Provable Algorithms for Machine Learning Problems - Duke Computer ...

Provable Algorithms for Machine Learning Problems

Rong Ge

A Dissertation Presented to the Faculty of Princeton University in Candidacy for the Degree of Doctor of Philosophy

Recommended for Acceptance by the Department of Computer Science Adviser: Professor Sanjeev Arora

November 2013

c Copyright by Rong Ge, 2013.

All rights reserved.

Abstract Modern machine learning algorithms can extract useful information from text, images and videos. All these applications involve solving NP-hard problems in average case using heuristics. What properties of the input allow it to be solved efficiently? Theoretically analyzing the heuristics is often very challenging. Few results were known. This thesis takes a different approach: we identify natural properties of the input, then design new algorithms that provably works assuming the input has these properties. We are able to give new, provable and sometimes practical algorithms for learning tasks related to text corpus, images and social networks. The first part of the thesis presents new algorithms for learning thematic structure in documents. We show under a reasonable assumption, it is possible to provably learn many topic models, including the famous Latent Dirichlet Allocation. Our algorithm is the first provable algorithms for topic modeling. An implementation runs 50 times faster than latest MCMC implementation and produces comparable results. The second part of the thesis provides ideas for provably learning deep, sparse representations. We start with sparse linear representations, and give the first algorithm for dictionary learning problem with provable guarantees. Then we apply similar ideas to deep learning: under reasonable assumptions our algorithms can learn a deep network built by denoising autoencoders. The final part of the thesis develops a framework for learning latent variable models. We demonstrate how various latent variable models can be reduced to orthogonal tensor decomposition, and then be solved using tensor power method. We give a tight perturbation analysis for tensor power method, which reduces the number of samples required to learn many latent variable models. In theory, the assumptions in this thesis help us understand why intractable problems in machine learning can often be solved; in practice, the results suggest inherently new approaches for machine learning. We hope the assumptions and algorithms inspire new research problems and learning algorithms.

iii

Acknowledgements I would like to thank my parents for giving me life, love, and constant support. I thank my advisor, Prof. Sanjeev Arora, for his advice and encouragements. His knowledge, way of thinking, passion, and optimism guided me through my five years in Princeton. He is always willing to spend lots of time to discuss about research. He also helped me with my writing and presentations with lots of patience. I could not have hoped for a better advisor. I would also like to thank Sham Kakade, who mentored me when I was interning at Microsoft Research New England in summer 2012, and after I left Princeton University. His viewpoints and intuitions opened a different gate for me, which is important for the line of research presented in this thesis. Thanks to Moses Charikar, Sham Kakade, Mark Bravermann, and Zeev Dvir for being in my thesis committee. I thank the entire theory group in Princeton, IAS, and more generally in the Center for Computational Intractability. Their discussions and numerous talks give me the chance of learning new results and ideas. I thank Princeton University and the Computer Science Department for providing a perfect environment for research. I especially thank Melissa Lawson and Mitra Kelly, for their help in all administrative work. I also thank Andrew Chi-Chih Yao, who went to Tsinghua University when I was an undergraduate student there. The courses he taught introduced theoretical computer science to me, and lead me to the research path. This thesis will not be possible without my collaborators. An incomplete list would be: Animashree Anandkumar, Boaz Barak, Aditya Bhaskara, Markus Brunnermeier, Decheng Dai, Yoni Halpern, Daniel Hsu, Ravindran Kannan, Tengyu Ma, David Mimno, Ankur Moitra, Sushant Sachdeva, Grant Schoenebeck, Ali Kemal Sinop, David Sontag, Matus Telgarsky, Yichen Wu, and Michael Zhu. This thesis is supported under NSF grants CCF-0832797, CCF-1117309, CCF-1302518, DMS-1317308, and Simons Investigator Grant. Finally, I would like to thank my wife, who brought happiness and passion to my life.

iv

Contents Abstract . . . . . . Acknowledgements List of Tables . . . List of Figures . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

1 Introduction 1.1 Provable, Practical Topic Modeling Algorithm . . . . . 1.2 Towards Provably Learning Deep Representations . . . 1.3 Tensor Decomposition for General Matrix Factorization 1.4 Other Related Works . . . . . . . . . . . . . . . . . . .

I

. . . . . . . .

. . . . . . . .

. . . . . . . .

. . . . . . . .

. . . . . . . .

. . . . . . . .

. . . . . . . .

. . . . . . . .

. . . . . . . .

. . . . . . . .

. . . . . . . .

. . . .

iii iv ix x

. . . .

3 6 7 9 11

Provable, Practical Topic Modeling Algorithm

2 Nonnegative Matrix Factorization 2.1 Background and Main Results . . . . . . . . . . . . . . . 2.1.1 History of NMF . . . . . . . . . . . . . . . . . . . 2.1.2 Main Results . . . . . . . . . . . . . . . . . . . . 2.2 Simplicial Factorization . . . . . . . . . . . . . . . . . . . 2.2.1 Justification for Simplicial Factorization . . . . . 2.2.2 Algorithm for Simplicial Factorization . . . . . . 2.3 General NMF . . . . . . . . . . . . . . . . . . . . . . . . 2.3.1 General Structure Theorem: Minimality . . . . . 2.3.2 General Structure Theorem: Simplicial Partitions 2.3.3 Enumerating Simplicial Partitions . . . . . . . . . 2.3.4 Solving Systems of Polynomial Inequalities . . . . 2.4 Strong Intractability of Simplicial Factorization . . . . . 2.4.1 The Gadget . . . . . . . . . . . . . . . . . . . . . 2.4.2 The Reduction . . . . . . . . . . . . . . . . . . . 2.4.3 Completeness and Soundness . . . . . . . . . . . 2.5 Fully-Efficient Factorization under Separability . . . . . . 2.5.1 Adding Noise . . . . . . . . . . . . . . . . . . . . 2.6 Related Works for Separable-NMF . . . . . . . . . . . . 2.7 Approximate Nonnegative Matrix Factorization . . . . . v

. . . . . . . . . . . . . . . . . . .

13 . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . .

14 14 14 15 17 17 18 19 19 22 23 25 27 28 31 33 36 38 41 42

3 From NMF to Topic Modeling 3.1 Main Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.2 Tools for (Noisy) Nonnegative Matrix Factorization . . . . . . . . 3.2.1 Various Condition Numbers . . . . . . . . . . . . . . . . . 3.2.2 Noisy Nonnegative Matrix Factorization under Separability 3.3 Algorithm for Learning a Topic Model: Proof of Theorem 3.4 . . . 3.3.1 Recover R and A with Anchor Words . . . . . . . . . . . . 3.3.2 Recover R and A with Almost Anchor Words . . . . . . . 3.3.3 Error Bounds for Q . . . . . . . . . . . . . . . . . . . . . . 3.3.4 Proving the Main Theorem . . . . . . . . . . . . . . . . . 3.3.5 Reducing Dictionary Size . . . . . . . . . . . . . . . . . . . 3.4 The Dirichlet Subcase . . . . . . . . . . . . . . . . . . . . . . . . 3.4.1 Condition Number of a Dirichlet Distribution . . . . . . . 3.4.2 Recovering the Parameters of a Dirichlet Distribution . . . 3.5 Obtaining Almost Anchor Words . . . . . . . . . . . . . . . . . . 3.6 Maximum Likelihood Estimation is Hard . . . . . . . . . . . . . . 4 Practical Implementations and Experiments 4.1 A Probabilistic Approach to Exploiting Separability . . 4.2 A Combinatorial Algorithm for Finding Anchor Words 4.3 Experimental Results . . . . . . . . . . . . . . . . . . . 4.3.1 Methodology . . . . . . . . . . . . . . . . . . . 4.3.2 Efficiency . . . . . . . . . . . . . . . . . . . . . 4.3.3 Semi-synthetic documents . . . . . . . . . . . . 4.3.4 Effect of separability . . . . . . . . . . . . . . . 4.3.5 Effect of correlation . . . . . . . . . . . . . . . . 4.3.6 Real documents . . . . . . . . . . . . . . . . . . 4.4 Proof for Nonnegative Recover Procedure . . . . . . . . 4.5 Proof for Anchor-Words Finding Algorithm . . . . . . .

II

. . . . . . . . . . .

. . . . . . . . . . .

. . . . . . . . . . .

. . . . . . . . . . .

. . . . . . . . . . .

. . . . . . . . . . .

. . . . . . . . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . .

47 47 49 49 50 51 52 53 57 58 58 59 60 61 61 64

. . . . . . . . . . .

66 66 69 71 71 72 73 73 74 75 76 77

Towards Provably Learning Deep Networks

5 Provable Sparse Coding: Learning Overcomplete Dictionaries 5.1 Background and Results . . . . . . . . . . . . . . . . . . . . . . . 5.1.1 Sparse Coding and Dictionary Learning . . . . . . . . . . . 5.1.2 Incoherence Assumption . . . . . . . . . . . . . . . . . . . 5.1.3 Main Results . . . . . . . . . . . . . . . . . . . . . . . . . 5.1.4 Previous Work . . . . . . . . . . . . . . . . . . . . . . . . 5.1.5 Our approach . . . . . . . . . . . . . . . . . . . . . . . . . 5.2 The Connection Graph . . . . . . . . . . . . . . . . . . . . . . . . 5.3 Graph Recovery . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5.4 Recovering the Dictionary . . . . . . . . . . . . . . . . . . . . . . 5.4.1 Finding the Relative Signs . . . . . . . . . . . . . . . . . . vi

82 . . . . . . . . . .

. . . . . . . . . .

. . . . . . . . . .

. . . . . . . . . .

. . . . . . . . . .

. . . . . . . . . .

83 83 83 84 84 86 87 88 90 93 93

5.5

5.4.2 An Approach via SVD . . . . . . . . . . . . . . . . . . . . . . . . . . A Higher-Order Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . .

6 Learning Deep Representations 6.1 Background and Main Results . . . . . . . . . . . . 6.1.1 Deep Networks . . . . . . . . . . . . . . . . 6.1.2 Main Results . . . . . . . . . . . . . . . . . 6.2 Denoising Property . . . . . . . . . . . . . . . . . . 6.3 Learning a Single Layer Network . . . . . . . . . . 6.3.1 Correlation Implies Common Cause . . . . . 6.3.2 Paritial Encoder: Finding h given y . . . . . 6.3.3 Learning the Graph: Finding −1 edges. . . . 6.4 Learning a Multi-layer Network . . . . . . . . . . . 6.4.1 Correlation Grapph in a Multilayer Network 6.4.2 Learning the −1 Edges . . . . . . . . . . . . 6.5 Graph Recovery . . . . . . . . . . . . . . . . . . . . 6.6 Layer with Real-valued Output . . . . . . . . . . . 6.7 Lower Bound . . . . . . . . . . . . . . . . . . . . . 6.8 Random Graph Properties . . . . . . . . . . . . . . 6.8.1 Unique neighbor property . . . . . . . . . . 6.8.2 Properties required by each steps . . . . . .

III

. . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . .

95 98 101 102 102 104 104 106 107 110 110 111 112 113 113 115 117 119 119 120

Tensor Decomposition for General Matrix Factorization 124

7 Orthogonal Tensor Decomposition 7.1 Orthogonal Tensors . . . . . . . . . . . . . . . . . . . . . . . . . . . 7.2 Whitening: Getting Orthogonality from General Vectors . . . . . . 7.2.1 The reduction . . . . . . . . . . . . . . . . . . . . . . . . . . 7.3 Tensor structure in latent variable models . . . . . . . . . . . . . . 7.3.1 Exchangeable single topic models . . . . . . . . . . . . . . . 7.3.2 Beyond raw moments . . . . . . . . . . . . . . . . . . . . . . 7.3.3 Multi-view models . . . . . . . . . . . . . . . . . . . . . . . 7.4 Tensor power method . . . . . . . . . . . . . . . . . . . . . . . . . . 7.4.1 Convergence analysis for orthogonally decomposable tensors 7.4.2 Perturbation analysis of a robust tensor power method . . . 7.5 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7.5.1 Computational complexity . . . . . . . . . . . . . . . . . . . 7.5.2 Sample complexity bounds . . . . . . . . . . . . . . . . . . . 7.6 Analysis of robust power method . . . . . . . . . . . . . . . . . . . 7.6.1 Initialization . . . . . . . . . . . . . . . . . . . . . . . . . . . 7.6.2 Tensor power iterations . . . . . . . . . . . . . . . . . . . . . 7.6.3 Deflation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7.6.4 Proof of the main theorem . . . . . . . . . . . . . . . . . . . vii

. . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . .

125 125 126 127 127 128 129 130 133 133 134 136 136 136 137 138 140 145 148

7.7

Better Initializers and Adaptive Deflation . . . . . . . . . . . . . . . . . . . . 152

8 Tensor Decomposition in Community Detection 8.1 Background and Main Results . . . . . . . . . . . . . . . 8.1.1 History of Overlapping Communities . . . . . . . 8.1.2 Main Result . . . . . . . . . . . . . . . . . . . . . 8.2 Graph Moments Under Mixed Membership Models . . . 8.3 Algorithm for Learning Mixed Membership Models . . . 8.3.1 Learning Mixed Membership Models Under Exact 8.3.2 Learning Algorithm Under Empirical Moments . . 8.4 Sample Analysis for Proposed Learning Algorithm . . . . 8.4.1 Sufficient Conditions and Recovery Guarantees . . 8.4.2 Proof Outline . . . . . . . . . . . . . . . . . . . . 8.5 Proof of the Main Theorems . . . . . . . . . . . . . . . . 8.5.1 Concentration Bounds . . . . . . . . . . . . . . . 8.5.2 Good Initializer for Tensor Power Method . . . . 8.5.3 Reconstruction After Tensor Power Method . . . 8.5.4 Dirichlet Properties . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Moments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . .

. . . . . . . . . . . . . . .

. . . . . . . . . . . . . . .

. . . . . . . . . . . . . . .

. . . . . . . . . . . . . . .

155 155 155 156 158 161 161 163 167 167 169 170 170 179 182 186

A Matrices and Tensors A.1 Matrix Notations . . . . . . . . . . . . . . . . . . . . . . . . . . . . A.1.1 Basic Notations, Rows and Columns . . . . . . . . . . . . . A.1.2 Norms and Eigenvalues . . . . . . . . . . . . . . . . . . . . . A.1.3 Singular Value Decomposition and Eigenspace Perturbation A.2 Tensor Notations . . . . . . . . . . . . . . . . . . . . . . . . . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

190 190 190 190 191 192

B Concentration Bounds 194 B.1 Concentration Bounds for Single Variables or Functions . . . . . . . . . . . . 194 B.2 Concentration Bounds for Vectors and Matrices . . . . . . . . . . . . . . . . 195 Bibliography

197

viii

List of Tables 1.1

Machine Learning Problems in General Matrix Factorization Framework . .

5

2.1

Summary of NMF Algorithms . . . . . . . . . . . . . . . . . . . . . . . . . .

43

4.1

Example topic pairs from NY Times (closest `1 ), anchor words in bold. All 100 topics are in suppl. material. . . . . . . . . . . . . . . . . . . . . . . . .

76

ix

List of Figures 1.1

General Matrix Factorization Framework . . . . . . . . . . . . . . . . . . . .

4

2.1 2.2 2.3

The Gadget . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Proof of Lemma 2.30 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Proof of Lemma 2.31 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

28 29 31

3.1

The matrix Q . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

53

4.1 4.2

73

4.7 4.8

Training time on synthetic NIPS documents. . . . . . . . . . . . . . . . . . . `1 error for a semi-synthetic model generated from a model trained on NY Times articles with K = 100. The horizontal line indicates the `1 error of K uniform distributions. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . `1 error for a semi-synthetic model generated from a model trained on NIPS papers with K = 100. Recover fails for D = 2000. . . . . . . . . . . . . . . . When we add artificial anchor words before generating synthetic documents, `1 error goes to zero for Recover and close to zero for RecoverKL and RecoverL2. `1 error increases as we increase topic correlation. We use the same K = 100 topic model from NY Times articles, but add correlation: TOP ρ = 0.05, BOTTOM ρ = 0.1. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Held-out probability (per token) is similar for RecoverKL, RecoverL2, and Gibbs sampling. RecoverKL and RecoverL2 have better coherence, but fewer unique words than Gibbs. (Up is better for all three metrics.) . . . . . . . . Illustration of the Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . Proof of Lemma 4.5, after projecting to the orthogonal subspace of span(S).

6.1 6.2 6.3 6.4

Example of a deep network . Single layered network . . . Two-layer network(G1 , G2 ) . Single-layer network (A, b) .

7.1

Examples of latent variable models.

8.1

Our moment-based learning algorithm uses 3-star count tensor from set X to sets A, B, C (and the roles of the sets are interchanged to get various estimates). Specifically, T is a third order tensor, where T(u, v, w) is the normalized count of the 3-stars with u, v, w as leaves over all x ∈ X. . . . . . 158

4.3 4.4 4.5

4.6

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

73 74 74

75

75 79 80 102 107 118 118

. . . . . . . . . . . . . . . . . . . . . . 131

x

Notations a, b, c, ... u, v, w, ... 0m , 1m supp(u) M, N, ... In T, ... u·v Diag(u) kuk |v|1 kM k |M |1 kM kF kT k kT kF λi (M ) κ(M ) M> M −1 M = U DV > M† ui , u[i] ui,j , u[i, j]

constants column vectors all 0’s/1’s vector of length m (m is omitted if it is clear from context) the support of vector u (set of nonzero indices) matrices n × n identity matrix (n is omitted if it is clear from context) tensors inner-product of two vectors diagonal matrix whose diagonal entries are from the vector u `2 norm of a vector `1 norm of a vector spectral norm of a matrix `1 norm of a matrix Frobenius norm of a matrix spectral norm of a tensor Frobenius norm of a tensor i-th largest eigenvalue of a matrix condition number of a matrix Transpose of a matrix inverse of a matrix (M will always be invertible) the singular value decomposition of matrix M the Moore-Penrose pseudo-inverse of the matrix M the i-th entry of the vector the entry in i-th row and j-th column of the matrix

1

Mi Mj T (U, V, W ) u⊗v u⊗v⊗w u⊗k [n]

i-th column of a matrix j-th row of a matrix tri-linear form of a tensor tensor product of vectors u and v, equivalent to matrix uv > tensor of three vectors shorthand for u ⊗ u ⊗ · · · ⊗ u (k u’s in the formula) the set {1, 2, . . . , n}

2

Chapter 1 Introduction Most machine learning problems are computationally intractable. However in practice, many heuristic algorithms work well. How do we explain this mismatch? At a high level, this is a mismatch between worst case and average case complexity. Computational intractability only implies hardness on worst case inputs, which may not occur in practice. The heuristic algorithms work because they only need to solve the “easy” instances. Investigating the details of this explanation raises more questions. What properties of the input make it “easy” for heuristic algorithms? Assuming the input satisfies these properties, can we prove current heuristic algorithms work? Can we design new algorithms that work provably on “easy” inputs? In this thesis, we try to answer these problems using a few examples. For each learning task, we identify reasonable properties that are satisfied by real-life inputs, and design provable algorithms using these properties. In theory, these new results show why hard problems in machine learning can in fact be solved; in practice, these results lead to fundamentally different approaches. We focus on unsupervised learning problems: given data points, the goal is to discover useful structure hiding in data. This is in contrast to supervised learning, where we are given data points and labels, and the goal is to find a function that maps data points to labels. Take the clustering problem as an example. Clustering Problem Given: Data points Goal: Group the data points, points within each group are close to each other. Clustering is a classical unsupervised learning problem: we are only given data points but not which clusters they are in; the hidden structure is “points within each group are close to each other”. In order to formally define the “correct” clustering, we specify a probabilistic model. Probabilistic modeling is a general way of describing unsupervised learning tasks. The clustering problem can be modeled as mixture of Gaussians. Each data point is in one of r groups. Each group has its own mean µi . If a point is in i-th group, the point is generated according to a Gaussian distribution with mean µi . Doing clustering is then equivalent to learning the groups. 3

Figure 1.1: General Matrix Factorization Framework This probabilistic view is very useful because it specifies how the data is generated, and provides a “ground truth” for measuring qualities of solutions. Many probabilistic models are also called latent variable models, as the observed data points depend on some latent variables (e.g. the group numbers). There are many latent variable models in different application areas that at first glance seem quite different from one another. Can we have a unified framework for thinking about them? Here we propose a new framework that we call General Matrix Factorization (GMF) which encompasses PCA, ICA, NMF, dictionary learning etc. (see Figure 1.1 for illustration, and Table 1.1 for a list of problems in the framework). Definition 1.1 (General Matrix Factorization). Each General Matrix Factorization (GMF) problem has three elements: Latent Variables The latent variables form a matrix W ∈ Rr×m . The i-th column Wi of W contains latent variables related to the i-th data point. The columns are drawn independently. Linear Transform The i-th data point is closely related to AWi , where the matrix A ∈ Rn×r is an unknown linear transform. Observation Function The observed data forms a matrix M . Each data point is a column Mi of M , generated by applying a simple function f to AWi (Mi = f (AWi )). Goal: Given M , learn matrices A and W . Many classical learning tasks fit naturally in the GMF framework. For mixture of Gaussians, the columns of matrix W are basis vectors determining which group the data point belongs to (i.e. Wi = ej if point i is in group j). The columns of matrix A correspond to the means of Gaussians µi . The observation function f simply adds Gaussian noise to the product AW . The well-known Principal Component Analysis (PCA) tries to find a subspace where the data points have the largest variance. To reduce PCA to GMF, the hidden variables Wi are 4

Problem PCA Clustering

Observation Function M = AW + noise M = AW + noise

Constraints on A dimension n × r dimension n × r

ICA

M = AW (+noise)

full rank

Nonnegative Matrix Factorization

M = AW (+noise)

dimension n × r nonnegative entries

Topic Modeling Dictionary Learning Neural Net (1 layer) Mixed Membership Community Model

M = sample AW M = AW M = threshold(AW ) M = sample∗ AW

dimension n × r columns are distributions general general A = W >P

Constraints on W dimension r × m Columns are basis vectors independent nonGaussian entries dimension r × m nonnegative entries dimension r × m priors columns are sparse 0/1 entries dimension r × m specific priors

Table 1.1: Machine Learning Problems in General Matrix Factorization Framework the coordinates of the data points in this subspace. The subspace itself is encoded in A (the columns of A form an orthonormal basis of the subspace). The observation function f adds small noise to the product AW . Independent Component Analysis (ICA, [42]) is another famous example. The goal of ICA is to learn the matrix A given samples of the form y = Ax, where the vector x has independent non-Gaussian entries. This is already in the GMF framework: the hidden variables are the vectors x, the linear transform is the matrix A, and the observation function is just the identity function f (Ax) = Ax. In Table 1.1 we summarize how different machine learning problems fit into the general matrix factorization framework (some problems are defined later in this Chapter). Let us now go back to the questions we asked at the beginning. For mixture of Gaussians we want to know Question: What kind of data is easy for learning mixture of Gaussians? Are there algorithms that work provably for this kind of data? These questions are already partly answered. The seminal work of Dasgupta[46] shows when the centers of Gaussians are far enough from each other, the problem can be solved efficiently. Follow-up works[31, 125, 141, 155] build on the same assumption, but reduce amount of separation required. In this thesis we answer similar questions for other problems. The last five problems in Table 1.1 can be categorized into three classes according to the constraints on A and W . Each part of this thesis investigates a different class of General Matrix Factorization problem. We identify very different simplifying assumptions in each class, and design provable algorithms under such assumptions. We hope these assumptions explain why such GMF problems can be solved efficiently, and these algorithms inspire different approaches for machine learning problems. 5

1.1

Provable, Practical Topic Modeling Algorithm

The first part of the thesis focus on GMF problems with nonnegative constraints. Topic modeling, which tries to extract thematic information from large corpus of documents, is an important problem in this class. A long line of work on topic modeling sets up the basic structure for topic models, starting from probabilistic latent semantic indexing (pLSI, [83]), followed by Latent Dirichlet Allocation[26], Correlated Topic Models[25] ,Pachinko allocation[112] and many others. According to these works, each topic is a probability distribution over words (e.g., in a topic related to “weather”, the words “snow”, “rain” and “sunny” have high probability); each document is a probability distribution over topics. These models also assume the ordering of the words is not crucial in determining the topics (“bag-of-words” assumption). Hence words in a document are generated independently at random. To generate a word, first generate its topic according to the document-topic distribution, then pick a word from the corresponding topic-word distribution. The process of generating a word can be summarized as a matrix product. The latent variables are the document-topic distributions. The linear transform is constructed from the topic-word distributions. That is, the (i, j)-th entry of matrix W ∈ Rr×m corresponds to the probability that document j generates topic i. The (l, i)-th entry of A ∈ Rn×r corresponds to the probability that the topic i generates the word l. The (l, j)-th entry of product AW is exactly the probability that document j generates word l. The observation function f samples N words according to the distribution AWj . Therefore we have the following description of topic modeling in GMF framework. Topic Modeling Given: an unknown topic matrix A with nonnegative entries that is dimension n × r, and a stochastically generated unknown matrix W that is dimension r × m. Each column of AW is viewed as a probability distribution on rows, and for each column we are given N n i.i.d. samples from the associated distribution. Goal: Reconstruct A and parameters of the generating distribution for W . Directly analyzing the General Matrix Factorization problem for topic modeling is not easy because much information is lost in the sampling process. In Chapter 2 We first focus on a simpler variant called Nonnegative Matrix Factorization (NMF). Nonnegative Matrix Factorization Given: Matrix M = AW + noise, where matrices A ∈ Rn×r and W ∈ Rr×m have nonnegative entries. Goal: Find A, W . We show a natural assumption called “separability” [54] simplifies the problem, and give a polynomial time algorithm when the instance is separable. 6

Separability Assumption: A nonnegative factorization M = AW is separable if for each column i of A, there is some row r(i) of A that has a single nonzero entry and this entry is in the i-th column. Under this assumption, we show the NMF problem has a nice geometric interpretation that leads to polynomial time algorithms. In Chapter 3 we show how to apply similar techniques to topic modeling. For topic modeling the separability assumption naturally translates to the “anchor words assumption”. Anchor Words Assumption: Every topic has a word with probability at least p in that topic, and does not appear (has probability 0) in all other topics. This assumption greatly simplifies the learning task. Even though we get samples from AW which form a very coarse approximation of the product, we show the noise can be reduced. Then utilizing the algorithm in Chapter 2 we give a polynomial time algorithm for topic modeling. Although the algorithms in Chapters 2 and 3 are interesting in theory, they are very slow in practice. In Chapter 4 we redesign the algorithms seeking good empirical performance. The new algorithm is order of magnitude faster than classical topic modeling toolkit MALLET[122], and produces results with similar quality.

1.2

Towards Provably Learning Deep Representations

Deep learning is one of the hottest topics in machine learning in recent years. The basic idea of deep learning is using a multi-layer network to represent the mapping from data points to their labels. Although multi-layer neural nets have been around for a long time, they can only be effectively learned after Hinton’s fundamental work[80, 81]. Later, with new insights from many researchers (see survey by Bengio [19]), deep learning has broken many records in tasks related to voice, images and videos. Despite its success in practice, theoretical analysis for deep learning algorithms is very limited. The main difficulties come from the nonlinearity in the network and correlations introduced by the layers of the network. Livni et al.[115] analyzed an algorithm of learning low degree polynomials, which can efficiently explain the input data. However there are no guarantees for recovering the ground truth parameters. Traditionally, deep learning is considered a very different topic from what we’ve discussed so far. However, new developments have characterized the kinds of deep networks that are useful in practice, and made deep learning quite reminiscent of the other problems in GMF framework. Although deep learning is a technique for supervised learning, its performance highly relies on an unsupervised pretraining step [81]. In this step, the algorithm tries to automatically discover hidden features in the input. These features are then used as a starting point for supervised training. The unsupervised step can be viewed as learning a “deep representation”. 7

Researchers have also assumed that the net (or some modification) can be run in reverse to get a generative model for a distribution that is a close fit to the empirical input distribution. This idea was implicit in Hinton’s work [81] as the Restricted Boltzmann Machines (RBM) he uses are naturally reversible. Later it is formalized as the idea of denoising autoencoders by Vincent et al.[156]. The generative model view of deep learning puts it into the General Matrix Factorization framework, where each layer provides a nonlinear code to represent the key structure of the data. In the second part of the thesis we try to design algorithms for provably learning deep representations. In particular, our algorithms learn sparse representations that can be viewed as GMF problems with sparsity constraints. In Chapter 5 we first look at the problem of learning a single-layer, sparse linear representation. This problem is usually called dictionary learning or sparse coding. Dictionary Learning Given: Samples y = Ax where x is a random sparse vector. Goal: Learn matrix A and vectors x. Dictionary learning fits easily to the GMF framework: the latent variables Wi ’s are just the vectors x; the linear transformation is the matrix A; the observation function is identity. When the matrix A is given, dictionary learning becomes the simpler sparse recovery problem (closely related to compressed sensing[55]). Even this simpler problem is hard in the worst case ([48, 95]). However, if we look at the worst case instances more carefully, they are all in the regime where the vector y can have multiple sparse representations x. Such ambiguity makes the representation less useful. This inspires the following assumption Robustness Assumption: The sparse representation for y is unique, and is stable when the vector y is perturbed. Previous works [36, 53, 75] have identified sufficient conditions for the sparse representation to be unique. In Chapter 5 we work with incoherent assumption of Donoho and Huo [53]. Under this assumption, we show the correlation structure of samples contains enough information about the support graph of the vectors x. This information is then used by a graph recovery algorithm to learn the supports of x, and eventually leads to reconstructing the matrix A. In Chapter 6, we show how to learn a network of autoencoders([82, 156]). An autoencoder is a one layer network with an encoding function E and a decoding function D. The functions D and E are usually of the form D(x) = s(Ax + b) and E(y) = s(A> y + b0 ) where s is a nonlinear function that is applied coordinate-wise. We choose s to be the threshold function (sgn(x) = 1 if x > 0 and sgn(x) = 0 otherwise). The autoencoder is then just a one-layer neural network. Learning Autoencoders (one-layer neural net) Given: Samples y = sgn(Ax) where x is a random sparse vector. Goal: Learn matrix A and vectors x. 8

Learning an autoencoder again reduces to the GMF problem: the latent variables are the vectors x; the linear transformation is the matrix A; the observation function applies the threshold function sgn to the vector Ax. The robustness assumption translates to the denoising property for autoencoders introduced by Vincent et al. [156]. Denoising Assumption: Any valid code x can be recovered even if the output is perturbed: E(D(x) + noise) = x. In Chapter 6 we show how to learn a denoising autoencoder when A is a random sparse matrix (this is inspired by Berinde et al. [21] ). Our algorithm for the single-layer network is robust, so that even when the input vector x comes from a deep network, the algorithm can still learn the correct network A. Therefore we can apply this algorithm iteratively to learn the entire deep network.

1.3

Tensor Decomposition for General Matrix Factorization

In the final part of the thesis, we focus on a unified methodology for solving many GMF problems in the statistical recovery framework. The key assumption in this part is distributional: Distribution Assumption: The latent variables have a specific, simple prior. The data is generated exactly according to the model. Statistical recovery algorithms are algorithms that work on synthetic data generated according to known distributions. There are many beautiful statistical recovery results in both theoretical computer science and machine learning, including algorithms for planted clique or dense subgraphs [125], Hidden Markov Models [88], mixture of Gaussians [128] and LDA model[9]. Most of these problems still fit into the General Matrix Factorization framework. However, the constraints on matrix A are not as strong as we have seen in the previous parts. For example, Anandkumar et al.[9] show how to learn the LDA model without using anchor words assumption1 . However, previous statistical recovery algorithms use problem-specific techniques, and usually require a large number of samples. In Chapter 7, we observe that in many cases it is possible to reduce General Matrix Factorization problem to the Orthogonal Tensor Decomposition problem. In fact, Orthogonal Tensor Decomposition can even be applied to settings where it is hard to formulate the problem as GMF. 1

Unlike the algorithms in Chapter 3, the algorithm does not generalize to other topic models such as Pachinko Allocation[112]

9

Definition 1.2. Orthogonal Pr Tensor Decomposition ˆ Given: Tensor T ≈ T = i=1 λi vi ⊗ vi ⊗ vi , the vectors vi are orthonormal. ˆ i ≈ λi , vî ≈ vi . Goal: Find estimates λ The key insight in the reductions is the method-of-moments ([137]): first compute the moments of observed variables2 , then apply suitable linear transformation to get an orthogonal tensor. This reduction crucially relies on the Distribution Assumption. After the reduction, the Orthogonal Tensor Decomposition problem can be solved using tensor power method. In Chapter 7 we give tight sample complexity bounds for this algorithm. Many of the previous works (including [8, 9, 38, 87, 88, 130]) can be reinterpreted in the Orthogonal Tensor Decomposition framework. Our tight sample complexity analysis gives better bounds for learning many well known models, including Hidden Markov Model, Independent Component Analysis and simple topic models. In Chapter 8, we highlight a new application of Orthogonal Tensor Decomposition: finding overlapping communities in social networks. The most popular model for communities is the stochastic block model. In this model, every person is in a single community. Two persons in the same community know each other with probability p; two persons in different communities know each other with probability q. Many algorithms (e.g. [125], see a summary in [166]) can learn stochastic block models. Stochastic block model is limited because every person can be in only one community. Airoldi et al.[4] purposed mixed membership stochastic block (MMSB) model. In this model, there are n persons and r communities. Each person belongs fractionally to different communities. If we use Wi,j to denote the fraction that j-th person is involved in the i-th community, the model asserts the probability that two persons u, v know each other should be Wu> P Wv (where P is an unknown matrix with entries in [0, 1]). Learning MMSB model Given: There is an unknown matrix W ∈ Rr×n with columns drawn from Dirichlet distribution, and an unknown matrix P with entries in [0, 1]. Given a graph G on n vertices, the edge (u, v) is in the graph with probability Wu> P Wv independently. Goal: Reconstruct W matrix. Learning MMSB model can be reduced to GMF framework, if we assume the linear transformation has the form W > P , and the observation function f samples according to entries of AW to form the graph G. Using method-of-moments, we can also reduce the learning problem to orthogonal tensor decomposition. The improved sample complexity analysis in Chapter 7 becomes crucial for learning MMSB model. Unlike topic models, where we can fix the topic matrix A and get as many documents as we want, in the community model the dimensions of both A and W increase as the number of persons increase. Using tensor power method, we can match the best known 2

The moments for an observed vector y include its mean E[y], variance E[yy > ] and higher order correlations

10

trade-off between number of communities and number of persons for the stochastic block model (even though our algorithm works for the more general MMSB model).

1.4

Other Related Works

Supervised Learning In supervised learning, the input includes data points and corresponding labels, the goal of learning is to predict the labels of new data points. These problems are traditionally studied via the PAC (Probabilistic Approximately Correct) model of Valiant [152], and VC dimension and SVMs (Support Vector Machines) of Vapnik and others [29, 153]. However, our work has to go beyond these approaches. First, obtaining unlabeled data points has become much easier with vast data generated by the Internet, while data sets with good quality labels require significantly more effort. Also, unsupervised learning is very closely related to understanding the properties of the data. In fact, insights from deep learning [19, 81] hint that supervised learning tasks might be “easy” because the structure of the input can be learned by unsupervised learning algorithms. Without the correct structure of the input, most learning problems are computationally hard in PAC model or even simpler variants of PAC model (e.g. [93, 97]). Semi-random Models A common critique to model assumptions and statistical recovery algorithms is that they are not robust to adversarial perturbations to the model. Semirandom models allow an adversary to perturb the instance after it is generated using a stochastic model. The first example is the planted bisection model of Feige and Kilian [60]. Recently, [118] generalized this result. Similar results are also known for graph coloring[27], finding hidden clique [61] and unique games [102]. Approximation Stability Assumptions Another line of work, initiated by Balcan, Blum and Gupta[16] focuses on problems where approximation algorithms are applied, such as k-means or k-median clustering. They assume the instances satisfy the property that if a solution has large objective function, then it should also be close to the “ground truth” solution. The reasoning behind this assumption is if it were false, then it would make no sense to apply approximation algorithms, and maybe people should seek a different objective function. Under this kind of assumption people were able to develop algorithms for various clustering-related problems. These assumptions are quite different from our work. In particular, we try to make assumptions that natural input should satisfy, without looking into specific objective functions. Tensor Decompositions There has been several lines of research studying tensor decomposition, including CP decomposition [78], Tucker decomposition [151] and HOSVD [50] (for a survey see [99]). Use of tensor decompositions in learning latent variable models also has a long history, some of the earlier works are summarized in the book [123]. For Orthogonal Tensor Decomposition [101], when the exact tensor is given, it is known how to decompose the tensor into sum of rank one tensors via variants of power method 11

[100, 167]. However, before our work it was not clear whether any variant of power method could work under sampling noise.

12

Part I Provable, Practical Topic Modeling Algorithm

13

Chapter 2 Nonnegative Matrix Factorization Before talking about topic modeling, we first focus on a closely related problem called nonnegative matrix factorization (NMF). In the NMF problem, we are given an n×m matrix M with nonnegative real entries (such a matrix will be henceforth called “nonnegative”) and an integer r > 0. Our goal is to express M as AW where A and W are nonnegative matrices of size n × r and r × m respectively. NMF arises naturally in the context of topic modeling, where the matrix product AW can concisely describe the process of generating a word from a document. In fact, NMF can be viewed as a limiting case of topic models where the number of words in a document is much larger than the size of the vocabulary. In this chapter, we first give algorithms for solving the exact NMF problem (that is, the matrix M is exactly equal to the product of AW ), and an almost matching lowerbound. The running time of this algorithm is polynomial in the dimensions n, m when the rank r is constant, but exponential in r. However, in order to apply NMF in the context of topic modeling, the algorithm needs to be robust to noise, and fully polynomial over all the parameters (n, m and r). Under a natural separability assumption [54], we design a new algorithm that meets these requirements.

2.1 2.1.1

Background and Main Results History of NMF

Besides applications in topic modeling, NMF is itself an interesting problem that has been independently introduced in a number of different contexts and applications. Many interesting heuristics and local search algorithms (including the familiar Expectation Maximization or EM) have been proposed to find such factorizations. One compelling family of applications is data analysis, where a nonnegative factorization is computed in order to extract certain latent relationships in the data and has been applied to image segmentation [109], [110] information retrieval [83] and document clustering [162]. NMF also has applications in fields such as chemometrics [107] (where the problem has a long history of study under the name self modeling curve resolution) and biology (e.g. in vision research [32]): in some cases, the 14

underlying physical model for a system has natural restrictions that force a corresponding matrix factorization to be nonnegative. In demography (see e.g., [79]), NMF is used to model the dynamics of marriage through a mechanism similar to the chemical laws of mass action. In combinatorial optimization, Yannakakis [164] characterized the number of extra variables needed to succinctly describe a given polytope as the nonnegative rank of an appropriate matrix (called the “slack matrix”). In communication complexity, Aho et al [3] showed that the log of the nonnegative rank of a Boolean matrix is polynomially related to its deterministic communication complexity - and hence the famous Log-Rank Conjecture of Lovasz and Saks [116] is equivalent to showing a quasi-polynomial relationship between real rank and nonnegative rank for Boolean matrices. In complexity theory, Nisan used nonnegative rank to prove lower bounds for non-commutative models of computation [133]. Additionally, the 1993 paper of Cohen and Rothblum [41] gives a long list of other applications in statistics and quantum mechanics. That paper also gives an exact algorithm that runs in exponential time.

2.1.2

Main Results

We give both an algorithm for accomplishing this algorithmic task that runs in polynomial time for any constant value of r and we complement this with an intractability result which states that assuming the Exponential Time Hypothesis [92] no algorithm can solve the exact NMF problem in time (nm)o(r) . Theorem 2.1. There is an algorithm for the Exact NMF problem (where r is the target 2 r inner-dimension) that runs in time O((nm)O(r 2 ) ). This result is based on algorithms for deciding the first order theory of the reals - roughly the goal is to express the decision question of whether or not the matrix M has nonnegative rank at most r as a system of polynomial equations and then to apply algorithms in algebraic geometry to determine if this semi-algebraic set is non-empty. The complexity of these procedures is dominated by the number of distinct variables occurring in the system of polynomial equations. In fact, the number of distinct variables plays an analogous role to VC-dimension. The naive formulation of the NMF decision problem as a non-emptiness problem is to use nr + mr variables, one for each entry in A or W [41]. Yet even for constant values of r, an algorithm based on such a formulation would run in time exponential in n and m. At the heart of our algorithm is a structure theorem – based on a novel method for reducing the number of variables needed to define the associated semi-algebraic set. We are able to express the decision problem for nonnegative matrix factorization using r2 2r distinct variables (and we make use of tools in geometry, such as the notion of a separable partition, to accomplish this [77], [6], [90]). Thus we obtain the algorithm quoted in the above theorem, and we note that prior to our work it was unknown whether even the case r = 3 was hard or solvable in polynomial time, and indeed it is the latter. In fact, in a natural special case of the problem where the rank of M is equal to the target inner-dimension r we can obtain a further improvement: We refer to this problem as 15

the Simplicial Factorization (SF) problem. See Section 2.2.1 for an explanation for why this special case is natural in the context of information retrieval and other applications where the goal is to learn latent statistical structure. Our algorithm is again based on the first order theory of the reals, but here the system of equations is much smaller so in practice one may be able to use heuristic approaches to solve this system (in which case, the validity of the solution can be easily checked). Theorem 2.2. There is an algorithm for the Exact SF problem (where r is the target inner2 dimension) that runs in time O((nm)O(r ) )1 . We complement these algorithms with a fixed parameter intractability result. We make use of a recent result of Patrascu and Williams [136] and engineer low-dimensional gadgets inspired by the gadgets of Vavasis [154] to show that under the Exponential Time Hypothesis [92], there is no exact algorithm for NMF that runs in time (nm)o(r) . This intractability result holds also for the SF problem. Theorem 2.3. If there is an exact algorithm for the SF problem (or for the NMF problem) that runs in time O((nm)o(r) ) then 3-SAT can be solved in 2o(n) time on instances with n variables. Although the algorithms mentioned above run in polynomial time for constant r, in practice they are still intractable even for small r. We consider the nonnegative matrix factorization problem under an assumption introduced by Donoho and Stodden [54] in the context of image segmentation called “separability”. This assumption asserts that there are r rows of A that can be permuted to form a diagonal matrix. If we knew the names of these rows, then computing a nonnegative factorization would be easy. The challenge in this context, is to avoid brute-force search (which runs in time nr ) and to find these rows in time polynomial in n, m and r. To the best of our knowledge the following is the first example of a polynomial-time algorithm that provably works under a non-trivial condition on the input. Theorem 2.4. There is an exact algorithm that can compute a separable, nonnegative factorization M = AW (where r is the inner-dimension) in time polynomial in n, m and r if such a factorization exists. Donoho and Stodden [54] argue that the separability condition is naturally met in the context of image segmentation. Additionally, Donoho and Stodden prove that separability in conjunction with some other conditions guarantees that the solution to the NMF problem is unique. Our theorem above is an algorithmic counterpart to their results, but requires only separability. Our algorithm can also be made noise tolerant, and hence works even when the separability condition only holds in an approximate sense. Indeed, an approximate separability condition is regarded as a fairly benign assumption and is believed to hold in many practical contexts in machine learning. For instance it is usually satisfied by model parameters fitted to various generative models (e.g. LDA [26] in information retrieval). We thank David Blei for this information. 1

This result is later improved in [127]

16

Lastly, we consider the case in which the given matrix M does not have an exact lowrank NMF but rather can be approximated by a nonnegative factorization with small innerdimension. Theorem 2.5. There is a 2poly(r log(1/)) poly(n, m)-time algorithm that, given a M for which there is a nonnegative factorization AW (of inner-dimension r) which is an -approximation to M in Frobenius norm, computes A0 and W 0 satisfying kM − A0 W 0 kF ≤ O(1/2 r1/4 ) kM kF . The rest of the chapter is organized as follows: In Section 2.2 we give an exact algorithm for the SF problem and in Section 2.3 we give an exact algorithm for the general NMF problem. In Section 2.4 we prove a fixed parameter intractability result for the SF problem. And in Section 2.5 we give algorithms for the separable nonnegative fatorization problems, which is particularly interesting for topic modeling. The same problem is solved again in later chapters and also in related follow-up works, these different algorithms provides a large variety of tradeoffs, which are summarized in Section 2.6. Finally we give an algorithm for approximate NMF in Section 2.7.

2.2

Simplicial Factorization

Here we consider the simplicial factorization problem, in which M has rank equal to the target inner-dimension r. Hence in any factorization, the factors A and W must have full column rank and full row rank respectively.

2.2.1

Justification for Simplicial Factorization

We first argue that the extra restriction imposed in simplicial factorization is natural in many contexts: Through a re-scaling we can assume that the columns of M , A and W all have unit `1 norm. The factorization M = AW can be interpreted probabilistically: each column of M can be expressed as a convex combination (given by the corresponding column of W ) of columns in A. In the example in the introduction, columns of M represent documents and the columns of A represent “topics”. Hence a nonnegative factorization is an “explanation” : each document can be expressed as a convex combination of the topics. But if A does not have full column rank then this explanation is seriously deficient. This follows from a restatement of Radon’s Lemma. Let conv(AU ) be the convex hull of the columns Ai for i ∈ U. Observation 1. If A is an n × r (with n ≥ r) matrix and rank(A) < r, then there are two disjoint sets of columns U, V ⊂ [r] so that conv(AU ) ∩ conv(AV ) 6= ∅. The observation implies that there can be a candidate document x that can be expressed as a convex combination of one set (U ) of topics, or instead can be expressed as a convex combination of an entirely disjoint other set (V ) of topics. The end goal of NMF is often to use the representation of documents as distributions on topics to perform various tasks, such as clustering or information retrieval. But if (even given the set of topics in a database) 17

it is this ambiguous to determine how we should represent a given document as a convex combination of topics, then the topics we have extracted cannot be very useful for clustering! In fact, it seems unnatural to not require the columns of A to be linearly independent! One should consider the process (stochastic, presumably) that generates the columns of W . Any reasonable process would almost surely result in a matrix M whose rank is equal to the rank of A.

2.2.2

Algorithm for Simplicial Factorization

In this section we give an algorithm that solves the simplicial factorization problem in 2 (nm)O(r ) time. Let L be the maximum bit complexity of any coefficient in the input. 2

Theorem 2.6. There is an O((nm)O(r ) ) time algorithm for deciding if the simplicial factorization problem has a solution of inner-dimension at most r. Furthermore, we can compute 2 a rational approximation to the solution up to accuracy δ in time poly(L, (nm)O(r ) , log 1/δ). The above theorem is proved by using Lemma 2.7 below to reduce the problem of finding a simplicial factorization to finding a point inside a semi-algebraic set with poly(n) constraints and 2r2 real-valued variables (or deciding that this set is empty). The decision problem can be solved using the well-known algorithm of Basu et. al.[18] that solves this 2 problem in nO(r ) time. We can instead use the algorithm of Renegar [139] (and a bound of 2 poly(L, (nm)O(r ) ) on the bit complexity of the coefficients in the solution due to Grigor’ev and Vorobjov [73]) to compute a rational approximation to the solution up to accuracy δ in 2 time poly(L, (nm)O(r ) , log 1/δ). This reduction uses the fact that since A, W have full rank they have “pseudo-inverses” + A , W + which are r × n and n × r matrices respectively such that A+ A = W W + = Ir×r . Thus A+ Mi = A+ AWi = Wi and similarly M j W + = Aj . Hence the columns of A can be obtained from linear combinations of the columns of M and similarly the rows of W can be obtain from linear combinations of rows of M . Let C be a maximal set of linearly independent columns of M and let R be a maximal set of linearly independent rows of M . Lemma 2.7 (Structure Lemma for Simplicial Factorization). Let M be a rank r matrix. Then M has a nonnegative matrix factorization of inner-dimension r if and only if there are linear transformations S and T such that: (i) CST R = M (ii) CS and T R are both non-negative Proof. (“if”) Suppose the conditions in the theorem are met. Then set A = CS and W = T R and hese matrices are nonnegative and have size n×r and r×m respectively, and furthermore are a factorization for M . Since rank(M ) = r, A and W are a simplicial factorization. (“only if”) Conversely suppose that there is a simplicial factorization M = AW . Recall that because A and W have full column and row rank respectively, the columns of A can be obtained from linear combinations of the columns of M and also the rows of W can be obtain from linear combinations of rows of M . Since R and C span the row and column space of 18

M respectively, we have that the columns of A can be obtained by a linear transformation of the columns of C and similarly W can be obtained by a linear transformation of R.

2.3

General NMF

Now we consider the NMF problem where the factor matrices A, W need not have full rank. Although the simplicial factorization problem is a quite natural special case in applications in machine learning and statistics, indeed in other applications such as in extended formulations [164] it is crucial that the nonnegative rank of a matrix can be much different than its rank. We note that subsequent to our work, the fourth author gave a singly-exponential time algorithm for computing the nonnegative rank of a matrix [127] and coupled with our hardness results in Section 2.4 this almost characterizes the fastest algorithm we could hope for under the Exponential Time Hypothesis [92]. 2 r

Theorem 2.8. There is a O((nm)O(r 2 ) ) time deterministic algorithm that given an n × m nonnegative matrix M outputs a factorization AW of inner dimension r if such a factorization exists. Furthermore, we can compute a rational approximation to the solution up to 2 r accuracy δ in time poly(L, (nm)O(r 2 ) , log 1/δ). As in the simplicial case the main idea will again be a reduction to an existence question for a semi-algebraic set, but this reduction is significantly more complicated than Lemma 2.7.

2.3.1

General Structure Theorem: Minimality

Our goal is to re-cast nonnegative matrix factorization (for constant r) as a system of polynomial inequalities where the number of variables is constant, the maximum degree is constant and the number of constraints is polynomially bounded in n and m. The main obstacle is that A and W are large - we cannot afford to introduce a new variable to represent each entry in these matrices. We will demonstrate there is always a ”minimal” choice for A and W so that: 1. there is a collection of linear transformations T1 , T2 , ...Tg(r) from the column-span of M to Rr and a choice function σW : [m] → [g(r)] 2. and a collection of linear transformations S1 , S2 , ...Sg(r) from the row-span of M to Rr and a choice function σA : [n] → [g(r)] And these linear transformations and choice functions satisfy the conditions: 1. for each i ∈ [n], Wi = TσW (i) Mi and 2. for each j ∈ [m], Aj = M j SσA (j) . 19

2

Furthermore, the number of possible choice functions σW is at most mO(r f (r)) and the 2 number of possible choice functions for σA is at most nO(r g(r)) . These choice functions are based on the notion of a simplicial partition, which we introduce later. We then give an algorithm for enumerating all simplicial partitions (this is the primary bottleneck in the algorithm). Fixing the choice functions σW and σA , the question of finding linear transformations T1 , T2 , ...Tg(r) and S1 , S2 , ...Sg(r) that satisfy the above constraints (and the constraint that M = AW , and A and W are nonnegative) is exactly a system of polynomial inequalities with a O(r2 g(r)) variables (each matrix Ti or Sj is r × r), degree at most four and furthermore there are at most O(mn) polynomial constraints. In this subsection, we will give a procedure (which given A and W ) generates a “minimal” choice for A and W (call this minimal choice A0 and W 0 ), and we will later establish that this “minimal” choice satisfies the structural property stated informally above. Definition 2.9. Let C(A) ⊂ 2[r] denote the subsets of [r] corresponding to maximal independent sets of columns (of A). Similarly let R(W ) ⊂ 2[r] denote the subsets of [r] corresponding to maximal independent sets of rows (of W ). A basic fact from linear algebra is that all maximal independent sets of columns of A have exactly rank(A) elements and all maximal independent sets of rows of W similarly have exactly rank(W ) elements. Definition 2.10. Let s be the total ordering on subsets of [r] of size s so that if U and V are both subsets of [r] of size s, U ≺s V iff U is lexicographically before V . Definition 2.11. Given a column Mi , we will call a subset U ∈ C(A) a minimal basis for Mi (with respect to A) if Mi ∈ cone(AU ) and for all V ∈ C(A) such that Mi ∈ cone(AV ) we must have U ≺s V . Claim. If Mi ∈ cone(A), then there is some U ∈ C(A) such that Mi ∈ cone(AU ). Definition 2.12. A proper chain (A, W, A0 , W 0 ) is a set of nonnegative matrices for which M = AW , M = AW 0 and M = A0 W 0 (the inner dimension of these factorizations is r) and functions σW 0 : [m] → C(A) and σA0 : [n] → R(W 0 ) such that 1. for all i ∈ [m], AWi0 = Mi , supp(Wi0 ) ⊂ σW 0 (i) and σW 0 (i) is a minimal basis w.r.t. A for Mi 2. for all j ∈ [n], A0j W 0 = M j , supp(Aj ) ⊂ σA0 (j) and σA0 (j) is a minimal basis w.r.t. W 0 for M j .

20

Note that the extra conditions on W 0 (i.e. the minimal basis constraint) is with respect to A and the extra conditions on A0 are with respect to W 0 . This simplifies the proof that there is always some proper chain, since we can compute a W 0 that satisfies the above conditions with respect to A and then find an A0 that satisfies the conditions with respect to W 0 . Lemma 2.13. If there is a nonnegative factorization M = AW (of inner-dimension r), then there is a choice of nonnegative A0 , W 0 of inner-dimension r and functions σW 0 : [m] → C(A) and σA0 : [n] → R(W 0 ) such that (A, W, A0 , W 0 ) and σW 0 , σA0 form a proper chain. Proof. The condition that there is some nonnegative W for which M = AW is just the condition that for all i ∈ [m], Mi ∈ cone(A). Hence, for each vector Mi , we can choose a minimal basis U ∈ C(A) using Claim 2.3.1. Then Mi ∈ cone(AU ) so there is some nonnegative vector Wi0 supported on U such AWi0 = Mi and we can set σW 0 (i) = U . Repeating this procedure for each column Mi , results in a nonnegative matrix W 0 that satisfies the condition M = AW 0 and for each i ∈ [m], by design supp(Wi0 ) ⊂ σW 0 (i) and σW 0 (i) is a minimal basis with respect to A for Mi . We can re-use this argument above, setting M T = (W 0T )AT and this interchanges the role of A and W . Hence we obtain a nonnegative matrix A0 which satisfies M = A0 W 0 and for each j ∈ [n], again by design we have that supp(Aj ) ⊂ σA0 (j) and σA0 (j) is a minimal basis with respect to W for M j . Definition 2.14. Let Π(A, U ) (for U ∈ C(A)) denote the r × n linear transformation that is zero on all rows not in U (i.e. Π(A, U )j = 0 for j ∈ / U ) and restricted to U is Π(A, U )U = (AU )+ (where the + operation denotes the Moore-Penrose pseudoinverse). Lemma 2.15. Let (A, W, A0 , W 0 ) and σW 0 and σA0 form a proper chain. For any index i ∈ [m], let Ui = σW 0 (i) and for any index j ∈ [n] let Vj = σA0 (j). Then Wi0 = Π(A, Ui )Mi and A0j = M j Π(W 0T , Vj )T . Notice that in the above lemma, the linear transformation that recovers the columns of W is based on column subsets of A, while the linear transformation to recover the rows of A0 is based on the row subsets of W 0 (not W ). 0

Proof. Since (A, W, A0 , W 0 ) and σW 0 and σA0 form a proper chain we have that AW 0 = M . Also supp(Wi0 ) ⊂ Ui = σW 0 (i). Consider the quantity Π(A, Ui )Mi . For any j ∈ / Ui , (Π(A, Ui )Mi )j = 0. So consider (Π(A, Ui )Mi )Ui = (AUi )+ AWi0 = (AUi )+ AUi (Wi0 )Ui where the last equality follows from the condition supp(Wi0 ) ⊂ Ui . Since Ui ∈ C(A) we have that (AUi )+ AUi is the |Ui | × |Ui | identity matrix. Hence Wi0 = Π(A, Ui )Mi . An identical argument with W 0 replaced with A0 and with A replaced by W 0T (and i and Ui replaced with j and Vj ) respectively implies that A0j = M j Π(W 0T , Vj )T too. 21

Note that there are at most |C(A)| ≤ 2r linear transformations of the form Π(A, Ui ) and hence the columns of W 0 can be recovered by at most 2r linear transformations of the column span of M , and similarly the rows of A0 can also be recovered. The remaining technical issue is we need to demonstrate that there are not too many choice functions σW 0 and σA0 and that we can enumerate over this set efficiently. In principle, even if say C(A) is just two sets, there are exponentially many choices of which (of the two) linear transformation to use for each column of M . However, when we use lexicographic ordering to tie break (as in the definition of a minimal basis), the number of choice functions is polynomially bounded. We will demonstrate that the choice function σW 0 : [m] → C(A) arising in the definition of a proper chain can be embedded in a restricted type of geometric partitioning of M which we call a simplicial partition.

2.3.2

General Structure Theorem: Simplicial Partitions

Here, we establish that the choice functions σW 0 and σA0 in a proper chain are combinatorially simple. The choice function σW 0 can be regarded as a partition of the columns of M into |C(A)| sets, and similarly the choice function σA0 is a partition of the rows of M into R(W 0 ) sets. Here we define a geometric type of partitioning scheme which we call a simplicial partition, which has the property that there are not too many simplicial partitions (by virtue of this class having small VC-dimension), and we show that the partition functions σW 0 and σA0 arising in the definition of a proper chain are realizable as (small) simplicial partitions. Definition 2.16. A (k, s)-simplicial partition of the columns of M is generated by a collection of k sets of s hyperplanes H1 = {h11 , h12 , ...h1s }, H2 = {h21 , h22 , ...h2s }, ...Hk = {hk1 , hk2 , ...hks }. Let Qi = {i0 s.t. for all j ∈ [s], hij · Mi0 ≥ 0}. Then this collection of sets of hyperplanes results in the partition • P1 = Q1 • P2 = Q2 − P1 • Pk = Qk − P1 − P2 ... − Pk−1 • Pk+1 = [m] − P1 − P2 ... − Pk If rank(A) = s, we will be interested in a (

r s

, s)-simplicial partition.

Lemma 2.17. Let (A, W, A0 , W 0 ) and σW 0 and σA0 form a proper chain. Then the partitions corresponding to σW 0 and to σA0 (of columns and rows of M respectively) are a ( rs , s)simplicial partition and a ( rt , t)-simplicial partition respectively, where rank(A) = s and rank(W 0 ) = t. 22

Proof. Order the sets in C(A) according to the lexicographic ordering s , so that V1 ≺s V2 ≺s ...Vk for k = |C(A)|. Then for each j, let Hj be the rows of the matrix (AVj )+ . Note that there are exactly rank(A) = s rows, hence this defines a (k, s)-simplicial partition. Claim. σW 0 (i) = j if and only if Mi ∈ Pj in the (k, s)-simplicial partition generated by H1 , H2 , ...Hk . Proof. Since (A, W, A0 , W 0 ) and σW 0 and σA0 forms a proper chain, we have that M = AW 0 . Consider a column i and the corresponding set Vi = σW 0 (i). Recall that Vj is the j th set in C(A) according to the lexicographic ordering s . Also from the definition of a proper chain Vi is a minimal basis for Mi with respect to A. Consider any set Vj 0 ∈ C(A) with j 0 < j. Then from the definition of a minimal basis we must have that Mi ∈ / cone(AVj0 ). + 0 Since Vj ∈ C(A), we have that the transformation (AVj0 )(AVj0 ) is a projection onto span(A) which contains span(M ). Hence (AVj0 )(AVj0 )+ Mi = Mi , but Mi ∈ / cone(AVj0 ) so (AVj0 )+ Mi cannot be a nonnegative vector. Hence Mi is not in Pj 0 for any j 0 < j. Furthermore, Mi is in Qj : using Lemma 2.15 we have Π(A, Vj )Mi = Π(A, Vj )AWi0 = Wi0 ≥ 0 and so (AVj )+ Mi = (Π(A, Vj )Mi )Vj ≥ 0. We can repeat the above replacing A with W 0T and W 0 with A0 , and this implies the lemma.

2.3.3

Enumerating Simplicial Partitions

Here we give an algorithm for enumerating all (k, s)-simplicial partitions (of, say, the columns of M ) that runs in time O(mks(r+1) ). An important observation is that the problem of enumerating all simplicial partitions can be reduced to enumerating all partitions that arise from a single hyperplane. Indeed, we can over-specify a simplicial partition by specifying the partition (of the columns of M ) that results from each hyperplane in the set of ks total hyperplanes that generates the simplicial partition. From this set of partitions, we can recover exactly the simplicial partition. A number of results are known in this domain, but surprisingly we are not aware of any algorithm that enumerates all partitions of the columns of M (by a single hyperplane) that runs in polynomial time (for dim(M ) ≤ r and r is constant) without some assumption on M . For example, the VC-dimension of a hyperplane in r dimensions is r + 1 and hence the Sauer-Shelah lemma implies that there are at most O(mr+1 ) distinct partitions of the columns of M by a hyperplane. In fact, a classic result of Harding (1967) gives a tight upper bound of O(mr ). Yet these bounds do not yield an algorithm for efficiently enumerating this structured set of partitions without checking all partitions of the data. A recent result of Hwang and Rothblum [90] comes close to our intended application. A separable partition into p parts is a partition of the columns of M into p sets so that the convex hulls of these sets are disjoint. Setting p = 2, the number of separable partitions is exactly the number of distinct hyperplane partitions. Under the condition that M is in general position (i.e. there are no t columns of M lying on a dimension t − 2 subspace where t = rank(M ) − 1), Hwang and Rothblum give an algorithm for efficiently enumerating all distinct hyperplane partitions [90]. 23

Here we give an improvement on this line of work, by removing any conditions on M (although our algorithm will be slightly slower). The idea is to encode each hyperplane partition by a choice of not too many data points. To do this, we will define a slight generalization of a hyperplane partition that we will call a hyperplane separation: Definition 2.18. To a hyperplane h we associate a mapping which we call a hyperplane separation that maps columns of M to {−1, 0, 1} depending on the sign of h · Mi (where the sign function is 1 for positive values, −1 for negative values and 0 for zero). A hyperplane partition can be regarded as a mapping from columns of M to {−1, 1} where we adopt the convention that Mi such that h ◦ Mi is mapped to 1. Definition 2.19. A hyperplane partition (defined by h) is an extension of a hyperplane separation (defined by g) if for all i, g(Mi ) 6= 0 ⇒ g(Mi ) = h(Mi ). Lemma 2.20. Let rank(M ) = s, then for any hyperplane partition (defined by h), there is a hyperplane g that contains s affinely independent columns of M and for which h (as a partition) is an extension of g (as a separation). Proof. After an appropriate linear transformation (of the columns of M and the hyperplanes), we can assume that M is full rank. If the h already contains s affinely independent columns of M , then we can choose g = h. If not we can perturb h in some direction so that for any column with h(Mi ) = 0, we maintain the invariant that Mi is contained on the perturbed hyperplane h0 . Since rank(M ) = s this perturbation has non-zero inner product with some column in M and so this hyperplane h0 will eventually contain a new column from M (without changing the sign of h(Mi ) for any other column). We can continue this argument until the hyperplane contains s affinely independent columns of M and by design on all remaining columns agrees in sign with h. Lemma 2.21. Let rank(M ) = s. For any hyperplane h (which defines a partition), there is a collection of k ≤ s sets of (at most s) columns of M , S1 , S2 , ..Sk so that any hyperplanes g1 , g2 , ..gk which contain S1 , S2 , ...Sk respectively satisfy: For all i, h(Mi ) (as a partition) is equal to the value of gj (Mi ), where j is the smallest index for which gj (Mi ) 6= 0. Furthermore these subsets are nested: S1 ⊃ S2 ⊃ ... ⊃ Sk . Proof. We can apply Lemma 2.20 repeatedly. When we initially apply the lemma, we obtain a hyperplane g1 that can be extended (as a separation) to the partition corresponding to h. In the above function (defined implicitly in the lemma) this fixes the partition of the columns except those contained in g1 . So we can then choose M 0 to be the columns of M that are contained in g1 , and recurse. If S2 is the largest set of columns output from the recursive call, we can add columns of M contained in g1 to this set until we obtain a set of s + 1 affinely independent columns contained in g1 , and we can output this set (as S1 ). 24

Theorem 2.22. Let rank(M ) = s. There is an algorithm that runs in time O(ms (s + 2)s ) time to enumerate all hyperplane partitions of the columns of M . Proof. We can apply Lemma 2.21 and instead enumerate the sets of points S1 , S2 , ...Ss . Since these sets are nested, we can enumerate all choices as follows: • choose at most s columns corresponding to the set S1 • initialize an active set T = S1 • until T is empty either – choose a column to be removed from the active set – or indicate that the current active set represents the next set Si and choose the sign of the corresponding hyperplane There are at most O(ms (s+2)s ) such choices, and for each choice we can then run a linear program to determine if there is a corresponding hyperplane partition. (In fact, all partitions that result from the above procedure will indeed correspond to a hyperplane partition). The correctness of this algorithm follows from Lemma 2.21. This immediately implies: 2

Corollary 2.23. There is an algorithm that runs in time O(mks ) ) that enumerates a set of partitions of the columns of M that contains the set of all (k, s)-simplicial partitions (of the columns of M ).

2.3.4

Solving Systems of Polynomial Inequalities

The results of Basu et al [18] give an algorithm for finding a point in a semi-algebraic set defined by O(mn) constraints on polynomials of total degree at most d, and f (r) variables in time O((mnd)cf (r) ). Using our structure theorem for nonnegative matrix factorization, we will re-cast the decision problem of whether a nonnegative matrix M has nonnegative rank r as an existence question for a semi-algebraic set. Theorem 2.24. There is an algorithm for deciding if a n × m nonnegative matrix M has 2 r nonnegative rank r that runs in time O((nm)O(r 2 ) ). Furthermore, we can compute a rational 2 r approximation to the solution up to accuracy δ in time poly(L, (nm)O(r 2 ) , log 1/δ). We first prove the first part of this theorem using the algorithm of Basu et al [18], and we instead use the algorithm of Renegar [139] to compute a rational approximation to the 2 r solution up to accuracy δ in time poly(L, (nm)O(r 2 ) , log 1/δ). 25

Proof. Suppose there is such a factorization. Using Lemma 2.13, there is also a proper chain. We can apply Lemma 2.17 and using the algorithm in Theorem 2.22 we can enumerate over a superset of simplicial partitions. Hence, at least one of those partitions will result in the choice functions σW 0 and σA0 in the proper chain decomposition for M = AW . Using Lemma 2.15 there is a set of at most 2r linear transformations T1 , T2 , ...T2r which recover columns of W 0 given columns of M , and similarly there is a set of at most 2r linear transformations S1 , S2 , ...S2r which recover the rows of A0 given rows of M . Note that these linear transformations are from the column-span and row-span of M respectively, and hence are from subspaces of dimension at most r. So apply a linear transformation to columns of M and one to rows of M to to recover matrices MC and MR respectively (which are no longer necessarily nonnegative) but which are dimension r × m and n × r respectively. There will still be a collection of at most 2r linear transformations from columns of MC to columns of W 0 , and similarly for MR and A0 . We will choose r2 variables for each linear transformation, so there are 2 ∗ r2 ∗ 2r variables in total. Then we can write a set of m linear constraints to enforce that for each column of (MC )i , the transformation corresponding to σW 0 (i) recovers a nonnegative vector. Similarly we can define a set of n constraints based on rows in MR . Lastly we can define a set of constraints that enforce that we do recover a factorization for M : For all i ∈ [m], j ∈ [n], let i0 = σW 0 (i) and j 0 = σA0 (j). Then we write the constraint (MC )j Sj 0 Ti0 (MR )i = Mij . This constraint has degree at two in the variables corresponding to the linear transformations. Lemma 2.13 implies that there is some choice of these transformations that will satisfy these constraints (when we formulate these constraints using the correct choice functions in the proper chain decomposition). Furthermore, any set of transformations that satisfies these constraints does define a nonnegative matrix factorization of inner dimension r for M . And of course, if there is no inner dimension r nonnegative factorization, then all calls to the algorithm of Basu et al [18] will fail and we can return that there is no such factorization. The result in Basu et al. [18] is a quantifier elimination algorithm in the Blum, Shub and Smale (BSS) model of computation [28]. The BSS model is a model for real number computation and it is natural to ask what is the bit complexity of finding a rational approximation of the solutions. There has been a long line of research on the decision problem for first order theory of reals: given a quantified predicate over polynomial inequalities of reals, determine whether it is true or false. What we need for our algorithm is actually a special case of this problem: given a set of polynomial inequalities over real variables, determine whether there exists a set of values for the variables so that all polynomial inequalities are satisfied. In particular, all variables in our problem are quantified by existential quantifier and there are no alternations. For this kind of problem Grigor’ev and Vorobjov [73] first 2 gave a singly-exponential time algorithm that runs in (nd)O(f (r) ) where n is the number of polynomial inequalities, d is the maximum degree of the polynomials and f (r) is the num2 ber of variables. The bit complexity of the algorithm is poly(L, (nd)O(f (r) ) ) where L is the maximum length of the coefficients in the input. Moreover, their algorithm also gives an upperbound of poly(L, (nd)O(f (r)) ) on the number of bits required to represent the solutions. 26

Renegar[139] gave a better algorithm that for the special case we are interested in takes time (nd)O(f (r)) . Using his algorithm with binary search (with search range bounded by Grigor’ev et.al.[73]), we can find rational approximations to the solutions with accuracy up to δ in time poly(L, (nm)O(f (r)) , log 1/δ). We note that our results on the SF problem are actually a special case of the theorem above (because our structural lemma for simplicial factorization is a special case of our general structure theorem): Corollary 2.25. There is an algorithm for determining whether the nonnegative rank of a 2 nonnegative n × m matrix M equals the rank and this algorithm runs in time O((nm)O(r ) ). Proof. If rank(M ) = r, then we know that both A and W must be full rank. Hence C(A) and R(W ) are both just the set {1, 2, ...r}. Hence we can circumvent the simplicial partition machinery, and set up a system of polynomial constraints in at most 2r2 variables.

2.4

Strong Intractability of Simplicial Factorization

Here we prove that there is no algorithm for computing simplicial factorization of dimension r that runs in (nm)o(r) time unless 3-SAT can be solved in 2o(n) time. Surprisingly, even the N P -hardness of the problem was only proved quite recently by Vavasis [154]. That reduction is the inspiration for our result, though unfortunately we were unable to use it directly to get low-dimensional instances. Instead we give a new reduction using the d-SUM Problem. Definition 2.26 (d-SUM). In the d-SUM problem we are given a set of N values {s1 , s2 , ...sN } each in the range [0, 1], and the goal is to determine if there is a set of d numbers (not necessarily distinct) that sum to exactly d/2. This definition for the d-SUM Problem is slightly unconventional in that here we allow repetition (i.e. the choice of d numbers need not be distinct). Patrascu and Williams [136] recently proved that if d-SUM can be solved in N o(d) time then 3-SAT has a sub-exponential time algorithm. In fact, in the instances constructed in [136] we can allow repetition of numbers without affecting the reduction since in these instances choosing any number more than once will never result in a sum that is exactly d/2. Hence we can re-state the results in [136] for our (slightly unconventional definition for) d-SUM. Theorem 2.27. If d < N 0.99 and if d-SUM instances of N distinct numbers each of O(d log N ) bits can be solved in N o(d) time then 3-SAT on n variables can be solved in time 2o(n) . Given an instance of the d-SUM, we will reduce to an instance of the Intermediate Simplex problem defined in [154]. 27

z

C D

B I2

ε A

I1

M I3N E F

1

y

Figure 2.1: The Gadget Definition 2.28 (Intermediate Simplex). Given a polyhedron P = {x ∈ Rr−1 : Hx ≥ b} where H is an n × (r − 1) size matrix and b ∈ Rn such that the matrix [H, b] has rank r and a set S of m points in Rr−1 , the goal of the Intermediate Simplex Problem is to find a set of points T that form a simplex (i.e. T is a set of r affinely independent points) each in P such that the convex hull of T contains the points in S. Vavasis [154] proved that Intermediate Simplex is equivalent to the Simplicial Factorization problem. Theorem 2.29 (Vavasis [154]). There is a polynomial time reduction from Intermediate Simplex problem to Simplicial Factorization problem and vice versa and furthermore both reductions preserve the value of r.

2.4.1

The Gadget

Given the universe U = {s1 , s2 , . . . , sN } for the d-SUM problem, we construct a two dimensional Intermediate Simplex instance as shown in Figure 2.1. We will show that the Intermediate Simplex instance has exactly N solutions, each representing a choice of si . Later in the reduction we use d such gadgets to represent the choice of d numbers in the set U. Recall for a two dimensional Intermediate Simplex problem, the input consists of a polygon P (which is the hexagon ABCDEF in Figure 2.1) and a set of points S = {I1 , I2 , . . . , I3N } inside P (which are the dots, except for M ). A solution to this two dimensional Intermediate Simplex instance will be a triangle inside P such that all the points in S are contained in the triangle (in Figure 2.1 ACE is a valid solution). 28

z 1

1

A

D Cj

Ai Aj

Ci C

B

y

y1 y2

Figure 2.2: Proof of Lemma 2.30 We first specify the polygon P for the Intermediate Simplex instance. The polygon P is just the hexagon ABCDEF inscribed in a circle with center M . All angles in the hexagon are 2π/3, the edges AB = CD = EF = where is a small constant depending on N , d that we determine later. The other 3 edges also have equal lengths BC = DE = F A. We use y(A) and z(A) to denote the y and z coordinates for the point A (and similarly for all other points in the gadget). The hexagon is placed so that y(A) = y(B) = 0, y(D) = y(E) = 1. Now we specify the set S of 3N points for the Intermediate Simplex instance. To get these points first take N points in each of the 3 segements AB, CD, EF . On AB these N points are called A1 , A2 , ..., AN , and |AAi | = si . Similarly we have points Ci ’s on CD and Ei ’s on EF , |CCi | = |EEi | = si . Now we have N triangles Ai Ci Ei (the thin lines in Figure 2.1). We claim (see Lemma 2.30 below) that the intersection of these triangles is a polygon with 3N vertices. The points in S are just the vertices of this intersection. Lemma 2.30. When < 1/50, the points {Ai }, {Ci }, {Ei } are on AB, CD, EF respectively and AAi = CCi = EEi = si , the intersection of the N triangles {Ai Ci Ei } is a polygon with 3N vertices. Proof. Since the intersection of N triangles Ai Ci Ei is the intersection of 3N halfplanes, it has at most 3N vertices. Therefore we only need to prove every edge in the triangles has a segment remaining in the intersection. Notice that the gadget is symmetric with respect to rotations of 2π/3 around the center M . By symmetry we only need to look at edges Ai Ci . The situation here is illustrated in Figure 2.2. Since all the halfplanes that come from triangles Ai Ci Ei contain the center M , later when talking about halfplanes we will only specify the boundary line. For example, the halfplane with boundary Ai Ci and contains Ei (as well as M ) is called halfplane Ai Ci . 29

The two thick lines in√Figure 2.2 are extensions of AB and CD, now they are rotated so that they are z = ± 3y. The two thin lines are two possible lines Ai Ci and Aj Cj . The differences between y coordinates of Ai and Ci are the same for all i (here normalized to 1) by the of the points Ai ’s and Ci ’s. Assume the coordinates for Ai , Aj √ construction √ are (yi , − √3yi ) and (yj , − 3yj ) respectively. Then the coordinates for the intersection is (yi +yj +1, 3(1+yi +yj +2yi yj )). This means if we have N segments with y1 < y2 < . . . < yN , segment i will be the highest one when y is in range (yi−1 + yi + 1, yi + yi+1 + 1) (indeed, the lines with j > i have higher slope and will win when y > yi + yj + 1 ≥ yi + yi+1 + 1; the lines with j < i have lower slope and will win when y < yi + yj + 1 ≤ yi + yi−1 + 1). We also want to make sure that all these intersection points are inside the halfplanes Ci Ei ’s and Ei Ai ’s. Since < 1/50, all the yi ’s are within √ [−1/2 − 1/20, −1/2 + 1/20]. Hence the intersection point is always close to the point (0, 3/2), the√distance is at most 1/5. At the same time, since is small, the distances of this point (0, 3/2) to all the Ci Ei ’s and Ei Ai ’s are all larger than 1/4. Therefore all the intersection points are inside the other 2N halfplanes and the segments will indeed remain in the intersection. The intersection has 3N edges and 3N vertices. The Intermediate Simplex instance has N obvious solutions: the triangles Ai Ci Ei , each one corresponds to a value si for the d-SUM problem. In the following Lemma we show that these are the only possible solutions. Lemma 2.31. Let < 1/1000. If the solution of the Intermediate Simplex problem is P QR, then P QR must be one of the Ai Ci Ei ’s. Proof. Suppose P QR is a solution of the Intermediate Simplex problem, since M is in the convex hull of {I1 , I2 , . . . , I3N }, it must be in P QR. Thus one of the angles ∠P M Q, ∠QM R, ∠RM P must be at least 2π/3 (their sum is 2π). Without loss of generality we assume this angle is ∠P M Q and by symmetry assume P is either on AB or BC. We shall show in either of the two cases, when P is not one of the Ai ’s, there will be some Ik that is not in the halfplane P Q (recall the halfplanes we are interested in always contain M so we don’t specify the direction). When P is on AB, since ∠P M Q ≥ 2π/3, we have CQ ≥ AP (by symmetry when CQ = AP the angle is exactly 2π/3). This means we can move Q to Q0 such that CQ0 = AP . The intersection of halfplane P Q0 and the hexagon ABCDEF is at least as large as the intersection of halfplane P Q and the hexagon. However, if P is not any of the points {Ai } (that is, |P Q0 |/ 6∈ {s1 , s2 , ..., sN }), then P Q0 can be viewed as AN +1 CN +1 if we add sN +1 = |AP |/ to the set U . By Lemma 2.30 introducing P Q0 must increase the number of vertices. One of the original vertices Ik is not in the hyperplane P Q0 , and hence not in P QR. Therefore when P is on AB it must coincide with one of the Ai ’s, by symmetry P QR must be one of Ai Ci Ei ’s. When P is on BC, there are two cases as shown in Figure 2.3. First observe that if we take U 0 = U ∪ {1 − s1 , 1 − s2 , . . . , 1 − sN }, and generate the set S = {I1 , I2 , . . . , I6N } according to U 0 , then the gadget is further symmetric with respect to 30

C P’

D Q’’ Q’

P

P

Q

Q

B A R

Figure 2.3: Proof of Lemma 2.31 flipping along the perpendicular bisector of BC. Now without loss of generality BP ≤ BC/2. Since every Ik is now in the intersection of 2N triangles, in particular they are also in the intersection of the original N triangles, it suffices to show one of Ik (k ∈ [6N ]) is outside halfplane P Q. The first case (left part of Figure 2.3) is when BP < . In this case we extend P Q to get intersection on AB (P 0 ) and intersection on CD (Q0 ). Again since ∠P M Q ≥ 2π/3, we have DQ ≥ BP . At the same time we know ∠DQQ0 ≥ ∠P 0 P B, so DQ0 > BP 0 . Similar to the previous case, we take Q00 so that CQ00 = AP 0 . The intersection of hyperplane P 0 Q00 and the hexagon ABCDEF is at least as large as the intersection of halfplane P Q and the hexagon. When < 1/1000, we can check AP 0 < 2 1/50, therefore we can still view P 0 Q00 as some A2N +1 C2N +1 for s2N +1 < 2. Now Lemma 2.30 shows there is some vertex Ik not in halfplane P 0 Q00 (and hence not in halfplane P Q). The final case (right part of Figure 2.3) is when BP ≥ . In this case we notice the triangle with 3 edges AD, BE, CF (the shaded triangle in the figure) is contained in every Ai Ci Ei , thus it must also be in P QR. However, since BC/2 ≥ BP ≥ , we know AR ≤ and DQ ≤ . In this case P QR does not even contain the center M .

2.4.2

The Reduction

Suppose we are given an instance of the d-SUM Problem with N values {s1 , s2 , ...sN }. We will give a reduction to an instance of Intermediate Simplex in dimension r − 1 = 3d + 1. To encode the choice of d numbers in the set {s1 , s2 , ..., sN }, we use d gadgets defined in Section 2.4.1. The final solution of the Intermediate Simplex instance we constructed will include solutions to each gadget. As the solution of a gadget always corresponds to a number in {s1 , s2 , ..., sN } (Lemma 2.31) we can decode the solution and get d numbers, and we use an extra dimension w that “computes” the sum of these numbers and ensures the sum is equal to d/2. We use three variables {xi , yi , zi } for the ith gadget. 31

Variables 1. We will use 3d + 1 variables: sets {xi , yi , zi } for i ∈ [d] and w. Constraints 1 (Box). For all i ∈ [d], xi , yi ∈ [0, 1], zi ∈ [0, 2] and also w ∈ [0, 1]. Definition 2.32. Let G ⊂ R2 be the hexagon ABCDEF in the two-dimensional gadget given in the Section 2.4.1.Let H ⊂ R3 be the set conv({(xi , yi , zi ) ∈ R3 |(yi , zi ) ∈ G, xi = 1}, 0). H is a tilted-cone that has a hexagonal base G and has an apex at the origin. Definition 2.33. Let R be a 7 × 3 matrix and b ∈ R7 so that {x|Rx ≥ b} = H. We will use these gadgets to define (some of the) constraints on the polyhedron P in an instance of intermediate simplex: Constraints 2 (Gadget). For each i ∈ [d], R(xi , yi , zi ) ≥ b. Hence when restricted to dimensions xi , yi , zi the ith gadget G is on the plane xi = 1. We hope that in a gadget, if we choose three points corresponding to the triangle for some value si , that of these three points only the point on the AB line will have a non-zero value for w and that this value will be si . The points on the lines CD or EF will hopefully have a value close to zero. We add constraints to enforce these conditions: Constraints 3 (CE). For all i ∈ [d], w ≤ 1 − yi + (1 − xi ) These constraints make sure that points on CD or EF cannot have large w value. Recall that we use z(A) to denote the z coordinate of A in the gadget in Section 2.4.1. Constraints 4 (AB). For all i ∈ [d]: w ∈

h

(zi −z(A)xi )

±

( 10 yi

i + (1 − xi ))

Theses constraints make sure that points on AB have values in {s1 , s2 , ..., sN }. The AB and CE constraints all have the property that when xi < 1 (i.e. the corresponding point is off of the gadget on the plane xi = 1) then these constraints gradually become relaxed. To make sure the gadget still works, we don’t want the extra constraints on w to rule out some possible values for xi , yi , zi ’s. Indeed we show the following claim. Claim. For all points in (xi , yi , zi ) ∈ H, there is some choice of w ∈ [0, 1] so that xi , yi , zi and w satisfy the CE and AB Constraints.

32

The proof is by observing that Constraints AB have almost no effect when y > 0 and Constraints CE have no effect when y = 0. Constraints 1 to 4 define a polyhedron P in 3d + 1-dimensional space and furthermore the set of constraints that define P have full rank (in fact even the inequalities in the Box Constraints have full rank). Thus this polyhedron is a valid polyhedron for the Intermediate Simplex problem. Next we specify the points in S for the Intermediate Simplex problem(each of which will be contained in the polyhedron P ). Let Ik (for k ∈ [3N ]) be the set S in the gadget in Section 2.4.1. As before, let z(Ik ) and y(Ik ) be the z and y coordinates of Ik respectively. Definition 2.34 (w-max(Ik )). Let w-max(Ik ) be the maximum possible w-value of any point I with xi = 1, yi = y(Ik ), zi = z(Ik ) and xj , yj , zj = 0 for all j 6= i so that I is still contained in P . Definition 2.35 (O, W, Iki , Q). The set S of points for the Intermediate Simplex problem is O points: For all i ∈ [d], xi , yi , zi = 0 and w = 0 W point: For all i ∈ [d], xi , yi , zi = 0 and w = 1 Iki points: For each i ∈ [d], for each k ∈ [3N ] set xi = 1/4, yi = 1/4y(Ik ), zi = 1/4z(Ik ) and for j 6= i set xj , yj , zj = 0. Also set w to be the 1/4 × w-max(Ik ). Q point: For each i ∈ [d], xi = 1/d, yi = y(M )/d, zi = z(M )/d and w = 1/6 This completes the reduction of 3-SUM to intermediate simplex, and next we establish the COMPLETENESS and SOUNDNESS of this reduction.

2.4.3

Completeness and Soundness

The completeness part is straight forward: for ith gadget we just select the triangle that corresponds to ski . Lemma 2.36. If there is a set {sk1 , sk2 , ...skd } of d values (not necessarily distinct) such that P i∈[d] ski = d/2 then there is a solution to the corresponding Intermediate Simplex Problem. Proof. We will choose a set of 3d + 2 points T : We will include the O and W points, and for each ski , we will choose the triangle corresponding to the value ski in the ith gadget. Recall the triangle is Aki Cki Eki in the gadget defined in Section 2.4.1. The points we choose have xi = 1 and yi , zi equal to the corresponding point in the gadget. We will set w to be ski for the point on the line AB and we will set w to be zero for the other two points not contained in the line AB. The rest of the dimensions are all set to 0. 33

Next we prove that the convex hull of this set of points T contains all the points in S: The points O and W are clearly contained in the convex hull of T (and are in fact in T !). Next consider some point Iki in S corresponding to some intersection point Ik in the gadget G. Since Ik is in the convex hull of the triangle corresponding to ski in the gadget G, there is a convex combination of the these three points Aki , Cki , Eki in T (which we call J) so that 1/4J matches Iki on all coordinates except possibly the w-coordinate. Furthermore the point J has some value in the coordinate corresponding to w and this must be at most the corresponding value in Iki (because we chose the w-value in Iki to be 1/4 × w-max(Ik )). Hence we can distribute the remaining 3/4 weight among the O and W points to recover Iki exactly on all coordinates. Lastly, we observe that if we equally weight all points in T (except O and W ) we recover Pd 1 s the point Q. In particular, the w coordinate of Q should be 3d i=1 ki = 1/6. Next we prove SOUNDNESS for our reduction. Suppose the solution is T , which is a set of 3d + 2 points in the polyhedron P and the convex hull of points in T contains all the O, W , Iki , Q points (in Definition 2.35). Claim. The points O and W must be in the set T . Proof. The points O and W are vertices of the polyhedron P and hence cannot be expressed as a convex combination of any other set of points in P . Now we want to prove the rest of the 3d points in set T is partitioned into d triples, each triple belongs to one gadget. Set T 0 = T − {O} − {W }. Definition 2.37. For i ∈ [d], let Ti0 = {Z ∈ T 0 |j 6= i ⇒ xj (Z), yj (Z), zj (Z) = 0 and one of xi (Z), yi (Z), zi (Z) 6= 0} Claim. The sets Ti0 partition T 0 and each contain exactly 3 nodes. Proof. The sets Ti0 are disjoint, and additionally each set Ti0 must contain at least 3 nodes (otherwise the convex hull of Ti0 even restricted to xi , yi , zi cannot contain the points Iki ). This implies the Claim. Recall the gadget in Section 2.4.1 is a two dimensional object, but it is represented as a three dimensional cone in our construction. We would like to apply Lemma 2.31 to points on the plane xi = 1 (in this plane the coordinates yi ,zi act the same as y, z in the gadget). Definition 2.38. For each point Z ∈ Ti0 , let ext(Z) ∈ R3 be the intersection of the line connecting the origin and (xi (Z), yi (Z), zi (Z)) with the xi = 1 base of the set {(xi , yi , zi )|R(xi , yi , zi ) ≥ b}. Let ext(Ti0 ) be the point-wise ext operation applied to each point in Ti0 . 34

Since the points Iki are in the affine hull of Ti0 when restricted to xi , yi , zi , we know ext(Iki ) must be in the convex hull of ext(Ti0 ). Using Lemma 2.31 in Section 2.4.1, we get: Corollary 2.39. ext(Ti0 ) must correspond to some triangle Aki Cki Eki for some value ski . Now we know how to decode the solution T and get the numbers ski . We will abuse notation and call the 3 points in Ti0 Aki , Cki , Eki (they were used to denote the corresponding points in the 2-d gadget in Section 2.4.1).We still want to make sure the w coordinate correctly “computes” the sum of these numbers. As a first step we want to show that the xi of all points in Ti0 must be 1 (we need this because the Constraints AB and CE are only strict when xi = 1). Lemma 2.40. For each point Z ∈ Ti0 , xi (Z) = 1 Proof. Suppose, forPthe sake of contradiction, that xi (Z)
0 be a lower bound on the `1 -condition number of the matrix √ R(T ). This is defined in Section 3.2, but for a r × r matrix it is within a factor of r of the smallest singular value. Our algorithm will work for any γ, but the number of documents we require will depend (polynomially) on 1/γ: Theorem 3.4 (Main). There is a polynomial time algorithm that learns the parameters of a topic model if the number of documents is at least log r · a2 r4 log n · a4 r6 , m = max O ,O 2 p6 γ 2 N γ2 where the three numbers a, p, γ are as defined above. The algorithm learns the topic-term matrix A up to entry-wise additive error . Moreover, when the number of documents is also log r·r2 larger than O the algorithm can learn the topic-topic covariance matrix R(T ) up to 2 entry-wise additive error . As noted earlier, we are able to recover the topic matrix even though we do not always recover the parameters of the column distribution T . In some special cases we can also recover the parameters of T , e.g. when this distribution is Dirichlet, as happens in the popular Latent Dirichlet Allocation (LDA) model [24, 26]. In Section 3.4.1 we compute a lower bound on the γ parameter for the Dirichlet distribution, which allows us to apply our main learning algorithm, and also the parameters of T can be recovered from the co-variance matrix R(T ) (see Section 3.4.2). 48

Recently the basic LDA model has been refined to allow correlation among different topics, which is more realistic. See for example the Correlated Topic Model (CTM) [25] and the Pachinko Allocation Model (PAM) [112]. A compelling aspect of our algorithm is that it extends to these models as well: we can learn the topic matrix, even though we cannot always identify T . (Indeed, the distribution T in the Pachinko is not even identifiable: two different sets of parameters can generate exactly the same distribution.) This Chapter is organized as follows: in Section 3.2 we introduce various parameters related to the A matrix. This prepares us for the proof of the main theorem in Section 3.3. The proof is adapted to the special case of LDA in Section 3.4. We give a better algorithm for NMF in Section 3.5 which is important for improving the sample complexity of the main theorem. Finally, Section 3.6 shows maximum likelihood estimator is NP-hard for topic modeling even if topics are separable, and justifies the need to assume data is actually generated according to the model.

3.2 3.2.1

Tools for (Noisy) Nonnegative Matrix Factorization Various Condition Numbers

Central to our arguments will be various notions of matrices being “far” from being low-rank. The most interesting one for our purposes was introduced by Kleinberg and Sandler [96] in the context of collaborative filtering; and can be thought of as an `1 -analogue to the smallest singular value of a matrix. Definition 3.5 (`1 Condition Number). If matrix B has nonnegative entries and all rows sum to 1 then its `1 Condition Number Γ(B) is defined as: Γ(B) = min |x> B|1 . |x|1 =1

If B does not have row sums of one then Γ(B) is equal to Γ(DB) where D is the diagonal matrix such that DB has row sums of one. For example, if the rows of B have disjoint support then Γ(B) = 1 and in general the quantity Γ(B) can be thought of a measure of how close two distributions on √ disjoint sets of rows can be. Note that, if x is an n-dimensional real vector, kxk2 ≤ |x|1 ≤ nkxk2 and hence (if σmin (B) is the smallest singular value of B), we have: √ 1 √ σmin (B) ≤ Γ(B) ≤ mσmin (B). n The above notions of condition number will be most relevant in the context of the topictopic covariance matrix R(T ). We shall always use γ to denote the `1 condition number of R(T ). The definition of condition number will be preserved even when we estimate the topic-topic covariance matrix using random samples. 49

Lemma 3.6. When m > 5 log r/20 , with high probability the matrix R = m1 W W > is entrywise close to R(T ) with error 0 . Further, when 0 < 1/4γar2 where a is topic imbalance, the matrix R has `1 condition number at least γ/2. Proof. Since E[Wi Wi> ] = R(T ), the first part is just by Chernoff bound and union bound. The further part follows because R(T ) is γ robustly simplicial, and the error can change the `1 norm of vR for any unit v by at most ar · r0 . The extra factor ar comes from the normalization to make rows of R sum up to 1. In Chapter 2 we defined a different measure of “distance” from singular which is essential to the polynomial time algorithm for NMF: Definition 3.7 (β-robustly simplicial). If each column of a matrix A has unit `1 norm, then we say it is β-robustly simplicial if no column in A has `1 distance smaller than β to the convex hull of the remaining columns in A. The following claim clarifies the interrelationships of these latter condition numbers. Claim. (i) If A is p-separable then A> has `1 condition number at least p. (ii) If A> has all row sums equal to 1 then A is β-robustly simplicial for β = Γ(A> )/2. We shall see that the `1 condition number for product of matrices is at least the product of `1 condition number. The main application of this composition is to show that the matrix R(T )A> (or the empirical version RA> ) is at least Ω(γp)-robustly simplicial. The following lemma will play a crucial role in analyzing our main algorithm: Lemma 3.8 (Composition Lemma). If B and C are matrices with `1 condition number Γ(B) ≥ γ and Γ(C) ≥ β , then Γ(BC) is at least βγ. Specificially, when A is p-separable the matrix R(T )A> is at least γp/2-robustly simplicial. Proof. The proof is straight forward because for any vector x, we know |x> BC|1 ≤ Γ(C)|x> B|1 ≤ Γ(C)Γ(B)|x|1 . For the matrix R(T )A> , by Claim 3.2.1 we know the matrix A> has `1 condition number at least p. Hence Γ(R(T )A> ) is at least γp and again by Claim 3.2.1 the matrix is γp/2-robustly simplicial.

3.2.2

Noisy Nonnegative Matrix Factorization under Separability

A key ingredient is an approximate NMF algorithm in Chapter 2, which can recover an ˜ ≈ AW when the `1 distance between each approximate nonnegative matrix factorization M ˜ and the corresponding row in AW is small. We emphasize that this is not enough row of M ˜ will have a substantial amount for our purposes, since the term-by-document matrix M of noise (when compared to its expectation) precisely because the number of words in a document N is much smaller than the dictionary size n. Rather, we will apply the following ˜M ˜ >. algorithm (and an improvement that we give in Section 3.5) to the Gram matrix M For topic modeling we need a slightly different goal here than in Chapter 2. Instead of recovering estimates to the anchor words that are close in `1 -norm, we would rather recover 50

almost anchor words that have almost all its weights on a single coordinate. (this was called the simplicial norm in Section 2.6.) Hence, we will be able to achieve better bounds by treating this problem directly, we defer the proof to Section 3.5. Theorem 3.9. Suppose M = AW where W and M are normalized to have rows sum up to 1, A is separable and W is γ-robustly simplicial. When < γ/100 there is a polynomial ˜ such that for all rows |M ˜ i − M i |1 < , finds r row (almost time algorithm that given M ˜ . The i-th almost anchor word corresponds to a row in M that can be anchor words) in M represented as (1 − O(/γ))W i + O(/γ)W −i . Here W −i is a vector in the convex hull of other rows in W with unit length in `1 norm.

3.3

Algorithm for Learning a Topic Model: Proof of Theorem 3.4

First it is important to understand why separability helps in nonnegative matrix factorization, and specifically, the exact role played by the anchor words. Suppose the NMF algorithm is given a matrix AB. If A is p-separable then this means that A contains a diagonal matrix (up to row permutations). Thus a scaled copy of each row of B is present as a row in AB. In fact, if we knew the anchor words of A, then by looking at the corresponding rows of AB we could“read off” the corresponding row of B (up to scaling), and use these in turn to recover all of A. Thus the anchor words constitute the “key” that “unlocks” the factorization, and indeed the main step of our earlier NMF algorithm was a geometric procedure to identify the anchor words. When one is given a noisy version of AB, the analogous notion is “almost anchor” words, which correspond to rows of AB that are “very close” to rows of B; see Theorem 3.9. Now we sketch how to apply these insights to learning topic models. Let M denote the provided term-by-document matrix, whose each column describes the empirical word frequencies in the documents. It is obtained from sampling AW and thus is an extremely noisy approximation to AW . Our algorithm starts by forming the Gram matrix M M > , which can be thought of as an empirical word-word covariance matrix. In fact as the number of documents increases m1 M M > tends to a limit Q = m1 E[AW W > A], implying Q = AR(T )A> . (See Lemma 3.15.) Imagine that we are given the exact matrix Q instead of a noisy approximation. Notice that Q is a product of three nonnegative matrices, the first of which is p-separable and the last is the transpose of the first. NMF at first sight seems too weak to help find such factorizations. However, if we think of Q as a product of two nonnegative matrices, A and R(T )A> , then our NMF algorithm [11] can at least identify the anchor words of A. As noted above, these suffice to recover R(T )A> , and then (using the anchor words of A again) all of A as well. See Section 3.3.1 for details. Of course, we are not given Q but merely a good approximation to it. Now our NMF algorithm allows us to recover “almost anchor” words of A, and the crux of the proof is Section 3.3.2 showing that these suffice to recover provably good estimates to A and W W > . This uses (mostly) bounds from matrix perturbation theory, and interrelationships of condition numbers mentioned in Section 3.2. 51

For simplicity we assume the following condition on the topic model, which we will see in Section 3.3.5 can be assumed without loss of generality: (*) The number of words, n, is at most 4ar/. Please see Algorithm 3.1: Main Algorithm for description of the algorithm. Note that R is our shorthand for m1 W W > , which as noted converges to R(T ) as the number of documents increases. Algorithm 3.1. Learning Topic Models input Samples from a topic model output R and A 1: Query the oracle for m documents, where log n · a4 r6 log r · a2 r4 log r · r2 m = max O ,O ,O 2 p6 γ 2 N γ2 2 ˜, M ˜ 0 be the term-bySplit the words of each document into two halves, and let M document matrix with first and second half of words respectively. ˜M ˜ 0> 3: Compute word-by-word matrix Q = N 42 m M 4: Apply the “Robust NMF” algorithm of Theorem 3.9 to Q which returns r words that are ”almost” the anchor words of A. 5: Use these r words as input to Recover with Almost Anchor Words to compute R = m1 W W > and A 2:

3.3.1

Recover R and A with Anchor Words

We first describe how the recovery procedure works in an “idealized” setting (Algorthm 3.2,Recover with True Anchor Words), when we are given the exact value of ARA> and a set of anchor words – one for each topic. We can permute the rows of A so that the anchor words are exactly the first r words. Therefore A> = (D, U > ) where D is a diagonal matrix. Note that D is not necessarily the identity matrix (nor even a scaled copy of the identity matrix), but we do know that the diagonal entries are at least p. We apply the same permutation to the rows and columns of Q. As shown in Figure 3.1, if we look at the submatrix formed by the first r rows and r columns, it is exactly DRD. Similarly, the submatrix consisting of the first r rows is exactly DRA> . We can use these two matrices to compute R and A, in this idealized setting (and we will use the same basic strategy in the general case, but need only be more careful about how we analyze how errors compound in our algorithm). Our algorithm has exact knowledge of the matrices DRD and DRA> , and so the main task is to recover the diagonal matrix D. Given D, we can then compute A and R (for the Dirichlet Allocation we can also compute its parameters - i.e. the α so that R(α) = R). The key idea to this algorithm is that the row sums of DR and DRA> are the same, and we can 52

Algorithm 3.2. Recover with True Anchor Words input r anchor words, empirical correlation matrix Q output R and A 1: Permute the rows and columns of Q so that the anchor words appear in the first r rows and columns 2: Compute DRA> 1 (which is equal to DR1) 3: Solve for z: DRDz = DR1. 4: Output A> = ((DRDDiag(z))−1 DRA> ). 5: Output R = (Diag(z)DRDDiag(z)).

D

D

W

U

WT

=

U

DRD

DRUT

URD

URUT

R

Figure 3.1: The matrix Q use the row sums of DR to set up a system of linear constraints on the diagonal entries of D−1 . Lemma 3.10. When the matrix Q is exactly equal to ARA> and we know the set of anchor words, Recover with True Anchor Words outputs A and R correctly. Proof. The Lemma is straight forward from Figure 3.1 and the procedure. By Figure 3.1 we can find the exact value of DRA> and DRD in the matrix Q. Step 2 of recover computes DR1 by computing DRA> 1. The two vectors are equal because A is the topic-term matrix and its columns sum up to 1, in particular A> 1 = 1. In Step 3, since R is invertible by Lemma 3.6, D is a diagonal matrix with entries at least p, the matrix DRD is also invertible. Therefore there is a unique solution z = (DRD)−1 DR1 = D−1 1. Also Dz = 1 and hence DDiag(z) = I. Finally, using the fact that DDiag(z) = I, the output in step 4 is just (DR)−1 DRA> = A> , and the output in step 5 is equal to R.

3.3.2

Recover R and A with Almost Anchor Words

What if we are not given the exact anchor words, but are given words that are “close” to anchor words? As we noted, in general we cannot hope to recover the true anchor words, but even a good approximation will be enough to recover R and A. 53

When we restrict A to the rows corresponding to “almost” anchor words, the submatrix will not be diagonal. However, it will be close to a diagonal in the sense that the submatrix will be a diagonal matrix D multiplied by E, and E is close to the identity matrix (and the diagonal entries of D are at least Ω(p)). Here we analyze the same procedure as above and show that it still recovers A and R (approximately) even when given “almost” anchor words instead of true anchor words. For clarity we state the procedure again in Algorithm 3.3: Recover with Almost Anchor Words. The guarantees at each step are different than before, but the implementation of the procedure is the same. Notice that here we permute the rows of A (and hence the rows and columns of Q) so that the “almost” anchor words returned by Theorem 3.9 appear first and the submatrix A on these rows is equal to DE. Here, we still assume that the matrix Q is exactly equal to ARA> and hence the first r rows of Q form the submatrix DERA> and the first r rows and columns are DERE > D. The complication here is that Diag(z) is not necessarily equal to D−1 , since the matrix E is not necessarily the identity. However, we can show that Diag(z) is ”close” to D−1 if E is suitably close to the identity matrix – i.e. given good enough proxies for the anchor words, we can bound the error of the above recovery procedure. We write E = I + Z. Intuitively when Z has only small entries E should behave like the identity matrix. In particular, E −1 should have only small off-diagonal entries. We make this precise through the following lemmas: P Lemma 3.11. Let E = I + Z and i,j |Zi,j | = < 1/2, then E −1 1 is a vector with entries in the range [1 − 2, 1 + 2]. Proof. E is clearly invertible because the spectral norm of Z is at most 1/2. Let b = E −1 1. Since E = I + Z we multiply E on both sides to get b + Zb = 1. Let bmax be the largest absolute value of any entry of b (bmax = max P |bi |). Consider the entry i where bmax is achieved, we know bmax = |bi | ≤ 1 + |(Zb)i | ≤ 1 + j |Zi,j ||bj | ≤ 1 + bmax . Thus bmax ≤ 1/(1 − ) ≤ 2. Now all the entries in Zb are within 2 in absolute value, and we know that b = 1 + Zb. Hence all the entries of b are in the range [1 − 2, 1 + 2], as desired. P Lemma 3.12. Let E = I + Z and i,j |Zi,j | = < 1/2, then the columns of E −1 − I have `1 norm at most 2. Proof. Without loss of generality, we can consider just the first column of E −1 − I, which is equal to (E −1 − I)e1 , where e1 is the indicator vector that is one on the first coordinate and zero elsewhere. The approach is similar to that in Lemma 3.11. Let b = (E −1 − I)e1 . Left multiply by E = (I + Z) and we obtain b + Zb = −Ze1 . Hence b = −Z(b + e1 ). Let bmax be the largest absolute value of entries of b (bmax = max |bi |). Let i be the entry in which bmax is achieved. Then bmax = |bi | ≤ |(Zb)i | + |(Ze1 )i | ≤ bmax + Therefore bmax ≤ /(1 − ) ≤ 2. Further, the |b|1 ≤ |Ze1 |1 + |Zb|1 ≤ + 22 ≤ 2.

Now we are ready to show that the procedure Recover with Almost Anchor Words succeeds when given ”almost” anchor words: 54

Algorithm 3.3. Recover with Almost Anchors input r ”almost” anchor words, empirical correlation Q output R and A. 1: Permute the rows and columns of Q so that the ”almost” anchor words appear in the first r rows and columns. 2: Compute DERA> 1 (which is equal to DER1) 3: Solve for z: DERE > Dz = DER1. 4: Output A> = ((DERE > DDiag(z))−1 DERA> ). 5: Output R = (Diag(z)DERE > DDiag(z)). Lemma 3.13. When the matrix Q is exactly equal to ARA> , the matrix A restricted to almost anchor words is DE where E − I has `1 norm < 1/10 when viewed as a vector, procedure Recover with Almost Anchor Words outputs A such that each column of A has `1 error at most 6. The matrix R has additive error ZR whose `1 norm when viewed as a vector is at most 8. Proof. Since Q is exactly ARA> , our algorithm is given DERA> and DERE > D with no error. In Step 3, since D, E and R are all invertible, we have z = (DERE > D)−1 DER1 = D−1 (E > )−1 1 Ideally we would want Diag(z) = D−1 , and indeed DDiag(z) = Diag((E > )−1 1). From Lemma 3.11, the vector (E > )−1 1 has entries in the range [1 − 2, 1 + 2], thus each entry of Diag(z) is within a (1 ± 2) multiplicative factor from the corresponding entry in D−1 . Consider the output in Step 4. Since D, E, R are invertible, the first output is (DERE > DDiag(z))−1 DERA> = (DDiag(z))−1 (E > )−1 A> Our goal is to bound the `1 error of the columns of the output compared to the corresponding columns of A. Notice that it is sufficient to show that the j th row of (DDiag(z))−1 (E > )−1 is close (in `1 distance) to the indicator vector e>j . Claim. For each j, |e>j (DDiag(z))−1 (E > )−1 − e>j |1 ≤ 5 Proof. Again, without loss of generality we can consider just the first row. From Lemma 3.12 e>1 (E > )−1 has `1 distance at most 2 to e>1 . (DDiag(z))−1 has entries in the range [1 − 3, 1 + 3]. And so |e>1 (DDiag(z))−1 (E > )−1 − e>1 |1 ≤ |e>1 (DDiag(z))−1 (E > )−1 − e>1 (E > )−1 |1 + |e>1 (E > )−1 − e>1 |1 The last term can be bounded by 2. Consider the first term on the right hand side: The vector e>1 (DDiag(z))−1 − e>1 has one non-zero entry (the first one) whose absolute value is at most 3. Hence, from Lemma 3.12 the first term can be bounded by 62 ≤ 3, and this implies the claim. 55

The first row of (DDiag(z))−1 (E > )−1 A> is A1 + z > A where z is a vector with `1 norm at most 5. So every column of A is recovered with `1 error at most 6. Consider the second output of the algorithm. The output is Diag(z)DERE > DDiag(z) and we can write Diag(z)D = I + Z1 and E = I + Z2 . The leading error are Z1 R + Z2 R + RZ1 + RZ2 and hence the `1 norm of the leading error term (when treated as a vector) is at most 6 and other terms are of order 2 and can safely be bounded by 2 for suitably small ). Finally we consider the general case (in which there is additive noise in Step 1): we are not given ARA> exactly. We are given Q which is close to ARA> (by Lemma 3.15). We will bound the accumulation of this last type of error. Suppose in Step 1 of RECOVER we obtain DERA> + U and DERE > D + V and furthermore the entries of U and U 1 have absolute value at most 1 and the matrix V has `1 norm 2 when viewed as a vector. Lemma 3.14. If , 1 , 2 are sufficiently small, RECOVER outputs A such that each entry of A has additive error at most O( + (ra2 /p3 + 1 r/p2 )/γ). Also the matrix R has additive error ZR whose `1 norm when viewed as a vector is at most O( + (ra2 /p3 + 1 r/p2 )/γ). The main idea of the proof is to write DERE > D + V as DER(E > + V 0 )D. In this way the error V can be translated to an error V 0 on E and Lemma 3.13 can be applied. The error U can be handled similarly. Proof. We shall follow the proof of Lemma 3.13. First can express the error term V instead as V = (DER)V 0 (D). This is always possible because all of D, E, R are invertible. Moreover, the `1 norm of V 0 when viewed as a vector is at most 8ra2 /γp3 , because this norm will grow by a factor of at most 1/p when multiplied by D−1 , a factor of at most 2 when multiplied by E −1 and at most ra/Γ(R) when multiplied by R−1 . The bound of Γ(R) comes from Lemma 3.6, we lose an extra ra because R may not have rows sum up to 1. Hence DERE > D + V = DER(E > + V 0 )D and the additive error for DERE > D can be transformed into error in E, and we will be able to apply the analysis in Lemma 3.13. Similarly, we can express the error term U as U = DERU 0 . Entries of U 0 have absolute value at most 81 r/γp2 . The right hand side of the equation in step 3 is equal to DER1+U 1 so the error is at most 1 per entry. Following the proof of Lemma 3.13, we know Diag(z)D has diagonal entries within 1 ± (2 + 162 /γp3 + 21 ). Now we consider the output. The output for A> is equal to (DER(E > + V 0 )DDiag(z))−1 DER(A> + U 0 ) = (DDiag(z))−1 (E > + V 0 )−1 (A> + U 0 ). Here we know (E > + V 0 )−1 − I has `1 norm at most O( + ra2 /γp3 ) per row, (DDiag(z)) is a diagonal matrix with entries in 1 ± O( + ra2 /γp3 + 1 ), entries of U 0 has absolute value O(1 r/γp2 ). Following the proof of Lemma 3.13 the final entry-wise error of A is roughly the sum of these three errors, and is bounded by O( + (ra2 /p3 + 1 r/p2 )/γ) (Notice that Lemma 3.13 gives bound for `1 norm of rows, which is stronger. Here we switched to entry-wise error because the entries of U are bounded while the `1 norm of U might be large). 56

Similarly, the output of R is equal to Diag(z)(DERE > D + V )Diag(z). Again we write Diag(z)D = I + Z1 and E = I + Z2 . The extra term Diag(z)V Diag(z) is small because the entries of z are at most to 2/p (otherwise Diag(z)D won’t be close to identity). The error can be bounded by O( + (ra2 /p3 + 1 r/p2 )/γ). Now in order to prove our main theorem we just need to show that when number of documents is large enough, the matrix Q is close to the ARA> , and plug the error bounds into Lemma 3.14.

3.3.3

Error Bounds for Q

Here we show that the matrix Q indeed converges to enough.

1 AW W > A> m

= ARA> when m is large

n , with high probability all entries of Q − m1 AW W > A> have Lemma 3.15. When m > 50Nlog 2Q absolute value at most Q . Further, the `1 norm of rows of Q are also Q close to the `1 norm of the corresponding row in m1 AW W > A> .

Proof. We shall first show that the expectation of Q is equal to ARA> where R is m1 W W > . Then by concentration bounds we show that entries of Q are close to their expectations. Notice that we can also hope to show that Q converges to AR(T )A> . However in that case we will not be able to get the inverse polynomial relationship with N (indeed, even if N goes to infinity it is impossible to learn R(T ) with only one document). Replacing R(T ) with the empirical R allows our algorithm to perform better when the number of words per document is larger. To show the expectation is correct we observe that conditioned on W , the entries of two ˜ and M ˜ 0 are independent. Their expectations are both N AW . Therefore, matrices M 2 4 ˜M ˜ 0> ] = 1 E[ M E[Q] = mN 2 m

2 2 0 > ˜] ˜ ] = 1 AW W > A> = ARA> . E[M E[M N N m

We still need to show that Q is close to this expectation. This is not surprising be˜ iM ˜ 0 ). Further, the variance of cause Q is the average of m independent samples (of N42 M i ˜ iM ˜ i0> can be bounded because M ˜ and M ˜ 0 also come from independent each entry in N42 M ˜ i and M ˜0 samples. For any i, j1 , j2 , let v = AWi be the probability distribution that M i ˜ i (j1 ) is distributed as Binomial(N/2, v(j1 )) and M ˜ 0 (j2 ) is disare sampled from, then M i tributed as Binomial(N/2, v(j2 )). The variance of these two variables are less than N/8 no matter what v is by the properties of binomial distribution. Conditioned on the vector v these two variables are independent, thus the variance of their product is at most ˜ i (j1 )EM ˜ 0 (j2 )2 + EM ˜ i (j1 )2 Var M ˜ 0 (j2 ) + Var M ˜ i (j1 ) Var M ˜ 0 (j2 ) ≤ N 3 /4 + N 2 /64. The Var M i i i ˜ iM ˜ 0> is at most 4/N + 1/16N 2 = O(1/N ). Higher moments variance of any entry in N42 M i can be bounded similarly and they satisfy the assumptions of Bernstein inequalities. Thus by Bernstein inequalities the probability that any entry is more than Q away from its true value is much smaller than 1/n2 . 57

The further part follows from the observation that the `1 norm of a row in Q is proportional to the number of appearances of the word. As long as the number of appearances concentrates the error in `1 norm must be small. The words are all independent (conditioned on W ) so this is just a direct application of Chernoff bounds.

3.3.4

Proving the Main Theorem

We are now ready to prove Theorem 3.4: Proof. By Lemma 3.15 we know when we have at least 50 log n/N 2Q documents, Q is entry wise close to ARA> . In this case error per row for Theorem 3.9 is at most Q · O(a2 r2 /p2 ) because in this step we can assume that there are at most 4ar/p words (see Section 3.3.5) and to normalize the row we need a multiplicative factor of at most 10ar/p (we shall only consider rows with `1 norm at least p/10ar, with high probability all the anchor words are in these rows). The γ parameter for Theorem 3.9 is pγ/4 by Lemma 3.6. Thus the almost anchor words found by the algorithm has weight at least 1−O(Q a2 r2 /γp3 ) on diagonals. The error for DERE > D is at most Q r2 , the error for any entry of DERA> and DERA> 1 is at most O(Q ). Therefore by Lemma 3.14 the entry-wise error for A is at most O(Q a2 r3 /γp3 ). When Q < p3 γ/a2 r3 the error is bounded by . In this case we need log r · a2 r4 log n · a4 r6 ,O . m = max O 2 p6 γ 2 N γ2 The latter constraint comes from Lemma 3.6. To get within additive error for the parameter α, we further need R to be close enough to the variance-covariance matrix of the document-topic distribution, which means m is at least log r · a2 r4 log r · r2 log n · a4 r6 ,O ,O . m = max O 2 p6 γ 2 N γ2 2

3.3.5

Reducing Dictionary Size

Above we assumed that the number of distinct words is small. Here, we give a simple gadget that shows in the general case we can assume that this is the case at the loss of an additional additive in our accuracy: Lemma 3.16. The general case can be reduced to an instance in which there are at most 4ar/ words all of which (with at most one exception) occur with probability at least /4ar. Proof. In fact, we can collect all words that occur infrequently and “merge” all of these words into a aggregate word that we will call the runoff word. To this end, we call a word log n large if it appears more than mN/3ar times in m = 100ar documents, and otherwise N we call it small. Indeed, with high probability all large words are words that occur with 58

probability at least /4ar in our model. Also, all words that has a entry larger than in the corresponding row of A will appear with at least /ar probability, and is thus a large word with high probability. We can merge all small words (i.e. rename all of these words to a single, new word). Hence we can apply the above algorithm (which assumed that there are not too many distinct words). After we get a result with the modified documents we can ignore the runoff words and assign 0 weight for all the small words. The result will still be correct up to additive error.

3.4

The Dirichlet Subcase

Here we demonstrate that the parameters of a Dirichlet distribution can be (robustly) recovered from just the covariance matrix R(T ). Hence an immediate corollary is that our main learning algorithm can recover both the topic matrix A and the distribution that generates columns of W in a Latent Dirichlet Allocation (LDA) Model [26], provided that A is separable. We believe that this algorithm may be of practical use, and provides the first alternative to local search and (unproven) approximation procedures for this inference problem [157], [51], [26]. The Dirichlet distribution is parametrized by a vector α of positive reals is a natural family of continuous multivariate probability distributions. The support of the Dirichlet Distribution is the unit simplex whose dimension is the same as the dimension of α. Let α be a r dimensional vector. Then for a vector θ ∈ Rr in the r dimensional simplex, its probability density is given by P r Γ( ri=1 αi ) Y αi −1 θi , P r[θ|α] = Qr i=1 Γ(αi ) i=1 where Γ is the Gamma function. In particular, when all the αi ’s are equal to one, the Dirichlet Distribution is just the uniform random distribution over the probability simplex. The expectationP and variance of θi ’s are easy to compute given the parameters α. We denote α0 = |α|1 = ri=1 αi , then the ratio αi /α0 should be interpreted as the “size” of the i-th variable θi , and α0 shows whether the distributions is concentrated in the interior (when α0 is large) or near the boundary (when α0 is small). The first two moments of Dirichlet Distribution is listed as below: E[θi ] = ( E[θi θj ] =

αi αj α0 (α0 +1) αi (αi +1) α0 (α0 +1)

αi . α0 when i 6= j . when i = j

Suppose the Dirichlet distribution has max αi / min αi = a and the sum of parameters is α0 ; we give an algorithm that computes close estimates to the vector of parameters α given a sufficiently close estimate to the co-variance matrix R(T ) (Theorem 3.19). Combining this with Theorem 3.4, we obtain the following corollary: 59

Theorem 3.17. There is an algorithm that learns the topic matrix A with high probability up to an additive error of from at most log n · a6 r8 (α0 + 1)4 log r · a2 r4 (α0 + 1)2 ,O m = max O 2 p6 N 2 documents sampled from the LDA model and runs in time polynomial in n, m. Furthermore, we also recover the parameters of the Dirichlet distribution to within an additive . Our main goal in this section is to bound the `1 -condition number of the Dirichlet distribution (Section 3.4.1), and using this we show how to recover the parameters of the distribution from its covariance matrix (Section 3.4.2).

3.4.1

Condition Number of a Dirichlet Distribution

There is a well-known meta-principle that if a matrix W is chosen by picking its columns independently from a fairly diffuse distribution, then it will be far from low rank. However, our analysis will require us to prove an explicit lower bound on Γ(R(T )). We now prove such a bound when the columns of W are chosen from a Dirichlet distribution with parameter vector α. We note that it is easy to establish such bounds for other types of distributions as well. Recall that we defined R(T ) in Section 3.1, and here we will abuse notation and throughout this section we will denote by R(α) the matrix R(T ) where T is a Dirichlet distribution with P parameter α. Let α0 = ri=1 αi . The mean, variance and co-variance for a Dirichlet distribution are α αj when i 6= j and is equal well-known, from which we observe that R(α)i,j is equal to α0 (αi 0 +1) to

αi (αi +1) α0 (α0 +1)

when i = j.

Lemma 3.18. The `1 condition number of R(α) is at least αα

1 . 2(α0 +1)

j i +1) Proof. As the entries R(α)i,j is α0 (αi 0 +1) when i 6= j and αα0i (α when i = j, after normal(α0 +1) 1 0 ization R(α) is just the matrix D = α0 +1 (α × (1, 1, ..., 1) + I) where × is outer product and I is the identity matrix. Let x be a vector such that |x|1 = 1 and |D0 x|1 achieves the minimum in Γ(R(α)) and let I = {i|xi ≥P0} and letPJ = I¯ be the complement. We can assume without loss of generality that i∈I xi ≥ | i∈J xi | (otherwise just take −x instead). The product D0 x is P xi α + α01+1 x. The first term is a nonnegative vector and hence for each i ∈ I, (D0 x)i ≥ 0. α0 +1 This implies that 1 X 1 |D0 x|1 ≥ xi ≥ . α0 + 1 i∈I 2(α0 + 1)

60

3.4.2

Recovering the Parameters of a Dirichlet Distribution

When the variance covariance matrix R(α) is recovered with error R in `1 norm when viewed as a vector, we can use Algorithm 3.4: Dirichlet to compute the parameters for the Dirichlet. Algorithm 3.4. Dirichlet Parameters input matrix R output Dirichlet parameters α Set α/α0 = R1. Let i be the row with smallest `1 norm, let u = Ri.i and v = αi /α0 . 1−u/v . Set α0 = u/v−v Output α = α0 · (α/α0 ). Theorem 3.19. When the variance covariance matrix R(α) is recovered with error R in `1 norm when viewed as a vector, the procedure Dirichlet(R) learns the parameter of the Dirichlet distribution with error at most O(ar(α0 + 1)R ). ± R and the value v is Proof. The αi /α0 ’s all have error at most R . The value u is αα0i αα0i +1 +1 αi /α0 ± R . Since v ≥ 1/ar we know the error for u/v is at most 2arR . Finally we need to bound the denominator αα0i +1 − αα0i > 2(α01+1) (since αα0i ≤ 1/r ≤ 1/2). Thus the final error is +1 at most 5ar(α0 + 1)R .

3.5

Obtaining Almost Anchor Words

In this section, we prove Theorem 3.9, which we restate here: Theorem 3.20. Suppose M = AW where W and M are normalized to have rows sum up to 1, A is separable and W is γ-robustly simplicial. When < γ/100 there is a polynomial ˜ such that for all rows |M ˜ i − M i |1 < , finds r row (almost time algorithm that given M ˜ . The i-th almost anchor word corresponds to a row in M that can be anchor words) in M represented as (1 − O(/γ))W i + O(/γ)W −i . Here W −i is a vector in the convex hull of other rows in W with unit length in `1 norm. The major weakness of the algorithm in [11] is that it only considers the `1 norm. However in an r dimensional simplex there is another natural measure of distance more suited to our purposes: since each point is a unique convex combination of the vertices, we can view this convex combination as a probability distribution and use statistical distance (on this representation) as a norm for points inside the simplex. We will in fact need a slight modification to this norm, since we would like it to extend to points outside the simplex too: Definition 3.21 ((δ, )-close). A point M 0j is (δ, )-close to M 0i if and only if

ck ≥0,

Pn

min

k=1 ck =1,cj ≥1−δ

0i

|M − 61

n X k=1

ck M 0k |1 ≤ .

Intuitively think of point M 0j is (δ, )-close to M 0i if M 0i is close in `1 distance to some point Q, where Q is a convex combination of the rows of M 0 that places at least 1 − δ weight on M 0j . Notice that this definition is not a distance since it is not symmetric, but we will abuse notation and nevertheless call it a distance function. We remark that this distance is easy to compute: To check whether M 0j is (δ, )-close to M 0i we just need to solve a linear program that minimizes the `1 distance when the c vector is a probability distribution with at least 1 − δ weight on j (the constraints on c are clearly all linear). We also consider all points that a row M 0j is close to, this is called the neighborhood of 0j M . Definition 3.22 ((δ, )-neighborhood). The (δ, )-neighborhood of M 0j are the rows M 0i such that M 0j is (δ, )-close to M 0i . For each point M 0j ,P we know its original (unperturbed) point Mj is in a convex combii j nation of W ’s: M = ri=1 Aj,i W i . Separability implies that for any column index i there is a row f (i) in A whose only nonzero entry is in the ith column. Then M f (i) = W i and consequently |M 0f (i) − W i |1 < . Let us call these rows M 0f (i) for all i the canonical rows. From the above description the following claim is clear. Claim. Every row M 0j has `1 -distance at most 2 to the convex hull of canonical rows. Proof. We have: 0j

|M −

r X k=1

Aj,k M

0f (k)

0j

j

j

|1 ≤ |M − M |1 + |M −

r X k=1

Aj,k M

f (k)

|1 + |

r X k=1

Aj,k (M f (k) − M 0f (k) )|1

and we can bound the right hand side by 2. The algorithm will distinguish rows that are close to vertices and rows that are far by testing whether each row is close (in `1 norm) to the convex hull of rows outside its neighborhood. In particular, we define a robust loner as: Definition 3.23 (robust loner). We call a row M 0j a robust-loner if it has `1 distance at most 2 to the convex hull of rows that are outside its (6/γ, 2) neighborhood. Our goal is to show that a row is a robust loner if and only if it is close to some row in W . The following lemma establishes one direction: Lemma 3.24. If Aj,t is smaller than 1 − 10/γ, the point M 0j cannot be (6/γ, 2)-close to the canonical row that corresponds to W > . Proof. Assume towards contradiction that M 0j is (6/γ, 2)-close to the canonical row M 0i which is a perturbation of W > . By definition there must be probability distribution c ∈ Rn P over the rows such that cj ≥ 1−6/γ, and |M 0i − nk=1 ck M 0k |1 ≤ 2. Now we instead consider the unperturbed matrix M , since every row of M 0 is Pclose (in `1 norm) to M we know Pn i k |M − k=1 ck M |1 ≤ 4. Now we represent M i and nk=1 ck M k as convex combinations of rows of W and consider the coefficient on W > . Clearly M i = W > so the coefficient is 1. 62

P But for nk=1 ck M k , since cj ≥ 1 − 6/γ and the coefficient Aj,t ≤ 1 − 10/γ, we know the coefficient of W > in the sum must be strictly P smaller than 1 − 10/γ + 6/γ = 1 − 4/γ. i By the robustly simplicial assumption M and nk=1 ck M k must be more than 4/γ · γ = 4 apart in `1 norm, which contradicts our assumption. As a corollary: Corollary 3.25. If Aj,t is smaller than 1 − 10/γ for all t, the row M 0j cannot be a robust loner. Proof. By the above lemma, we know the canonical rows are not in the (6/γ, 2) neighborhood of M 0j . Thus by Claim 3.5 the row is close to the convex hull of canonical rows and cannot be a robust loner. Next we prove the other direction: a canonical row is necessarily a robust loner: Lemma 3.26. All canonical rows are robust loners. Proof. Suppose M 0i is a canonical row that corresponds to W > . We first observe that all the rows that are outside the (6/γ, 2) neighborhoodPof M 0j must have Aj,t < 1 − 6/γ. This is because when Aj,t ≥ 1 − 6/γ we have M j − rk=1 Aj,t W > = 0. If we replace M j by M 0j and W > by the corresponding canonical row, the distance is still at most 2 and the coefficient on M 0i is at least 1 − 6/γ. By definition the corresponding row M 0j must be in the neighborhood of M 0i . Now we try to represent M 0i with convex combination of rows that has Aj,t < 1 − 6/γ. However this is impossible because every point in the convex combination will also have weight smaller than 1 − 6/γ on W > , while M i has weight 1 on W > . The `1 distance between Mi and the convex combination of the Mj ’s where Aj,t < 1 − 6/γ is a least 6 by robust simplicial property. Even when the points are perturbed by (in `1 ) the distance can change by at most 2 and is still more than 2. Therefore M 0i is a robust loner. Now we can prove the main theorem of this section: Proof. Suppose we know γ and 100 < γ.When γ is so small we have the following claim: Claim. If Aj,t and Ai,l is at least 1 − 10/γ, and t 6= l, then M 0j cannot be (10/γ, 2)-close to M 0i and vice versa. The proof is almost identical to Lemma 3.24. Also, the canonical row that corresponds to W > is (10/γ, 2) close to all rows with Aj,t ≥ 1 − 10/γ. Thus if we connect two robust loners when one is (10/γ, 2) close to the other, the connected component of the graph will exactly be a partition according to the row in W that the robust loner is close to. We pick one robust loner in each connected component to get the almost anchor words. Now suppose we don’t know γ. In this case the problem is we don’t know what is the right size of neighborhood to look at. However, since we know γ > 100, we shall first run the algorithm with γ = 100 to get r rows W 0 that are very close to the true rows in W . It is not hard to show that these rows are at least γ/2 robustly simplicial and at most γ + 2 robustly simplicial. Therefore we can compute the γ(W 0 ) parameter for this set of rows and use γ(W 0 ) − 2 as the γ parameter. 63

3.6

Maximum Likelihood Estimation is Hard

Here we prove that computing the Maximum Likelihood Estimate (MLE) of the parameters of a topic model is N P -hard. We call this problem the Topic Model Maximum Likelihoood Estimation (TM-MLE) problem: Definition 3.27 (TM-MLE). Given m documents and a target of r topics, the TM-MLE problem asks to compute the topic matrix A that has the largest probability of generating the observed documents (when the columns of W are generated by a uniform Dirichlet distribution). Surprisingly, this appears to be the first proof that computing the MLE estimate in a topic model is indeed computationally hard, although its hardness is certainly to be expected. On a related note, Sontag and Roy [143] recently proved that given the topic matrix and a document, computing the Maximum A Posteriori (MAP) estimate for the distribution on topics that generated this document is N P -hard. Here we will establish that TM-MLE is N P -hard via a reduction from the MIN-BISECTION problem: In MIN-BISECTION the input is a graph with n vertices (n is an even integer), and the goal is to partition the vertices into two equal sized sets of n/2 vertices each so as to minimize the number of edges crossing the cut. Theorem 3.28. There is a polynomial time reduction from MIN-BISECTION to TM-MLE (r = 2). Proof. Suppose we are given an instance G of the MIN-BISECTION problem with n vertices and m edges. We will now define an instance of the TM-MLE problem. First, we set the number of words to be n. For each word i, we construct N = d200m3 log ne documents each of which contain the word i twice and no other words. For each edge in the graph G, we construct a document whose two words correspond to the endpoints of the edge. Suppose that x = (x1 , x2 )> is generated by the Dirichlet distribution Dir(1, 1). Consequently the probability that words i and j appear in a document with only two words is exactly (Ai x)·(Aj x). We can take the expectation of this term over the Dirichlet distribution Dir(1, 1) and hence the probability that a document (with exactly two words) contains the words i and j is 1 1 E[(Ai x) · (Aj x)] = (Ai · Aj ) + (Ai1 Aj2 + Aj1 Ai2 ) 3 6 In the TM-MLE problem, our goal is to maximize the following objective function (which is the log of the probability of generating the collection of documents): X 1 i j 1 i j j i OBJ = log (A · A ) + (A1 A2 + A1 A2 ) . 3 6 document = {i, j} For any bisection, we define a canonical solution: the first topic is uniform on all words on one side of the bisection and the second topic is uniform on all words on the other side of the bisection.To prove the correctness of our reduction, a key step is to show that any 64

candidate solution to the MLE problem must be close to a canonical solution. In particular, we show the following: 1. The rows Ai have almost the same `1 norm. 2. In each row Ai , almost all of the weight will be in one of the two topics. Indeed, canonical solutions have large objective value. Any canonical solution has objective value at least −N n log 3n2 /4 − m log 3n2 /2 (this is because documents with same words contribute − log 3n2 /4 and documents with different words contribute at least − log 3n2 /2). Recall, in our reduction N is large. Roughly, if one of the rows has `1 norm that is bounded away from 2/n by at least 1/20nm, the contribution (to the objective function) of documents with a repeated word will decrease significantly and the solution cannot be optimal. To prove this we use the fact that the function log x2 = 2 log x is concave. Therefore when one of the rows has `1 norm more than 2/n + 1/20nm, the optimal value for documents with a repeated word will be attained when all other rows have the same `1 norm 2/n − 1/20nm(n − 1). Using a Taylor expansion, we conclude that the sum of terms for documents with a repeated word will decrease by at least N/50m2 which is much larger than any effect the remaining m documents can recoup. In fact, an identical argument establishes that in each row Ai , the topic with smaller weight will always have weight smaller than 1/20nm. Now we claim that among canonical solutions, the one with largest objective value corresponds to a minimum bisection. The proof follows from the observation that the value of the objective function is −N n log 3n2 /4 − k log 3n2 /2 − (m − k) log 3n2 /4 for canonical solutions, where k is the number of edges cut by the bisection. In particular, the objective function of the minimum bisection will be at least an additive log 2 larger than the objective function of a non-minimum bisection. However, even if the canonical solution is perturbed by 1/20nm, the objective function will only change by at most m · 1/10m = 1/10, which is much smaller than log2 2 . And this completes our reduction. We remark that the canonical solutions in our reduction are all separable, and hence this reduction applies even when the topic matrix A is known (and required) to be separable. So, even in the case of a separable topic matrix, it is N P -hard to compute the MLE.

65

Chapter 4 Practical Implementations and Experiments In the previous chapter, we designed a new algorithm that can provably learn a topic model. However, the algorithm is not very practical for several reasons. First, the algorithm requires solving many linear programs, which are very slow in practice. We need to replace linear programming with a combinatorial anchor-word selection algorithm. Second, the recovery algorithm relies on matrix inversion, which is notoriously unstable and can potentially produce negative entries. In this chapter we present a simple probabilistic interpretation of topic recovery given anchor words that replaces matrix inversion with a new gradient-based inference method. In this chapter we show how to redesign and implement the algorithm in Chapter 3 efficiently. The new algorithm produces results comparable to the best MCMC implementations while running orders of magnitude faster. We study both the empirical sample complexity of the algorithms on synthetic distributions, and the performance of the algorithms on real-world document corpora. We find that our algorithm performs as well as collapsed Gibbs sampling on a variety of metrics, and runs at least an order of magnitude faster. The rest of the chapter is organized as follows: in Section 4.1 we reformulate the main framework of the algorithm, and show how the recovery step can be replaced by a probabilistic approach. The proofs of new recovery algorithms are deferred to Section 4.4. In Section 4.2 we give a combinatorial NMF algorithm that is much faster than the ones in previous Chapters. The proof of the algorithm is deferred to Section 4.5. Section 4.3 introduces the methodology of experiments and presents the results.

4.1

A Probabilistic Approach to Exploiting Separability

Recall that the algorithm in Chapter 3 has two steps: anchor selection, which identifies anchor words, and recovery, which recovers the parameters of A and of τ . Both anchor 66

Algorithm 4.1. High Level Algorithm input Textual corpus D, Number of anchors K, Tolerance parameters a , b > 0. output Word-topic matrix A, topic-topic matrix R Q ← Word Co-occurences(D) ¯ 1, Q ¯ 2 , ...Q ¯ V }, the normalized rows of Q. Form {Q ¯ 1, Q ¯ 2 , ...Q ¯ V }, K, a ) (Algorithm 4.3) S ← FastAnchorWords({Q A, R ← RecoverKL(Q, S, b ) (Algorithm 4.2) return A, R Algorithm 4.2. RecoverKL input Matrix Q, Set of anchor words S, tolerance parameter . output Matrices A,R ¯ Normalize the rows of Q to form Q. Store the normalization constants pw = Q1. ¯ s is the row of Q ¯ for the k th anchor word. Q k for i = 1, ..., V do ¯ i || P ¯ Solve Ci· = arg min D ( Q KL C k∈S Ci,k Qsk ) i P Subject to: k Ci,k = 1 and Ci,k ≥ 0 With tolerance: end for A0 = diag(pw )C Normalize the columns of A0 to form A. T R = A† QA† return A, R selection and recovery take as input the matrix Q (of size V ×V ) of word-word co-occurrence counts, whose construction is described in the supplementary material. Q is normalized so that the sum of all entries is 1. The high-level flow of our complete learning algorithm is described in Algorithm 4.1, and follows the same two steps. In this section we will introduce a new recovery method based on a probabilistic framework. We defer the discussion of anchor selection to the next section, where we provide a purely combinatorial algorithm for finding the anchor words. Here we adopt a new probabilistic approach, which we describe below after introducing some notation. Consider any two words in a document and call them w1 and w2 , and let z1 and z2 refer to their topic assignments. We will use Ai,k to index the matrix of word-topic distributions, i.e. Ai,k = Pr(w1 = i|z1 = k) = Pr(w2 = i|z2 = k). Given infinite data, the elements of the Q matrix can be interpreted as Qi,j = Pr(w1 = i, w2 = j). The row¯ which plays a role in both finding the anchor words and normalized Q matrix, denoted Q, ¯ i,j = Pr(w2 = j|w1 = i). the recovery step, can be interpreted as a conditional probability Q Denoting the indices of the anchor words as S = {s1 , s2 , ..., sK }, the rows indexed by ¯ lies in the convex hull of the rows elements of S are special in that every other row of Q 67

indexed by the anchor words. To see this, first note that for an anchor word sk , X ¯ s ,j = Q Pr(z1 = k 0 |w1 = sk ) Pr(w2 = j|z1 = k 0 ) k

(4.1)

k0

= Pr(w2 = j|z1 = k),

(4.2)

where (4.1) uses the fact that in an admixture model w2 ⊥w1 | z1 , and (4.2) is because Pr(z1 = k|w1 = sk ) = 1. For any other word i, we have X ¯ i,j = Q Pr(z1 = k|w1 = i) Pr(w2 = j|z1 = k). k

¯ i,j = P Ci,k Q ¯ s ,j . Since C is Denoting the probability Pr(z = k|w = i) as C , we have Q 1 1 i,k k k P ¯ non-negative and k Ci,k = 1, we have that any row of Q lies in the convex hull of the rows corresponding to the anchor words. The mixing weights give us Pr(z1 |w1 = i)! Using this together with Pr(w1 = i), we can recover the A matrix simply by using Bayes’ rule: Pr(z1 = k|w1 = i) Pr(w1 = i) . Pr(w1 = i|z1 = k) = P 0 0 i0 Pr(z1 = k|w1 = i )p(w1 = i ) P P Finally, we observe that Pr(w1 = i) is easy to solve for since j Qi,j = j p(w1 = i, w2 = j) = Pr(w1 = i). Our new algorithm finds, for each row of the empirical row normalized co-occurrence ˆ i , the coefficients Pr(z1 |w1 = i) that best reconstruct it as a convex combination matrix, Q of the rows that correspond to anchor words. This step can be solved quickly and in parallel (independently) for each word using the exponentiated gradient algorithm. Once we have Pr(z1 |w1 ), we recover the A matrix using Bayes’ rule. The full algorithm using KL divergence as an objective is found in Algorithm 4.2. Further details of the exponentiated gradient algorithm are given in the supplementary material. One reason to use KL divergence as the measure of reconstruction error is that the recovery procedure can then be understood as maximum likelihood estimation. In particular, we seek the parameters Pr(w1 ), Pr(z1 |w1 ), Pr(w2 |z1 ) that maximize the likelihood of observing ˆ However, the optimization problem does not explicitly the word co-occurence counts, Q. constrain the parameters to correspond an admixture model. We can also define a similar algorithm using quadratic loss, which we call RecoverL2. This formulation has the extremely useful property that both the objective and gradient can be kernelized so that the optimization problem is independent of the vocabulary size. To see this, notice that the objective can be re-written as T

T

||Qi − CiT QS ||2 = ||Qi ||2 − 2Ci (QS Qi ) + CiT (QS QS )Ci , T

T

where QS QS is K × K and can be computed once and used for all words, and QS Qi is K × 1 and can be computed once prior to running the exponentiated gradient algorithm for word i. 68

To recover the R matrix for an admixture model, recall that Q = ARAT . This may be an over-constrained system of equations with no solution for R, but we can find a least-squares approximation to R by pre- and post-multiplying Q by the pseudo-inverse A† . For the special case of LDA we can learnPthe Dirichlet hyperparameters. Recall that in applying Bayes’ rule we calculated Pr(z1 ) = i0 Pr(z1 |w1 = i0 ) Pr(w1 = i0 ). These values for Pr(z) specify the Dirichlet hyperparameters up to a constant scaling. This constant could be recovered from the R matrix using Algorithm 3.4, but in practice we find it is better to choose it using a grid search to maximize the likelihood of the training data.

4.2

A Combinatorial Algorithm for Finding Anchor Words

Here we consider the anchor selection step of the algorithm where our goal is to find the anchor words. In the infinite data case where we have infinitely many documents, the convex hull of the rows in Q will be a simplex where the vertices of this simplex correspond to the anchor words. Since we only have a finite number of documents, the rows of Q are only an approximation to their expectation. We are therefore given a set of V points d1 , d2 , ...dV that are each a perturbation of a1 , a2 , ...aV whose convex hull P defines a simplex. We would like to find an approximation to the vertices of P . This is the same as the problems we solved in Section 2.5 and Section 3.5. In this section we describe a purely combinatorial algorithm for this task that avoids linear programming altogether. The new “FastAnchorWords” algorithm is given in Algorithm 4.3. To find all of the anchor words, our algorithm iteratively finds the furthest point from the subspace spanned by the anchor words found so far. Since the points we are given are perturbations of the true points, we cannot hope to find the anchor words exactly. Nevertheless, the intuition is that even if one has only found r points S that are close to r (distinct) anchor words, the point that is furthest from span(S) will itself be close to a (new) anchor word. The additional advantage of this procedure is that when faced with many choices for a next anchor word to find, our algorithm tends to find the one that is most different than the ones we have found so far. The main contribution of this section is a proof that the FastAnchorWords algorithm succeeds in finding K points that are close to anchor words. To precisely state the guarantees, we use a similar notion of robustness as Section 2.5 (but notice that they are different because of different norms used): Definition 4.1. A simplex P is γ-robust if for every vertex v of P , the `2 distance between v and the convex hull of the rest of the vertices is at least γ. In most reasonable settings the parameters of the topic model define lower bounds on the robustness of the polytope P . For example, in LDA, this lower bound is based on the largest ratio of any pair of hyper-parameters in the model [12]. Our goal is to find a set of points that are close to the vertices of the simplex, and to make this precise we introduce the following definition: 69

Algorithm 4.3. FastAnchorWords Input: V points {d1 , d2 , ...dV } in V dimensions, almost in a simplex with K vertices and >0 Output: K points that are close to the vertices of the simplex. Project the points di to a randomly chosen 4 log V /2 dimensional subspace S ← {di } s.t. di is the farthest point from the origin. for i = 1 TO K − 1 do Let dj be the point in {d1 , . . . , dV } that has the largest distance to span(S). S ← S ∪ {dj }. end for 0 }. S = {v10 , v20 , ...vK for i = 1 TO K do 0 Let dj be the point that has the largest distance to span({v10 , v20 , ..., vK }\{vi0 }) Update vi0 to dj end for 0 return {v10 , v20 , ..., vK }. Notation: span(S) denotes the subspace spanned by the points in the set S. We compute the distance from a point x to the subspace span(S) by computing the norm of the projection of x onto the orthogonal complement of span(S).

Definition 4.2. Let a1 , a2 , ...aV be a set of points whose convex hull P is a simplex with vertices v1 , v2 , ...vK . Then weP say ai -covers vj if when aj is written as a convex combination of the vertices as ai = j cj vj , then cj ≥ 1 − . Furthermore we will say that a set of K points -covers the vertices if each vertex is covered by some point in the set. We will prove the following theorem: suppose there is a set of points A = a1 , a2 , ...aV whose convex hull P is γ-robust and has vertices v1 , v2 , ...vK (which appear in A) and that we are given a perturbation d1 , d2 , ...dV of the points so that for each i, kai − di k ≤ , then: ˜ 2 + V K/2 ) 1 and Theorem 4.3. There is a combinatorial algorithm that runs in time O(V outputs a subset of {d1 , . . . , dV } of size K that O(/γ)-covers the vertices provided that 20K/γ 2 < γ. This new algorithm not only helps us avoid linear programming altogether in inferring the parameters of a topic model, but also can be used to solve the nonnegative matrix factorization problem under the separability assumption, again without resorting to linear programming. Our analysis rests on the following lemmas, whose proof we defer to the supplementary material. Suppose the algorithm has found a set S of k points that are each δ-close to distinct vertices in {v1 , v2 , ..., vK } and that δ < γ/20K. Lemma 4.4. There is a vertex vi whose distance from span(S) is at least γ/2. 1

In practice we find setting dimension to 1000 works well. The running time is then O(V 2 + 1000V K).

70

The proof of this lemma is based on a volume argument, and the connection between the volume of a simplex and the determinant of the matrix of distances between its vertices. Lemma 4.5. The point dj found by the algorithm must be δ = O(/γ 2 ) close to some vertex vi . This lemma is used to show that the error does not accumulate too badly in our algorithm, since δ only depends on , γ (not on the δ used in the previous step of the algorithm). This prevents the error from accumulating exponentially in the dimension of the problem, which would be catastrophic for our proof. After running the first phase of our algorithm, we run a cleanup phase (the second loop in Alg. 4.3) that can reduce the error in our algorithm. When we have K − 1 points close to K − 1 vertices, only one of the vertices can be far from their span. The farthest point must be close to this missing vertex. The following lemma shows that this cleanup phase can improve the guarantees of Lemma 4.5: Lemma 4.6. Suppose |S| = K − 1 and each point in S is δ = O(/γ 2 ) < γ/20K close to some vertex vi , then the farthest point vj0 found by the algorithm is 1 − O(/γ) close to the remaining vertex. This algorithm is a greedy approach to maximizing the volume of the simplex. The larger the volume is, the more words per document the resulting model can explain. Better anchor word selection is an open question for future work. We have experimented with a variety of other heuristics for maximizing simplex volume, with varying degrees of success.

4.3

Experimental Results

We compare three parameter recovery methods, Recover [12], RecoverKL and RecoverL2 to a fast implementation of Gibbs sampling [122].2 Linear programming-based anchor word finding is too slow to be comparable, so we use FastAnchorWords for all three recovery algorithms. Using Gibbs sampling we obtain the word-topic distributions by averaging over 10 saved states, each separated by 100 iterations, after 1000 burn-in iterations.

4.3.1

Methodology

We train models on two synthetic data sets to evaluate performance when model assumptions are correct, and real documents to evaluate real-world performance. To ensure that synthetic documents resemble the dimensionality and sparsity characteristics of real data, we generate semi-synthetic corpora. For each real corpus, we train a model using MCMC and then generate new documents using the parameters of that model (these parameters are not guaranteed to be separable). 2

We were not able to obtain [9]’s implementation of their algorithm, and our own implementation is too slow to be practical.

71

We use two real-world data sets, a large corpus of New York Times articles (295k documents, vocabulary size 15k, mean document length 298) and a small corpus of NIPS abstracts (1100 documents, vocabulary size 2500, mean length 68). Vocabularies were pruned with document frequency cutoffs. We generate semi-synthetic corpora of various sizes from models trained with K = 100 from NY Times and NIPS, with document lengths set to 300 and 70, respectively, and with document-topic distributions drawn from a Dirichlet with symmetric hyperparameters 0.03. We use a variety of metrics to evaluate models: For the semi-synthetic corpora, we can compute reconstruction error between the true word-topic matrix A and learned topic distributions. Given a learned matrix Aˆ and the true matrix A, we use an LP to find the best matching between topics. Once topics are aligned, we evaluate `1 distance between each pair of topics. When true parameters are not available, a standard evaluation for topic models is to compute held-out probability, the probability of previously unseen documents under the learned model. This computation is intractable but there are reliable approximation methods [33, 158]. Topic models are useful because they provide interpretable latent dimensions. We can evaluate the semantic quality of individual topics using a metric called Coherence. Coherence is based on two functions, D(w) and D(w1 , w2 ), which are number of documents with at least one instance of w, and of w1 and w2 , respectively [126]. Given a set of words W, coherence is Coherence(W) =

X w1 ,w2 ∈W

log

D(w1 , w2 ) + . D(w2 )

(4.3)

The parameter = 0.01 is used to avoid taking the log of zero for words that never co-occur [146]. This metric has been shown to correlate well with human judgments of topic quality. If we perfectly reconstruct topics, all the high-probability words in a topic should co-occur frequently, otherwise, the model may be mixing unrelated concepts. Coherence measures the quality of individual topics, but does not measure redundancy, so we measure inter-topic similarity. For each topic, we gather the set of the N most probable words. We then count how many of those words do not appear in any other topic’s set of N most probable words. Some overlap is expected due to semantic ambiguity, but lower numbers of unique words indicate less useful models.

4.3.2

Efficiency

The Recover algorithms, in Python, are faster than a heavily optimized Java Gibbs sampling implementation [165]. Fig. 4.1 shows the time to train models on synthetic corpora on a single machine. Gibbs sampling is linear in the corpus size. RecoverL2 is also linear (ρ = 0.79), but only varies from 33 to 50 seconds. Estimating Q is linear, but takes only 7 seconds for the largest corpus. FastAnchorWords takes less than 6 seconds for all corpora.

72

3000

Seconds

Algorithm 2000

Gibbs Recover RecoverL2

1000

RecoverKL

0 0

25000

50000

75000

100000

Documents

Figure 4.1: Training time on synthetic NIPS documents.

SynthNYT, L1 error 2.0 Algorithm

L1.error

1.5

Gibbs Recover

1.0

RecoverL2 RecoverKL

0.5 0.0 50000

100000

150000

200000

250000

300000

400000

500000 1000000 2000000 Infinite

Documents

Figure 4.2: `1 error for a semi-synthetic model generated from a model trained on NY Times articles with K = 100. The horizontal line indicates the `1 error of K uniform distributions.

4.3.3

Semi-synthetic documents

The new algorithms have good `1 reconstruction error on semi-synthetic documents, especially for larger corpora. Results for semi-synthetic corpora drawn from topics trained on NY Times articles are shown in Fig. 4.2 for corpus sizes ranging from 50k to 2M synthetic documents. In addition, we report results for the three Recover algorithms on “infinite data,” that is, the true Q matrix from the model used to generate the documents. Error bars show variation between topics. Recover performs poorly in all but the noiseless, infinite data setting. Gibbs sampling has lower `1 with smaller corpora, while the new algorithms get better recovery and lower variance with more data (although more sampling might reduce MCMC error further). Results for semi-synthetic corpora drawn from NIPS topics are shown in Fig. 4.3. Recover does poorly for the smallest corpora (topic matching fails for D = 2000, so `1 is not meaningful), but achieves moderate error for D comparable to the NY Times corpus. RecoverKL and RecoverL2 also do poorly for the smallest corpora, but are comparable to or better than Gibbs sampling, with much lower variance, after 40,000 documents.

4.3.4

Effect of separability

The non-negative algorithms are more robust to violations of the separability assumption than the original Recover algorithm. In Fig. 4.3, Recover does not achieve zero `1 error even with noiseless “infinite” data. Here we show that this is due to lack of separability. In our semi-synthetic corpora, documents are generated from the LDA model, but the topicword distributions are learned from data and may not satisfy the anchor words assumption. We test the sensitivity of algorithms to violations of the separability condition by adding a 73

SynthNIPS, L1 error 2.0 Algorithm

L1.error

1.5

Gibbs Recover

1.0

RecoverL2 RecoverKL

0.5 0.0 2000

4000

6000

8000

10000

20000

40000

60000

80000

100000

Infinite

Documents

Figure 4.3: `1 error for a semi-synthetic model generated from a model trained on NIPS papers with K = 100. Recover fails for D = 2000.

SynthNYT+Anchors, L1 error 2.0 Algorithm

L1.error

1.5

Gibbs Recover

1.0

RecoverL2 RecoverKL

0.5 0.0 50000

100000

150000

200000

250000

300000

400000

500000 1000000 2000000 Infinite

Documents

Figure 4.4: When we add artificial anchor words before generating synthetic documents, `1 error goes to zero for Recover and close to zero for RecoverKL and RecoverL2.

synthetic anchor word to each topic that is by construction unique to the topic. We assign the synthetic anchor word a probability equal to the most probable word in the original topic. This causes the distribution to sum to greater than 1.0, so we renormalize. Results are shown in Fig. 4.4. The `1 error goes to zero for Recover, and close to zero for RecoverKL and RecoverL2. The reason RecoverKL and RecoverL2 do not reach exactly zero is because we do not solve the optimization problems to perfect optimality.

4.3.5

Effect of correlation

The theoretical guarantees of the new algorithms apply even if topics are correlated. To test how algorithms respond to correlation, we generated new synthetic corpora from the same K = 100 model trained on NY Times articles. Instead of a symmetric Dirichlet distribution, we use a logistic normal distribution with a block-structured covariance matrix. We partition topics into 10 groups. For each pair of topics in a group, we add a non-zero off-diagonal element to the covariance matrix. This block structure is not necessarily realistic, but shows the effect of correlation. Results for two levels of covariance (ρ = 0.05, ρ = 0.1) are shown in Fig. 4.5. Results for Recover are much worse in both cases than the Dirichlet-generated corpora in Fig. 4.2. The other three algorithms, especially Gibbs sampling, are more robust to correlation, but performance consistently degrades as correlation increases, and improves with larger corpora. With infinite data `1 error is equal to `1 error in the uncorrelated synthetic corpus (non-zero because of violations of the separability assumption).

74

SynthNYT, strong corr, L1 error

SynthNYT, moderate corr, L1 error

2.0

2.0 Algorithm

Algorithm 1.5

Gibbs

L1.error

L1.error

1.5

Recover 1.0

RecoverL2 RecoverKL

0.5

Gibbs Recover

1.0

RecoverL2 RecoverKL

0.5

0.0

0.0 50000

100000

150000

200000

250000

300000

400000

500000 1000000 2000000 Infinite

50000

100000

150000

200000

250000

Documents

300000

400000

500000 1000000 2000000 Infinite

Documents

Figure 4.5: `1 error increases as we increase topic correlation. We use the same K = 100 topic model from NY Times articles, but add correlation: TOP ρ = 0.05, BOTTOM ρ = 0.1.

NIPS

NYT

0.0 LogProbPerToken

−2.5 −5.0 −7.5 0

Algorithm

Value

Coherence

−250 −500 −750

Gibbs Recover RecoverL2 RecoverKL

−1000

UniqueWords

10

5

0 50

100

150

200

50

100

150

200

Topics

Figure 4.6: Held-out probability (per token) is similar for RecoverKL, RecoverL2, and Gibbs sampling. RecoverKL and RecoverL2 have better coherence, but fewer unique words than Gibbs. (Up is better for all three metrics.)

4.3.6

Real documents

The new algorithms produce comparable quantitative and qualitative results on real data. Fig. 4.6 shows three metrics for both corpora. Error bars show the distribution of log probabilities across held-out documents (top panel) and coherence and unique words across topics (center and bottom panels). Held-out sets are 230 documents for NIPS and 59k for NY Times. For the small NIPS corpus we average over 5 non-overlapping train/test splits. The matrix-inversion in Recover failed for the smaller corpus (NIPS). In the larger corpus (NY Times), Recover produces noticeably worse held-out log probability per token than the other algorithms. Gibbs sampling produces the best average held-out probability (p < 0.0001 under a paired t-test), but the difference is within the range of variability between documents. We tried several methods for estimating hyperparameters, but the observed differences did not change the relative performance of algorithms. Gibbs sampling has worse coherence than the Recover algorithms, but produces more unique words per topic. These patterns are consistent with semi-synthetic results for similarly sized corpora (details are in supplementary material). 75

Table 4.1: Example topic pairs from NY Times (closest `1 ), anchor words in bold. All 100 topics are in suppl. material. RecoverL2 Gibbs RecoverL2 Gibbs RecoverL2 Gibbs

run inning game hit season zzz anaheim angel run inning hit game ball pitch father family zzz elian boy court zzz miami zzz cuba zzz miami cuban zzz elian boy protest file sport read internet email zzz los angeles web site com www mail zzz internet

For each NY Times topic learned by RecoverL2 we find the closest Gibbs topic by `1 distance. The closest, median, and farthest topic pairs are shown in Table 4.1.3 We observe that when there is a difference, recover-based topics tend to have more specific words (Anaheim Angels vs. pitch).

4.4

Proof for Nonnegative Recover Procedure

¯ are perturbed, In order to show RecoverL2 learns the parameters even when the rows of Q ¯ are close to the expectation, we need the following lemma that shows when columns of Q the posteriors c computed by the algorithm is also close to the true value. Lemma 4.7. For a γ robust simplex S with vertices {v1 , v2 , ..., P vK }, let v be a point in the simplex that can be represented as a convex combination v = K i=1 ci vi . If the vertices 0 0 0 0 of S are perturbed to S = {..., vi , ...} where kvi − vi k ≤ δ1 and v is perturbed PK to0 v where 0 ∗ 0 0 ∗ kv√− v k ≤ δ2 . Let v be the point in S that is closest to v , and v = i=1 ci vi , when 10 Kδ1 ≤ γ for all i ∈ [K] |ci − c0i | ≤ 4(δ1 + δ2 )/γ. PK P 0 0 Proof. Consider the point u = K i=1 ci kvi − vi k ≤ i=1 ci vi , by triangle inequality: ku − vk ≤ δ1 . Hence ku − v 0 k ≤ ku − vk + kv − v 0 k ≤ δ1 + δ2 , and u is in S 0 . The point v ∗ is the point in S 0 that is closest to v 0 , so kv ∗ − v 0 k ≤ δ1 + δ2 and kv ∗ − uk ≤ 2(δ1 + δ2 ). Then we need to show when a point (u) moves a small distance, its representation also changes √ by a small amount. Intuitively this is true because S is γ robust. By Lemma 4.4 when 10 Kδ1 < γ, the simplex S 0 is also γ/2 robust. For any i, let P roji (v ∗ ) and P roji (u) be the projections of v ∗ and u in the orthogonal subspace of span(S 0 \{vi0 }), then |ci − c0i | = kP roji (v ∗ ) − P roji (u)k /dis(vi , span(S 0 \{vi0 })) ≤ 4(δ1 + δ2 )/γ and this completes the proof. With this lemma it is not hard to show that RecoverL2 has polynomial sample complexity. Theorem 4.8. When the number of documents M is at least max{O(aK 3 log V /D(γp)6 ), O((aK)3 log V /D3 (γp)4 )} 3

The UCI NY Times corpus includes named-entity annotations, indicated by the zzz prefix.

76

our algorithm using the conjunction of FastAnchorWords and RecoverL2 learns the A matrix with entry-wise error at most . Proof. (sketch) We can assume without loss of generality that each word occurs with probability at least /4aK and furthermore that if M is at least 50 log V /D2Q then the empirical ˜ is entry-wise within an additive Q to the true Q = 1 PM AWd W T AT see [12] matrix Q d d=1 M ¯ form a simplex that is γp robust. for the details. Also, the K anchor rows of Q p ¯ can be at most δ2 = Q 4aK/. By Theorem 4.13 when The error in each column of Q 20Kδ2 /(γp)2 < γp (which is satisfied when M = O(aK 3 log V /D(γp)6 )) , the anchor words found are δ1 = O(δ2 /(γp)) close to the true anchor words. Hence by Lemma 4.7 every entry of C has error at most O(δ2 /(γp)2 ). With such number of documents, all the word probabilities p(w = i) are estimated more accurately than the entries of Ci,j , so we omit their perturbations here for simplicity. When we apply the Bayes rule, we know Ai,k = Ci,k p(w = i)/p(z = k), where p(z = k) is αk which is lower bounded by 1/aK. The numerator and denominator are all related to entries of C with positive coefficients sum up to at most 1. Therefore the errors δnum and δdenom are at most the error of a single entry of C, which is bounded by O(δ2 /(γp)2 ). Applying Taylor’s Expansion to (p(z = k, w = i) + δnum )/(αk + δdenom ), the error on entries of A is at most O(aKδ2 /(γp)2 ). When Q ≤ O((γp)2 1.5 /(aK)1.5 ), we have O(aKδ2 /(γp)2 ) ≤ , and get the desired accuracy of A. The number of document required is M = O((aK)3 log V /D3 (γp)4 ). The sample complexity for R can then be bounded using matrix perturbation theory.

4.5

Proof for Anchor-Words Finding Algorithm

Recall that the correctness of the algorithm depends on the following Lemmas: Lemma 4.9. There is a vertex vi whose distance from span(S) is at least γ/2. Lemma 4.10. The point ∆j found by the algorithm must be δ = O(/γ 2 ) close to some vertex vi . In order to prove Lemma 4.4, we use a volume argument. First we show that the volume of a robust simplex cannot change by too much when the vertices are perturbed. Lemma 4.11. Suppose {v1 , v2 , ..., vK } are the vertices of a γ-robust simplex S. Let S 0 be 0 0 0 0 a simplex with vertices {v 1 , v2 , ..., vK }, each of the vertices vi is a perturbation of vi and √ kvi0 − vi k2 ≤ δ. When 10 Kδ < γ the volume of the two simplices satisfy vol(S)(1 − 2δ/γ)K−1 ≤ vol(S 0 ) ≤ vol(S)(1 + 4δ/γ)K−1 . Proof. As the volume of a simplex is proportional to the determinant of a matrix whose columns are the edges of the simplex, we first show the following perturbation bound for determinant.

77

Claim. Let A, √ E be K ×K matrices, √ the smallest eigenvalue of A is at least γ, the Frobenius norm kEkF ≤ Kδ, when γ > 5 Kδ we have det(A + E)/ det(A) ≥ (1 − δ/γ)K . Proof. Since det(AB) = det(A) det(B), we can multiply both A and A + E by A−1 . Hence det(A + E)/ det(A) = det(I + A−1 E). The Frobenius norm of A−1 E is bounded by √

−1

A E ≤ A−1 kEk ≤ Kδ/γ. F F

of A−1 E be λ1 , λ2 , ..., λK , then by definition of Frobenius Norm PKLet 2 the eigenvalues 2 −1 2 2 of I +A−1 E are just 1+λ1 , 1+λ2 , ..., 1+λK , i=1 λi ≤ kA EkF ≤ Kδ /γ . The eigenvalues Q and the determinant det(I + A−1 E) = K i=1 (1 + λi ). Hence it suffices to show K K Y X K min (1 + λi ) ≥ (1 − δ/γ) when λ2i ≤ Kδ 2 /γ 2 . i=1

i=1

To do this we apply Lagrangian method and show the minimum is only obtained when all λi ’s are equal. The optimal value must be obtained at a local optimum of K Y

(1 + λi ) + C

i=1

K X

λ2i .

i=1

partial derivatives with √ respect to λi ’s, we get the equations −λi (1 + λi ) = QTaking K − i=1 (1 + λi )/2C (here using Kδ/γ is small so 1 + λi > 1/2 > 0). The right hand side is a constant, so each λi must be one of the two solutions of this equation. However, only one of the solution is larger than 1/2, therefore all the λi ’s are equal. For the lower bound, we can project the perturbed subspace to the K − 1 dimensional space. Such a projection cannot increase the volume and the perturbation distances only get smaller. Therefore we can apply the claim directly, the columns of A are just vi+1 − v1 for 0 i = 1, 2, ..., K − 1; columns of E are just vi+1 − vi+1 − (v10 − v1 ). The smallest eigenvalue of A is at least γ because the polytope is γ robust, which is equivalent to saying after orthogonalization each column still has length at least γ. The Frobenius norm of E is at √ most 2 K − 1δ. We get the lower bound directly by applying the claim. For the upper bound, swap the two sets S and S 0 and use the argument for the lower bound. The only thing we need to show is that the smallest eigenvalue of the matrix generated by points in S 0 is still √ at least γ/2. This follows from Wedin’s Theorem [160] and the fact that kEk ≤ kEkF ≤ Kδ ≤ γ/2. Now we are ready to prove Lemma 4.4. Proof. The first case is for the first step of the algorithm, when we try to find the farthest point to the origin. Here essentially S = {0}. For any two vertices v1 , v2 , since the simplex is 78

(a)

(b)

(c)

(d)

Figure 4.7: Illustration of the Algorithm

γ robust, the distance between v1 and v2 is at least γ. Which means dis(0, v1 ) + dis(0, v2 ) ≥ γ, one of them must be at least γ/2. For the later steps, recall that S contains vertices of a perturbed simplex. Let S 0 be the set of original vertices corresponding to the perturbed vertices in S. Let v be any vertex in {v1 , v2 , ..., vK } which is not in S. Now we know the distance between v and S is equal to vol(S ∪ {v})/(|S| − 1)vol(S). On the other hand, we know vol(S 0 ∪ {v})/(|S 0 | − 1)vol(S 0 ) ≥ γ. Using Lemma 4.11 to bound the ratio between the two pairs vol(S)/vol(S 0 ) and vol(S ∪ {v})/vol(S 0 ∪ {v}), we get: dis(v, S) ≥ (1 − 40 /γ)2|S|−2 γ > γ/2 when γ > 20K0 . Lemma 4.5 is based on the following observation: in a simplex the point with largest `2 is always a vertex. Even if two vertices have the same norm if they are not close to each other the vertices on the edge connecting them will have significantly lower norm. Proof. (Lemma 4.5) Since dj is the point found by the algorithm, let us consider the point aj before perturbation. The point aj is inside the simplex, therefore we can write aj as a convex combination of the vertices: aj =

K X t=1

79

ct vt

w Δ

Δ-2ε 0

γ/4 Δ

vt

Figure 4.8: Proof of Lemma 4.5, after projecting to the orthogonal subspace of span(S).

Let vt be the vertex with largest coefficient ct . Let ∆ be the largest distance from some vertex to the space spanned by points in S (∆ = maxl dis(vl , span(S)). By Lemma 4.4 we know ∆ > γ/2. Also notice that we are not assuming dis(vt , span(S)) = ∆. Now we rewrite aj as ct vt + (1 − ct )w, where w is a vector in the convex hull of vertices other than vt . Observe that aj must be far from span(S), because dj is the farthest point found by the algorithm. Indeed: dis(aj , span(S)) ≥ dis(dj , span(S)) − ≥ dis(vl , span(S)) − 2 ≥ ∆ − 2 The second inequality is because there must be some point dl that correspond to the farthest vertex vl and have dis(dl , span(S)) ≥ ∆ − . Thus as dj is the farthest point dis(dj , span(S)) ≥ dis(dl , span(S)) ≥ ∆ − . The point aj is on the segment connecting vt and w, the distance between aj and span(S) is not much smaller than that of vt and w. Following the intuition in `2 norm when vt and w are far we would expect aj to be very close to either vt or w. Since ct ≥ 1/K it cannot be really close to w, so it must be really close to vt . We formalize this intuition by the following calculation (see Figure 4.8): Project everything to the orthogonal subspace of span(S) (points in span(S) are now at the origin). After projection distance to span(S) is just the `2 norm of a vector. Without loss of generality we assume kvt k2 = kwk = ∆ because these two have length at most ∆, and extending these two vectors to have length ∆ can only increase the length of dj . The point vt must be far from w by applying Lemma 4.4: consider the set of vertices 0 V = {vi : vi does not correspond to any point in S and i 6= t}. The set V 0 ∪ S satisfy the assumptions in Lemma 4.4 so there must be one vertex that is far from span(V 0 ∪ S), and it can only be vt . Therefore even after projecting to orthogonal subspace of span(S), vt is still far from any convex combination of V 0 . The vertices that are not in V 0 all have very small norm after projecting to orthogonal subspace (at most δ0 ) so we know the distance of vt and w is at least γ/2 − δ0 > γ/4. Now the problem becomes a two dimensional calculation. When ct is fixed the length of aj is strictly increasing when the distance of vt and w decrease, so we assume the distance is γ/4. Simple calculation (using essentially just pythagorean theorem) shows 80

ct (1 − ct ) ≤

p . ∆ − ∆2 − γ 2 /16

The right hand side is largest when ∆ = 2 (since the vectors are in unit ball) and the maximum value is O(/γ 2 ). When this value is smaller than 1/K, we must have 1 − ct ≤ O(/γ 2 ). Thus ct ≥ 1 − O(/γ 2 ) and δ ≤ (1 − ct ) + ≤ O(/γ 2 ). The cleanup phase tries to find the farthest point to a subset of K − 1 vertices, and use that point as the K-th vertex. This will improve the result because when we have K − 1 points close to K − 1 vertices, only one of the vertices can be far from their span. Therefore the farthest point must be close to the only remaining vertex. Another way of viewing this is that the algorithm is trying to greedily maximize the volume of the simplex, which makes sense because the larger the volume is, the more words/documents the final LDA model can explain. The following lemma makes the intuitions rigorous and shows how cleanup improves the guarantee of Lemma 4.5. Lemma 4.12. Suppose |S| = K − 1 and each point in S is δ = O(/γ 2 ) < γ/20K close to some vertex vi , then the farthest point vj0 found by the algorithm is 1 − O(/γ) close to the remaining vertex. P Proof. We still look at the original point aj and express it as K t=1 ct vt . Without loss of generality let v1 be the vertex that does not correspond to anything in S. By Lemma 4.4 v1 is γ/2 far from span(S). On the other hand all other vertices are at least γ/20r close to span(S). We know the distance dis(aj , span(S)) ≥ dis(v1 , span(S)) − 2, this cannot be true unless ct ≥ 1 − O(/γ). These lemmas directly lead to the following theorem: ˜ 2 + V K/2 ) and outputs a Theorem 4.13. FastAnchorWords algorithm runs in time O(V subset of {d1 , ..., dV } of size K that O(/γ)-covers the vertices provided that 20K/γ 2 < γ. Proof. In the first phase of the algorithm, do induction using Lemma 4.5. When 20K/γ 2 < γ Lemma 4.5 shows that we find a set of points that O(/γ 2 )-covers the vertices. Now Lemma 4.6 shows after cleanup phase the points are refined to O(/γ)-cover the vertices.

81

Part II Towards Provably Learning Deep Networks

82

Chapter 5 Provable Sparse Coding: Learning Overcomplete Dictionaries Dictionary learning, or sparse coding, tries to learn a sparse linear code to represent the given data succinctly. Such a code represents data points (e.g. image patches) using sparse linear combinations of dictionary elements. This sparse representation captures the useful structure in the input data, and can be used in various tasks including denoising the input data or providing features for supervised learning algorithms. Learning a sparse linear code can also be viewed as a preliminary for learning a sparse autoencoder (a sparse nonlinear code), which is a main building block for deep learning algorithms. In this chapter we show under mild assumptions, it is possible to learn an overcomplete dictionary (i.e. the number of dictionary elements is larger than their dimension) in polynomial time.

5.1 5.1.1

Background and Results Sparse Coding and Dictionary Learning

Sparse coding is widely used in machine learning. However, it is first introduced in neural science, trying to explain how evolution may have produced cells in the first layer of visual cortex: Olshausen and Field [134] observed the dictionary elements learned by sparse coding share similar features as the cells in the internal image representations in the visual cortex. Inspired by the neural analog, sparse coding is often used for feature selection [10] and for building classifiers atop sparse coding routines [94]. Dictionary learning is also a basic building block in design of deep learning systems [138]. Dictionary learning is also used in image processing, where experiments suggest that hand-designed dictionaries (such as sinusoids, wavelets, ridgelets, curvelets, etc. [120]) do not do as well as dictionaries learned from data. Dictionaries learned by sparse coding algorithms are used to perform denoising [57], edge-detection [117], super-resolution [163] and compression. See [1], [58] for further applications. 83

5.1.2

Incoherence Assumption

Designing provably correct algorithms for dictionary learning has proved challenging. Even if the dictionary is completely known, it can be NP-hard to represent a vector u as a sparse linear combination of the columns of A (even if such a representation exists) [47]. However for many natural types of dictionaries, the problem of finding a sparse representation is computationally easy. The pioneering work of Donoho and Huo [53] (building on the uncertainty principle of Donoho and Stark [56]) presented a number of important examples of dictionaries that are incoherent and gave an algorithm to find a sparse representation in a known, incoherent dictionary if one exists. More precisely: Definition 5.1 (µ-incoherent). An n × m matrix A whose columns are unit vectors is said to be µ-incoherent if for all i 6= j we have √ Ai · Aj ≤ µ/ n. We sometimes say just incoherent if µ is small, like log n. A randomly chosen dictionary is incoherent with high probability (even if m = nC ). Donoho and Huo [53] gave many other important examples of incoherent dictionaries, such as one constructed from spikes and sines, as well as those built up from wavelets and sines, or even wavelets and ridgelets. There is a rich body of literature devoted to incoherent dictionaries (see additional references in [64]).√Donoho and Huo [53] proved that given u = Av where v has k nonzero entries, where k ≤ n/2µ, basis pursuit (solvable by a linear program) recovers v exactly and it is unique. Gilbert, Muthukrishnan and Strauss [64] (and subsequently [150]) gave algorithms for recovering v even in the presence of additive noise. Tropp [148] gave a more general exact recovery condition (ERC) under which the sparse recovery problem for incoherent dictionaries can be algorithmically solved. All of these require n > k 2 µ2 . In a foundational work, Candes, Romberg and Tao [35] showed that basis pursuit solves the sparse recovery problem even for n = O(k log(m/k)) if A satisfies the weaker restricted isometry property [34]. Also if A is a full-rank square matrix, then we can compute v from A−1 u, trivially. But our focus here will be on incoherent and overcomplete dictionaries.

5.1.3

Main Results

A range of results are possible which trade off more assumptions with better performance. We give two illustrative ones: the first makes the most assumptions but has the best performance; the second has the weakest assumptions and somewhat worse performance. The theorem statements will be cleaner if we use asymptotic notation: the parameters k, n, m will go to infinity and the constants denoted as“O(1)” are arbitrary so long as they do not grow with these parameters. Throughout this chapter, we will use Y (i) to denote the ith sample and X (i) as the vector that generated it – i.e. Y (i) = AX (i) . When we are working with a graph G we will use ΓG (u) to denote the set of neighbors of u in G. 84

First we define the class of distributions that the k-sparse vectors must be drawn from. We will be interested in distributions on k-sparse vectors in Rm where each coordinate is nonzero with probability Θ(k/m) (the constant in Θ(·) can differ among coordinates). Definition 5.2 (Distribution class Γ and its moments). The distribution is in class Γ if (i) each nonzero Xi has expectation 0 and lies in [−C, 1] ∪ [1, C] where C = O(1). (ii) Conditional on any subset of coordinates in X being nonzero, the values Xi are independent of each other. The distribution has bounded `-wise moments Q if the probability that X is nonzero in any ` subset S of ` coordinates is at most c times i∈S Pr[Xi 6= 0] where c = O(1). Remark 5.3. (i) The bounded moments condition trivially holds for any constant ` if the set of nonzero locations is a random subset of size k. The values of these nonzero locations are allowed to be distributed very differently from one another. (ii) The requirement that nonzero Xi ’s be bounded away from zero in magnitude is similar in spirit to the Spike-andSlab Sparse Coding (S3C) model of Goodfellow et al. [70], which also encourages nonzero latent variables to be bounded away from zero. Theorem 5.4. There is a polynomial time algorithm to learn a µ-incoherent dictionary A from random examples of the form Y = AX, where X ∈ Rn is chosen according to some distribution in Γ and A is Rm×n . √

• If k ≤ O(min(m2/5 , µ logn n )) and the distribution has bounded 3-wise moments, then the m algorithm requires p ≥ Ω(max(m2 /k 2 log m, m log )) samples and with high probability 2 ˆ ˆ returns a dictionary A so that for each i, kAi − Ai k ≤ . The algorithm runs in time ˜ 2 + m2 p). O(p √

• If k ≤ c min(m(`−1)/(2`−1) , µ logn n ) and the distribution has bounded `-wise moments, then there is a polynomial time algorithm with the same guarantees and sample complexity ˜ 2 + k `−1 m + m2 p). that runs in time O(p • Even if we are given Y (i) = AX (i) + ηi , where ηi ’s are independent spherical Gaus√ sian noise with standard deviation σ = o( n), the algorithm still works with sample complexity p ≥ Ω(mσ 2 log m/2 ). √

Remark 5.5. (i) The sparsity that this algorithm can tolerate – roughly µ logn n or m1/2− – approaches the sparsity that the best known algorithms require even if A is known. Furthermore since this procedure learns a dictionary that is close to the true dictionary, what it learns must also be incoherent. Hence: Remark 5.6. This algorithm learns a dictionary for which we can actually solve sparse recovery problems. Now we describe the other result with fewer assumptions on X but lower performance. 85

Definition 5.7 (Distribution class D). A distribution is in class D if (i) the events Xi = 6 0 have weakly bounded second moment and third moments, in the sense that Pr[Xi = 6 1/4 0 and Xj 6= 0] ≤ n Pr[Xi 6= 0] Pr[Xj 6= 0], Pr[Xi , Xj , Xt 6= 0] ≤ o(n ) Pr[Xi 6= 0] Pr[Xj = 6 0] Pr[Xt 6= 0]. (ii) Each nonzero Xi is in [−C, −1] ∪ [1, C] where C = O(1). Remark 5.8. A major difference from class Γ is that the Xi ’s do not have expectation 0 and are not forbidden from taking values close to 0 (provided they do have reasonable probability of taking values away from 0). Another major difference is that the distribution of Xi can depend upon the values of other nonzero coordinates. The weaker moment condition allows a fair bit of correlation among the set of nonzero coordinates. Theorem 5.9. There is a polynomial time algorithm to learn a µ-incoherent dictionary A from random examples of the form Y = AX, where X is chosen according to some distribution in D. 1/4−/2

• If k ≤ c min(m1/4 , n

√

µ

) and we are given p ≥ Ω(max(m2 /k 2 log m, mn

3/2

log m log n )) k2 µ

samples, then the algorithm succeeds with high probability and returns a dictionary Aˆ so √ ˜ 2 +m2 p). that for each i, kAi − Aî k ≤ O(k µ/n1/4−/2 ). The algorithm runs in time O(p

The algorithm is also noise-tolerant as the one in Theorem 5.4. Theorem 5.9 is proved similarly to Theorem 5.4, the proof can be found in [13]. Remark 5.10. It is also possible to relax the condition that each nonzero Xi is in [−C, 1] ∪ [1, C]. Instead we require Xi has magnitude at most O(1), and has a weak anti-concentration property: for every δ > 0 it has probability at least cδ > 0 of exceeding δ in magnitude.

5.1.4

Previous Work

There has also been significant prior work on dictionary learning and sparse coding. Lewicki and Sejnowski [111] provided the first approach, and subsequently Engan, Aase and Husoy [59] introduce the method of optimal directions (MOD) and Aharon, Elad and Bruckstein [2] introduced K-SVD, both of which have had considerable success in practice. For completeness, we will describe these latter two approaches. Suppose the algorithm is given a matrix Y , generated as AX where A is the unknown dictionary and X is an unknown matrix with iid columns. Method of Optimal Direction [59] : Start with an initial guess A, and then alternately update either A or X: • Given A, compute a sparse X so that AX ≈ Y (using e.g. matching pursuit [119] or basis pursuit [39]) • Given X, compute the A that minimizes kAX − Y kF This algorithm converges to a local optimum, because in each step the error kAX − Y kF will only decrease. 86

K-SVD [2] Start with an initial guess for A. Then repeat the following procedure: • Given A, compute a sparse X so that AX ≈ Y (again, using a pursuit method) • Group all data points Y (1) where the corresponding X vector has a non-zero at index i. Subtract off components in the other directions X (1) Y (1) − A j Xj j6=i

• Compute the first singular vector v1 of the residual matrix, and update the column Ai to v1 However, these algorithms do not come with provable guarantees. After all, since the first step in the algorithm uses some arbitrary guess for A there may not be any sparse representation X so that AX ≈ Y . Furthermore, even if the algorithm learns some dictionary there is no guarantee that it is incoherent, and consequently no guarantee that one can use it to solve sparse recovery. Instead, our goal here is to give an algorithm for dictionary learning that has provable guarantees. Assuming the data is generated by an incoherent dictionary, we recover it to within arbitrary accuracy. The most closely related work is an elegant paper of Spielman, Wang and Wright [144], giving an algorithm that provably recovers A to arbitrary accuracy if it is a full column rank matrix, and X satisfies reasonable assumptions. However, requiring A to be full column rank precludes most interesting settings where the dictionary is redundant and hence cannot have full column rank (see e.g. [53], [58], [120]).

5.1.5

Our approach

The main idea is to recover the support of the vector X by a graph recovery algorithm. If given many samples, we can find all the samples for which Xi 6= 0 then we can recover the column Ai from a singular value decomposition. (Alternatively, we can instead find all the samples for which Xi > 0 and then the samples will be biased in the direction of Ai ). This is exactly the approach taken in K-SVD [59]. The difference is that in K-SVD, the support of samples are determined after running matching pursuit on an intermediate guess for the dictionary. If this guess is not accurate, then the support could be far from correct. Our main observation is that for incoherent dictionaries, we can recover the support without knowing the dictionary! We build a graph – called the connection graph. We give a simple combinatorial algorithm for provably recovering the support graph of X from this graph, and from this we can immediately recover the true dictionary. In order to prove correctness of our combinatorial algorithm, we rely on tools from discrete geometry, namely the piercing number [121], [5]. Related Works The graph recovery problem we have is related to the graph square root problem ([106, 131]), which tries to reconstruct a graph given all pairs of distance at most 87

2 in the graph. Our setting is slightly different, and our algorithm only solves the problem in average case. There is also a connection between our graph recovery problem and those studied in community detection literature, where recent papers [14], [17] give provable algorithms for finding all of the overlapping communities in a graph. However these algorithms applied to our problem would run in time quasi-polynomial in the sparsity k. We remark that algorithms for independent component analysis [42] can also provably learn an unknown dictionary A from random examples. Suppose we are given random samples of the form AX where the components of X are exactly independent. Frieze, Jerrum and Kannan [63] gave an algorithm with provable guarantees that recovers the dictionary A up to arbitrary accuracy, provided that the random variables in the vector X are nonGaussian (since in that case A is only determined up to rotations anyways). Subsequent work on this problem considers the over-determined case and gives provable algorithms even when A is n × m with m larger than n [49], [72]. However, these algorithms are very brittle to the assumption that the entries in X be independent; if this condition is even slightly violated, the algorithms can fail. The weakest conditions that we are aware of require that X be 4-wise independent, and typically even higher orders of independence are required for the overcomplete case [49], [72]. Hence these algorithms are not well-suited for dictionary learning where this condition seems too restrictive.

5.2

The Connection Graph

Let A be an unknown n × m dictionary whose columns are incoherent – i.e. each column is a unit vector and for all i 6= j we have √ Ai · Aj ≤ µ/ n Our goal is to recover the dictionary A given sparse linear combinations of its columns. In particular, suppose that there is a distribution Γ on k-sparse columns X of length m. We will assume the following generative model: • each X has at most k non-zeroes and the probability that Xi 6= 0 is Θ(k/m) and furthermore Xi = 6 0 is non-positively correlated with other coordinates being non-zero • if a coordinate is non-zero, its magnitude is in the range [1, C] and its expectation is zero. In fact, our analysis will immediately extend to even more general models (see the full version in [13]) but for simplicity we encourage the reader to restrict the distribution Γ so that the support of X is uniformly random over all k-tuples and that if a coordinate is non-zero it is a Rademacher random variable (equally likely to be +1 or −1) since this will allow us to not have to keep track of as many extraneous constants through the various steps of our analysis. Recall our strategy is to find the support of X. Based on the support, we form an overlapping clustering where we gather together all the samples Y for which Xi 6= 0. We 88

suggest two approaches for how to use this overlapping clustering to determine the dictionary in Section 5.4.1 and Section 5.4.2 respectively. The first approach is to further determine which set of samples have Xi > 0. This set of samples will be biased in exactly the direction of Ai and hence we can recover this column by taking an average. Alternatively, we can compute the direction of maximum variance. This latter approach recovers (roughly) the direction Ai and yields insight into why approaches like K-SVD [59] are effective in practice. But how can we compute the correct support? As a starting point: if there are two samples Y (1) = AX (1) and Y (2) = AX (2) then we can determine if the support of X (1) and X (2) intersect, but with false negatives: √ Lemma 5.11. If k 2 µ < n/2 then |Y (1) · Y (2) | > 1/2 implies that he support of X (1) and X (2) intersect Proof. Suppose that the support of X (1) and X (2) are disjoint. Then the following upper bound holds: X √ (1) (2) |Ai · Aj Xi Xj | ≤ k 2 µ/ n < 1/2 |Y (1) · Y (2) i| ≤ i6=j

and this implies the lemma. √ This would be a weak bound in our setting, since we would require that k = O(n1/4 / µ). Whereas if the dictionary is known, then classic results in compressed sensing √ show (algorithmically) how to recover a vector that is a linear combination of at most n/2µ columns of A [53]. In fact we can impose weaker conditions on k, if we use the randomness of the values in X. We will appeal to the classic Hanson-Wright inequality: Theorem 5.12 (Hanson-Wright). [76] Let X be a vector of independent, sub-Gaussian random variables with mean zero and variance one. Let M be a symmetric matrix. Then P r[|X > M X − tr(M )| > t] ≤ 2exp{−c min(t2 /kM k2F , t/kM k)} √

n for large enough constant C 0 . Then if the support of Lemma 5.13. Suppose kµ < C 0 log n X (1) and X (2) are disjoint, with high probability |Y (1) · Y (2) | < 1/2

Proof. Let N be the k × k submatrix resulting from restricting A> A to the locations where X (1) and X (2) are non-zero. Set M to be a 2k × 2k matrix where the k × k submatrices in the top-left and bottom-right are zero, and the k × k submatrices in the bottom-left and top-right are (1/2)N and (1/2)N > respectively. Here we think of the vector X as being a length 2k vector whose first k entries are the non-zero entries in X (1) and whose last k entries are the non-zero entries in X (2) . And by construction, we have that Y (1) · Y (2) = X > M X We can now appeal to the Hanson-Wright inequality (above). Note that since √ the support of X (1) and X (2) do not intersect, the entries in M are each at most µ/ n and so the Frobenius norm of M is at most √µk2n . This is also an upper-bound on the spectral norm of 89

M . We can set t = 1/2, and for kµ < and this implies the lemma.

√ n/C 0 log n both terms in the minimum are Ω(log n)

We can build a graph of which pairs intersect, by connecting a pair of samples Y (1) and Y if |Y (1) · Y (2) | > 1/2. (2)

Definition 5.14. Given p samples Y (1) , Y (2) , ..., Y (p) , build a connection graph on p nodes where i and j are connected by an edge if and only if Y (1) · Y (2) > 1/2. This graph will “miss” some edges, since if a pair X (1) and X (2) have intersecting support we do not necessarily meet the above condition, but (with high probability) this graph will not have any false positives: Corollary 5.15. With high probability, each edge (i, j) that is present in the connection graph corresponds to a pair where the support of X (i) and X (j) intersect. It is important to note that a sample Y (1) will intersect many other samples (say, Y (2) and Y (3) ), each of which corresponds to a coordinate i (reps. i0 ) where the support of X (2) (resp. X (3) ) intersect with the support of X (1) . But the challenge is that we do not know if they intersect X (1) in the same coordinate or not. We will use the other samples as a way to “cross-check” and determine if X (1) , X (2) and X (3) have a common intersection. Remark 5.16. Even if we are given samples Y = AX + η where η is additive noise, under natural conditions on the correlation of η √ with our samples (e.g. if it is spherical Gaussian noise whose expected norm is at most o( n)) this will not affect the construction of the connection graph, and hence will not affect the proof of correctness of our algorithms.

5.3

Graph Recovery

Our goal in this section is to determine which samples Y have Xi 6= 0 just from the connection graph. To do this, we reduce this problem to a graph recovery problem. Definition 5.17 (Graph Recovery). There is an unknown bipartite graph G0 (U, V, E0 ). Given a graph G(V, E), constructed randomly based on G0 satisfy the following properties: 1. For two vertices (p, q) ∈ V , if they have no common neighbor in G0 , then (p, q) 6∈ E. 2. For vertices p, q1 , q2 , ..., ql , if there is a vertex u ∈ U where {(u, p), (u, q1 ), ..., (u, ql )} ⊂ E0 , then Pr[all edges (p, qi ) ∈ E] ≥ 2−l . The goal is to recover the graph G0 . Consider G0 as the support graph of X, where U is the set of coordinates, V is the set of (i) samples, and (i, p) ∈ E0 iff Xp 6= 0. The connection graph we constructed in Definition 5.14 satisfies the requirement for graph recovery problem. In order to prove this we just need the following lemma. 90

Lemma 5.18. Suppose the support of X intersects the support of each of X (1) , X (2) , ..., X (`) . Then P r[ for all j = 1, 2, ..., `, |Y · Y (j) | > 1/2] ≥ 2−` Proof. We have that with high probability that round(Y · Y (j) ) = X · X (j) for all j using Lemma 5.13, and hence we need to prove that with probability at least 2−` , each pair X ·X (j) is non-zero. We can prove this by induction, by considering the X (j) ’s in an order such that no set Sj = supp(X) ∩ supp(X (j) ) is entirely contained in a earlier set Si (for i < j). Then conditioned on all previous inner-products X · X (i) being non-zero, the probability that X (1) · X (j) is non-zero is at least 1/2. Now given two samples Y (1) and Y (2) so that the support of X (1) and X (2) intersect uniquely at index i, we will prove that all the samples Y for which Xi 6= 0 can be recovered because the expected number of common neighbors between Y (1) , Y (2) and Y will be much larger than in any other case (in particular if Xi = 0 instead). Claim. Suppose supp(X (1) ) ∩ supp(X (2) ) ∩ supp(X (3) ) 6= ∅, then P rY [ for all j = 1, 2, 3, |Y · Y (j) | > 1/2] ≥

k 8m

Proof. Let i ∈ supp(X (1) ) ∩ supp(X (2) ) ∩ supp(X (3) ). Then the probability that supp(X) 3 i is exactly k/m, and hence using the previous lemma this implies the claim. Intuitively, if supp(X (1) ) ∩ supp(X (2) ) ∩ supp(X (3) ) = ∅ then the expected number of common neighbors of Y (1) , Y (2) and Y (3) should be smaller. But in principle we should be concerned that the support of X could still intersect the support of each of X (1) , X (2) and X (3) . Let a = |supp(X (1) )∩supp(X (2) )|, b = |supp(X (1) )∩supp(X (3) )| and c = |supp(X (2) )∩ supp(X (3) )|. Then: Lemma 5.19. Suppose that supp(X (1) ) ∩ supp(X (2) ) ∩ supp(X (3) ) = ∅. Then the probability that the support of X intersects the support of each of X (1) , X (2) and X (3) is at most k6 3k 3 (a + b + c) + m3 m2 Proof. We can break up the event whose probability we would like to bound into two (not necessarily disjoint) events: (1) the probability that X intersects each of X (1) , X (2) and X (3) disjointly (i.e. it contains a point i ∈ supp(X (1) ) but i ∈ / supp(X (2) ), supp(X (3) ), and similarly for the other sets ). (2) the probability that X contains a point in the common intersection of two of the sets, and one point from the remaining set. Clearly if the support of X intersects the support of each of X (1) , X (2) and X (3) then at least one of these two events must occur. The probability of the first event is at most the probability that the support of X contains at least one element from each of three disjoint sets of size at most k. The probability that the support of X contains an element of just one such set is at most the expected intersection 2 which is km , and since the expected intersection of X with each of these sets are non-positively 91

correlated (because they are disjoint) we have that the probability of the first event can be k6 bounded by m 3. Similarly, for the second event: consider the probability that the support of X contains an element in supp(X (1) ) ∩ supp(X (2) ). Since supp(X (1) ) ∩ supp(X (2) ) ∩ supp(X (3) ) = ∅, the support of X must also contain an element in supp(X (3) ) too. The expected intersection of X 2 and the expected intersection of X and X (3) is km , and again and supp(X (1) )∩supp(X (2) ) is ka m the expectations are non-positively correlated since the two sets supp(X (1) ) ∩ supp(X (2) ) and supp(X (3) ) are disjoint by assumption. Repeating this argument for the other pairs completes the proof of the lemma. Note that the probability that two sets of size k intersect in at least Q elements is at most k Q ) by the same non-positive correlation argument. Hence we can assume that with high (m probability there is no pair of sets that intersect in more than Q locations. When we only have bounded 3-wise moments, we need to slightly modify the clustering algorithm, see [13] for more discussions. And comparing the lemma and the claim above, we find that if k ≤ cm2/5 then the expected number of common neighbors is much larger if supp(X (1) )∩supp(X (2) )∩supp(X (3) ) 6= ∅ than if the intersection is empty. Under this condition, if we take p = O(m2 /k 2 log n) samples each triple with a common intersection will have at least T common neighbors, and each triple whose intersection is empty will have less than T /2 common neighbors. Hence we can search for a triple with a common intersection as follows: We can find a pair that intersects, since (with high probability) for any pair |Y (1) · Y (2) | > 1/2 the support of X (1) and X (2) intersect. We can try neighbors of Y (1) at random, and we can use the number of common neighbors as a test to verify whether all three sets have a common intersection. Algorithm 5.1. RecoverGraph input p samples Y (1) , Y (2) , ..., Y (p) output Sets that are equal to the support of rows of X Compute a graph G on p nodes where there is an edge between i and j iff |Y (1) ·Y (2) | > 1/2 Initialize list of triples to be empty pk Set T = 10m for i = 1 TO Ω(mk log2 m) do Choose a random edge (u, v) in G and a random neighbor w of u if |ΓG (u) ∩ ΓG (v) ∩ ΓG (w)| ≥ T then Add (u, v, w) to the list of triples end if end for For each node x, add x to each set Su,v,w where (u, v, w) is a triple and |ΓG (u) ∩ ΓG (v) ∩ ΓG (x)| ≥ T Delete any set Su,v,w that contains another set Sa,b,c Output the remaining sets Su,v,w ∪ {u, v, w} 92

Definition 5.20. We will call a triple of samples Y (1) , Y (2) and Y (3) an identifying triple for coordinate i if the support of X (1) , X (2) and X (3) intersect and furthermore the support of X (1) and X (2) is exactly {i}. Theorem 5.21. Algorithm 5.1(RecoverGraph) output sets that each corresponds to some i e 2 n) and and contains all Y (j) for which i ∈ supp(X (j) ). √The algorithm runs in time O(p succeeds with high probability if k ≤ c min(m2/5 , µ logn n ) and if p = Ω(m2 /k 2 log m) Proof. We can use Lemma 5.13 to conclude that each edge in G corresponds to a pair whose support intersects. We can appeal to Lemma 5.19 and Claim 5.3 to conclude that for p = Ω(m2 /k 2 log m), with high probability each triple with a common intersection has at least T common neighbors, and each triple without a common intersection has at most T /2 common neighbors. In fact, for a random edge (u, v), the probability that the common intersection of u and v (of the supports of their X’s) is exactly {i} is Ω(1/m) because we know that their X’s do intersect, and that intersection has a constant probability of being size one and it is uniformly distributed over m possible locations. Furthermore the probability that a randomly chosen neighbor w of u contains i (in the support of its X) is at least k/m, and hence appealing to a coupon collector argument we conclude that if the inner loop is run at least Ω(km log2 m) times then the algorithm finds an identifying triple (u, v, w) for each column Ai with high probability. Note that we may have triples that are not an identifying triple for some coordinate i. However, any other triple (u, v, w) found by the algorithm must have a common intersection. Consider for example a triple (u, v, w) where u and v have a common intersection {i, j}. Then we know that there is some other triple (a, b, c) which is an identifying triple for i and hence Sa,b,c ⊂ Su,v,w . (In fact this containment is strict, since Su,v,w will also contain a set corresponding to an identifying triple for j too). Then the second-to-last step in the algorithm will necessarily delete all such triples Su,v,w . What is the running time of this e algorithm? We need O(p2 n) time to build the connection graph, and the loop takes O(mpk) e 2 p) since there will be O(m e 2 ) triples found time. Finally, the deletion step requires time O(m in the previous step. This concludes the proof of correctness of the algorithm, and its running time analysis.

5.4 5.4.1

Recovering the Dictionary Finding the Relative Signs

Here we show how to recover the columns of A once we have learned the support of X. (1) (2) The key observation is that if the support of Xi and Xi uniquely intersect in index i then (1) (2) the sign of Y (1) · Y (2) is equal to the sign of Xi Xi . And if there are enough such pairs (1) (2) (1) (2) Xi and Xi , then we can correctly determine the relative sign of every pair Xi and Xi . We formalize this idea in the following lemma: (u)

Lemma 5.22. In Algorithm 5.2, Si is either {u : Xi 93

(u)

> 0} or {u : Xi

< 0}.

Algorithm 5.2. OverlappingAverage input p samples Y (1) , Y (2) , ...Y (p) output estimation Aˆ of A Run RecoverGraph (or RecoverGraphHighOrder) on the p samples Let C1 , C2 , ...Cm be the m returned sets for each Ci do for each pair u, v ∈ Ci such that X (u) and X (v) have a unique intersection do Label the pair +1 if Y (u) · Y (v) > 0 and otherwise label it −1. end for Choose an arbitrary ui ∈ Ci , and set Si = {ui } for each v ∈ Ci do if the pair ui , v is labeled +1, or there is w ∈ Ci where the pairs ui , w and v, w have the same label then add v to Si . end if end for if |Si | ≤ |Ci |/2 then set Si = Ci \Si . end if P P Let Aî = v∈Si X (v) /k v∈Si X (v) k end for ˆ where each column is Aî for some i return A, Proof. It suffices to prove the lemma at the start of Step 12, since this step only takes the complement of Si with respect to Ci . Appealing to Lemma 5.13 we conclude that if the support of X (u) and X (v) uniquely intersect in coordinate i then the sign of Y (u) · Y (v) is (u) (v) equal to the sign of Xi Xi . Hence when Algorithm 5.2 adds an element to Si it must have the same sign as the ith component of X (ui ) . What remains is to prove that each node v ∈ Ci is correctly labeled. We will do this by showing that for any such vertex, there is a length two path of labeled pairs that connects ui to v, and this is true because the number of labeled pairs is large. We need the following simple claim: Claim. If p > m2 log m/k 2 then with high probability any two rows of X share at most 2pk 2 /m2 coordinates in common. This follows since the probability that a node is contained in any fixed pair of sets is at most k 2 /m2 . Then for any node u ∈ Ci , we would like to lower bound the number of labeled pairs it has in Ci . Since u is in at most k − 1 other sets Ci1 , ..., Cik−1 , the number of pairs u, v where v ∈ Ci that are not labeled for Ci is at most k−1 X t=1

|Cit ∩ Ci | ≤ k · 2pk 2 /m2 pk/3m = |Ci |/3

94

Therefore for a fixed node u for at least a 2/3 fraction of the other nodes w ∈ Ci the pair u, w is labeled. Hence we conclude that for each pair of nodes ui , v ∈ Ci the number of w for which both ui , w and w, v are labeled is at least |Ci |/3 > 0 and so for every v, there is a labeled path of length two connecting ui to v. Using this lemma, we are ready to prove Algorithm 5.2 correctly learns all columns of A. Theorem 5.23. Algorithm 5.2(OverlappingAverage) outputs a dictionary Aˆ so that for each √ n 2/5 i, kAi − Aî k ≤ with high probability if k ≤ c min(m , µ log n ) and if p = Ω max(m2 /k 2 log m, m log m/2 ) Furthermore the algorithm runs in time O(p2 k 2 /m). (u)

Proof. We can invoke Lemma 5.22 and conclude that Si is either {u : Xi > 0} or {u : (u) Xi < 0}, whichever set is larger. Let us suppose that it is the former. Then each Y (u) in Si is an independent sample from the distribution conditioned on Xi > 0, which we call Γ+ i . We have that EΓ+i [AX] = cAi where c is a constant in [1, C] because EΓ+i [Xj ] = 0 for all j 6= i. Let us compute the variance: X X EΓ+i [kAX − EΓ+i AXk2 ] ≤ EΓ+i Xi2 + EΓ+i [Xj2 ] ≤ C 2 + C 2 k/m ≤ C 2 (k + 1), j6=i

j6=i

Note that there are no cross-terms because the signs of each Xj are independent. Furthermore we can bound the norm of each vector Y (u) via incoherence. We can conclude that if |Si | > C 2 k log m/2 , then with high probability kAî − Ai k ≤ using vector Bernstein’s inequality ([74], Theorem 12). This latter condition holds because we set Si to itself or its complement based on which one is larger.

5.4.2

An Approach via SVD

Here we give an alternative algorithm for recovering the dictionary based instead on SVD. The advantage is that approaches such as K-SVD which are quite popular in practice also rely on finding directions of maximum variance, so the analysis we provide here yields insights into why these approaches work. However, we note that the crucial difference is that we rely on finding the correct support in the first step of our dictionary learning algorithms, whereas K-SVD and approaches like it attempt to alternate between updating the dictionary and fixing the estimation of the support. Let us fix some notation: Let Γi be the distribution conditioned on Xi 6= 0. Then once we have found the support of X, each set from Algorithm 5.1 RecoverGraph is a set of random samples from Γi . Also let α = |u · Ai |. P Definition 5.24. Let Ri2 = 1 + j6=i (Ai · Aj )2 EΓi [Xj2 ]. 95

Note that Ri2 is the projected variance of Γi on the direction u = Ai . Our goal is to show that for any u 6= Ai (i.e. α 6= 1), the variance is strictly smaller. Lemma 5.25. The projected variance of Γi on u is at most α2 Ri2 + α

p 2µk µk k (1 − α2 ) √ + (1 − α2 )( + √ ) m n n

Proof. Let u|| and u⊥ be the components of u in the direction of Ai and perpendicular to Ai . Then we want bound EΓi [(u · Y )2 ] where Y is sampled from Γi . Since the signs of each Xj are independent, we can write X X EΓi [(u · Y )2 ] = EΓi [(u · Aj Xj )2 ] = EΓi [((u|| + u⊥ ) · Aj Xj )2 ] j

j

Since α = ku|| k we have: EΓi [(u · Y )2 ] = α2 Ri2 + EΓi [

X j6=i

(2(u|| · Aj )(u⊥ · Aj ) + (u⊥ · Aj )2 )Xj2 ]

Also EΓi [Xj2 ] = (k − 1)/(m − 1). Let v be the unit vector in the direction u⊥ . We can write X k−1 > )v A−i A> EΓi [ (u⊥ · Aj )2 Xj2 ] = (1 − α2 )( −i v m − 1 j6=i where A−i denotes the dictionary A with the ith column removed. The maximum over v > of v > A−i A> −i v is just the largest singular value of A−i A−i which is the same as the largest singular value of A> −i A−i which by the Greshgorin Disk Theorem (see e.g. [85]) is at most 1 + √µn m. And hence we can bound EΓi [

X k µk (u⊥ · Aj )2 Xj2 ] ≤ (1 − α2 )( + √ ) m n j6=i

√ Also since |u|| · Aj | = α|Ai · Aj | ≤ αµ/ n we obtain: E[

X j6=i

2(u|| · Aj )(u⊥ · Aj )Xj2 ] ≤ α

p 2µk (1 − α2 ) √ n

and this concludes the proof of the lemma. q µk k √ Definition 5.26. Let ζ = max{ n , m }, so the expression in Lemma 5.25 can be be an √ 2 2 upper bounded by α Ri + 2α 1 − α2 · ζ + (1 − α2 )ζ 2 . We will show that an approach based on SVD recovers the true dictionary up to additive accuracy ±ζ. Note that here ζ is a parameter that converges to zero as the size of the 96

problem increases, but is not a function of the number of samples. So unlike the algorithm in the previous subsection, we cannot make the error in our algorithm arbitrarily small by increasing the number of samples, but this algorithm has the advantage that it succeeds even when E[Xi ] 6= 0. Corollary 5.27. The maximum singular value of Γi is at least Ri and the direction u satisfies ku − Ai k ≤ O(ζ). Furthermore the second largest singular value is bounded by O(Ri2 ζ 2 ). Proof. The bound in Lemma 5.25 is only an upper bound, however the direction α = 1 has variance Ri2 > 1 and hence the direction of maximum variance must correspond to α ∈ [1 − O(ζ 2 ), 1]. Then we can appeal to the variational characterization of singular values (see [85]) that u> Σi u σ2 (Σi ) = max > u⊥Ai u u Then condition that α ∈ [−O(ζ), O(ζ)] for the second singular value implies the second part of the corollary. Since we have a lower bound on the separation between the first and second singular values of Σi , we can apply Wedin’s Theorem (see Appendix A.1) and show that we can recover Ai approximately even in the presence of noise. ˆ i (e.g. an Hence even if we do not have access to Σi but rather an approximation to it Σ empirical covariance matrix computed from our samples), we can use the above perturbation bound to show that we can still recover a direction that is close to Ai – and in fact converges to Ai as we take more and more samples. Algorithm 5.3. OverlappingSVD Run RecoverGraph (or RecoverGraphHighOrder) on the p samples Let C1 , C2 , ...Cm be P the m returned sets 1 ˆ Compute Σi = |Ci | Y ∈Ci Y Y > î Compute the first singular value Aî of Σ ˆ ˆ return A, where each column is Ai for some i Theorem 5.28. Algorithm 5.3 (OverlappingSVD) outputs a dictionary Aˆ so that for each √ i, kAi − Aî k ≤ ζ with high probability if k ≤ c min(m2/5 , µ logn n ) and if p ≥ max(m2 /k 2 log m,

mn log m log n ) ζ2

Proof. Appealing to Theorem 5.21, we have that with high probability the call to Algon rithm 5.1 (RecoverGraph) returns the correct support. Then given n log samples from the ζ2 distribution Γi the classic result of Rudelson implies that the computed empirical covariance ˆ i is close in spectral norm to the true co-variance matrix [140]. This, combined matrix Σ with the separation of the first and second singular values established in Corollary 5.27 and 97

Wedin’s Theorem A.2 imply that we recover each column of A up to an additive accuracy of and this implies the theorem. Note that since we only need to compute the first singular vector, this can be done via power iteration [68] and hence the bottleneck in the running time is the call to Algorithm 5.1(RecoverGraph).

5.5

A Higher-Order Algorithm

Here we extend the Algorithm 5.1 (RecoverGraph) presented in Section 5.3 to succeed even √ when k ≤ c min(m1/2− , n/µ log n). We can then use the Algorithm 5.2 in conjunction with this new clustering algorithm and obtain the second part of Theorem 6.1. The premise of Algorithm 5.1(RecoverGraph) is that we can distinguish whether or not a triple of sets X (1) , X (2) , X (3) has a common intersection based on their number of common neighbors in the connection graph. However for k = ω(m2/5 ) this is no longer true! But we will instead consider higher-order groups of sets. In particular, for any > 0 there is an ` so that we can distinguish whether an `-tuple of sets X (1) , X (2) , ..., X (`) has a common intersection or not based on their number of common neighbors even for k = Ω(m1/2− ). The main technical challenge is in showing that if the sets X (1) , X (2) , ..., X (`) do not have a common intersection, that we can upper bound the probability that a random set X intersects each of them. To accomplish this, we will need to bound the number of ways of piercing ` sets X (1) , X (2) , ..., X (`) that have bounded pairwise intersections by at most s points (see Lemma 5.31), and from this an analysis of Algorithm 5.4(RecoverGraphHighOrder) will be immediate. Algorithm 5.4. RecoverGraphHighOrder Compute a graph G on p nodes where there is an edge between i and j iff |Y (1) ·Y (2) | > 1/2 Initialize list of `-tuples to be empty pk Set T = Cm2 ` for i = 1 TO Ω(k `−2 m log2 m) do Choose a random node u in G, and ` − 1 neighbors u1 , u2 , ...u`−1 if |ΓG (u) ∩ ΓG (u1 ) ∩ ... ∩ ΓG (u`−1 )| ≥ T then Add (u, u1 , u2 , ...u`−1 ) to the list of `-tuples end if end for For each node x, add x to each set Su,u1 ,u2 ,...u`−1 where (u1 , u2 , ...u`−1 ) is an `-tuple and |ΓG (u) ∩ ΓG (u1 ) ∩ ... ∩ ΓG (x)| ≥ T Delete any set Su,u1 ,u2 ,...u`−1 that contains another set Sv,v1 ,v2 ,...v`−1 return the remaining sets Su,u1 ,u2 ,...u`−1 ∪ {u, u1 , u2 , ...u`−1 } What we need is an analogue of Claim 5.3 and Lemma 5.19. First the easy part:

98

Claim. Suppose supp(X (1) ) ∩ supp(X (2) ) ∩ ... ∩ supp(X (`) ) 6= ∅, then P rY [ for all j = 1, 2, ..., `, |Y · Y (j) | > 1/2] ≥ 2−`

k m

The proof of this claim is identical to the proof of Claim 5.3. But what about an analogue of Lemma 5.19? To analyze the probability that a set X intersects each of the sets X (1) , X (2) , ..., X (`) we will rely on the following standard definition: Definition 5.29. Given a collection of sets supp(X (1) ), supp(X (2) ), ..., supp(X (`) ), the piercing number is the minimum number of points p1 , p2 , ..., pr so that each set contains at least one point pi . The notion of piercing number is well-studied in combinatorics (see e.g. [121]). However, one is usually interested in upper-bounding the piercing number. For example, a classic result of Alon and Kleitman concerns the (p, q)-problem [5]: Suppose we are given a collection of sets that has the property that each choice of p of them has a subset of q which intersect. Then how large can the piercing number be? Alon and Kleitman proved that the piercing number is at most a fixed constant c(p, q) independent of the number of sets [5]. However, here our interest in piercing number is not in bounding the minimum number of points needed but rather in analyzing how many ways there are of piercing a collection of sets with at most s points, since this will directly yield bounds on the probability that X intersects each of X (1) , X (2) , ..., X (`) . We will need as a condition that each pair of sets has bounded intersection, and this holds in our model with high-probability. Claim. With high probability, the intersection of any pair supp(X (1) ), supp(X (2) ) has size at most Q Definition 5.30. We will call a set of ` sets a (k, Q) family if each set has size at most k and the intersection of each pair of sets has size at most Q. Lemma 5.31. The number of ways of piercing (k, Q) family (of ` sets) with s points is at most (`k)s . And crucially if ` ≥ s + 1, then the number of ways of piercing it with s points is at most Qs(s + 1) + (`k)s−1 . Proof. The first part of the lemma is the obvious upper bound. Now let us assume ` ≥ s + 1: Then given a set of s points that pierce the sets, we can partition the ` sets into s sets based on which of the s points is hits the set. (In general, a set may be hit by more than one point, but we can break ties arbitrarily). Let us fix any s + 1 of the ` sets, and let U be the the union of the pairwise intersections of each of these sets. Then U has size at most Qs(s + 1). Furthermore by the Pidgeon Hole Principle, there must be a pair of these sets that is hit by the same point. Hence one of the s points must belong to the set U , and we can remove this point and appeal to the first part of the lemma (removing any sets that are hit by this point). This concludes the proof of the second part of the lemma, too.

99

Corollary 5.32. The probability that the support of X hits each set in a (k, Q) family (of ` sets) is at most k s X `k 2 s X (Cs + (`k)s−1 ) + m m 2≤s≤`−1 s≥` where Cs is a constant depending polynomially on s. Proof. We can break up the probability of the event that the support of X hits each set in a (k, Q) family into another family of events. Let us consider the probability that X pierces the family with s ≤ ` − 1 points or s ≥ ` points. In the former case, we can invoke the second part of Lemma 5.31 and the probability that X hits any particular set of s points is at most (k/m)s . In the latter case, we can invoke the first part of Lemma 5.31. Note that if k ≤ m1/2 then k/m is always greater than or equal to k s−1 (k/m)s . And so asymptotically the largest term in the above sum is (k 2 /m)` which we want to be asymptotically smaller than k/m which is the probability in Claim 5.5. So if k ≤ cm(`−1)/(2`−1) then above bound is o(k/m) which is asymptotically smaller than the probability that a given set of ` nodes that have a common intersection are each connected to a random (new) node in the connection graph. So again, we can distinguish between whether or not an `-tuple has a common intersection √ or not and this immediately yields a new algorithm that works for k almost as large as m, although the running time depends on how close k is to this bound. Theorem 5.33. Algorithm 5.4(RecoverGraphHighOrder(`)) output sets that each corresponds to some i and contains all Y (j) for which i ∈ supp(X (j) ). The algorithm runs in √ n `−2 2 (`−1)/(2`−1) e time O(k mp + p n) and succeeds with high probability if k ≤ c min(m , µ log n ) 2 2 and if p = Ω(m /k log m) Hence this yields a polynomial time algorithm for finding an unknown incoherent dic√ n tionary whenever k ≤ c min( µ log n , m1/2− ) using Theorem 5.23. The proof of correctness of this algorithm is identical to one of Theorem 5.23 except that the definition of an identifying `-tuple for coordinate i is a set of ` samples that have a common intersection and for which the first ` − 1 have a common intersection that is exactly {i}. And note that the probability of finding an identifying `-tuple for coordinate i is at least Ω(1/(mk `−1 )). √ The threshold of k ≤ n/2µ is a natural barrier: Suppose √ we are given a vector u of the form u = Av where v has at most k non-zeros. Then if k ≤ n/2µ, the vector v is uniquely defined and√is the sparsest solution to the system u = Ax (where x is the variable). However when k > n/2µ this is no-longer true and there are incoherent dictionaries where a vector u admits more than one representation as a sparse linear combination of the columns of A. In fact, there are many known algorithms for recovering v from u up to the uniqueness threshold when the dictionary A is known. The above algorithm√gives a method to recover the dictionary at almost the same threshold – i.e. if k ≤ c min( µ logn n , m1/2− ).

100

Chapter 6 Learning Deep Representations Can we provide theoretical explanation for the success of deep nets in applications such as vision, speech recognition etc. (see Bengio’s survey [19])? Like most other ML tasks, learning deep nets seems NP-hard in any reasonable formulation, and in fact “badly NP-hard”because of many layers of hidden variables connected by nonlinear operations. Usually one imagines that NP-hardness is not a barrier to provable algorithms in ML because the input is drawn from some simple distribution and not worst-case. However, supervised learning of neural nets even on random inputs still seems as hard as cracking cryptographic schemes: this holds for depth-5 neural nets [93] and even ANDs of thresholds (a simple depth two network) [97]. However, modern deep nets are not “just”neural nets. They assume that the net (or some modification) can and should be run in reverse to get a generative model that produces a distribution that fits the empirical input distribution. Hinton promoted this viewpoint, and assumed that each level is a Restricted Boltzmann Machine (RBM), which is “reversible”in this sense. Vincent et al. [156] introduced denoising autoencoder, a generalization of an RBM (see Definition 6.3). These viewpoints allow a different methodology than classical backpropagation: layerwise learning. The bottom (observed) layer is learnt in unsupervised fashion using the provided data. This gives values for the next layer of hidden variables, which are used as the data to learn the next higher layer, and so on. The final net thus learned is also a good generative model for the distribution of the bottom layer1 . Viewing layers of deep networks as reversible representations alludes to assumptions and techniques in the previous chapter. Each layer now is a sparse nonlinear code (as opposed to a sparse linear code in the previous section). We shall show a similar robustness assumption allows us to learn the nonlinear code under mild assumptions, and the learning algorithm can be composed to learn a random deep network. The rest of this chapter is organized as follows: in Section 6.1 we introduce the generative model and our main results. Every layer of our network will be a denoising autoencoder, and this is shown in Section 6.2. In Section 6.3 we illustrate the high-level algorithm, and prove it learns one layer networks. These results are extended in Section 6.4 to learn a multi-layer network. Section 6.5 describes 1

Recent work suggests that classical backpropagation-based learning of neural nets together with a few modern ideas like convolution and dropout training also performs very well [103], though the authors suggest that some unsupervised pretraining should help further.

101

the graph recovery algorithm, which is a crucial step in learning the network. The last layer of our network will be slightly different from the previous layers, and is discussed in Section 6.6. In Section 6.7 we show our network cannot be simplified: a two layer network cannot be represented by a one layer network. Finally we list the graph properties we used in Section 6.8.

6.1 6.1.1

Background and Main Results Deep Networks

Here we describe the class of randomly chosen neural nets that can be learned by our algorithm. A network D(`, ρ` , {Gi }) has ` hidden layers of binary variables h(`) , h(`−1) , .., h(1) from top to bottom and an observed layer y at bottom (see Figure 6.1). The set of nodes at layer h(i) is denoted by Ni , and |Ni | = ni . The nodes of the output layer are denoted by Y (or N0 ), and |Y | = n0 . For simplicity let n = maxi ni , and assume each ni > nc for some positive constant c.

h(`) random neural net G`−1

h(`−1) = sgn(G`−1 h(`) )

h(`−1) random neural nets

h(i−1) = sgn(Gi−1 h(i) )

h(1) random linear function G0

y = G0 h(1)

y

(observed layer)

Figure 6.1: Example of a deep network The graph between layers i + 1 and i is a random bipartite graph Gi (Ni+1 , Ni , Ei , w), where every edge (u, v) ∈ Ni+1 × Ni in Ei has probability pi . We denote this distribution of graphs by Gni+1 ,ni ,pi . Each edge e ∈ Ei carries a random weight w(e) in {−1, 1}.The set of positive edges are denoted by Ei+ = {(u, v) ∈ Ni+1 × Ni : w(u, v) > 0}, and the set of negative edges by E − . The threshold for the generative direction (from top to bottom) is 0 for every node. The top layer h(`) is a random vector where every component is an independent Bernoulli variable with probability ρ` 2 . Each node in layer ` − 1 computes a weighted sum of its neighbors in 2

We can also allow the top layer to be uniform over all the ρ` n` sparse vectors, because in that case all the probabilities computed in the following sections changes only by a multiplicative factor of (1 + o(1)).

102

layer `, and becomes 1 iff that sum strictly exceeds 0. We use sgn(·) to denote the threshold function: sgn(x) = 1 if x > 0 and 0 else. (6.1) Applying sgn() to a vector involves applying it componentwise. Hence the network generates the observed examples using the following equations: h(i−1) = sgn(Gi−1 h(i) ) for all i > 0 and h(0) = G0 h(1) (i.e., no threshold at the observed layer).Here (with slight abuse of notation) Gi stands for both the bipartite graph and the bipartite weight matrix of the graph at layer i. Throughout this chapter, “with high probability” means the probability is at least 1−n−C for arbitrary large constant C. Moreover, f g, f g mean f ≥ n g , f ≤ n− g where is a arbitrarily small constant3 . More network notations. The expected degree from Ni to Ni+1 is di , that is, di , pi |Ni+1 | = pi ni+1 , and the expected degree from Ni+1 to Ni is denoted by d0i , pi |Ni | = pi ni . The set of forward neighbors of u ∈ Ni+1 in graph Gi is denoted by Fi (u) = {v ∈ Ni : (u, v) ∈ Ei }, and the set of backward neighbors of v ∈ Ni in Gi is denoted by Bi (v) = {u ∈ Ni+1 : (u, v) ∈ Ei }. We use Fi+ (u) to denote the positive neighbors: Fi+ (u) , {v, : (u, v) ∈ Ei+ } (and similarly for Bi+ (v)). The expected density of the layers are defined as ρi−1 = ρi di−1 /2 (ρ` is given as a parameter of the model). For higher level ancestors, we use B (j) (u) to denote (j) the all nodes that have a path to u at layer j, and B+ (u) to denote all nodes that have a positive path to u at layer j. Our analysis works while allowing network layers of different sizes and different degrees. For simplicity, we recommend first-time readers to assume all the ni ’s are equal, and di = d0i for all layers. Discussions Recent papers have given theoretical analysis of models with multiple levels of hidden features, including SVMs [40, 115]. However, none of these solves the task of recovering a ground-truth neural network given its output distribution. Though real-life neural nets are not random, our consideration of random deep networks makes some sense for theory. Sparse denoising autoencoders are reminiscent of other objects such as error-correcting codes, compressed sensing, etc. which were all first analysed in the random case. As mentioned, provable reconstruction of the hidden layer in a known autoencoder already seems a nonlinear generalization of compressed sensing (1-bit compressed sensing, [30]), and even the usual compressed sensing seems possible only if the adjacency matrix has “random-like” properties (low coherence or restricted isoperimetry or lossless expansion). In fact our result that a single layer of our generative model is a sparse denoising autoencoder can be seen as an analog of the fact that random matrices are good for compressed sensing/sparse reconstruction (see Donoho [55] for general matrices and Berinde et al. [21] for sparse matrices). Of course, in compressed sensing the matrix of edge weights is known whereas here it has to be learnt, which is the main contribution of our work. Fur3

It is possible to make the analysis tighter and allow f ≥ g poly log n, we choose this to make the presentation simpler.

103

thermore, we show that our algorithm for learning a single layer of weights be extended to do layerwise learning of the entire network. Along the way we show interesting properties of such randomly-generated neural nets: (a) Each pair of layers constitutes a denoising autoencoder in the sense of Vincent et al.; see Lemma 6.4. (b) The reverse computation (computing the top hidden layer given the value of the observed vector) can also be performed by the same network by appropriately changing the thresholds at the computation nodes (c) The reverse computation is stable to dropouts (d) the distribution generated by a two-layer net cannot be represented by any single layer neural net (see Appendix), which in turn suggests that a random t-layer network cannot be represented by any t/2-level neural net4 . Note that properties (a) to (d) are assumed in modern deep network research (e.g., (b) is a heuristic trick called “weight tying”) so the fact that they provably hold for a random generative model can be seen as some theoretical validation of those assumptions.

6.1.2

Main Results

Theorem 6.1. For a network D(`, ρ` , {Gi }), if all graphs Gi ’s are chosen according to Gni+1 ,ni ,pi , weights are chosen at random, and the parameters satisfy: 1. All di 1, d0i 1. 2. For all layers, n3i (d0i−1 )8 /n8i−1 1. 3. For all but last layer (i ≥ 1), ρ3i ρi+1 . 3/2

4. For last layer, ρ1 d0 = O(1), d0 /d1 d2 1,

√ d0 /d1 1, d20 n1

P Then there is an algorithm using O(log2 n/ρ2` ) samples, running in time O( ì=1 ni ((d0i−1 )3 + ni−1 )) that learns the network with high probability over both the graph and the samples. Remark 6.2. We include the last layer whose output is real instead of 0/1, in order to get fully dense outputs. We can also learn a network without this layer, in which case the last layer needs to have density at most 1/ poly log(n), and Condition 4 is no long needed. Although we assume each layer of the network is a random graph, we are not using all the properties of the random graph. The properties we need are listed in Section 6.8.

6.2

Denoising Property

Experts feel that deep networks satisfy some intuitive properties. First, intermediate layers in a deep representation should approximately preserve the useful information in the input layer. Next, it should be possible to go back/forth easily between the representations in two 4

Proving this for t > 3 is difficult however since showing limitations of even 2-layer neural nets is a major open problem in computational complexity theory. Some deep learning papers mistakenly cite an old paper for such a result, but the result that actually exists is far weaker.

104

successive layers, and in fact they should be able to use the neural net itself to do so. Finally, this process of translating between layers should be noise-stable to small amounts of random noise. All this was implicit in the early work on RBM and made explicit in the paper of Vincent et al. [156] on denoising autoencoders. For a theoretical justification of the notion of a denoising autoencoder based upon the “manifold assumption” of machine learning see the survey of Bengio [19]. Definition 6.3 (Denoising autoencoder). An autoencoder consists of an decoding function D(h) = s(W h + b) and a encoding function E(y) = s(W 0 y + b0 ) where W, W 0 are linear transformations, b, b0 are fixed vectors and s is a nonlinear function that acts identically on each coordinate. The autoencoder is denoising if E(D(h) + η) = h with high probability where h is drawn from the input distribution, η is a noise vector drawn from the noise distribution, and D(h) + η is a shorthand for “E(h) corrupted with noise η.” The autoencoder is said to use weight tying if W 0 = W T . The popular choices of s includes logistic function, soft max, etc. In this work we choose s to be a simple threshold on each coordinate (i.e., the test > 0, this can be viewed as an extreme case of logistic function). Weight tying is a popular constraint and is implicit in RBMs. Our work also satisfies weight tying. In practice the denoising autoencoder property is a “soft” constraint on the nonlinear optimization used in learning: it takes the form of a term that penalizes deviations from this ideal property. A denoising autoencoder is a network that makes this penalty term zero. Thus one could conceivably weaken the above definition to say E(D(h) + η) ≈ h instead of exact equality. However, our networks satisfy the stronger equality constraint for most datapoints. We will show that each layer of our network is a denoising autoencoder with very high probability. (Each layer can also be viewed as an RBM with an additional energy term to ensure sparsity of h.) Later we will of course give efficient algorithms to learn such networks without recoursing to local search. In this section we just prove they satisfy Definition 6.3. The single layer has m hidden and n output (observed) nodes. The connection graph between them is picked randomly by selecting each edge independently with probability p and putting a random weight on it in {−1, 1}. Then the linear transformation W corresponds simply to this matrix of weights. In our autoencoder we set b = 0 and b0 = 0.2d0 × 1, where d0 = pn is the expected degree of the random graph on the hidden side. (By simple Chernoff bounds, every node has degree very close to d0 .) The hidden layer h has the following prior: it is given a 0/1 assignment that is 1 on a random subset of hidden nodes of size ρm. This means the number of nodes in the output layer that are 1 is at most ρmd0 = ρnd, where d = pm is the expected degree on the observed side. We will see that since b = 0 the number of nodes that are 1 in the output layer is close to ρmd0 /2. Lemma 6.4. If ρmd0 < 0.05n (i.e., the assignment to the observed layer is also fairly sparse) then the single-layer network above is a denoising autoencoder with high probability (over the choice of the random graph and weights), where the noise distribution is allowed to flip every output bit independently with probability 0.01. 105

Remark: The parameters accord with the usual intuition that the information content must decrease when going from observed layer to hidden layer. Proof. By definition, D(h) = sgn(W h). Let’s understand what D(h) looks like. If S is the subset of nodes in the hidden layer that are 1 in h, then the unique neighbor property (Lemma 6.27) implies that (i) With high probability each node u in S has at least 0.9d0 neighboring nodes in the observed layer that are neighbors to no other node in S. Furthermore, at least 0.44d0 of these are connected to u by a positive edge and 0.44d0 are connected by a negative edge. All 0.44d0 of the former nodes must therefore have a value 1 in D(h). Furthermore, it is also true that the total weight of these 0.44d0 positive edges is at least 0.21d0 . (ii) Each v not in S has at most 0.1d0 neighbors that are also neighbors of any node in S. Now let’s understand the encoder, specifically, E(D(h)). It assigns 1 to a node in the hidden layer iff the weighted sum of all nodes adjacent to it is at least 0.2d0 . By (i), every node in S must be set to 1 in E(D(h)) and no node in S is set to 1. Thus E(D(h)) = h for most h’s and we have shown that the autoencoder works correctly. Furthermore, there is enough margin that the decoding stays stable when we flip 0.01 fraction of bits in the observed layer. For the last layer, since the output is not 0/1, we shall choose a different threshold. Lemma 6.5. For the last layer, with high probability the encoder E(y) = sgn(GT y − 0.51) and linear decoder D(h) = Gh form a denoising autoencoder, where the noise vector η has independent components each with variance at most o(d00 / log2 n). Proof. When there is no noise, we know E(D(h)) = sgn(GT Gh − 0.5d00 1). With high probability the matrix GT G has value at least 0.9d00 on diagonals. For any fixed h, the support of the i-th row of GT G and the support of h have a intersection of size at most d00 log2 n with high probability. Also, at all these intersections, the entries of GT G has p 0random T 0 signs, and variance bounded by O(1). Hence if h = 1 (G Gh) ≥ 0.9d − O( d0 log2 n) i i 0 p ; if hi = 0 (GT Gh)i ≤ O( d00 log2 n). Setting the threshold at 0.5d00 easily distinguishes between these two cases. Even if we add noise, since the inner product of Gi and the noise vector is bounded by o(d0 ) with high probability, the autoencoder still works.

6.3

Learning a Single Layer Network

We first consider the question of learning a single layer network, which as noted amounts to learning nonlinear dictionaries. It perfectly illustrates how we leverage the sparsity and the randomness of the support graph. The overall algorithm is illustrated in Algorithm 6.1. We start with the simplest subcase of the first step: all edge weights are 1, and consider only pairwise correlations. Then we shall consider 3-wise correlations to figure out which triples share a common neighbor in the hidden layer. The correlation structure is used by 106

Algorithm 6.1. High Level Algorithm input samples y’s generated by a deep network described in Section 6.1 output Output the network/encoder and decoder functions 1: for i = 1 TO ` do 2: Call LastLayerGraph/3-Wise Graph on h(i−1) to construct the correlation graph 3: Call RecoverGraph3Wise to learn the positive edges Ei+ 4: Use PartialEncoder to encode all h(i−1) to h(i) 5: Call LearnGraphLast/LearnGraph to learn the graph/decoder between layer i and i − 1. 6: end for the Graph Recovery procedure (described later in Section 6.5) to learn the support of the graph. In Section 6.3.1 we show how to generalize these ideas to learn positive edges when edges can have weights {±1}. In Section 6.3.2 we construct a partial encoder, which can do encoding given only the support of positive edges. The result there is general and works in the multi-layer setting. Finally we give a simple algorithm for learning the negative edges.

6.3.1

Correlation Implies Common Cause

Warm up: 0/1 weights In this part we work with the simplest setting: a single level network with m hidden nodes, n observed nodes, and a random (but unknown) bipartite graph G(U, V, E) connecting them (see Figure 6.2). The expected backdegree of observed nodes are d. All edge weights are 1, so learning G is equivalent to finding the edges. Recall that we denote the hidden variables by h and the observed variables by y, and the neural network implies y = sgn(Gh). w

h

(hidden layer)

G y

(observed layer)

u0

u

v

Figure 6.2: Single layered network Theorem 6.6. Let G be a graph satisfying properties Psing . Suppose ρ 1/d2 , with high ˆ where u, v are connected in probability over the samples, Algorithm 6.2 construct a graph G, ˆ iff they have a common neighbor in G. G As mentioned, the crux of the algorithm is to compute the correlations between observed variables. The following lemma shows pairs of variables with a common parent both get value 107

Algorithm 6.2. PairwiseGraph input N = O(log2 n/ρ2 ) samples of y = sgn(Gh) ˆ on vertices V , u, v are connected if u, v share a positive neighbor in G output Graph G for each u, v in the output layer do if there are at least 2ρN/3 samples of y: yu = yv = 1 then ˆ connect u and v in G end if end for 1 more frequently than a typical pair. To simplify notation let ρy = ρd be the approximate expected density of output layer. Lemma 6.7. Under the assumptions of Theorem 6.6, if two observed nodes u, v have a common neighbor in the hidden layer then Pr[yu = 1, yv = 1] ≥ ρ h

otherwise, Pr[yu = 1, yv = 1] ≤ 2ρ2y h

Proof. When u and v have a common neighbor z in the input layer, as long as z is fired both u and v are fired. Thus Pr[yu = 1, yv = 1] ≥ Pr[hz = 1] = ρ. On the other hand, suppose the neighbor of u (B(u)) and the neighbors of v (B(v)) are disjoint. Then to make them both 1, we need supp(h) to intersect both B(u) and B(v). These two events are independent (because the sets are disjoing), the probability is at most (ρ|B(u)|)(ρ|B(v)|) ≤ 2ρ2y . The lemma implies that when ρ2y ρ(which is equivalent to ρ 1/(d2 )), we can find pairs of nodes with common neighbors by estimating the probability that they are both 1. The theorem follos directly from the lemma. Algorithm 6.3. 3-WiseGraph input N = O(log n/ρ2 ) samples of y = sgn(Gh), where h is unknown and chosen from uniform ρm-sparse distribution ˆ on vertices V . {u, v, s} is an edge if and only if they share a positive output Hypergraph G neighbor in G for each u, v, s in the observed layer of y do if there are at least ρN/3 samples of y satisfying all u, v and s are fired then ˆ add {u, v, s} as an hyperedge for G end if end for

108

In order to weaken the assumption that ρ 1/d2 , we can consider higher order correlations using Algorithm 6.3 3-WiseGraph. In the following lemma we show how 3-wise correlation works when ρ d−3/2 (ρ3y ρ). Lemma 6.8. If the graph G satisfies Psing and Psing+ , for any u, v, s in the observed layer, 1. Prh [yu = yv = ys = 1] ≥ ρ, if u, v, s have a common neighbor 2. Prh [yu = yv = ys = 1] ≤ 2ρ3y + O(ρy ρ log n) otherwise. Proof. The proof is very similar to the proof of Lemma 6.7. If u,v and s have a common neighbor z, then Pr[yu = yv = ys = 1] ≤ Pr[hu = 1] = ρ. On the other hand, if they don’t share a common neighbor, then Prh [yu = yv = ys = 1] = Pr[supp(h) intersects with B(u), B(v), B(s)]. There are two ways supp(h) can intersect with three sets: 1. it intersects with the three sets independently or 2. it intersects with the intersection of two. The probability of the first type is easily bound by 2ρ3y . Notice that the property Psing+ bounds the pairwise intersections of these sets by O(log n), so the probability of second type is bounded by O(ρy ρ log n). General case: finding common positive neighbors The previous two algorithms still work even when weights can be both positive and negative, as long as the weights are random. We will show this in the next lemma. A difference in notation here is ρy = ρd/2 (because only half of the edges have positive weights). Lemma 6.9. When the graph G satisfies properties Psing and Psing+ , for any u, v, s in the observed layer, 1. Prh [yu = yv = ys = 1] ≥ ρ/2, if u, v, s have a common positive neighbor 2. Prh [yu = yv = ys = 1] ≤ O(ρ3y + ρy ρ log n), otherwise. Proof. The proof is similar to the proof of Lemma 6.8. First, when u, v, s have a common positive neighbor z, let U be the neighbors of u, v, s except z. Hence U = B(u) ∪ B(v) ∪ B(s) \ {z}. By property Psing , we know the size of U is at most 3.3d. With at least 1 − 3.3ρd ≥ 0.9 probability, none of them is 1. Hence Pr[yu = yv = ys = 1] ≥ Pr[hz = 1] Pr[hU = 0] ≥ ρ/2. On the other hand, if u, v and s don’t have a positive common neighbor, then we have Prh [yu = yv = ys = 1] ≤ Pr[supp(h) intersects with B + (u), B + (v), B + (s)]. This case is the same as Lemma 6.8.

109

6.3.2

Paritial Encoder: Finding h given y

Although we only know the positive edges of G, it is enough for the encoding process. This is based on unique neighbor property (see Section 6.8 for more details). Intuitively, when ρd 1 any hz = 1 will have a large fraction of unique positive neighbors (that are not neighbors of any z 0 with hz0 = 1). These unique neighbors cannot be canceled by any −1 edges, so they must be 1 in y. Algorithm 6.4. PartialEncoder input positive edges E + , sample y = sgn(Gh), threshold θ output the hidden variable h return h = sgn((E + )> y − θ1) Lemma 6.10. If the support of vector h has the 11/12-strong unique neighbor property in G, then Algorithm 6.4 returns h given input E + and θ = 0.3d0 . Proof. Let S = supp(h). If u ∈ S, at most d0 /12 of its neighbors (in particular that many of its positive neighbors) can be shared with other vertices in S. Thus u has at least (0.3)d0 unique positive neighbors, and these are all “on”. Now if u 6∈ S, it can have an intersection at most d0 /12 with F (S) (by the definition of strong unique neighbors). Anything outside F (S) must be 0 in y. Thus there cannot be (0.3)d0 of its neighbors that are 1. Remark 6.11. Notice that Lemma 6.10 only depends on the unique neighbor property, which holds for the support of any vector h with high probability over the randomness of the graph. Therefore this ParitialEncoder can be used even when we are learning the layers of deep network (and h is not a i.i.d. random sparse vector).

6.3.3

Learning the Graph: Finding −1 edges.

Now that we can find h given y, the idea is to use many such pairs (h, y) and the partial graph E + to determine all the non-edges (i.e., edges of 0 weight) of the graph. Since we know all the +1 edges, we can thus find all the −1 edges. Consider some (h, y), and suppose yv = 1, for some output v. Now suppose we knew that precisely one element of B + (v) is 1 in h (recall: B + denotes the back edges with weight +1). Note that this is a condition we can verify, since we know both h and E + . In this case, it must be that there is no edge between v and S \ B + , since if there had been an edge, it must be with weight −1, in which case it would cancel out the contribution of +1 from the B + . Thus we ended up “discovering” that there is no edge between v and several vertices in the hidden layer. We now claim that observing polynomially many samples (h, y) and using the above argument, we can discover every non-edge in the graph. Thus the complement is precisely the support of the graph, which in turn lets us find all the −1 edges. 110

Algorithm 6.5. Learning Graph input positive edges E + , samples of (h, y) output E − 1: R ← (U × V ) \ E + . 2: for each of the samples (h, y), and each v do 3: Let S be the support of h 4: if yv = 1 and S ∩ B + (v) = {u} for some u then 5: for s ∈ S do 6: remove (s, v) from R. 7: end for 8: end if 9: end for 10: return R Note that the algorithm R maintains a set of candidate E − , which it initializes to (U × V ) \ E + , and then removes all the non-edges it finds (using the argument above). The main lemma is now the following. Lemma 6.12. Suppose we have N = O(log2 n/(ρ2 d)) samples (h, y). Then with high probability over choice of the samples, Algorithm 6.5 outputs the set E − . The lemma follows from the following proposition, which says that the probability that a non-edge (z, u) is identified by one sample (h, y) is at least ρ2 d/3. The probability that this particular non-edge is not identified after O(log2 n/(ρ2 d)) samples is < 1/nO(log n) . Therefore all non-edges must be found with high probability by union bound. Proposition 6.13. Let (z, u) be a non-edge, then with probability at least ρ2 d/3, all of the followings hold: 1. hz = 1, 2. |B + (u) ∩ supp(h)| = 1, 3. |B − (u) ∩ supp(h)| = 0. If such (h, y) is one of the samples we consider, (z, u) will be removed from R by Algorithm 6.5. Proof. The latter part of the proposition follows from the description of the algorithm. Hence we only need to bound the probability of the three events. The three events are independent because {z}, B + (u), B − (u) are three disjoint sets. The first event has probability ρ, the second event have probability roughly ρ|B + (u)| (since ρd 1) and the third event has probability at least 1 − ρ|B − (u)|. Therefore the probability that all three events happen is at least ρ2 /3.

6.4

Learning a Multi-layer Network

Among the four steps of the algorithm, graph recovery and partial encoding are not assuming any distribution on the hidden variable h. Hence these two steps can still be used for learning a multi-layer network. On the other hand, building the correlation graph and learning the −1 edges all depend on the distribution of h. In this section we shall show the effect of 111

correlation between hidden variables are bounded. Although in a multi-layer network the hidden variable h(1) does not have independent components, the first and last step of the algorithm can still work. The key idea here is that although the maximum correlation between two nodes in z, t in layer h(i+1) can be large, there are only very few pairs with such high correlation. Since the graph Gi is random and independent of the upper layers, we don’t expect to see many such pairs in the B(u), B(v) for any u, v in h(i) .

6.4.1

Correlation Grapph in a Multilayer Network

For the first step of the algorithm, we show correlation does not change the behavior of the algorithm by the following theorem. Theorem 6.14. For any 1 ≤ i ≤ `−1, and if the network satisfies Property Pmul + (which are satisfied by random graphs if ρ3i+1 ρi ), then given O(log2 n/ρ2i+1 ) samples, Algorithm 6.3 ˆ where (u, v, s) is an edge if and only if they share 3-WiseGraph constructs a hypergraph G, a positive neighbor in Gi . The theorem follows directly from the lemma below. Lemma 6.15. Under the assumption of Theorem 6.14, for any i ≤ ` − 1 and any u, v, s in the layer of h(i) , if they have a common positive neighbor(parent) in layer of h(i+1) Pr[h(i) u = h(i) v = h(i) s = 1] ≥ ρi+1 /3, otherwise Pr[h(i) u = h(i) v = h(i) s = 1] ≤ 0.2ρi+1 Proof. The idea is to go all the way back to the `-th layer and use the assumption on the input distribution there. Consider first the case when u, v and s have a common positive neighbor z in the layer of h(i+1) . Similar to the proof of Lemma 6.9, when h(i+1) z = 1 and none of other neighbors of u, v and s in the layer of h(i+1) is 1, we know h(i) u = h(i) v = h(i) s = 1. Going back to level (`) `, this is implied by the event B+ (z) has at least one component being 1, and everything (`) (`) (`) (`) else in B (u) ∪ B (v) ∪ B (s) are all 0. By Property Pmul , B+ (z) ≥ 0.8ρi+1 /ρ` , and (`) (`) B (u) ∪ B (`) (v) ∪ B (`) (s) ≤ 2`+1 ρi /ρ` . Let S = B (`) (u) ∪ B (`) (v) ∪ B (`) (s) \ B+ (z), we (`) know (using independence at the first layer h ) (`)

(`)

Pr[h(i) u = h(i) v = h(i) s = 1] ≥ Pr[supp(h(`) )∩B+ (z) 6= ∅] Pr[supp(h(`) )∩B+ (z) = ∅] ≥ ρi+1 /3. On the other hand, if u, v and s don’t have a common positive neighbor in layer of (`) (`) (`) h , consider event E: S intersects each of B+ (u), B+ (v), B+ (s). Clearly, the target probability can be upperbounded by the probability of E. There are three ways the support of h(`) can intersect with all three sets: 1. intersect with the intersection of three sets; 2. intersect with the intersection of two of the sets, and intersect at a different place with the third set; 3. intersect all three sets at different places. (i+1)

112

By graph Property Pmul+ , the size of pairwise and three-wise intersections can be bounded, let A ≤ 2`+1 ρi /ρ` be the size of the largest set, B ≤ O(log n)ρi+1 /ρ` be the size of the largest pairwise intersection, and let C = o(ρi+1 /ρ` ) be the size of 3-wise intersection, we can bound the probability of event E by Pr[E] ≤ ρ` C + 3ρ2` AB + ρ3` A3 ≤ 0.2ρi+1 . The bound on the third term uses the assumption that ρ3i ρi+1 .

6.4.2

Learning the −1 Edges

In order to show Algorithm 6.5 still works, we prove the following lemma that replaces Lemma 6.12. Lemma 6.16. If the network satisfy Pneg+ , given O(log2 n/(ρ2i+1 )) samples, with high probability over the choice of the samples, Algorithm 6.5 returns E − . Proof. Similar as Lemma 6.12, it suffices to prove the following Proposition 6.17 (which replaces Proposition 6.13). Proposition 6.17. For any (x, u) 6∈ Ei , with probability Ω(ρ2i+1 ) over the choice of h(i+1) , the following events happen simultaneously: 1. x ∈ supp(h(i+1) ), 2. B + (u) ∩ supp(h(i+1) ) = − 1, 3. B (u) ∩ supp(h(i+1) ) = 0. When these events happen, (x, u) is removed from R by Algorithm 6.5. Proof. Let t be an arbitrary node in B + (u), and let S = B(u) \ {t}. For simplicity, we shall (i+1) bound probability of the following three events happen simultaneously: 1. hx = 1; 2. (i+1) ht = 1; 3. supp(h(i+1) ) ∩ S = ∅. Clearly, these three events imply the three events in the lemma. The set S does not (`) depend on graphs Gi+1 to G`−1 . Going back to the first layer, let S1 = B+ (x)\B (`) ({x}∪S), (`) S2 = B+ (t) \ B (`) ({x} ∪ S), S3 = B (`) (t) ∪ B (`) (x) ∪ B (`) (S) \ (S1 ∪ S2 ). By Property Pneg+ , |S1 | ≥ ρi+1 /2ρ` , |S2 | ≥ ρi+1 /2ρ` , and |S3 | ≤ 2`+2 ρi /ρ` . Let H = supp(h(`) ), the probability of the three events is at least Pr[H ∩ S1 6= ∅] Pr[H ∩ S2 6= ∅] Pr[H ∩ S3 = ∅] ≥ (ρi+1 /3)2 (1 − O(ρi )) = Ω(ρ2i+1 ).

6.5

Graph Recovery

By Algorithms 6.2 PairwiseGraph and 6.3 3-WiseGraph, we can construct a graph/hypergraph that is related to the support graph of the positive edges in the network. How can we find the positive edges? The problem reduces to a graph recovery problem, which is very similar to the problem we solved in Section 5.3. For the pairwise correlation graph, the problem is formulated as follows: 113

Definition 6.18 (Graph Recovery Problem). There is an unknown random bipartite graph G1 (U, V, E1 ) between |U | = m and |V | = n vertices. Each edge is in E1 with probability d0 /n. ˆ Given: Graph G(V, E) where (v1 , v2 ) ∈ E iff v1 and v2 share a common parent in G1 (i.e. ∃u ∈ U where (u, v1 ) ∈ E1 and (u, v2 ) ∈ E1 ). Goal: Find the bipartite graph G1 . This problem is really similar to the problem we solved in Section 5.3. In fact, it is even simpler because whenever two vertices share a common parent they are connected (in Section 5.3 such pairs are only connected with probability /12). We can directly apply Algorithm 5.1 or the more complicated Algorithm 5.4 to solve this problem. However, there is a limit in the sparsity of the graphs for these algorithms to work. In √ 0 particular, they would require d n. This is not very satisfying for our application. In this case, we can rely on the hypergraph generated by 3WiseGraph. Such higher-order correlation structure was not available for algorithms in the previous chapter. Definition 6.19 (Graph Recovery with 3-wise Correlation). There is an unknown random bipartite graph G1 (U, V, E1 ) between |U | = m and |V | = n vertices. Each edge is in E1 with probability d0 /n. ˆ Given: Hypergraph G(V, E) where (v1 , v2 , v3 ) ∈ E iff there exists u ∈ U where (u, v1 ), (u, v2 ) and (u, v3 ) are all in E1 . Goal: Find the bipartite graph G1 . Algorithm 6.6. RecoverGraph3Wise ˆ in Definition 6.19 input Hypergraph G output Graph G1 in Defintion 6.19 repeat Pick a random hyperedge (v1 , v2 , v3 ) in E Let S = {v : (v, v1 , v2 ), (v, v1 , v3 ), (v, v2 , v3 ) ∈ E} if |S| < 1.3d0 then 0 Let S 0 = {v ∈ S : v is correlated with at least 0.8d2 −1 pairs in S} In G1 , create a vertex u and connect u to every v ∈ S 0 . Mark all hyperedges (v1 , v2 , v3 ) for v1 , v2 , v3 ∈ S 0 end if until all hyperedges are marked The intuitions behind Algorithm 6.6 are very similar to the algorithms in previous chapter: since 3-wise correlations are rare, not many vertices should have 3-wise correlation with all three pairs (v1 , v2 ), (v1 , v3 ) and (v2 , v3 ) unless they are all in the same community. The performance of Algorithm 6.6 is better than the previous algorithms because 3-wise correlations are rarer.

114

ˆ is constructed according Theorem 6.20. When the graph G1 satisfy properties Prec+ , G 3 08 8 to Definition 6.19, and when m d /n 1, Algorithm 6.6 solves Graph Recovery with 3wise Correlation. The expected running time is O(m(d03 + n)) over the randomness of the algorithm. Proof. If (v1 , v2 , v3 ) has more than one common neighbor, then Property 2 shows the condition in if statement must be false (as |S| ≥ |F (B(v1 ) ∩ B(v2 ) ∩ B(v3 ))| ≥ 1.5d0 . When (v1 , v2 , v3 ) has only one common neighbor u, then Property 1 shows S = F (u) ∪ T where |T | ≤ d0 /20. Now consider S 0 , for any v ∈ F (u), it is correlated with all other pairs in F (u). Hence it 0 must be correlated with at least 0.8d2 −1 pairs in S, which implies v is in S. For any v 0 6∈ F (u), by Property 3 it can only be correlated with d02 /40 pairs in F (u). Therefore, the total number of correlated pairs it can have in S is bounded by |T | |F (u)| + |T | 0.8d0 −1 02 + d /40 < . This implies v 0 is not in S. 2 2 The argument above shows S 0 = F (u), so the algorithm correctly learns the edges related to u. Finally, to bound the running time, notice that Property 4 shows the algorithm finds a new vertex u in 10 iterations in expectation. Each iteration takes at most n + d03 time. Therefore the algorithm takes O(m(d03 + n)) expected time.

6.6

Layer with Real-valued Output

In previous sections, we considered hidden layers with sparse binary input and output. However, in most applications of deep learning, the observed vector is dense and real-valued. Bengio et al.[20] suggested a variant auto-encoder with linear decoder, which is particularly useful in this case. We use similar ideas as in Section 6.3 to learn the last layer. The algorithm collects the correlation-structure of the observed variables, and use this information to reconstruct the edges. When we observe real-valued output in the last layer, we can learn whether three nodes u,v and s have a common parent from their correlation measured by E[xu xv xs ], even if the output vector is fully dense. Note that compared to Theorem 6.6 in Section 6.3, the main difference here is that we allow ρpm to be any constant (before ρpm 1/d). √ 3/2 Theorem 6.21. When ρ1 d0 = O(1), d20 n1 , d0 /(d1 d2 ) 1 and d0 /d1 1, the last layer is generated at random and the rest of the network satisfy Pmul+ , then with high probability over the choice of the weights and the choice of the graph, for any three nodes u, v, s at the observed vector y 1. If u, v and s have no common neighbor, then | Eh [yu yv ys ]| ≤ 0.6ρ1 2. If u, v and s have a unique common neighbor, then | Eh [yu yv ys ]| ≥ 0.1ρ1 Before proving the theorem we state a concentration bounds for trilinear forms: 115

Lemma 6.22. If x, y and z are three independent uniformly random vectors in {±1}n , with high probability sX X A2i,j,k log3 n). Ai,j,k xi yj zk ≤ O( i,j,k

i,j,k

This bound follows easily from the McDiarmid’s inequality (see Appendix B) and union bound. This is not tight (for example compared to the Hanson-Wright inequality in the previous chapter) but only loses poly log factors and is good enough for our need. Proof. In this proof we shall use h for h(1) to simplify notations. By the structure of the network we know X wi,u wj,v wk,s E[hi hj hk ] E[yu yv ys ] =

(6.2)

i∈B(u),j∈B(v),k∈B(s)

Observe that wi,u and wj,v are different random variables, no matter whether i is equal to j. Thus by concentration bounds for trilinear forms, with high probability over the choice of the weights s X 3 (6.3) | E[yu yv ys ]| ≤ E[hi hj hk ]2 log n i∈B(u),j∈B(v),k∈B(s)

Let V be the main term in the bound: X V ,

2 E[hi hj hk ] .

i∈B(u),j∈B(v),k∈B(s)

We need to bound the size of V . From Property Pmul+ we can bound the expectations of h variables. Proposition 6.23. ( ≤ O(ρ31 + ρ2 ρ1 + ρ2 ) log n if B + (i) ∩ B + (j) ∩ B + (k) 6= ∅, [h h h ] is E i j k ≤ O(ρ31 + log nρ2 ρ1 + ρ3 ) log n if B + (i) ∩ B + (j) ∩ B + (k) = ∅, and

( ≤ O(ρ21 + ρ2 ) log n if B + (i) ∩ B + (j) 6= ∅, E[hi hj ] is ≤ O(ρ21 + ρ3 ) log n if B + (i) ∩ B + (j) = ∅,

When d20 n1 , with high probability we can also bound the intersection of neighbors |B(u) ∩ B(v)| ≤ log n for any two u, v in the observed layer. Combining all these, we can carefully of V when u, v, s do not share a √ √bound the size 3/2 common neighbor. In particular V ≤ v1 + v2 ≤ O(d (ρ31 + ρ2 ρ1 + ρ3 ) + 10d1/2 (ρ21 + ρ2 )) poly log n. p Under the assumption of theorem, V log3 n ≤ ρ1 /3. With high probability, when u, v, s do not share a common neighbor, we know | E[yu yv ys ]| ≤ ρ1 /10. 116

When u, v, s share a unique neighbor z, we take the term E[h3z ]wz,u wz,v wz,s out of the sum. The remaining terms can still be bounded by ρ1 /3. Hence | E[yu yv ys ]| ≥ E[h3z ] − ρ1 /10 ≥ 0.6ρ1 (we argue that E[hz ] ≥ 0.7ρ by going back to the `-th layer as before). Although the hypergraph generated by Theorem 6.21 is not exactly the same as the hypergraph required in Definiton 6.19, the same Algorithm 6.6 can find the edges of the graph. This is because the two hypergraphs only differ when u, v, s share more than one common neighbor, which happens with extremely low probability (Algorithm 6.6 can actually tolerate this effect even if the probability is a small constant). Note that different from the previous sections, here Algorithm 6.3 actually returns a graph that contains both the positive edges and negative edges. In order to reconstruct the weighted bipartite graph we need to differentiate between positive and negative edges. This is not hard: from the proof of Theorem 6.21, when u, v, s share a unique neighbor x, Eh [yu yv ys ] > 2ρ/3 if an even number of edges (x, u), (x, v), (x, s) are negative. Therefore Algorithm 6.7 can distinguish positive and negative edges. Algorithm 6.7. LearnLastLayer input E, the set of edges output E + , E − , the set of positive and negative edges for each node x ∈ h(1) do Pick u, v in F (x) Let S be the set {s : Eh [yu yv ys ] > 0}. for each t : (x, t) ∈ E do if most triples (t, u0 , v 0 )(u0 , v 0 ∈ S) have positive Eh [yt yu0 yv0 ] then add (x, t) to E + else add (x, t) to E − end if end for end for To prove the algorithm works we first consider the case when any triple share at most one parent. In that case all nodes in S will have edges with the same sign to x, therefore the sign of t is exactly the sign of Eh [yt yu0 yv0 ] for any u0 , v 0 ∈ S. The algorithm is extremely noise tolerant, so it still works even if there is a small number of triples that share more than one parent (and the test Eh [yu yv ys ] > 0 is not perfect).

6.7

Lower Bound

In this section we show that a two-layer network with ±1 weights is more expressive than one layer network with arbitrary weights. A two-layer network (G1 , G2 ) consists of random graphs G1 and G2 with random ±1 weights on the edges. Viewed as a generative model, its 117

input is h(3) and the output is h(1) = sgn(G1 sgn(G2 h(3) )). We will show that a single-layer network even with arbitrary weights and arbitrary threshold functions must generate a fairly different distribution. Lemma 6.24. For almost all choices of (G1 , G2 ), the following is true. For every one layer network with matrix A and vector b, if h(3) is chosen to be a random ρ3 n-sparse vector with ρ3 d2 d1 1, the probability (over the choice of h(3) ) is at least Ω(ρ23 ) that sgn(G1 sgn(G1 h(3) )) 6= sgn(Ah(3) + b). The idea is that the cancellations possible in the two-layer network simply cannot all be accomodated in a single-layer network even using arbitrary weights. More precisely, even the bit at a single output node v cannot be well-represented by a simple threshold function. First, observe that the output at v is determined by values of d1 d2 nodes at the top layer that are its ancestors. Wlog, in the one layer net (A, b), there should be no edge between v and any node u that is not its ancestor. The reason is that these edges between v and its ancestors in (A, b) can be viewed collectively as a single random variables that is not correlated with values at v’s ancestors, and either these edges are “influential” with probability at least ρ23 /4 in which case it causes a wrong bit at v; or else it is not influential and removing it will not change the function computed on ρ23 /4 fraction of probability mass. Similarly, if there is a path from u to v then there must be a corresponding edge in the one-layer network. The question is what weight it should have, and we show that no weight assignment can avoid producing erroneous answers. The reason is that with probability at least ρ3 /2, among all ancestors of v in the input layer, only u is 1. Thus in order to produce the same output in all these cases, in the onelayer net the edge between u and v should be positive iff the path from u to v consists of two positive edges. But now we show that with high probability there is a cancellation effect involving a local structure in the two layer net whose effect cannot be duplicated by such a single-layer net (See the Figure 6.3 and 6.4). u1 u2

u3

u4

u1 u2

h(3) +1 -1 (2)

h

h(1)

u3

u4

h(3) +1

s

s0

+1

+1

-1

Au1 v > 0 < 0

G2

>0

0, Au2 ,v + Au3 ,v + bv > 0. However, there can 118

be no choices of (A, b) that satisfies all four constraints! To see that, simply add the first P4 two and the last two, the left-hand-sides are all i=1 Aui ,v + 2bv , but the right-hand-sides are ≤ 0 and > 0. The single-layer network cannot agree on all four inputs. Each of the four inputs occurs with probability at least Ω(ρ23 ). Therefore the outputs of two networks must disagree with probability Ω(ρ23 ). Remark: It is well-known in complexity theory that such simple arguments do not suffice to prove lowerbounds on neural nets with more than one layer.

6.8

Random Graph Properties

In this Section we state the properties of random graphs that are used by the algorithm. We first describe the unique-neighbor property, which is central to our analysis. In the next part we list the properties required by different steps of the algorithm.

6.8.1

Unique neighbor property

Recall that in bipartite graph G(U, V, E, w), the set F (u) denotes the neighbors of u ∈ U , and the set B(v) denotes the neighbors of v ∈ V . For any node u ∈ U and any subset S ⊂ U , let UF (u, S) be the sets of unique neighbors of u with respect to S, UF (u, S) , {v ∈ V : v ∈ F (u), v 6∈ F (S \ {u})} Definition 6.25. In a bipartite graph G(U, V, E, w), a node u ∈ U has (1 − )-unique neighbor property with respect to S if |U F (u, S)| ≥ (1 − ) |F (u)|. The set S has (1 − )-strong unique neighbor property if for every u ∈ U , u has (1 − )unique neighbor property with respect to S. Remark 6.26. Notice that in our definition, u does not need to be inside S. This is not much stronger than the usual definition where u has to be in S: if u is not in S we are simply saying u has unique neighbor property with respect to S ∪ {u}. When all (or most) sets of size |S| + 1 have the “usual” unique neighbor properties, all (or most) sets of size |S| have the unique neighbor property according to our definition. Lemma 6.27. If G(U, V, E) is from distribution Gm,n,p , for every subset S of U with (1 − p)|S| > 1 − /2 (note that p|S| = d0 |S|/n is the expected density of F (S)), with probability 1−2m exp(−2 d0 ) over the randomness of the graph, S has the (1−)-strong unique neighbor property. Proof. Fix the vertex u, first sample the edges incident to u. Without loss of generality assume u has neighbors v1 , . . . , vk (where k ≈ pn). Now sample the edges incident to the vi ’s. For each vi , with probability (1 − p)|S| ≥ 1 − /2, vi is a unique neighbor of u. Call this event Goodi (and we also use Goodi as the indicator for this event). 119

By the construction of the graph we know Goodi ’s are independent, hence by Chernoff bounds, we have that with probability 1 − 2 exp(−2 d0 ) the node u has unique neighbor property with respect to S. By union bound, every u satisfies this property with probability at least 1−2m exp(−2 C).

6.8.2

Properties required by each steps

We now list the properties required in our analysis. These properties hold with high probability for random graphs. Properties for Correlation Graph: Single-layer The algorithm PairwisGraph requires the following properties Psing 1. For any u in the observed layer, |B(u)| ∈ [0.9d, 1.1d], (if G has negative weights, we also need |B + (u)| ∈ [0.9d/2, 1.1d/2]) 2. For any z in the hidden layer, |F (z)| ∈ [0.9d0 , 1.1d0 ], (if G has negative weights, we also need |F + (z)| ∈ [0.45d0 , 0.55d0 ]) The algorithm 3-WiseGraph needs Psing , and additionally Psing+

1. For any u, v in the observed layer, |B + (u) ∪ B + (v)| ≤ 10 log n.

Lemma 6.28. If graph G is chosen from Gm,n,p with expected degrees d, d0 log n, with high probability over the choice of the graph, Psing is satisfied. If in addition d2 n, the property Psing+ is also satisfied with high probability. The lemma is a straight-forward application of concentration bounds.

Properties for Correlation Graph: Multi-layer For the multi-layer setting, the algorithm PairwisGraph requires the following expansion properties Pmul . 0 0 1. For any node u at the layer i, |Fi−1 (u)| ∈ [0.9di−1 , 1.1di−1 ], |Bi (u)| ∈ [0.9di , 1.1di ], F + (u) ∈ [0.45d0i−1 , 0.55d0i−1 ], B + (u) ∈ [0.45di , 0.55di ] i−1 i (t)

2. For any node u at the layer i, |B+ (u)| ≥= 0.8ρi /ρt , and |B (`) (u)| ≤ 2`−i+1 ρi /ρ` . 3. For any pair of nodes u, v at layer i, (`) (i+1) (`) (i+1) B+ (u) ∩ B+ (v) ≤ O ρi+1 /ρ` · 1/di+1 + log n/(ρ` n` ) + B+ (u) ∩ B+ (v) (i+1) (i+1) In particular, if u and v have no common positive parent at layer i+1 ( B+ (u) ∩ B+ (v) = 0), then (`) (`) B+ (u) ∩ B+ (v) ≤ o(ρi+1 /ρ` ) 120

The algorithm 3-WiseGraph needs the additional property Pmul+ : (`) (`) 1. For any pair of nodes u, v at layer i, B+ (u) ∩ B+ (v) ≤ O(ρi+1 log n/ρ` ) 2. For any three nodes u, v and s at layer i, if they don’t have a common positive neighbor in layer i + 1, (`) (`) (`) B+ (u) ∩ B+ (v) ∩ B+ (s) ≤ O ρi+1 log n/ρ` · 1/di+1 + ρ` /(ρ` n` ) + 1/(ρ2` n2` ) ≤ o(ρi+1 /ρ` ) Property 1 can be relaxed but we choose to present this simpler condition. Lemma 6.29. If the network D(`, ρ` , {Gi }) have parameters satisfying di log n, and ρ2i ρi+1 , then with high probability over the randomness of the graphs, {Gi }’s satisfy Pmul . Additionally, if gi log n and ρ3i ρi+1 , then {Gi }’s satisfy Pmul+ with high probability. In order to prove Lemma 6.29 we need the following claim: Claim. If the graph G ∼ Gm,n,p with d = pm being the expected back degree, and d log n. For two arbitrary sets T1 and T2 , with d|T1 | m, d|T2 | m, we have with high probability |B(T1 ) ∩ B(T2 )| ≤ (1 + )d|T1 ∩ T2 | + (1 + )d2 |T1 ||T2 |/m + 5 log n This Claim simply follows from simple concentration bounds. Now we are ready to prove Lemma 6.29. Proof of Lemma 6.29. Property 1 in Pmul follows directly from di log n. Property 2 in Pmul follows from unique neighbor properties (when we view the bipartite graph from Ni to Ni+1 ). For Property 3, we prove the following proposition by induction on t: Proposition 6.30. For any two nodes u, v at layer 1, (2) (t) (t) B+ (u) ∩ B+ (v) ≤ (1+)t tρ21 /(ρ2t nt )+6t(1+)t ρ3 log n/ρt +(1+)t ρ2 B+ (u) ∩ B (2) (v) /ρt This is true for t = 2 (simply because of the third term). Suppose it is true for all the values at most t. When we are considering t + 1-th level, by Claim 6.8.2, we know with high probability (notice that we only need to do union bound on n2 pairs)

121

(t+1) (t) (t+1) (t) B+ (u) ∩ B+ (v) ≤ (1 + )dt /2 · ρ2 B+ (u) ∩ B+ (v) /ρt (t)

(t)

+(1 + )d2t /4 · |B+ (u)||B+ (v)| + 5 log n (2) ≤ (1 + )t+1 ρ2 B+ (u) ∩ B (2) (v) /ρt+1 + 6t(1 + )t+1 ρ3 log n/ρt+1

+(1 + )t tρ21 /(ρ2t nt ) + (1 + )d2t /4 · ρ21 /(ρ2t nt+1 ) + 5 log (2) ≤ (1 + )t+1 ρ2 B+ (u) ∩ B (2) (v) /ρt+1 + 6(t + 1)(1 + )t+1 ρ3 log n/ρt+1 +2(1 + )t+1 (t + 1)ρ21 /(ρ2t+1 nt+1 ),

where the last inequality uses the fact that ρ21 /(ρ2t nt ) ≤ d2t /4 · ρ21 /(ρ2t nt+1 ). This is because nt dt /nt+1 = d0t 1. Proposition 6.30 implies that when ρ21 ρ2 , and ` is a constant, (i+1) (t) (i+1) (t) B+ (u) ∩ B+ (v) ≤ O ρi+1 /ρ` · 1/di+1 + log n/(ρ` n` ) + B+ (u) ∩ B+ (v) Property 2 in Pmul+ is similar but more complicated. Properties for Graph Reovery For the algorithm RecoverGraph3Wise to work, the hypergraph generated from the random graph should have the following properties Prec+ . 1. For any (v1 , v2 , v3 ) ∈ E, if S is the set defined as in the algorithm, then |S\F (B(v1 ) ∩ B(v2 ) ∩ B(v3 ))| < d0 /20. 2. For any u1 , u2 ∈ U , |F (u1 ) ∪ F (u2 )| > 1.5d0 . 3. For any u ∈ U , v ∈ V and v 6∈ F (u), v is correlated with at most d02 /40 pairs in F (u). 4. For any u ∈ U , at least 0.1 fraction of triples v1 , v2 , v3 ∈ F (u) does not have a common neighbor other than u. 5. For any u ∈ U , its degree du ∈ [0.9d0 , 1.1d0 ]. Lemma 6.31. When m3 d08 /n8 1, d0 1, with high probability a random graph satisfies Property Prec+ . Proof. Property 1: Fix any v1 , v2 and v3 , Consider the graph G1 to be sampled in the following order. First sample the edges related to v1 , v2 and v3 , then sample the rest of the graph. At step 2, let S1 = B(v2 ) ∩ B(v3 )\B(v1 ), S2 = B(v1 ) ∩ B(v3 )\B(v2 ) and S3 = B(v1 ) ∩ ˆ every vertex in S must be in F (S1 ) ∩ B(v2 )\B(v3 ). By the construction of the graph G, −Ω(d0 ) F (S2 )∩F (S3 ). With high probability (e ) we know |S1 | ≤ 5m(d0 /n)2 (this is by Chernoff 122

bound, because each vertex u is connected to two vertices v2 , v3 with probability (d0 /n)2 ). Similar things hold for S2 , S3 . Now F (S1 ), F (S2 ) and F (S3 ) are three random sets of size at most 10m(d0 )3 /n2 , thus again by Chernoff bound we know their intersection has size smaller than 10(10m(d0 )3 /n2 )3 /n2 , which is smaller than d0 /20 by assumption. Property 2: This follows directly from the unique neighbor property of sets of size 2. Property 3: First sample the edges related to u and v, then sample edges related to vertices in B(v). For any vertex in B(v), the expected number of pairs is (d02 /n)2 . Since |B(v)| ≤ md0 /n with high probability, again by Chernoff bound we know the number of pairs is smaller than O((d02 /n)2 md0 /n) = O(md05 /n3 ). This is much smaller than d02 /40 under 1/3 assumption (notice that (md03 )/n3 = ((m3 d08 /n8 )(d0 /n)) 1). Property 4: Again change the sampling process: first sample all the edges not related to u, then sample 1/2 of the edges connecting to u, and finally sample the second half. Let S1 be the first half of F (u). For a vertex outside S1 , similar to Property 3 we know every v 6∈ S1 has at most d0 /40 neighboring pairs in S1 , therefore any new sample in the second half is going to introduce many new triples of correlation. The total number of new correlations is at least 0.1 fraction. Property 5: This simply follows from d0 1. Properties for Partial Encoder The partial encoder only relies on the strong unique-neighbor property. Properties for Learning −1 Edges In order to learn the −1 edges, we need Property Pneg+ . It includes some properties in Pmul , and additionally the property that nodes in i-th layer cannot be too negatively correlated, in particular 1. 1 and 2 in Pmul . 2. For fixed node u and set S at layer i, if |S| ≤ 2di−1 , then with high probability (`) B+ (u) \ B (`) (S) ≥ ρi /2ρ` . The proof of second property is very similar to Lemma 6.29.

123

Part III Tensor Decomposition for General Matrix Factorization

124

Chapter 7 Orthogonal Tensor Decomposition In this chapter we discuss Orthogonal Tensor Decomposition, which is a unified methodology for solving General Matrix Factorization problems (or learning latent variable models in general). This methodology provides computationally and statistically efficient parameter estimation methods for a wide class of latent variable models—including simple topic models, independent component analysis and hidden Markov models—which exploits a certain tensor structure in their low-order observable moments. Specifically, parameter estimation is reduced to the problem of extracting a certain (orthogonal) decomposition of a symmetric tensor derived from the moments; this decomposition can be viewed as a natural generalization of the singular value decomposition for matrices. Although tensor decompositions are generally intractable to compute, the decomposition of these specially structured tensors can be efficiently obtained by a variety of approaches, including power iterations and maximization approaches (similar to the case of matrices). We provide a detailed analysis of a robust tensor power method, establishing an analogue of Wedin’s perturbation theorem for the singular vectors of matrices. This implies a robust and computationally tractable estimation approach for several popular latent variable models.

7.1

Orthogonal Tensors

We call the decomposition T =

k X

λi vi⊗3

(7.1)

i=1

an orthogonal decomposition of tensor T if all the vectors vi ’s are orthonormal. And we call the tensor T orthogonally decomposable. Notice this is similar to the singular value Pthat k decomposition (SVD) of symmetric matrices: M = i=1 σi vi viT . All symmetric matrices can be decomposed into orthogonal components, but only very special tensors are orthogonally decomposable. Given a tensor Tˆ which is close to an orthogonally decomposable tensor T

125

Tˆ ≈ T =

k X

λi vi⊗3 ,

i=1

ˆ i ’s and v the goal of orthogonal tensor decomposition is to find λ î ’s that are close to λi ’s and vi ’s. Our main result is Theorem 7.1. LetP Tˆ = T + E ∈ Rk×k×k , where T is a symmetric tensor with orthogonal decomposition T = ki=1 λi vi⊗3 where each λi > 0, {v1 , v2 , . . . , vk } is an orthonormal basis, and E has operator norm := kEk. Define λmin := min{λi : i ∈ [k]}, and λmax := max{λi : i ∈ [k]}. There exists universal constant C such that when ≤ C λmin , there is an algorithm that k ˆ 1 ), (ˆ ˆ 2 ), . . . , (ˆ ˆ k ). With high probability, there exists a permutation π on returns (ˆ v1 , λ v2 , λ vk , λ [k] such that ˆ j | ≤ 5, ∀j ∈ [k], kvπ(j) − v ˆj k ≤ 8/λπ(j) , |λπ(j) − λ and

k X

⊗3 ˆj v

T − λ ˆ j ≤ 55.

j=1

The orthogonal constraint may seem very unnatural, as the parameters of models do not usually have such strong structure. However in Section 7.2 we show that a more general case reduces to orthogonal tensor decomposition via commonly applied whitening. In Section 7.3 we give a few examples to show how this problem is crucial in learning many different latent variable models. In Section 7.4 we present the tensor power method. This is followed by discussions in Section 7.5, and the proof of the main theorem appears in Section 7.6.

7.2

Whitening: Getting Orthogonality from General Vectors

We now demonstrate how orthogonality can be enforced using the whitening operation.For concreteness, we take the following specific form, which appears in the exchangeable single topic model (Theorem 7.2): M2 = M3 =

k X i=1 k X i=1

wi µi ⊗ µi , wi µi ⊗ µi ⊗ µi .

126

(The more general case allows the weights wi in M2 to differ in M3 , but for simplicity we keep them the same in the following discussion.) We now show how to reduce these forms to an orthogonally decomposable tensor from which the wi and µi can be recovered. Throughout, we assume the following non-degeneracy condition. Condition 7.2.1 (Non-degeneracy). The vectors µ1 , µ2 , . . . , µk ∈ Rd are linearly independent, and the scalars w1 , w2 , . . . , wk > 0 are strictly positive. Observe that Condition 7.2.1 implies that M2 0 is positive semidefinite and has rank-k. This is a mild condition; furthermore, when this condition is not met, learning is conjectured to be hard for both computational [130] and information-theoretic reasons [128].

7.2.1

The reduction

First, let W ∈ Rd×k be a linear transformation such that M2 (W, W ) = W > M2 W = I where I is the k × k identity matrix (i.e., W whitens M2 ). Since M2 0, we may for concreteness take W := U D−1/2 , where U ∈ Rd×k is the matrix of orthonormal eigenvectors of M2 , and D ∈ Rk×k is the diagonal matrix of positive eigenvalues of M2 . Let µ ˜ i :=

√

wi W > µi .

Observe that M2 (W, W ) =

k X

k X √ √ > µ ˜iµ ˜ >i = I, W ( wi µi )( wi µi ) W = >

i=1

i=1

so the µ ˜ i ∈ Rk are orthonormal vectors. f3 := M3 (W, W, W ) ∈ Rk×k×k , so that Now define M f3 = M

k X

wi (W > µi )⊗3 =

i=1

k X 1 ˜ ⊗3 √ µ i . w i i=1

When applying this reduction to learning latent variable models, it is important to bound the error of the orthogonal tensor. Loose bounds follow from McDiarmid’s inequality (see Appendix B). Tighter bounds can be obtained by matrix bernstein inequalities, see Chapter 8 for an example.

7.3

Tensor structure in latent variable models

In this section, we give several examples of latent variable models whose low-order moments can be written as symmetric tensors of low symmetric rank. This form is demonstrated 127

in Theorem 7.2 for the first example. The general pattern will emerge from subsequent examples.

7.3.1

Exchangeable single topic models

We first consider a simple bag-of-words model for documents in which the words in the document are assumed to be exchangeable. Recall that a collection of random variables x1 , x2 , . . . , x` are exchangeable if their joint probability distribution is invariant to permutation of the indices. The well-known De Finetti’s theorem [15] implies that such exchangeable models can be viewed as mixture models in which there is a latent variable h such that x1 , x2 , . . . , x` are conditionally i.i.d. given h (see Figure 7.1(a) for the corresponding graphical model) and the conditional distributions are identical at all the nodes. In our simplified topic model for documents, the latent variable h is interpreted as the (sole) topic of a given document, and it is assumed to take only a finite number of distinct values. Let k be the number of distinct topics in the corpus, d be the number of distinct words in the vocabulary, and ` ≥ 3 be the number of words in each document. The generative process for a document is as follows: the document’s topic is drawn according to the discrete distribution specified by the probability vector w := (w1 , w2 , . . . , wk ) ∈ ∆k−1 . This is modeled as a discrete random variable h such that Pr[h = j] = wj ,

j ∈ [k].

Given the topic h, the document’s ` words are drawn independently according to the discrete distribution specified by the probability vector µh ∈ ∆d−1 . It will be convenient to represent the ` words in the document by d-dimensional random vectors x1 , x2 , . . . , x` ∈ Rd . Specifically, we set xt = ei

if and only if the t-th word in the document is i,

t ∈ [`],

where e1 , e2 , . . . ed is the standard coordinate basis for Rd . One advantage of this encoding of words is that the (cross) moments of these random vectors correspond to joint probabilities over words. For instance, observe that X E[x1 ⊗ x2 ] = Pr[x1 = ei , x2 = ej ] ei ⊗ ej 1≤i,j≤d

=

X 1≤i,j≤d

Pr[1st word = i, 2nd word = j] ei ⊗ ej ,

so the (i, j)-the entry of the matrix E[x1 ⊗ x2 ] is Pr[1st word = i, 2nd word = j]. More generally, the (i1 , i2 , . . . , i` )-th entry in the tensor E[x1 ⊗ x2 ⊗ · · · ⊗ x` ] is Pr[1st word = i1 , 2nd word = i2 , . . . , `-th word = i` ]. This means that estimating cross moments, say, of x1 ⊗ x2 ⊗ x3 , is the same as estimating joint probabilities of the first three words over all documents. (Recall that we assume that each document has at least three words.) 128

The second advantage of the vector encoding of words is that the conditional expectation of xt given h = j is simply µj , the vector of word probabilities for topic j: E[xt |h = j] =

d X

Pr[t-th word = i|h = j] ei =

i=1

d X

[µj ]i ei = µj ,

i=1

j ∈ [k]

(where [µj ]i is the i-th entry in the vector µj ). Because the words are conditionally independent given the topic, we can use this same property with conditional cross moments, say, of x1 and x2 : E[x1 ⊗ x2 |h = j] = E[x1 |h = j] ⊗ E[x2 |h = j] = µj ⊗ µj ,

j ∈ [k].

This and similar calculations lead one to the following theorem. Theorem 7.2 ([8]). If M2 := E[x1 ⊗ x2 ] M3 := E[x1 ⊗ x2 ⊗ x3 ], then M2 = M3 =

k X i=1 k X i=1

wi µi ⊗ µi wi µi ⊗ µi ⊗ µi .

As we will see in Section 7.2, the structure of M2 and M3 revealed in Theorem 7.2 implies that the topic vectors µ1 , µ2 , . . . , µk can be estimated by computing a certain symmetric tensor decomposition. Moreover, due to exchangeability, all triples (resp., pairs) of words in a document—and not just the first three (resp., two) words—can be used in forming M3 (resp., M2 ).

7.3.2

Beyond raw moments

In the single topic model above, the raw (cross) moments of the observed words directly yield the desired symmetric tensor structure. In some other models, the raw moments do not explicitly have this form. Here, we show that the desired tensor structure can be found through various manipulations of different moments. Independent component analysis (ICA) The standard model for ICA [37, 42, 44, 91], in which independent signals are linearly mixed and corrupted with Gaussian noise before being observed, is specified as follows. Let h ∈ Rk be a latent random vector with independent coordinates, A ∈ Rd×k the mixing matrix, and 129

z be a multivariate Gaussian random vector. The random vectors h and z are assumed to be independent. The observed random vector is x := Ah + z. Let µi denote the i-th column of the mixing matrix A. Theorem 7.3 ([44]). Define M4 := E[x ⊗ x ⊗ x ⊗ x] − T where T is the fourth-order tensor with [T ]i1 ,i2 ,i3 ,i4 := E[xi1 xi2 ]E[xi3 xi4 ]+E[xi1 xi3 ]E[xi2 xi4 ]+E[xi1 xi4 ]E[xi2 xi3 ],

1 ≤ i1 , i2 , i3 , i4 ≤ k

( i.e., T is the fourth derivative tensor of the function v 7→ 8−1 E[(v > x)2 ]2 . Let κi := E[h4i ] − 3 for each i ∈ [k]. Then M4 =

k X i=1

κi µi ⊗ µi ⊗ µi ⊗ µi .

See [87] for a proof of this theorem in this form. Note that κi corresponds to the excess kurtosis, a measure of non-Gaussianity as κi = 0 if hi is a standard normal random variable. Furthermore, note that A is not identifiable if h is a multivariate Gaussian. We may derive forms similar to that of M2 and M3 from Theorem 7.2 using M4 by observing that M4 (I, I, u, u) = M4 (I, I, I, v) =

k X i=1 k X i=1

κi (µ>i u)(µ>i u) µi ⊗ µi , κi (µ>i v) µi ⊗ µi ⊗ µi

for any vectors u, v ∈ Rd .

7.3.3

Multi-view models

Multi-view models (also sometimes called na¨ıve Bayes models) are a special class of Bayesian networks in which observed variables x1 , x2 , . . . , x` are conditionally independent given a latent variable h. This is similar to the exchangeable single topic model, but here we do not require the conditional distributions of the xt , t ∈ [`] to be identical. Techniques developed for this class can be used to handle a number of widely used models including hidden Markov models (HMMs) [8, 130], phylogenetic tree models [38, 130], certain tree mixtures [7], and certain probabilistic grammar models [86]. 130

h x1

x2

···

x`

(a) Multi-view models

h1

h2

x1

x2

···

h` x`

(b) Hidden Markov model

Figure 7.1: Examples of latent variable models. As before, we let h ∈ [k] be a discrete random variable with Pr[h = j] = wj for all j ∈ [k]. Now consider random vectors x1 ∈ Rd1 , x2 ∈ Rd2 , and x3 ∈ Rd3 which are conditionally independent given h, and E[xt |h = j] = µt,j ,

j ∈ [k], t ∈ {1, 2, 3}

where the µt,j ∈ Rdt are the conditional means of the xt given h = j. Thus, we allow the observations x1 , x2 , . . . , x` to be random vectors, parameterized only by their conditional means. Importantly, these conditional distributions may be discrete, continuous, or even a mix of both. We first note the form for the raw (cross) moments. Proposition 7.4. We have that: E[xt ⊗ xt0 ] = E[x1 ⊗ x2 ⊗ x3 ] =

k X i=1 k X i=1

wi µt,i ⊗ µt0 ,i ,

{t, t0 } ⊂ {1, 2, 3}, t 6= t0

wi µ1,i ⊗ µ2,i ⊗ µ3,i .

The cross moments do not possess a symmetric tensor form when the conditional distributions are different. Nevertheless, the moments can be “symmetrized” via a simple linear transformation of x1 and x2 (roughly speaking, this relates x1 and x2 to x3 ); this leads to an expression from which the conditional means of x3 (i.e., µ3,1 , µ3,2 , . . . , µ3,k ) can be recovered. For simplicity, we assume d1 = d2 = d3 = k; the general case (with dt ≥ k) is easily handled using low-rank singular value decompositions. Theorem 7.5 ([9]). Assume that the vectors {µv,1 , µv,2 , . . . , µv,k } are linearly independent for each v ∈ {1, 2, 3}. Define x ˜1 x ˜2 M2 M3

:= := := :=

E[x3 ⊗ x2 ]E[x1 ⊗ x2 ]−1 x1 E[x3 ⊗ x1 ]E[x2 ⊗ x1 ]−1 x2 E[˜ x1 ⊗ x ˜2 ] E[˜ x1 ⊗ x ˜2 ⊗ x3 ]. 131

Then M2 =

k X i=1

M3 =

k X i=1

wi µ3,i ⊗ µ3,i wi µ3,i ⊗ µ3,i ⊗ µ3,i .

We now discuss three examples (taken mostly from [8]) where the above observations can be applied. The first two concern mixtures of product distributions, and the last one is the time-homogeneous hidden Markov model. Hidden Markov models Our last example is the time-homogeneous HMM for sequences of vector-valued observations x1 , x2 , . . . ∈ Rd . Consider a Markov chain of discrete hidden states y1 → y2 → y3 → · · · over k possible states [k]; given a state yt at time t, the observation xt at time t (a random vector taking values in Rd ) is independent of all other observations and hidden states. See Figure 7.1(b). Let π ∈ ∆k−1 be the initial state distribution (i.e., the distribution of y1 ), and T ∈ Rk×k be the stochastic transition matrix for the hidden state Markov chain: for all times t, i, j ∈ [k].

Pr[yt+1 = i|yt = j] = Ti,j ,

Finally, let M ∈ Rd×k be the matrix whose j-th column is the conditional expectation of xt given yt = j: for all times t, E[xt |yt = j] = M ej ,

j ∈ [k].

Proposition 7.6 ([8]). Define h := y2 , where y2 is the second hidden state in the Markov chain. Then • x1 , x2 , x3 are conditionally independent given h; • the distribution of h is given by the vector w := T π ∈ ∆k−1 ; • for all j ∈ [k], E[x1 |h = j] = M diag(π)T > diag(w)−1 ej E[x2 |h = j] = M ej E[x3 |h = j] = M T ej . Note the matrix of conditional means of xt has full column rank, for each t ∈ {1, 2, 3}, provided that: (i) M has full column rank, (ii) T is invertible, and (iii) π and T π have positive entries. 132

7.4

Tensor power method

In this section, we consider the tensor power method of [105, Remark 3] for orthogonal tensor decomposition. We first state a simple convergence analysis for an orthogonally decomposable tensor T . When only an approximation Tˆ to an orthogonally decomposable tensor T is available (e.g., when empirical moments are used to estimate population moments), an orthogonal decomposition need not exist for this perturbed tensor (unlike for the case of matrices), and a more robust approach is required to extract the approximate decomposition. Here, we propose such a variant in Algorithm 7.1 and provide a detailed perturbation analysis. We note that alternative approaches such as simultaneous diagonalization can also be employed.

7.4.1

Convergence analysis for orthogonally decomposable tensors

The following lemma establishes the quadratic convergence of the tensor power method (i.e., repeated iteration of (7.2)) for extracting a single component of the orthogonal decomposition. Note that the initial vector θ0 determines which robust eigenvector will be the convergent point. Computation of subsequent eigenvectors can be computed with deflation, i.e., by subtracting appropriate terms from T . N3 k Lemma 7.7. Let T ∈ R have an orthogonal decomposition as given in (7.1). For a vector θ0 ∈ Rk , suppose that the set of numbers |λ1 v1> θ0 |, |λ2 v2> θ0 |, . . . , |λk vk> θ0 | has a unique largest element. Without loss of generality, say |λ1 v1> θ0 | is this largest value and |λ2 v2> θ0 | is the second largest value. For t = 1, 2, . . . , let θt := Then

T (I, θt−1 , θt−1 ) . kT (I, θt−1 , θt−1 )k

(7.2)

t+1 k X λ2 v2> θ0 2 −2 2 . kv1 − θt k ≤ 2λ1 λi · > λ v θ 1 0 1 i=2 2

That is, repeated iteration of (7.2) starting from θ0 converges to v1 at a quadratic rate. To obtain all eigenvectors, simply proceed iteratively using deflation, executing P we may ⊗3 the power method on T − j λj vj after having obtained robust eigenvector / eigenvalue pairs {(vj , λj )}. Proof. Let θ0 , θ 1 , θ 2 , . . . be the sequence given by θ0 := θ0 and θt := T (I, θt−1 , θt−1 ) for t ≥ 1. Let ci := vi> θ0 for all i ∈ [k]. It is easy to check that (i) θt = θt /kθt k, P P P t t t t and (ii) θt = ki=1 λ2i −1 c2i vi . (Indeed, θ t+1 = ki=1 λi (vi> θt )2 vi = ki=1 λi (λ2i −1 c2i )2 vi = Pk 2t+1 −1 2t+1 ci vi .) Then i=1 λi 2t+1 Pk t+1 t+1 k 2t+1 −2 2t+1 X ci (v1> θt )2 λ21 −2 c21 −2 λ2 c2 2 i=2 λi 1 − (v1 θt ) = 1 − = 1 − ≤ ≤ λ λ · P P t+1 t+1 1 i t+1 t+1 λ 1 c1 . k k 2 −2 2 2 −2 2 kθt k2 ci ci i=1 λi i=1 λi i=2 >

2

133

Since λ1 > 0, we have v1> θt > 0 and hence kv1 − θt k2 = 2(1 − v1> θt ) ≤ 2(1 − (v1> θt )2 ) as required. Pk Pk 2 In this proof, the key observation is if u = i=1 ci vi , then T (I, u, u = i=1 λi ci vi , which is the same as the result of left multiplying u by a matrix M whose eigenvectors are vi ’s and eigenvalues are λi ci ’s. Therefore we call |λi ci | the effective eigenvalue. P P Definition 7.8 (Effective Eigenvalue). Given T = ki=1 λi vi and vector u = ki=1 ci vi , the effective eigenvalue with respect to vi is |λi ci |.

7.4.2

Perturbation analysis of a robust tensor power method

Now we consider the case where we have an approximation Tˆ to an orthogonally decomposable tensor T . Here, a more robust approach is required to extract an approximate decomposition. We propose such an algorithm in Algorithm 7.1, and provide a detailed perturbation analysis. Algorithm 7.1. Robust tensor power method input symmetric tensor T˜ ∈ Rk×k×k , number of iterations L, N . output the estimated eigenvector/eigenvalue pair; the deflated tensor. 1: for τ = 1 to L do (τ ) 2: Draw θ0 uniformly at random from the unit sphere in Rk . 3: for t = 1 to N do 4: Compute power iteration update (τ ) θt

(τ ) (τ ) T˜(I, θt−1 , θt−1 ) := (τ ) (τ ) kT˜(I, θ , θ )k t−1

(7.3)

t−1

end for end for (τ ) (τ ) (τ ) Let τ ∗ := arg maxτ ∈[L] {T˜(θN , θN , θN )}. (τ ∗ ) ˆ := T˜(θ, ˆ and set λ ˆ θ, ˆ θ). ˆ 8: Do N power iteration updates starting from θN to obtain θ, ˆ the deflated tensor T˜ − λ ˆ θˆ⊗3 . ˆ λ); 9: return the estimated eigenvector/eigenvalue pair (θ,

5: 6: 7:

Assume that the symmetric tensor T ∈ Rk×k×k is orthogonally decomposable, and that Tˆ = T + E, where the perturbation E ∈ Rk×k×k is a symmetric tensor with small spectral norm. In our latent variable model applications, Tˆ is the tensor formed by using empirical moments, while T is the orthogonally decomposable tensor derived from the population moments for the given model. The following theorem is similar to Wedin’s perturbation theorem for singular vectors of matrices [160] in that it bounds the error of the (approximate) decomposition returned by Algorithm 7.1 on input Tˆ in terms of the size of the perturbation, provided that the perturbation is small enough. 134

Algorithm 7.2. Robust tensor power method: Main Loop input symmetric tensor T˜ ∈ Rk×k×k , number of iterations L, N . output the estimated eigenvector/eigenvalue pairs for i = 1 to k do Call Algorithm 7.1, get an eigenvector/eigenvalue pair, and replace T˜ with the deflated tensor end for return the estimated eigenvector/eigenvalue pairs Theorem 7.9. LetP Tˆ = T + E ∈ Rk×k×k , where T is a symmetric tensor with orthogonal decomposition T = ki=1 λi vi⊗3 where each λi > 0, {v1 , v2 , . . . , vk } is an orthonormal basis, and E has operator norm := kEk. Define λmin := min{λi : i ∈ [k]}, and λmax := max{λi : i ∈ [k]}. For any δ > 0, there exists universal constants C1 , C2 , C3 > 0 such that the following holds. Pick any η ∈ (0, 1), and suppose λ λmin max , N ≥ C2 · log(k) + log log ≤ C1 · , k ˆ 1 ), (ˆ ˆ 2 ), and L = k 1+δ log(k/η). If Algorithm 7.2 returned eigenvector/eigenvalue pairs (ˆ v1 , λ v2 , λ ˆ k ), then with probability at least 1 − η, there exists a permutation π on [k] such . . . , (ˆ vk , λ that ˆ j | ≤ 5, ∀j ∈ [k], kvπ(j) − v ˆj k ≤ 8/λπ(j) , |λπ(j) − λ and

k X

⊗3 ˆj v

T − λ ˆj

≤ 55.

j=1

The proof of Theorem 7.9 is given in Section 7.6. One important difference from Wedin’s theorem is that this is an algorithm dependent perturbation analysis, specific to Algorithm 7.1 (since the perturbed tensor need not have an orthogonal decomposition). Furthermore, note that Algorithm 7.1 uses multiple restarts to ensure (approximate) convergence—the intuition is that by restarting at multiple points, we eventually start at a point in which the initial contraction towards some eigenvector dominates the error E in our tensor. The proof shows that we find such a point with high probability within L = poly(k) trials. It should be noted that for large k, the required bound on L is very close to linear in k. In general, it is possible, when run on a general symmetric tensor (e.g., Tˆ), for the tensor power method to exhibit oscillatory behavior [98, Example 1]. This is not in conflict with Theorem 7.9, which effectively bounds the amplitude of these oscillations; in particular, if Tˆ = T + E is a tensor built from empirical moments, the error term E (and thus the amplitude of the oscillations) can be driven down by drawing more samples. The practical value of addressing these oscillations and perhaps stabilizing the algorithm is an interesting direction for future research [100]. 135

A final consideration is that for specific applications, it may be possible to use domain knowledge to choose better initialization points. For instance, in the topic modeling applications (cf. Section 7.3.1), the eigenvectors are related to the topic word distributions, and many documents may be primarily composed of words from just single topic. Therefore, good initialization points can be derived from these single-topic documents themselves, as these points would already be close to one of the eigenvectors.

7.5 7.5.1

Discussion Computational complexity

It is interesting to consider the computational complexity of the tensor power method in the dense setting where T ∈ Rk×k×k is orthogonally decomposable but otherwise unstructured. Each iteration requires O(k 3 ) operations, and assuming at most k 1+δ random restarts for extracting each eigenvector (for some small δ > 0) and O(log(k) + log log(1/)) iterations per restart, the total running time is O(k 5+δ (log(k) + log log(1/))) to extract all k eigenvectors and eigenvalues. An alternative approach to extracting the orthogonal decomposition of T is to reorganize 2 T into aPmatrix M ∈ Rk×k byPflattening two of the dimensions into one. In this case, k k ⊗3 if T = i=1 λi vi , then M = i=1 λi vi ⊗ vec(vi ⊗ vi ). This reveals the singular value decomposition of M (assuming the eigenvalues λ1 , λ2 , . . . , λk are distinct), and therefore can be computed with O(k 4 ) operations. Therefore it seems that the tensor power method is less efficient than a pure matrix-based approach via singular value decomposition. However, it should be noted that this matrix-based approach fails to recover the decomposition when eigenvalues are repeated, and it is unstable when the gap between eigenvalues is small. It is worth noting that the running times differ by roughly a factor of Θ(k 1+δ ), which can be accounted for by the random restarts. This gap can potentially be alleviated or removed by using a more clever method for initialization. Moreover, using special structure in the problem (as discussed above) can also improve the running time of the tensor power method.

7.5.2

Sample complexity bounds

Previous work on using linear algebraic methods for estimating latent variable models crucially rely on matrix perturbation analysis for deriving sample complexity bounds [8, 9, 87, 88, 130]. The learning algorithms in these works are plug-in estimators that use empirical moments in place of the population moments, and then follow algebraic manipulations that result in the desired parameter estimates. As long as these manipulations can tolerate small perturbations of the population moments, a sample complexity bound can be obtained by exploiting the convergence of the empirical moments to the population moments via the law of large numbers. Using the perturbation analysis for the tensor power method, improved sample complexity bounds can be obtained for all of the examples discussed in Section 7.3. The underlying analysis remains the same as in previous works (e.g., [9, 87]), the main difference being 136

the accuracy of the orthogonal tensor decomposition obtained via the tensor power method. Relative to the previously cited works, the sample complexity bound will be considerably improved in its dependence on the rank parameter k, as Theorem 7.9 implies that the tensor f3 from Section 7.2) is not amplified by any factor estimation error (e.g., error in estimating M explicitly depending on k (there is a requirement that the error be smaller than some factor depending on k, but this only contributes to a lower-order term in the sample complexity bound).

7.6

Analysis of robust power method

In this section, we prove Theorem 7.9. The proof is divided into several steps. First in Section 7.6.1 we define what are “good” initializazers for the tensor power iterations, and show that a random vector is good with significant probability. Then in Section 7.6.2 we show given a good initializer, the tensor power method will quickly reach the neighborhood of the top eigenvector. Finally in Section 7.6.3, we show that the error in deflation step does not accrue very badly. These three steps is assembled in Section 7.6.4 to complete the proof. The vectors during the tensor power method will be called θ0 , θ1 , ..., and we express each vector in the basis of {v1 , v2 , . . . , vk }: θt =

k X

θi,t vi .

i=1

.

Recall the update rule used in the power method,

θt+1 =

k X

θi,t+1 vi := T˜(I, θt , θt )/kT˜(I, θt , θt )k.

i=1

In this subsection, we assume that T˜ has the form T˜ =

k X

˜ i v ⊗3 + E˜ λ i

i=1

where {v1 , v2 , . . . , vk } is an orthonormal basis, and, without loss of generality, ˜ 1 |θ1,t | = max λ ˜ i |θi,t | > 0. λ i∈[k]

Also, define ˜ min := min{λ ˜ i : i ∈ [k], λ ˜ i > 0}, λ 137

˜ max := max{λ ˜ i : i ∈ [k]}. λ

(7.4)

We do not use the original eigenvalues λi ’s here, because throughout the deflation process, some eigenvalues will be deflated and becomes 0. We further assume the error E˜ is a symmetric tensor such that, for some constant p > 1, ˜ u, u)k ≤ ˜, ∀u ∈ S k−1 ; kE(I, ˜ 1 )2 . ˜ u, u)k ≤ ˜/p, ∀u ∈ S k−1 s.t. (u> v1 )2 ≥ 1 − (3˜/λ kE(I,

(7.5) (7.6)

All these settings may seem very unnatural at this point, but they are important for the later deflation analysis. In particular, after some deflation steps, the error E˜ will be the original error E plus the additional error introduced in the deflation process. Later we will show that the spectral norm of the additional error is bounded by ˜ which is a constant factor more than . However that alone is not enough because if the error grows by a constant factor at every deflation step, it will be huge at the end of iteration. The additional insight is that the error introduced by deflation is not uniform along all directions, it is only large when the current vector has large projection on the previously found eigenvectors, and will be small when the current vector is close to a new eigenvector (which itself is orthogonal to all the previously found eigenvectors). That is why constant p is introduced here.

7.6.1

Initialization

For γ ∈ (0, 1), we say a unit vector θ0 ∈ Rk is γ-separated relative to i∗ ∈ [k] if |θi∗ ,0 | − max∗ |θi,0 | ≥ γ|θi∗ ,0 |. i∈[k]\{i }

Intuitively, a γ-separated vector is a good initializer for the tensor power method, because it has significantly larger correlation with one of the vectors, and we can hope that this initial advantage will be amplified and in the end the process converges to the eigenvector. The following lemma shows that for any constant γ, with probability at least 1 − η, at least one out of poly(k) log(1/η) i.i.d. random vectors (uniformly distributed over the unit ˜ i . (For small enough γ and large enough sphere S k−1 ) is γ-separated relative to arg maxi∈[k] λ k, the polynomial is close to linear in k.) Lemma 7.10. For any constant γ, there exists a δ such that with probability at least 1−η over k 1+δ · log2 (1/η) i.i.d. uniform random unit vectors, at least one of the vectors is γ-separated ˜i. relative to arg maxi∈[k] λ ˜ i = 1. Suppose we take L = k 1+δ Proof. Without loss of generality, assume arg maxi∈[k] λ k×L vectors. Consider a random matrix Z ∈ R whose entries are independent N (0, 1) random variables; we take the j-th column of Z to be comprised of the random variables used for the j-th random vector (before normalization). Specifically, for the j-th random vector, Zi,j θi,0 := qP k

i0 =1

, Zi20 ,j

138

i ∈ [n].

It suffices to show that with probability at least 1/2, there is a column j ∗ ∈ [L] such that |Z1,j ∗ | ≥

1 max |Zi,j ∗ |. 1 − γ i∈[k]\{1}

Since maxj∈[L] |Z1,j | is a 1-Lipschitz function of L independent N (0, 1) random variables, it follows that h i p Pr max |Z1,j | − median max |Z1,j | > 2 ln(8) ≤ 1/4. j∈[L]

j∈[L]

Moreover, h i h i median max |Z1,j | ≥ median max Z1,j =: m. j∈[L]

j∈[L]

Observe that the cumulative distribution function of maxj∈[L] Z1,j is given by F (z) = Φ(z)L , where Φ is the standard Gaussian CDF. Since F (m) = 1/2, it follows that m = Φ−1 (2−1/L ). It can be checked that Φ−1 (2−1/L ) ≥

p ln(ln(L)) + c 2 ln(L) − p 2 2 ln(L)

for some absolute constant c > 0. Also, let j ∗ := arg maxj∈[L] |Z1,j |. Now for each j ∈ [L], let |Z2:k,j | := max{|Z2,j |, |Z3,j |, . . . , |Zk,j |}. Again, since |Z2:k,j | is a 1-Lipschitz function of k − 1 independent N (0, 1) random variables, it follows that h i p Pr |Z2:k,j | > E |Z2:k,j | + 2 ln(4) ≤ 1/4. Moreover, by a standard argument, h i p E |Z2:k,j | ≤ 2 ln(k). Since |Z2:k,j | is independent of |Z1,j | for all j ∈ [L], it follows that the previous two displayed inequalities also hold with j replaced by j ∗ . Therefore we conclude with a union bound that with probability at least 1/2, |Z1,j ∗ | ≥

p p p ln(ln(L)) + c p 2 ln(L) − p − 2 ln(8) and |Z2:k,j ∗ | ≤ 2 ln(k) + 2 ln(4). 2 2 ln(L)

When this even happens, and when δ is large enough, the j ∗ -th random vector is γseparated. The probability of success can easily be boosted from 1/2 to 1 − η by taking log(1/η)L samples, as all the samples are independent.

139

7.6.2

Tensor power iterations

In this section we show a good initializer will converge to the neighborhood of the top eigenvector if the error on the tensor is small enough. In the next two propositions (Propositions 7.11 and 7.12) and next two lemmas (Lemmas 7.13 and 7.14), we analyze the power method iterations using T˜ at some arbitrary iterate ˜ But throughout, the quantity ˜ can be replaced by θt using only the property (7.5) of E. > 2 ˜ ˜/p if θt satisfies (θt v1 ) ≥ 1 − (3˜/λ1 )2 as per property (7.6). We first define several quantities to measure the progress and simplify calculations. Rτ :=

2 θ1,τ 2 1 − θ1,τ

γτ := 1 −

˜ 1 θ1,τ λ , ˜ i |θi,τ | λ ˜ δτ := ˜ 1 θ2 λ 1,τ

1/2 , 1

mini6=1 |ri,τ |

ri,τ := ,

(7.7)

for τ ∈ {t, t + 1}. Here R is used to measure progress (in the end we hope R would be very large). The quantity γ is similar to the “eigengap” for matrices (recall the definition of effective eigenvalue in Definition 7.8. Initially γ should be a small constant for good initial vectors, and will remain at least a constant. The quantity δ should be considered as the relative size of the error (at the beginning it is a small constant and decreases as the algorithm proceeds). The two propositions below are really just calculations. Proposition 7.11. min |ri,t | ≥ Rt , i6=1

1 γt ≥ 1 − , Rt

2 θ1,t

Rt2 = . 1 + Rt2

Proposition 7.12. 2 ri,t+1 ≥ ri,t ·

1 − δt 1 − δt = , 1 2 1 + δt ri,t + δ 2 t r

Rt+1 ≥ Rt ·

1 − δt 1 − δt ≥ 1 . 1 − γt + δt Rt + δt R2

i,t

i ∈ [k],

(7.8) (7.9)

t

Proof. Let θˇt+1 := T˜(I, θt , θt ), so θt+1 = θˇt+1 /kθˇt+1 k. Since θˇi,t+1 = T˜(vi , θt , θt ) = T (vi , θt , θt ) + E(vi , θt , θt ), we have ˜ i θ2 + E(vi , θt , θt ), θˇi,t+1 = λ i,t

i ∈ [k].

Using the triangle inequality and the fact kE(vi , θt , θt )k ≤ ˜, we have 2 ˇ ˜ ˜ θi,t+1 ≥ λi θi,t − ˜ ≥ |θi,t | · λi |θi,t | − ˜/|θi,t | 140

(7.10)

and

˜ i θ2 | + ˜ ≤ |θi,t | · λ ˜ i |θi,t | + ˜/|θi,t | |θˇi,t+1 | ≤ |λ i,t

(7.11)

for all i ∈ [k]. Combining (7.10) and (7.11) gives ri,t+1 =

˜ 1 θˇ1,t+1 ˜ 1 θ1,t+1 λ 1 − δt λ 1 − δt 1 − δt 2 2 2 · = ≥ ri,t · ≥ ri,t . = ri,t · ˜ 2 2 ˜ i |θi,t+1 | ˜ i |θˇi,t+1 | ˜ i /λ ˜ 1 )δt r 1 + δt ri,t 1 + λ˜ θ2 λ λ 1 + ( λ i,t i i,t

Moreover, by the triangle inequality and Hölder’s inequality, X n

[θˇi,t+1 ]

2

1/2 =

X n

˜ i θ2 λ i,t

2 1/2 + E(vi , θt , θt )

i=2

i=2

≤

X n

˜ 2 θ4 λ i i,t

1/2 +

X n

i=2

2

1/2

E(vi , θt , θt )

i=2

X 1/2 n 2 ˜ + ˜ θi,t ≤ max λi |θi,t | i6=1

= (1 −

i=2

2 1/2 θ1,t )

˜ i |θi,t | + ˜/(1 − θ2 )1/2 . · max λ 1,t i6=1

(7.12)

Combining (7.10) and (7.12) gives ˜ 1 |θ1,t | − ˜/|θ1,t | |θ1,t | |θ1,t+1 | λ |θˇ1,t+1 | ≥ . = · 1/2 P 2 2 1/2 ˜ i |θi,t | + ˜/(1 − θ2 )1/2 (1 − θ1,t+1 )1/2 (1 − θ1,t ) n max λ i6 = 1 2 ˇ 1,t i=2 [θi,t+1 ] In terms of Rt+1 , Rt , γt , and δt , this reads Rt+1 ≥

(1 − γt )

1 − δt 1−θ2 1/2 1,t

2 θ1,t

+ δt

= Rt ·

1 − δt 1 − δt 1 − δt = 1−γt ≥ 1 1 − γt + δt Rt + δt + δt R2 Rt t

where the last inequality follows from Proposition 7.11. The following Lemma shows how the vector changes during tensor power iterations. There are several phases, in the first phase Theorem 7.13 shows we make progress as long as Rt ≤ 9. Lemma 7.13 (First phase). Assume 0 ≤ δt < 1/18, and γt > 18δt . 2 1. If ri,t ≤ 8, then ri,t+1 ≥ |ri,t | 1 + γ2t . 2 2 2. If 4 < ri,t , then ri,t+1 ≥ min{ri,t /2,

1/2−δt }. δt

3. “Spectral gap” always increases until it reaches 1/2: γt+1 ≥ min{γt , 1/2}. 141

4. Make progress whenever Rt is small: If Rt ≤ 9, then Rt+1 ≥ Rt 1 + and δt+1 ≤ δt .

γt 3

2 2 , θ1,t+1 ≥ θ1,t

2 . Proof. Consider two (overlapping) cases depending on ri,t 2 ≤ 2ρ2 . By (7.8) from Proposition 7.12, • Case 1: ri,t 2 ri,t+1 ≥ ri,t ·

1 − δt 1 1 − δt γt ≥ |r | · · ≥ |r | 1 + i,t i,t 2 1 + κδt ri,t 1 − γt 1 + 2κρ2 δt 2

where the last inequality uses the assumption γt > 2(1 + 2κρ2 )δt . This proves the first claim. 2 2 • Case 2: ρ2 < ri,t . We split into two sub-cases. Suppose ri,t ≤ (ρ(1 − δt ) − 1)/(κδt ). Then, by (7.8),

2 ri,t+1 ≥ ri,t ·

2 ri,t 1 − δt 1 − δt 2 = ≥ r · . i,t 2 t )−1 1 + κδt ri,t ρ 1 + κδt ρ(1−δ κδt

2 Now suppose instead ri,t > (ρ(1 − δt ) − 1)/(κδt ). Then

ri,t+1 ≥

1 − δt − 1/ρ 1 − δt = . κδt + κδt

(7.13)

κδt ρ(1−δt )−1

2 Observe that if mini6=1 ri,t ≤ (ρ(1 − δt ) − 1)/(κδt ), then ri,t+1 ≥ |ri,t | for all i ∈ [k], and hence t γt+1 ≥ γt . Otherwise we have γt+1 > 1 − 1−δκδ > 1 − 1/ρ. This proves the third claim. t −1/ρ 2 Finally, for the last claim, if Rt ≤ 1 + 2κρ , then by (7.9) from Proposition 7.12 and the assumption γt > 2(1 + 2κρ2 )δt ,

Rt+1

γt 1 − 2(1+2κρ 2) 1 − δt κρ2 γt ≥ Rt 1 + γt · 1 + . ≥ Rt · ≥ Rt · ≥ R t 1 − γt + δt Rt 1 − γt /2 1 + 2κρ2 3

2 2 via Proposition 7.11, and thus δt+1 ≤ δt . ≥ θ1,t This in turn implies that θ1,t+1

Next Lemma deals with the second phase, when R is larger than 9. It shows Rt grows in a quadratic speed until it becomes really large, and once it is really large it stays large. Lemma 7.14 (Second Phase). When 9 ≤ Rt ≤ ˜ ˜ When Rt ≥ λ3˜1 , Rt+1 ≥ λ3˜1

˜1 λ , 3˜

˜

Rt+1 ≥ min{Rt2 /2, λ3˜1 }.

Proof. Observe that for any c > 0, Rt ≥

1 c

⇔

2 θ1,t ≥

1 1 + c2

Now consider the following cases depending on Rt . 142

⇔

δt ≤

(1 + c2 )˜ . ˜1 λ

(7.14)

• Case 1: Rt ≥

˜1 λ 3˜

= 1/α. In this case, we have δt ≤

(1 + α2 )˜ αγt ≤ ˜1 1+α λ

by (7.14) (with c = α). Combining this with (7.9) from Proposition 7.12 gives αγt 1 − 1+α 1 − δt ≥ ≥ 1−γt (1 − γt )α + + δt Rt

Rt+1

• Case 2: 9 ≤ Rt
1 = 1 2 + δt + 2+R 2 R2 R2 t

t

t

Approximate recovery of a single eigenvector We now state the main result regarding the approximate recovery of a single eigenvector using the tensor power method on T˜. Here, we exploit the special properties of the error E˜ (both (7.5) and (7.6)). Lemma 7.15 (Convergence for Top Eigenvector). There exists a universal constant C > 0 such that the following holds. ∗ ˜ If the initializer is γ0 -biased (0 < γ0 < 1) with respect to i = arg max λi , when ˜ < ˜ max · θ2∗ · γ0 /18, and N ≥ C · λ i ,0

˜

log(k/γ0 ) + log log pλ˜i∗

, after t ≥ N iterations of the

tensor power method on tensor T˜ as defined in (7.4) and satisfying (7.5) and (7.6), the final vector θt satisfies s 2 2 3˜ 4˜ ˜ ˜ ˜ i∗ | ≤ 27 θi∗ ,t ≥ 1 − , kθt − vi∗ k ≤ , |T˜(θt , θt , θt ) − λ +2 . ˜ ˜ pλi∗ p pλi∗ pλi∗

Proof. Assume without loss of generality that i∗ = 1. We consider three phases: (i) iterations before the first time t such that Rt > 9, (ii) the subsequent iterations before the first time 143

˜

t such that Rt ≥ λ3˜1 and (iii) If p > 1, the subsequent iterations before the first time t such ˜1 λ . that Rt ≥ p3˜ We begin by analyzing the first phase, i.e., the iterates in T1 := {t ≥ 0 : Rt ≤ 9}. Observe that the condition on ˜ implies δ0 =

˜ < γ0 /18, ˜ 1 θ2 λ 1,0

and hence the preconditions on δt and γt of Lemma 7.13 hold for t = 0. For all t ∈ T1 satisfying the preconditions, Lemma 7.13 implies that δt+1 ≤ δt and γt+1 ≥ min{γt , 1/2}, so the next iteration also satisfies the preconditions. Hence by induction, the preconditions hold for all iterations in T1 . Moreover, for all i ∈ [k], we have |ri,0 | ≥

1 ; 1 − γ0

2 ≤ 8, and (ii) |ri,t | stays above 2 and while t ∈ T1 : (i) |ri,t | increases at a linear rate while ri,t once it is at least 2.(The specific rates are given, respectively, in Lemma 7.13, claims 1 and 2.) 2 It follows that mini6=1 ri,t ≤ 4 for at most

√ 8 2 ln = O(log 1/γ0 ) 1 γ0 1−γ0

(7.15)

iterations in T1 . √ We know R0 ≥ 1/ k at the beginning, and it increases by a factor of (1 + γt /3) at every step, therefore there are at most an additional O(log k) steps after γ reaches 1/2. Therefore, by combining we have that the number of iterations in the first the counts, phase is at most |T1 | = O log k/γ . ˜

We now analyze the second phase, i.e., the iterates in T2 := {t ≥ 0 : t ∈ / T1 , Rt < λ3˜1 }. 0 Note that for the initial iteration t := min T2 , we have that Rt0 ≥ 1+2κρ2 = 1+8κ = 1/β, ˜ Lemma 7.14 implies that Rt+1 ≥ min{Rt , λ3˜1 }. To bound the number of iterations in T2 , observe that Rt increases at a quadratic rate until Rt ≥

˜1 λ , 3˜

˜

so |T2 | = O log log λ3˜1 .

After Rt00 ≥

˜1 λ , 3˜

we have 2 θ1,t 00

≥1−

144

3˜ ˜ λ1

2 .

Therefore, the vector θt00 satisfies the condition for property (7.6) of E˜ to hold. Now we apply Lemma 7.14 using ˜/p in place of ˜, including in the definition of δt (which we call δ t ): δ t :=

˜ ; ˜ 1 θ2 pλ 1,t

Similar as before, by Lemma 7.14, we have that Rt increases at a quadratic rate in this ˜1 ˜1 pλ λ final phase until Rt ≥ 3˜ . So the number of iterations before Rt ≥ 3˜ can be bounded as ˜

O log log pλ˜1 . Once Rt ≥

˜1 pλ , 3˜

we have

2 θ1,t

≥1−

3˜ ˜1 pλ

2 .

2 2 ) > 0 by Proposition 7.12, we · (1 − δ t−1 )/(1 + δ t−1 r1,t−1 Since sign(θ1,t ) = r1,t ≥ r1,t−1 have θ1,t > 0. Therefore we can conclude that s q q ˜ 1 ))2 ≤ 4˜/(pλ ˜ 1 ). kθt − v1 k = 2(1 − θ1,t ) ≤ 2 1 − 1 − (3˜/(pλ

Finally, k X 3 3 ˜ 1 | = λ ˜ 1 (θ − 1) + ˜ i θ + E(θ ˜ t , θt , θt ) |T˜(θt , θt , θt ) − λ λ 1,t i,t i=2

k q X 2 2 2 ˜ ˜ ˜ θt , θt )k θi,t ≤ λ1 1 − θ1,t + |θ1,t (1 − θ1,t )| + max λi 1 − θ1,t + kE(I, i6=1

i=2

˜ 1 ) + 2)˜ (27κ · (˜/pλ . p 2

≤

7.6.3

Deflation

In this section we give a bound on the additional error caused by deflation. Assuming all previous steps found vectors that are close to the true vectors, then the Lemma below claims the error introduced by deflation depends on the projection of the current vector into the space spanned by the previously found eigenvectors. The benefit of a direction-dependent bound is that once we approach the end of the power method iterations, the vector will be close to the new eigenvector, and therefore has very low projection to the previously found eigenvectors. This allows the algorithm to find more accurate vectors, in particular the guarantee will not deteriorate as the deflation process continues. Lemma 7.16. Fix some ˜ ≥ 0. Let {v1 , v2 , . . . , vk } be an orthonormal basis for Rk , and λ1 , λ2 , . . . , λk ≥ 0 with λmin := mini∈[k] λi . Also, let {ˆ v1 , v ˆ2 , . . . , v ˆk } be a set of unit vectors

145

ˆ1, λ ˆ2, . . . , λ ˆ k ≥ 0 be non-negative scalars, and define in Rk (not necessarily orthogonal), λ îv Ei := λi vi⊗3 − λ î⊗3 ,

i ∈ [k].

Pick any t ∈ [k]. If ˆ i − λi | ≤ ˜, |λ

√ kˆ vi − vi k ≤ min{ 2, 2˜/λi }

√ for all i ∈ [t], then for any unit vector u ∈ S k−1 , ˜ ≤ λmin /10000 k implies

X

2 t X

t

> 2

Ei (I, u, u) (u vi ) ˜2 .

≤ 0.01 + 100 i=1

2

i=1

Proof. For any unit vector u and i ∈ [t], the error term ˆ i (u> v Ei (I, u, u) = λi (u> vi )2 vi − λ î )2 v î lives in span{vi , v î }; this space is the same as span{vi , v î⊥ }, where v î⊥ := v î − (vi> v î )vi is the projection of v î onto the subspace orthogonal to vi . Since kˆ vi − vi k2 = 2(1 − vi> v î ), it follows that ci := vi> v î = 1 − kˆ vi − vi k2 /2 ≥ 0 √ (the inequality follows from the assumption kˆ vi −vi k ≤ 2, which in turn implies 0 ≤ ci ≤ 1). By the Pythagorean theorem and the above inequality for ci , kˆ vi⊥ k2 = 1 − c2i ≤ kˆ vi − vi k2 . Later, we will also need the following bound, which is easily derived from the above inequalities and the triangle inequality: vi − vi k2 . |1 − c3i | = |1 − ci + ci (1 − c2i )| ≤ 1 − ci + |ci (1 − c2i )| ≤ 1.5kˆ We now express Ei (I, u, u) in terms of the coordinate system defined by vi and v î⊥ , depicted below. vi

ci

vî

  

vî⊥ /kˆ vi⊥ k

 

vî⊥

subspace orthogonal to vi

span{vi , vî⊥ }⊥

146

Define ai := u> vi

î⊥ /kˆ vi⊥ k . and bi := u> v

(Note that the part of u living in span{vi , v î⊥ }⊥ is irrelevant for analyzing Ei (I, u, u).) We have ˆ i (u> v Ei (I, u, u) = λi (u> vi )2 vi − λ î )2 v î 2 2 ⊥ ˆ i ai ci + kˆ î⊥ = λi ai vi − λ vi kbi ci vi + v 2 ⊥ ⊥ ⊥ 2 2 ⊥ ˆ i a2 c2 + 2kˆ ˆ i ai ci + kˆ = λi a2i vi − λ v ka b c + kˆ v k b c v − λ v kb v î i i i i i i i i i i i i 2 ⊥ ˆ i kˆ ˆ i c3 )a2 − 2λ ˆ i kˆ ˆ i kˆ = (λi − λ vi⊥ kai bi c2i − λ vi⊥ k2 b2i ci vi − λ vi⊥ k ai ci + kˆ vi⊥ kbi v î /kˆ vi⊥ k i i {z } | | {z } =:B i =:A i ⊥ ⊥ vi k . = Ai vi − Bi v î /kˆ The overall error can also be expressed in terms of the Ai and Bi :

t

2 X

2 t X

X

t

⊥ ⊥

E (I, u, u) = A v − B (ˆ v /kˆ v k) i i i i i i

i=1

2

i=1

2

i=1

t

2

t

2

X

X

⊥ ⊥

≤ 2 Ai vi + 2 Bi (ˆ vi /kˆ vi k) ≤2

i=1 t X

i=1

A2i + 2

i=1

X t i=1

2 |Bi |

2

(7.16)

where the first inequality uses the fact (x + y)2 ≤ 2(x2 + y 2 ) and the triangle inequality, and the second inequality uses the orthonormality of the vi and the triangle inequality. It remains to bound A2i and |Bi | in terms of |ai |, λi , and ˜. The first term, A2i , can be ˆ i |, kˆ bounded using the triangle inequality and the various bounds on |λi − λ vi − vi k, kˆ vi⊥ k, and ci : ˆ i |c3 + λi |c3 − 1|)a2 + 2(λi + |λi − λ ˆ i |)kˆ ˆ i |)kˆ |Ai | ≤ (|λi − λ vi⊥ k|ai bi |c2i + (λi + |λi − λ vi⊥ k2 b2i ci i i i ˆ i | + 1.5λi kˆ ˆ i |)kˆ ˆ i |)kˆ ≤ (|λi − λ vi − vi k2 + 2(λi + |λi − λ vi − vi k)|ai | + (λi + |λi − λ vi − vi k2 ≤ (˜ + 7˜2 /λi + 4˜ + 4˜2 /λi )|ai | + 4˜2 /λi + ˜3 /λ2i = (5 + 11˜/λi )˜|ai | + 4(1 + ˜/λi )˜2 /λi ,

and therefore (via (x + y)2 ≤ 2(x2 + y 2 )) A2i ≤ 2(5 + 11˜/λi )2 ˜2 a2i + 32(1 + ˜/λi )2 ˜4 /λ2i .

147

The second term, |Bi |, is bounded similarly: ˆ i |)kˆ |Bi | ≤ 2(λi + |λi − λ vi⊥ k2 (a2i + kˆ vi⊥ k2 ) ˆ i |)kˆ vi − vi k2 ) ≤ 2(λi + |λi − λ vi − vi k2 (a2 + kˆ i

≤ 8(1 + ˜/λi )(˜2 /λi )a2i + 32(1 + ˜/λi )˜4 /λ3i .

Therefore, using the inequality from (7.16) and again (x + y)2 ≤ 2(x2 + y 2 ),

t

2 X 2 t t X

X

2

Ei (I, u, u) ≤ 2 Ai + 2 |Bi |

i=1

2

i=1

i=1

t X > 2 ≤ 0.01 + 100 (u vi ) ˜2 i=1

7.6.4

Proof of the main theorem

Finally we are ready to prove the main theorem in this Chapter. The proof is basically an induction on Theorem 7.15, plugging in the error bound in Theorem 7.16. Ideally we would find the t-th largest eigenvector at the t-th iteration of the algorithm. Theorem 7.10 will guarantee that in one of the random initial vectors we indeed find the top eigenvector. However, additional complication arises because we don’t know which random initialization worked! The solution to this problem is to pick the one vector with the largest T˜(v, v, v), because the value T˜(v, v, v) is very similar to the quadratic form v T M v for matrices. Analog to matrices suggest that having large T˜(v, v, v) value should imply closeness to a large eigenvector. We show that this intuition is correct during the induction step in the proof below. Theorem 7.9 restated. Let TˆP= T + E ∈ Rk×k×k , where T is a symmetric tensor with k ⊗3 orthogonal decomposition T = where each λi > 0, {v1 , v2 , . . . , vk } is an ori=1 λi vi thonormal basis, and E has operator norm := kEk. Define λmin := min{λi : i ∈ [k]}, and λmax := max{λi : i ∈ [k]}. For any δ > 0, there exists universal constants C1 , C2 , C3 > 0 such that the following holds. Pick any η ∈ (0, 1), and suppose λ λmin max , N ≥ C2 · log(k) + log log , ≤ C1 · k and L = k 1+δ log(k/η) Suppose that Algorithm 7.1 is iteratively called k times, where the input tensor is Tˆ in the first call, and in each subsequent call, the input tensor is the deflated tensor returned ˆ 1 ), (ˆ ˆ 2 ), . . . , (ˆ ˆ k ) be the sequence of estimated eigenvecby the previous call. Let (ˆ v1 , λ v2 , λ vk , λ tor/eigenvalue pairs returned in these k calls. With probability at least 1 − η, there exists a 148

permutation π on [k] such that ˆ j | ≤ 5, |λπ(j) − λ

kvπ(j) − v ˆj k ≤ 8/λπ(j) , and

∀j ∈ [k],

k X

⊗3 ˆj v

T − λ ˆj

≤ 55. j=1

Proof. We prove by induction that for each i ∈ [k] (corresponding to the i-th call to Algorithm 7.1), with probability at least 1 − iη/k, there exists a permutation π on [k] such that the following assertions hold. ˆ j | ≤ 12. 1. For all j ≤ i, kvπ(j) − v ˆj k ≤ 8/λπ(j) and |λπ(j) − λ 2. The error tensor X X X ⊗3 ⊗3 ⊗3 ˆj v ˆj v λ ˆj − λπ(j) vπ(j) =E+ λπ(j) vπ(j) −λ ˆj⊗3 E˜i+1 := Tˆ − j≤i

j≥i+1

j≤i

satisfies kE˜i+1 (I, u, u)k ≤ 56, ∀u ∈ S k−1 ; (7.17) k−1 > 2 kE˜i+1 (I, u, u)k ≤ 2, ∀u ∈ S s.t. ∃j ≥ i + 1, (u vπ(j) ) ≥ 1 − (168/λπ(j) )2 . (7.18) We actually take i = 0 as the base case, so we can ignore the first assertion, and just observe that for i = 0, k X E˜1 = Tˆ − λi vi⊗3 = E. j=1

We have kE˜1 k = kEk = , and therefore the second assertion holds. Now fix some i ∈ [k], and assume as the inductive hypothesis that, with probability at least 1 − (i − 1)η/k, there exists a permutation π such that two assertions above hold for i − 1 (call this Eventi−1 ). The i-th call to Algorithm 7.1 takes as input X ˆj v T˜i := Tˆ − λ ˆj⊗3 , j≤i−1

which is intended to be an approximation to X ⊗3 Ti := λπ(j) vπ(j) . j≥i

Observe that

T˜i − Ti = E˜i , 149

which satisfies the second assertion in the inductive hypothesis. We may write Ti = P k ˜ ⊗3 where λ ˜ l = λl whenever π −1 (l) ≥ i, and λ ˜ l = 0 whenever π −1 (l) ≤ i − 1. This l=1 λl vl ˜ i in preceding lemmas (in particular, Lemma 7.10 form is used when referring to T˜ or the λ and Lemma 7.15). By Lemma 7.10, with conditional probability at least 1 − η/k given Eventi−1 , at least (τ ) one of θ0 for τ ∈ [L] is γ-separated relative to π(jmax ), where jmax := arg maxj≥i λπ(j) , (for γ = 0.01; call this Event0i ; note that the application of Lemma 7.10 determines C3 ). Therefore Pr[Eventi−1 ∩ Event0i ] = Pr[Event0i |Eventi−1 ] Pr[Eventi−1 ] ≥ (1 − η/k)(1 − (i − 1)η/k) ≥ 1 − iη/k. It remains to show that Eventi−1 ∩ Event0i ⊆ Eventi ; so henceforth we condition on Eventi−1 ∩ Event0i . (τ ) Set C1 to be a universal constant that is small enough. For all τ ∈ [L] such that θ0 is √ (τ ) γ-separated relative to π(jmax ), we have (i) |θjmax ,0 | ≥ 1/ k, and (ii) that by Lemma 7.15 (using ˜/p := 2, and i∗ := π(jmax ), C = C2 ), |T˜i (θN , θN , θN ) − λπ(jmax ) | ≤ 5 (τ )

(τ )

(τ )

(notice by definition that γ ≥ 1/100 implies γ0 ≥ 1 − /(1 + γ) ≥ 1/101, thus it follows γ0 ˜ min · θ2∗ as < 2(1+8κ) ·λ from the bounds on the other quantities that ˜ = 2p ≤ 56C1 · λmin i ,0 k (τ ∗ )

necessary). Therefore θN := θN

must satisfy

(τ ) (τ ) (τ ) T˜i (θN , θN , θN ) = max T˜i (θN , θN , θN ) ≥ max λπ(j) − 5 = λπ(jmax ) − 5. j≥i

τ ∈[L]

On the other hand, by the triangle inequality, X 3 T˜i (θN , θN , θN ) ≤ λπ(j) θπ(j),N + |E˜i (θN , θN , θN )| j≥i

≤

X j≥i

2 λπ(j) |θπ(j),N |θπ(j),N + 56

≤ λπ(j ∗ ) |θπ(j ∗ ),N | + 56 where j ∗ := arg maxj≥i λπ(j) |θπ(j),N |. Therefore 4 λπ(j ∗ ) |θπ(j ∗ ),N | ≥ λπ(jmax ) − 5 − 56 ≥ λπ(jmax ) . 5 2 2 ∗ Squaring both sides and using the fact that θπ(j ∗ ),N + θπ(j),N ≤ 1 for any j 6= j ,

λπ(j ∗ ) θπ(j ∗ ),N

2

2 16 2 16 λπ(jmax ) θπ(j ∗ ),N + λπ(jmax ) θπ(j),N 25 25 2 16 2 16 ≥ λπ(j ∗ ) θπ(j ∗ ),N + λπ(j) θπ(j),N 25 25 ≥

150

which in turn implies 3 λπ(j) |θπ(j),N | ≤ λπ(j ∗ ) |θπ(j ∗ ),N |, 4

j 6= j ∗ .

This means that θN is (1/4)-separated relative to π(j ∗ ). Also, observe that |θπ(j ∗ ),N | ≥

4 λπ(jmax ) 4 · ≥ , 5 λπ(j ∗ ) 5

λπ(jmax ) 5 ≤ . λπ(j ∗ ) 4

Therefore we are in a situation that is very similar to Theorem 7.14, except that now the dominating |λj ∗ θj ∗ | may not have the same j ∗ as the largest jmax . The proof of Theorem 7.14 λ ) ≤ 54 , hence we can still apply can be easily adapted in this case because we still know λπ(jπ(jmax ∗) Theorem 7.15 and get kθˆ − vπ(j ∗ ) k ≤

8 λπ(j ∗ )

,

ˆ − λπ(j ∗ ) | ≤ 5. |λ

ˆ i = λ, ˆ the first assertion of the inductive hypothesis is satisfied, as we Since v î = θˆ and λ can modify the permutation π by swapping π(i) and π(j ∗ ) without affecting the values of {π(j) : j ≤ i − 1} (recall j ∗ ≥ i). We now argue that E˜i+1 has the required properties to complete the inductive step. By Lemma 7.16 (using ˜ := 5), we have for any unit vector u ∈ S k−1 ,

1/2 i

X

X

⊗3 ⊗3 > 2 ˆ 5 ≤ 55. (7.19) λπ(j) vπ(j) − λj v ˆj (I, u, u) ≤ 1/50 + 100 (u vπ(j) )

j=1

j≤i

Therefore by the triangle inequality,

X

⊗3 ˆj v kE˜i+1 (I, u, u)k ≤ kE(I, u, u)k + λπ(j) vπ(j) −λ ˆj⊗3 (I, u, u) ≤ 56.

j≤i

Thus the bound (7.17) holds. To prove that (7.18) holds, pick any unit vector u ∈ S k−1 such that there exists j 0 ≥ i + 1 ) with (u> vπ(j 0 ) )2 ≥ 1 − (168/λπ(j 0 ) )2 . We have (via the assumed bound ≤ C1 · λmin k i X

2 1 168 > 2 100 (u vπ(j) ) ≤ 100 1 − (u vπ(j 0 ) ) ≤ 100 ≤ , λπ(j 0 ) 50 j=1 >

2

and therefore 1/2 i X > 2 1/50 + 100 (u vπ(j) ) 5 ≤ (1/50 + 1/50)1/2 5 ≤ . j=1

151

By the triangle inequality, we have kE˜i+1 (I, u, u)k ≤ 2. Therefore (7.18) holds, so the second assertion of the inductive hypothesis holds. Thus Eventi−1 ∩ Event0i ⊆ Eventi , and Pr[Eventi ] ≥ Pr[Eventi−1 ∩ Event0i ] ≥ 1 − iη/k. We conclude that by the induction principle, there exists a permutation π such that two assertions hold for i = k, with probability at least 1 − η. P ˆ ⊗3 ˆj k ≤ From the last induction step (i = k), it is also clear from (7.19) that kT − kj=1 λ jv 55 (in Eventk−1 ∩ Event0k ). This completes the proof of the theorem.

7.7

Better Initializers and Adaptive Deflation

If we take a closer look at the analysis of the tensor power method, it is easy to see the requirement of error smaller than λmin /k is only caused by Theorem 7.15, in particular in the requirement that the relative error δ0 should be a small constant. For √ example, the deflation step, Theorem 7.16 actually requires error only smaller than λmin / k. In some situations, it is possible to obtain much better initializers. The exchangeable single topic model is a good example for this idea: for a single document, the expected word frequency vector v will be µi if the document has topic i. When the document is long enough it is easy to show that v is actually very close to its expectation. If we apply the reduction in Section 7.2.1, then W T v will be close to one of the component in the tensor decomposition. If we have many such good initializers, then it is possible to do orthogonal tensor decomposition even when the error is more than λmin /k. To make this precise, define a vector to be (γ, Q)-good respect to i if u · vi > Q and |u · vi | − maxj6=i |u · vj | > γ|u · vi |. Theorem 7.15 shows if the initializer is (γ, Q)-good with respect to the top eigenvector, we only require ˜ < λmin Q2 γ/18. This is really good as if Q ˆ max . and γ are both constants, ˜ is only required to be smaller than a small constant times λ When Q and γ are large, it is possible√that the deflation step becomes the bottleneck as it still requires to be smaller than λmin / k. In order to avoid this we need to do deflation in a more careful way, as in Algorithm 7.3 The main difference between Algorithm 7.3 and Algorithm 7.1 is in the deflation step: instead of the one shot deflation in Algorithm 7.1 (just subtract the found component from the tensor), in Algorithm 7.3 we are more careful in that we only subtract the component when it “threats” the current iteration (that is, the |λi θi,t | value is larger or almost as large as |λj θj,t | when we are trying to converge to vj ). Lemma 7.17 (Deflation analysis). Let ˜ > 0 and let {v1 , . . . , vk } be an orthonormal basis ˆ i ≥ 0. for Rk and λi ≥ 0 for i ∈ [k]. Let {ˆ v1 , . . . , v ˆk } ∈ Rk be a set of unit vectors and λ Define third order tensor Ei such that îv î⊗3 , Ei := λi vi⊗3 − λ

152

∀ i ∈ k.

Algorithm 7.3. {λ, Φ} ←TensorEigen(T, {vi }i∈[L] , N )

Input: Tensor T ∈ Rk×k×k , set of L initialization vectors {vi }i∈L , number of iterations N . î, v ˆ are the eigenvalues Output: the estimated eigenvalue/eigenvector pairs {λ î }, where λ and v î ’s are the eigenvectors. for i = 1 to k do for τ = 1 to L do θ0 ← vτ . for t = 1 to N do T˜ ← T . for j = 1 to i − 1 (when i > 1) do ˆ j θ (τ ) · v if |λ ˆj | > ξ = 5˜ then t ˆj v ˜ ˜ T ←T −λ ˆj⊗3 . end if end for (τ ) (τ ) T˜(I,θ ,θt−1 ) (τ ) Compute power iteration update θt := ˜ t−1 (τ ) (τ ) kT (I,θt−1 ,θt−1 )k

end for end for (τ ) (τ ) (τ ) Let τ ∗ := arg maxτ ∈L {T˜(θN , θN , θN )}. (τ ∗ ) Do N power iteration updates starting from θN to obtain eigenvector estimate v î , ˆ i := T˜(ˆ and set λ vi , v î , v î ). end for î, v return the estimated eigenvalue/eigenvectors (λ î ). For some t ∈ [k] and a unit vector u ∈ S k−1 such that u = have for i ∈ [t],

P

i∈[k] θi vi

and θî := u · v î , we

ˆ i θî | ≥ ξ ≥ 5˜, |λ

ˆ i − λi | ≤ ˜, |λ

√ kˆ vi − vi k ≤ min{ 2, 2˜/λi }, then when ˜ ≤ 10−5 λmin implies

t

2 t X

X

2

Ei (I, u, u) ≤ 200˜ θi2 .

i=1

2

i=1

Proof: The proof is on lines of deflation analysis in Theorem 7.16, but we improve the bounds based on additional properties of vector u. From Theorem 7.16, we have that for all

153

i ∈ [t], and any unit vector u,

2

t t t X X

X 2 2 2

˜ /λi ) + 100 θi ˜2 Ei (I, u, u) ≤ O(

i=1

2

i=1

(7.20)

i=1

ˆ i = λi + δi and θî = θi + βi . We have δi ≤ ˜ and βi ≤ 2˜/λi , and that |λ ˆ i θî | ≥ ξ. Let λ ˆ i θî | − |λi θi || ≤ |λ ˆ i θî − λi θi | ≤ |(λi + δi )(θi + βi ) − λi θi | ≤ 4˜. ||λ (7.21) P P Thus, we have that |λi θi | ≥ 5˜ − 4˜ = ˜. Thus ti=1 ˜2 /λ2i ≤ i θi2 . Substituting in (7.20), we have the result. We also need to show that the deflation process indeed deflated all the previously found eigenvectors that has a large effective eigenvalue. This is again by (7.21), if |λi θi | ≥ 9˜, then ˆ i θî | ≥ 5˜ and will get deflated. On the other hand, the effective eigenvalue of the vector |λ we want to converge to is at least λmin Q 5˜.

154

Chapter 8 Tensor Decomposition in Community Detection Using the tensor decomposition framework, it is possible to learn many latent variable models. In this chapter we use mixed membership community model as an example. The new algorithm in this section illustrates many useful techniques for proving sharp sample complexity bounds using tensor power method. In particular, the sample complexity bounds we get almost match the state-of-art for stochastic block model (which is a special case of mixed membership model), even though our algorithm works for more general models.

8.1 8.1.1

Background and Main Results History of Overlapping Communities

Studying communities forms an integral part of social network analysis. A community generally refers to a group of individuals with shared interests (e.g. music, sports), or relationships (e.g. friends, co-workers). Community formation in social networks has been studied by many sociologists, e.g. [45, 108, 124, 129], starting with the seminal work of [129]. Various probabilistic and non-probabilistic network models attempt to explain community formation. In addition, they also attempt to quantify interactions and the extent of overlap between different communities, relative sizes among the communities, and various other network properties. Studying such community models are also of interest in other domains, e.g. in biological networks. While there exists a vast literature on community models, learning these models is typically challenging, and various heuristics such as Markov Chain Monte Carlo (MCMC) or variational expectation maximization (EM) are employed in practice. Such heuristics tend to scale poorly for large networks. On the other hand, community models with guaranteed learning methods tend to be restrictive. A popular class of probabilistic models, termed as stochastic blockmodels, have been widely studied and enjoy strong theoretical learning guarantees, e.g. [62, 84, 125, 142, 159, 161]. On the other hand, they posit that an individual belongs to a single community, which does not hold in most real settings [135]. 155

The mixed membership community model of [4] has a number of attractive properties. It retains many of the convenient properties of the stochastic block model. For instance, conditional independence of the edges is assumed, given the community memberships of the nodes in the network. At the same time, it allows for communities to overlap, and for every individual to be fractionally involved in different communities.

8.1.2

Main Result

In this section, we describe the mixed membership community model based on Dirichlet priors for the community draws by the individuals. Then we state our main result. We first introduce the special case of the popular stochastic block model, where each node belongs to a single community. Notation: Let G be the {0, 1} adjacency matrix for the random network and let GA,B be the submatrix of G corresponding to rows A ⊆ [n] and columns B ⊆ [n]. We consider models with k underlying (hidden) communities. For node i, let πi ∈ Rk denote its community membership vector, i.e., the vector is supported on the communities to which the node belongs. In the special case of the popular stochastic block model described below, πi is a basis coordinate vector, while the more general mixed membership model relaxes this assumption and a node can be in multiple communities with fractional memberships. Define Π := [π1 |π2 | · · · |πn ] ∈ Rk×n . and let ΠA := [πi : i ∈ A] ∈ Rk×|A| denote the set of column vectors restricted to A ⊆ [n]. For a matrix M , recall that (M )i and (M )i denote its ith column and row respectively. Stochastic block model (special case): In this model, each individual is independently assigned to a single community, chosen at random: each node i chooses community j independently with probability α bj , for i ∈ [n], j ∈ [k], and we assign πi = ej in this case, k th where ej ∈ {0, 1} is the j coordinate basis vector. Given the community assignments Π, every directed1 edge in the network is independently drawn: if node u is in community i and node v is in community j (and u 6= v), then the probability of having the edge (u, v) in the network is Pi,j . Here, P ∈ [0, 1]k×k and we refer to it as the community connectivity matrix. This implies that given the community membership vectors πu and πv , the probability of an edge from u to v is πu> P πv (since when πu = ei and πv = ej , we have πu> P πv = Pi,j .). Throughout this Chapter we work in the special case where Pi,i = p and Pi,j = q when i 6= j. Here p, q are two probabilities that two individual know each other when they are in/not in the same community respectively. Also, we assume all the communities have equal size, that is, α bi = 1/k for all i ∈ [k]. Mixed membership model: We now consider the extension of the stochastic block model which allows for an individual to belong to multiple communities and yet preserves 1

We limit our discussion to directed networks in this paper, but note that the results also hold for undirected community models, where P is a symmetric matrix, and an edge (u, v) is formed with probability > πu P πv = πv> P πu .

156

some of the convenient independence assumptions of the block model. InPthis model, the community membership vector πu at node u is a probability vector, i.e., i∈[k] πu (i) = 1, for all u ∈ [n]. Given the community membership vectors, the generation of the edges is identical to the block model: given vectors πu and πv , the probability of an edge from u to v is πu> P πv , and the edges are independently drawn. This formulation allows for the nodes to be in multiple communities, and at the same time, preserves the conditional independence of the edges, given the community memberships of the nodes. Dirichlet prior for community membership: The only aspect left to be specified for the mixed membership model is the distribution from which the community membership vectors Π are drawn. We consider the popular setting of [4, 71], where the community vectors {πu } are i.i.d. draws from the Dirichlet distribution, denoted by Dir(α), with parameter vector α ∈ Rk>0 . The probability density function of the Dirichlet distribution is given by Qk P[π] =

k

Γ(αi ) Y αi −1 , π Γ(α0 ) i=1 i

i=1

π ∼ Dir(α), α0 :=

X

αi ,

(8.1)

i

where Γ(·) is the Gamma function and the ratio of the Gamma function serves as the normalization constant. The Dirichlet distribution is widely employed for specifying priors in Bayesian statistics, e.g. latent Dirichlet allocation [26]. The Dirichlet distribution is the conjugate prior of the multinomial distribution which makes it attractive for Bayesian inference. P Let α b denote the normalized parameter vector α/α , where α := 0 0 i αi . In particular, P bi = 1. Intuitively, α b denotes the relative expected note that α b is a probability vector: P iα −1 bi ). Again in this Chapter we work sizes of the communities (since E[n u∈[n] πu [i]] = α with the simplified case where αi = α0 /k for all i. The stochastic block model is a limiting case of the mixed membership model when the Dirichlet parameter is α = α0 · α b, where the probability vector α b is held fixed and α0 → 0. In the other extreme when α0 → ∞, the Dirichlet distribution becomes peaked around a single point, for instance, if αi ≡ c and c → ∞, the Dirichlet distribution is peaked at k −1 · 1, where 1 is the all-ones vector. Thus, the parameter α0 serves as a measure of the average sparsity of the Dirichlet draws or equivalently, of how concentrated the Dirichlet measure is along the different coordinates. This in effect, controls the extent of overlap among different communities. Specific Parameter Choices in Homophilic Models A sub-class of community model are those satisfying homophily. Homophily or the tendency to form edges within the members of the same community has been posited as an important factor in community formation, especially in social settings. Many of the existing learning algorithms, e.g. [166] require this assumption to provide guarantees in the stochastic block model setting. Throughout this Chapter we deal with a standard special case, where the diagonal entries of P are all equal to p, and off-diagonal entries are all equal to q < p. All the communities have the same (expected size), which means α bi = 1/k for all i ∈ [k]. The parameter α0 157

x

A

B u

C v

w

Figure 8.1: Our moment-based learning algorithm uses 3-star count tensor from set X to sets A, B, C (and the roles of the sets are interchanged to get various estimates). Specifically, T is a third order tensor, where T(u, v, w) is the normalized count of the 3-stars with u, v, w as leaves over all x ∈ X. should generally be thought of as a constant (which correspond to the case that everyone is ˜ and Ω ˜ notations expected to be in constant number of communities). In particular, our O will ignore polynomial powers of α0 + 1. Main Result Informally, the main result of this chapter is Theorem 8.1 (Informal). There is an algorithm for estimating the parameters of Mixed Membership Stochastic Block model, when everyone is expected to be in constant number of communities, and communities have similar sizes, the guarantee of the algorithm is matches the state-of-art result for the simpler Stochastic Block model. The rest of this chapter is structured as follows: we show what graph moments we use and give the structure of the graph moments in Section 8.2. In Section 8.3, we describe learning algorithms when we have exact moments (just to give intuition) and how this can be adapted to the case when we are only given samples. The next Section 8.4 states the performance guarantee of our algorithms, and outlines the proof. All the details of the proof are deferred to the final Section 8.5. Implementation In a subsequent work, Huang et al. [89] implemented the algorithm using GPUs. The algorithm scales up to networks with more than 100,000 nodes and 500 communities, and produces reasonable results.

8.2

Graph Moments Under Mixed Membership Models

Our approach for learning a mixed membership community model relies on the form of the graph moments2 under the mixed membership model. We now describe the specific graph moments used by our learning algorithm (based on 3-star and edge counts) and provide explicit forms for the moments, assuming draws from a mixed membership model. 2

We interchangeably use the term first order moments for edge counts and third order moments for 3-star counts.

158

To simplify notation we define F := Π> P > = [π1 |π2 | · · · |πn ]> P > .

(8.2)

For a subset A ⊆ [n] of individuals, let FA ∈ R|A|×k denote the submatrix of F corresponding > to nodes in A, i.e., FA := Π> AP . Graph Moments under Mixed Membership Dirichlet Model 3-star counts: The primary quantity of interest is a third-order tensor which counts the number of 3-stars. A 3-star is a star graph with three leaves {a, b, c} and we refer to the internal node x of the star as its “head”, and denote the structure by x → {a, b, c} (see figure 8.1). We partition the network into four3 parts and consider 3-stars such that each node in the 3-star belongs to a different partition. Specifically, consider4 a partition A, B, C, X of the network. We count the number of 3-stars from X to A, B, C and our quantity of interest is TX→{A,B,C} :=

1 X > > [G ⊗ G> i,B ⊗ Gi,C ]. |X| i∈X i,A

(8.3)

T ∈ R|A|×|B|×|C| is a third order tensor, and an element of the tensor is given by TX→{A,B,C} (a, b, c) =

1 X G(x, a)G(x, b)G(x, c), |X| x∈X

∀a ∈ A, b ∈ B, c ∈ C,

(8.4)

which is the normalized count of the number of 3-stars with leaves a, b, c such that its “head” is in set X. We now analyze the graph moments for the general mixed membership Dirichlet model. Instead of the raw moments (i.e. edge and 3-star counts), we consider modified moments to obtain similar expressions as in the case of the stochastic block model. Let µX→A ∈ R|A| denote a vector which gives the normalized count of edges from X to A: 1 X > [G ]. (8.5) µX→A := |X| i∈X i,A 0 We now define a modified adjacency matrix5 GαX,A as 0 GαX,A :=

√

√ α0 + 1GX,A − ( α0 + 1 − 1)1µ> X→A .

3

(8.6)

For sample complexity analysis, we require dividing the graph into more than four partitions to deal with statistical dependency issues, and we outline it in Section 8.3. 4 For our theoretical guarantees to hold, the partitions A, B, C, X can be randomly chosen and are of size Θ(n). P 5 To compute the modified moments Gα0 , and Tα0 , we need to know the value of the scalar α0 := i αi , which is the concentration parameter of the Dirichlet distribution and is a measure of the extent of overlap between the communities. We assume its knowledge here.

159

0 In the special case of the stochastic block model (α0 → 0), GαX,A = GX,A is the submatrix of the adjacency matrix G. Similarly, we define modified third-order statistics, 0 :=(α0 + 1)(α0 + 2) TX→{A,B,C} +2 α02 µX→A ⊗ µX→B ⊗ µX→C TαX→{A,B,C} α0 (α0 + 1) X > > > > > − Gi,A ⊗ G> i,B ⊗ µX→C + Gi,A ⊗ µX→B ⊗ Gi,C + µX→A ⊗ Gi,B ⊗ Gi,C , |X| i∈X

(8.7)

and it reduces to (a scaled version of) the 3-star count TX→{A,B,C} defined in (8.3) for the stochastic block model (α0 → 0). The modified adjacency matrix and the 3-star count tensor can be viewed as a form of “centering” of the raw moments which simplifies the expressions for the moments. The following relationships hold between the modified graph moments 0 GαX,A , Tα0 and the model parameters P and α b of the mixed membership model. Proposition 8.2 (Moments in Mixed Membership Model). Given partitions A, B, C, X 0 and GαX,A and Tα0 , as in (8.6) and (8.7), normalized Dirichlet concentration vector α b, and > > F := Π P , where P is the community connectivity matrix and Π is the matrix of community memberships, we have 0 E[(GαX,A )> |ΠA , ΠX ] = FA Diag(b α1/2 )ΨX , 0 E[TαX→{A,B,C}

|ΠA , ΠB , ΠC ] =

k X i=1

(8.8)

α bi (FA )i ⊗ (FB )i ⊗ (FC )i ,

(8.9)

where (FA )i corresponds to ith column of FA and ΨX relates to the community membership matrix ΠX as ! ! X √ √ 1 α0 + 1ΠX − ( α0 + 1 − 1) ΨX := Diag(b α−1/2 ) πi 1> . |X| i∈X Moreover, we have that |X|−1 EΠX [ΨX Ψ> X ] = I.

(8.10)

Recall E[G> membership model, and µX→A := i,A |πi , ΠA ] = FA πi for a mixed P 1 > > G , therefore E[µ |Π , Π ] = F π X→A A X A |X| i,A i∈X i∈X i 1 . Equation (8.8) follows

Proof: P 1 |X|

directly. For Equation (8.10), we note the Dirichlet moment, E[ππ > ] = α0 α bα b> , when π ∼ Dir(α) and α0 +1

1 α0 +1

Diag(b α) +

√ √ |X|−1 E[ΨX Ψ> α−1/2 ) (α0 + 1)E[ππ > ] + (−2 α0 + 1( α0 + 1 − 1) X ] = Diag(b √ +( α0 + 1 − 1)2 )E[π]E[π]> Diag(b α−1/2 ) = Diag(b α−1/2 ) Diag(b α ) + α0 α bα b> + (−α0 )b αα b> Diag(b α−1/2 ) = I. 160

The expectation in (8.9) involves multi-linear map of the expectation of the tensor products π ⊗ π ⊗ π among other terms. Collecting these terms, we have that (α0 + 1)(α0 + 2)E[π ⊗ π ⊗ π] − (α0 )(α0 + 1)(E[π ⊗ π ⊗ E[π]] +E[π ⊗ E[π] ⊗ π] + E[E[π] ⊗ π ⊗ π]) + 2α02 E[π] ⊗ E[π] ⊗ E[π] is a diagonal tensor, in the sense that its (p, p, p)-th entry is α bp , and its (p, q, r)-th entry is 0 when p, q, r are not all equal. With this, we have (8.9). Note that the form of the tensor in this Lemma is very similar to Theorem 7.2. However, because FA , FB , FC are three different matrices, we need a more complicated reduction than Section 7.2.1.

8.3

Algorithm for Learning Mixed Membership Models

The simple form of the graph moments derived in the previous section is now utilized to recover the community vectors Π and model parameters P, α b of the mixed membership model. We first analyze the simpler case when exact moments are available in Section 8.3.1 and then extend the method to handle empirical moments computed from the network observations in Section 8.3.2.

8.3.1

Learning Mixed Membership Models Under Exact Moments

We first describe the learning approach when exact moments are available. In Section 8.3.2, we suitably modify the approach to handle perturbations, which are introduced when only empirical moments are available. We first describe a “simultaneous whitening” procedure similar to Section 7.2.1 to convert the graph moment tensor Tα0 to a orthogonally decomposable tensor through a multi-linear transformation of Tα0 . We then employ the power method to obtain the eigenvalues and eigenvectors of the tensor. Finally, by reversing the multi-linear transform in the whitening step we get the parameters of interest. This yields a guaranteed method for obtaining the decomposition of graph moment tensor Tα0 under exact moments. Reduction of the graph-moment tensor to symmetric orthogonal form (Whitening): Recall from Proposition 8.2 that the modified 3-star count tensor Tα0 has the form α0

E[T

|ΠA , ΠB , ΠC ] =

k X i=1

α bi (FA )i ⊗ (FB )i ⊗ (FC )i .

In order to whiten the tensor, we make use of the modified adjacency matrix Gα0 , defined in (8.6). Consider the singular value decomposition (SVD) of the modified adjacency matrix 161

Gα0 under exact moments: 0 |X|−1/2 E[(GαX,A )> |Π] = UA DA VA> .

−1 Define WA := UA DA , and similarly define WB and WC using the corresponding matrices α0 α0 GX,B and GX,C respectively. Now define

RA,B :=

1 0 0 W > E[(GαX,B )> |Π] · E[(GαX,A )|Π]WA , |X| B

˜ B := WB RA,B , W

(8.11)

˜ C . We establish that a multilinear transformation (as defined in (A.1)) and similarly define W ˜ B , and W ˜ C results in a orthogonally of the graph-moment tensor Tα0 using matrices WA , W decomposable tensor. Lemma 8.3 (Reducing to Orthogonally Decomposable Tensor). Assume that the matrices FA , FB , FC and ΠX have rank k, where k is the number of communities. We have an orthogonal symmetric tensor form for the modified 3-star count tensor Tα0 in (8.7) under a ˜ B , and W ˜ C: multilinear transformation using matrices WA , W X k×k×k ˜ B, W ˜ C )|ΠA , ΠB , ΠC ] = E[Tα0 (WA , W λi (Φ)⊗3 , (8.12) i ∈ R i∈[k]

where λi := α bi−0.5 and Φ ∈ Rk×k is an orthogonal matrix, given by Φ := WA> FA Diag(b α0.5 ).

(8.13)

Remark: Note that the matrix WA orthogonalizes FA under exact moments, and is ˜ B = RA,B WB and W ˜ C = RA,C WC referred to as a whitening matrix. Similarly, the matrices W consist of whitening matrices WB and WC , and in addition, the matrices RA,B and RA,C serve to symmetrize the tensor. Proof: Recall that the modified adjacency matrix Gα0 satisfies 0 E[(GαX,A )> |ΠA , ΠX ] = FA Diag(b α1/2 )ΨX .

ΨX := Diag(b α−1/2 )

√

√ α0 + 1ΠX − ( α0 + 1 − 1)

! ! 1 X πi 1> . |X| i∈X

From the definition of ΨX above, we see that it has rank k when ΠX has rank k. Using the Sylvester’s rank inequality, we have that the rank of FA Diag(b α1/2 )ΨX is at least 2k − k = k. This implies that the whitening matrix WA also has rank k. Notice that −1 > −1 2 > 0 0 = I ∈ Rk×k , |X|−1 WA> E[(GαX,A )> |Π] · E[(GαX,A )|Π]WA = DA UA UA DA UA UA DA

162

or in other words, |X|−1 M M > = I, where M := WA> FA Diag(b α1/2 )ΨX . We now have that α1/2 )E[ΨX Ψ> α1/2 )FA> WA I = |X|−1 EΠX M M > = |X|−1 WA> FA Diag(b X ] Diag(b α)FA> WA , = WA> FA Diag(b

since |X|−1 EΠX [ΨX Ψ> X ] = I from (8.10), and we use the fact that the sets A and X do not overlap. Thus, WA whitens FA Diag(b α1/2 ) under exact moments (up on taking expectation over ΠX ) and the columns of WA> FA Diag(b α1/2 ) are orthonormal. Now note from the ˜ definition of WB that ˜ > E[(Gα0 )> |Π] = W > E[(Gα0 )> |Π], W B A X,B X,A since WB satisfies 0 0 |X|−1 WB> E[(GαX,B )> |Π] · E[(GαX,B )|Π]WB = I,

˜ C . The final result in (8.12) follows by taking expectation of and similar result holds for W α0 tensor T over ΠX .

Overview of the learning approach under exact moments: With the above result in place, we are now ready to describe the high-level approach for learning the mixed membership model under exact moments. First, whiten the graph-moment tensor Tα0 as described above and then apply the tensor power method described in the previous Chapter. This enables us to obtain the vector of eigenvalues λ := α b−1/2 and the matrix of eigenvectors > 0.5 Φ = WA FA Diag(b α ) using tensor power iterations. We can then recover the community membership vectors of set Ac (i.e., nodes not in set A) under exact moments as ΠAc ← Diag(λ)−1 Φ> WA> E[G> Ac ,A |Π], c −1 > > since E[G> Ac ,A |Π] = FA ΠAc (since A and A do not overlap) and Diag(λ) Φ WA = Diag(b α)FA> WA WA> under exact moments. In order to recover the community membership vectors of set A, viz., ΠA , we can reverse the direction of the 3-star counts, i.e., consider the 3-stars from set A to X, B, C and obtain ΠA in a similar manner. Once all the community membership vectors Π are obtained, we can obtain the community connectivity matrix P , using the relationship: Π> P Π = E[G|Π] and noting that we assume Π to be of rank k. Thus, we are able to learn the community membership vectors Π and the model parameters α b and P of the mixed membership model using edge counts and the 3-star count tensor. We now describe modifications to this approach to handle empirical moments.

8.3.2

Learning Algorithm Under Empirical Moments

In the previous section, we explored a tensor-based approach for learning mixed membership model under exact moments. However, in practice, we only have samples (i.e. the observed network), and the method needs to be robust to perturbations when empirical moments are employed. 163

ˆ Pˆ , α Algorithm 8.1. {Π, b} ← LearnMixedMembership(G, k, α0 ) P Input: Adjacency matrix G ∈ Rn×n , k is the number of communities, α0 := i αi , where α is the Dirichlet parameter vector. Let Ac := [n] \ A denote the set of nodes not in A. Output: Estimates of the community membership vectors Π ∈ Rn×k , community connectivity matrix P ∈ [0, 1]k×k , and the normalized Dirichlet parameter vector α b. Partition the vertex set [n] into 5 parts X, Y , A, B, C. 0 0 0 Compute moments GαX,A , GαX,B , GαX,C , TαY 0→{A,B,C} using (8.6) and (8.7). ˆ Ac , α , G). {Π b} ← LearnPartitionCommunity(Gα0 , Gα0 , Gα0 , Tα0 X,A

X,B

X,C

Y →{A,B,C}

ˆ Y c. Interchange roles of Y and A to obtain Π ˆ such that its i-th row is Q ˆ i := (α0 + 1) Πˆ i − α0 1> . {We will establish that Define Q ˆ i |1 n |Π † > ˆ Q ≈ (Π ) .} ˆ Q ˆ > . {Recall that E[G] = Π> P Π in our model.} Estimate Pˆ ← QG ˆ Pˆ , α Return Π, b

Pre-processing steps 0 0 0 ˆ Ac , α Algorithm 8.2. {Π b} ← LearnPartitionCommunity(GαX,A , GαX,B , GαX,C , TαY 0→{A,B,C} , G) 0 0 0 Input: Require modified adjacency submatrices GαX,A , GαX,B , GαX,C , 3-star count tensor α0 TY →{A,B,C} , adjacency matrix G Output: Estimates of ΠAc and α b. −1/2 α0 > Compute rank-k SVD: (|X| GX,A )k−svd = UA DA VA> and compute whitening matrices ˆ A := UA D−1 . Similarly, compute W ˆ B, W ˆ C and R ˆ AB , R ˆ AC using (8.14). W A ˆ A, W ˆ BR ˆ AB , W ˆ CR ˆ AC ). Compute whitened and symmetrized tensor T ← TαY 0→{A,B,C} (W ˆ Φ} ˆ ← Algorithm 7.3 TensorEigen(T, {W ˆ > G> }i∈B , 10 log n). {Φ ˆ is a k × k matrix with {λ, A i,A ˆ is the vector of estimated eigenvalues. each columns being an estimated eigenvector and λ > > ˆ G ’s are initial vectors} W A i,A ˆ −1 Φ ˆ −2 , for i ∈ [k]. ˆ ˆ >W ˆ > G>c , τ ) and α ΠAc ← Thres(Diag(λ) î ← λ A A ,A i ˆ Return ΠAc and α ˆ.

Partitioning: In the previous section, we partitioned the nodes into four sets A, B, C, X for learning under exact moments. However, we require more partitions under empirical moments to avoid statistical dependency issues and obtain stronger reconstruction guarantees. We now divide the network into five non-overlapping sets A, B, C, X, Y . The set X is ˆ A, W ˆ B and W ˆ C , described in detail subsequently, employed to compute whitening matrices W the set Y is employed to compute the 3-star count tensor Tα0 and sets A, B, C contain the leaves of the 3-stars under consideration. The roles of the sets can be interchanged to obtain the community membership vectors of all the sets. 164

Whitening: The whitening procedure is along the same lines as described in the previous section, except that now empirical moments are used. Specifically, consider the k-rank singular value decomposition (SVD) of the modified adjacency matrix Gα0 defined in (8.6), > 0 (|X|−1/2 GαX,A )> k−svd = UA DA VA .

ˆ A := UA D−1 , and similarly define W ˆ B and W ˆ C using the corresponding matrices Define W A α0 α0 GX,B and GX,C respectively. Now define α0 ˆ A,B := 1 W ˆ B> (Gα0 )> ˆ R X,B k−svd · (GX,A )k−svd WA , |X|

(8.14)

ˆ AC . The whitened and symmetrized graph-moment tensor is now and similarly define R computed as ˆ A, W ˆ BR ˆ AB , W ˆ CR ˆ AC ), TαY 0→{A,B,C} (W where Tα0 is given by (8.7) and the multi-linear transformation of a tensor is defined in (A.1). Efficient Initialization: For a mixed membership model in the sparse regime, recall that the community membership vectors Π are sparse (with high probability). Under this regime of the model, we note that the whitened neighborhood vectors contain good initializers for the power iterations. Specifically, in Algorithm 7.1, we initialize with the whitened neighborhood ˆ > G> , for i ∈ vectors W / A. The intuition behind this is as follows: for a suitable choice of A i,A parameters (such as the scaling of network size n with respect to the number of communities k), we expect neighborhood vectors G> i,A to concentrate around their mean values, viz., , FA πi . Since πi is sparse (w.h.p) for the model regime under consideration, this implies that ˆ > FA πi , for i ∈ Ac , which concentrate (w.h.p) on only along a few there exist vectors W A eigen-directions of the whitened tensor, and hence, serve as an effective initializer. Reconstruction after tensor power method Recall that previously in Section 8.3.1, when exact moments are available, estimating the community membership vectors Π is straightforward, once we recover all the stable tensor eigen-pairs. However, in case of empirical moments, we can obtain better guarantees with the following modification: the estimated community membership vectors Π are further subject to thresholding so that the weak values are set to zero. Since we are limiting ourselves to the regime of the mixed membership model, where the community vectors Π are sparse (w.h.p), this modification strengthens our reconstruction guarantees. This thresholding step is incorporated in Algorithm 8.1. Moreover, recall that under exact moments, estimating the community connectivity matrix P is straightforward, once we recover the community membership vectors since P ← (Π> )† E[G|Π]Π† . However, when empirical moments are available, we are able to establish better reconstruction guarantees through a different method, outlined in Algorithm 8.1. 165

ˆ such that its i-th row is We define Q î ˆ i := (α0 + 1) Π − α0 1> , Q ˆ i |1 n |Π ˆ and the matrix Pˆ is obtained as Pˆ ← QG ˆ Q ˆ > . We subsequently based on estimates Π, ˆΠ ˆ > ≈ I. establish that Q Improved support recovery estimates in homophilic models: We describe the post-processing method in Algorithm 8.3, which gives improved estimates by averaging. Specifically, consider nodes in set C and edges going from C to nodes in B. First, consider the special case of the stochastic block model: for each node c ∈ C, compute the number of ˆ from Algorithm 8.1), neighbors in B belonging to each community (as given by the estimate Π and declare the community with the maximum number of such neighbors as the community of node c. Intuitively, this provides a better estimate for ΠC since we average over the edges in B. This method has been used before in the context of spectral clustering [125]. The same idea can be extended to the general mixed membership models: declare communities to be significant if they exceed a certain threshold, as evaluated by the average number of edges to each community. The details are provided in Algorithm 8.3. In the next section, we establish that in certain regime of parameters, this procedure can lead to zero-error support recovery of significant community memberships of the nodes and also rule out communities where a node does not have a strong presence. ˆ ← SupportRecoveryHomophilicModels(G, k, α0 , ξ, Π) ˆ Algorithm 8.3. {S} P Input: Adjacency matrix G ∈ Rn×n , k is the number of communities, α0 := i αi , where α is the Dirichlet parameter vector, ξ is the threshold for support recovery, corresponding to ˆ from Algorithm 8.1. significant community memberships of an individual. Get estimate Π n×k ˆ Output: S ∈ {0, 1} is the estimated support for significant community memberships (see Theorem 8.6 for guarantees). Consider partitions A, B, C, X, Y as in Algorithm 8.1. ˆ on lines of definition in Algorithm 8.1, using estimates Π. ˆ Let the i-th row for Define Q i ˆ î . ˆ i := (α0 + 1) ΠiB − α0 1> . Similarly define Q set B be Q C B ˆ |1 n |Π B > ˆ , Pˆ ← Q ˆ C FˆC . Estimate FˆC ← GC,B Q B Let H be the average of diagonals of Pˆ , L be the average of off-diagonals of Pˆ for x ∈ C, i ∈ [k] do ˆ x) ← 1 if FˆC (x, i) ≥ L + (H − L) · 3ξ and zero otherwise.{Identify large entries} S(i, 4 end for Permute the roles of the sets A, B, C, X, Y to get results for remaining nodes.

Computational complexity: We note that the computational complexity of the method is O(n2 k). This is because the time for computing whitening matrices is dominated by SVD 166

of the top k singular vectors of n × n matrix, which takes O(n2 k) time. We then compute the whitened tensor T which requires time O(n2 k + k 3 n) = O(n2 k), since for each i ∈ Y , we multiply Gi,A , Gi,B , Gi,C with the corresponding whitening matrices, and this step takes O(nk) time. We then average this k × k × k tensor over different nodes i ∈ Y to the result, which takes O(k 3 ) time in each step. For the tensor power method, the time required for a single iteration is O(k 3 ). We need at most log n iterations per initial vector, and we need to consider k γ0 initial vectors (recall that γ0 is a constant close to 1). Hence the total running time of tensor power method is O(k 4+γ ) which is dominated by O(n2 k). In the process of estimating Π and P , the dominant operation is multiplying k × n matrix by n × n matrix, which takes O(n2 k) time. For support recovery, the dominant operation is computing the “average degree”, which again takes O(n2 k) time. Thus, we have that the overall computational time is O(n2 k). Note that the above bound on complexity of our method nearly matches the bound for spectral clustering method [125], since computing the k-rank SVD requires O(n2 k) time. Another method for learning stochastic block models is based on convex optimization involving semi-definite programming (SDP) [166], and it provides the best scaling bounds (for both the network size n and the separation p−q for edge connectivity) known so far. The specific convex problem can be solved via the method of augmented Lagrange multipliers [114], where each step consists of an SVD operation and q-linear convergence is established by [114]. This implies that the method has complexity O(n3 ), since it involves taking SVD of a general n × n matrix, rather than a k-rank SVD. Thus, our method has significant advantage in terms of computational complexity, when the number of communities is much smaller than the network size (k n).

8.4 8.4.1

Sample Analysis for Proposed Learning Algorithm Sufficient Conditions and Recovery Guarantees

Recall that for a matrix M , (M )i and (M )i denote the ith row and column respectively. We say that an event holds with high probability, if it occurs with probability 1 − n−c for some ˜ and Ω ˜ hides polylog factors and polynomials of (1 + α0 ). constant c > 0. Recall that O Theorem 8.4 (Guarantees on Estimating P , Π). When p > q > poly log n, with high probability √ np i i ˆ ˜ επ,`1 := max kΠ − Π k1 = O i (p − q) √ ˜ k√ p . εP := max |Pî,j − Pi,j | = O i,j∈[n] n

p−q √ p

˜ √k ), = Ω( n

The proofs are given in next section and a proof outline is provided in Section 8.4.2.

167

The main ingredient in establishing the above result is the tensor concentration bound and additionally, recovery guarantees under the tensor power method in Procedure 7.1. We now provide these results below. > > Recall that FA := Π> α1/2 ) denotes the set of tensor eigenvectors A P and Φ = WA FA Diag(b ˆ is the set of estimated eigenvectors under empirical under exact moments in (8.13), and Φ moments, obtained using Algorithm 8.2. We establish the following guarantees. Lemma 8.5 (Perturbation bound for estimated eigen-pairs). The recovered eigenvectorˆ i ) from the tensor power method satisfies with high probability, for a ˆ i, λ eigenvalue pairs (Φ permutation θ, such that −1/2

ˆ i − Φθ(i) k ≤ 8k −1/2 εT , max kΦ

max |λi − α bθ(i) | ≤ 5εT , i

i∈[k]

(8.15)

The tensor perturbation bound εT is given by

α0 α0 ˆ ˆ ˆ ˆ ˆ ˜ ˜ εT := TY →{A,B,C} (WA , WB RAB , WC RAC ) − E[TY →{A,B,C} (WA , WB , WC )|ΠA∪B∪C ] ! 3/2 √ k p ˜ √ =O . (8.16) (p − q) n Notice that the power √ on k is 3/2 here, the extra this tensor are of order k.

√ k factor is due to all eigenvalues of

Zero-error guarantees for support recovery Recall that we proposed Procedure 8.3 as a post-processing step to provide improved support recovery estimates. We now provide guarantees for this method. We now specify the threshold ξ for support recovery in Algorithm 8.3. Theorem 8.6 (Support Recovery). The support recovery method in Algorithm 8.3 has the √ pk ˆ ˜ following guarantees on the estimated support set S when ξ = Ω( (p−q)√n ): with high probability, ˆ j) = 1 Π(i, j) ≥ ξ ⇒ S(i,

and

Π(i, j) ≤

ξ ˆ j) = 0, ⇒ S(i, 2

∀i ∈ [k], j ∈ [n],

(8.17)

where Π is the true community membership matrix. Thus, the above result guarantees that the Algorithm 8.3 correctly recovers all the “large” entries of Π and also correctly rules out all the “small” entries in Π. In other words, we can correctly infer all the significant memberships of each node and also rule out the set of communities where a node does not have a strong presence. The only shortcoming of the above result is that there is a gap between the “large” and “small” values, and for an intermediate set of values (in [ξ/2, ξ]), we cannot guarantee correct inferences about the community memberships. Note this gap depends on εP , the error in 168

estimating the P matrix. This is intuitive, since as the error εP decreases, we can infer the community memberships over a large range of values. For the special case of stochastic block models (i.e. lim α0 → 0), we can improve the above result and give a zero error guarantee at all nodes (w.h.p). Note that we no longer require a threshold ξ in this case, and only infer one community for each node.

8.4.2

Proof Outline

We now summarize the main techniques involved in proving Theorems 8.4 and 8.6. The details are in Section 8.5. The main ingredient is the concentration of the adjacency matrix: since the edges are drawn independently conditioned on the community memberships, we establish that the adjacency matrix concentrates around its mean under the stated assumptions. See Section 8.5.1 for details. With this in hand, we can then establish concentration of various quantities used by our learning algorithm. Step 1: Whitening matrices. We first establish concentration bounds on the whitening ˆ A, W ˆ B, W ˆ C computed using empirical moments, described in Section 8.3.2. With matrices W ˆ > F Diag(b this in hand, we can approximately recover the span of matrix FA since W αi )1/2 is A a rotation matrix. The main technique employed is the Matrix Bernstein’s inequality [149, thm. 1.4]. See Appendix B for details. Step 2: Tensor concentration bounds Recall that we use the whitening matrices to obtain a symmetric orthogonal tensor. We establish that the whitened and symmetrized tensor concentrates around its mean. This is done in several stages and we carefully control the tensor perturbation bounds. See Section 8.5.1 for details. Step 3: Tensor power method analysis. We utilize the algorithm in Section 7.7. Here we need to establish that there exist good initializers for the power method among (whitened) neighborhood vectors, see Section 8.5.2 for details. Step 4: Thresholding of estimated community vectors In Step 3, we provide guarantees for recovery of each eigenvector in `2 norm. Direct application of this result only allows us to obtain `2 norm bounds for row-wise recovery of the community matrix Π. In order to strengthen the result to an `1 norm bound, we threshold the estimated Π vectors. Here, we exploit the sparsity in Dirichlet draws and carefully control the contribution of weak entries in the vector. Finally, we establish perturbation bounds on P through rather straightforward concentration bound arguments. See Section 8.5.3 for details. Step 5: Support recovery guarantees. It is convenient to consider the case of in stochastic block model here in the canonical setting of Section 8.4.1. Recall that Procedure 8.3 readjusts the community membership estimates based on degree averaging. For each vertex, if we count the average degree towards these “approximate communities”, for 169

the correct community the result is concentrated around value p and for the wrong community the result is around value q. Therefore, we can correctly identify the community memberships of all the nodes, when p − q is sufficiently large. The argument can be easily extended to general mixed membership models. See Section 8.5.3 for details.

8.5

Proof of the Main Theorems

In this section we prove the main theorems. The order of the proof follows the order of the algorithm: we will first show concentration bounds for the whitening matrices and the tensor; then we give the guarantee for tensor power method (which mainly involves showing the initializers are good); finally we prove the reconstruction guarantees. The proof is simplified due to the assumptions (all communities are of the same size, P has the special structure), but is not oversimplified and can still be easily extended to the general case. Throughout the section we heavily use the fact that the columns of Π are chosen according to Dirichlet distribution, in particular the moments of the distribution is stated as follows Proposition 8.7 (Moments under Dirichlet distribution). Suppose π ∼ Dir(α), the moments of π satisfies the following formulas: αi α0 αi (αi + 1) E[πi2 ] = α0 (α0 + 1) αi αj E[πi πj ] = , α0 (α0 + 1) E[πi ] =

More generally, if a(t) =

Qt−1

i=0 (a

E[

+ i), then we have k Y

(a ) πi i ]

(ai ) i=1 αi . P ( k a) α0 i=1 i

Qk =

i=1

8.5.1

i 6= j.

Concentration Bounds

Let’s first look at the concentration bounds for the matrices and tensors used in the algorithm. Bounding Π and F

170

iid

Lemma 8.8 (Bounds for Π and F ). For πi ∼ Dir(α) with high probability, we have s |A| ˜ 1/4 ), σmin (ΠA ) ≥ − O(n k(α0 + 1) p ˜ 1/4 ), kΠA k ≤ |A|/k + O(n p κ(ΠA ) ≤ 2 (α0 + 1). Moreover, with high probability |FA |1 ≤ 2p|A| Proof:

Consider ΠA Π> A =

P

i∈A

(8.18)

πi πi> .

1 > E[Π> A ΠA ] =Eπ∼Dir(α) [ππ ] |A| α0 1 = α bα b> + Diag(b α), α0 + 1 α0 + 1 from Proposition 8.7. The first term is positive semi-definite so the eigenvalues of the sum are at least the eigenvalues of the second component. Smallest eigenvalue of second component gives lower bound on σmin (E[ΠA Π> A ]). The spectral norm of the first component is α0 α0 αk ≤ α0 +1 α bmax , the spectral norm of second component is α01+1 αmax . Thus bounded by α0 +1 kˆ

E[ΠA Π> ] ≤ |A| · α bmax . A P 1 > > Now applying Matrix Bernstein’s inequality to |A| i πi πi − E[ππ ] . We have that the variance is O(1/|A|). Thus with probability 1 − δ, s !

1

log(k/δ) > >

.

|A| ΠA ΠA − E[ΠA ΠA ] = O |A| To show bound on |FA |1 , note that each column of FA satisfies E[(FA )i ] = (b αT (P )i )1> , and thus |E[FA ]|1 ≤ |A| maxi (P α b)i ≤ p|A|. Using Bernstein’s inequality, for each column of FA , we have, with probability 1 − δ, ! r |A| |(FA )i |1 − |A|b p|A| log , α> (P )i = O δ by applying Bernstein’s inequality, since |b α> (P )i | ≤ p.

Bounding the Modified Adjacency Matrix The amount of error in the whitening step largely depends on the spectral norm of the difference between the true modified adjacency 0 matrix GαX,A and its expectation. 171

Lemma 8.9 (Concentration of adjacency submatrices). With high probability ˜ (√pn) . kGX,A − E[GX,A |Π]k = O p ˜ p/n). kµX→A − E[µX→A |Π]k = O(

(8.19) (8.20)

As a corollary, the modified adjacency matrix has bound 0 0 ˜ √pn) G := kGαX,A − E[(GαX,A )> |Π]k = O(

(8.21)

Proof: Recall E[GX,A |Π] = FA ΠX and GA,X = Ber(FA ΠX ) where Ber(·) denotes the Bernoulli random matrix with independent entries. Let > Zi := (G> i,A − FA πi )ei .

P We have G> X,A − FA ΠX = i∈X P Zi . We apply matrix P Bernstein’s inequality. P We compute the variances i E[Zi Zi> |Π] and i E[Zi> Zi |Π]. We have that i E[Zi Zi> |Π] only the diagonal terms are non-zero due to independence of Bernoulli variables, and E[Zi Zi> |Π] ≤ Diag(FA πi )

(8.22)

entry-wise. Thus, k

X i∈X

E[Zi Zi> |Π]k ≤ max a∈[k]

X i∈X,b∈[k]

X

= max a∈[k]

≤ max c∈[k]

FA (a, b)πi (b) FA (a, b)ΠX (b, i)

i∈X,b∈[k]

X

P (c, b)ΠX (b, d)

i∈X,b∈[k]

= kP ΠX k∞ = |FX |1 .

(8.23)

P P 2 Similarly i∈X E[Zi> Zi ] = i∈X Diag(E[kG> i,A − FA πi k ]) ≤ |FX |1 . From Lemma 8.8, we ˜ have |FX |1 = O(p|X|) . We now bound kZi k. First note that the entries in Gi,A are independent and we can use the vector to bound kGi,A −FA πi k. We have maxj∈A |Gi,j −(FA πi )j | ≤ P Bernstein’s inequality P 2 2 and j E[Gi,j − (FA πi )j ] ≤ j (FA πi )j ≤ |FA |1 . Thus with high probability p ˜ |FA |1 ). kGi,A − FA πi k = O( p p P ˜ √pn). The con˜ Thus, we have the bound that k i Zi k = O(max( |FA |1 , |FX |1 )) = O( centration of the mean term follows from this result. The statement for µ is just a straight forward application of Bernstein’s inequality.

172

0 Finally, for the modified adjacency matrix GαX,A , we have

G ≤

√

p √ α0 + 1kGX,A − E[GX,A |Π]k + ( α0 + 1 − 1) |X|kµX→A − E[µX→A |Π]k.

substituting in the bounds for adjacency matrix and mean vector gives the result. Another factor in the whitening error is the smallest singular value of the expectation of which we estimate below. Define √ √ α). (8.24) ψi := Diag(ˆ α)−1/2 ( α0 + 1πi − ( α0 + 1 − 1)b

0 GαX,A ,

Let ΨX be the matrix with columns ψi , for i ∈ X. We have 0 E[(GαX,A )> |Π] = FA Diag(ˆ α)1/2 ΨX , 0 from definition of E[(GαX,A )> |Π].

Lemma 8.10 (Smallest Singular Value). With hith probability, p ˜ ε1 := kI − |X|−1 ΨX Ψ> k/n) k ≤ O( X α0 > ˜ σmin E[(G ) |Π] = Ω((p − q)n/k). X,A

p αmin ) from Proof: Note that ψi is a random vector with norm bounded by O( (α0 + 1)/b > Lemma 8.8 and E[ψi ψi ] = I. We know bound ε1 using Matrix Bernstein Inequality. Each αmin |X|). The variance σ 2 is bounded matrix ψi ψi> /|X| has spectral norm at most O((α0 +1)/b by

1

1

X X

2 2 > > E[ kψi k ψi ψi ] ≤ max kψi k E[ ψi ψi ] ≤ O((α0 + 1)/b αmin |X|).

|X|2

|X|2

i∈X

i∈X

Since O((α0 + 1)/αmin |X|) < 1, the variance dominates in Matrix Bernstein’s inequality. Let B := |X|−1 ΨX Ψ> X . We have with probability 1 − δ, q α0 > α)1/2 B Diag(ˆ α)1/2 FA> ), σmin (E[(GX,A ) |Π]) = |X|σmin (FA Diag(ˆ p = Ω( α bmin |X|(1 − 1 ) · σmin (FA )). From Lemma 8.8, with probability 1 − δ, s  |A|b αmin σmin (FA ) ≥  − O((|A| log k/δ)1/4 ) · σmin (P ). α0 + 1 Similarly other results follow.

173

0 ˆ ˆ ˆ> Whitening Error Consider rank-k SVD of |X|−1/2 (GαX,A )> k−svd = UA DA VA , and the ˆ A := UÂ D ˆ −1 and thus |X|−1 W ˆ > (Gα0 )> (Gα0 )W ˆ A = I. whitening matrix is given by W A A X,A X,A Now consider the singular value decomposition of

ˆ > E[(Gα0 )> |Π] · E[(Gα0 )|Π]W ˆ A = ΦDΦ ˜ >. |X|−1 W A X,A X,A ˆ A does not whiten the exact moments in general. On the other hand, consider W ˆ A ΦA D ˜ −1/2 Φ> . WA := W A A 0 Observe that WA whitens |X|−1/2 E[(GαX,A )|Π]

> 0 0 ˜ −1/2 Φ> ˜ > ˜ −1/2 Φ> |X|−1 WA> E[(GαX,A )> |Π]E[(GαX,A )|Π]WA = (ΦA D A = I A ) ΦA DA ΦA ΦA DA A

ˆ A may differ and we control the perturbations below. Now the ranges of WA and W ˆ A,B , R ˆ A,C are given by Also note that R α0 ˆ AB := |X|−1 W ˆ B> (Gα0 )> ˆ R X,B k−svd (GX,A )k−svd WA . −1

RAB := |X|

0 WB> E[(GαX,B )> |Π]

·

0 E[GαX,A |Π]

(8.25)

· WA .

(8.26)

0 Recall G is given by (8.21), and σmin E[GαX,A |Π] is given in Theorem 8.10. ˆ A , as in general there are Remark 1: Here it is really important to define WA using W α0 infinitely many matrices that whitens E[(GX,A |Π], and these matrices are not close in spectral norm.

ˆ

Remark 2: It is tempting to define the whitening error as WA − WA , however for tight results in the more general settings, it is better to define the whitening error as the following Lemma. Remark 3: The whitening matrix perturbation is very central in the perturbation bounds, ˜ W ) < 1. in our algorithm works when O(ε A Lemma 8.11 (Whitening matrix perturbations). With high probability, ˆ A − WA )k = O WA := k Diag(b α)1/2 FA> (W

(1 − ε1 )−1/2 G 0 σmin E[GαX,A |Π]

ˆ BR ˆ AB − WB RAB )k = O W˜ B := k Diag(b α)1/2 FB> (W

(1 − ε1 )−1/2 G 0 σmin E[GαX,B |Π]

Thus, ˜ WA = W˜ B = O

174

√

pk √ (p − q) n

! (8.27) ! (8.28)

(8.29)

Proof:

ˆ A ΦA D ˜ −1/2 Φ> or W ˆ A = WA ΦA D ˜ 1/2 Φ> we have that Using the fact that WA = W A A A A ˆ A − WA )k ≤ k Diag(b ˜ 1/2 Φ> k Diag(b α)1/2 FA> (W α)1/2 FA> WA (I − ΦA D A )k A 1/2 1/2 > ˜ )k = k Diag(b α) F WA (I − D A

≤ ≤

A 1/2 > ˜ k Diag(b α) FA WA (I − DA )(I ˜ Ak k Diag(b α)1/2 FA> WA k · kI − D 1/2

˜ 1/2 )k +D A

˜ A is a diagonal matrix. using the fact that D 0 Now note that WA whitens |X|−1/2 E[GαX,A |Π] = |X|−1/2 FA Diag(α1/2 )ΨX , where ΨX is defined in (8.24). Further it is shown in Lemma 8.10 that ΨX satisfies with probability 1 − δ that s ! (α + 1) k 0 · log ε1 := kI − |X|−1 ΨX Ψ> Xk ≤ O α bmin |X| δ Since ε1 1 when X, A are of size Ω(n), we have that |X|−1/2 ΨX has singular values around 0 1. Since WA whitens |X|−1/2 E[GαX,A |Π], we have 1/2 |X|−1 WA> FA Diag(α1/2 )ΨX Ψ> )FA> WA = I. X Diag(α

Thus, with probability 1 − δ, k Diag(b α)1/2 FA> WA k = O((1 − ε1 )−1/2 ). 0 0 Let E[(GαX,A )|Π] = (GαX,A )k−svd + ∆. We have

˜ A k = kI − ΦA D ˜ A Φ> kI − D Ak ˆ A> E[(Gα0 )> |Π] · E[(Gα0 )|Π]W ˆ Ak = kI − |X|−1 W X,A X,A ˆ > ∆> (Gα0 )k−svd + ∆(Gα0 )> ˆ Ak = O |X|−1 kW W A X,A X,A k−svd ˆ A> ∆> VÂ + VÂ> ∆W ˆ Ak , = O |X|−1/2 kW −1/2 ˆ = O |X| kWA kk∆k = O |X|−1/2 kWA kG , 0 since k∆k ≤ G + σk+1 (GαX,A ) ≤ 2G , using Weyl’s theorem for singular value perturbation 0 and the fact that G · kWA k 1 and kWA k = |X|1/2 /σmin E[GαX,A |Π] . We now consider perturbation of WB RAB . By definition, we have that 0 0 E[GαX,B |Π] · WB RAB = E[GαX,A |Π] · WA .

and 0 kWB RAB k = |X|1/2 σmin (E[GαX,B |Π])−1 .

175

Along the lines of previous derivation for WA , let ˆ BR ˆ AB )> · E[(Gα0 )> |Π] · E[Gα0 |Π]W ˆ BR ˆ AB = ΦB D ˜ B Φ> . |X|−1 (W B X,B X,B Again using the fact that |X|−1 ΨX Ψ> X ≈ I, we have α)1/2 FA> WA k, k Diag(b α)1/2 FB> WB RAB k ≈ k Diag(b and the rest of the proof follows.

Expected Error in a Single Whitened Vector Before going to the concentration bound ˆ > Gi,A . of the tensor, let us first take a look at the expected error in a whitened tensor W This is used in both the tensor concentration bounds and for showing the existence of good initilizers for tensor power method. Lemma 8.12 (Concentration of a random whitened vector). Conditioned on Π matrix, for all i 6∈ X, r

2 √

ˆT T

−1/2 T ˜ √pk 3/2 /(p − q) n). G − W F π bmin ) = O( E[ W A i,A A A i |Π] ≤ O(εWA α In particular, with probability at least 1/4,

2 √

ˆT T

−1/2 T ˜ √pk 3/2 /(p − q) n). bmin ) = O(

WA Gi,A − WA FA πi ≤ O(εWA α Proof. We have

ˆT T

ˆ

ˆT T

T T T ˆ W G − W F π ≤ ( W − W ) F π + W (G − W F π)

A i,A

A

A i,A

. A A i A A i A A −1/2

The first term is bounded by εWA α bmin by Lemma 8.11. ˆ> Now we bound the second term. Note that G> i,A is independent of WA , since they are related to disjoint subset of edges. The whitened neighborhood vector can be viewed as a sum of vectors: X X X > > ˆ ˆ ˆ ˆ ˆ A> G> G ( D U ) = D Gi,j (UÂ> )j . = G ( W ) = W A i,j i,j A A j i,A A j j∈A

j∈A

j∈A

Conditioned on πi and FA , Gi,j are Bernoulli variables with probability (FA πi )j . The goal

2 P

is to compute the variance of the sum, j∈A (FA πi )j (UÂ> )j , and then use Chebyshev’s inequality. By Wedin’s theorem, we know the span of columns of UÂ is O(G /σmin (GαX0 , A)) = O(WA ) close to the span of columns of FA . The span of columns of FA is the same as the span of rows in ΠA . In particular, let P rojΠ be the projection matrix of the span of rows in ΠA , we 176

have

ˆ ˆ>

UA UA − P rojΠ ≤ O(WA ).

Using the spectral norm bound, we have the Frobenius norm

√

ˆ ˆ>

UA UA − P rojΠ ≤ O(WA k) F

since they are rank k matrices. This implies that

X

2

ˆT j

= O(2WA k). ( U ) − P roj

A j Π j∈A

We also know

s

kP rojΠj k ≤

kπj k = O σmin (ΠA )



(α0 + 1)  , nb αmin

by the SVD of ΠA . P Now we can bound the variance of the vectors j∈A Gi,j (UÂT )j , since the variance of Gi,j is bounded by (FA πi )j (its probability), and the variance of the vectors is at most X j∈A

2 X X

2

2

(FA πi )j (UÂT )j − P rojΠj (FA πi )j P rojΠj + 2 (FA πi )j (UÂ> )j ≤ 2 j∈A

j∈A

≤2

X j∈A

≤O

X

2

2

ˆT j (FA πi )j max P rojΠj + max Pi,j

(UA )j − P rojΠ i,j

j∈A

|FA |1 (α0 + 1) nb αmin

j∈A

Chebyshev’s inequality implies that with probability at least 1/4 (or any other constant),

2

X

|FA |1 (α0 + 1)

T ˆ .

(Gi,j − FA πi )(UA )j ≤ O

nb αmin j∈A And thus, we have s ˆ AT (Gi,A − FA πi ) ≤ W

|FA |1 (α0 + 1)

ˆ T −1/2 bmin . · WA ≤ O WA α nb αmin

Combining the two terms, we have the result. Concentration of the Tensor Finally we are ready to prove the tensor concentration bound.

177

Theorem 8.13 (Perturbation of whitened tensor). With high probability,

α0 α0 ˜ ˜ ˆ ˆ ˆ ˆ ˆ εT := TY →{A,B,C} (WA , WB RAB , WC RAC ) − E[TY →{A,B,C} (WA , WB , WC )|ΠA , ΠB , ΠC ] √ 3/2 ! pk ˜ √ =O . (8.30) (p − q) n Proof:

In tensor T α0 in (8.7), the first term is X > > (α0 + 1)(α0 + 2) G> i,A ⊗ Gi,B ⊗ Gi,C . i∈Y

We claim that this term dominates in the perturbation analysis since the mean vector perturbation is of lower order. We now consider perturbation of the whitened tensor 1 X ˆ > > ˆ> W ˆ > G> ) ⊗ (R ˆ> W ˆ > G> ) . Λ0 = (WA Gi,A ) ⊗ (R AB B i,B AC C i,C |Y | i∈Y We show that this tensor is close to the corresponding term in the expectation in three steps. First we show it is close to Λ1 =

1 X ˆ > > ˆ > > ˆ > ˆ AB ˆ AC (WA FA πi ) ⊗ (R WB FB πi ) ⊗ (R WC FC πi ) . |Y | i∈Y

Then this vector is close to the expectation over ΠY . ˆ > FA π) ⊗ (R ˆ> W ˆ > FB π) ⊗ (R ˆ> W ˆ > FC π) . Λ2 = Eπ∼Dir(α) (W A AB B AC C ˆ A with WA . Finally we replace the estimated whitening matrix W ˜ > FB π) ⊗ (W ˜ > FC π) . Λ3 = Eπ∼Dir(α) (WA> FA π) ⊗ (W B C ˆ > G> as the sum of two terms W ˆ > FA πi and W ˆ > G> − W ˆ > FA πi For Λ0 − Λ1 , we write W A i,A A A i,A A (and similarly for B and C), then expand Λ0 into eight terms. The first term will be equal to Λ1 . In the remaining terms, the dominant term in the perturbation bound is, 1 X ˆ> > ˆ> W ˆ> ˆ> ˆ > W (G − FA πi ) ⊗ (R AB B FB πi ) ⊗ (RAC WC FC πi ). |Y | i∈Y A i,A We can view the tensor as a matrix (and by definition the spectral norm can only go down), then bound. Both

apply matrix Bernstein

terms of the variance can be bounded

ˆ> >

2 ˜ > 2 ˜ > 2 by |Y |E[ W (G − FA πi ) ] W FB W FC = O(nε2 k 3 ), the norm of each vector A

i,A

B

C

WA

can also be √ bounded. By√matrix Bernstein with high probability this term is bounded by 3/2 O(εWA k / n) ≤ O(εWA k) because k 2 < n. 178

ˆ A FA Diag(b For Λ1 − Λ2 , since W α)1/2 has spectral norm almost 1, by Lemma 8.14 the spectral norm of the perturbation is at most

3 1 X

ˆ

(Diag(b α)−1/2 πi )⊗3 − Eπ∼Dir(α) (Diag(b α)−1/2 πi )⊗3 α)1/2

WA FA Diag(b

|Y | i∈Y p √ ≤ O(1/b αmin n log n/δ). For the final term Λ2 − Λ3 , the dominating term is ˆ A − WA )FA Diag(b (W α)1/2 kΛ3 k ≤ εWA kΛ3 k Putting all these together, the third term kΛ2 − Λ3 k dominates and we get the result.

Lemma 8.14. With high probability,

1 X p 1

−1/2 ⊗3 −1/2 ⊗3 √ (Diag(b α) πi ) − Eπ∼Dir(α) (Diag(b α) π) ≤ O(· log n/δ)

|Y | α bmin n i∈Y √ ˜ = O(k/ n) Proof. The spectral norm of this tensor cannot be larger than the spectral norm of a k × k 2 matrix that we obtain be “collapsing” the last two dimensions (by definitions of norms). Let φi = Diag(ˆ α)−1/2 πi , the “collapsed” tensor is just the matrix φi (φi ⊗ φi )> (here we view 2 > φi ⊗ φi as a vector in Rk ). We apply Matrix Bernstein matrices Zi =

on the −2

φi (φ>i ⊗

φi ) . P 4 > > | max kφk E[φφ ] ≤ |Y |b αmin because E[φφ ] ≤ 2. Clearly, i∈Y E[Zi Zi ] ≤ |Y

P >

For the other variance term i∈Y E[Zi Zi ] , we have

X

> E[Z Z ] αmin E[(φ ⊗ φ)(φ ⊗ φ)> ]

i ≤ |Y |b i

i∈Y

. It remains to bound the norm of E[(φ ⊗ φ)(φ ⊗ φ)> ]. This is almost a fixed matrix for constant α0 , the details are lengthy because of the number of different sized entries, and are omitted here.

8.5.2

Good Initializer for Tensor Power Method

The eigenvectors of the orthogonally decomposable tensor obtained in Lemma 8.3 are columns of Φ. Recall that a vector u is (γ, Q)-good with respect to Φi if |u> Φi | > Q, and |λi u> Φi | ≥ (1 + γ)|λj u> Φj | for all λj ≤ λi . The following Lemma shows there exists very good initial ˆ > Gi,A for i ∈ B. vectors in among W A 179

˜ Lemma 8.15. With high probability, for any 10/9 > γ > 1 there are (γ, Ω(1))-good initial> ˆ izers with respect to all Φi ’s within the set {WA Gi,A : i ∈ B}. ˆ > Gi,A is close to Intuitively, this Lemma should be true because by Lemma 8.12, W A WA> FA πi = (Diag(b α)−1/2 πi )> Φ. This shows when this vector is expressed in the basis of Φ, the coefficients are Diag(b α)−1/2 π. The vector π is chosen from Dirichlet distribution so we expect π to have a single large entry, thus the vector should be a good initialzer. We first prove the intuition that Dirichlet vectors should have a single large entry with good probability. Without loss of generality we assume α b1 ≤ α b2 ≤ · · · ≤ α bk (this is trivially true in the special case, but the Lemma below applies to more general cases).

Lemma 8.16. Let v be the unit vector Diag(ˆ α)−1/2 π/ Diag(ˆ α)−1/2 π where π is chosen according to Dir(α). For any 1 < γ ≤ 5/4, with probability Ω(b αi /k γ ), the i-th entry of v is p −1/2 −1/2 at least α bi /(b αi + (α0 + 1)k), the j-th entry of v is at most 1/γ of the i-th entry for all j > i, and πi > 9/10(α0 + 1). Proof. We are assuming αi < 1 throughout. This is the reasonable case where Dirichlet distributions are sparse. When α0 < 1/10, we derive the bound directly from properties of Dirichlet distribution. Let (X1 , X2 , ..., Xk ) ∼ Dir(α), by properties of Dirichlet distribution, we know E[πi ] = α bi αi +1 2 and E[πi ] = α bi α0 +1 . Let pi = Pr[πi ≥ 9/10(α0 + 1)]. We have 9 9 ] + (1 − pi )E[πi2 |πi < ] 10(α0 + 1) 10(α0 + 1) 9 9 ≤ pi + (1 − pi ) E[πi |πi < ] 10(α0 + 1) 10(α0 + 1) 9 ≤ pi + (1 − pi ) E[πi ] 10(α0 + 1)

E[πi2 ] = pi E[πi2 |πi ≥

Here the second line is using the fact that when πi < t, πi2 ≤ tπi (let t = 9/10(α0 + 1) and take expectation). Substitute E[πi ] and E[πi2 ], we know pi ≥ α bi /10(α0 + 1) = Ω(b αi ). −1/2 Also, when πi ≥ 9/10(α0 + 1) ≥ 9/11, all other πj ’s are at most 2/11, so α bi πi is at least −1/2 4.5 · α bj πj for j > i. qP −1/2 k The only thing left in this case is to show α bi πi / bj−1 πj2 is large. j=1 α 1 Notice that for Dirichlet distributions, πi is independent of 1−π (π1 , π2 , ..., πi−1 , πi+1 , ..., πk ). i P −1 2 When we condition on πi ≥ 9/11, the expectation E[ j6=i α bi πj |πi ≥ 9/11] can only be P −1 2 smaller than E[ j6=i α bi πj ] ≤ 2k/(α0 + 1) ≤ 2k. By Markov, with probability 1/4 the true P value is below 4 times the expectation, so Pr[ j6=i α bi−1 πj2 ≤ 8k|πi ≥ 9/11] ≥ 1/4. P Now, with probability pi /4 we know πi ≥ 9/11, and j6=i α bi−1 πj2 ≤ 8k, which means v u k uX −1 √ −1/2 −1/2 −1/2 α bj πj2 ≥ α bi /(b αi + k) α bi πi /t j=1

180

The second case is when α0 ≥ 1/10. In this case we need to several properties of Dirichlet distribution, Gamma distribution and Gamma functions in Proposition 8.21. Let Yi ∼ Γ(αi , 1). By the proposition it is clear that v is just the normalized version of −1/2 (..., α bi Yi , ...). By property 5 in Proposition 8.21, if we pick t = log k, then each Yi is bigger than t with probability at most 1/k. Since they are independent, the probability that there exists some j 6= i such that Yj > t is at most (1 − 1/k)k ≤ e−1 . Also, the expected value of P −1/2 bj Yj2 ≤ 2kα j6=i α P 0 , the probability that it is larger than 20kα0 is at most 1/10. The expected value of j6=i Yj ≤ α0 , the probability that it is larger than 10α0 is at most 1/10. By bound, with probability at least e−1 − 1/10 − 1/10 > 0.1, for all j 6= i, Yj < log k, P union P bj−1 Yj2 ≤ 20kα0 , and j6=i Yj ≤ 10α0 . j6=i α However, when we pick t = γ log k, we know with probability Cαi /4tk γ = Ω(b αi /k γ log k) (here we know αi ≥ α bi /10 because α0 > 1/10), Yi is at least γ log k. Since Yi and all the other Y ’s are independent, with probability at least Ω(b αi /k γ log k), we have Yi ≥ γ log k, Yj ≤ log k(j 6= i), and v u k q p uX −1 −1/2 −1/2 −1/2 −1/2 2 t + α0 k). log k/ α αi α b i Yj ≥ α bi bi−1 log2 k + 10kα0 ≤ α bi /(b α bi Yi / j=1

Finally, πi ≥

Yi P Yi + j6=i Yj

≥ log t/(log t + 10α0 ) > 9/10(α0 + 1).

Remark: The probability can be improved when α bmax /b αmin is small. In particular for ˜ −γ0 ). We will the special case we are interested in, the probability of a good initializer is O(k use this bound. Now we are ready to prove Lemma 8.15. Proof of Lemma 8.15: By Lemma 8.16, the probability that πj gives a good vector for index i (i ∈ k) is at least 1/k γ0 . When that happens the normalization of sj = W T FA πj is ˜ a (γ0 , Ω(1))-good initializer. Before normalization, by Lemma 8.16 we know πj (i) ≥ 9/10(α0 + 1), therefore every −1/2 good vector sj for index i had projection at least 9b αi /10(α0 + 1) on Φi , and at most s> j Φi /γ0 projection on all Φt (t > i).

ˆ A , πj , ˆ > G> − W > FA πj By Lemma 8.12, with constant probability, conditioned on W

W

≤ A j,A A √ Therefore with constant probability the inner product with Φi can change by at O(εWA k). √ ˆ > Gj,A is a (γ0 − ∆, Ω(1))-good ˜ most O(εWA k), when that happens W intializer for arbitrary small ∆. Since the constant probability in Lemma 8.12 is conditioned on πj , we can multiply the probability in Lemma 8.12 and Lemma 8.16 (we use the bound in the Remark following that ˜ ˜ −γ0 ). Lemma). Therefore, the probability that a vector is (γ0 − ∆, Ω(1))-good is at least Ω(k ˆ > Gi,A The size |B| is much larger than k γ0 log n (because n > k 2 ), by union bound the set W A has good initializer for all i with high probability. 181

8.5.3

Reconstruction After Tensor Power Method

Reconstructing Π Let (M )i and (M )i denote the ith row and ith column in matrix M respectively. Let Z ⊆ Ac denote any subset of nodes not in A, considered in Procedure LearnPartition Community. Define ˜ Z := Diag(λ)−1 Φ ˆ >W ˆ A> G> Π Z,A .

(8.31)

ˆ Z is obtained by thresholding Π ˜ Z element-wise with threshold Recall that the final estimate Π ˜ Z. τ in Procedure 8.2. We first analyze perturbation of Π ˜ Z ). With high probability, Lemma 8.17 (Reconstruction Guarantees for Π √ pk i i ˜ √ kΠZ k . επ := max k(ΠZ ) − (ΠZ ) k = O i∈Z (p − q) n ˆ > G> . ˜ Z )i = λ−1 ((Φ) ˆ i )> W Proof: We have (Π A Z,A i 1/2 > ˜i Notice that ΠiZ = Diag(b αi )Φ> W F i A A ΠZ , we can write the difference between ΠZ and ΠiZ in four terms > −1 ˆ > > ˜ iZ − ΠiZ =(Diag(λi )−1 − Diag(b Π αi1 /2))Φ> i WA FA ΠZ + Diag(λi ) (Φi − Φi ) WA FA ΠZ > −1 ˆ > ˆ > ˆ> ˆ > + Diag(λi )−1 Φ i (WA − WA )FA ΠZ + Diag(λi ) Φi WA (GA,Z − FA ΠZ )

We will now use perturbation bounds for each of the terms to get the result. The first term is bounded by √ 1/2 k Diag(λi )−1 − Diag(b αi )k · kΦk · kWA> FA k · kΠZ k ≤ εT /k · 2 · 2 k · kΠZ k ≤ O(εWA ) kΠZ k). The second term is bounded by √ √ √ ˆ i k · kW > FA k · kΠZ k ≤ 2/ k · εT / k · 2 k · kΠZ k ≤ O(εW kΠZ k). k Diag(λi )−1 k · k(Φ)i − Φ A A The third term is bounded by √ √ ˆ i k · k(W ˆ > − W > )FA k · kΠZ k ≤ 2/ k · 1 · εW k · kΠZ k = O(εW kΠZ k), k Diag(λi )−1 k · kΦ A A A A from Lemma 8.11 and finally, we bound the fourth term by

√ √

ˆ > −1 ˆ k Diag(λi ) k · kΦi k · WA · kGA,Z − FA ΠZ k ≤ O 1/ k · 1/σmin (FA ) · pn √ ≤ O(εWA / k) from Lemma 8.9 and Lemma 8.10.

182

The first three terms dominates the last term (by Lemma 8.8), therefore we get the result. ˜ Z , the the resulting matrix Π ˆ Z has We now show that if we threshold the entries of Π rows close to those in ΠZ in `1 norm. Which is stronger than the `2 guarantee we had in the previous Lemma. ˆ Z := Thres(Π ˜ Z , τ ), where τ = Lemma (Guarantees after thresholding). For Π √ 8.18 √ ˜(Θ)(επ k/ n) = O(ε ˜ W ) is the threshold, we have with high probability that √ pk n i i ˆ ˜ ˜ √ · . επ,`1 := max |(ΠZ ) − (ΠZ ) |1 = O(εWA n/k) = O i∈[k] (p − q) n k Remark: Notice that the `1 norm of a row of Π should be around n/k by Lemma 8.8, so this bound makes sense as long as εWA 1. ˆ Z (i, j) > 0} = {j : Π ˜ Z (i, j) > τ }, Si0 = {j : ΠZ (i, j) > τ /2} and Proof: Let Si := {j : Π 1 0 Si = {j : ΠZ (i, j) > 2τ }. The idea is Si should contain almost all entries in Si , and Si should contain almost all entries of Si1 (approximately Si1 ⊂ Si ⊂ Si0 ). For a vector v, let vS denote the sub-vector by considering entries in set S. We now have ˆ Z )i − (ΠZ )i |1 ≤ |(Π ˆ Z )i − (ΠZ )i |1 + |(ΠZ )i c |1 + |(Π ˆ Z )i c |1 |(Π Si Si Si Si Let us first assume α0 < 1 for simplicity. ˆ Z )i = 0 so that term disappears. By thresholding procedure (Π Si For the first term, from Lemma 8.20, we have P[Π(i, j) ≥ τ /2] ≤ 8b αi log(2/τ ). Since 0 ˜ Π(i, j) are independent for j ∈ Z, by Chernoff bound maxi∈[k] |Si | < O(n/k). We claim that most of Si should be in Si0 , because for every j ∈ Si \Si0 , we have |ΠZ (i, j)− ˜ Z (i, j)| ≥ τ /2, for j ∈ Si , and number of such terms is at most 4ε2π /τ 2 = O(n/k). ˜ Π Thus, ˆ Z )i |1 ≤ επ |(ΠZ )iSi − (Π Si

p

˜ π |Si | ≤ O(ε

p n/k).

For the other term, from Lemma 8.20, we have E[ΠZ (i, j) · δ(ΠZ (i, j) ≤ 2τ )] ≤ α bi (2τ ). ˜ · n/k) ≤ O(ε ˜ π Concentration bounds show that maxi∈[k] (ΠZ )iS 1,c ≤ O(τ i

p n/k).

Again we claim that most indices in Sic are also in Si1,c , because for every j ∈ Sic and ˜ Z (i, j)| ≥ τ . This implies that there are at most O(n/k) ˜ j 6∈ Si1,c , we have |ΠZ (i, j) − Π such

183

entries and X j∈Sic \Si1,c

q s X 1,c c ΠZ (i, j) ≤ |Si \Si | [ΠZ (i, j)]2 j∈Sic \Si1,c

s p ˜ ≤ O( n/k) · 4

X j∈Sic \Si1,c

˜ Z (i, j)]2 [ΠZ (i, j) − Π

p ˜ n/kεπ ). ≤ O( Case α0 ∈ [1, k): From Lemma 8.20, we see that the results hold when we replace α bmax with αmax , losing factors of (α0 + 1). Reconstruction of P Finally we would like to use the community vectors Π and the adjacency matrix G to estimate the P matrix. Recall that in the generative model, we ˆ † )> GΠ ˆ † . However, our have E[G] = Π> P Π. Thus, a straightforward estimate is to use (Π ˆ are not strong enough to control the error on Π ˆ † (since we only have row-wise guarantees on Π `1 guarantees). ˆ for Π ˆ † and use it to find Pˆ in Algorithm 8.1. We propose an alternative estimator Q ˆ is given by Recall that the i-th row of Q î ˆ i := (α0 + 1) Π − α0 1> . Q ˆ i |1 n |Π Define Q using exact communities, i.e. Qi := (α0 + 1)

Πi α0 − 1> . i |Π |1 n

0 +1 Note that if we define Q as (Q0 )i := αnb Πi − αn0 1> , then EΠ [Q0 Π> ] = I, and we show that αi Q0 is close to Q since E[|Πi |1 ] = nb αi . For Q, we normalize by |Πi |1 in order to ensure that the first term of Q has equal column norms (which will be used in our proofs subsequently). ˆ is close to Π† , and therefore, Pˆ := Q ˆ > GQ ˆ is close to P w.h.p. We show below that Q

Lemma 8.19 (Reconstruction of P ). With probability 1 − 5δ, ˜ W (p − q)) εP := max |Pî,j − Pi,j | ≤ O(ε A i,j∈[n]

Remark: Note that again this bound is meaningful when εWA 1. Proof: The proof goes in three steps: ˆ > P ΠQ ˆ > ≈ QG ˆ Q ˆ >. P ≈ QΠ> P ΠQ> ≈ QΠ

184

Note that EΠ [ΠQ> ] = I and by Bernstein’s bound, we can claim that ΠQ> is close to I and can show that the i-th row of QΠ> satisfies √ ˜ ∆i := |(QΠ> )i − e> i |1 = O(k/ n). Moreover, |(Π> P ΠQ> )i,j − (Π> P )i,j | ≤ |(Π> P )i ((Q)j − ej )| = |(Π> P )i ∆j | √ ≤ O(pk/ n) ≤ O(εWA (p − q)). using the fact that (Π> P )i,j ≤ p. This can be translated P and QΠ> P ΠQ> because the dominating

> to a> bound>between

term would be |Q|1 (Π P ΠQ ) − (Π P ) ∞ . ˆ is close to Q and by Lemma 8.17 |Qi − Q ˆ i |1 ≤ O(εW ) , which Now we claim that Q B B A implies ˆ > )i,j | = |(Π> P Π)i (Q> − Q ˆ > )j | |(Π> P ΠQ> )i,j − (Π> P ΠQ

ˆ > )j |1 = ((Π> P Π)i − q1> )|(Q> − Q ≤ O(εWA (p − q)).

ˆ j )1 = 0, due to the normalization. using the fact that (Qj − Q ˆ > )i,j (Π> P ΠQ ˆ > )i,j | are small by standard concentration bounds (and the Finally, |(GQ differences are of lower order). Combining these |Pî,j − Pi,j | ≤ O(εP ). Support Recovery Proof of Theorem 8.6: From Lemma 8.19, |Pî,j − Pi,j | ≤ O(εP ) which implies bounds for the average of diagonals H and average of off-diagonals L: |H − p| = O(εP ),

|L − q| = O(εP ).

In particular, |L + (H − L) · 3ξ4 − (q + (p − q) 3ξ4 )| ≤ O(εP ), which means the threshold is very close to (q + (p − q) 3ξ4 ). On similar lines as the proof of Lemma 8.19 and from independence of edges used to ˆ we also have define Fˆ from the edges used to estimate Π, |Fˆ (j, i) − F (j, i)| ≤ O(εP ). Note that Fj,i = q + Πi,j (p − q). The threshold ξ satisfies ξ = Ω(εWA ) and in particular (p − q)ξ > O(εP ).

185

If Πi,j > xi, then Fi,j > q+(p−q)ξ and Fˆj,i > q+(p−q)ξ−O(εP ) > (q+(p−q) 3ξ4 )+O(εP ), thus all large entries are identified. Similarly, if Πi,j < xi/2, Fˆj,i cannot be larger than the threshold.

8.5.4

Dirichlet Properties

Lemma 8.20 (Marginal Dirichlet distribution in sparse regime). For Z ∼ B(a, b), the following results hold: Case b ≤ 1, C ∈ [0, 1/2]: a a+b a E[Z · δ(Z ≤ C)] ≤ C · E[Z] = C · a+b Pr[Z ≥ C] ≤ 8 log(1/C) ·

Case b ≥ 1, C ≤ (b + 1)−1 :

(8.32) (8.33)

we have Pr[Z ≥ C] ≤ a log(1/C) E[Z · δ(Z ≤ C)] ≤ 6aC

(8.34) (8.35)

Remark: The guarantee for b ≥ 1 is worse and this agrees with the intuition that the Dirichlet vectors are more spread out (or less sparse) when b = α0 − αi is large. Proof. We have C

1 xa (1 − x)b−1 dx 0 B(a, b) Z (1 − C)b−1 C a ≤ x dx B(a, b) 0 (1 − C)b−1 C a+1 = (a + 1)B(a, b)

E[Z · δ(Z ≤ C)] =

Z

For E[Z · δ(Z ≥ C)], we have, 1

1 xa (1 − x)b−1 dx B(a, b) C Z 1 Ca ≥ (1 − x)b−1 dx B(a, b) C (1 − C)b C a = bB(a, b)

E[Z · δ(Z ≥ C)] =

Z

186

The ratio between these two is at least E[Z · δ(Z ≥ C)] (1 − C)(a + 1) 1 ≥ ≥ . E[Z · δ(Z ≤ C)] bC C The last inequality holds when a, b < 1 and C < 1/2. The sum of the two is exactly E[Z], so when C < 1/2 we know E[Z · δ(Z ≤ C)] < C · E[Z]. 2a Next we bound the probability Pr[Z ≥ C]. Note that Pr[Z ≥ 1/2] ≤ 2E[Z] = a+b by Markov’s inequality. Now we show Pr[Z ∈ [C, 1/2]] is not much larger than Pr[Z ≥ 1/2] by bounding the integrals. Z 1 Z 1 a−1 b−1 (1 − x)b−1 dx = (1/2)b /b. x (1 − x) dx ≥ A= 1/2

1/2

Z

1/2

x

B= C

a−1

b−1

(1 − x)

b−1

≤ (1/2)

Z

1/2

xa−1 dx

C

0.5a − C a ≤ (1/2)b−1 a 1 − (1 − a log 1/C) ≤ (1/2)b−1 a b−1 = (1/2) log(1/C). The last inequality uses the fact that ex ≥ 1 + x for all x. Now Pr[Z ≥ C] = (1 +

2a a B ) Pr[Z ≥ 1/2] ≤ (1 + 2b log(1/C)) ≤ 8 log(1/C) · A a+b a+b

and we have the result. Case 2: When b ≥ 1, we have an alternative bound. We use the fact that if X ∼ Γ(a, 1) and Y ∼ Γ(b, 1) then Z ∼ X/(X + Y ). Since Y is distributed as Γ(b, 1), its PDF is 1 xb−1 e−x . This is proportional to the PDF of Γ(1) (e−x ) multiplied by a increasing function Γ(b) xb−1 . Therefore we know Pr[Y ≥ t] ≥ PrY 0 ∼Γ(1) [Y 0 ≥ t] = e−t . Now we use this bound to compute the probability that Z ≤ 1/R for all R ≥ 1. This is equivalent to

187

X 1 Pr[ ≤ ]= X +Y R

∞

Z 0

P r[X = x]P r[Y ≥ (R − 1)X]dx ∞

1 a−1 −Rx x e dx Γ(a) 0 Z ∞ 1 a−1 −y −a y e dy =R Γ(a) 0 = R−a ≥

Z

In particular, Pr[Z ≤ C] ≥ C a , which means Pr[Z ≥ C] ≤ 1 − C a ≤ a log(1/C). For E[Zδ(Z < C)], the proof is similar as before: Z

C

1 C a+1 xa (1 − x)b dx ≤ B(a, b) B(a, b)(a + 1)

1

1 C a (1 − C)b+1 xa (1 − x)b dx ≥ B(a, b) B(a, b)(b + 1)

P = E[Zδ(Z < C)] = 0

Q = E[Zδ(Z ≥ C)] = Now E[Zδ(Z ≤ C)] ≤

P E[Z] Q

Z

C

≤ 6aC when C < 1/(b + 1).

Properties of Gamma and Dirichlet Distributions Recall Gamma distribution Γ(α, β) is a distribution on nonnegative real values with density β α α−1 −βx function Γ(α) x e . Proposition 8.21 (Dirichlet and Gamma distributions). The following facts are known for Dirichlet distribution and Gamma distribution. 1. Let Yi ∼ Γ(αi , 1) be independent random variables, then the vector (Y1 , Y2 , ..., Yk )/ is distributed as Dir(α).

Pk

i=1

Yk

2. The Γ function satisfies Euler’s reflection formula: Γ(1 − z)Γ(z) ≤ π/ sin πz . 3. The Γ(z) ≥ 1 when 0 < z < 1. 4. There exists a universal constant C such that Γ(z) ≤ C/z when 0 < z < 1. 5. For Y ∼ Γ(α, 1) and t > 0 and α ∈ (0, 1), we have α α−1 −t t e ≤ Pr[Y ≥ t] ≤ tα−1 e−t , 4C

(8.36)

and for any η, c > 1, we have P[Y > ηt|Y ≥ t] ≥ (cη)α−1 e−(η−1)t . 188

(8.37)

Proof: and

and Z t

∞

The bounds in (8.36) is derived using the fact that 1 ≤ Γ(α) ≤ C/α when α ∈ (0, 1) Z ∞ Z ∞ 1 1 αi −1 −x x e dx ≤ tαi −1 e−x dx ≤ tαi −1 e−t , Γ(αi ) Γ(αi ) t t 1 1 xαi −1 e−x dx ≥ Γ(αi ) Γ(αi )

Z

2t

x t

αi −1 −x

e dx ≥ αi /C

Z t

2t

(2t)αi −1 e−x dx ≥

αi αi −1 −t t e . 4C

189

Appendix A Matrices and Tensors A.1 A.1.1

Matrix Notations Basic Notations, Rows and Columns

A matrix is a rectangular array of numbers. Throughout this thesis we only consider real matrices. Most notations in this paper are commonly used. We used bold letters for vectors like v. All the vectors are by default column vectors. The entries of a vector is usually denoted as ui , but sometimes we also use u[i] (especially when the vector itself has many superscripts or subscripts). The all 1’s and all 0’s vectors of length m are denoted as 0m and 1m , but the length is usually omitted as it is clear from the context. The basis vectors ei are vectors which has only one entry ei [i] = 1 and all other entries are 0, again the length of the vector will be clear from context. Inner products of vectors is denoted as u · v, although sometimes equivalent matrix notation u> v is also used (mainly in manipulations that also involves matrices). We use I to denote the identity matrix, M > to denote the transpose of a matrix, and M −1 to denote the inverse of a matrix. A matrix is symmetric if M = M > . It is orthonormal if M M > = M > M = I. The entries of a matrix M is usually denoted as Mi,j , however sometimes we also use M [i, j]. To represent a diagonal matrix, we use Diag(u) or Diag(ui ) where u is the vector of the diagonal entries. We use Mi to denote the i-th column of a matrix, and M j to denote the j-th row of a matrix. To denote a submatrix, let M be an n × m matrix, A ⊂ [n] and B ⊂ [m], then MA,B or M [A, B] denote the submatrix of M with indices in A and B. When either A or B is a singleton we also abuse notation to use Mi,B instead of M{i},B .

A.1.2

Norms and Eigenvalues

We again use commonly qP accepted definitions for vector and matrix norms. The `2 norm k 2 of a vector kuk = i=1 ui (notice that the subscript 2 is omitted). The `1 norm is Pk |u|1 = i=1 |ui |. Infinity norm is also sometimes used but we will simply write maxk |uk |. 190

For a matrix M , if M v = λv, then λ is an eigenvalue and v is an eigenvector of the matrix. We will only talk about eigenvalues and eigenvectors of real symmetric matrices, so the eigenvalues and eigenvectors are themselves real. The singular values of a matrix M are the square roots of the eigenvalues of M > M , usually denoted by σi (M )’s (and usually we assume σ1 ≥ σ2 ≥ · · · ≥ σm ). The spectral norm of a matrix is the same as the operator norm, and has several equivalent definitions: kM k = σ1 (M ) = max kM uk = max u> M u. kuk≤1

kuk≤1

Here the last equality only holds when M is symmetric. The Frobenius norm of a matrix is defined as follows s sX X 2 Mi,j = σi (M )2 . kM kF = i,j

i

There are several standard facts about these norms 1. kA + Bk ≤ kAk + kBk. 2. kABk ≤ kAk kBk 3. kABkF ≤ kAk kBkF . 4. For rank r matrices kAk ≤ kAkF ≤

√

r kAk.

We call a symmetric matrix is positive semidefinite (PSD) if all its eigenvalues are nonnegative. The smallest singular value of a matrix is usually denoted as σmin (M ). The condition number of a matrix is equal to the ratio between its largest and smallest singular values, κ(M ) = kM k /σmin (M ). When the condition number is small we say the matrix is well conditioned.

A.1.3

Singular Value Decomposition and Eigenspace Perturbation

Every matrix can be decomposed into the form M = U DV > , where U > U = I and V V > = I, and D is a diagonal matrix whose values are the singular values of M . The columns of U and V are called left and right singular vectors. When M is symmetric PSD we have U = V . the top-k singular values to approximate a matrix M , Mk−svd = Pk Often we use > This is the optimal approximation using rank k matrices in terms i=1 Ui σi (M )Vi . of both spectral norm and Frobenius norm. The Moore-Penrose pseudo-inverse of an n × m (n > m) full rank matrix M is M † = V D−1 U > , which always satisfies M † M = I (but keep in mind that there are more than one matrices that satisfy this equation). When a matrix M is symmetric PSD that has singular value decomposition M = U DU > , the p-th power p ∈ R of M is defined as M p = p U Diag(Di,i )U > . 191

There are several theorems controlling the eigenvalues and eigenvectors of matrices when they are perturbed. Weyl’s Theorem Weyl’s Theorem bounds the perturbation in the singular values when a noise with bounded spectral norm is added to the matrix. Theorem A.1 (Weyl’s theorem; Theorem 4.11, p.204 in [147].). Let A, E ∈ Rm×n with m > n, then max |σi (A + E) − σi (A)| ≤ kEk . i∈[n]

Wedin’s Theorem: Wedin’s Theorem bounds the perturbation of subspaces spanned by singular vectors when a noise with bounded spectral norm is added to the matrix. When the spectral norm of the noise is much smaller than the top-k singular values, the space spanned by the top-k singular vectors does not change much. Theorem A.2 (Wedin’s theorem; Theorem 4.4, p. 262 in [147].). Let A, E ∈ Rm×n with m ≥ n be given. Let A have the singular value decomposition    >  Σ1 0 U1  U2>  A V1 V2 =  0 Σ2  . U3> 0 0 ˜ 1, Σ ˜ 2 , V˜1 V˜2 ). Let Let A˜ := A + E, with analogous singular value decomposition (U˜1 , U˜2 , U˜3 , Σ ˜ Φ be the matrix of canonical angles between range(U1 ) and range(U1 ), and Θ be the matrix of canonical angles between range(V1 ) and range(V˜1 ). If there exists δ, α > 0 such that ˜ 1 ) ≥ α + δ and maxi σi (Σ2 ) ≤ α, then mini σi (Σ max{k sin Φk2 , k sin Θk2 } ≤

A.2

kEk2 . δ

Tensor Notations

N A real p-th order tensor T ∈ pi=1 is a member of the tensor product of Euclidean spaces Rni , i ∈ [p]. WeN generally restrict to the case where n1 = n2 = · · · = np = n, N and simply write T ∈ p Rn . For a vector v ∈ Rn , we use v ⊗p := v ⊗ v ⊗ · · · ⊗ v ∈ p Rn to denote its p-th tensor power. As is the case for vectors (where p = 1) and matrices (where p = 2), we may identify a p-th order tensor with the p-way array of real numbers [Ti1 ,i2 ,...,ip : i1 , i2 , . . . , ip ∈ [n]], where Ti1 ,i2 ,...,ip is the (i1 , i2 , . . . , ip )-th coordinate of A (with respect to a canonical basis). An useful way to work with tensors is to view the tensor as a multilinear form. We use the following notation1 : 1

the notation first appeared in [113]

192

Notation (Multilinear Form). We can consider T to be a multilinear map in the following sense: for a set of matrices {Vi ∈ Rn×mi : i ∈ [p]}, the (i1 , i2 , . . . , ip )-th entry in the p-way array representation of T (V1 , V2 , . . . , Vp ) ∈ Rm1 ×m2 ×···×mp is X [T (V1 , V2 , . . . , Vp )]i1 ,i2 ,...,ip := Tj1 ,j2 ,...,jp [V1 ]j1 ,i1 [V2 ]j2 ,i2 · · · [Vp ]jp ,ip . (A.1) j1 ,j2 ,...,jp ∈[n]

Note that if T is a matrix (p = 2), then T (V1 , V2 ) = V1> T V2 . Similarly, for a matrix M and vector v ∈ Rn , we can express M v as M (I, v) = M v ∈ Rn , where I is the n × n identity matrix. As a final example of this notation, observe T (ei1 , ei2 , . . . , eip ) = Ti1 ,i2 ,...,ip , where {e1 , e2 , . . . , en }N is the canonical basis for Rn . Most tensors T ∈ p Rn considered in this thesis will be symmetric (sometimes called supersymmetric), which means that their p-way array representations are invariant to permutations of the array indices: i.e., for all indices i1 , i2 , . . . , ip ∈ [n], Ti1 ,i2 ,...,ip = Tiπ(1) ,iπ(2) ,...,iπ(p) for any permutation π on [p]. It can be checked that this reduces to the usual definition of a symmetric matrix for p = 2. Np n Definition A.3 (Tensor Rank). The rank of a p-th order tensor T ∈ R is the smallest P non-negative integer k such that T = kj=1 u1,j ⊗ u2,j ⊗ · · · ⊗ up,j for some ui,j ∈ Rn , i ∈ [p], j ∈ [k], and the symmetric rank P of a symmetric p-th order tensor T is the smallest n non-negative integer k such that T = kj=1 u⊗p j for some uj ∈ R , j ∈ [k]. The notion of rank readily reduces to the usual definition of matrix rank when p = 2, as revealed by the singular value decomposition. Similarly, for symmetric matrices, the symmetric rank is equivalent to the matrix rank as given by the spectral theorem. The notion of tensor (symmetric) rank is considerably more delicate than matrix (symmetric) rank. For instance, it is not clear a priori that the symmetric rank of a tensor should even be finite [43]. In addition, removal of the best rank-1 approximation of a (general) tensor may increase the tensor rank of the residual [145]. Definition A.4 (Spectral Norm of Tensor). The spectral norm of a tensor T is defined by viewing it as a linear operator. kT k =

sup kvi k≤1,i∈[p−1]

kT (v1 , v2 , . . . , vp−1 , I)k =

sup kvi k≤1,i∈[p]

kT (v1 , v2 , . . . , vp )k .

For symmetric tensors it is easy to show that the maximum is always achieved when all the vectors are the same, so for symmetric T , kT k = supkvk≤1 T (v, v, ..., v). 193

Appendix B Concentration Bounds B.1

Concentration Bounds for Single Variables or Functions

Given a random variable X, concentration bounds for X tries to bound the probability that X is far from its expectation E[X]. All the random variables in this thesis will have bounded moments. The most basic inequalities are the Markov’s inequality and Chebyshev’s inequality: Theorem B.1 (Markov’s Inequality). If X is a nonnegative random variable, then for any k≥1 Pr[X ≥ kE[X]] ≤ 1/k. Theorem B.2 (Chebyshev’s Inequality). For any random variable X and for any k ≥ 1 √ Pr[|X − E[X]| > k Var X] ≤ 1/k 2 . Much stronger guarantees can be obtained for sum of independent random variables, including Chernoff bounds and Bernstein’s inequality. Theorem B.3 (Chernoff Bounds). Let X1 , ..., Xn be P independent random variables P taking values in {0, 1}, Pr[Xi = 1] = pi . Then if we let X = ni=1 Xi and µ = E[X] = ni=1 pi , for any δ > 0, µ eδ Pr[X > (1 + δ)µ] ≤ , (1 + δ)(1+δ) µ e−δ Pr[X < (1 − δ)µ] ≤ . (1 − δ)(1−δ)

194

Theorem B.4 (Bernstein’s Inequality). Let X1 , ..., Xn be independent random variables with mean 0. Suppose |Xi | ≤ M almost surely for all i, then for all t > 0, Pr[|

n X i=1

t2 /2 i=1 Var[Xi ] + M t/3

Xi | > t] ≤ 2 exp − Pn

.

Finally, when the random variables are not independent, or when we are interested in other functions of random variables, the common tool is martingales and Azuma’s inequality. However here we just state the bounded differences/McDiarmid’s inequality as it is the simplest way to obtain polynomial bounds for the quantities of interest in Chapter 7. Theorem B.5 (McDiarmid’s Inequality). Consider independent random variables X1 , ..., Xn ∈ X , and a real valued function f (X1 , X2 , ..., Xn ). If for all index i ∈ [n], and for all x1 , x2 , ..., xn , x0i ∈ X , the function satisfies |f (x1 , x2 , ..., xi−1 , xi , xi+1 , ..., xn ) − f (x1 , x2 , ..., xi−1 , x0i , xi+1 , ..., xn )| ≤ ci , then we have for all t > 0 Pr[|f (X1 , ..., Xn ) − E[f (X1 , ..., Xn )]| ≥ t] ≤ 2 exp

2t2 Pn

2 i=1 ci

.

For example, the simplest way to show the sum of independent matrices/tensors have concentrated spectral norm is to let f be the spectral norm of the sum. However, this usually won’t give the tightest bound. In order to better understand the concentration behavior we need more precise bounds in the next Section.

B.2

Concentration Bounds for Vectors and Matrices

Vectors and matrices also have concentration bounds that behave similarly to the Bernstein’s inequality. Proposition B.6 (Vector Bernstein Inequality). Let z = (z1 , z2 , ..., zn ) ∈ Rn be a random vector with independent entries, E[zi ] = 0, E[zi2 ] = σi2 , and Pr[|zi | ≤ 1] = 1. Let A = [a1 |a2 | · · · |an ] ∈ Rm×n be a matrix, then v u n √ uX kai k2 σi2 + (4/3) max kai k t] ≥ 1 − e−t . Pr[kAzk ≤ (1 + 8t)t i∈[n]

i=1

Proposition B.7 (Matrix Bernstein Inequality). Suppose Z =

P

j

1. Wj are independent random matrices with dimension d1 × d2 , 2. E[Wj ] = 0 for all j,

3. kWj k ≤ R almost surely.

195

Wj where

o

P n P

> > Let d = d1 + d2 , and σ = max j E[Wj Wj ] , j E[Wj Wj ] , then we have −t2 /2 Pr[kZk ≥ t] ≤ d · exp . σ 2 + Rt/3 2

196

Bibliography [1] M. Aharon. Overcomplete Dictionaries for Sparse Representation of Signals. PhD thesis, Technion - Israel Institute of Technology, 2006. [2] M. Aharon, M. Elad, and A. Bruckstein. K-svd: An algorithm for designing overcomplete dictionaries for sparse representation. Trans. Sig. Proc., 54(11):4311–4322, November 2006. [3] A. Aho, J. Ullman, and M. Yannakakis. On notions of information transfer in vlsi circuits. In Proceedings of the 15th symposium on Theory of Computing, pages 133– 139, 1983. [4] Edoardo M. Airoldi, David M. Blei, Stephen E. Fienberg, and Eric P. Xing. Mixed membership stochastic blockmodels. Journal of Machine Learning Research, 9:1981– 2014, June 2008. [5] N. Alon and D. Kleitman. Piercing convex sets and the hadwigder debrunner (p, q)problem. Advances in Mathematics, 96:103–112, 1992. [6] N. Alon and S. Onn. Separable partitions. Discrete Applied Math, pages 39–51, 1999. [7] A. Anandkumar, D. Hsu, F. Huang, and S.M. Kakade. Learning mixtures of tree graphical models. In Advances in Neural Information Processing Systems 25, 2012. [8] A. Anandkumar, D. Hsu, and S.M. Kakade. A method of moments for mixture models and hidden markov models. In Proc. of Conf. on Learning Theory, June 2012. [9] Anima Anandkumar, Dean P. Foster, Daniel Hsu, Sham M. Kakade, and Yi-Kai Liu. A spectral algorithm for latent Dirichlet allocation. In Advances in Neural Information Processing Systems 25, 2012. [10] Andreas Argyriou, Theodoros Evgeniou, and Massimiliano Pontil. Multi-task feature learning. In B. Schölkopf, J. Platt, and T. Hoffman, editors, Advances in Neural Information Processing Systems 19, pages 41–48, Cambridge, MA, 2007. MIT Press. [11] Sanjeev Arora, Rong Ge, Ravindran Kannan, and Ankur Moitra. Computing a nonnegative matrix factorization – provably. In Proceedings of the 44th Symposium on Theory of Computing Conference, STOC 2012, New York, NY, USA, May 19 - 22, pages 145–162, 2012. 197

[12] Sanjeev Arora, Rong Ge, and Ankur Moitra. Learning topic models – going beyond svd. In Proceedings of the 53rd Annual IEEE Symposium on Foundations of Computer Science, pages 1–10, 2012. [13] Sanjeev Arora, Rong Ge, and Ankur Moitra. New algorithms for learning incoherent and overcomplete dictionaries. arXiv, 1308.6273, 2013. [14] Sanjeev Arora, Rong Ge, Sushant Sachdeva, and Grant Schoenebeck. Finding overlapping communities in social networks: toward a rigorous approach. In Proceedings of the 13th ACM Conference on Electronic Commerce, EC ’12, pages 37–54, New York, NY, USA, 2012. ACM. [15] T. Austin. On exchangeable random variables and the statistics of large graphs and hypergraphs. Probab. Survey, 5:80–145, 2008. [16] Maria-Florina Balcan, Avrim Blum, and Anupam Gupta. Clustering under approximation stability. J. ACM, 60(2):8:1–8:34, May 2013. [17] Maria-Florina Balcan, Christian Borgs, Mark Braverman, Jennifer T. Chayes, and Shang-Hua Teng. Finding endogenously formed communities. In Proceedings of the 24th annual ACM-SIAM symposium on Discrete algorithms, volume abs/1201.4899, 2013. [18] S. Basu, R. Pollack, and M. Roy. On the combinatorial and algebraic complexity of quantifier elimination. Journal of the ACM, pages 1002–1045, 1996. [19] Yoshua Bengio. Learning deep architectures for ai. Found. Trends Mach. Learn., 2(1):1–127, January 2009. [20] Yoshua Bengio, Aaron C. Courville, and Pascal Vincent. Representation learning: A review and new perspectives. IEEE Trans. Pattern Anal. Mach. Intell., 35(8):1798– 1828, 2013. [21] R. Berinde, A.C. Gilbert, P. Indyk, H. Karloff, and M.J. Strauss. Combining geometry and combinatorics: a unified approach to sparse signal recovery. In 46th Annual Allerton Conference on Communication, Control, and Computing, pages 798–805, 2008. [22] V. Bittorf, B. Recht, C. Re, and J. Tropp. Factoring nonnegative matrices with linear programs. In NIPS, pages 1223–1231, 2012. [23] D. Blei. Personal communication. [24] D. Blei. Introduction to probabilistic topic models. Communications of the ACM, pages 77–84, 2012. [25] David M. Blei and John D. Lafferty. A correlated topic model of science. AAS, 1(1):17–35, 2007. 198

[26] David M. Blei, Andrew Y. Ng, and Michael I. Jordan. Latent dirichlet allocation. Journal of Machine Learning Research, 3:993–1022, March 2003. [27] Avrim Blum and Joel Spencer. Coloring random and semi-random k-colorable graphs. J. Algorithms, 19(2):204–234, September 1995. [28] L. Blum, F. Cucker, M. Shub, and S. Smale. Complexity of Real Computations. Springer Verlag, 1998. [29] Bernhard E. Boser, Isabelle M. Guyon, and Vladimir N. Vapnik. A training algorithm for optimal margin classifiers. In Proceedings of the fifth annual workshop on Computational learning theory, COLT ’92, pages 144–152, New York, NY, USA, 1992. ACM. [30] Petros T Boufounos and Richard G Baraniuk. 1-bit compressive sensing. In Information Sciences and Systems, 2008. CISS 2008. 42nd Annual Conference on, pages 16–21. IEEE, 2008. [31] Spencer Charles Brubaker and Santosh Vempala. Isotropic pca and affine-invariant clustering. In Proceedings of the 49th Annual IEEE Symposium on Foundations of Computer Science, FOCS ’08, pages 551–560, Washington, DC, USA, 2008. IEEE Computer Society. [32] G. Buchsbaum and O. Bloch. Color categories revealed by non-negative matrix factorization of munsell color spectra. Vision Research, pages 559–563, 2002. [33] Wray L. Buntine. Estimating likelihoods for topic models. In Asian Conference on Machine Learning, 2009. [34] E. J. Candes and T. Tao. Decoding by linear programming. IEEE Trans. Inf. Theor., 51(12):4203–4215, December 2005. [35] Emmanuel J. Candès, Justin K. Romberg, and Terence Tao. Stable signal recovery from incomplete and inaccurate measurements. Comm. Pure Appl. Math., 59(8):1207– 1223, August 2006. [36] Emmanuel J Candes and Terence Tao. Decoding by linear programming. Information Theory, IEEE Transactions on, 51(12):4203–4215, 2005. [37] J.-F. Cardoso and Pierre Comon. Independent component analysis, a survey of some algebraic methods. In IEEE International Symposium on Circuits and Systems, pages 93–96, 1996. [38] Joseph T. Chang. Full reconstruction of Markov models on evolutionary trees: Identifiability and consistency. Mathematical Biosciences, 137:51–73, 1996. [39] Scott Shaobing Chen, David L. Donoho, and Michael A. Saunders. Atomic decomposition by basis pursuit. SIAM Rev., 43(1):129–159, January 2001. 199

[40] Youngmin Cho and Lawrence Saul. Kernel methods for deep learning. In Advances in Neural Information Processing Systems 22, pages 342–350. 2009. [41] J. Cohen and U. Rothblum. Nonnegative ranks, decompositions and factorizations of nonnegative matices. Linear Algebra and its Applications, pages 149–168, 1993. [42] P. Comon. Independent component analysis, a new concept? 36(3):287–314, 1994.

Signal Processing,

[43] P. Comon, G. Golub, L.-H. Lim, and B. Mourrain. Symmetric tensors and symmetric tensor rank. SIAM Journal on Matrix Analysis Appl., 30(3):1254–1279, 2008. [44] P. Comon and C. Jutten. Handbook of Blind Source Separation: Independent Component Analysis and Applications. Academic Press. Elsevier, 2010. [45] S. Currarini, M.O. Jackson, and P. Pin. An economic model of friendship: Homophily, minorities, and segregation. Econometrica, 77(4):1003–1045, 2009. [46] Sanjoy Dasgupta. Learning mixtures of gaussians. In Proceedings of the 40th Annual IEEE Symposium on Foundations of Computer Science, pages 634–644, 1999. [47] G. Davis, S. Mallat, and M. Avellaneda. Adaptive greedy approximations. Constructive Approximation, 13(1):57–98, March 1997. [48] Geoff Davis, Stephane Mallat, and Marco Avellaneda. Adaptive greedy approximations. Constructive approximation, 13(1):57–98, 1997. [49] L. De Lathauwer, J. Castaing, and J.-F. Cardoso. Fourth-order cumulant-based blind identification of underdetermined mixtures. Signal Processing, IEEE Transactions on, 55(6):2965–2973, 2007. [50] Lieven De Lathauwer, Bart De Moor, and Joos Vandewalle. A multilinear singular value decomposition. SIAM journal on Matrix Analysis and Applications, 21(4):1253– 1278, 2000. [51] A. P. Dempster, N. M. Laird, and D. B. Rubin. Maximum-likelihood from incomplete data via the EM algorithm. J. Royal Statist. Soc. Ser. B, 39:1–38, 1977. [52] Weicong Ding, Mohammad H. Rohban, Prakash Ishwar, and Venkatesh Saligrama. Topic discovery through data dependent and random projections. In ICMLProceedings of the 30th International Conference on Machine Learning (ICML-13), 2013. [53] D. L. Donoho and X. Huo. Uncertainty principles and ideal atomic decomposition. IEEE Trans. Inf. Theor., 47(7):2845–2862, September 2006. [54] David Donoho and Victoria Stodden. When does non-negative matrix factorization give a correct decomposition into parts? In NIPS, 2003. 200

[55] David L Donoho. Compressed sensing. Information Theory, IEEE Transactions on, 52(4):1289–1306, 2006. [56] David L. Donoho and Philip B. Stark. Uncertainty principles and signal recovery. SIAM J. Appl. Math., 49(3):906–931, June 1989. [57] M. Elad and M. Aharon. Image denoising via sparse and redundant representations over learned dictionaries. Image Processing, IEEE Transactions on, 15(12):3736–3745, December 2006. [58] Michael Elad. Sparse and Redundant Representations: From Theory to Applications in Signal and Image Processing. Springer Publishing Company, Incorporated, 1st edition, 2010. [59] K. Engan, S. O. Aase, and J. Hakon Husoy. Method of optimal directions for frame design. In Proceedings of the Acoustics, Speech, and Signal Processing, 1999. on 1999 IEEE International Conference - Volume 05, ICASSP ’99, pages 2443–2446, Washington, DC, USA, 1999. IEEE Computer Society. [60] Uriel Feige and Joe Kilian. Heuristics for semirandom graph problems. Journal of Computing and System Sciences, 63:639–671, 2001. [61] Uriel Feige and Robert Krauthgamer. Finding and certifying a large hidden clique in a semi-random graph. Technical report, Jerusalem, Israel, Israel, 1999. [62] S.E. Fienberg, M.M. Meyer, and S.S. Wasserman. Statistical analysis of multiple sociometric relations. Journal of the american Statistical association, 80(389):51–67, 1985. [63] Alan M. Frieze, Mark Jerrum, and Ravindran Kannan. Learning linear transformations. In Proceedings of the 37th Annual IEEE Symposium on Foundations of Computer Science, 1996. [64] Anna C. Gilbert, S. Muthukrishnan, and Martin J. Strauss. Approximation of functions over redundant dictionaries using coherence. In Proceedings of the 14th annual ACMSIAM symposium on Discrete algorithms, SODA ’03, pages 243–252, Philadelphia, PA, USA, 2003. Society for Industrial and Applied Mathematics. [65] N. Gillis and S. Vavasis. Fast and robust recursive algorithms for separable nonnegative matrix factorization. arXiv, 1208.1237, 2012. [66] Nicolas Gillis. Robustness analysis of hottopixx, a linear programming model for factoring nonnegative matrices. SIAM Journal on Matrix Analysis and Applications, 34(3):1189–1212, 2013. [67] Nicolas Gillis and Robert Luce. Robust near-separable nonnegative matrix factorization using linear optimization. arXiv, 1302.4385, 2013. 201

[68] G. H. Golub and C. F. van Loan. Matrix Computations. Johns Hopkins University Press, 1996. [69] C. Gomez, H. Le Borgne, P. Allemand, C. Delacourt, and P. Ledru. N-findr method versus independent component analysis for lithological identification in hyperspectral imagery. Int. J. Remote Sens., 28(23), January 2007. [70] I. J. Goodfellow, A. Courville, and Y. Bengio. Large-scale feature learning with spikeand-slab sparse coding. In Proceedings of the 29th International Conference on Machine Learning (ICML-12), 2012. [71] P. Gopalan, D. Mimno, S. Gerrish, M. Freedman, and D. Blei. Scalable inference of overlapping communities. In Advances in Neural Information Processing Systems 25, pages 2258–2266, 2012. [72] N. Goyal, S. Vempala, and Y. Xiao. Fourier pca. arXiv, 1306.5825, 2013. [73] D. Yu. Grigor’ev and N.N. Vorobjov Jr. Solving systems of polynomial inequalities in subexponential time,. Journal of Symbolic Computation, 5-1-2:37–64, 1988. [74] David Gross. Recovering low-rank matrices from few coefficients in any basis. Information Theory, IEEE Transactions on, 57(3):1548–1566, 2011. [75] V. Guruswami and P. Indyk. Expander-based constructions of efficiently decodable codes. In Proceedings of the 42nd IEEE symposium on Foundations of Computer Science, FOCS ’01, pages 658–, Washington, DC, USA, 2001. IEEE Computer Society. [76] D.L. Hanson and F.T. Wright. A bound on tail probabilities for quadratic forms in independent random variables. The Annals of Mathematical Statistics, 42(3):1079– 1083, 1971. [77] E. Harding. The number of partitions of a set of n points in k dimensions induced by hyperplanes. Edinburgh Math. Society, pages 285–289, 1967. [78] Richard A Harshman. Foundations of the parafac procedure: models and conditions for an” explanatory” multimodal factor analysis. 1970. [79] L. Henry. Schémas de nuptialité: déséquilibre des sexes et célibat. Population, 24:457– 486, 1969. [80] Geoffrey E. Hinton. Training products of experts by minimizing contrastive divergence. Neural Comput., 14(8):1771–1800, August 2002. [81] Geoffrey E. Hinton, Simon Osindero, and Yee-Whye Teh. A fast learning algorithm for deep belief nets. Neural Comput., 18(7):1527–1554, July 2006. [82] Geoffrey E Hinton and Ruslan R Salakhutdinov. Reducing the dimensionality of data with neural networks. Science, 313(5786):504–507, 2006. 202

[83] Thomas Hofmann. Probabilistic latent semantic indexing. In Proceedings of the 22nd annual international ACM SIGIR conference on Research and development in information retrieval, SIGIR ’99, pages 50–57, New York, NY, USA, 1999. ACM. [84] P.W. Holland, K.B. Laskey, and S. Leinhardt. Stochastic blockmodels: first steps. Social networks, 5(2):109–137, 1983. [85] Roger A. Horn and Charles R. Johnson, editors. Matrix analysis. Cambridge University Press, New York, NY, USA, 1986. [86] D. Hsu, S. Kakade, and P. Liang. Identifiability and unmixing of latent parse trees. In Advances in Neural Information Processing Systems 25, 2012. [87] Daniel Hsu and Sham M. Kakade. Learning mixtures of spherical gaussians: moment methods and spectral decompositions. In Proceedings of the 4th conference on Innovations in Theoretical Computer Science, ITCS ’13, pages 11–20, New York, NY, USA, 2013. ACM. [88] Daniel Hsu, Sham M. Kakade, and Tong Zhang. A spectral algorithm for learning hidden Markov models. Journal of Computer and System Sciences, 78(5):1460–1480, 2012. [89] Furong Huang, Mohammad Umar Hakeem, Prateek Verma, Animashree Anandkumar, et al. Fast detection of overlapping communities via online tensor methods on gpus. arXiv, 1309.0787, 2013. [90] F. Hwang and U. Rothblum. On the number of separable partitions. Journal of Combinatorial Optimization, pages 423–433, 2011. [91] A. Hyvärinen and E. Oja. Independent component analysis: algorithms and applications. Neural Networks, 13(4–5):411–430, 2000. [92] Russel Impagliazzo and Ramamohan Paturi. On the complexity of k-sat. J. Comput. Syst. Sci., 62(2):367–375, March 2001. [93] Jeffrey C Jackson, Adam R Klivans, and Rocco A Servedio. Learnability beyond ac0 . In Proceedings of the thiry-fourth annual ACM symposium on Theory of computing, pages 776–784. ACM, 2002. [94] Koray Kavukcuoglu, Marc’Aurelio Ranzato, and Yann Lecun. Fast inference in sparse coding algorithms with applications to object recognition. Technical report, Technical report, Computational and Biological Learning Lab, Courant Institute, NYU, 2008. [95] Leonid Khachiyan. On the complexity of approximating extremal determinants in matrices. Journal of Complexity, 11(1):138–153, 1995. [96] J. Kleinberg and M. Sandler. Using mixture models for collaborative filtering. JCSS, pages 49–69, 2008. 203

[97] Adam R Klivans and Alexander A Sherstov. Cryptographic hardness for learning intersections of halfspaces. Journal of Computer and System Sciences, 75(1):2–12, 2009. [98] E. Kofidis and P. A. Regalia. On the best rank-1 approximation of higher-order supersymmetric tensors. SIAM Journal on Matrix Analysis and Applications, 23(3):863–884, 2002. [99] T. G. Kolda and B. W. Bader. Tensor decompositions and applications. SIAM review, 51(3):455, 2009. [100] T. G. Kolda and J. R. Mayo. Shifted power method for computing tensor eigenpairs. SIAM Journal on Matrix Analysis and Applications, 32(4):1095–1124, October 2011. [101] Tamara G Kolda. Orthogonal tensor decompositions. SIAM Journal on Matrix Analysis and Applications, 23(1):243–255, 2001. [102] Alexandra Kolla, Konstantin Makarychev, and Yury Makarychev. How to play unique games against a semi-random adversary. In Proceedings of the IEEE 52nd Annual Symposium on Foundations of Computer Science, FOCS ’11, pages 443–452, Washington, DC, USA, 2011. IEEE Computer Society. [103] Alex Krizhevsky, Ilya Sutskever, and Geoff Hinton. Imagenet classification with deep convolutional neural networks. In P. Bartlett, F.C.N. Pereira, C.J.C. Burges, L. Bottou, and K.Q. Weinberger, editors, Advances in Neural Information Processing Systems 25, pages 1106–1114. 2012. [104] A. Kumar, V. Sindhwani, and P. Kambadur. Fast conical hull algorithms for nearseparable non-negative matrix factorization. In Proceedings of the 30th International Conference on Machine Learning (ICML-13), 2013. [105] L. De Lathauwer, B. De Moor, and J. Vandewalle. On the best rank-1 and rank(R1 , R2 , ..., Rn ) approximation and applications of higher-order tensors. SIAM J. Matrix Anal. Appl., 21(4):1324–1342, 2000. [106] Lap Chi Lau. Bipartite roots of graphs. In In Proceedings of the 15th Annual ACMSIAM Symposium on Discrete Algorithms, pages 952–961, 2004. [107] W. Lawton and E. Sylvestre. Self modeling curve resolution. Technometrics, pages 617–633, 1971. [108] P.F. Lazarsfeld, R.K. Merton, et al. Friendship as a social process: A substantive and methodological analysis. Freedom and control in modern society, 18(1):18–66, 1954. [109] D. Lee and H. Seung. Learning the parts of objects by non-negative matrix factorization. Nature, pages 788–791, 1999. 204

[110] Daniel D. Lee and H. Sebastian Seung. Algorithms for non-negative matrix factorization. In NIPS, pages 556–562, 2000. [111] Michael S. Lewicki and Terrence J. Sejnowski. Learning overcomplete representations. Neural Comput., 12(2):337–365, February 2000. [112] Wei Li and Andrew McCallum. Pachinko allocation: Dag-structured mixture models of topic correlations. In Proceedings of the 23rd international conference on Machine learning, ICML ’06, pages 577–584, New York, NY, USA, 2006. ACM. [113] Lek-Heng Lim. Singular values and eigenvalues of tensors: a variational approach. In Computational Advances in Multi-Sensor Adaptive Processing, 2005 1st IEEE International Workshop on, pages 129–132. IEEE, 2005. [114] Zhouchen Lin, Minming Chen, and Yi Ma. The augmented lagrange multiplier method for exact recovery of corrupted low-rank matrices. Mathematical Programming, 2010. [115] Roi Livni, Shai Shalev-Shwartz, and Ohad Shamir. A provably efficient algorithm for training deep networks. arXiv, 1304.7045, 2013. [116] L. Lovász and M. Saks. Communication complexity and combinatorial lattice theory. JCSS, pages 322–349, 1993. [117] Julien Mairal, Marius Leordeanu, Francis Bach, Martial Hebert, and Jean Ponce. Discriminative sparse image models for class-specific edge detection and image interpretation. In Proceedings of the 10th European Conference on Computer Vision: Part III, ECCV ’08, pages 43–56, Berlin, Heidelberg, 2008. Springer-Verlag. [118] Konstantin Makarychev, Yury Makarychev, and Aravindan Vijayaraghavan. Approximation algorithms for semi-random partitioning problems. In Proceedings of the 44th symposium on Theory of Computing, STOC ’12, pages 367–384, New York, NY, USA, 2012. ACM. [119] S.G. Mallat and Zhifeng Zhang. Matching pursuits with time-frequency dictionaries. Trans. Sig. Proc., 41(12):3397–3415, December 1993. [120] Stphane Mallat. A Wavelet Tour of Signal Processing, Third Edition: The Sparse Way. Academic Press, 3rd edition, 2008. [121] Jiri Matousek. Lectures on Discrete Geometry. Springer-Verlag New York, Inc., Secaucus, NJ, USA, 2002. [122] Andrew Kachites McCallum. Mallet: A machine learning for language toolkit, 2002. [123] Peter McCullagh. Tensor methods in statistics, volume 161. Chapman and Hall London, 1987.

205

[124] M. McPherson, L. Smith-Lovin, and J.M. Cook. Birds of a feather: Homophily in social networks. Annual review of sociology, pages 415–444, 2001. [125] F. McSherry. Spectral partitioning of random graphs. In Proceedings of the 42nd Annual IEEE Symposium on Foundations of Computer Science, 2001. [126] David Mimno, Hanna Wallach, Edmund Talley, Miriam Leenders, and Andrew McCallum. Optimizing semantic coherence in topic models. In EMNLP, 2011. [127] A. Moitra. An almost optimal algorithm for computing the nonnegative rank. In Proceedings of the 24th annual ACM-SIAM symposium on Discrete algorithms, 2013. [128] Ankur Moitra and Gregory Valiant. Settling the polynomial learnability of mixtures of gaussians. In Proceedings of the 51st Annual IEEE Symposium on Foundations of Computer Science, 2010. [129] J.L. Moreno. Who shall survive?: A new approach to the problem of human interrelations. Nervous and Mental Disease Publishing Co, 1934. [130] Elchanan Mossel and Sébastian Roch. Learning nonsingular phylogenies and hidden Markov models. Annals of Applied Probability, 16(2):583–614, 2006. [131] Rajeev Motwani and Madhu Sudan. Computing roots of graphs is hard. DISCRETE APPLIED MATHEMATICS, 54:54–81, 1994. [132] J.M. P. Nascimento and J. M. B. Dias. Vertex component analysis: A fast algorithm to unmix hyperspectral data. IEEE TRANS. GEOSCI. REM. SENS, 43:898–910, 2004. [133] N. Nisan. Lower bounds for non-commutative computation (extended abstract). In Proceedings of the 23rd symposium on Theory of Computing, pages 410–418, 1991. [134] B. A. Olshausen and D. J. Field. Sparse coding with an overcomplete basis set: a strategy employed by v1? Vision Res, 37:3311–25, 1997. [135] G. Palla, I. Derényi, I. Farkas, and T. Vicsek. Uncovering the overlapping community structure of complex networks in nature and society. Nature, 435(7043):814–818, 2005. [136] M. Patrascu and R. Williams. On the possibility of faster sat algorithms. In Proceedings of the 21st annual ACM-SIAM symposium on Discrete algorithms, pages 1065–1075, 2010. [137] K. Pearson. Contributions to the mathematical theory of evolution. Philosophical Transactions of the Royal Society, London, A., page 71, 1894. [138] Marc’Aurelio Ranzato, Y-Lan Boureau, and Yann LeCun. Sparse feature learning for deep belief networks. In J.C. Platt, D. Koller, Y. Singer, and S. Roweis, editors, Advances in Neural Information Processing Systems 20, pages 1185–1192, Cambridge, MA, 2008. MIT Press. 206

[139] J. Renegar. On the computational complexity and geometry of the first-order theory of the reals. Journal of Symbolic Computation, 13-3:255–352., 1992. [140] M. Rudelson. Random vectors in the isotropic position. J. Funct. Anal., 164(1):60–72, 1999. [141] Arora Sanjeev and Ravi Kannan. Learning mixtures of arbitrary gaussians. In Proceedings of the 33rd annual ACM symposium on Theory of computing, STOC ’01, pages 247–257, New York, NY, USA, 2001. ACM. [142] T.A.B. Snijders and K. Nowicki. Estimation and prediction for stochastic blockmodels for graphs with latent block structure. Journal of Classification, 14(1):75–100, 1997. [143] D. Sontag and D. Roy. Complexity of inference in latent dirichlet allocation. In NIPS, pages 1008–1016, 2011. [144] Daniel A. Spielman, Huan Wang, and John Wright. Exact recovery of sparsely-used dictionaries. In COLT, 2012. [145] A. Stegeman and P. Comon. Subtracting a best rank-1 approximation may increase tensor rank. Linear Algebra and Its Applications, 433:1276–1300, 2010. [146] Keith Stevens, Philip Kegelmeyer, David Andrzejewski, and David Buttler. Exploring topic coherence over many models and many topics. In EMNLP, 2012. [147] G.W. Stewart and J. Sun. Matrix perturbation theory, volume 175. Academic press New York, 1990. [148] J. A. Tropp. Greed is good: algorithmic results for sparse approximation. IEEE Trans. Inf. Theor., 50(10):2231–2242, September 2006. [149] J.A. Tropp. User-friendly tail bounds for sums of random matrices. Foundations of Computational Mathematics, 12(4):389–434, 2012. [150] Joel A. Tropp, Anna C. Gilbert, S. Muthukrishnan, and Martin Strauss. Improved sparse approximation over quasiincoherent dictionaries. In IEEE International Conf. on Image Processing, pages 37–40, 2003. [151] Ledyard R Tucker. Some mathematical notes on three-mode factor analysis. Psychometrika, 31(3):279–311, 1966. [152] L. G. Valiant. A theory of the learnable. In Proceedings of the 16th annual ACM symposium on Theory of computing, STOC ’84, pages 436–445, New York, NY, USA, 1984. ACM. [153] Vladimir N. Vapnik and Alexey Chervonenkis. On the Uniform Convergence of Relative Frequencies of Events to Their Probabilities. Theory of Probability and its Applications, 16(2):264–280, 1971. 207

[154] Stephen A. Vavasis. On the complexity of nonnegative matrix factorization. SIAM J. on Optimization, 20(3):1364–1377, October 2009. [155] Santosh Vempala and Grant Wang. A spectral algorithm for learning mixture models. J. Comput. Syst. Sci., 68(4):841–860, June 2004. [156] Pascal Vincent, Hugo Larochelle, Yoshua Bengio, and Pierre-Antoine Manzagol. Extracting and composing robust features with denoising autoencoders. In ICML, pages 1096–1103, 2008. [157] M. Wainwright and M. Jordan. Graphical models, exponential families, and variational inference. Foundations and Trends in Machine Learning, pages 1–305, 2008. [158] Hanna Wallach, Iain Murray, Ruslan Salakhutdinov, and David Mimno. Evaluation methods for topic models. In Proceedings of the 26th International Conference on Machine Learning (ICML-09), 2009. [159] Y.J. Wang and G.Y. Wong. Stochastic blockmodels for directed graphs. Journal of the American Statistical Association, 82(397):8–19, 1987. [160] P. Wedin. Perturbation bounds in connection with singular value decomposition. BIT Numerical Mathematics, 12(1):99–111, 1972. [161] H.C. White, S.A. Boorman, and R.L. Breiger. Social structure from multiple networks. i. blockmodels of roles and positions. American journal of sociology, pages 730–780, 1976. [162] Wei Xu, Xin Liu, and Yihong Gong. Document clustering based on non-negative matrix factorization. In Proceedings of the 26th annual international ACM SIGIR conference on Research and development in informaion retrieval, SIGIR ’03, pages 267–273, New York, NY, USA, 2003. ACM. [163] Jianchao Yang, John Wright, Thomas S. Huang, and Yi Ma. Image super-resolution via sparse representation. Trans. Img. Proc., 19(11):2861–2873, November 2010. [164] M. Yannakakis. Expressing combinatorial optimization problems by linear programs. JCSS, pages 441–466, 1991. [165] Limin Yao, David Mimno, and Andrew McCallum. Efficient methods for topic model inference on streaming document collections. In KDD, 2009. [166] Chen Yudong, Sujay Sanghavi, and Huan Xu. Clustering sparse graphs. In Advances in Neural Information Processing Systems 25, 2012. [167] Tong Zhang and Gene H. Golub. Rank-one approximation to high order tensors. SIAM J. Matrix Anal. Appl., 23(2):534–550, February 2001.

208

Provable Algorithms for Machine Learning Problems - Duke Computer ...

Provable Algorithms for Machine Learning Problems - Duke Computer ...

Suggest Documents

Optimization Algorithms in Machine Learning - Computer Sciences ...

Machine Learning of Computer Vision Algorithms - CiteSeerX

Learning Multiple Defaults for Machine Learning Algorithms

Fast and Provable Algorithms for Learning Two-Layer Polynomial

Machine Learning Algorithms

Supervised Machine Learning Algorithms

Machine Learning Algorithms - analyticsvidhya.com

Machine Learning Algorithms for Real Data Sources

educational tool for machine learning algorithms

machine learning algorithms for damage detection

Learning Hybrid Algorithms for Vehicle Routing Problems

Calibrating Unsupervised Machine Learning Algorithms for the

Machine learning approximation algorithms for high ...

evaluating machine learning algorithms for detecting ... - CiteSeerX

Evaluating Machine Learning Algorithms for Automated ... - CAIA

Parallelizing Machine Learning Algorithms - CS229

Generative Learning algorithms - CS 229: Machine Learning

Towards provable security for ubiquitous applications - Computer

Machine Learning for Computer Games - Google Sites

Machine Learning for Diabetes Decision Support - Computer ...

Machine Learning for Computer Games - Google Sites

MACHINE LEARNING TECHNIQUES FOR BRAIN-COMPUTER ...

Machine Learning Methods for Computer Security

Machine Learning for Computer Security - Google Sites