Geometric Embedding for Learning Combinatorial Structures

0 downloads 0 Views 137KB Size Report
... problems in both kinds of spaces, combinatorial spaces are ... tation problems in combinatorial spaces are NP- ... set of data, D. Typical examples of S() are BDe ... 102. 103. 104. Number of Random Variables. Time in seconds. HC. GP−HC.
Geometric Embedding for Learning Combinatorial Structures

Terran Lane Ben Yackley Sergey Plis Blake Anderson University of New Mexico, Albuquerque, NM 87131 USA

1. Introduction Since the inception of computer science, there has been a strong dichotomy between optimization in continuous spaces (such as Rd ) and combinatoric spaces (such as permutations on d objects). While there are computationally hard problems in both kinds of spaces, combinatorial spaces are far more often the villain. It seems as if nearly all interesting learning, optimization, and representation problems in combinatorial spaces are NPcomplete in the best case. We feel that a key factor at the heart of this dichotomy is that combinatorial spaces are far more unstructured than the familiar continuous spaces. A priori, combinatorial spaces are simply sets of objects, with no relationship among them. Compare this to, say, Euclidean d-space, which comes equipped with a topology, limits, continuity, completeness, a metric, an inner product, and so on (Munkres, 1975). On these are built the entire infrastructure of analysis, including tools like the derivative, most optimization algorithms, and representations like the Fourier basis (Rudin, 1964; Kreyszig, 1978). Essentially, the last four centuries of mathematics has been developing tools for representation and optimization in continuous spaces. Combinatorial spaces, on the other hand, have been burdened with fewer assumptions, but endowed with fewer advantages. One strategy for working with combinatorial spaces is to embed them into continuous spaces and work there with powerful analytic tools. This trick has proven to be powerful in, for example, continuous relaxations of integer programming

[email protected] [email protected] [email protected] [email protected]

problems (Gomory, 1958). It has enjoyed relatively less penetration in machine learning, however. Where versions of it have appeared (Fligner & Verducci, 1986; Meila et al., 2007), the connection to the topology and analytic properties of the embedding space is typically not made explicit, nor fully exploited. In this abstract, we argue that the embedding approach offers us a powerful way to finesse some of the challenges of combinatoric spaces. We outline two of our research directions in this space: rapid Bayesian network structure identification via induced topology, and Bayesian inference on permutations via low-dimensional embedding.

2. Fast Bayes Net Structure Search A key learning problem for Bayesian networks (BNs) is structure identification: finding the “best” structure, corresponding to a set of observed data. The structure identification problem is NP-hard (Chickering et al., 1994) and considerable work has been devoted to efficient algorithms for it (Moore & Lee, 1998; Heckerman, 1999; Friedman & Koller, 2003). Essentially, the structure identification problem b = arg maxg∈G S(g|D), where S() boils down to: G is some score function that measures the goodness of fit between a proposed graph, g, and a set of data, D. Typical examples of S() are BDe (Heckerman, 1999), BIC (Schwarz, 1978), or AIC (Akaike, 1973). The exponential challenge of this optimization problem comes down to the combinatorial structure of the space of DAGs, G, but

Geometric Embedding for Learning Combinatorial Structures 4

10

network sizes (horizontal axis). The speedup is particularly pronounced for very large data sets, where HC’s time scales with data set size and GPHC/RS are effectively independent of data set size (data not shown).

HC GP−HC RS

3

10

2

Time in seconds

10

1

10

0

10

3. Bayesian Inference on Permutations

−1

10

−2

10

−3

10

0

20

40

60

80 100 120 Number of Random Variables

140

160

180

200

Figure 1. Structure identification speedup from two variants of the meta-graph approach to BN structure search (GP-HC and RS) over standard hillclimbing search with exact score evaluation (HC). Horizontal axis gives number of random variables in the target BN; vertical axis gives total time for search, on a log scale. (Down is faster.)

evaluating the score function can also be a significant computational burden. Even with a decomposable score, the scan incurred in accumulating even per-variable sufficient statistics can be quite expensive for large data sets. We observe that S : G → R is simply a function mapping, so we can build a function approximator for it, in the classical tradition of supervised machine learning. If this approximator, f () ≈ S(), is sufficiently accurate and is fast to evaluate (say, constant time), then f () can be used as a proxy b for S() in any incremental search for G. To build the approximator f (), we embed the set of graphs on d nodes into an O(d2 )-dimensional Euclidean space by imposing a distance metric between graphs. We use (weighted) Hamming distance between the adjacency matrix representations of the graphs. This imposes a hypercube topology on the space of graphs: the “meta-graph,” a graph over the space of graphs. From this topology we construct standard function approximators, such as linear models over the Fourier basis or Gaussian process regressors. Figure 1 demonstrates the speedup that can be achieved by using two versions of our proxy score function, f () (GP-HC and RS), compared to the same search strategy using exact score evaluation, S() (HC). Both GP-HC and RS achieve better than an order-of-magnitude speedup over HC (vertical axis), over a wide range of Bayesian

A key component of object tracking, ranking, and related tasks is Bayesian inference on permutations. This is the problem of representing and manipulating probability distributions over the super-exponentially large family of permutations of n objects. An exact representation of a general PDF here requires factorially many parameters. Prior approaches have approximated a general PDF with a restricted set of basis functions (Huang et al., 2009), or have embedded the permutation space only implicitly and worked with a heuristically chosen probability distribution (Meila et al., 2007). We start with the n × n matrix representation of a permutation, and take the metric to be the Euclidean distance between (vectorizations of) these matrices. It turns out that that this metric implies a specific embedding into Rd , in which the permutations fall on the surface of a hypersphere. In turn, this suggests a natural family of probability distributions over permutations: the Fisher-Bingham distributions, of which a common special case is the von Mises-Fisher distribution (vMF) (Mardia & Jupp, 2000). We can give polynomial (in n) time mappings between permutation space and the Euclidean hypersphere. This allows us to sample data from permutation space, map to Euclidean space, manipulate probability distributions there (marginalization and projection steps of Bayesian inference), and then map resulting inferences back to permutation space. Coupling the probability representations to the mapping operations bridges the gap between the discrete, combinatorial space of permutations and the continuous, low-dimensional hypersphere. This allows us to lift the large body of results developed for directional statistics (Mardia & Jupp, 2000) directly to permutation inference.

Geometric Embedding for Learning Combinatorial Structures

Acknowledgements This material is based upon work supported by the National Science Foundation under Grant No. 0705681. Dr. Plis’s work was supported by the NIH under grant number NCRR 1P20 RR021938.

References Akaike, H. Information theory and an extension of the maximum likelihood principle. In Petrov, B. N. and Csaki, F. (eds.), Proceedings of the Second International Symposium on Information Theory, volume Academiai Kiado, pp. 267– 281, Budapest, 1973. Chickering, D., Geiger, D., and Heckerman, D. Learning Bayesian networks is NP-hard. Technical Report MSR-TR-94-17, Microsoft, Redmond, WA, 1994. Fligner, M. A. and Verducci, J. S. Distance based ranking models. Journal of the Royal Statistical Society. Series B (Methodological), 48(3):359– 369, 1986. Friedman, N. and Koller, D. Being Bayesian about network structure: A Bayesian approach to structure discovery in Bayesian networks. Machine Learning, 50(1–2):95–125, January 2003. doi: 10.1023/A:1020249912095. Gomory, R. E. Outline of an algorithm for integer solutions to linear programs. Bulletin of the American Mathematical Society, 64(5):275–278, 1958. doi: 10.1090/ S0002-9904-1958-10224-4. Heckerman, D. A tutorial on learning with Bayesian networks. In Jordan, M. (ed.), Learning in Graphical Models. MIT Press, Cambridge, MA, 1999. URL http://research.microsoft.com/scripts/ pubs/view.asp?TR_ID=MSR-TR-95-06. Huang, J., Guestrin, C., and Guibas, L. Fourier theoretic probabilistic inference over permutations. Journal of Machine Learning Research, 10:997–1070, May 2009. URL http://www.jmlr.org/papers/volume10/ huang09a/huang09a.pdf.

Kreyszig, E. Introductory Functional Analysis with Applications. John Wiley & Sons, 1978. Mardia, K. V. and Jupp, P. E. Directional Statistics. John Wiley & Sons, Ltd., second edition, 2000. Meila, M., Phadnis, K., Patterson, A., and Bilmes, J. Consensus ranking under the exponential model. In Proceedings of the 23rd Annual Conference on Uncertainty in Artificial Intelligence (UAI), 2007. Moore, A. and Lee, M. S. Cached sufficient statistics for efficient machine learning with large datasets. Journal of Artificial Intelligence Research, 8:67–91, March 1998. doi: 10.1613/jair. 453. Munkres, J. R. Topology: A First Course. Prentice Hall, 1975. Rudin, W. Principles of Mathematical Analysis. McGraw-Hill, third edition, 1964. Schwarz, G. Estimating the dimension of a model. Annals of Statistics, 6:461–464, 1978.

Suggest Documents