Generalize machine learning methods to deal with structured outputs and/or with ... Softmargin. Illustration ... update
Scalable Algorithms for Structured Prediction Thomas Hofmann
[email protected]
Scalable Algorithms for Structured Prediction (applicable to, but without applications in Computer Vision)
Thomas Hofmann
[email protected]
Motivation & Overview
Structured Prediction Generalize machine learning methods to deal with structured outputs and/or with multiple, interdependent outputs
Structured objects such as sequences, strings, trees, labeled graphs, lattices, etc.
Multiple response variables that are interdependent = collective classification
Natural Language Processing Syntactic sentence parsing, dependency parsing PoS tagging, named entity detection, language modeling B. Taskar, D. Klein, M. Collins, D. Koller, and C. Manning, Max-Margin Parsing, EMNLP, 2004. R. McDonald, K. Crammer, F. Pereira, Online largemargin training of dependency parsers. ACL 2005. H. C. Daume III, Practical Structured Learning Techniques for Natural Language Processing, Ph.D. Thesis, Univ. Southern California, 2006 B. Roark, M. Saraclar, M. Collins: Discriminative ngram language modeling. Computer Speech & Language 21(2): 373-392, 2007 P. Blunsom, Structured Classication for Multilingual Natural Language Processing, Ph.D. Thesis, Univ. Melbourne, 2007 L. S. Zettlemoyer, Learning to Map Setences to Logical Form, Ph,D, Thesis, MIT, 2009 T. Koo, Advances in Discriminative Dependency Parsing, Ph.D. Thesis, MIT, 2010.
Bioinformatics Protein secondary & tertiary structure, function prediction Gene structure prediction (splicing), gene finding Y. Liu, E. P. Xing, and J. Carbonell, Predicting protein folds with structural repeats using a chain graph model, ICML 2005 G. Raetsch and S. Sonneburg, Large Scale Hidden Semi-Markov SVMs, NIPS 2006. G. Schweikert et al, mGene: Accurate SVMbased gene finding with an application to nematode genomes, Genome Res. 2009 19: 2133-2143 A. Sokolov and A. Ben-Hur, A StructuredOutputs Method for Prediction of Protein Function, In Proceedings of the 3rd International Workshop on Machine Learning in Systems Biology, 2008. K. Astikainen et al., Towards Structured Output Prediction of Enzyme Function, BMC Proceedings 2008, 2 (Suppl. 4):S2
Computer Visiom [ but I didn't want to talk about Computer Vision here ] T. Caetano & R. Hartley, ICCV 2009 Tutorial on Structured Prediction in Computer Vision e.g. M. P. Kumar, P. Torr, A. Zisserman, Efficient Discriminative Learning of Parts-based Models, ICCV 2009.
Maximum Margin Structured Prediction
Scoring Functions Linear scoring (or compatibility) functions in joint feature representation of input/output pairs Score each input/output pair with a scoring function Consider linear scoring functions Joint feature map over input/output pairs Prediction
Minimal Risk Structure Prediction Goal of learning: find scoring function that minimizes expected prediction loss (= minimal risk) Loss function for predicting incorrect output Expected loss of prediction function for randomly generated instances Empirical risk of prediction function on sample set
Hinge-style Upper Bound Difficult to optimize empirical risk functionals directly Use upper bound on loss (w/ regularization) Which one? Choice will determine theoretical properties (e.g. consistency) and type of optimization problem Hinge loss (for binary classification) Margin re-scaling
Slack re-scaling
Proving the Upper Bound
Detail
It is easy to see that this results in an upper bound. For the margin re-scaled loss:
A similar relation holds for the slack loss (in fact one can define a one-parametric family of loss functions)
Hinge + \|w\|_2 = structured SVM With a L2 norm regularizer one obtains a generalization of SVM Structured SVM
Rolling out with slack variables
Shorthand
Softmargin
Illustration
Representer Theorem
Insight
Special case: linear functions
T. Hofmann, B. Schölkopf, and A. Smola, A Tutorial Review of RKHS Methods in Machine Learning, Annals of Statistics, 2008
Structured SVM
Comment
Converting one max-constraint per training instance into many linear constraints # = cardinality of output space Assuming that the output space is large (combinatorial explosion in # of variables or parts) the problem (convex QP) is intractable as such Challenge: devise efficient algorithms, exact or approximate
SVMstruct: Iterative Strengthening Incrementally add constraints to define a sequence of relaxed QPs. Trade-off between accuracy and computational speed. Define chain of (sub-)sets of constraints:
Alternation of two steps: (1) Solve relaxed QP with constrain set Ct (2) Find a set of violated constraints Ct+ not in Ct yet and add:
Requires a black box mechanism for generating constraints: "separation oracle"
SVMstruct: Analysis Method: Derive dual QP, by solving for w and plugging solution in (use expansion from representer theorem) Upper bound dual objective by setting weight to zero and solving for slack variables -> B
Find constraints/dual variables s.t. there is a guarantee that the dual objective will increase at least by some non-zero amount -> eta Result: # iterations is finite
Separation Oracle How to find (appropriate) violated constraints? Lemma:
What are the most violated constraints per training instance?
Loss-augmented prediction problem Do not add constraints that are not epsilon-violated (termination)
Final result: I. Tsochantaridis, T. Joachims, T. Hofmann, and Y. Altun, Large Margin Methods for Structured and Interdependent Output Variables, JMLR 2005
Approximation Quality Final solution: all violated constraints are violated by less than epsilon (= termination condition) Increase all slack variables by epsilon => feasible solution, but no longer optimal Define:
Then: found solution can only be worse by epsilon
I. Tsochantaridis, T. Joachims, T. Hofmann, and Y. Altun, Large Margin Methods for Structured and Interdependent Output Variables, JMLR 2005
Advanced Primal Methods
Cutting Planes via Pooling
Improvement
1-slack formulation: blowing-up constraints even more!
Problem is equivalent to structured SVM:
But: fewer constraints need to be added -> more sparseness Bottom line: 1 pooled constraints as good as n individual ones [ Gains largest when separation oracle is fast. ] T. Joachims, T. Finley, C.-N. J. Yu, Cutting-Plane Training of Structural SVMs, MLJ 2009. T. Joachims, Training linear SVMs in linear time. In: ACM SIGKDD 2006, 217–226
Cutting Planes via Pooling
Experiments
T. Joachims, T. Finely, C.-N. J. Yu, Cutting-Plane Training of Structural SVMs, MLJ 2009
Online Subgradient Method for sSVM Why bother about the constraints in the first place? Optimize non-smooth objective directly - piecewise linear!
Compute subgradient
Perform stochastic subgradient descent (w/ learning rates) N. Ratliff, J. A. Bagnell, M. Zinkevich, (Online) Subgradient Methods for Structured Prediction, AISTATS 2007. N. Ratliff, J. A. Bagnell, M. Zinkevich, Subgradient Methods for Maximum Margin Structured Learning, 2007 N. Z. Shor, Minimization methods for non-differentiable functions. Springer-Verlag, 1985
Subgradient Methods
Background
Subgradient Methods Convergence Proof
Recurrence leads to
Background
Subgradient w/ Projection
Improvement
PEGASOS Algorithm Two key improvements: Project weight vectors back to sphere after each update Select a subset of k violated constraints in each subgradient update step (between batch and online) Analysis: Very fast convergence to epsilon-good solution
S. Shalev-Shwartz Y. Singer N. Srebro, Pegasos: Primal Estimated sub-GrAdient SOlver for SVM, MLJ 2007.
Non-Convex Loss Bounds Convex upper bounds like the Hinge loss become very poor for large losses -> mismatch, sensitivity to outliers Use non-convex bound instead Ramp loss:
C. B. Do, Q. Le, C.H. Teo, O. Chapelle, A. Smola, Tighter Bounds for Structured Estimation, NIPS 2008 R. Collobert, F. Sinz, J. Weston. L. Bottou, Trading Convexity for Scalability, ICML 2006
Convex ConCave Procedure Slight modification of structured SVM: Rescaled target margin, linearization of negative max
CCCP method Taylor expansion - upper bound
Iterate minimization and re-computation of upper bound (convergence guarantee) A.L. Yuille and A. Rangarajan. The concave-convex procedure. Neural Computation, 15:915–936, 2003.
Non-convex Loss Optimization
Results
C. B. Do, Q. Le, C.H. Teo, O. Chapelle, A. Smola, Tighter Bounds for Structured Estimation, NIPS 2008
Saddle Point Formulation (1) So far: argmax for prediction or loss adjusted prediction performed in black box (outside of QP optimization) New idea: incorporate prediction directly into QP Class of problems for which prediction can be solved exactly by an LP relaxation
Binary MRFs with submodular potentials, matchings Tree-structured MRFS B. Taskar, S. Lacoste-Julien, M/ Jordan, Structured Prediction, Dual Extragradient and Bregman Projections, JMLR 2006
Saddle Point Formulation (2) Combine into min/max problem
w-space: spherical constraint z-space: linear constraints (depends on problem) Extragradient methods Solution method for saddle point problems (game theory) Perform gradient step along w, z - then project Recompute gradient at projected wp, zp and use new gradient from w, z & projection to obtain corrected wc, zc B. Taskar, S. Lacoste-Julien, M. Jordan, Structured Prediction via the Extragradient Method, NIPS 2005 B. Taskar, S. Lacoste-Julien, M. Jordan, Structured Prediction, Dual Extragradient and Bregman
Relaxed Constraint Generation LP relaxations can also be used for approximate constraint generation (even, if LP relaxations are not exact)! Idea: Round solution of relaxed LP and treat as constraints (even, if not feasible output) -> constraint overgeneration Less principled than saddle point approach, but good results in practice for intractable problems
T. Finley, T. Joachims, Training Structural SVMs when Exact Inference is Intractable, ICML 2008
Dual Methods
Dual QP for Structured SVM Dual QP (margin re-scaling)
Dual variables can be re-scaled such that they define for each training instance a probability mass function over the possible outputs.
sSVM Algorithm: Dual View Iterative strengthening of primal corresponds to variable selection in dual At iteration t: most dual variables are clamped to zero (sparseness) At iteration t+1: a subset of additional variables is unclamped (those corresponding to the selected constraints) But: real power of dual view comes from incorporation of decomposition properties of feature map into the optimization problem Primal methods: decomposition exploited in prediction a/o loss augmented prediction
Part-based Decomposition & MRFs Often the feature function decomposes into contributions over parts (or factors, or cliques in an MRF)
Assume similar additive decomposition holds for the loss
Then: one can rewrite the dual QP in terms of marginals over factor (sub-)configurations - much more compact!
Part-based Decomposition & MRFs Often the feature function decomposes into contributions over parts (or factors, or cliques in an MRF)
Assume similar additive decomposition holds for the loss
Then: one can rewrite the dual QP in terms of marginals over factor (sub-)configurations - much more compact!
Representer Theorem Definitions
Representation
Can be directly kernelized by introducing kernels on factor level.
Reparameterizing the Dual Interpret dual variables as probabilities and introduce "marginals" over factors
New QP with marginal probabilities sorry for the messy notation
P. L. Bartlett, M. Collins, B. Taskar, D. McAllester. Exponentiated gradient algorithms for large-margin structured classification. NIPS 2005.
Exponentiated Gradient Descent Essential idea: everything can be formulated in terms of variables defined over factor configurations (instead of global ones) Simplified sketch: Exponential paramaterization
Perform gradient updates w.r.t. canonical parameters Compute marginals \mu from dual variables \alpha (assumed to be efficient) Excellent convergence rate bounds! (here: out of scope) M. Collins, A. Globerson, T. Koo, X. Carreras, P.L. Bartlett, Exponentiated Gradient Algorithms for Conditional Random Fields and Max-Margin Markov Networks, JMLR 2008
Conculsion
Conclusion Significant progress on scalable structured prediction problems Constraint-generation approaches sSVM, theoretical guarantees deep cuts from pooled constraints non-convex upper bounds and CCCP on line learning via stochastic subgradient, Pegasos over-generating constraints Other methods saddle point formulation and extragradient exponentiated gradient descent on dual [new work by Meshi et al, ICML 2010]