Structured Prediction - CVPR 2010 - Structured Models in Computer ...

Scalable Algorithms for Structured Prediction Thomas Hofmann [email protected]

Scalable Algorithms for Structured Prediction (applicable to, but without applications in Computer Vision)

Thomas Hofmann [email protected]

Motivation & Overview

Structured Prediction Generalize machine learning methods to deal with structured outputs and/or with multiple, interdependent outputs

Structured objects such as sequences, strings, trees, labeled graphs, lattices, etc.

Multiple response variables that are interdependent = collective classification

Natural Language Processing Syntactic sentence parsing, dependency parsing PoS tagging, named entity detection, language modeling B. Taskar, D. Klein, M. Collins, D. Koller, and C. Manning, Max-Margin Parsing, EMNLP, 2004. R. McDonald, K. Crammer, F. Pereira, Online largemargin training of dependency parsers. ACL 2005. H. C. Daume III, Practical Structured Learning Techniques for Natural Language Processing, Ph.D. Thesis, Univ. Southern California, 2006 B. Roark, M. Saraclar, M. Collins: Discriminative ngram language modeling. Computer Speech & Language 21(2): 373-392, 2007 P. Blunsom, Structured Classication for Multilingual Natural Language Processing, Ph.D. Thesis, Univ. Melbourne, 2007 L. S. Zettlemoyer, Learning to Map Setences to Logical Form, Ph,D, Thesis, MIT, 2009 T. Koo, Advances in Discriminative Dependency Parsing, Ph.D. Thesis, MIT, 2010.

Bioinformatics Protein secondary & tertiary structure, function prediction Gene structure prediction (splicing), gene finding Y. Liu, E. P. Xing, and J. Carbonell, Predicting protein folds with structural repeats using a chain graph model, ICML 2005 G. Raetsch and S. Sonneburg, Large Scale Hidden Semi-Markov SVMs, NIPS 2006. G. Schweikert et al, mGene: Accurate SVMbased gene finding with an application to nematode genomes, Genome Res. 2009 19: 2133-2143 A. Sokolov and A. Ben-Hur, A StructuredOutputs Method for Prediction of Protein Function, In Proceedings of the 3rd International Workshop on Machine Learning in Systems Biology, 2008. K. Astikainen et al., Towards Structured Output Prediction of Enzyme Function, BMC Proceedings 2008, 2 (Suppl. 4):S2

Computer Visiom [ but I didn't want to talk about Computer Vision here ] T. Caetano & R. Hartley, ICCV 2009 Tutorial on Structured Prediction in Computer Vision e.g. M. P. Kumar, P. Torr, A. Zisserman, Efficient Discriminative Learning of Parts-based Models, ICCV 2009.

Maximum Margin Structured Prediction

Scoring Functions Linear scoring (or compatibility) functions in joint feature representation of input/output pairs Score each input/output pair with a scoring function Consider linear scoring functions Joint feature map over input/output pairs Prediction

Minimal Risk Structure Prediction Goal of learning: find scoring function that minimizes expected prediction loss (= minimal risk) Loss function for predicting incorrect output Expected loss of prediction function for randomly generated instances Empirical risk of prediction function on sample set

Hinge-style Upper Bound Difficult to optimize empirical risk functionals directly Use upper bound on loss (w/ regularization) Which one? Choice will determine theoretical properties (e.g. consistency) and type of optimization problem Hinge loss (for binary classification) Margin re-scaling

Slack re-scaling

Proving the Upper Bound

Detail

It is easy to see that this results in an upper bound. For the margin re-scaled loss:

A similar relation holds for the slack loss (in fact one can define a one-parametric family of loss functions)

Hinge + \|w\|_2 = structured SVM With a L2 norm regularizer one obtains a generalization of SVM Structured SVM

Rolling out with slack variables

Shorthand

Softmargin

Illustration

Representer Theorem

Insight

Special case: linear functions

T. Hofmann, B. Schölkopf, and A. Smola, A Tutorial Review of RKHS Methods in Machine Learning, Annals of Statistics, 2008

Structured SVM

Comment

Converting one max-constraint per training instance into many linear constraints # = cardinality of output space Assuming that the output space is large (combinatorial explosion in # of variables or parts) the problem (convex QP) is intractable as such Challenge: devise efficient algorithms, exact or approximate

SVMstruct: Iterative Strengthening Incrementally add constraints to define a sequence of relaxed QPs. Trade-off between accuracy and computational speed. Define chain of (sub-)sets of constraints:

Alternation of two steps: (1) Solve relaxed QP with constrain set Ct (2) Find a set of violated constraints Ct+ not in Ct yet and add:

Requires a black box mechanism for generating constraints: "separation oracle"

SVMstruct: Analysis Method: Derive dual QP, by solving for w and plugging solution in (use expansion from representer theorem) Upper bound dual objective by setting weight to zero and solving for slack variables -> B

Find constraints/dual variables s.t. there is a guarantee that the dual objective will increase at least by some non-zero amount -> eta Result: # iterations is finite

Separation Oracle How to find (appropriate) violated constraints? Lemma:

What are the most violated constraints per training instance?

Loss-augmented prediction problem Do not add constraints that are not epsilon-violated (termination)

Final result: I. Tsochantaridis, T. Joachims, T. Hofmann, and Y. Altun, Large Margin Methods for Structured and Interdependent Output Variables, JMLR 2005

Approximation Quality Final solution: all violated constraints are violated by less than epsilon (= termination condition) Increase all slack variables by epsilon => feasible solution, but no longer optimal Define:

Then: found solution can only be worse by epsilon

I. Tsochantaridis, T. Joachims, T. Hofmann, and Y. Altun, Large Margin Methods for Structured and Interdependent Output Variables, JMLR 2005

Advanced Primal Methods

Cutting Planes via Pooling

Improvement

1-slack formulation: blowing-up constraints even more!

Problem is equivalent to structured SVM:

But: fewer constraints need to be added -> more sparseness Bottom line: 1 pooled constraints as good as n individual ones [ Gains largest when separation oracle is fast. ] T. Joachims, T. Finley, C.-N. J. Yu, Cutting-Plane Training of Structural SVMs, MLJ 2009. T. Joachims, Training linear SVMs in linear time. In: ACM SIGKDD 2006, 217–226

Cutting Planes via Pooling

Experiments

T. Joachims, T. Finely, C.-N. J. Yu, Cutting-Plane Training of Structural SVMs, MLJ 2009

Online Subgradient Method for sSVM Why bother about the constraints in the first place? Optimize non-smooth objective directly - piecewise linear!

Compute subgradient

Perform stochastic subgradient descent (w/ learning rates) N. Ratliff, J. A. Bagnell, M. Zinkevich, (Online) Subgradient Methods for Structured Prediction, AISTATS 2007. N. Ratliff, J. A. Bagnell, M. Zinkevich, Subgradient Methods for Maximum Margin Structured Learning, 2007 N. Z. Shor, Minimization methods for non-diﬀerentiable functions. Springer-Verlag, 1985

Subgradient Methods

Background

Subgradient Methods Convergence Proof

Recurrence leads to

Background

Subgradient w/ Projection

Improvement

PEGASOS Algorithm Two key improvements: Project weight vectors back to sphere after each update Select a subset of k violated constraints in each subgradient update step (between batch and online) Analysis: Very fast convergence to epsilon-good solution

S. Shalev-Shwartz Y. Singer N. Srebro, Pegasos: Primal Estimated sub-GrAdient SOlver for SVM, MLJ 2007.

Non-Convex Loss Bounds Convex upper bounds like the Hinge loss become very poor for large losses -> mismatch, sensitivity to outliers Use non-convex bound instead Ramp loss:

C. B. Do, Q. Le, C.H. Teo, O. Chapelle, A. Smola, Tighter Bounds for Structured Estimation, NIPS 2008 R. Collobert, F. Sinz, J. Weston. L. Bottou, Trading Convexity for Scalability, ICML 2006

Convex ConCave Procedure Slight modification of structured SVM: Rescaled target margin, linearization of negative max

CCCP method Taylor expansion - upper bound

Iterate minimization and re-computation of upper bound (convergence guarantee) A.L. Yuille and A. Rangarajan. The concave-convex procedure. Neural Computation, 15:915–936, 2003.

Non-convex Loss Optimization

Results

C. B. Do, Q. Le, C.H. Teo, O. Chapelle, A. Smola, Tighter Bounds for Structured Estimation, NIPS 2008

Saddle Point Formulation (1) So far: argmax for prediction or loss adjusted prediction performed in black box (outside of QP optimization) New idea: incorporate prediction directly into QP Class of problems for which prediction can be solved exactly by an LP relaxation

Binary MRFs with submodular potentials, matchings Tree-structured MRFS B. Taskar, S. Lacoste-Julien, M/ Jordan, Structured Prediction, Dual Extragradient and Bregman Projections, JMLR 2006

Saddle Point Formulation (2) Combine into min/max problem

w-space: spherical constraint z-space: linear constraints (depends on problem) Extragradient methods Solution method for saddle point problems (game theory) Perform gradient step along w, z - then project Recompute gradient at projected wp, zp and use new gradient from w, z & projection to obtain corrected wc, zc B. Taskar, S. Lacoste-Julien, M. Jordan, Structured Prediction via the Extragradient Method, NIPS 2005 B. Taskar, S. Lacoste-Julien, M. Jordan, Structured Prediction, Dual Extragradient and Bregman

Relaxed Constraint Generation LP relaxations can also be used for approximate constraint generation (even, if LP relaxations are not exact)! Idea: Round solution of relaxed LP and treat as constraints (even, if not feasible output) -> constraint overgeneration Less principled than saddle point approach, but good results in practice for intractable problems

T. Finley, T. Joachims, Training Structural SVMs when Exact Inference is Intractable, ICML 2008

Dual Methods

Dual QP for Structured SVM Dual QP (margin re-scaling)

Dual variables can be re-scaled such that they define for each training instance a probability mass function over the possible outputs.

sSVM Algorithm: Dual View Iterative strengthening of primal corresponds to variable selection in dual At iteration t: most dual variables are clamped to zero (sparseness) At iteration t+1: a subset of additional variables is unclamped (those corresponding to the selected constraints) But: real power of dual view comes from incorporation of decomposition properties of feature map into the optimization problem Primal methods: decomposition exploited in prediction a/o loss augmented prediction

Part-based Decomposition & MRFs Often the feature function decomposes into contributions over parts (or factors, or cliques in an MRF)

Assume similar additive decomposition holds for the loss

Then: one can rewrite the dual QP in terms of marginals over factor (sub-)configurations - much more compact!

Part-based Decomposition & MRFs Often the feature function decomposes into contributions over parts (or factors, or cliques in an MRF)

Assume similar additive decomposition holds for the loss

Then: one can rewrite the dual QP in terms of marginals over factor (sub-)configurations - much more compact!

Representer Theorem Definitions

Representation

Can be directly kernelized by introducing kernels on factor level.

Reparameterizing the Dual Interpret dual variables as probabilities and introduce "marginals" over factors

New QP with marginal probabilities sorry for the messy notation

P. L. Bartlett, M. Collins, B. Taskar, D. McAllester. Exponentiated gradient algorithms for large-margin structured classification. NIPS 2005.

Exponentiated Gradient Descent Essential idea: everything can be formulated in terms of variables defined over factor configurations (instead of global ones) Simplified sketch: Exponential paramaterization

Perform gradient updates w.r.t. canonical parameters Compute marginals \mu from dual variables \alpha (assumed to be efficient) Excellent convergence rate bounds! (here: out of scope) M. Collins, A. Globerson, T. Koo, X. Carreras, P.L. Bartlett, Exponentiated Gradient Algorithms for Conditional Random Fields and Max-Margin Markov Networks, JMLR 2008

Conculsion

Conclusion Significant progress on scalable structured prediction problems Constraint-generation approaches sSVM, theoretical guarantees deep cuts from pooled constraints non-convex upper bounds and CCCP on line learning via stochastic subgradient, Pegasos over-generating constraints Other methods saddle point formulation and extragradient exponentiated gradient descent on dual [new work by Meshi et al, ICML 2010]

Structured Prediction - CVPR 2010 - Structured Models in Computer ...

Structured Prediction - CVPR 2010 - Structured Models in Computer ...

Suggest Documents

Structured Prediction - CVPR 2010 - Structured Models in Computer ...

STRUCTURED MODELS

Randomized Optimum Models for Structured Prediction

Houdini: Fooling Deep Structured Prediction Models

Structured prediction models for RNN based

Structured Uncertainty Prediction Networks - Department of Computer ...

Structured Prediction - Google Sites

Confidence Estimation in Structured Prediction

STRUCTURED COMPUTER ORGANIZATION

Learning Deep Structured Models

Tree Structured GARCH Models

Structured Learning and Prediction in Computer Vision ... - IST Austria

Structured Learning and Prediction in Computer Vision - CiteSeerX

What is structured prediction? - PDFKUL.COM

Structured Prediction of Network Response

What is structured prediction? - GitHub

Structured Prediction Models for Chord Transcription of Music Audio

Learning Structured Prediction Models for Interactive Image Labeling

Training Structured Prediction Models with Extrinsic Loss ... - Slav Petrov

structured text retrieval models - CiteSeerX

topic models with structured features

Ensemble Methods for Structured Prediction - NYU Computer Science

Learning Ensembles of Structured Prediction Rules - NYU Computer ...

Learning Ensembles of Structured Prediction Rules - NYU Computer ...