Adding Expressivity via Constraints. â« Each field must be a consecu)ve list of words and can appear at most once in a
Decomposing Structured Predic2on via Constrained Condi2onal Models
Roth Dan
Department of Computer Science University of Illinois at Urbana-‐Champaign With thanks to: June 2013 Collaborators: Ming-Wei Chang, Lev Ratinov, Rajhans Samdani, Vivek Srikumar, Many others SLG Workshop, ICML, Atlanta GA Funding: NSF; DHS; NIH; DARPA. DASH Optimization (Xpress-MP) Page 1
Comprehension (ENGLAND, June, 1989) - Christopher Robin is alive and well. He lives in England. He is the same person that you read about in the book, Winnie the Pooh. As a boy, Chris lived in a pretty home called Cotchfield Farm. When Chris was three years old, his father wrote a poem about him. The poem was printed in a magazine for others to read. Mr. Robin then wrote a book. He made up a fairy tale land where Chris lived. His friends were animals. There was a bear called Winnie the Pooh. There was also an owl and a young pig, called a piglet. All the animals were stuffed toys that Chris owned. Mr. Robin made them come to life with his words. The places in the story were all near Cotchfield Farm. Winnie the Pooh was written in 1925. Children still love to read about Christopher Robin and his animal friends. Most people don't know he is a real person who is grown now. He has written two books of his own. They tell what it is like to 1. Christopher be famous.Robin was born in England. 2. Winnie the Pooh is a title of a book. 3. Christopher Robin’s dad was a magician. 4. Christopher Robin must be at This is an Inference least 65 now.
Problem
Page 2
Learning and Inference n
Global decisions in which several local decisions play a role but there are mutual dependencies on their outcome. In current NLP we oEen think about simpler structured problems: Parsing, Informa2on Extrac2on, SRL, etc. ¨ As we move up the problem hierarchy (Textual Entailment, QA,….) not all component models can be learned simultaneously ¨ We need to think about (learned) models for different sub-‐problems, oEen pipelined ¨ Knowledge rela2ng sub-‐problems (constraints) becomes more essen2al and may appear only at evalua2on 2me ¨
n
Goal: Incorporate models’ informa2on, along with prior knowledge (constraints) in making coherent decisions ¨
Decisions that respect the local models as well as domain & context specific knowledge/constraints. Page 3
Outline n
Constrained Condi2onal Models A formula2on for global inference with knowledge modeled as expressive structural constraints ¨ A Structured Predic2on Perspec2ve ¨
n
Decomposed Learning (DecL) Efficient structure learning by reducing the learning-‐2me-‐inference to a small output space ¨ Provide condi2ons for when DecL is provably iden2cal to global structural learning (GL) ¨
Page 4
Three Ideas Underlying Constrained Condi2onal Models Modeling Idea 1: Separate modeling and problem formula2on from algorithms n
¨
Similar to the philosophy of probabilis2c modeling
Inference Idea 2: Keep models simple, make expressive decisions (via constraints) n
¨
Unlike probabilis2c modeling, where models become more expressive
Learning Idea 3: Expressive structured decisions can be supported by simply learned models n
¨
Amplified and minimally supervised by exploi2ng dependencies among models’ outcomes. Page 5
Significant performance Inference with General Constraint Structure [Roth&Yih’04,07] Improvement
Recognizing En22es and Rela2ons
other
0.05
other
0.10
other
0.05
per An 0.85 0.50 Objec2vper 0.60 Note: Non e f y = argmax ∑ry0.10 n0.30 loc lea loc 1u loc c2=on th 0.45 SequenFal Model [y=v] nscore ed m[y=v] at inco odels w rporate ith know s score l = argmax ¢ 1 + score ¢ 1 +… e Dole Elizabeth , is a native of N.C. d E ’s1A=wife, E E E = LOC g e (Qcues2ons: PER 1 = PER 1 = LOCKey 1o c n o s n t score ¢ 1 +….. r s a t in ts) rREa1 =2inS-ofeHow the global E1 R1 = S-of E3inference? d Cto ognuide di2onallearned or pipelined models Over independently ModePlipeline, Subject to Constraints R12 How tR o l23 earn? Independently, Jointly? per
irrelevant irrelevant
0.05
irrelevant
0.10
spouse_of spouse_of
0.45
spouse_of
0.05
born_in born_in
0.50
born_in
0.85
Models could be learned separately; constraints may come up only at decision 2me. Page 6
Constrained Condi2onal Models
Penalty for violating the constraint. (Soft) constraints component
Weight Vector for “local” models
Features, classifiers; loglinear models (HMM, CRF) or a combination
How to solve? Inferning workshop This is an Integer Linear Program Solving using ILP packages gives an exact solu2on. Cugng Planes, Dual Decomposi2on & other search techniques are possible
How far y is from a “legal” assignment
How to train? Training is learning the objec2ve func2on Decouple? Decompose? How to exploit the structure to minimize supervision? Page 7
Placing in context: a crash course in structured prediction
Structured Predic2on: Inference n
Inference: given input x (a document, a sentence),
predict the best structure y = {y1,y2,…,yn} 2 Y (en22es & rela2ons) ¨
n
Assign values to the y1,y2,…,yn, accoun2ng for dependencies among yis
Inference is expressed as a maximiza2on of a scoring func+on
y’ = argmaxy 2 Y wT Á (x,y) Set of allowed structures n
Feature Weights (es2mated during learning)
Joint features on inputs and outputs
Inference requires, in principle, enumera2ng all y 2 Y at decision 2me, when we are given x 2 X and anempt to determine the best y 2 Y for it, given w ¨ For some structures, inference is computa2onally easy. ¨ Eg: Using the Viterbi algorithm ¨ In general, NP-‐hard (can be formulated as an ILP) Page 8
Structured Predic2on: Learning Learning: given a set of structured examples {(x,y)} find a scoring func2on w that minimizes empirical loss. n Learning is thus driven by the anempt to find a weight vector w such that for each given annotated example (xi, yi): n
Page 9
Structured Predic2on: Learning Learning: given a set of structured examples {(x,y)} find a scoring func2on w that minimizes empirical loss. n Learning is thus driven by the anempt to find a weight vector w such that for each given annotated example (xi, yi): Penalty for Score o f a nnotated Score o f a ny predicFng other 8 y structure other s tructure structure n
n
We call these condi2ons the learning constraints.
n
In most structured learning algorithms used today, the update of the weight vector w is done in an on-‐line fashion W.l.o.g. (almost) we can thus write the generic structured learning algorithm as follows:
n
n
What follows is a Structured Perceptron, but with minor varia2ons this procedure applies to CRFs and Linear Structured SVM
Page 10
In the structured the predic2on Structured Predic2on: Learning Algorithm case, (inference) step is oEen intractable n For each example (xi, yi) and needs to be n Do: (with the current weight vector w) done many 2mes ¨ Predict: perform Inference with the current weight vector n yi’ = argmaxy 2 Y wT Á ( xi ,y)
Check the learning constraints n Is the score of the current predicFon bePer than of (xi, yi)? ¨ If Yes – a mistaken predic2on n Update w ¨ Otherwise: no need to update w on this example EndFor ¨
n
Page 11
Structured Predic2on: Learning Algorithm n n
Solu2on I: decompose the scoring func2on to EASY and HARD parts
For each example (xi, yi) Do: ¨ Predict: perform Inference with the current weight vector
n yi’ = argmaxy 2 Y wEASYT ÁEASY ( xi ,y) + wHARDT ÁHARD
( xi ,y)
Check the learning constraint n Is the score of the current predicFon bePer than of (xi, yi)? ¨ If Yes – a mistaken predic2on n Update w ¨ Otherwise: no need to update w on this example n EndDo EASY: could be feature func2ons that correspond to an HMM, a linear CRF, ¨
or a bank of classifiers (omigng dependence on y at learning 2me). May not be enough if the HARD part is s2ll part of each inference step.
Page 12
Structured Predic2on: Learning Algorithm n n
Solu2on II: Disregard some of the dependencies: assume a simple model.
For each example (xi, yi) Do: ¨ Predict: perform Inference with the current weight vector
n yi’ = argmaxy 2 Y wEASYT ÁEASY ( xi ,y) + wHARDT ÁHARD
( xi ,y)
Check the learning constraint n Is the score of the current predicFon bePer than of (xi, yi)? ¨ If Yes – a mistaken predic2on n Update w ¨ Otherwise: no need to update w on this example EndDo ¨
n
Page 13
Structured Predic2on: Learning Algorithm n n
Solu2on III: Disregard some of the dependencies For each example (xi, yi) during learning; take into account at decision 2me Do: ¨ Predict: perform Inference with the current weight vector n yi’ = argmaxy 2 Y wEASYT ÁEASY ( xi ,y) + wHARDT ÁHARD
( xi ,y)
n
Check the learning constraint n Is the score of the current predicFon bePer than of (xi, yi)? ¨ If Yes – a mistaken predic2on n Update w ¨ Otherwise: no need to update w on this example EndDo
n
yi’ = argmaxy 2 Y wEASYT ÁEASY ( xi ,y) + wHARDT ÁHARD ( xi ,y)
¨
This is the most commonly used soluFon in NLP today Page 14
Examples: CCM Formula2ons
(SoE) constraints component is more general since constraints can be declara2ve, non-‐grounded statements.
CCMs can be viewed as a general interface to easily combine declara2ve domain knowledge with data driven sta2s2cal models Formulate NLP Problems as ILP problems (inference may be done otherwise) 1. Sequence tagging (HMM/CRF + Global constraints) 2. Sentence Compression (Language Model + Global Constraints) 3. SRL (Independent classifiers + Global Constraints)
Constrained CondiFonal Models Allow: Sequen2al Sentence Compression/ Predic2on Linguis2cs Constraints n Learning a simple model (or mulFple; or pipelines) Summariza2on: n Make decisions with a more complex model HMM/CRF based: Cannot have both A states and B states constraints to include bias/re-‐rank n Language Accomplished M Aodel rgmax based: ∑ ¸ijb xy ij directly incorporaFng in aa n If moodifier utput cshosen, equence. its head global decisions composed of simpler models’ decisions Argmax ∑ ¸ijk xijk If verb is chosen, include its arguments n More sophisFcated algorithmic approaches exist to bias the output [CoDL: Cheng et. al 07,12; PR: Ganchev et. al. 10; UEM: Samdani et. al 12] Page 15
Outline n
Constrained Condi2onal Models A formula2on for global inference with knowledge modeled as expressive structural constraints ¨ A Structured Predic2on Perspec2ve ¨
n
Decomposed Learning (DecL) Efficient structure learning by reducing the learning-‐2me-‐inference to a small output space ¨ Provide condi2ons for when DecL is provably iden2cal to global structural learning (GL) ¨
Page 16
Training Constrained Condi2onal Models Decompose Model
n
Training: ¨ ¨ ¨
n
n
Independently of the constraints (L+I) Jointly, in the presence of the constraints (IBT, GL) Decomposed to simpler models
Not surprisingly, decomposi2on is good. ¨
n
Decompose Model from constraints
See [Chang et. al., Machine Learning Journal 2012]
Linle can be said theore2cally on the quality/generaliza2on of predic2ons made with a decomposed model Next, an algorithmic approach to decomposi2on that is both good, and comes with interes2ng guarantees. Page 17
Decomposed Structured Predic2on n
y = argmaxy 2 Y wT Á (x,y) Learning is driven by the anempt to find a weight vector w such that for each given annotated example (xi , yi):
y + ¢ (y,yi) 8 y wT Á (xi , yi ) ¸ wT Á (xi , y) n
n
In Global Learning, the output space is exponen2al in the number of variables – accurate learning can be intractable “Standard” ways to decompose it forget some of the structure and bring it back y1 only at decision 2me Learning Entity Inference
we
y5
y2
y1
y3
y4 y5
y6
y2
y3
y4
Relation y6
wr
Page 18
Decomposed Structural Learning (DecL) [samdani & Roth IMCL’12] n
Algorithm: Restrict the ‘argmax’ inference to a small subset of variables while fixing the remaining variables to their ground truth values yi ¨ ¨
… and repea2ng for different subsets of the output variables: a decomposiFon The resul2ng set of assignments considered for yi is called a neighborhood(yi )
y1
00 01 10 11
n
y 3
y4 y5
n
y2
y6
000 001 010 011 100 101 110 111
Related work: Pseudolikelihood – Besag, 77; Piecewise Pseudolikelihood – Sunon and McCallum, 07; Pseudomax – Sontag et al, 10
00 W 01e 10 Key contribuFon: give 11condi2ons under which DecL is provably equivalent to Global Learning (GL) Show experimentally that DecL provides results close to GL when such condi2ons do not exactly hold Page 19
What are good neighborhoods?
DecL vs. Global Learning (GL) n
GL: Separate ground truth from 8y 2 Y ¨
• DecL: Separate ground truth from 8y 2 nbr(yj) o 16 outputs
26 = 64 outputs
wT Á (xi , yi ) ¸ wT Á (xi , y) + ¢ (y,yi) 8 y 2 Y 00000 0 00000 1 00001 0 00001 1
y 1
Likely scenario: nbr(yj) ¿ Y y
y
2
3
y 4
y5
y6
y 00 01 10 11
8 y 2 nbr(yj) y y3 2
1
y 4
y5
y6 00 01 10 11
000 001 010 011 100 101 110 111 Page 20
Crea2ng Decomposi2ons n n
DecL allows different decomposi2ons Sj for different training instances yj Example: Learning with decomposi2ons in which all subsets of size k are considered: DecL-‐k ¨ ¨ ¨
n
In prac2ce, neighborhoods should be determined based on domain knowledge ¨
n
For each k-‐subset of the variables, enumerate; keep n-‐k to gold. K=1 is Pseudomax [Sontag et al, 2010] K=2 is Constraint Classifica2on [Har-‐Peled, Zimak, Roth 2002; Crammer, Singer 2002]
Put highly coupled variables in the same set
The goal is to get results that are close to doing exact inference. Are there small & good neighborhoods?
Page 21
Exactness of DecL n
n n
Different label space
Separa2ng weights for DecL j j,y), 8 y 2 nbr(yj)} Wdecl: {w| w¢Á(xj, yj) ¸ w¢Á(xj, y)y+ ¢(yw ¨
n
Score of all non ground truth y
Key Result: YES. Under “reasonable condi2ons”, DecL with small sized neighborhoods nbr(yj) gives the same results as Global Learning. For analyzing the equivalence between DecL and GL, we need a no2on of ‘separability’ of the data Separability: existence of set of weights W* that sa2sfy W*: {w | w¢Á(xj, yj) ¸ w¢Á(xj, y) + ¢(yj,y), 8 y 2 Y }
n
Score of ground truth yj
W*
Naturally: W* µ Wdecl
Wdecl
Exactness Results: The set of separa2ng weights for DecL is equal to the set of separa2ng weights for GL: W*=Wdecl Page 22
Example of Exactness: Pairwise Markov Networks n
Scoring func2on define over a graph with edges E
n
Assume domain knowledge on W*: for correct (separa2ng) w 2 Pairwise/Edge W*, which of Singleton/Vertex the p airwise Á (.;w) are: i,k components components ¨
Submodular: Ái,k(0,0)+ Ái,k(1,1) > Ái,k(0,1) + Ái,k(1,0) y2 y3 1 0y1 1 1 0 0
>
0
1
0
1
OR¨ Supermodular: Á (0,0)+ Á (1,1) < Á (0,1) + Á (1,0) i,k i,k y4 i,k i,k 1
1
0
y5
0
=
yvj
or 4
uv
X < 0, yuj 6= yvj }
m j ⇣ ⌘ E C E2 1 X j j j kwk + max yw · x y w · x + Iyj 6=y , j j j For 2an example ), define E by removing edges from E m (x ,yy2{0,1} j=1 where the labels disagree with the Á‘s
E j = {(u, v) 2 E|(sub( uv ) ^ yuj = yvj ) _ (sup( uv ) ^ yuj 6= yvj )} n Theorem: Decomposing the variables as T cUy onnected 1 j yields q(y) / PE(y|x; w) e . components of E xactness.
q above is not well defined and so we take the limit
! 0 in (9) an Page 24
Experiments: Informa2on extrac2on Lars Ole Andersen . Program analysis and specialization for the C Programming language. PhD thesis. DIKU , University of Copenhagen, May 1994 . PredicFon result of a trained HMM [AUTHOR] [TITLE] [EDITOR] [BOOKTITLE] [TECH-REPORT] [INSTITUTION] [DATE]
Lars Ole Andersen . Program analysis and specialization for the C Programming language . PhD thesis . DIKU , University of Copenhagen , May 1994 .
Violates lots of natural constraints! Page 25
Adding Expressivity via Constraints n
Each field must be a consecu2ve list of words and can appear at most once in a cita2on.
n
State transi2ons must occur on punctua2on marks.
n
The cita2on can only start with AUTHOR or EDITOR.
n n n n
The words pp., pages correspond to PAGE. Four digits star2ng with 20xx and 19xx are DATE. Quota2ons can appear only in TITLE ……. Page 26
Informa2on Extrac2on with Constraints n
Adding constraints, we get correct results!
n
[AUTHOR] [TITLE] [TECH-REPORT] [INSTITUTION] [DATE] Experimental Goal: n n
Lars Ole Andersen . Program analysis and specialization for the C Programming language . PhD thesis . DIKU , University of Copenhagen , May, 1994 .
Inves2gate DecL with small neighborhoods Note that the required theore2cal condi2ons hold only approximately: n Output tokens tend to appear in con2guous blocks n Use neighborhoods similar to the PMN Page 27
Typical Results: Informa2on Extrac2on (Ads Data) Accuracy 81
F1 Scores
80 79
HMM (LL)
78
L+I GL
77
DecL
76 75
HMM (LL)
L+I
GL
DecL Page 28
Typical Results: Informa2on Extrac2on (Ads Data) 81
80
F1 Scores
60 79
50
78
40 30
77
20 76
10
75
0
HMM (LL)
L+I
GL
Time taken to train (Minutes)
70
80
Time
DecL Page 29
Thank You!
Conclusion n
Presented Constrained Condi2onal Models: ¨
n
Supports joint inference while maintaining modularity and tractability of training ¨
n
Interdependent components are learned (independently or pipelined) and, via joint inference, support coherent decisions, modulo declara2ve constraints.
Presented Decomposed Learning (DecL): efficient joint learning by reducing the learning-‐2me-‐inference to a small output space ¨
n
An ILP formula2on for structured predic2on that augments sta2s2cally learned models with declara2ve constraints as a way to incorporate knowledge and support decisions in expressive output spaces
Provided condi2ons for when DecL is provably iden2cal to global structural learning (GL)
Interes2ng open ques2ons in developing further understanding of how to support efficient joint inference
Check out our tools, demos, tutorials Page 30