with Sequence-loss Cost Function for Structured Prediction. Tomasz Kajdanowicz, PrzemysÅaw Kazienko, Jan Kraszewski. Wroclaw University of Technology, ...
Boosting Algorithm with Sequence-loss Cost Function for Structured Prediction
Tomasz Kajdanowicz, Przemysław Kazienko, Jan Kraszewski Wroclaw University of Technology, Poland
Outline 1. Introduction to Structured Prediction 2. Problem Description
3. The concept of AdaBoostSeq 4. Experiments
2
Structured prediction Single value prediction
Structured prediction
• function f maps an input to an simple output (binary classification , multiclass classification or regression)
• prediction problems with more complex outputs (structured prediction)
Example : problem of predicting whether the next day will or will not be rainy on the basis of historical weather data.
Example : problem of predicting weather for next few days. 3
Structured prediction • Structured prediction is a cost-sensitive prediction problem, where output has structure of elements decomposing into variable-length vectors. [Daume] Vector notation is treated as useful encoding not only for sequence labeling problems. 0 1 0 1 1 1
Input = original input + partially produced output (extended notion for feature input space)
4
Structured prediction algorithms • Most algorithms are based on the well know binary classification adapted in the specific way [Nguyen et al.]
• Structured perceptron [Collins] – minimal requirements on output space shape – easy to implement – poor generalization
• Max-margin Markov Nets [Taskar et al.] – very useful – perform very slow – limited to Hamming loss function 5
Structured prediction algorithms • Conditional Random Fields [Lafferty et al.] • extention of logistic regression to the structured outputs • probabilistic outputs • good generalization
• relatively slow
• Support Vector Machine for Interdependent and Structured Outputs (SVMSTRUCT) [Tsochantaridis et al.] • more loss functions 6
Ensembles • Combined may be better – the goal is to select the right component for building a good hybrid system – Lotfi Zadeh is reputed to have said: Good combined system is like British Police German Mechanics French Cuisine Swiss Banking Italian Love
Bad combined system is like British Cuisine German Police French Mechanics Italian Banking Swiss Love
7
Problem Description
prediction of sequential values • for single case a sequence of output values
attributes
output
8
Problem Statement • Binary sequence classification problem f:XY where:
X – vector input, Y - variable-length vector (y1,y2, ..., yT) yμi{-1,1} • where
i=1,2,…,N – number of observations μ=1,2,…,T – length of sequence
9
Problem Statement • Goal: T classifiers combined: – optimally designed linear combination – K base classifiers of the form where
F x
K
x; k
k
k 1
Φ(x,Θk) - kth base classifier Θk - parameters of kth classifier k - weight associated to the kth classifier 10
General Idea of AdaBoostSeq Attributes 1 2 3 4 5 6 7 8
case 1 case 2 case 3 . . .
. . .
. . .
And so on..
case N
input
target
11
AdaBoostSeq • A novel algorithm for sequence prediction • Optimization for each sequence item:
exp y F N
arg
min
k ; k ;k :1, K
i
xi
i 1
• Equation is highly complex => a stage-wise suboptimal method is performed 12
AdaBoostSeq • By definitionm of the mth partial sum: Fm x k x; k , m 1,2,...,K k 1
• The recurence is obvious:
Fm x Fm1 x m x; m
• Stagewise optimization – mth step, Fm-1(x) is part of the previous step – the new target is: m , m arg min J , ,
13
AdaBoostSeq J , where
N
exp yi Fm1 xi 1 yi Rm ( xi ) xi ;
i 1
Rm
- impact function denoting the influence of the quality of preceding sequence labels m 1 prediction R ( x ) R ( x) m
i
i
i 1
1
i 1
R ( x)
y
Fi ( x)
1
K j 1
j 14
AdaBoostSeq • For given α :
N
arg min
wi( m)
wi( m) exp yixi ;
i 1
exp yi Fm 1 xi 1 yi R ( x)
• Because wi(m) does not depend neighter on α nor Ф(xi;Θ), it can be threated as a weigth of xi • Binary nature of base classifier: N ( m) m arg min Pm wi I 1 yi xi ; i 1
Pm – weighted empirical error
0, if x 0 I ( x) 1, if x 0
AdaBoostSeq • Computing base classifier at step m: N
( m) w i Pm
yi xi ; m 0 N
(m) w i 1 Pm
yi xi ; m 0
16
AdaBoostSeq • Getting equations together:
m arg min exp( )(1 Pm ) exp( ) Pm Pm
• derivative:
1 1 Pm m ln 2 Pm
17
AdaBoostSeq • Weight of the ith case: wi( m1)
wi( m) exp yi m xi ; m 1 m R ( x) Zm
• Zm – normalizator: N
Zm
wi( m) exp yi m xi ; m 1 m R ( x)
i 1
18
Algorithm AdaBoostSeq • For each sequence position (μ=1 to T) – –
Initialization: wi(1)=1/N, i=1,2,...,N; m=1 While termination criterion is not met: • obtain optimal Θm and Ф(∙; Θm) (min. Pm) • obtain optimal Pm
• αm=1/2ln((1-Pm)/Pm)
– –
• Zm = 0.0 • For i = 1 do N – wi(m+1)= wi(m)exp(-yi αmФ(xi; Θm)-(1- ) αm Rμ(x)) – Zm=Zm+wi(m+1) • End For • For i = 1 do N – wi(m+1)= wi(m)/Zm • End For • K = m; m = m+1 End while fμ (∙)=sign(ΣKk=1αkФ(∙;Θk) )
• End for 19
Profile of AdaBoostSeq • A new algorithm for sequence prediction
• For each sequence item – AdaBoostSeq considers also prediction errors for all previous items in the sequence within the boosting algorithm – the more errors on previous sequence items, the stronger focus on bad cases at the recent item
• Self-adaptive
20
Experiments • 4019 cases in the dataset
• 20 input features • Sequence lenght=10 • Decision stump as the base classifier • 10 fold cross-validation 21
AdaBoost vs. AdaBoostSeq (with ) Mean Absolute Error
=0.6 the best
Ssequence mean absolute error
0,12 0,1
0,0922
0,1011
ξ =0.9
ξ =1
0,0872
0,0828 0,0762
0,08
0,0945 0,0810
0,06 0,04 0,02 0 ξ = 0.4
ξ =0.5
ξ =0.6
ξ =0.7
ξ =0.8
For =1 it is a standard AdaBoost (the worst)
22
Summary of the Experiments
respects errors on previous items
• influences error • =0.6 error decreases by 24% for the whole sequence compared to the standard approach (=1)
0,35
ξ = 0.4
ξ =0.6
ξ =0.8
ξ =1
0,3
0,25 Mean absolute error
• For item 2+ error reduced dramatically (6 times!) since it
0,2
0,15 0,1
0,05 0 1
2
3
4 5 6 7 Sequence item
8
9
10
23
Conclusions and Future Work • AdaBoostSeq - a new algorithm for sequence prediction based on AdaBoost • While prediction of the following items in sequence, the errors from the previous items are utilized • Much more accurate than AdaBoost applied to sequence items independently • Parametrized, - how much errors are respected • Recent application: prediction for debt valuation • Future work: new cost functions (on HMM canva) 24
26