Boosting Algorithm with Sequence-loss Cost Function for Structured ...

Boosting Algorithm with Sequence-loss Cost Function for Structured Prediction

Tomasz Kajdanowicz, Przemysław Kazienko, Jan Kraszewski Wroclaw University of Technology, Poland

Outline 1. Introduction to Structured Prediction 2. Problem Description

3. The concept of AdaBoostSeq 4. Experiments

2

Structured prediction Single value prediction

Structured prediction

• function f maps an input to an simple output (binary classification , multiclass classification or regression)

• prediction problems with more complex outputs (structured prediction)

Example : problem of predicting whether the next day will or will not be rainy on the basis of historical weather data.

Example : problem of predicting weather for next few days. 3

Structured prediction • Structured prediction is a cost-sensitive prediction problem, where output has structure of elements decomposing into variable-length vectors. [Daume] Vector notation is treated as useful encoding not only for sequence labeling problems. 0 1 0 1 1 1

Input = original input + partially produced output (extended notion for feature input space)

4

Structured prediction algorithms • Most algorithms are based on the well know binary classification adapted in the specific way [Nguyen et al.]

• Structured perceptron [Collins] – minimal requirements on output space shape – easy to implement – poor generalization

• Max-margin Markov Nets [Taskar et al.] – very useful – perform very slow – limited to Hamming loss function 5

Structured prediction algorithms • Conditional Random Fields [Lafferty et al.] • extention of logistic regression to the structured outputs • probabilistic outputs • good generalization

• relatively slow

• Support Vector Machine for Interdependent and Structured Outputs (SVMSTRUCT) [Tsochantaridis et al.] • more loss functions 6

Ensembles • Combined may be better – the goal is to select the right component for building a good hybrid system – Lotfi Zadeh is reputed to have said: Good combined system is like British Police German Mechanics French Cuisine Swiss Banking Italian Love

Bad combined system is like British Cuisine German Police French Mechanics Italian Banking Swiss Love

7

Problem Description

prediction of sequential values • for single case a sequence of output values

attributes

output

8

Problem Statement • Binary sequence classification problem f:XY where:

X – vector input, Y - variable-length vector (y1,y2, ..., yT) yμi{-1,1} • where

i=1,2,…,N – number of observations μ=1,2,…,T – length of sequence

9

Problem Statement • Goal: T classifiers combined: – optimally designed linear combination – K base classifiers of the form where

F  x  

K

 x;  k

k



k 1

Φ(x,Θk) - kth base classifier Θk - parameters of kth classifier k - weight associated to the kth classifier 10

General Idea of AdaBoostSeq Attributes 1 2 3 4 5 6 7 8

case 1 case 2 case 3 . . .

. . .

. . .

And so on..

case N

input

target

11

AdaBoostSeq • A novel algorithm for sequence prediction • Optimization for each sequence item:

 exp y F N

arg

min

 k ; k ;k :1, K

i



xi 

i 1

• Equation is highly complex => a stage-wise suboptimal method is performed 12

AdaBoostSeq • By definitionm of the mth partial sum: Fm x    k x;  k , m  1,2,...,K k 1

• The recurence is obvious:

Fm x   Fm1 x    m x; m 

• Stagewise optimization – mth step, Fm-1(x) is part of the previous step – the new target is:  m ,  m   arg min J  ,   ,

13

AdaBoostSeq J  ,   where

   N



 exp  yi Fm1 xi   1   yi Rm ( xi )  xi ;  

i 1

 Rm

- impact function denoting the influence of the quality of preceding sequence labels m 1  prediction R  ( x )   R  ( x) m

i



i

i 1



 1

i 1

R  ( x) 

y

Fi ( x)



 1

K j 1

j 14

AdaBoostSeq • For given α :

N

  arg min 

wi( m)



wi( m) exp  yixi ; 

i 1

 

  exp  yi Fm 1 xi   1   yi R ( x)



• Because wi(m) does not depend neighter on α nor Ф(xi;Θ), it can be threated as a weigth of xi • Binary nature of base classifier: N   ( m) m  arg min Pm   wi I 1  yi xi ;   i 1  

Pm – weighted empirical error

0, if x  0 I ( x)   1, if x  0

AdaBoostSeq • Computing base classifier at step m: N

( m) w  i  Pm

yi   xi ; m  0 N

(m) w  i  1  Pm

yi   xi ; m  0

16

AdaBoostSeq • Getting equations together:

 m  arg min exp(  )(1  Pm )  exp( ) Pm  Pm

• derivative:

1 1  Pm  m  ln 2 Pm

17

AdaBoostSeq • Weight of the ith case: wi( m1)



wi( m) exp  yi m xi ;  m   1    m R  ( x)  Zm



• Zm – normalizator: N

Zm 





wi( m) exp  yi m xi ;  m   1    m R  ( x)



i 1

18

Algorithm AdaBoostSeq • For each sequence position (μ=1 to T) – –

Initialization: wi(1)=1/N, i=1,2,...,N; m=1 While termination criterion is not met: • obtain optimal Θm and Ф(∙; Θm) (min. Pm) • obtain optimal Pm

• αm=1/2ln((1-Pm)/Pm)

– –

• Zm = 0.0 • For i = 1 do N – wi(m+1)= wi(m)exp(-yi  αmФ(xi; Θm)-(1- ) αm Rμ(x)) – Zm=Zm+wi(m+1) • End For • For i = 1 do N – wi(m+1)= wi(m)/Zm • End For • K = m; m = m+1 End while fμ (∙)=sign(ΣKk=1αkФ(∙;Θk) )

• End for 19

Profile of AdaBoostSeq • A new algorithm for sequence prediction

• For each sequence item – AdaBoostSeq considers also prediction errors for all previous items in the sequence within the boosting algorithm – the more errors on previous sequence items, the stronger focus on bad cases at the recent item

• Self-adaptive

20

Experiments • 4019 cases in the dataset

• 20 input features • Sequence lenght=10 • Decision stump as the base classifier • 10 fold cross-validation 21

AdaBoost vs. AdaBoostSeq (with ) Mean Absolute Error

=0.6 the best

Ssequence mean absolute error

0,12 0,1

0,0922

0,1011

ξ =0.9

ξ =1

0,0872

0,0828 0,0762

0,08

0,0945 0,0810

0,06 0,04 0,02 0 ξ = 0.4

ξ =0.5

ξ =0.6

ξ =0.7

ξ =0.8

For =1 it is a standard AdaBoost (the worst)

22

Summary of the Experiments

respects errors on previous items

•  influences error • =0.6 error decreases by 24% for the whole sequence compared to the standard approach (=1)

0,35

ξ = 0.4

ξ =0.6

ξ =0.8

ξ =1

0,3

0,25 Mean absolute error

• For item 2+ error reduced dramatically (6 times!) since it

0,2

0,15 0,1

0,05 0 1

2

3

4 5 6 7 Sequence item

8

9

10

23

Conclusions and Future Work • AdaBoostSeq - a new algorithm for sequence prediction based on AdaBoost • While prediction of the following items in sequence, the errors from the previous items are utilized • Much more accurate than AdaBoost applied to sequence items independently • Parametrized,  - how much errors are respected • Recent application: prediction for debt valuation • Future work: new cost functions (on HMM canva) 24

26

Boosting Algorithm with Sequence-loss Cost Function for Structured ...

Boosting Algorithm with Sequence-loss Cost Function for Structured ...

Suggest Documents

Cost-sensitive Boosting with p-norm Loss

The gradient boosting algorithm and random boosting for genome ...

A Boosting Algorithm for Learning Bipartite Ranking Functions with ...

A Column Generation Algorithm For Boosting

A Boosting Algorithm for Crater Recognition

SelfieBoost: A Boosting Algorithm for Deep Learning

The Most Robust Loss Function for Boosting

a learning algorithm for structured character

An Interference Alignment Algorithm for Structured Channels

fireworks algorithm for unconstrained function

Boosting a Low-Cost Smart Home Environment with Usage ... - MDPI

A Low Variance Error Boosting Algorithm - CiteSeerX

Boosting cooperation. The beneficial function of

On Boosting Cumulative Distribution Function Estimates

AdaCost: Misclassification Cost-sensitive Boosting - Semantic Scholar

An Improved Self-Adaptive PSO Algorithm with Detection Function for ...

Random Keys Genetic Algorithm with Adaptive Penalty Function for ...

Dual Algorithm for Minimizing a Quadratic Function with Two ...

An Improved Self-Adaptive PSO Algorithm with Detection Function for ...

Cost Function for Sound Source Localization with Arbitrary ... - Microsoft

MARK: A Boosting Algorithm for Heterogeneous Kernel Models

MARK: A Boosting Algorithm for Heterogeneous Kernel ... - CiteSeerX

An Efficient Boosting Algorithm for Combining ... - Machine Learning

A Gradient Boosting Algorithm for Survival Analysis via Direct ...