Apr 24, 2009 - i := yi,...,yj. More general problem: p(yt|ytâ1 ... Prediction Suffix Trees address these problems. [Wi
Online Learning of Prediction Suffix Trees Nikos Karampatziakis
Dexter Kozen
Cornell University
April 24, 2009
To appear in ICML 2009
Nikos Karampatziakis, Dexter Kozen
Online Learning of Prediction Suffix Trees
Introduction
Sequential prediction
Nikos Karampatziakis, Dexter Kozen
Online Learning of Prediction Suffix Trees
Introduction
Sequential prediction Prediction Suffix Trees
Nikos Karampatziakis, Dexter Kozen
Online Learning of Prediction Suffix Trees
Introduction
Sequential prediction Prediction Suffix Trees Learning algorithm
Nikos Karampatziakis, Dexter Kozen
Online Learning of Prediction Suffix Trees
Motivation
Monitoring of applications in a computer system
Nikos Karampatziakis, Dexter Kozen
Online Learning of Prediction Suffix Trees
Motivation
Monitoring of applications in a computer system System calls
Nikos Karampatziakis, Dexter Kozen
Online Learning of Prediction Suffix Trees
Motivation
Monitoring of applications in a computer system System calls Prediction model
Nikos Karampatziakis, Dexter Kozen
Online Learning of Prediction Suffix Trees
Motivation
Monitoring of applications in a computer system System calls Prediction model Some assumptions
Nikos Karampatziakis, Dexter Kozen
Online Learning of Prediction Suffix Trees
Outline
Prediction Suffix Trees Algorithm and Properties Results
Nikos Karampatziakis, Dexter Kozen
Online Learning of Prediction Suffix Trees
Outline
Prediction Suffix Trees Algorithm and Properties Results
Nikos Karampatziakis, Dexter Kozen
Online Learning of Prediction Suffix Trees
What is the next item in the sequence? open(),read(),read(),. . . G, A, T, T, A, C, . . . +1, −1, +1, −1, +1, . . . all,your,base,. . .
Nikos Karampatziakis, Dexter Kozen
Online Learning of Prediction Suffix Trees
What is the next item in the sequence? open(),read(),read(),. . . G, A, T, T, A, C, . . . +1, −1, +1, −1, +1, . . . all,your,base,. . . Formally, predict yt from y1 , y2 , . . . , yt−2 , yt−1
Nikos Karampatziakis, Dexter Kozen
Online Learning of Prediction Suffix Trees
What is the next item in the sequence? open(),read(),read(),. . . G, A, T, T, A, C, . . . +1, −1, +1, −1, +1, . . . all,your,base,. . . Formally, predict yt from y1 , y2 , . . . , yt−2 , yt−1 yt = h(y1t−1 )
yij := yi , . . . , yj
Nikos Karampatziakis, Dexter Kozen
Online Learning of Prediction Suffix Trees
What is the next item in the sequence? open(),read(),read(),. . . G, A, T, T, A, C, . . . +1, −1, +1, −1, +1, . . . all,your,base,. . . Formally, predict yt from y1 , y2 , . . . , yt−2 , yt−1 yt = h(y1t−1 )
yij := yi , . . . , yj
More general problem: p(yt |y1t−1 )
Nikos Karampatziakis, Dexter Kozen
Online Learning of Prediction Suffix Trees
What is the next item in the sequence? open(),read(),read(),. . . G, A, T, T, A, C, . . . +1, −1, +1, −1, +1, . . . all,your,base,. . . Formally, predict yt from y1 , y2 , . . . , yt−2 , yt−1 yt = h(y1t−1 )
yij := yi , . . . , yj
More general problem: p(yt |y1t−1 ) Markovian Assumption
Nikos Karampatziakis, Dexter Kozen
Online Learning of Prediction Suffix Trees
Markov Models
Zero order: p(yt |y1t−1 ) = p(yt )
Nikos Karampatziakis, Dexter Kozen
Online Learning of Prediction Suffix Trees
Markov Models
Zero order: p(yt |y1t−1 ) = p(yt ) First order:
A
yt−1
C
G
T
z }| { p(yt |yt−1 = G)
Nikos Karampatziakis, Dexter Kozen
Online Learning of Prediction Suffix Trees
Markov Models Second order:
yt−1
A
C
yt−2
A
C
G
G
yt−2
T
A
C
G
T
yt−2
T
C
A
G
yt−2
T
A
C
G
z }| { p(yt |yt−1 = C, yt−2 = G)
Nikos Karampatziakis, Dexter Kozen
Online Learning of Prediction Suffix Trees
T
Markov Models Second order:
yt−1
A
C
yt−2
A
C
G
G
yt−2
T
A
C
G
T
yt−2
T
C
A
G
yt−2
T
A
C
G
z }| { p(yt |yt−1 = C, yt−2 = G)
Third order, k-th order etc. Nikos Karampatziakis, Dexter Kozen
Online Learning of Prediction Suffix Trees
T
Markov Models — Problems How to select the best k?
Nikos Karampatziakis, Dexter Kozen
Online Learning of Prediction Suffix Trees
Markov Models — Problems How to select the best k? Number of model parameters exponential in k.
Nikos Karampatziakis, Dexter Kozen
Online Learning of Prediction Suffix Trees
Markov Models — Problems How to select the best k? Number of model parameters exponential in k. Poor family of models. What if I want a model with 100 parameters?
Nikos Karampatziakis, Dexter Kozen
Online Learning of Prediction Suffix Trees
Markov Models — Problems How to select the best k? Number of model parameters exponential in k. Poor family of models. What if I want a model with 100 parameters? What if yt does not depend on yt−2 if yt−1 = A but it depends on it if yt−1 = T ?
Nikos Karampatziakis, Dexter Kozen
Online Learning of Prediction Suffix Trees
Markov Models — Problems How to select the best k? Number of model parameters exponential in k. Poor family of models. What if I want a model with 100 parameters? What if yt does not depend on yt−2 if yt−1 = A but it depends on it if yt−1 = T ? Prediction Suffix Trees address these problems. [Willems et al., 1995, Ron et al., 1996, Helmbold & Schapire, 1997, Pereira & Singer, 1999] Nikos Karampatziakis, Dexter Kozen
Online Learning of Prediction Suffix Trees
Prediction Suffix Trees
yt−1
A
C
yt−2
A
C
G
G
yt−2
T
A
C
G
T
yt−2
T
Nikos Karampatziakis, Dexter Kozen
A
C
G
yt−2
T
A
C
Online Learning of Prediction Suffix Trees
G
T
Prediction Suffix Trees Keep a model for each k. Use a weighted sum. p(yt )
A
p(yt |A)
A
C
G
T
G
C
p(yt |G)
p(yt |C)
A
C
G
T
T
A
C
G
T
p(yt |T )
A
C
p(yt |yt−1 , yt−2 )
Nikos Karampatziakis, Dexter Kozen
Online Learning of Prediction Suffix Trees
G
T
Prediction Suffix Trees Keep a model for each k. Use a weighted sum. Prune useless branches and subtrees. p(yt )
A
C
p(yt |A)
A
G
p(yt |C)
A
C
T
p(yt |G)
A
C
p(yt |T )
G
p(yt |yt−1 , yt−2 )
Nikos Karampatziakis, Dexter Kozen
Online Learning of Prediction Suffix Trees
Prediction Suffix Trees Keep a model for each k. Use a weighted sum. Prune useless branches and subtrees. p(yt )
A
C
p(yt |A)
A
G
p(yt |C)
A
C
T
p(yt |G)
A
C
p(yt |T )
G
p(yt |yt−1 , yt−2 )
Assumptions: p(yt |yt−1 = T, yt−2 ) = p(yt |yt−1 = T ), p(yt |yt−1 = A, yt−2 = ¬A) = p(yt |yt−1 = A). Nikos Karampatziakis, Dexter Kozen
Online Learning of Prediction Suffix Trees
Moving Away from Probabilistic Modeling
We just want to predict yt from y1t−1
Nikos Karampatziakis, Dexter Kozen
Online Learning of Prediction Suffix Trees
Moving Away from Probabilistic Modeling
We just want to predict yt from y1t−1 Assume yt ∈ {−1, +1}
Nikos Karampatziakis, Dexter Kozen
Online Learning of Prediction Suffix Trees
Moving Away from Probabilistic Modeling
We just want to predict yt from y1t−1 Assume yt ∈ {−1, +1} Store a weight instead of a probability
Nikos Karampatziakis, Dexter Kozen
Online Learning of Prediction Suffix Trees
Moving Away from Probabilistic Modeling
We just want to predict yt from y1t−1 Assume yt ∈ {−1, +1} Store a weight instead of a probability Use the sign of a weighted sum of the values in the appropriate path
Nikos Karampatziakis, Dexter Kozen
Online Learning of Prediction Suffix Trees
Example True Sequence: −1, −1, +1, −1, −1, +1, . . . Discounting: A node at depth d is discounted by 2−d Tree: 0
−1
+1
−1
−2
−1 5 Input Sequence: Decision: Nikos Karampatziakis, Dexter Kozen
Online Learning of Prediction Suffix Trees
Example True Sequence: −1, −1, +1, −1, −1, +1, . . . Discounting: A node at depth d is discounted by 2−d Tree: 0
−1
+1
−1
−2
−1 5 Input Sequence: Decision:
. . . , −1, −1
0
Nikos Karampatziakis, Dexter Kozen
Online Learning of Prediction Suffix Trees
Example True Sequence: −1, −1, +1, −1, −1, +1, . . . Discounting: A node at depth d is discounted by 2−d Tree: 0
−1
+1
−1
−2
−1 5 Input Sequence: Decision:
. . . , −1, −1
0 − 21 · 1
Nikos Karampatziakis, Dexter Kozen
Online Learning of Prediction Suffix Trees
Example True Sequence: −1, −1, +1, −1, −1, +1, . . . Discounting: A node at depth d is discounted by 2−d Tree: 0
−1
+1
−1
−2
−1 5 Input Sequence: Decision:
. . . , −1, −1
0 − 21 · 1 + 14 · 5
Nikos Karampatziakis, Dexter Kozen
Online Learning of Prediction Suffix Trees
Example True Sequence: −1, −1, +1, −1, −1, +1, . . . Discounting: A node at depth d is discounted by 2−d Tree: 0
−1
+1
−1
−2
−1 5 Input Sequence: Decision:
. . . , −1, −1
0 − 21 · 1 + 14 · 5 =
Nikos Karampatziakis, Dexter Kozen
3 sign → 4
+1
Online Learning of Prediction Suffix Trees
Example True Sequence: −1, −1, +1, −1, −1, +1, . . . Discounting: A node at depth d is discounted by 2−d Tree: 0
−1
+1
−1
−2
−1 5 Input Sequence: Decision:
. . . , +1, −1
0
Nikos Karampatziakis, Dexter Kozen
Online Learning of Prediction Suffix Trees
Example True Sequence: −1, −1, +1, −1, −1, +1, . . . Discounting: A node at depth d is discounted by 2−d Tree: 0
−1
+1
−1
−2
−1 5 Input Sequence: Decision:
. . . , +1, −1
0 − 21 · 1
Nikos Karampatziakis, Dexter Kozen
Online Learning of Prediction Suffix Trees
Example True Sequence: −1, −1, +1, −1, −1, +1, . . . Discounting: A node at depth d is discounted by 2−d Tree: 0
−1
+1
−1
−2
−1 5 Input Sequence: Decision:
. . . , +1, −1 sign
0 − 21 · 1 = − 12 → −1
Nikos Karampatziakis, Dexter Kozen
Online Learning of Prediction Suffix Trees
Example True Sequence: −1, −1, +1, −1, −1, +1, . . . Discounting: A node at depth d is discounted by 2−d Tree: 0
−1
+1
−1
−2
−1 5 Input Sequence: Decision:
. . . , +1
0
Nikos Karampatziakis, Dexter Kozen
Online Learning of Prediction Suffix Trees
Example True Sequence: −1, −1, +1, −1, −1, +1, . . . Discounting: A node at depth d is discounted by 2−d Tree: 0
−1
+1
−1
−2
−1 5 Input Sequence: Decision:
. . . , +1
0 − 21 · 2
Nikos Karampatziakis, Dexter Kozen
Online Learning of Prediction Suffix Trees
Example True Sequence: −1, −1, +1, −1, −1, +1, . . . Discounting: A node at depth d is discounted by 2−d Tree: 0
−1
+1
−1
−2
−1 5 Input Sequence: Decision:
. . . , +1 sign
0 − 21 · 2 = −1 → −1
Nikos Karampatziakis, Dexter Kozen
Online Learning of Prediction Suffix Trees
Remarks
Notation for the node values: gt,s
Nikos Karampatziakis, Dexter Kozen
Online Learning of Prediction Suffix Trees
Remarks
Notation for the node values: gt,s Discounting scheme in this work: t−1 (1 − )|s| if s = yt−i + xt,s = 0 otherwise
i = 1, . . . , t − 1
The best value of will be determined (much) later.
Nikos Karampatziakis, Dexter Kozen
Online Learning of Prediction Suffix Trees
Remarks
Notation for the node values: gt,s Discounting scheme in this work: t−1 (1 − )|s| if s = yt−i + xt,s = 0 otherwise
i = 1, . . . , t − 1
The best value of will be determined (much) later. Decision at time t: sign
P
s
Nikos Karampatziakis, Dexter Kozen
+ gt,s x+ t,s = sign(hgt , xt i)
Online Learning of Prediction Suffix Trees
Outline
Prediction Suffix Trees Algorithm and Properties Results
Nikos Karampatziakis, Dexter Kozen
Online Learning of Prediction Suffix Trees
Outline
Prediction Suffix Trees Algorithm and Properties Results
Nikos Karampatziakis, Dexter Kozen
Online Learning of Prediction Suffix Trees
Algorithms for Linear Prediction SVMs, logistic regression,. . .
Nikos Karampatziakis, Dexter Kozen
Online Learning of Prediction Suffix Trees
Algorithms for Linear Prediction SVMs, logistic regression,. . . Focus on online setting
Nikos Karampatziakis, Dexter Kozen
Online Learning of Prediction Suffix Trees
Algorithms for Linear Prediction SVMs, logistic regression,. . . Focus on online setting Algorithm maintains hypothesis wt . In each round:
Nikos Karampatziakis, Dexter Kozen
Online Learning of Prediction Suffix Trees
Algorithms for Linear Prediction SVMs, logistic regression,. . . Focus on online setting Algorithm maintains hypothesis wt . In each round: The algorithm receives information xt .
Nikos Karampatziakis, Dexter Kozen
Online Learning of Prediction Suffix Trees
Algorithms for Linear Prediction SVMs, logistic regression,. . . Focus on online setting Algorithm maintains hypothesis wt . In each round: The algorithm receives information xt . It outputs yˆt = sign(hwt , xt i).
Nikos Karampatziakis, Dexter Kozen
Online Learning of Prediction Suffix Trees
Algorithms for Linear Prediction SVMs, logistic regression,. . . Focus on online setting Algorithm maintains hypothesis wt . In each round: The algorithm receives information xt . It outputs yˆt = sign(hwt , xt i). It receives the correct output yt .
Nikos Karampatziakis, Dexter Kozen
Online Learning of Prediction Suffix Trees
Algorithms for Linear Prediction SVMs, logistic regression,. . . Focus on online setting Algorithm maintains hypothesis wt . In each round: The algorithm receives information xt . It outputs yˆt = sign(hwt , xt i). It receives the correct output yt . It updates: wt+1 ← f (wt ).
Nikos Karampatziakis, Dexter Kozen
Online Learning of Prediction Suffix Trees
Algorithms for Linear Prediction SVMs, logistic regression,. . . Focus on online setting Algorithm maintains hypothesis wt . In each round: The algorithm receives information xt . It outputs yˆt = sign(hwt , xt i). It receives the correct output yt . It updates: wt+1 ← f (wt ).
A mistake is made when yt hwt , xt i ≤ 0.
Nikos Karampatziakis, Dexter Kozen
Online Learning of Prediction Suffix Trees
Perceptron
If a mistake is made at round t: wt+1 = wt + αyt xt
Nikos Karampatziakis, Dexter Kozen
Online Learning of Prediction Suffix Trees
Perceptron
If a mistake is made at round t: wt+1 = wt + αyt xt α is a learning rate.
Nikos Karampatziakis, Dexter Kozen
Online Learning of Prediction Suffix Trees
Perceptron
If a mistake is made at round t: wt+1 = wt + αyt xt α is a learning rate. It is selected so that it optimizes various tradeoffs.
Nikos Karampatziakis, Dexter Kozen
Online Learning of Prediction Suffix Trees
Balanced Winnow Balanced Winnow [Littlestone, 1989] θ1 ← 0 for t = 1, 2, . . . , T do θt,i wt,i ← Pe θt,j j e yˆt ← hwt , xt i if yt yˆt ≤ 0 θt+1 ← θt + αyt xt else θt+1 ← θt
Nikos Karampatziakis, Dexter Kozen
Perceptron [Rosenblatt, 1958] θ1 ← 0 for t = 1, 2, . . . , T do wt ← θt yˆt ← hwt , xt i if yt yˆt ≤ 0 θt+1 ← θt + αyt xt else θt+1 ← θt
Online Learning of Prediction Suffix Trees
Balanced Winnow Balanced Winnow [Littlestone, 1989] θ1 ← 0 for t = 1, 2, . . . , T do θt,i wt,i ← Pe θt,j j e yˆt ← hwt , xt i if yt yˆt ≤ 0 θt+1 ← θt + αyt xt else θt+1 ← θt
Perceptron [Rosenblatt, 1958] θ1 ← 0 for t = 1, 2, . . . , T do wt ← θt yˆt ← hwt , xt i if yt yˆt ≤ 0 θt+1 ← θt + αyt xt else θt+1 ← θt
Important for Balanced Winnow: + + + + + xt = [x+ t , −xt ] = [xt,1 , . . . , xt,d , −xt,1 , . . . , −xt,d ] t−1 (1 − )|s| if s = yt−i i = 1, . . . , t − 1 + xt,s = 0 otherwise Nikos Karampatziakis, Dexter Kozen
Online Learning of Prediction Suffix Trees
Advantages
Multiplicative updates — fast convergence.
Nikos Karampatziakis, Dexter Kozen
Online Learning of Prediction Suffix Trees
Advantages
Multiplicative updates — fast convergence. Can cope with many features when few of them are relevant.
Nikos Karampatziakis, Dexter Kozen
Online Learning of Prediction Suffix Trees
Advantages
Multiplicative updates — fast convergence. Can cope with many features when few of them are relevant. For our application it can track changes better.
Nikos Karampatziakis, Dexter Kozen
Online Learning of Prediction Suffix Trees
Where’s the catch?
Perceptron and Winnow don’t care about the sparsity of their hypothesis.
Nikos Karampatziakis, Dexter Kozen
Online Learning of Prediction Suffix Trees
Where’s the catch?
Perceptron and Winnow don’t care about the sparsity of their hypothesis. Eventually, we have a feature for every substring of y1T .
Nikos Karampatziakis, Dexter Kozen
Online Learning of Prediction Suffix Trees
Where’s the catch?
Perceptron and Winnow don’t care about the sparsity of their hypothesis. Eventually, we have a feature for every substring of y1T . Implications: Need to learn O(T 2 ) weights in T rounds
Nikos Karampatziakis, Dexter Kozen
/
Online Learning of Prediction Suffix Trees
Where’s the catch?
Perceptron and Winnow don’t care about the sparsity of their hypothesis. Eventually, we have a feature for every substring of y1T . Implications: Need to learn O(T 2 ) weights in T rounds Need to store
O(T 2 )
numbers
/
/
Naively, all weights affect the decision and must be stored.
Nikos Karampatziakis, Dexter Kozen
Online Learning of Prediction Suffix Trees
Does the decision depend on every feature? − + + Recall the specific form of features: xt = [x+ t , xt ] = [xt , −xt ].
Nikos Karampatziakis, Dexter Kozen
Online Learning of Prediction Suffix Trees
Does the decision depend on every feature? − + + Recall the specific form of features: xt = [x+ t , xt ] = [xt , −xt ].
Notation: wt = [wt+ , wt− ] θt = [θt+ , θt− ]. Recall wt,i =
Nikos Karampatziakis, Dexter Kozen
Online Learning of Prediction Suffix Trees
eθt,i θt,j j e
P
Does the decision depend on every feature? − + + Recall the specific form of features: xt = [x+ t , xt ] = [xt , −xt ].
Notation: wt = [wt+ , wt− ] θt = [θt+ , θt− ]. Recall wt,i = Easy inductive argument can show that θt− = −θt+
Nikos Karampatziakis, Dexter Kozen
Online Learning of Prediction Suffix Trees
eθt,i θt,j j e
P
Does the decision depend on every feature? − + + Recall the specific form of features: xt = [x+ t , xt ] = [xt , −xt ].
Notation: wt = [wt+ , wt− ] θt = [θt+ , θt− ]. Recall wt,i = Easy inductive argument can show that θt− = −θt+ 1 Decision yˆt = hwt , xt i = hwt+ − wt− , x+ t i =
=
d X
Pd i=1
1
sinh(x) =
+ sinh(θt,i )
ex −e−x 2
+ j=1 cosh(θt,j )
and cosh(x) =
Nikos Karampatziakis, Dexter Kozen
x+ t,i ∝
d X
+ sinh(θt,i )x+ t,i
i=1
ex +e−x 2 Online Learning of Prediction Suffix Trees
eθt,i θt,j j e
P
Does the decision depend on every feature? − + + Recall the specific form of features: xt = [x+ t , xt ] = [xt , −xt ].
Notation: wt = [wt+ , wt− ] θt = [θt+ , θt− ]. Recall wt,i = Easy inductive argument can show that θt− = −θt+ 1 Decision yˆt = hwt , xt i = hwt+ − wt− , x+ t i =
=
d X
+ sinh(θt,i )
Pd i=1
+ j=1 cosh(θt,j )
x+ t,i ∝
d X
+ sinh(θt,i )x+ t,i
i=1
+ Iff θt,i = 0 the decision does not depend on feature i. 1
sinh(x) =
ex −e−x 2
and cosh(x) =
Nikos Karampatziakis, Dexter Kozen
ex +e−x 2 Online Learning of Prediction Suffix Trees
eθt,i θt,j j e
P
Observations
If θt,s = 0 we don’t have to store it.
Nikos Karampatziakis, Dexter Kozen
Online Learning of Prediction Suffix Trees
Observations
If θt,s = 0 we don’t have to store it. Initially θ1 = 0. The tree has only one node.
Nikos Karampatziakis, Dexter Kozen
Online Learning of Prediction Suffix Trees
Observations
If θt,s = 0 we don’t have to store it. Initially θ1 = 0. The tree has only one node. As mistakes are made, the tree grows.
Nikos Karampatziakis, Dexter Kozen
Online Learning of Prediction Suffix Trees
Observations
If θt,s = 0 we don’t have to store it. Initially θ1 = 0. The tree has only one node. As mistakes are made, the tree grows. Classic Winnow/Perceptron update quickly destroys sparsity.
Nikos Karampatziakis, Dexter Kozen
Online Learning of Prediction Suffix Trees
Illustration Input Sequence: +1, −1, +1, −1, ? −|s| . . . , −1, t−1 2 if s = yt−i i = 1, . . . , t − 1 x+ t,s = 0 otherwise Tree: 0 −1
+1 −2
1 −1 −5
Decision: Nikos Karampatziakis, Dexter Kozen
Online Learning of Prediction Suffix Trees
Illustration Input Sequence: +1, −1, +1, −1, ? −|s| . . . , −1, t−1 2 if s = yt−i i = 1, . . . , t − 1 x+ t,s = 0 otherwise Tree: 0 −1
+1 −2
1 −1 −5
Decision:
1 2
· sinh(1)
Nikos Karampatziakis, Dexter Kozen
Online Learning of Prediction Suffix Trees
Illustration Input Sequence: +1, −1, +1, −1, ? −|s| . . . , −1, t−1 2 if s = yt−i i = 1, . . . , t − 1 x+ t,s = 0 otherwise Tree: 0 −1
+1 −2
1 −1 −5
Decision:
1 2
· sinh(1) > 0
Nikos Karampatziakis, Dexter Kozen
Online Learning of Prediction Suffix Trees
Illustration Input Sequence: +1, −1, +1, −1, ? −|s| . . . , −1, t−1 2 if s = yt−i i = 1, . . . , t − 1 x+ t,s = 0 otherwise Tree: 0 −1
+1 −2
1 −1 −5
Decision:
1 2
sign
· sinh(1) > 0 → +1
Nikos Karampatziakis, Dexter Kozen
Online Learning of Prediction Suffix Trees
Illustration Input Sequence: +1, −1, +1, −1, − 1 −|s| . . . , −1, t−1 2 if s = yt−i i = 1, . . . , t − 1 x+ t,s = 0 otherwise Tree: 0 −1
+1 −2
1 −1 −5
Decision:
1 2
sign
· sinh(1) > 0 → +1
Nikos Karampatziakis, Dexter Kozen
Online Learning of Prediction Suffix Trees
Illustration Input Sequence: +1, −1, +1, −1, − 1 −|s| . . . , −1, t−1 2 if s = yt−i i = 1, . . . , t − 1 x+ t,s = 0 otherwise Tree: 0 −1
+1 −2
1 −1
+1
−5
0 −1 0 +1 0
.. .
Decision:
1 2
sign
· sinh(1) > 0 → +1
Nikos Karampatziakis, Dexter Kozen
Online Learning of Prediction Suffix Trees
Illustration Input Sequence: +1, −1, +1, −1, − 1 −|s| . . . , −1, t−1 2 if s = yt−i i = 1, . . . , t − 1 x+ t,s = 0 otherwise Tree: 0 −1
+1
1 2
−1
−2 +1 −1 4
−5 −1 1 −8
+1 1 − 16
.. .
Decision:
1 2
sign
· sinh(1) > 0 → +1
Nikos Karampatziakis, Dexter Kozen
Online Learning of Prediction Suffix Trees
Summary of the Problem
Mistake at time t: O(t) nodes are inserted in the tree.
Nikos Karampatziakis, Dexter Kozen
Online Learning of Prediction Suffix Trees
Summary of the Problem
Mistake at time t: O(t) nodes are inserted in the tree. x+ t,s is non-zero even when s is very long.
Nikos Karampatziakis, Dexter Kozen
Online Learning of Prediction Suffix Trees
Summary of the Problem
Mistake at time t: O(t) nodes are inserted in the tree. x+ t,s is non-zero even when s is very long. Bad idea: change the definition of x+ t,s to avoid this.
Nikos Karampatziakis, Dexter Kozen
Online Learning of Prediction Suffix Trees
Summary of the Problem
Mistake at time t: O(t) nodes are inserted in the tree. x+ t,s is non-zero even when s is very long. Bad idea: change the definition of x+ t,s to avoid this.
Need to learn a good θ while keeping it sparse.
Nikos Karampatziakis, Dexter Kozen
Online Learning of Prediction Suffix Trees
Summary of the Problem
Mistake at time t: O(t) nodes are inserted in the tree. x+ t,s is non-zero even when s is very long. Bad idea: change the definition of x+ t,s to avoid this.
Need to learn a good θ while keeping it sparse. Not all sparse vectors are equal.
Nikos Karampatziakis, Dexter Kozen
Online Learning of Prediction Suffix Trees
Better Update Rule Adaptive bound dt on the length of the suffixes.
Nikos Karampatziakis, Dexter Kozen
Online Learning of Prediction Suffix Trees
Better Update Rule Adaptive bound dt on the length of the suffixes. Same as the depth up to which the tree will grow on round t.
Nikos Karampatziakis, Dexter Kozen
Online Learning of Prediction Suffix Trees
Better Update Rule Adaptive bound dt on the length of the suffixes. Same as the depth up to which the tree will grow on round t.
dt will be growing slowly if necessary.
Nikos Karampatziakis, Dexter Kozen
Online Learning of Prediction Suffix Trees
Better Update Rule Adaptive bound dt on the length of the suffixes. Same as the depth up to which the tree will grow on round t.
dt will be growing slowly if necessary. New update: θt+1,s =
θt,s + αyt xt,s if |s| ≤ dt θt,s otherwise
Nikos Karampatziakis, Dexter Kozen
Online Learning of Prediction Suffix Trees
Better Update Rule Adaptive bound dt on the length of the suffixes. Same as the depth up to which the tree will grow on round t.
dt will be growing slowly if necessary. New update: θt+1,s =
θt,s + αyt xt,s if |s| ≤ dt θt,s otherwise
Equivalently: θt+1 = θt + αyt xt + αnt where nt is a noise vector that cancels part of the update: −yt xt,s if |s| > dt nt,s = 0 otherwise Nikos Karampatziakis, Dexter Kozen
Online Learning of Prediction Suffix Trees
Setting dt ht = the length of the path from the root using yt−1 , yt−2 , . . ..
Nikos Karampatziakis, Dexter Kozen
Online Learning of Prediction Suffix Trees
Setting dt ht = the length of the path from the root using yt−1 , yt−2 , . . .. Jt is the subset of rounds 1, . . . , t in which a mistake was made.
Nikos Karampatziakis, Dexter Kozen
Online Learning of Prediction Suffix Trees
Setting dt ht = the length of the path from the root using yt−1 , yt−2 , . . .. Jt is the subset of rounds 1, . . . , t in which a mistake was made. Pt =
P
i∈Jt
||ni ||∞ =
P
i∈Jt (1
Nikos Karampatziakis, Dexter Kozen
− )(di +1) .
Online Learning of Prediction Suffix Trees
Setting dt ht = the length of the path from the root using yt−1 , yt−2 , . . .. Jt is the subset of rounds 1, . . . , t in which a mistake was made. Pt =
P
i∈Jt
||ni ||∞ =
P
i∈Jt (1
− )(di +1) .
dt will be the smallest integer such that Pt ≤ |Jt |2/3 .
Nikos Karampatziakis, Dexter Kozen
Online Learning of Prediction Suffix Trees
Setting dt ht = the length of the path from the root using yt−1 , yt−2 , . . .. Jt is the subset of rounds 1, . . . , t in which a mistake was made. Pt =
P
i∈Jt
||ni ||∞ =
P
i∈Jt (1
− )(di +1) .
dt will be the smallest integer such that Pt ≤ |Jt |2/3 . To guarantee that we can set q 3 3/2 3 dt = max ht , log1− Pt−1 + 2Pt−1 + 1 − Pt−1 − 1
Nikos Karampatziakis, Dexter Kozen
Online Learning of Prediction Suffix Trees
Algorithm Properties — Learning Mistake Bound If there exists a tree which over the input sequence y1 , y2 , . . . , yT correctly predicts all items with confidence ≥ δ, our algorithm’s mistakes will be at most 8 log T 64 max , 3 δ2 δ
Nikos Karampatziakis, Dexter Kozen
Online Learning of Prediction Suffix Trees
Algorithm Properties — Learning Mistake Bound If there exists a tree which over the input sequence y1 , y2 , . . . , yT correctly predicts all items with confidence ≥ δ, our algorithm’s mistakes will be at most 8 log T 64 max , 3 δ2 δ Instead of assuming a perfect tree, assume a tree u that attains a cumulative δ-hinge loss L > 0. The mistake bound becomes: 2L 8 log T 64 max + , 3 δ δ2 δ
Nikos Karampatziakis, Dexter Kozen
Online Learning of Prediction Suffix Trees
Algorithm Properties — Learning Mistake Bound If there exists a tree which over the input sequence y1 , y2 , . . . , yT correctly predicts all items with confidence ≥ δ, our algorithm’s mistakes will be at most 8 log T 64 max , 3 δ2 δ Instead of assuming a perfect tree, assume a tree u that attains a cumulative δ-hinge loss L > 0. The mistake bound becomes: 2L 8 log T 64 max + , 3 δ δ2 δ PT L = t=1 `t (u) where `t (u) = max(0, δ − yt hu, xt i) Nikos Karampatziakis, Dexter Kozen
Online Learning of Prediction Suffix Trees
Algorithm Properties — Resources
Let Mt be the number of mistakes up to round t
Nikos Karampatziakis, Dexter Kozen
Online Learning of Prediction Suffix Trees
Algorithm Properties — Resources
Let Mt be the number of mistakes up to round t Set so that −|s|/3 t−1 2 if s = yt−i + xt,s = 0 otherwise
Nikos Karampatziakis, Dexter Kozen
i = 1, . . . , t − 1
Online Learning of Prediction Suffix Trees
Algorithm Properties — Resources
Let Mt be the number of mistakes up to round t Set so that −|s|/3 t−1 2 if s = yt−i + xt,s = 0 otherwise
i = 1, . . . , t − 1
Growth Bound Our algorithm will not grow a tree deeper than log2 MT −1 + 4
Nikos Karampatziakis, Dexter Kozen
Online Learning of Prediction Suffix Trees
How to go about proving the properties
Bounding how fast the tree grows is straightforward.
Nikos Karampatziakis, Dexter Kozen
Online Learning of Prediction Suffix Trees
How to go about proving the properties
Bounding how fast the tree grows is straightforward. Mistake bound: show that each update contributes towards a goal.
Nikos Karampatziakis, Dexter Kozen
Online Learning of Prediction Suffix Trees
How to go about proving the properties
Bounding how fast the tree grows is straightforward. Mistake bound: show that each update contributes towards a goal. We need a goal and measure of progress.
Nikos Karampatziakis, Dexter Kozen
Online Learning of Prediction Suffix Trees
Measuring Progress Goal: A fixed tree with good performance.
Nikos Karampatziakis, Dexter Kozen
Online Learning of Prediction Suffix Trees
Measuring Progress Goal: A fixed tree with good performance. For example, a vector u such that yt hu, xt i ≥ δ.
Nikos Karampatziakis, Dexter Kozen
Online Learning of Prediction Suffix Trees
Measuring Progress Goal: A fixed tree with good performance. For example, a vector u such that yt hu, xt i ≥ δ.
Our adaptive tree is represented by wt . Remember
Nikos Karampatziakis, Dexter Kozen
P
i
Online Learning of Prediction Suffix Trees
wt,i = 1.
Measuring Progress Goal: A fixed tree with good performance. For example, a vector u such that yt hu, xt i ≥ δ.
Our adaptive tree is represented by wt . Remember To be fair assume ui ≥ 0 and
Nikos Karampatziakis, Dexter Kozen
P
i
P
i
ui = 1
Online Learning of Prediction Suffix Trees
wt,i = 1.
Measuring Progress Goal: A fixed tree with good performance. For example, a vector u such that yt hu, xt i ≥ δ.
Our adaptive tree is represented by wt . Remember To be fair assume ui ≥ 0 and
P
i
P
i
wt,i = 1.
ui = 1
Measure of progress: Relative entropy between u and wt X ui D(u||wt ) = ui log wt,i i
Nikos Karampatziakis, Dexter Kozen
Online Learning of Prediction Suffix Trees
Measuring Progress Goal: A fixed tree with good performance. For example, a vector u such that yt hu, xt i ≥ δ.
Our adaptive tree is represented by wt . Remember To be fair assume ui ≥ 0 and
P
i
P
i
wt,i = 1.
ui = 1
Measure of progress: Relative entropy between u and wt X ui D(u||wt ) = ui log wt,i i Potential function: Φ(wt ) = D(u||wt ) ≥ 0 Nikos Karampatziakis, Dexter Kozen
Online Learning of Prediction Suffix Trees
Proof Technique
Upper bound the initial potential.
Nikos Karampatziakis, Dexter Kozen
Online Learning of Prediction Suffix Trees
Proof Technique
Upper bound the initial potential. Lower bound the change in the potential with each update.
Nikos Karampatziakis, Dexter Kozen
Online Learning of Prediction Suffix Trees
Proof Technique
Upper bound the initial potential. Lower bound the change in the potential with each update. Keep the total effect of noise bounded. [Dekel et al., 2004]
Nikos Karampatziakis, Dexter Kozen
Online Learning of Prediction Suffix Trees
Pictorially Potential: 0
Φ(w1 ) Noise Pt : Example x1
Correct Prediction
Progress due to classic update Effect of noise Net progress Nikos Karampatziakis, Dexter Kozen
Online Learning of Prediction Suffix Trees
Pictorially Potential: 0
Φ(w1 ) Noise Pt : Example x1
Correct Prediction
%
Progress due to classic update Effect of noise Net progress Nikos Karampatziakis, Dexter Kozen
Online Learning of Prediction Suffix Trees
Pictorially Potential: 0
Φ(w1 ) Noise Pt : Example x1
Correct Prediction
%
Progress due to classic update Effect of noise Net progress Nikos Karampatziakis, Dexter Kozen
Online Learning of Prediction Suffix Trees
Pictorially Potential: 0
Φ(w1 ) Noise Pt : Example x1 x2
Correct Prediction
%
Progress due to classic update Effect of noise Net progress Nikos Karampatziakis, Dexter Kozen
Online Learning of Prediction Suffix Trees
Pictorially Potential: 0
Φ(w1 ) Noise Pt : Example x1 x2
Correct Prediction
% %
Progress due to classic update Effect of noise Net progress Nikos Karampatziakis, Dexter Kozen
Online Learning of Prediction Suffix Trees
Pictorially Potential: 0
Φ(w1 ) Noise Pt : Example x1 x2
Correct Prediction
% %
Progress due to classic update Effect of noise Net progress Nikos Karampatziakis, Dexter Kozen
Online Learning of Prediction Suffix Trees
Pictorially Potential: 0
Φ(w1 ) Noise Pt : Example x1 x2 x3
Correct Prediction
% %
Progress due to classic update Effect of noise Net progress Nikos Karampatziakis, Dexter Kozen
Online Learning of Prediction Suffix Trees
Pictorially Potential: 0
Φ(w1 ) Noise Pt : Example x1 x2 x3
Correct Prediction
% % "
Progress due to classic update Effect of noise Net progress Nikos Karampatziakis, Dexter Kozen
Online Learning of Prediction Suffix Trees
Pictorially Potential: 0
Φ(w1 ) Noise Pt : Example x1 x2 x3 x4
Correct Prediction
% % "
Progress due to classic update Effect of noise Net progress Nikos Karampatziakis, Dexter Kozen
Online Learning of Prediction Suffix Trees
Pictorially Potential: 0
Φ(w1 ) Noise Pt : Example x1 x2 x3 x4
Correct Prediction
% % " %
Progress due to classic update Effect of noise Net progress Nikos Karampatziakis, Dexter Kozen
Online Learning of Prediction Suffix Trees
Pictorially Potential: 0
Φ(w1 ) Noise Pt : Example x1 x2 x3 x4
Correct Prediction
% % " %
Progress due to classic update Effect of noise Net progress Nikos Karampatziakis, Dexter Kozen
Online Learning of Prediction Suffix Trees
Pictorially Potential: 0
Φ(w1 ) Noise Pt : Example x1 x2 x3 x4 x5
Correct Prediction
% % " %
Progress due to classic update Effect of noise Net progress Nikos Karampatziakis, Dexter Kozen
Online Learning of Prediction Suffix Trees
Pictorially Potential: 0
Φ(w1 ) Noise Pt : Example x1 x2 x3 x4 x5
Correct Prediction
% % " % "
Progress due to classic update Effect of noise Net progress Nikos Karampatziakis, Dexter Kozen
Online Learning of Prediction Suffix Trees
Pictorially Potential: 0
Φ(w1 ) Noise Pt : Example x1 x2 x3 x4 x5 x6
Correct Prediction
% % " % "
Progress due to classic update Effect of noise Net progress Nikos Karampatziakis, Dexter Kozen
Online Learning of Prediction Suffix Trees
Pictorially Potential: 0
Φ(w1 ) Noise Pt : Example x1 x2 x3 x4 x5 x6
Correct Prediction
% % " % " %
Progress due to classic update Effect of noise Net progress Nikos Karampatziakis, Dexter Kozen
Online Learning of Prediction Suffix Trees
Pictorially Potential: 0
Φ(w1 ) Noise Pt : Example x1 x2 x3 x4 x5 x6
Correct Prediction
% % " % " %
Progress due to classic update Effect of noise Net progress Nikos Karampatziakis, Dexter Kozen
Online Learning of Prediction Suffix Trees
Pictorially Potential: 0
Φ(w1 ) Noise Pt : Example x1 x2 x3 x4 x5 x6
Correct Prediction
% % " % " %
Progress due to classic update Effect of noise Net progress Nikos Karampatziakis, Dexter Kozen
Online Learning of Prediction Suffix Trees
Proof
length of
≤ Φ(w1 )
Nikos Karampatziakis, Dexter Kozen
Online Learning of Prediction Suffix Trees
Proof
length of
≤ Φ(w1 )
length of
−length of
Nikos Karampatziakis, Dexter Kozen
≤ Φ(w1 )
Online Learning of Prediction Suffix Trees
Proof
≤ Φ(w1 )
length of length of | {z
≥min
−length of
≤ Φ(w1 )
}
size×mistakes
Nikos Karampatziakis, Dexter Kozen
Online Learning of Prediction Suffix Trees
Proof
≤ Φ(w1 )
length of length of | {z
≥min
}
size×mistakes
− length of | {z
≤ Φ(w1 )
≤mistakes2/3
Nikos Karampatziakis, Dexter Kozen
}
Online Learning of Prediction Suffix Trees
Proof
≤ Φ(w1 )
length of length of | {z
≥min
min
}
size×mistakes
− length of | {z
≤ Φ(w1 )
≤mistakes2/3
}
size · mistakes − mistakes2/3 ≤ Φ(w1 )
Nikos Karampatziakis, Dexter Kozen
Online Learning of Prediction Suffix Trees
Multiclass extension We maintain weights w(1) , w(2) , . . . , w(k) Predict yˆt = argmaxi hw(i) , xt i In case of a mistake (ˆ y )
(ˆ y )
(y )
(y )
t θt+1,s = θt,st − αxt,s t θt+1,s = θt,st + αxt,s
for all s such that |s| ≤ dt .
Nikos Karampatziakis, Dexter Kozen
Online Learning of Prediction Suffix Trees
Outline
Prediction Suffix Trees Algorithm and Properties Results
Nikos Karampatziakis, Dexter Kozen
Online Learning of Prediction Suffix Trees
Outline
Prediction Suffix Trees Algorithm and Properties Results
Nikos Karampatziakis, Dexter Kozen
Online Learning of Prediction Suffix Trees
Data
120 sequences of system calls from Outlook, Excel and Firefox (40 each). The monitoring program records 23 different system calls. A typical sequence contains hundreds of thousands of system calls.
Nikos Karampatziakis, Dexter Kozen
Online Learning of Prediction Suffix Trees
Results Averages over the 40 sequences Outlook Perceptron Winnow % Error 5.1 4.43 PST Size 41239 25679
“Perceptron” is the PST learning algorithm of [Dekel et al., 2004]. Nikos Karampatziakis, Dexter Kozen
Online Learning of Prediction Suffix Trees
Results Averages over the 40 sequences Outlook Perceptron Winnow % Error 5.1 4.43 PST Size 41239 25679 Firefox % Error PST Size
Perceptron 14.86 21081
Winnow 13.88 12662
“Perceptron” is the PST learning algorithm of [Dekel et al., 2004]. Nikos Karampatziakis, Dexter Kozen
Online Learning of Prediction Suffix Trees
Results Averages over the 40 sequences Outlook Perceptron Winnow % Error 5.1 4.43 PST Size 41239 25679 Firefox % Error PST Size
Perceptron 14.86 21081
Winnow 13.88 12662
Excel % Error PST Size
Perceptron 22.68 24402
Winnow 20.59 15338
“Perceptron” is the PST learning algorithm of [Dekel et al., 2004]. Nikos Karampatziakis, Dexter Kozen
Online Learning of Prediction Suffix Trees
Results Averages over the 40 sequences Outlook Perceptron Winnow % Error 5.1 4.43 PST Size 41239 25679 Firefox % Error PST Size
Perceptron 14.86 21081
Winnow 13.88 12662
Excel % Error PST Size
Perceptron 22.68 24402
Winnow 20.59 15338
Moreover, Winnow made less mistakes and grew smaller trees for all 120 sequences “Perceptron” is the PST learning algorithm of [Dekel et al., 2004]. Nikos Karampatziakis, Dexter Kozen
Online Learning of Prediction Suffix Trees
Results 30000
Total number of mistakes
25000
20000
15000
10000
5000 perceptron winnow 0 0
100000
200000
300000
Nikos Karampatziakis, Dexter Kozen
400000 Round (t)
500000
600000
700000
Online Learning of Prediction Suffix Trees
800000
Results 80000
70000
Total number of mistakes
60000
50000
40000
30000
20000
10000 perceptron winnow 0 0
100000 200000 300000 400000 500000 600000 700000 800000 900000 1e+006 Round (t)
Nikos Karampatziakis, Dexter Kozen
Online Learning of Prediction Suffix Trees
Results 20000 18000 16000
Total number of mistakes
14000 12000 10000 8000 6000 4000 2000 perceptron winnow 0 0
50000
100000
150000
200000
Round (t)
Nikos Karampatziakis, Dexter Kozen
Online Learning of Prediction Suffix Trees
250000
Related Work Many learning algorithms assume an a priori bound on the tree’s depth [Willems et al., 1995], [Ron et al., 1996], [Pereira & Singer, 1999]. . . [Dekel et al., 2004] present a perceptron algorithm similar to ours. [Kivinen & Warmuth, 1997] show how to compete against vectors
u with ||u||1 ≤ U Sparsifying Winnow is popular (e.g. [Blum, 1997]) but no guarantees.
Nikos Karampatziakis, Dexter Kozen
Online Learning of Prediction Suffix Trees
Summary
Outlined the benefits of prediction suffix trees.
Nikos Karampatziakis, Dexter Kozen
Online Learning of Prediction Suffix Trees
Summary
Outlined the benefits of prediction suffix trees. Introduced an online learning algorithm to learn PSTs.
Nikos Karampatziakis, Dexter Kozen
Online Learning of Prediction Suffix Trees
Summary
Outlined the benefits of prediction suffix trees. Introduced an online learning algorithm to learn PSTs. It is competitive with the best fixed PST in hindsight.
Nikos Karampatziakis, Dexter Kozen
Online Learning of Prediction Suffix Trees
Summary
Outlined the benefits of prediction suffix trees. Introduced an online learning algorithm to learn PSTs. It is competitive with the best fixed PST in hindsight. The resulting trees grow slowly if necessary
Nikos Karampatziakis, Dexter Kozen
Online Learning of Prediction Suffix Trees
Summary
Outlined the benefits of prediction suffix trees. Introduced an online learning algorithm to learn PSTs. It is competitive with the best fixed PST in hindsight. The resulting trees grow slowly if necessary
On our task, it made less mistakes and grew smaller trees than other state-of-the-art algorithms.
Nikos Karampatziakis, Dexter Kozen
Online Learning of Prediction Suffix Trees
References I Blum, A. (1997). Empirical Support for Winnow and Weighted-Majority Algorithms: Results on a Calendar Scheduling Domain. Machine Learning, 26, 5–23. Dekel, O., Shalev-Shwartz, S., & Singer, Y. (2004). The power of selective memory: Self-bounded learning of prediction suffix trees. Advances in Neural Information Processing Systems, 17. Helmbold, D., & Schapire, R. (1997). Predicting Nearly As Well As the Best Pruning of a Decision Tree. Machine Learning, 27, 51–68. Nikos Karampatziakis, Dexter Kozen
Online Learning of Prediction Suffix Trees
References II Kivinen, J., & Warmuth, M. (1997). Exponentiated Gradient versus Gradient Descent for Linear Predictors. Information and Computation, 132, 1–63. Littlestone, N. (1989). Mistake bounds and logarithmic linear-threshold learning algorithms. Doctoral dissertation, Santa Cruz, CA, USA. Pereira, F., & Singer, Y. (1999). An Efficient Extension to Mixture Techniques for Prediction and Decision Trees. Machine Learning, 36, 183–199. Nikos Karampatziakis, Dexter Kozen
Online Learning of Prediction Suffix Trees
References III Ron, D., Singer, Y., & Tishby, N. (1996). The Power of Amnesia: Learning Probabilistic Automata with Variable Memory Length. Machine Learning, 25, 117–149. Rosenblatt, F. (1958). The perceptron: A probabilistic model for information storage and organization in the brain. Psychological Review, 65, 386–408. Willems, F., Shtarkov, Y., & Tjalkens, T. (1995). The context-tree weighting method: basic properties. IEEE Transactions on Information Theory, 41, 653–664.
Nikos Karampatziakis, Dexter Kozen
Online Learning of Prediction Suffix Trees