Online Learning of Prediction Suffix Trees - Nikos Karampatziakis

0 downloads 98 Views 655KB Size Report
Apr 24, 2009 - i := yi,...,yj. More general problem: p(yt|yt−1 ... Prediction Suffix Trees address these problems. [Wi
Online Learning of Prediction Suffix Trees Nikos Karampatziakis

Dexter Kozen

Cornell University

April 24, 2009

To appear in ICML 2009

Nikos Karampatziakis, Dexter Kozen

Online Learning of Prediction Suffix Trees

Introduction

Sequential prediction

Nikos Karampatziakis, Dexter Kozen

Online Learning of Prediction Suffix Trees

Introduction

Sequential prediction Prediction Suffix Trees

Nikos Karampatziakis, Dexter Kozen

Online Learning of Prediction Suffix Trees

Introduction

Sequential prediction Prediction Suffix Trees Learning algorithm

Nikos Karampatziakis, Dexter Kozen

Online Learning of Prediction Suffix Trees

Motivation

Monitoring of applications in a computer system

Nikos Karampatziakis, Dexter Kozen

Online Learning of Prediction Suffix Trees

Motivation

Monitoring of applications in a computer system System calls

Nikos Karampatziakis, Dexter Kozen

Online Learning of Prediction Suffix Trees

Motivation

Monitoring of applications in a computer system System calls Prediction model

Nikos Karampatziakis, Dexter Kozen

Online Learning of Prediction Suffix Trees

Motivation

Monitoring of applications in a computer system System calls Prediction model Some assumptions

Nikos Karampatziakis, Dexter Kozen

Online Learning of Prediction Suffix Trees

Outline

Prediction Suffix Trees Algorithm and Properties Results

Nikos Karampatziakis, Dexter Kozen

Online Learning of Prediction Suffix Trees

Outline

Prediction Suffix Trees Algorithm and Properties Results

Nikos Karampatziakis, Dexter Kozen

Online Learning of Prediction Suffix Trees

What is the next item in the sequence? open(),read(),read(),. . . G, A, T, T, A, C, . . . +1, −1, +1, −1, +1, . . . all,your,base,. . .

Nikos Karampatziakis, Dexter Kozen

Online Learning of Prediction Suffix Trees

What is the next item in the sequence? open(),read(),read(),. . . G, A, T, T, A, C, . . . +1, −1, +1, −1, +1, . . . all,your,base,. . . Formally, predict yt from y1 , y2 , . . . , yt−2 , yt−1

Nikos Karampatziakis, Dexter Kozen

Online Learning of Prediction Suffix Trees

What is the next item in the sequence? open(),read(),read(),. . . G, A, T, T, A, C, . . . +1, −1, +1, −1, +1, . . . all,your,base,. . . Formally, predict yt from y1 , y2 , . . . , yt−2 , yt−1 yt = h(y1t−1 )

yij := yi , . . . , yj

Nikos Karampatziakis, Dexter Kozen

Online Learning of Prediction Suffix Trees

What is the next item in the sequence? open(),read(),read(),. . . G, A, T, T, A, C, . . . +1, −1, +1, −1, +1, . . . all,your,base,. . . Formally, predict yt from y1 , y2 , . . . , yt−2 , yt−1 yt = h(y1t−1 )

yij := yi , . . . , yj

More general problem: p(yt |y1t−1 )

Nikos Karampatziakis, Dexter Kozen

Online Learning of Prediction Suffix Trees

What is the next item in the sequence? open(),read(),read(),. . . G, A, T, T, A, C, . . . +1, −1, +1, −1, +1, . . . all,your,base,. . . Formally, predict yt from y1 , y2 , . . . , yt−2 , yt−1 yt = h(y1t−1 )

yij := yi , . . . , yj

More general problem: p(yt |y1t−1 ) Markovian Assumption

Nikos Karampatziakis, Dexter Kozen

Online Learning of Prediction Suffix Trees

Markov Models

Zero order: p(yt |y1t−1 ) = p(yt )

Nikos Karampatziakis, Dexter Kozen

Online Learning of Prediction Suffix Trees

Markov Models

Zero order: p(yt |y1t−1 ) = p(yt ) First order:

A

yt−1

C

G

T

z }| { p(yt |yt−1 = G)

Nikos Karampatziakis, Dexter Kozen

Online Learning of Prediction Suffix Trees

Markov Models Second order:

yt−1

A

C

yt−2

A

C

G

G

yt−2

T

A

C

G

T

yt−2

T

C

A

G

yt−2

T

A

C

G

z }| { p(yt |yt−1 = C, yt−2 = G)

Nikos Karampatziakis, Dexter Kozen

Online Learning of Prediction Suffix Trees

T

Markov Models Second order:

yt−1

A

C

yt−2

A

C

G

G

yt−2

T

A

C

G

T

yt−2

T

C

A

G

yt−2

T

A

C

G

z }| { p(yt |yt−1 = C, yt−2 = G)

Third order, k-th order etc. Nikos Karampatziakis, Dexter Kozen

Online Learning of Prediction Suffix Trees

T

Markov Models — Problems How to select the best k?

Nikos Karampatziakis, Dexter Kozen

Online Learning of Prediction Suffix Trees

Markov Models — Problems How to select the best k? Number of model parameters exponential in k.

Nikos Karampatziakis, Dexter Kozen

Online Learning of Prediction Suffix Trees

Markov Models — Problems How to select the best k? Number of model parameters exponential in k. Poor family of models. What if I want a model with 100 parameters?

Nikos Karampatziakis, Dexter Kozen

Online Learning of Prediction Suffix Trees

Markov Models — Problems How to select the best k? Number of model parameters exponential in k. Poor family of models. What if I want a model with 100 parameters? What if yt does not depend on yt−2 if yt−1 = A but it depends on it if yt−1 = T ?

Nikos Karampatziakis, Dexter Kozen

Online Learning of Prediction Suffix Trees

Markov Models — Problems How to select the best k? Number of model parameters exponential in k. Poor family of models. What if I want a model with 100 parameters? What if yt does not depend on yt−2 if yt−1 = A but it depends on it if yt−1 = T ? Prediction Suffix Trees address these problems. [Willems et al., 1995, Ron et al., 1996, Helmbold & Schapire, 1997, Pereira & Singer, 1999] Nikos Karampatziakis, Dexter Kozen

Online Learning of Prediction Suffix Trees

Prediction Suffix Trees

yt−1

A

C

yt−2

A

C

G

G

yt−2

T

A

C

G

T

yt−2

T

Nikos Karampatziakis, Dexter Kozen

A

C

G

yt−2

T

A

C

Online Learning of Prediction Suffix Trees

G

T

Prediction Suffix Trees Keep a model for each k. Use a weighted sum. p(yt )

A

p(yt |A)

A

C

G

T

G

C

p(yt |G)

p(yt |C)

A

C

G

T

T

A

C

G

T

p(yt |T )

A

C

p(yt |yt−1 , yt−2 )

Nikos Karampatziakis, Dexter Kozen

Online Learning of Prediction Suffix Trees

G

T

Prediction Suffix Trees Keep a model for each k. Use a weighted sum. Prune useless branches and subtrees. p(yt )

A

C

p(yt |A)

A

G

p(yt |C)

A

C

T

p(yt |G)

A

C

p(yt |T )

G

p(yt |yt−1 , yt−2 )

Nikos Karampatziakis, Dexter Kozen

Online Learning of Prediction Suffix Trees

Prediction Suffix Trees Keep a model for each k. Use a weighted sum. Prune useless branches and subtrees. p(yt )

A

C

p(yt |A)

A

G

p(yt |C)

A

C

T

p(yt |G)

A

C

p(yt |T )

G

p(yt |yt−1 , yt−2 )

Assumptions: p(yt |yt−1 = T, yt−2 ) = p(yt |yt−1 = T ), p(yt |yt−1 = A, yt−2 = ¬A) = p(yt |yt−1 = A). Nikos Karampatziakis, Dexter Kozen

Online Learning of Prediction Suffix Trees

Moving Away from Probabilistic Modeling

We just want to predict yt from y1t−1

Nikos Karampatziakis, Dexter Kozen

Online Learning of Prediction Suffix Trees

Moving Away from Probabilistic Modeling

We just want to predict yt from y1t−1 Assume yt ∈ {−1, +1}

Nikos Karampatziakis, Dexter Kozen

Online Learning of Prediction Suffix Trees

Moving Away from Probabilistic Modeling

We just want to predict yt from y1t−1 Assume yt ∈ {−1, +1} Store a weight instead of a probability

Nikos Karampatziakis, Dexter Kozen

Online Learning of Prediction Suffix Trees

Moving Away from Probabilistic Modeling

We just want to predict yt from y1t−1 Assume yt ∈ {−1, +1} Store a weight instead of a probability Use the sign of a weighted sum of the values in the appropriate path

Nikos Karampatziakis, Dexter Kozen

Online Learning of Prediction Suffix Trees

Example True Sequence: −1, −1, +1, −1, −1, +1, . . . Discounting: A node at depth d is discounted by 2−d Tree: 0

−1

+1

−1

−2

−1 5 Input Sequence: Decision: Nikos Karampatziakis, Dexter Kozen

Online Learning of Prediction Suffix Trees

Example True Sequence: −1, −1, +1, −1, −1, +1, . . . Discounting: A node at depth d is discounted by 2−d Tree: 0

−1

+1

−1

−2

−1 5 Input Sequence: Decision:

. . . , −1, −1

0

Nikos Karampatziakis, Dexter Kozen

Online Learning of Prediction Suffix Trees

Example True Sequence: −1, −1, +1, −1, −1, +1, . . . Discounting: A node at depth d is discounted by 2−d Tree: 0

−1

+1

−1

−2

−1 5 Input Sequence: Decision:

. . . , −1, −1

0 − 21 · 1

Nikos Karampatziakis, Dexter Kozen

Online Learning of Prediction Suffix Trees

Example True Sequence: −1, −1, +1, −1, −1, +1, . . . Discounting: A node at depth d is discounted by 2−d Tree: 0

−1

+1

−1

−2

−1 5 Input Sequence: Decision:

. . . , −1, −1

0 − 21 · 1 + 14 · 5

Nikos Karampatziakis, Dexter Kozen

Online Learning of Prediction Suffix Trees

Example True Sequence: −1, −1, +1, −1, −1, +1, . . . Discounting: A node at depth d is discounted by 2−d Tree: 0

−1

+1

−1

−2

−1 5 Input Sequence: Decision:

. . . , −1, −1

0 − 21 · 1 + 14 · 5 =

Nikos Karampatziakis, Dexter Kozen

3 sign → 4

+1

Online Learning of Prediction Suffix Trees

Example True Sequence: −1, −1, +1, −1, −1, +1, . . . Discounting: A node at depth d is discounted by 2−d Tree: 0

−1

+1

−1

−2

−1 5 Input Sequence: Decision:

. . . , +1, −1

0

Nikos Karampatziakis, Dexter Kozen

Online Learning of Prediction Suffix Trees

Example True Sequence: −1, −1, +1, −1, −1, +1, . . . Discounting: A node at depth d is discounted by 2−d Tree: 0

−1

+1

−1

−2

−1 5 Input Sequence: Decision:

. . . , +1, −1

0 − 21 · 1

Nikos Karampatziakis, Dexter Kozen

Online Learning of Prediction Suffix Trees

Example True Sequence: −1, −1, +1, −1, −1, +1, . . . Discounting: A node at depth d is discounted by 2−d Tree: 0

−1

+1

−1

−2

−1 5 Input Sequence: Decision:

. . . , +1, −1 sign

0 − 21 · 1 = − 12 → −1

Nikos Karampatziakis, Dexter Kozen

Online Learning of Prediction Suffix Trees

Example True Sequence: −1, −1, +1, −1, −1, +1, . . . Discounting: A node at depth d is discounted by 2−d Tree: 0

−1

+1

−1

−2

−1 5 Input Sequence: Decision:

. . . , +1

0

Nikos Karampatziakis, Dexter Kozen

Online Learning of Prediction Suffix Trees

Example True Sequence: −1, −1, +1, −1, −1, +1, . . . Discounting: A node at depth d is discounted by 2−d Tree: 0

−1

+1

−1

−2

−1 5 Input Sequence: Decision:

. . . , +1

0 − 21 · 2

Nikos Karampatziakis, Dexter Kozen

Online Learning of Prediction Suffix Trees

Example True Sequence: −1, −1, +1, −1, −1, +1, . . . Discounting: A node at depth d is discounted by 2−d Tree: 0

−1

+1

−1

−2

−1 5 Input Sequence: Decision:

. . . , +1 sign

0 − 21 · 2 = −1 → −1

Nikos Karampatziakis, Dexter Kozen

Online Learning of Prediction Suffix Trees

Remarks

Notation for the node values: gt,s

Nikos Karampatziakis, Dexter Kozen

Online Learning of Prediction Suffix Trees

Remarks

Notation for the node values: gt,s Discounting scheme in this work:  t−1 (1 − )|s| if s = yt−i + xt,s = 0 otherwise

i = 1, . . . , t − 1

The best value of  will be determined (much) later.

Nikos Karampatziakis, Dexter Kozen

Online Learning of Prediction Suffix Trees

Remarks

Notation for the node values: gt,s Discounting scheme in this work:  t−1 (1 − )|s| if s = yt−i + xt,s = 0 otherwise

i = 1, . . . , t − 1

The best value of  will be determined (much) later. Decision at time t: sign

P

s

Nikos Karampatziakis, Dexter Kozen

 + gt,s x+ t,s = sign(hgt , xt i)

Online Learning of Prediction Suffix Trees

Outline

Prediction Suffix Trees Algorithm and Properties Results

Nikos Karampatziakis, Dexter Kozen

Online Learning of Prediction Suffix Trees

Outline

Prediction Suffix Trees Algorithm and Properties Results

Nikos Karampatziakis, Dexter Kozen

Online Learning of Prediction Suffix Trees

Algorithms for Linear Prediction SVMs, logistic regression,. . .

Nikos Karampatziakis, Dexter Kozen

Online Learning of Prediction Suffix Trees

Algorithms for Linear Prediction SVMs, logistic regression,. . . Focus on online setting

Nikos Karampatziakis, Dexter Kozen

Online Learning of Prediction Suffix Trees

Algorithms for Linear Prediction SVMs, logistic regression,. . . Focus on online setting Algorithm maintains hypothesis wt . In each round:

Nikos Karampatziakis, Dexter Kozen

Online Learning of Prediction Suffix Trees

Algorithms for Linear Prediction SVMs, logistic regression,. . . Focus on online setting Algorithm maintains hypothesis wt . In each round: The algorithm receives information xt .

Nikos Karampatziakis, Dexter Kozen

Online Learning of Prediction Suffix Trees

Algorithms for Linear Prediction SVMs, logistic regression,. . . Focus on online setting Algorithm maintains hypothesis wt . In each round: The algorithm receives information xt . It outputs yˆt = sign(hwt , xt i).

Nikos Karampatziakis, Dexter Kozen

Online Learning of Prediction Suffix Trees

Algorithms for Linear Prediction SVMs, logistic regression,. . . Focus on online setting Algorithm maintains hypothesis wt . In each round: The algorithm receives information xt . It outputs yˆt = sign(hwt , xt i). It receives the correct output yt .

Nikos Karampatziakis, Dexter Kozen

Online Learning of Prediction Suffix Trees

Algorithms for Linear Prediction SVMs, logistic regression,. . . Focus on online setting Algorithm maintains hypothesis wt . In each round: The algorithm receives information xt . It outputs yˆt = sign(hwt , xt i). It receives the correct output yt . It updates: wt+1 ← f (wt ).

Nikos Karampatziakis, Dexter Kozen

Online Learning of Prediction Suffix Trees

Algorithms for Linear Prediction SVMs, logistic regression,. . . Focus on online setting Algorithm maintains hypothesis wt . In each round: The algorithm receives information xt . It outputs yˆt = sign(hwt , xt i). It receives the correct output yt . It updates: wt+1 ← f (wt ).

A mistake is made when yt hwt , xt i ≤ 0.

Nikos Karampatziakis, Dexter Kozen

Online Learning of Prediction Suffix Trees

Perceptron

If a mistake is made at round t: wt+1 = wt + αyt xt

Nikos Karampatziakis, Dexter Kozen

Online Learning of Prediction Suffix Trees

Perceptron

If a mistake is made at round t: wt+1 = wt + αyt xt α is a learning rate.

Nikos Karampatziakis, Dexter Kozen

Online Learning of Prediction Suffix Trees

Perceptron

If a mistake is made at round t: wt+1 = wt + αyt xt α is a learning rate. It is selected so that it optimizes various tradeoffs.

Nikos Karampatziakis, Dexter Kozen

Online Learning of Prediction Suffix Trees

Balanced Winnow Balanced Winnow [Littlestone, 1989] θ1 ← 0 for t = 1, 2, . . . , T do θt,i wt,i ← Pe θt,j j e yˆt ← hwt , xt i if yt yˆt ≤ 0 θt+1 ← θt + αyt xt else θt+1 ← θt

Nikos Karampatziakis, Dexter Kozen

Perceptron [Rosenblatt, 1958] θ1 ← 0 for t = 1, 2, . . . , T do wt ← θt yˆt ← hwt , xt i if yt yˆt ≤ 0 θt+1 ← θt + αyt xt else θt+1 ← θt

Online Learning of Prediction Suffix Trees

Balanced Winnow Balanced Winnow [Littlestone, 1989] θ1 ← 0 for t = 1, 2, . . . , T do θt,i wt,i ← Pe θt,j j e yˆt ← hwt , xt i if yt yˆt ≤ 0 θt+1 ← θt + αyt xt else θt+1 ← θt

Perceptron [Rosenblatt, 1958] θ1 ← 0 for t = 1, 2, . . . , T do wt ← θt yˆt ← hwt , xt i if yt yˆt ≤ 0 θt+1 ← θt + αyt xt else θt+1 ← θt

Important for Balanced Winnow: + + + + + xt = [x+ t , −xt ] = [xt,1 , . . . , xt,d , −xt,1 , . . . , −xt,d ]  t−1 (1 − )|s| if s = yt−i i = 1, . . . , t − 1 + xt,s = 0 otherwise Nikos Karampatziakis, Dexter Kozen

Online Learning of Prediction Suffix Trees

Advantages

Multiplicative updates — fast convergence.

Nikos Karampatziakis, Dexter Kozen

Online Learning of Prediction Suffix Trees

Advantages

Multiplicative updates — fast convergence. Can cope with many features when few of them are relevant.

Nikos Karampatziakis, Dexter Kozen

Online Learning of Prediction Suffix Trees

Advantages

Multiplicative updates — fast convergence. Can cope with many features when few of them are relevant. For our application it can track changes better.

Nikos Karampatziakis, Dexter Kozen

Online Learning of Prediction Suffix Trees

Where’s the catch?

Perceptron and Winnow don’t care about the sparsity of their hypothesis.

Nikos Karampatziakis, Dexter Kozen

Online Learning of Prediction Suffix Trees

Where’s the catch?

Perceptron and Winnow don’t care about the sparsity of their hypothesis. Eventually, we have a feature for every substring of y1T .

Nikos Karampatziakis, Dexter Kozen

Online Learning of Prediction Suffix Trees

Where’s the catch?

Perceptron and Winnow don’t care about the sparsity of their hypothesis. Eventually, we have a feature for every substring of y1T . Implications: Need to learn O(T 2 ) weights in T rounds

Nikos Karampatziakis, Dexter Kozen

/

Online Learning of Prediction Suffix Trees

Where’s the catch?

Perceptron and Winnow don’t care about the sparsity of their hypothesis. Eventually, we have a feature for every substring of y1T . Implications: Need to learn O(T 2 ) weights in T rounds Need to store

O(T 2 )

numbers

/

/

Naively, all weights affect the decision and must be stored.

Nikos Karampatziakis, Dexter Kozen

Online Learning of Prediction Suffix Trees

Does the decision depend on every feature? − + + Recall the specific form of features: xt = [x+ t , xt ] = [xt , −xt ].

Nikos Karampatziakis, Dexter Kozen

Online Learning of Prediction Suffix Trees

Does the decision depend on every feature? − + + Recall the specific form of features: xt = [x+ t , xt ] = [xt , −xt ].

Notation: wt = [wt+ , wt− ] θt = [θt+ , θt− ]. Recall wt,i =

Nikos Karampatziakis, Dexter Kozen

Online Learning of Prediction Suffix Trees

eθt,i θt,j j e

P

Does the decision depend on every feature? − + + Recall the specific form of features: xt = [x+ t , xt ] = [xt , −xt ].

Notation: wt = [wt+ , wt− ] θt = [θt+ , θt− ]. Recall wt,i = Easy inductive argument can show that θt− = −θt+

Nikos Karampatziakis, Dexter Kozen

Online Learning of Prediction Suffix Trees

eθt,i θt,j j e

P

Does the decision depend on every feature? − + + Recall the specific form of features: xt = [x+ t , xt ] = [xt , −xt ].

Notation: wt = [wt+ , wt− ] θt = [θt+ , θt− ]. Recall wt,i = Easy inductive argument can show that θt− = −θt+ 1 Decision yˆt = hwt , xt i = hwt+ − wt− , x+ t i =

=

d X

Pd i=1

1

sinh(x) =

+ sinh(θt,i )

ex −e−x 2

+ j=1 cosh(θt,j )

and cosh(x) =

Nikos Karampatziakis, Dexter Kozen

x+ t,i ∝

d X

+ sinh(θt,i )x+ t,i

i=1

ex +e−x 2 Online Learning of Prediction Suffix Trees

eθt,i θt,j j e

P

Does the decision depend on every feature? − + + Recall the specific form of features: xt = [x+ t , xt ] = [xt , −xt ].

Notation: wt = [wt+ , wt− ] θt = [θt+ , θt− ]. Recall wt,i = Easy inductive argument can show that θt− = −θt+ 1 Decision yˆt = hwt , xt i = hwt+ − wt− , x+ t i =

=

d X

+ sinh(θt,i )

Pd i=1

+ j=1 cosh(θt,j )

x+ t,i ∝

d X

+ sinh(θt,i )x+ t,i

i=1

+ Iff θt,i = 0 the decision does not depend on feature i. 1

sinh(x) =

ex −e−x 2

and cosh(x) =

Nikos Karampatziakis, Dexter Kozen

ex +e−x 2 Online Learning of Prediction Suffix Trees

eθt,i θt,j j e

P

Observations

If θt,s = 0 we don’t have to store it.

Nikos Karampatziakis, Dexter Kozen

Online Learning of Prediction Suffix Trees

Observations

If θt,s = 0 we don’t have to store it. Initially θ1 = 0. The tree has only one node.

Nikos Karampatziakis, Dexter Kozen

Online Learning of Prediction Suffix Trees

Observations

If θt,s = 0 we don’t have to store it. Initially θ1 = 0. The tree has only one node. As mistakes are made, the tree grows.

Nikos Karampatziakis, Dexter Kozen

Online Learning of Prediction Suffix Trees

Observations

If θt,s = 0 we don’t have to store it. Initially θ1 = 0. The tree has only one node. As mistakes are made, the tree grows. Classic Winnow/Perceptron update quickly destroys sparsity.

Nikos Karampatziakis, Dexter Kozen

Online Learning of Prediction Suffix Trees

Illustration Input Sequence: +1, −1, +1, −1, ?  −|s| . . . , −1, t−1 2 if s = yt−i i = 1, . . . , t − 1 x+ t,s = 0 otherwise Tree: 0 −1

+1 −2

1 −1 −5

Decision: Nikos Karampatziakis, Dexter Kozen

Online Learning of Prediction Suffix Trees

Illustration Input Sequence: +1, −1, +1, −1, ?  −|s| . . . , −1, t−1 2 if s = yt−i i = 1, . . . , t − 1 x+ t,s = 0 otherwise Tree: 0 −1

+1 −2

1 −1 −5

Decision:

1 2

· sinh(1)

Nikos Karampatziakis, Dexter Kozen

Online Learning of Prediction Suffix Trees

Illustration Input Sequence: +1, −1, +1, −1, ?  −|s| . . . , −1, t−1 2 if s = yt−i i = 1, . . . , t − 1 x+ t,s = 0 otherwise Tree: 0 −1

+1 −2

1 −1 −5

Decision:

1 2

· sinh(1) > 0

Nikos Karampatziakis, Dexter Kozen

Online Learning of Prediction Suffix Trees

Illustration Input Sequence: +1, −1, +1, −1, ?  −|s| . . . , −1, t−1 2 if s = yt−i i = 1, . . . , t − 1 x+ t,s = 0 otherwise Tree: 0 −1

+1 −2

1 −1 −5

Decision:

1 2

sign

· sinh(1) > 0 → +1

Nikos Karampatziakis, Dexter Kozen

Online Learning of Prediction Suffix Trees

Illustration Input Sequence: +1, −1, +1, −1, − 1  −|s| . . . , −1, t−1 2 if s = yt−i i = 1, . . . , t − 1 x+ t,s = 0 otherwise Tree: 0 −1

+1 −2

1 −1 −5

Decision:

1 2

sign

· sinh(1) > 0 → +1

Nikos Karampatziakis, Dexter Kozen

Online Learning of Prediction Suffix Trees

Illustration Input Sequence: +1, −1, +1, −1, − 1  −|s| . . . , −1, t−1 2 if s = yt−i i = 1, . . . , t − 1 x+ t,s = 0 otherwise Tree: 0 −1

+1 −2

1 −1

+1

−5

0 −1 0 +1 0

.. .

Decision:

1 2

sign

· sinh(1) > 0 → +1

Nikos Karampatziakis, Dexter Kozen

Online Learning of Prediction Suffix Trees

Illustration Input Sequence: +1, −1, +1, −1, − 1  −|s| . . . , −1, t−1 2 if s = yt−i i = 1, . . . , t − 1 x+ t,s = 0 otherwise Tree: 0 −1

+1

1 2

−1

−2 +1 −1 4

−5 −1 1 −8

+1 1 − 16

.. .

Decision:

1 2

sign

· sinh(1) > 0 → +1

Nikos Karampatziakis, Dexter Kozen

Online Learning of Prediction Suffix Trees

Summary of the Problem

Mistake at time t: O(t) nodes are inserted in the tree.

Nikos Karampatziakis, Dexter Kozen

Online Learning of Prediction Suffix Trees

Summary of the Problem

Mistake at time t: O(t) nodes are inserted in the tree. x+ t,s is non-zero even when s is very long.

Nikos Karampatziakis, Dexter Kozen

Online Learning of Prediction Suffix Trees

Summary of the Problem

Mistake at time t: O(t) nodes are inserted in the tree. x+ t,s is non-zero even when s is very long. Bad idea: change the definition of x+ t,s to avoid this.

Nikos Karampatziakis, Dexter Kozen

Online Learning of Prediction Suffix Trees

Summary of the Problem

Mistake at time t: O(t) nodes are inserted in the tree. x+ t,s is non-zero even when s is very long. Bad idea: change the definition of x+ t,s to avoid this.

Need to learn a good θ while keeping it sparse.

Nikos Karampatziakis, Dexter Kozen

Online Learning of Prediction Suffix Trees

Summary of the Problem

Mistake at time t: O(t) nodes are inserted in the tree. x+ t,s is non-zero even when s is very long. Bad idea: change the definition of x+ t,s to avoid this.

Need to learn a good θ while keeping it sparse. Not all sparse vectors are equal.

Nikos Karampatziakis, Dexter Kozen

Online Learning of Prediction Suffix Trees

Better Update Rule Adaptive bound dt on the length of the suffixes.

Nikos Karampatziakis, Dexter Kozen

Online Learning of Prediction Suffix Trees

Better Update Rule Adaptive bound dt on the length of the suffixes. Same as the depth up to which the tree will grow on round t.

Nikos Karampatziakis, Dexter Kozen

Online Learning of Prediction Suffix Trees

Better Update Rule Adaptive bound dt on the length of the suffixes. Same as the depth up to which the tree will grow on round t.

dt will be growing slowly if necessary.

Nikos Karampatziakis, Dexter Kozen

Online Learning of Prediction Suffix Trees

Better Update Rule Adaptive bound dt on the length of the suffixes. Same as the depth up to which the tree will grow on round t.

dt will be growing slowly if necessary. New update:  θt+1,s =

θt,s + αyt xt,s if |s| ≤ dt θt,s otherwise

Nikos Karampatziakis, Dexter Kozen

Online Learning of Prediction Suffix Trees

Better Update Rule Adaptive bound dt on the length of the suffixes. Same as the depth up to which the tree will grow on round t.

dt will be growing slowly if necessary. New update:  θt+1,s =

θt,s + αyt xt,s if |s| ≤ dt θt,s otherwise

Equivalently: θt+1 = θt + αyt xt + αnt where nt is a noise vector that cancels part of the update:  −yt xt,s if |s| > dt nt,s = 0 otherwise Nikos Karampatziakis, Dexter Kozen

Online Learning of Prediction Suffix Trees

Setting dt ht = the length of the path from the root using yt−1 , yt−2 , . . ..

Nikos Karampatziakis, Dexter Kozen

Online Learning of Prediction Suffix Trees

Setting dt ht = the length of the path from the root using yt−1 , yt−2 , . . .. Jt is the subset of rounds 1, . . . , t in which a mistake was made.

Nikos Karampatziakis, Dexter Kozen

Online Learning of Prediction Suffix Trees

Setting dt ht = the length of the path from the root using yt−1 , yt−2 , . . .. Jt is the subset of rounds 1, . . . , t in which a mistake was made. Pt =

P

i∈Jt

||ni ||∞ =

P

i∈Jt (1

Nikos Karampatziakis, Dexter Kozen

− )(di +1) .

Online Learning of Prediction Suffix Trees

Setting dt ht = the length of the path from the root using yt−1 , yt−2 , . . .. Jt is the subset of rounds 1, . . . , t in which a mistake was made. Pt =

P

i∈Jt

||ni ||∞ =

P

i∈Jt (1

− )(di +1) .

dt will be the smallest integer such that Pt ≤ |Jt |2/3 .

Nikos Karampatziakis, Dexter Kozen

Online Learning of Prediction Suffix Trees

Setting dt ht = the length of the path from the root using yt−1 , yt−2 , . . .. Jt is the subset of rounds 1, . . . , t in which a mistake was made. Pt =

P

i∈Jt

||ni ||∞ =

P

i∈Jt (1

− )(di +1) .

dt will be the smallest integer such that Pt ≤ |Jt |2/3 . To guarantee that we can set   q   3 3/2 3 dt = max ht , log1− Pt−1 + 2Pt−1 + 1 − Pt−1 − 1

Nikos Karampatziakis, Dexter Kozen

Online Learning of Prediction Suffix Trees

Algorithm Properties — Learning Mistake Bound If there exists a tree which over the input sequence y1 , y2 , . . . , yT correctly predicts all items with confidence ≥ δ, our algorithm’s mistakes will be at most   8 log T 64 max , 3 δ2 δ

Nikos Karampatziakis, Dexter Kozen

Online Learning of Prediction Suffix Trees

Algorithm Properties — Learning Mistake Bound If there exists a tree which over the input sequence y1 , y2 , . . . , yT correctly predicts all items with confidence ≥ δ, our algorithm’s mistakes will be at most   8 log T 64 max , 3 δ2 δ Instead of assuming a perfect tree, assume a tree u that attains a cumulative δ-hinge loss L > 0. The mistake bound becomes:   2L 8 log T 64 max + , 3 δ δ2 δ

Nikos Karampatziakis, Dexter Kozen

Online Learning of Prediction Suffix Trees

Algorithm Properties — Learning Mistake Bound If there exists a tree which over the input sequence y1 , y2 , . . . , yT correctly predicts all items with confidence ≥ δ, our algorithm’s mistakes will be at most   8 log T 64 max , 3 δ2 δ Instead of assuming a perfect tree, assume a tree u that attains a cumulative δ-hinge loss L > 0. The mistake bound becomes:   2L 8 log T 64 max + , 3 δ δ2 δ PT L = t=1 `t (u) where `t (u) = max(0, δ − yt hu, xt i) Nikos Karampatziakis, Dexter Kozen

Online Learning of Prediction Suffix Trees

Algorithm Properties — Resources

Let Mt be the number of mistakes up to round t

Nikos Karampatziakis, Dexter Kozen

Online Learning of Prediction Suffix Trees

Algorithm Properties — Resources

Let Mt be the number of mistakes up to round t Set  so that  −|s|/3 t−1 2 if s = yt−i + xt,s = 0 otherwise

Nikos Karampatziakis, Dexter Kozen

i = 1, . . . , t − 1

Online Learning of Prediction Suffix Trees

Algorithm Properties — Resources

Let Mt be the number of mistakes up to round t Set  so that  −|s|/3 t−1 2 if s = yt−i + xt,s = 0 otherwise

i = 1, . . . , t − 1

Growth Bound Our algorithm will not grow a tree deeper than log2 MT −1 + 4

Nikos Karampatziakis, Dexter Kozen

Online Learning of Prediction Suffix Trees

How to go about proving the properties

Bounding how fast the tree grows is straightforward.

Nikos Karampatziakis, Dexter Kozen

Online Learning of Prediction Suffix Trees

How to go about proving the properties

Bounding how fast the tree grows is straightforward. Mistake bound: show that each update contributes towards a goal.

Nikos Karampatziakis, Dexter Kozen

Online Learning of Prediction Suffix Trees

How to go about proving the properties

Bounding how fast the tree grows is straightforward. Mistake bound: show that each update contributes towards a goal. We need a goal and measure of progress.

Nikos Karampatziakis, Dexter Kozen

Online Learning of Prediction Suffix Trees

Measuring Progress Goal: A fixed tree with good performance.

Nikos Karampatziakis, Dexter Kozen

Online Learning of Prediction Suffix Trees

Measuring Progress Goal: A fixed tree with good performance. For example, a vector u such that yt hu, xt i ≥ δ.

Nikos Karampatziakis, Dexter Kozen

Online Learning of Prediction Suffix Trees

Measuring Progress Goal: A fixed tree with good performance. For example, a vector u such that yt hu, xt i ≥ δ.

Our adaptive tree is represented by wt . Remember

Nikos Karampatziakis, Dexter Kozen

P

i

Online Learning of Prediction Suffix Trees

wt,i = 1.

Measuring Progress Goal: A fixed tree with good performance. For example, a vector u such that yt hu, xt i ≥ δ.

Our adaptive tree is represented by wt . Remember To be fair assume ui ≥ 0 and

Nikos Karampatziakis, Dexter Kozen

P

i

P

i

ui = 1

Online Learning of Prediction Suffix Trees

wt,i = 1.

Measuring Progress Goal: A fixed tree with good performance. For example, a vector u such that yt hu, xt i ≥ δ.

Our adaptive tree is represented by wt . Remember To be fair assume ui ≥ 0 and

P

i

P

i

wt,i = 1.

ui = 1

Measure of progress: Relative entropy between u and wt X ui D(u||wt ) = ui log wt,i i

Nikos Karampatziakis, Dexter Kozen

Online Learning of Prediction Suffix Trees

Measuring Progress Goal: A fixed tree with good performance. For example, a vector u such that yt hu, xt i ≥ δ.

Our adaptive tree is represented by wt . Remember To be fair assume ui ≥ 0 and

P

i

P

i

wt,i = 1.

ui = 1

Measure of progress: Relative entropy between u and wt X ui D(u||wt ) = ui log wt,i i Potential function: Φ(wt ) = D(u||wt ) ≥ 0 Nikos Karampatziakis, Dexter Kozen

Online Learning of Prediction Suffix Trees

Proof Technique

Upper bound the initial potential.

Nikos Karampatziakis, Dexter Kozen

Online Learning of Prediction Suffix Trees

Proof Technique

Upper bound the initial potential. Lower bound the change in the potential with each update.

Nikos Karampatziakis, Dexter Kozen

Online Learning of Prediction Suffix Trees

Proof Technique

Upper bound the initial potential. Lower bound the change in the potential with each update. Keep the total effect of noise bounded. [Dekel et al., 2004]

Nikos Karampatziakis, Dexter Kozen

Online Learning of Prediction Suffix Trees

Pictorially Potential: 0

Φ(w1 ) Noise Pt : Example x1

Correct Prediction

Progress due to classic update Effect of noise Net progress Nikos Karampatziakis, Dexter Kozen

Online Learning of Prediction Suffix Trees

Pictorially Potential: 0

Φ(w1 ) Noise Pt : Example x1

Correct Prediction

%

Progress due to classic update Effect of noise Net progress Nikos Karampatziakis, Dexter Kozen

Online Learning of Prediction Suffix Trees

Pictorially Potential: 0

Φ(w1 ) Noise Pt : Example x1

Correct Prediction

%

Progress due to classic update Effect of noise Net progress Nikos Karampatziakis, Dexter Kozen

Online Learning of Prediction Suffix Trees

Pictorially Potential: 0

Φ(w1 ) Noise Pt : Example x1 x2

Correct Prediction

%

Progress due to classic update Effect of noise Net progress Nikos Karampatziakis, Dexter Kozen

Online Learning of Prediction Suffix Trees

Pictorially Potential: 0

Φ(w1 ) Noise Pt : Example x1 x2

Correct Prediction

% %

Progress due to classic update Effect of noise Net progress Nikos Karampatziakis, Dexter Kozen

Online Learning of Prediction Suffix Trees

Pictorially Potential: 0

Φ(w1 ) Noise Pt : Example x1 x2

Correct Prediction

% %

Progress due to classic update Effect of noise Net progress Nikos Karampatziakis, Dexter Kozen

Online Learning of Prediction Suffix Trees

Pictorially Potential: 0

Φ(w1 ) Noise Pt : Example x1 x2 x3

Correct Prediction

% %

Progress due to classic update Effect of noise Net progress Nikos Karampatziakis, Dexter Kozen

Online Learning of Prediction Suffix Trees

Pictorially Potential: 0

Φ(w1 ) Noise Pt : Example x1 x2 x3

Correct Prediction

% % "

Progress due to classic update Effect of noise Net progress Nikos Karampatziakis, Dexter Kozen

Online Learning of Prediction Suffix Trees

Pictorially Potential: 0

Φ(w1 ) Noise Pt : Example x1 x2 x3 x4

Correct Prediction

% % "

Progress due to classic update Effect of noise Net progress Nikos Karampatziakis, Dexter Kozen

Online Learning of Prediction Suffix Trees

Pictorially Potential: 0

Φ(w1 ) Noise Pt : Example x1 x2 x3 x4

Correct Prediction

% % " %

Progress due to classic update Effect of noise Net progress Nikos Karampatziakis, Dexter Kozen

Online Learning of Prediction Suffix Trees

Pictorially Potential: 0

Φ(w1 ) Noise Pt : Example x1 x2 x3 x4

Correct Prediction

% % " %

Progress due to classic update Effect of noise Net progress Nikos Karampatziakis, Dexter Kozen

Online Learning of Prediction Suffix Trees

Pictorially Potential: 0

Φ(w1 ) Noise Pt : Example x1 x2 x3 x4 x5

Correct Prediction

% % " %

Progress due to classic update Effect of noise Net progress Nikos Karampatziakis, Dexter Kozen

Online Learning of Prediction Suffix Trees

Pictorially Potential: 0

Φ(w1 ) Noise Pt : Example x1 x2 x3 x4 x5

Correct Prediction

% % " % "

Progress due to classic update Effect of noise Net progress Nikos Karampatziakis, Dexter Kozen

Online Learning of Prediction Suffix Trees

Pictorially Potential: 0

Φ(w1 ) Noise Pt : Example x1 x2 x3 x4 x5 x6

Correct Prediction

% % " % "

Progress due to classic update Effect of noise Net progress Nikos Karampatziakis, Dexter Kozen

Online Learning of Prediction Suffix Trees

Pictorially Potential: 0

Φ(w1 ) Noise Pt : Example x1 x2 x3 x4 x5 x6

Correct Prediction

% % " % " %

Progress due to classic update Effect of noise Net progress Nikos Karampatziakis, Dexter Kozen

Online Learning of Prediction Suffix Trees

Pictorially Potential: 0

Φ(w1 ) Noise Pt : Example x1 x2 x3 x4 x5 x6

Correct Prediction

% % " % " %

Progress due to classic update Effect of noise Net progress Nikos Karampatziakis, Dexter Kozen

Online Learning of Prediction Suffix Trees

Pictorially Potential: 0

Φ(w1 ) Noise Pt : Example x1 x2 x3 x4 x5 x6

Correct Prediction

% % " % " %

Progress due to classic update Effect of noise Net progress Nikos Karampatziakis, Dexter Kozen

Online Learning of Prediction Suffix Trees

Proof

length of

≤ Φ(w1 )

Nikos Karampatziakis, Dexter Kozen

Online Learning of Prediction Suffix Trees

Proof

length of

≤ Φ(w1 )

length of

−length of

Nikos Karampatziakis, Dexter Kozen

≤ Φ(w1 )

Online Learning of Prediction Suffix Trees

Proof

≤ Φ(w1 )

length of length of | {z

≥min

−length of

≤ Φ(w1 )

}

size×mistakes

Nikos Karampatziakis, Dexter Kozen

Online Learning of Prediction Suffix Trees

Proof

≤ Φ(w1 )

length of length of | {z

≥min

}

size×mistakes

− length of | {z

≤ Φ(w1 )

≤mistakes2/3

Nikos Karampatziakis, Dexter Kozen

}

Online Learning of Prediction Suffix Trees

Proof

≤ Φ(w1 )

length of length of | {z

≥min

min

}

size×mistakes

− length of | {z

≤ Φ(w1 )

≤mistakes2/3

}

size · mistakes − mistakes2/3 ≤ Φ(w1 )

Nikos Karampatziakis, Dexter Kozen

Online Learning of Prediction Suffix Trees

Multiclass extension We maintain weights w(1) , w(2) , . . . , w(k) Predict yˆt = argmaxi hw(i) , xt i In case of a mistake (ˆ y )

(ˆ y )

(y )

(y )

t θt+1,s = θt,st − αxt,s t θt+1,s = θt,st + αxt,s

for all s such that |s| ≤ dt .

Nikos Karampatziakis, Dexter Kozen

Online Learning of Prediction Suffix Trees

Outline

Prediction Suffix Trees Algorithm and Properties Results

Nikos Karampatziakis, Dexter Kozen

Online Learning of Prediction Suffix Trees

Outline

Prediction Suffix Trees Algorithm and Properties Results

Nikos Karampatziakis, Dexter Kozen

Online Learning of Prediction Suffix Trees

Data

120 sequences of system calls from Outlook, Excel and Firefox (40 each). The monitoring program records 23 different system calls. A typical sequence contains hundreds of thousands of system calls.

Nikos Karampatziakis, Dexter Kozen

Online Learning of Prediction Suffix Trees

Results Averages over the 40 sequences Outlook Perceptron Winnow % Error 5.1 4.43 PST Size 41239 25679

“Perceptron” is the PST learning algorithm of [Dekel et al., 2004]. Nikos Karampatziakis, Dexter Kozen

Online Learning of Prediction Suffix Trees

Results Averages over the 40 sequences Outlook Perceptron Winnow % Error 5.1 4.43 PST Size 41239 25679 Firefox % Error PST Size

Perceptron 14.86 21081

Winnow 13.88 12662

“Perceptron” is the PST learning algorithm of [Dekel et al., 2004]. Nikos Karampatziakis, Dexter Kozen

Online Learning of Prediction Suffix Trees

Results Averages over the 40 sequences Outlook Perceptron Winnow % Error 5.1 4.43 PST Size 41239 25679 Firefox % Error PST Size

Perceptron 14.86 21081

Winnow 13.88 12662

Excel % Error PST Size

Perceptron 22.68 24402

Winnow 20.59 15338

“Perceptron” is the PST learning algorithm of [Dekel et al., 2004]. Nikos Karampatziakis, Dexter Kozen

Online Learning of Prediction Suffix Trees

Results Averages over the 40 sequences Outlook Perceptron Winnow % Error 5.1 4.43 PST Size 41239 25679 Firefox % Error PST Size

Perceptron 14.86 21081

Winnow 13.88 12662

Excel % Error PST Size

Perceptron 22.68 24402

Winnow 20.59 15338

Moreover, Winnow made less mistakes and grew smaller trees for all 120 sequences “Perceptron” is the PST learning algorithm of [Dekel et al., 2004]. Nikos Karampatziakis, Dexter Kozen

Online Learning of Prediction Suffix Trees

Results 30000

Total number of mistakes

25000

20000

15000

10000

5000 perceptron winnow 0 0

100000

200000

300000

Nikos Karampatziakis, Dexter Kozen

400000 Round (t)

500000

600000

700000

Online Learning of Prediction Suffix Trees

800000

Results 80000

70000

Total number of mistakes

60000

50000

40000

30000

20000

10000 perceptron winnow 0 0

100000 200000 300000 400000 500000 600000 700000 800000 900000 1e+006 Round (t)

Nikos Karampatziakis, Dexter Kozen

Online Learning of Prediction Suffix Trees

Results 20000 18000 16000

Total number of mistakes

14000 12000 10000 8000 6000 4000 2000 perceptron winnow 0 0

50000

100000

150000

200000

Round (t)

Nikos Karampatziakis, Dexter Kozen

Online Learning of Prediction Suffix Trees

250000

Related Work Many learning algorithms assume an a priori bound on the tree’s depth [Willems et al., 1995], [Ron et al., 1996], [Pereira & Singer, 1999]. . . [Dekel et al., 2004] present a perceptron algorithm similar to ours. [Kivinen & Warmuth, 1997] show how to compete against vectors

u with ||u||1 ≤ U Sparsifying Winnow is popular (e.g. [Blum, 1997]) but no guarantees.

Nikos Karampatziakis, Dexter Kozen

Online Learning of Prediction Suffix Trees

Summary

Outlined the benefits of prediction suffix trees.

Nikos Karampatziakis, Dexter Kozen

Online Learning of Prediction Suffix Trees

Summary

Outlined the benefits of prediction suffix trees. Introduced an online learning algorithm to learn PSTs.

Nikos Karampatziakis, Dexter Kozen

Online Learning of Prediction Suffix Trees

Summary

Outlined the benefits of prediction suffix trees. Introduced an online learning algorithm to learn PSTs. It is competitive with the best fixed PST in hindsight.

Nikos Karampatziakis, Dexter Kozen

Online Learning of Prediction Suffix Trees

Summary

Outlined the benefits of prediction suffix trees. Introduced an online learning algorithm to learn PSTs. It is competitive with the best fixed PST in hindsight. The resulting trees grow slowly if necessary

Nikos Karampatziakis, Dexter Kozen

Online Learning of Prediction Suffix Trees

Summary

Outlined the benefits of prediction suffix trees. Introduced an online learning algorithm to learn PSTs. It is competitive with the best fixed PST in hindsight. The resulting trees grow slowly if necessary

On our task, it made less mistakes and grew smaller trees than other state-of-the-art algorithms.

Nikos Karampatziakis, Dexter Kozen

Online Learning of Prediction Suffix Trees

References I Blum, A. (1997). Empirical Support for Winnow and Weighted-Majority Algorithms: Results on a Calendar Scheduling Domain. Machine Learning, 26, 5–23. Dekel, O., Shalev-Shwartz, S., & Singer, Y. (2004). The power of selective memory: Self-bounded learning of prediction suffix trees. Advances in Neural Information Processing Systems, 17. Helmbold, D., & Schapire, R. (1997). Predicting Nearly As Well As the Best Pruning of a Decision Tree. Machine Learning, 27, 51–68. Nikos Karampatziakis, Dexter Kozen

Online Learning of Prediction Suffix Trees

References II Kivinen, J., & Warmuth, M. (1997). Exponentiated Gradient versus Gradient Descent for Linear Predictors. Information and Computation, 132, 1–63. Littlestone, N. (1989). Mistake bounds and logarithmic linear-threshold learning algorithms. Doctoral dissertation, Santa Cruz, CA, USA. Pereira, F., & Singer, Y. (1999). An Efficient Extension to Mixture Techniques for Prediction and Decision Trees. Machine Learning, 36, 183–199. Nikos Karampatziakis, Dexter Kozen

Online Learning of Prediction Suffix Trees

References III Ron, D., Singer, Y., & Tishby, N. (1996). The Power of Amnesia: Learning Probabilistic Automata with Variable Memory Length. Machine Learning, 25, 117–149. Rosenblatt, F. (1958). The perceptron: A probabilistic model for information storage and organization in the brain. Psychological Review, 65, 386–408. Willems, F., Shtarkov, Y., & Tjalkens, T. (1995). The context-tree weighting method: basic properties. IEEE Transactions on Information Theory, 41, 653–664.

Nikos Karampatziakis, Dexter Kozen

Online Learning of Prediction Suffix Trees