Online Learning of Prediction Suffix Trees - Nikos Karampatziakis

Online Learning of Prediction Suffix Trees Nikos Karampatziakis

Dexter Kozen

Cornell University

April 24, 2009

To appear in ICML 2009

Nikos Karampatziakis, Dexter Kozen

Online Learning of Prediction Suffix Trees

Introduction

Sequential prediction



Introduction

Sequential prediction Prediction Suffix Trees



Introduction

Sequential prediction Prediction Suffix Trees Learning algorithm



Motivation

Monitoring of applications in a computer system



Motivation

Monitoring of applications in a computer system System calls



Motivation

Monitoring of applications in a computer system System calls Prediction model



Motivation

Monitoring of applications in a computer system System calls Prediction model Some assumptions



Outline

Prediction Suffix Trees Algorithm and Properties Results



Outline




What is the next item in the sequence? open(),read(),read(),. . . G, A, T, T, A, C, . . . +1, −1, +1, −1, +1, . . . all,your,base,. . .



What is the next item in the sequence? open(),read(),read(),. . . G, A, T, T, A, C, . . . +1, −1, +1, −1, +1, . . . all,your,base,. . . Formally, predict yt from y1 , y2 , . . . , yt−2 , yt−1



What is the next item in the sequence? open(),read(),read(),. . . G, A, T, T, A, C, . . . +1, −1, +1, −1, +1, . . . all,your,base,. . . Formally, predict yt from y1 , y2 , . . . , yt−2 , yt−1 yt = h(y1t−1 )

yij := yi , . . . , yj




yij := yi , . . . , yj

More general problem: p(yt |y1t−1 )




yij := yi , . . . , yj

More general problem: p(yt |y1t−1 ) Markovian Assumption



Markov Models

Zero order: p(yt |y1t−1 ) = p(yt )



Markov Models

Zero order: p(yt |y1t−1 ) = p(yt ) First order:

A

yt−1

C

G

T

z }| { p(yt |yt−1 = G)



Markov Models Second order:

yt−1

A

C

yt−2

A

C

G

G

yt−2

T

A

C

G

T

yt−2

T

C

A

G

yt−2

T

A

C

G

z }| { p(yt |yt−1 = C, yt−2 = G)



T

Markov Models Second order:

yt−1

A

C

yt−2

A

C

G

G

yt−2

T

A

C

G

T

yt−2

T

C

A

G

yt−2

T

A

C

G

z }| { p(yt |yt−1 = C, yt−2 = G)

Third order, k-th order etc. Nikos Karampatziakis, Dexter Kozen


T

Markov Models — Problems How to select the best k?



Markov Models — Problems How to select the best k? Number of model parameters exponential in k.



Markov Models — Problems How to select the best k? Number of model parameters exponential in k. Poor family of models. What if I want a model with 100 parameters?



Markov Models — Problems How to select the best k? Number of model parameters exponential in k. Poor family of models. What if I want a model with 100 parameters? What if yt does not depend on yt−2 if yt−1 = A but it depends on it if yt−1 = T ?



Markov Models — Problems How to select the best k? Number of model parameters exponential in k. Poor family of models. What if I want a model with 100 parameters? What if yt does not depend on yt−2 if yt−1 = A but it depends on it if yt−1 = T ? Prediction Suffix Trees address these problems. [Willems et al., 1995, Ron et al., 1996, Helmbold & Schapire, 1997, Pereira & Singer, 1999] Nikos Karampatziakis, Dexter Kozen


Prediction Suffix Trees

yt−1

A

C

yt−2

A

C

G

G

yt−2

T

A

C

G

T

yt−2

T


A

C

G

yt−2

T

A

C


G

T

Prediction Suffix Trees Keep a model for each k. Use a weighted sum. p(yt )

A

p(yt |A)

A

C

G

T

G

C

p(yt |G)

p(yt |C)

A

C

G

T

T

A

C

G

T

p(yt |T )

A

C

p(yt |yt−1 , yt−2 )



G

T

Prediction Suffix Trees Keep a model for each k. Use a weighted sum. Prune useless branches and subtrees. p(yt )

A

C

p(yt |A)

A

G

p(yt |C)

A

C

T

p(yt |G)

A

C

p(yt |T )

G

p(yt |yt−1 , yt−2 )



Prediction Suffix Trees Keep a model for each k. Use a weighted sum. Prune useless branches and subtrees. p(yt )

A

C

p(yt |A)

A

G

p(yt |C)

A

C

T

p(yt |G)

A

C

p(yt |T )

G

p(yt |yt−1 , yt−2 )

Assumptions: p(yt |yt−1 = T, yt−2 ) = p(yt |yt−1 = T ), p(yt |yt−1 = A, yt−2 = ¬A) = p(yt |yt−1 = A). Nikos Karampatziakis, Dexter Kozen


Moving Away from Probabilistic Modeling

We just want to predict yt from y1t−1




We just want to predict yt from y1t−1 Assume yt ∈ {−1, +1}




We just want to predict yt from y1t−1 Assume yt ∈ {−1, +1} Store a weight instead of a probability




We just want to predict yt from y1t−1 Assume yt ∈ {−1, +1} Store a weight instead of a probability Use the sign of a weighted sum of the values in the appropriate path



Example True Sequence: −1, −1, +1, −1, −1, +1, . . . Discounting: A node at depth d is discounted by 2−d Tree: 0

−1

+1

−1

−2

−1 5 Input Sequence: Decision: Nikos Karampatziakis, Dexter Kozen



−1

+1

−1

−2

−1 5 Input Sequence: Decision:

. . . , −1, −1

0




−1

+1

−1

−2


. . . , −1, −1

0 − 21 · 1




−1

+1

−1

−2


. . . , −1, −1

0 − 21 · 1 + 14 · 5




−1

+1

−1

−2


. . . , −1, −1

0 − 21 · 1 + 14 · 5 =


3 sign → 4

+1



−1

+1

−1

−2


. . . , +1, −1

0




−1

+1

−1

−2


. . . , +1, −1

0 − 21 · 1




−1

+1

−1

−2


. . . , +1, −1 sign

0 − 21 · 1 = − 12 → −1




−1

+1

−1

−2


. . . , +1

0




−1

+1

−1

−2


. . . , +1

0 − 21 · 2




−1

+1

−1

−2


. . . , +1 sign

0 − 21 · 2 = −1 → −1



Remarks

Notation for the node values: gt,s



Remarks

Notation for the node values: gt,s Discounting scheme in this work: t−1 (1 − )|s| if s = yt−i + xt,s = 0 otherwise

i = 1, . . . , t − 1

The best value of will be determined (much) later.



Remarks

Notation for the node values: gt,s Discounting scheme in this work: t−1 (1 − )|s| if s = yt−i + xt,s = 0 otherwise

i = 1, . . . , t − 1

The best value of will be determined (much) later. Decision at time t: sign

P

s


+ gt,s x+ t,s = sign(hgt , xt i)


Outline




Outline




Algorithms for Linear Prediction SVMs, logistic regression,. . .



Algorithms for Linear Prediction SVMs, logistic regression,. . . Focus on online setting



Algorithms for Linear Prediction SVMs, logistic regression,. . . Focus on online setting Algorithm maintains hypothesis wt . In each round:



Algorithms for Linear Prediction SVMs, logistic regression,. . . Focus on online setting Algorithm maintains hypothesis wt . In each round: The algorithm receives information xt .



Algorithms for Linear Prediction SVMs, logistic regression,. . . Focus on online setting Algorithm maintains hypothesis wt . In each round: The algorithm receives information xt . It outputs yˆt = sign(hwt , xt i).



Algorithms for Linear Prediction SVMs, logistic regression,. . . Focus on online setting Algorithm maintains hypothesis wt . In each round: The algorithm receives information xt . It outputs yˆt = sign(hwt , xt i). It receives the correct output yt .



Algorithms for Linear Prediction SVMs, logistic regression,. . . Focus on online setting Algorithm maintains hypothesis wt . In each round: The algorithm receives information xt . It outputs yˆt = sign(hwt , xt i). It receives the correct output yt . It updates: wt+1 ← f (wt ).



Algorithms for Linear Prediction SVMs, logistic regression,. . . Focus on online setting Algorithm maintains hypothesis wt . In each round: The algorithm receives information xt . It outputs yˆt = sign(hwt , xt i). It receives the correct output yt . It updates: wt+1 ← f (wt ).

A mistake is made when yt hwt , xt i ≤ 0.



Perceptron

If a mistake is made at round t: wt+1 = wt + αyt xt



Perceptron

If a mistake is made at round t: wt+1 = wt + αyt xt α is a learning rate.



Perceptron

If a mistake is made at round t: wt+1 = wt + αyt xt α is a learning rate. It is selected so that it optimizes various tradeoffs.



Balanced Winnow Balanced Winnow [Littlestone, 1989] θ1 ← 0 for t = 1, 2, . . . , T do θt,i wt,i ← Pe θt,j j e yˆt ← hwt , xt i if yt yˆt ≤ 0 θt+1 ← θt + αyt xt else θt+1 ← θt


Perceptron [Rosenblatt, 1958] θ1 ← 0 for t = 1, 2, . . . , T do wt ← θt yˆt ← hwt , xt i if yt yˆt ≤ 0 θt+1 ← θt + αyt xt else θt+1 ← θt


Balanced Winnow Balanced Winnow [Littlestone, 1989] θ1 ← 0 for t = 1, 2, . . . , T do θt,i wt,i ← Pe θt,j j e yˆt ← hwt , xt i if yt yˆt ≤ 0 θt+1 ← θt + αyt xt else θt+1 ← θt

Perceptron [Rosenblatt, 1958] θ1 ← 0 for t = 1, 2, . . . , T do wt ← θt yˆt ← hwt , xt i if yt yˆt ≤ 0 θt+1 ← θt + αyt xt else θt+1 ← θt

Important for Balanced Winnow: + + + + + xt = [x+ t , −xt ] = [xt,1 , . . . , xt,d , −xt,1 , . . . , −xt,d ] t−1 (1 − )|s| if s = yt−i i = 1, . . . , t − 1 + xt,s = 0 otherwise Nikos Karampatziakis, Dexter Kozen


Advantages

Multiplicative updates — fast convergence.



Advantages

Multiplicative updates — fast convergence. Can cope with many features when few of them are relevant.



Advantages

Multiplicative updates — fast convergence. Can cope with many features when few of them are relevant. For our application it can track changes better.



Where’s the catch?

Perceptron and Winnow don’t care about the sparsity of their hypothesis.




Perceptron and Winnow don’t care about the sparsity of their hypothesis. Eventually, we have a feature for every substring of y1T .




Perceptron and Winnow don’t care about the sparsity of their hypothesis. Eventually, we have a feature for every substring of y1T . Implications: Need to learn O(T 2 ) weights in T rounds


/



Perceptron and Winnow don’t care about the sparsity of their hypothesis. Eventually, we have a feature for every substring of y1T . Implications: Need to learn O(T 2 ) weights in T rounds Need to store

O(T 2 )

numbers

/

/

Naively, all weights affect the decision and must be stored.



Does the decision depend on every feature? − + + Recall the specific form of features: xt = [x+ t , xt ] = [xt , −xt ].




Notation: wt = [wt+ , wt− ] θt = [θt+ , θt− ]. Recall wt,i =



eθt,i θt,j j e

P


Notation: wt = [wt+ , wt− ] θt = [θt+ , θt− ]. Recall wt,i = Easy inductive argument can show that θt− = −θt+



eθt,i θt,j j e

P


Notation: wt = [wt+ , wt− ] θt = [θt+ , θt− ]. Recall wt,i = Easy inductive argument can show that θt− = −θt+ 1 Decision yˆt = hwt , xt i = hwt+ − wt− , x+ t i =

=

d X

Pd i=1

1

sinh(x) =

+ sinh(θt,i )

ex −e−x 2

+ j=1 cosh(θt,j )

and cosh(x) =


x+ t,i ∝

d X

+ sinh(θt,i )x+ t,i

i=1

ex +e−x 2 Online Learning of Prediction Suffix Trees

eθt,i θt,j j e

P


Notation: wt = [wt+ , wt− ] θt = [θt+ , θt− ]. Recall wt,i = Easy inductive argument can show that θt− = −θt+ 1 Decision yˆt = hwt , xt i = hwt+ − wt− , x+ t i =

=

d X

+ sinh(θt,i )

Pd i=1

+ j=1 cosh(θt,j )

x+ t,i ∝

d X

+ sinh(θt,i )x+ t,i

i=1

+ Iff θt,i = 0 the decision does not depend on feature i. 1

sinh(x) =

ex −e−x 2

and cosh(x) =


ex +e−x 2 Online Learning of Prediction Suffix Trees

eθt,i θt,j j e

P

Observations

If θt,s = 0 we don’t have to store it.



Observations

If θt,s = 0 we don’t have to store it. Initially θ1 = 0. The tree has only one node.



Observations

If θt,s = 0 we don’t have to store it. Initially θ1 = 0. The tree has only one node. As mistakes are made, the tree grows.



Observations

If θt,s = 0 we don’t have to store it. Initially θ1 = 0. The tree has only one node. As mistakes are made, the tree grows. Classic Winnow/Perceptron update quickly destroys sparsity.



Illustration Input Sequence: +1, −1, +1, −1, ? −|s| . . . , −1, t−1 2 if s = yt−i i = 1, . . . , t − 1 x+ t,s = 0 otherwise Tree: 0 −1

+1 −2

1 −1 −5

Decision: Nikos Karampatziakis, Dexter Kozen



+1 −2

1 −1 −5

Decision:

1 2

· sinh(1)




+1 −2

1 −1 −5

Decision:

1 2

· sinh(1) > 0




+1 −2

1 −1 −5

Decision:

1 2

sign

· sinh(1) > 0 → +1



Illustration Input Sequence: +1, −1, +1, −1, − 1 −|s| . . . , −1, t−1 2 if s = yt−i i = 1, . . . , t − 1 x+ t,s = 0 otherwise Tree: 0 −1

+1 −2

1 −1 −5

Decision:

1 2

sign

· sinh(1) > 0 → +1




+1 −2

1 −1

+1

−5

0 −1 0 +1 0

.. .

Decision:

1 2

sign

· sinh(1) > 0 → +1




+1

1 2

−1

−2 +1 −1 4

−5 −1 1 −8

+1 1 − 16

.. .

Decision:

1 2

sign

· sinh(1) > 0 → +1



Summary of the Problem

Mistake at time t: O(t) nodes are inserted in the tree.




Mistake at time t: O(t) nodes are inserted in the tree. x+ t,s is non-zero even when s is very long.




Mistake at time t: O(t) nodes are inserted in the tree. x+ t,s is non-zero even when s is very long. Bad idea: change the definition of x+ t,s to avoid this.





Need to learn a good θ while keeping it sparse.





Need to learn a good θ while keeping it sparse. Not all sparse vectors are equal.



Better Update Rule Adaptive bound dt on the length of the suffixes.



Better Update Rule Adaptive bound dt on the length of the suffixes. Same as the depth up to which the tree will grow on round t.




dt will be growing slowly if necessary.




dt will be growing slowly if necessary. New update: θt+1,s =

θt,s + αyt xt,s if |s| ≤ dt θt,s otherwise




dt will be growing slowly if necessary. New update: θt+1,s =

θt,s + αyt xt,s if |s| ≤ dt θt,s otherwise

Equivalently: θt+1 = θt + αyt xt + αnt where nt is a noise vector that cancels part of the update: −yt xt,s if |s| > dt nt,s = 0 otherwise Nikos Karampatziakis, Dexter Kozen


Setting dt ht = the length of the path from the root using yt−1 , yt−2 , . . ..



Setting dt ht = the length of the path from the root using yt−1 , yt−2 , . . .. Jt is the subset of rounds 1, . . . , t in which a mistake was made.



Setting dt ht = the length of the path from the root using yt−1 , yt−2 , . . .. Jt is the subset of rounds 1, . . . , t in which a mistake was made. Pt =

P

i∈Jt

||ni ||∞ =

P

i∈Jt (1


− )(di +1) .



P

i∈Jt

||ni ||∞ =

P

i∈Jt (1

− )(di +1) .

dt will be the smallest integer such that Pt ≤ |Jt |2/3 .




P

i∈Jt

||ni ||∞ =

P

i∈Jt (1

− )(di +1) .

dt will be the smallest integer such that Pt ≤ |Jt |2/3 . To guarantee that we can set q 3 3/2 3 dt = max ht , log1− Pt−1 + 2Pt−1 + 1 − Pt−1 − 1



Algorithm Properties — Learning Mistake Bound If there exists a tree which over the input sequence y1 , y2 , . . . , yT correctly predicts all items with confidence ≥ δ, our algorithm’s mistakes will be at most 8 log T 64 max , 3 δ2 δ



Algorithm Properties — Learning Mistake Bound If there exists a tree which over the input sequence y1 , y2 , . . . , yT correctly predicts all items with confidence ≥ δ, our algorithm’s mistakes will be at most 8 log T 64 max , 3 δ2 δ Instead of assuming a perfect tree, assume a tree u that attains a cumulative δ-hinge loss L > 0. The mistake bound becomes: 2L 8 log T 64 max + , 3 δ δ2 δ



Algorithm Properties — Learning Mistake Bound If there exists a tree which over the input sequence y1 , y2 , . . . , yT correctly predicts all items with confidence ≥ δ, our algorithm’s mistakes will be at most 8 log T 64 max , 3 δ2 δ Instead of assuming a perfect tree, assume a tree u that attains a cumulative δ-hinge loss L > 0. The mistake bound becomes: 2L 8 log T 64 max + , 3 δ δ2 δ PT L = t=1 `t (u) where `t (u) = max(0, δ − yt hu, xt i) Nikos Karampatziakis, Dexter Kozen


Algorithm Properties — Resources

Let Mt be the number of mistakes up to round t




Let Mt be the number of mistakes up to round t Set so that −|s|/3 t−1 2 if s = yt−i + xt,s = 0 otherwise


i = 1, . . . , t − 1



Let Mt be the number of mistakes up to round t Set so that −|s|/3 t−1 2 if s = yt−i + xt,s = 0 otherwise

i = 1, . . . , t − 1

Growth Bound Our algorithm will not grow a tree deeper than log2 MT −1 + 4



How to go about proving the properties

Bounding how fast the tree grows is straightforward.




Bounding how fast the tree grows is straightforward. Mistake bound: show that each update contributes towards a goal.




Bounding how fast the tree grows is straightforward. Mistake bound: show that each update contributes towards a goal. We need a goal and measure of progress.



Measuring Progress Goal: A fixed tree with good performance.



Measuring Progress Goal: A fixed tree with good performance. For example, a vector u such that yt hu, xt i ≥ δ.




Our adaptive tree is represented by wt . Remember


P

i


wt,i = 1.


Our adaptive tree is represented by wt . Remember To be fair assume ui ≥ 0 and


P

i

P

i

ui = 1


wt,i = 1.



P

i

P

i

wt,i = 1.

ui = 1

Measure of progress: Relative entropy between u and wt X ui D(u||wt ) = ui log wt,i i





P

i

P

i

wt,i = 1.

ui = 1

Measure of progress: Relative entropy between u and wt X ui D(u||wt ) = ui log wt,i i Potential function: Φ(wt ) = D(u||wt ) ≥ 0 Nikos Karampatziakis, Dexter Kozen


Proof Technique

Upper bound the initial potential.



Proof Technique

Upper bound the initial potential. Lower bound the change in the potential with each update.



Proof Technique

Upper bound the initial potential. Lower bound the change in the potential with each update. Keep the total effect of noise bounded. [Dekel et al., 2004]



Pictorially Potential: 0

Φ(w1 ) Noise Pt : Example x1

Correct Prediction

Progress due to classic update Effect of noise Net progress Nikos Karampatziakis, Dexter Kozen




Correct Prediction

%





Correct Prediction

%




Φ(w1 ) Noise Pt : Example x1 x2

Correct Prediction

%





Correct Prediction

% %





Correct Prediction

% %




Φ(w1 ) Noise Pt : Example x1 x2 x3

Correct Prediction

% %




Φ(w1 ) Noise Pt : Example x1 x2 x3

Correct Prediction

% % "




Φ(w1 ) Noise Pt : Example x1 x2 x3 x4

Correct Prediction

% % "





Correct Prediction

% % " %





Correct Prediction

% % " %




Φ(w1 ) Noise Pt : Example x1 x2 x3 x4 x5

Correct Prediction

% % " %




Φ(w1 ) Noise Pt : Example x1 x2 x3 x4 x5

Correct Prediction

% % " % "




Φ(w1 ) Noise Pt : Example x1 x2 x3 x4 x5 x6

Correct Prediction

% % " % "





Correct Prediction

% % " % " %





Correct Prediction

% % " % " %





Correct Prediction

% % " % " %



Proof

length of

≤ Φ(w1 )



Proof

length of

≤ Φ(w1 )

length of

−length of


≤ Φ(w1 )


Proof

≤ Φ(w1 )

length of length of | {z

≥min

−length of

≤ Φ(w1 )

}

size×mistakes



Proof

≤ Φ(w1 )


≥min

}

size×mistakes

− length of | {z

≤ Φ(w1 )

≤mistakes2/3


}


Proof

≤ Φ(w1 )


≥min

min

}

size×mistakes

− length of | {z

≤ Φ(w1 )

≤mistakes2/3

}

size · mistakes − mistakes2/3 ≤ Φ(w1 )



Multiclass extension We maintain weights w(1) , w(2) , . . . , w(k) Predict yˆt = argmaxi hw(i) , xt i In case of a mistake (ˆ y )

(ˆ y )

(y )

(y )

t θt+1,s = θt,st − αxt,s t θt+1,s = θt,st + αxt,s

for all s such that |s| ≤ dt .



Outline




Outline




Data

120 sequences of system calls from Outlook, Excel and Firefox (40 each). The monitoring program records 23 different system calls. A typical sequence contains hundreds of thousands of system calls.



Results Averages over the 40 sequences Outlook Perceptron Winnow % Error 5.1 4.43 PST Size 41239 25679

“Perceptron” is the PST learning algorithm of [Dekel et al., 2004]. Nikos Karampatziakis, Dexter Kozen


Results Averages over the 40 sequences Outlook Perceptron Winnow % Error 5.1 4.43 PST Size 41239 25679 Firefox % Error PST Size

Perceptron 14.86 21081

Winnow 13.88 12662





Winnow 13.88 12662

Excel % Error PST Size


Winnow 20.59 15338





Winnow 13.88 12662

Excel % Error PST Size


Winnow 20.59 15338

Moreover, Winnow made less mistakes and grew smaller trees for all 120 sequences “Perceptron” is the PST learning algorithm of [Dekel et al., 2004]. Nikos Karampatziakis, Dexter Kozen


Results 30000

Total number of mistakes

25000

20000

15000

10000

5000 perceptron winnow 0 0

100000

200000

300000


400000 Round (t)

500000

600000

700000


800000

Results 80000

70000


60000

50000

40000

30000

20000

10000 perceptron winnow 0 0

100000 200000 300000 400000 500000 600000 700000 800000 900000 1e+006 Round (t)



Results 20000 18000 16000


14000 12000 10000 8000 6000 4000 2000 perceptron winnow 0 0

50000

100000

150000

200000

Round (t)



250000

Related Work Many learning algorithms assume an a priori bound on the tree’s depth [Willems et al., 1995], [Ron et al., 1996], [Pereira & Singer, 1999]. . . [Dekel et al., 2004] present a perceptron algorithm similar to ours. [Kivinen & Warmuth, 1997] show how to compete against vectors

u with ||u||1 ≤ U Sparsifying Winnow is popular (e.g. [Blum, 1997]) but no guarantees.



Summary

Outlined the benefits of prediction suffix trees.



Summary

Outlined the benefits of prediction suffix trees. Introduced an online learning algorithm to learn PSTs.



Summary

Outlined the benefits of prediction suffix trees. Introduced an online learning algorithm to learn PSTs. It is competitive with the best fixed PST in hindsight.



Summary

Outlined the benefits of prediction suffix trees. Introduced an online learning algorithm to learn PSTs. It is competitive with the best fixed PST in hindsight. The resulting trees grow slowly if necessary



Summary

Outlined the benefits of prediction suffix trees. Introduced an online learning algorithm to learn PSTs. It is competitive with the best fixed PST in hindsight. The resulting trees grow slowly if necessary

On our task, it made less mistakes and grew smaller trees than other state-of-the-art algorithms.



References I Blum, A. (1997). Empirical Support for Winnow and Weighted-Majority Algorithms: Results on a Calendar Scheduling Domain. Machine Learning, 26, 5–23. Dekel, O., Shalev-Shwartz, S., & Singer, Y. (2004). The power of selective memory: Self-bounded learning of prediction suffix trees. Advances in Neural Information Processing Systems, 17. Helmbold, D., & Schapire, R. (1997). Predicting Nearly As Well As the Best Pruning of a Decision Tree. Machine Learning, 27, 51–68. Nikos Karampatziakis, Dexter Kozen


References II Kivinen, J., & Warmuth, M. (1997). Exponentiated Gradient versus Gradient Descent for Linear Predictors. Information and Computation, 132, 1–63. Littlestone, N. (1989). Mistake bounds and logarithmic linear-threshold learning algorithms. Doctoral dissertation, Santa Cruz, CA, USA. Pereira, F., & Singer, Y. (1999). An Efficient Extension to Mixture Techniques for Prediction and Decision Trees. Machine Learning, 36, 183–199. Nikos Karampatziakis, Dexter Kozen


References III Ron, D., Singer, Y., & Tishby, N. (1996). The Power of Amnesia: Learning Probabilistic Automata with Variable Memory Length. Machine Learning, 25, 117–149. Rosenblatt, F. (1958). The perceptron: A probabilistic model for information storage and organization in the brain. Psychological Review, 65, 386–408. Willems, F., Shtarkov, Y., & Tjalkens, T. (1995). The context-tree weighting method: basic properties. IEEE Transactions on Information Theory, 41, 653–664.