Calculation of the Hessian for the Log-Log Frequency Distribution

1 downloads 0 Views 71KB Size Report
Jun 2, 2005 - 2.1 Bias Per Word. Define a and {bi} as parameters of the distribution. Let xij be the frequency cound of word i in document j. Let lj = ∑i xij be ...
Calculation of the Hessian for the Log-Log Frequency Distribution Jason D. M. Rennie [email protected] June 2, 2005∗

Abstract We calculate the diagonal entries of the Hessian for our log-log frequency models.

1

Introduction

We have discussed two parameterizations of the log-log term frequency model: (1) a model with a bias parameter for each word [2, 4], and (2) a model with an exponent parameter for each word [3]. Here we simply calculate additional derivatives for the two parameterizations.

2

Frequency Count Distribution

We calculate diagonal entries of the Hessian for the length conditional log-log term frequency count distribution.

2.1

Bias Per Word

Define a and {bi } as parameters of the distribution. Let xij be the frequency P cound of word i in document j. Let lj = i xij be the length of document j. Assuming independence of word frequencies and documents, and conditioning on the length of each document, the data negative log-likelihood is     lj d X n   X X J= log  (x + bi )a  − a log(xij + bi ) , (1)   i=1 j=1

∗ Updated

x=0

July 6, 2005

1

The partial derivatives are ) ( Pl d X n j a X ∂J (x + b ) log(x + b ) i i x=0 = − log(xij + bi ) P lj ∂a (x + bi )a i=1 j=1

(2)

x=0

=

d X n n X i=1 j=1

o EPij [log(x + bi )] − EPˆij [log(x + bi )] ,

n

X ∂J = ∂bi j=1

( P lj

) a − xij + bi     a a − EPˆij , x + bi x + bi

a a x=0 (x + bi ) x+bi P lj a x=0 (x + bi )

n  X EPij = j=1

The diagonal elements of the Hessian are  n  Plj d X 2 2 a X ∂ J x=0 (x + bi ) log (x + bi ) − = P l j  ∂a∂a (x + bi )a i=1 j=1

=

d X n X i=1 j=1

x=0

P lj a a(a−1) n  X ∂ J x=0 (x + bi ) (x+bi )2 − = P lj a  ∂bi ∂bi x=0 (x + bi ) j=1 =

j=1

(

(4) (5)

!2   x=0 (x + bi ) log(x + bi ) P lj  (x + bi )a

P lj

a

x=0

(6)

   EPij log2 (x + bi ) − EPij [log(x + bi )]2 ,

2

n X

(3)

EPij







Plj

(7)

a a x=0 (x + bi ) x+bi Plj a x=0 (x + bi )

a a(a − 1) − EPij 2 (x + bi ) x + bi

2

+ EPˆij

"

!2

+

 

a 2 (xij + bi )  (8)

a (x + bi )

2

#)

.

(9)

Note that the a element is a variance; it is guaranteed to be non-negative. The {bi } elements may be negative if a < 0.

2.2

Exponent Per Word

Define {ai } and b as parameters of the distribution. Let xij be the frequency P cound of word i in document j. Let lj = i xij be the length of document j. Assuming independence of word frequencies and documents, and conditioning on the length of each document, the data negative log-likelihood is     lj d X n   X X J= log  (x + b)ai  − ai log(xij + b) , (10)   i=1 j=1

x=0

2

The partial derivatives are ( Pl ) n j ai X (x + b) log(x + b) ∂J x=0 = − log(xij + b) P lj ai ∂ai j=1 x=0 (x + b) =

(11)

n n o X EPij [log(x + b)] − EPˆij [log(x + b)] ,

(12)

j=1

=

( Plj

)

(13)

    d X n  X ai ai EPij − EPˆij , x+b x+b i=1 j=1

(14)

d

n

XX ∂J = ∂b i=1 j=1

ai ai x=0 (x + b) x+b Plj ai x=0 (x + b)

The diagonal elements of the Hessian are  n  P lj 2 ai 2 X ∂ J x=0 (x + b) log (x + b) = − P l j  ∂ai ∂ai (x + b)ai j=1

x=0

ai − xij + b

!2   x=0 (x + b) log(x + b) P lj  (x + b)ai

P lj

ai

x=0

(15)

n X    = EPij log2 (x + b) − EPij [log(x + b)]2 ,

(16)

j=1

P lj ai ai (ai −1) d X n  X ∂ J x=0 (x + b) (x+b)2 − = P l j ai  ∂b∂b x=0 (x + b) i=1 j=1 2

=

n d X X i=1 j=1

(

EPij







P lj

ai ai x=0 (x + b) x+b Plj ai x=0 (x + b)

ai ai (ai − 1) − EPij 2 (x + b) x+b

2

+ EPˆij

"

!2

+

ai 2 (xij + b)  (17)

ai 2

(x + b)

#)

. (18)

Note that the {ai } Hessian elements are each a variance; they are each guaranteed to be non-negative. We show in [1] that if b is held fixed, the optimization surface is convex. The b element may be negative if some of the {ai } are negative (which is the usual case).

3 3.1

Frequency Rate Distribution Bias Per Word

a Define fi (x, l) = xl + bi . Then the data negative log-likelihood for a set of documents is  X J= log[fi (0, lj ) − fi (lj + 1, lj )] − log[fi (xij , lj ) − fi (xij + 1, lj )] , (19) i,j

3

 

where lj denotes the length of document j and xij denotes the frequency of word i in document j. First we calculate derivatives of fi with respect to the parameters:  x ∂fi (x, l) (20) = fi (x, l) log + bi , ∂a l a ∂fi (x, l) . (21) = fi (x, l) x + bi ∂bi l The objective derivatives are X ∂J = ∂a i,j X ∂J = ∂bi j

∂fi (0,lj ) ∂a

∂f (x ,l )

∂f (l +1,l )

∂f (x +1,l )

i ij j j j ij j − i ∂a − i ∂a ∂a − fi (0, lj ) − fi (lj + 1, lj ) fi (xij , lj ) − fi (xij + 1, lj )

∂fi (0,lj ) ∂bi



∂fi (lj +1,lj ) ∂bi

fi (0, lj ) − fi (lj + 1, lj )



∂fi (xij ,lj ) ∂bi



∂fi (xij +1,lj ) ∂bi

fi (xij , lj ) − fi (xij + 1, lj )

!

,

(22)

!

.

(23)

Next, we calculate second derivatives of fi : x  ∂ 2 fi = fi (x, l) log2 + bi , ∂a∂a l a(a − 1) ∂ 2 fi = fi (x, l) 2 . x ∂bi ∂bi + bi

(24) (25)

l

Finally, we calculate the second derivatives of the objective, 2

∂ J = ∂a∂a

X i,j

∂ 2 fi (0,lj ) ∂a∂a



∂ 2 fi (lj +1,lj ) ∂a∂a

∂fi (0,lj ) ∂a



∂fi (lj +1,lj ) ∂a

2

− − 2− fi (0, lj ) − fi (lj + 1, lj ) (fi (0, lj ) − fi (lj + 1, lj )) 2  ∂fi (xij +1,lj ) ∂fi (xij ,lj ) ∂ 2 fi (xij ,lj ) ∂ 2 fi (xij +1,lj ) − − ∂a ∂a ∂a∂a ∂a∂a + , (26) fi (xij , lj ) − fi (xij + 1, lj ) (fi (xij , lj ) − fi (xij + 1, lj ))2

 2 2 ∂f (lj +1,lj ) ∂fi (0,lj ) ∂ 2 fi (lj +1,lj ) fi (0,lj ) − i ∂b X ∂ ∂b − ∂b ∂2J i i ∂bi ∂bi i ∂bi = − 2− ∂bi ∂bi f (0, l ) − f (l + 1, lj ) i j i j (fi (0, lj ) − fi (lj + 1, lj )) j  2 ∂fi (xij ,lj ) ∂fi (xij +1,lj ) ∂ 2 fi (xij ,lj ) ∂ 2 fi (xij +1,lj ) − − ∂bi ∂bi ∂bi ∂bi ∂bi ∂bi . (27) + fi (xij , lj ) − fi (xij + 1, lj ) (fi (xij , lj ) − fi (xij + 1, lj ))2

3.2

Exponent Per Word

ai Define fi (x, l) = xl + b . Then the data negative log-likelihood for a set of documents is  X J= log[fi (0, lj ) − fi (lj + 1, lj )] − log[fi (xij , lj ) − fi (xij + 1, lj )] , (28) i,j

4

where lj denotes the length of document j and xij denotes the frequency of word i in document j. First we calculate derivatives of fi with respect to the parameters: x  ∂fi (x, l) = fi (x, l) log +b , ∂ai l ∂fi (x, l) ai . = fi (x, l) x ∂b l +b

(29) (30)

The objective derivatives are ∂fi (0,lj ) ∂ai

X ∂J = ∂ai j



∂fi (lj +1,lj ) ∂ai

fi (0, lj ) − fi (lj + 1, lj ) ∂fi (0,lj ) ∂b



∂fi (xij ,lj ) ∂ai



∂fi (xij +1,lj ) ∂ai

fi (xij , lj ) − fi (xij + 1, lj ) ∂f (x ,l )

∂f (l +1,l )

∂f (x +1,l )

i ij j j − i j∂b j − i ij ∂b ∂b − fi (0, lj ) − fi (lj + 1, lj ) fi (xij , lj ) − fi (xij + 1, lj )

X ∂J = ∂b i,j

!

,

(31)

!

.

(32)

Next, we calculate second derivatives of fi :  x ∂ 2 fi = fi (x, l) log2 +b , ∂ai ∂ai l 2 ∂ fi ai (ai − 1) = fi (x, l) 2 . x ∂b∂b +b

(33) (34)

l

Finally, we calculate the second derivatives of the objective, 2

∂ J = ∂ai ∂ai

2

∂ J = ∂b∂b

X j

X i,j

∂ 2 fi (0,lj ) ∂ai ∂ai



∂ 2 fi (lj +1,lj ) ∂ai ∂ai



∂fi (0,lj ) ∂ai



∂fi (lj +1,lj ) ∂ai

2

2− (fi (0, lj ) − fi (lj + 1, lj ))  2 ∂fi (xij ,lj ) ∂fi (xij +1,lj ) ∂ 2 fi (xij ,lj ) ∂ 2 fi (xij +1,lj ) − − ∂a ∂a i i ∂ai ∂ai ∂ai ∂ai + , (35) fi (xij , lj ) − fi (xij + 1, lj ) (fi (xij , lj ) − fi (xij + 1, lj ))2

fi (0, lj ) − fi (lj + 1, lj )

∂ 2 fi (0,lj ) ∂b∂b





∂ 2 fi (lj +1,lj ) ∂b∂b

∂fi (0,lj ) ∂b



∂fi (lj +1,lj ) ∂b

2

− − 2− fi (0, lj ) − fi (lj + 1, lj ) (fi (0, lj ) − fi (lj + 1, lj ))  2 ∂fi (xij ,lj ) ∂f (x +1,lj ) ∂ 2 fi (xij +1,lj ) ∂ 2 fi (xij ,lj ) − i ij − ∂b ∂b ∂b∂b ∂b∂b + . (36) fi (xij , lj ) − fi (xij + 1, lj ) (fi (xij , lj ) − fi (xij + 1, lj ))2

References [1] J. D. M. Rennie. A class of convex http://people.csail.mit.edu/∼jrennie/writing, May 2005. 5

functions.

[2] J. D. M. Rennie. Learning a log-log term frequency model. http://people.csail.mit.edu/∼jrennie/writing, May 2005. [3] J. D. M. Rennie. A role-reversal in the http://people.csail.mit.edu/∼jrennie/writing, May 2005.

log-log

model.

[4] J. D. M. Rennie. Using a log-log distribution to model term frequency rates. http://people.csail.mit.edu/∼jrennie/writing, May 2005.

6

Suggest Documents