Calculation of the Hessian for the Log-Log Frequency Distribution

Calculation of the Hessian for the Log-Log Frequency Distribution Jason D. M. Rennie [email protected] June 2, 2005∗

Abstract We calculate the diagonal entries of the Hessian for our log-log frequency models.

1

Introduction

We have discussed two parameterizations of the log-log term frequency model: (1) a model with a bias parameter for each word [2, 4], and (2) a model with an exponent parameter for each word [3]. Here we simply calculate additional derivatives for the two parameterizations.

2

Frequency Count Distribution

We calculate diagonal entries of the Hessian for the length conditional log-log term frequency count distribution.

2.1

Bias Per Word

Define a and {bi } as parameters of the distribution. Let xij be the frequency P cound of word i in document j. Let lj = i xij be the length of document j. Assuming independence of word frequencies and documents, and conditioning on the length of each document, the data negative log-likelihood is     lj d X n   X X J= log  (x + bi )a  − a log(xij + bi ) , (1)   i=1 j=1

∗ Updated

x=0

July 6, 2005

1

The partial derivatives are ) ( Pl d X n j a X ∂J (x + b ) log(x + b ) i i x=0 = − log(xij + bi ) P lj ∂a (x + bi )a i=1 j=1

(2)

x=0

=

d X n n X i=1 j=1

o EPij [log(x + bi )] − EPîj [log(x + bi )] ,

n

X ∂J = ∂bi j=1

( P lj

) a − xij + bi a a − EPîj , x + bi x + bi

a a x=0 (x + bi ) x+bi P lj a x=0 (x + bi )

n X EPij = j=1

The diagonal elements of the Hessian are  n  Plj d X 2 2 a X ∂ J x=0 (x + bi ) log (x + bi ) − = P l j  ∂a∂a (x + bi )a i=1 j=1

=

d X n X i=1 j=1

x=0

P lj a a(a−1) n  X ∂ J x=0 (x + bi ) (x+bi )2 − = P lj a  ∂bi ∂bi x=0 (x + bi ) j=1 =

j=1

(

(4) (5)

!2   x=0 (x + bi ) log(x + bi ) P lj  (x + bi )a

P lj

a

x=0

(6)

EPij log2 (x + bi ) − EPij [log(x + bi )]2 ,

2

n X

(3)

EPij

Plj

(7)

a a x=0 (x + bi ) x+bi Plj a x=0 (x + bi )

a a(a − 1) − EPij 2 (x + bi ) x + bi

2

+ EPîj

"

!2

+

 

a 2 (xij + bi )  (8)

a (x + bi )

2

#)

.

(9)

Note that the a element is a variance; it is guaranteed to be non-negative. The {bi } elements may be negative if a < 0.

2.2

Exponent Per Word

Define {ai } and b as parameters of the distribution. Let xij be the frequency P cound of word i in document j. Let lj = i xij be the length of document j. Assuming independence of word frequencies and documents, and conditioning on the length of each document, the data negative log-likelihood is     lj d X n   X X J= log  (x + b)ai  − ai log(xij + b) , (10)   i=1 j=1

x=0

2

The partial derivatives are ( Pl ) n j ai X (x + b) log(x + b) ∂J x=0 = − log(xij + b) P lj ai ∂ai j=1 x=0 (x + b) =

(11)

n n o X EPij [log(x + b)] − EPîj [log(x + b)] ,

(12)

j=1

=

( Plj

)

(13)

d X n X ai ai EPij − EPîj , x+b x+b i=1 j=1

(14)

d

n

XX ∂J = ∂b i=1 j=1

ai ai x=0 (x + b) x+b Plj ai x=0 (x + b)

The diagonal elements of the Hessian are  n  P lj 2 ai 2 X ∂ J x=0 (x + b) log (x + b) = − P l j  ∂ai ∂ai (x + b)ai j=1

x=0

ai − xij + b

!2   x=0 (x + b) log(x + b) P lj  (x + b)ai

P lj

ai

x=0

(15)

n X = EPij log2 (x + b) − EPij [log(x + b)]2 ,

(16)

j=1

P lj ai ai (ai −1) d X n  X ∂ J x=0 (x + b) (x+b)2 − = P l j ai  ∂b∂b x=0 (x + b) i=1 j=1 2

=

n d X X i=1 j=1

(

EPij

P lj

ai ai x=0 (x + b) x+b Plj ai x=0 (x + b)

ai ai (ai − 1) − EPij 2 (x + b) x+b

2

+ EPîj

"

!2

+

ai 2 (xij + b)  (17)

ai 2

(x + b)

#)

. (18)

Note that the {ai } Hessian elements are each a variance; they are each guaranteed to be non-negative. We show in [1] that if b is held fixed, the optimization surface is convex. The b element may be negative if some of the {ai } are negative (which is the usual case).

3 3.1

Frequency Rate Distribution Bias Per Word

a Define fi (x, l) = xl + bi . Then the data negative log-likelihood for a set of documents is X J= log[fi (0, lj ) − fi (lj + 1, lj )] − log[fi (xij , lj ) − fi (xij + 1, lj )] , (19) i,j

3

 

where lj denotes the length of document j and xij denotes the frequency of word i in document j. First we calculate derivatives of fi with respect to the parameters: x ∂fi (x, l) (20) = fi (x, l) log + bi , ∂a l a ∂fi (x, l) . (21) = fi (x, l) x + bi ∂bi l The objective derivatives are X ∂J = ∂a i,j X ∂J = ∂bi j

∂fi (0,lj ) ∂a

∂f (x ,l )

∂f (l +1,l )

∂f (x +1,l )

i ij j j j ij j − i ∂a − i ∂a ∂a − fi (0, lj ) − fi (lj + 1, lj ) fi (xij , lj ) − fi (xij + 1, lj )

∂fi (0,lj ) ∂bi

−

∂fi (lj +1,lj ) ∂bi

fi (0, lj ) − fi (lj + 1, lj )

−

∂fi (xij ,lj ) ∂bi

−

∂fi (xij +1,lj ) ∂bi

fi (xij , lj ) − fi (xij + 1, lj )

!

,

(22)

!

.

(23)

Next, we calculate second derivatives of fi : x ∂ 2 fi = fi (x, l) log2 + bi , ∂a∂a l a(a − 1) ∂ 2 fi = fi (x, l) 2 . x ∂bi ∂bi + bi

(24) (25)

l

Finally, we calculate the second derivatives of the objective, 2

∂ J = ∂a∂a

X i,j

∂ 2 fi (0,lj ) ∂a∂a

∂ 2 fi (lj +1,lj ) ∂a∂a

∂fi (0,lj ) ∂a

−

∂fi (lj +1,lj ) ∂a

2

− − 2− fi (0, lj ) − fi (lj + 1, lj ) (fi (0, lj ) − fi (lj + 1, lj )) 2 ∂fi (xij +1,lj ) ∂fi (xij ,lj ) ∂ 2 fi (xij ,lj ) ∂ 2 fi (xij +1,lj ) − − ∂a ∂a ∂a∂a ∂a∂a + , (26) fi (xij , lj ) − fi (xij + 1, lj ) (fi (xij , lj ) − fi (xij + 1, lj ))2

2 2 ∂f (lj +1,lj ) ∂fi (0,lj ) ∂ 2 fi (lj +1,lj ) fi (0,lj ) − i ∂b X ∂ ∂b − ∂b ∂2J i i ∂bi ∂bi i ∂bi = − 2− ∂bi ∂bi f (0, l ) − f (l + 1, lj ) i j i j (fi (0, lj ) − fi (lj + 1, lj )) j 2 ∂fi (xij ,lj ) ∂fi (xij +1,lj ) ∂ 2 fi (xij ,lj ) ∂ 2 fi (xij +1,lj ) − − ∂bi ∂bi ∂bi ∂bi ∂bi ∂bi . (27) + fi (xij , lj ) − fi (xij + 1, lj ) (fi (xij , lj ) − fi (xij + 1, lj ))2

3.2

Exponent Per Word

ai Define fi (x, l) = xl + b . Then the data negative log-likelihood for a set of documents is X J= log[fi (0, lj ) − fi (lj + 1, lj )] − log[fi (xij , lj ) − fi (xij + 1, lj )] , (28) i,j

4

where lj denotes the length of document j and xij denotes the frequency of word i in document j. First we calculate derivatives of fi with respect to the parameters: x ∂fi (x, l) = fi (x, l) log +b , ∂ai l ∂fi (x, l) ai . = fi (x, l) x ∂b l +b

(29) (30)

The objective derivatives are ∂fi (0,lj ) ∂ai

X ∂J = ∂ai j

−

∂fi (lj +1,lj ) ∂ai

fi (0, lj ) − fi (lj + 1, lj ) ∂fi (0,lj ) ∂b

−

∂fi (xij ,lj ) ∂ai

−

∂fi (xij +1,lj ) ∂ai

fi (xij , lj ) − fi (xij + 1, lj ) ∂f (x ,l )

∂f (l +1,l )

∂f (x +1,l )

i ij j j − i j∂b j − i ij ∂b ∂b − fi (0, lj ) − fi (lj + 1, lj ) fi (xij , lj ) − fi (xij + 1, lj )

X ∂J = ∂b i,j

!

,

(31)

!

.

(32)

Next, we calculate second derivatives of fi : x ∂ 2 fi = fi (x, l) log2 +b , ∂ai ∂ai l 2 ∂ fi ai (ai − 1) = fi (x, l) 2 . x ∂b∂b +b

(33) (34)

l

Finally, we calculate the second derivatives of the objective, 2

∂ J = ∂ai ∂ai

2

∂ J = ∂b∂b

X j

X i,j

∂ 2 fi (0,lj ) ∂ai ∂ai

−

∂ 2 fi (lj +1,lj ) ∂ai ∂ai

∂fi (0,lj ) ∂ai

−

∂fi (lj +1,lj ) ∂ai

2

2− (fi (0, lj ) − fi (lj + 1, lj )) 2 ∂fi (xij ,lj ) ∂fi (xij +1,lj ) ∂ 2 fi (xij ,lj ) ∂ 2 fi (xij +1,lj ) − − ∂a ∂a i i ∂ai ∂ai ∂ai ∂ai + , (35) fi (xij , lj ) − fi (xij + 1, lj ) (fi (xij , lj ) − fi (xij + 1, lj ))2

fi (0, lj ) − fi (lj + 1, lj )

∂ 2 fi (0,lj ) ∂b∂b

−

∂ 2 fi (lj +1,lj ) ∂b∂b

∂fi (0,lj ) ∂b

−

∂fi (lj +1,lj ) ∂b

2

− − 2− fi (0, lj ) − fi (lj + 1, lj ) (fi (0, lj ) − fi (lj + 1, lj )) 2 ∂fi (xij ,lj ) ∂f (x +1,lj ) ∂ 2 fi (xij +1,lj ) ∂ 2 fi (xij ,lj ) − i ij − ∂b ∂b ∂b∂b ∂b∂b + . (36) fi (xij , lj ) − fi (xij + 1, lj ) (fi (xij , lj ) − fi (xij + 1, lj ))2

References [1] J. D. M. Rennie. A class of convex http://people.csail.mit.edu/∼jrennie/writing, May 2005. 5

functions.

[2] J. D. M. Rennie. Learning a log-log term frequency model. http://people.csail.mit.edu/∼jrennie/writing, May 2005. [3] J. D. M. Rennie. A role-reversal in the http://people.csail.mit.edu/∼jrennie/writing, May 2005.

log-log

model.

[4] J. D. M. Rennie. Using a log-log distribution to model term frequency rates. http://people.csail.mit.edu/∼jrennie/writing, May 2005.

6

Calculation of the Hessian for the Log-Log Frequency Distribution

Calculation of the Hessian for the Log-Log Frequency Distribution

Suggest Documents

Frequency adaptive metadynamics for the calculation ...

Quasicontiguous frequency-fluctuation model for calculation of ...

Efficient Hessian Calculation using Automatic ... - University of Oxford

The CEBAF Frequency Distribution System

Calculation of the Key Length for Quantum Key Distribution

Calculation of distribution coefficients for ...

A Full Frequency-Dependent Cable Model for the Calculation ... - MDPI

Retrospective comparison of the frequency, distribution, and ...

Frequency distribution of the first clinical symptoms

CALDERON PROJECTOR FOR THE HESSIAN OF THE PERTURBED ...

The distribution and frequency of endocrine cells

Gene Frequency Distribution of the - Springer Link

Calculation of the Hardness Space Distribution in the As Quenched ...

Direct calculation of the attempt frequency of magnetic structures using ...

Ab initio calculation of the frequency-dependent ... - icqem - Cnr

Analytical calculation of the frequency shift in phase oscillators driven

Calculation of Positron Distribution in the Presence of a Uniform ...

Monte Carlo calculation of the radial distribution function of quantum ...

Calculation of the Distribution Rule of Equivalent ...

On calculation of the interweight distribution of an equitable partition

Path Integral Monte Carlo Calculation of the Momentum Distribution of ...

monte carlo calculation of the energy distribution of ...

Efficient Calculation of the Joint Distribution of Order Statistics

A tropical analogue of the Hessian group