Calculation of the Hessian for the Log-Log Frequency Distribution Jason D. M. Rennie
[email protected] June 2, 2005∗
Abstract We calculate the diagonal entries of the Hessian for our log-log frequency models.
1
Introduction
We have discussed two parameterizations of the log-log term frequency model: (1) a model with a bias parameter for each word [2, 4], and (2) a model with an exponent parameter for each word [3]. Here we simply calculate additional derivatives for the two parameterizations.
2
Frequency Count Distribution
We calculate diagonal entries of the Hessian for the length conditional log-log term frequency count distribution.
2.1
Bias Per Word
Define a and {bi } as parameters of the distribution. Let xij be the frequency P cound of word i in document j. Let lj = i xij be the length of document j. Assuming independence of word frequencies and documents, and conditioning on the length of each document, the data negative log-likelihood is lj d X n X X J= log (x + bi )a − a log(xij + bi ) , (1) i=1 j=1
∗ Updated
x=0
July 6, 2005
1
The partial derivatives are ) ( Pl d X n j a X ∂J (x + b ) log(x + b ) i i x=0 = − log(xij + bi ) P lj ∂a (x + bi )a i=1 j=1
(2)
x=0
=
d X n n X i=1 j=1
o EPij [log(x + bi )] − EPˆij [log(x + bi )] ,
n
X ∂J = ∂bi j=1
( P lj
) a − xij + bi a a − EPˆij , x + bi x + bi
a a x=0 (x + bi ) x+bi P lj a x=0 (x + bi )
n X EPij = j=1
The diagonal elements of the Hessian are n Plj d X 2 2 a X ∂ J x=0 (x + bi ) log (x + bi ) − = P l j ∂a∂a (x + bi )a i=1 j=1
=
d X n X i=1 j=1
x=0
P lj a a(a−1) n X ∂ J x=0 (x + bi ) (x+bi )2 − = P lj a ∂bi ∂bi x=0 (x + bi ) j=1 =
j=1
(
(4) (5)
!2 x=0 (x + bi ) log(x + bi ) P lj (x + bi )a
P lj
a
x=0
(6)
EPij log2 (x + bi ) − EPij [log(x + bi )]2 ,
2
n X
(3)
EPij
Plj
(7)
a a x=0 (x + bi ) x+bi Plj a x=0 (x + bi )
a a(a − 1) − EPij 2 (x + bi ) x + bi
2
+ EPˆij
"
!2
+
a 2 (xij + bi ) (8)
a (x + bi )
2
#)
.
(9)
Note that the a element is a variance; it is guaranteed to be non-negative. The {bi } elements may be negative if a < 0.
2.2
Exponent Per Word
Define {ai } and b as parameters of the distribution. Let xij be the frequency P cound of word i in document j. Let lj = i xij be the length of document j. Assuming independence of word frequencies and documents, and conditioning on the length of each document, the data negative log-likelihood is lj d X n X X J= log (x + b)ai − ai log(xij + b) , (10) i=1 j=1
x=0
2
The partial derivatives are ( Pl ) n j ai X (x + b) log(x + b) ∂J x=0 = − log(xij + b) P lj ai ∂ai j=1 x=0 (x + b) =
(11)
n n o X EPij [log(x + b)] − EPˆij [log(x + b)] ,
(12)
j=1
=
( Plj
)
(13)
d X n X ai ai EPij − EPˆij , x+b x+b i=1 j=1
(14)
d
n
XX ∂J = ∂b i=1 j=1
ai ai x=0 (x + b) x+b Plj ai x=0 (x + b)
The diagonal elements of the Hessian are n P lj 2 ai 2 X ∂ J x=0 (x + b) log (x + b) = − P l j ∂ai ∂ai (x + b)ai j=1
x=0
ai − xij + b
!2 x=0 (x + b) log(x + b) P lj (x + b)ai
P lj
ai
x=0
(15)
n X = EPij log2 (x + b) − EPij [log(x + b)]2 ,
(16)
j=1
P lj ai ai (ai −1) d X n X ∂ J x=0 (x + b) (x+b)2 − = P l j ai ∂b∂b x=0 (x + b) i=1 j=1 2
=
n d X X i=1 j=1
(
EPij
P lj
ai ai x=0 (x + b) x+b Plj ai x=0 (x + b)
ai ai (ai − 1) − EPij 2 (x + b) x+b
2
+ EPˆij
"
!2
+
ai 2 (xij + b) (17)
ai 2
(x + b)
#)
. (18)
Note that the {ai } Hessian elements are each a variance; they are each guaranteed to be non-negative. We show in [1] that if b is held fixed, the optimization surface is convex. The b element may be negative if some of the {ai } are negative (which is the usual case).
3 3.1
Frequency Rate Distribution Bias Per Word
a Define fi (x, l) = xl + bi . Then the data negative log-likelihood for a set of documents is X J= log[fi (0, lj ) − fi (lj + 1, lj )] − log[fi (xij , lj ) − fi (xij + 1, lj )] , (19) i,j
3
where lj denotes the length of document j and xij denotes the frequency of word i in document j. First we calculate derivatives of fi with respect to the parameters: x ∂fi (x, l) (20) = fi (x, l) log + bi , ∂a l a ∂fi (x, l) . (21) = fi (x, l) x + bi ∂bi l The objective derivatives are X ∂J = ∂a i,j X ∂J = ∂bi j
∂fi (0,lj ) ∂a
∂f (x ,l )
∂f (l +1,l )
∂f (x +1,l )
i ij j j j ij j − i ∂a − i ∂a ∂a − fi (0, lj ) − fi (lj + 1, lj ) fi (xij , lj ) − fi (xij + 1, lj )
∂fi (0,lj ) ∂bi
−
∂fi (lj +1,lj ) ∂bi
fi (0, lj ) − fi (lj + 1, lj )
−
∂fi (xij ,lj ) ∂bi
−
∂fi (xij +1,lj ) ∂bi
fi (xij , lj ) − fi (xij + 1, lj )
!
,
(22)
!
.
(23)
Next, we calculate second derivatives of fi : x ∂ 2 fi = fi (x, l) log2 + bi , ∂a∂a l a(a − 1) ∂ 2 fi = fi (x, l) 2 . x ∂bi ∂bi + bi
(24) (25)
l
Finally, we calculate the second derivatives of the objective, 2
∂ J = ∂a∂a
X i,j
∂ 2 fi (0,lj ) ∂a∂a
∂ 2 fi (lj +1,lj ) ∂a∂a
∂fi (0,lj ) ∂a
−
∂fi (lj +1,lj ) ∂a
2
− − 2− fi (0, lj ) − fi (lj + 1, lj ) (fi (0, lj ) − fi (lj + 1, lj )) 2 ∂fi (xij +1,lj ) ∂fi (xij ,lj ) ∂ 2 fi (xij ,lj ) ∂ 2 fi (xij +1,lj ) − − ∂a ∂a ∂a∂a ∂a∂a + , (26) fi (xij , lj ) − fi (xij + 1, lj ) (fi (xij , lj ) − fi (xij + 1, lj ))2
2 2 ∂f (lj +1,lj ) ∂fi (0,lj ) ∂ 2 fi (lj +1,lj ) fi (0,lj ) − i ∂b X ∂ ∂b − ∂b ∂2J i i ∂bi ∂bi i ∂bi = − 2− ∂bi ∂bi f (0, l ) − f (l + 1, lj ) i j i j (fi (0, lj ) − fi (lj + 1, lj )) j 2 ∂fi (xij ,lj ) ∂fi (xij +1,lj ) ∂ 2 fi (xij ,lj ) ∂ 2 fi (xij +1,lj ) − − ∂bi ∂bi ∂bi ∂bi ∂bi ∂bi . (27) + fi (xij , lj ) − fi (xij + 1, lj ) (fi (xij , lj ) − fi (xij + 1, lj ))2
3.2
Exponent Per Word
ai Define fi (x, l) = xl + b . Then the data negative log-likelihood for a set of documents is X J= log[fi (0, lj ) − fi (lj + 1, lj )] − log[fi (xij , lj ) − fi (xij + 1, lj )] , (28) i,j
4
where lj denotes the length of document j and xij denotes the frequency of word i in document j. First we calculate derivatives of fi with respect to the parameters: x ∂fi (x, l) = fi (x, l) log +b , ∂ai l ∂fi (x, l) ai . = fi (x, l) x ∂b l +b
(29) (30)
The objective derivatives are ∂fi (0,lj ) ∂ai
X ∂J = ∂ai j
−
∂fi (lj +1,lj ) ∂ai
fi (0, lj ) − fi (lj + 1, lj ) ∂fi (0,lj ) ∂b
−
∂fi (xij ,lj ) ∂ai
−
∂fi (xij +1,lj ) ∂ai
fi (xij , lj ) − fi (xij + 1, lj ) ∂f (x ,l )
∂f (l +1,l )
∂f (x +1,l )
i ij j j − i j∂b j − i ij ∂b ∂b − fi (0, lj ) − fi (lj + 1, lj ) fi (xij , lj ) − fi (xij + 1, lj )
X ∂J = ∂b i,j
!
,
(31)
!
.
(32)
Next, we calculate second derivatives of fi : x ∂ 2 fi = fi (x, l) log2 +b , ∂ai ∂ai l 2 ∂ fi ai (ai − 1) = fi (x, l) 2 . x ∂b∂b +b
(33) (34)
l
Finally, we calculate the second derivatives of the objective, 2
∂ J = ∂ai ∂ai
2
∂ J = ∂b∂b
X j
X i,j
∂ 2 fi (0,lj ) ∂ai ∂ai
−
∂ 2 fi (lj +1,lj ) ∂ai ∂ai
∂fi (0,lj ) ∂ai
−
∂fi (lj +1,lj ) ∂ai
2
2− (fi (0, lj ) − fi (lj + 1, lj )) 2 ∂fi (xij ,lj ) ∂fi (xij +1,lj ) ∂ 2 fi (xij ,lj ) ∂ 2 fi (xij +1,lj ) − − ∂a ∂a i i ∂ai ∂ai ∂ai ∂ai + , (35) fi (xij , lj ) − fi (xij + 1, lj ) (fi (xij , lj ) − fi (xij + 1, lj ))2
fi (0, lj ) − fi (lj + 1, lj )
∂ 2 fi (0,lj ) ∂b∂b
−
∂ 2 fi (lj +1,lj ) ∂b∂b
∂fi (0,lj ) ∂b
−
∂fi (lj +1,lj ) ∂b
2
− − 2− fi (0, lj ) − fi (lj + 1, lj ) (fi (0, lj ) − fi (lj + 1, lj )) 2 ∂fi (xij ,lj ) ∂f (x +1,lj ) ∂ 2 fi (xij +1,lj ) ∂ 2 fi (xij ,lj ) − i ij − ∂b ∂b ∂b∂b ∂b∂b + . (36) fi (xij , lj ) − fi (xij + 1, lj ) (fi (xij , lj ) − fi (xij + 1, lj ))2
References [1] J. D. M. Rennie. A class of convex http://people.csail.mit.edu/∼jrennie/writing, May 2005. 5
functions.
[2] J. D. M. Rennie. Learning a log-log term frequency model. http://people.csail.mit.edu/∼jrennie/writing, May 2005. [3] J. D. M. Rennie. A role-reversal in the http://people.csail.mit.edu/∼jrennie/writing, May 2005.
log-log
model.
[4] J. D. M. Rennie. Using a log-log distribution to model term frequency rates. http://people.csail.mit.edu/∼jrennie/writing, May 2005.
6