ity models by means of the rational function growth transform. (RFGT) [GKNN91]. .... âDependency on Training Data Size
Conditional ML Estimation Using Rational Function Growth Transform Ciprian Chelba and Alex Acero Microsoft Research, Redmond, U.S.A.
Abstract
Smoothing and Initialization
Conditional Maximum Likelihood (CML) estimation of probability models by means of the rational function growth transform (RFGT) [GKNN91]. Model parameter values — probabilities — properly normalized at each iteration. Discriminatively train a Na¨ıve Bayes (NB) classifier; same procedure is at the basis of discriminative training of HMMs in speech recognition [Nor91].
Model parameters θ are initialized with maximum likelihood estimates smoothed using MAP:
vergence speed significantly over a straightforward implementation.
Conditional Likelihood: Training Data 0
#(y) λy = , #(y) + α
λ y = 1 − λy
(3) −500
LogL
#(k, y) 1 θky = λy · + λy · , #(y) 2
P (y|x; θ) = Z(x; θ)
−1
· θy
F
total (min=0.1) ML count (min=0.1) ML count (min=0.075)
−1000
Model in Eq. (1) is re-parameterized: 1 fk (x) 1 fk (x) (λy · θky + λy · ) (λy · θky + λy · ) 2 2
−1500
(4)
k=1
✔ conditional likelihood H(T ; θ) =
0
50
100 150 200 Conditional Likelihood: Held−Out Data
250
300
0
50
100 150 200 Class Error Rate: Held−Out Data
250
300
0
50
100
250
300
−400
T
j=1 P (yj |xj ) is still a ratio of polynomials
−600
LogL
✔ Simple modification of the reestimation eqns. increases con-
Experiments: Convergence Speed
−800
✔ Reduces class error rate over ML classifier by 40% relative.
−1000
Model Parameter Updates
10
The re-estimation equations take the form:
−1θ = N θ ky y ky
• Conditional probability P (y|x), y ∈ Y, x ∈ X
∂ log H(T ; θ) + Cθ ∂θky
CER
Conditional Na¨ıve Bayes Models
where Cθ > 0 satisfies:
Na¨ıve Bayes model for the feature vector and the predicted variable (f (x), y): P (y|x; θ) = Z(x; θ)
−1
· θy
fk (x) (x) f k θky θky
6
(5)
4
• Feature set F = {fk (x) : X → {0, 1}, fk (x) = 1 − f (x), k = 1 . . . F }
F
∂ log H(T ; θ) + Cθ > , ∀k and y ∂θky
(1)
k=1
where: fk (x) = 1 − f (x), θy ≥ 0, ∀y ∈ Y, y∈Y θy = 1; θky ≥ 0, θky ≥ 0, θky + θky = −1 1, ∀k = 1 . . . F, y ∈ Y and Z(x; θ) = y P (f (x), y) is a normalization term.
Equivalence with Exponential Models
Experiments: Classification Performance Training Objective Function Smoothing Class Error Rate (%) ML smoothed 11.3 CML not smoothed 8.5 CML smoothed 6.7 MaxEnt smoothed 4.9
where:
F θky Setting: fk (x, y) = fk (x) · δ(y); λky = log( ); λ0y = log(θy · k=1 θky ) and f0(x, y) = f0(y) θky
☞ log-linear model — maximum entropy probability estimation [BPP96]: P (y|x; λ) = Z(x; λ)−1 · exp
F
• δ() Kronecker delta operator, δ(y, yi) = 1 f or y = yi, 0 otherwise
λky fk (x, y)
(2)
Algorithmic Refinements
k=0
where λky are free real-valued parameters and Z(x; λ) the normalization term.
Rational Function Growth Transform for CML Estimation of Na¨ıve Bayes Models Estimate θ = {θy , θky , θky : y ∈ Y, k = 1 . . . F } ∈ Θ such that the conditional likelihood of training data T = {(x1, y1) . . . (xT , yT )} is maximized: θ∗ = arg max H(T ; θ), θ
H(T ; θ) =
T
P (yj |xj )
normalize
γθ βθ = T
☞ Dependency on Class Count:
make Eqn. (6) dimensionally correct ζθ (y) βθ = #(y)
– Considerable convergence speedup #(y) – Setting ζθ (y) = T · γθ results in identical updates under either γ or ζ.
✔ CML trained classifier reduces CER by 40% relative over ML counterpart ✔ although equivalent with MaxEnt, does not result in same performance – different smoothing (Gaussian prior for MaxEnt [CR00]) – objective function (conditional log-likelihood) is not convex in parameter values for Na¨ıve Bayes (NB) model whereas it is convex for MaxEnt model – one extra free parameter in NB versus MaxEnt
Acknowledgements Thanks to Milind Mahajan and Asela Gunawardana for useful comments and discussions.
References [BPP96]
A. L. Berger, S. A. Della Pietra, and V. J. Della Pietra. A maximum entropy approach to natural language processing. Computational Linguistics, 22(1):39–72, March 1996.
[CA04]
Ciprian Chelba and Alex Acero. Conditional maximum likelihood estimation of Na¨ıve Bayes probability models. Technical Report to appear, Microsoft Research, Redmond, WA, 2004.
[CR00]
Stanley F. Chen and Ronald Rosenfeld. A survey of smoothing techniques for maximum entropy models. IEEE Transactions on Speech and Audio Processing, 8(1):37–50, 2000.
j=1
H(T ; θ) is a ratio of two polynomials with real coefficients, each defined over a set Θ of probability distributions: θy = 1; θky ≥ 0, θky ≥ 0 and θky + θky = 1, ∀y ∈ Y, ∀k} Θ = {θ : θy ≥ 0, ∀y ∈ Y and y
✔ iteratively estimate model parameters using growth transform for rational functions on the domain Θ [GKNN91].
☞ Dependency on Training Data Size:
Experiments: Setup
✔ ATIS II + III type A utterances (can be interpreted w/o context) • type A utterances: TRAIN/DEV/TEST: 5,822/410/914 (74,442/5,326/10,673wds); • class label assigned from SQL query (14 classes); • word-vocabulary: 780, OOV 0.24%
200
10-fold speed-up as measured by increase in log-likelihood and decrease in class error rate (CER) on held-out data
with > 0 suitably chosen, see [GKNN91] and [CA04] for details. Calculating the partial derivatives in Eq. (5) we obtain: T λ y −1 · θ · 1 + β θ = N fk (xi)[δ(y, yi) − p(y|xi; θ)] (6) ky ky θ ky 1 λy · θky + λy · −1 normalization constant that ensures θ + θ = 1 • Nky ky ky • βθ = 1/Cθ
150 Iteration Number
✔ about
2 i=1
8
[GKNN91] P. S. Gopalakrishnan, Dimitri Kanevski, Arthur Nadas, and David Nahamoo. An inequality for rational functions with applications to some statistical estimation problems. IEEE Transactions on Information Theory, 37(1):107–113, January 1991. [Nor91]
Y. Normandin. Hidden Markov Models, Maximum Mutual Information Estimation and the Speech Recognition Problem. PhD thesis, McGill University, Montreal, 1991.