Optimality of Kernel Density Estimation of Prior Distribution in Bayes Network Hengqing Tong1 , Yanfang Deng1 , and Ziling Li1 Department of Mathematics, Wuhan University of Technology, Wuhan, Hubei, 430070, P.R.China
Abstract. The key problem of inductive-learning in Bayes network is the estimator of prior distribution. This paper adopted general native Bayes to handle continuous variables, proposed a kind of kernel function constructed by orthogonal polynomials, which is used to estimate the density function of prior distribution in Bayes network. Paper then made further researches into optimality of kernel density estimation of density and derivatives. When the sample is fixed, the estimator can keep continuity and smoothness, and when size of a sample tends to infinity, the estimator can keep good convergence rates. 2 Keywords. Bayes network, Prior distribution, Kernel density estimation, Orthogonal polynomial, Optimality.
1
Introduction
Bayes network describes the correlation between variables in graphic mode. It can process incomplete data set with noises. It’s widely used in speech recognition system, industry control system, economic forecast system, medical diagnosis and so on. Its ability of expression of probability under indefinite condition and the characteristic of incremental learning of synthesis prior distribution have been attracted wide attention. As a Bayes network with constraint, the native Bayes is built based on the assumption that characteristic attribute is independent of class attribute. It reduces the complexity of the network structure by exponential, but its variables must be discrete(Kohavi R, 1997; Zhang H, 2003; Tang Y, 2002; Sanchis A, 2003). On the other hand, many problems in real world usually include continuous variables. If we establish models in continuous data directly, it’s hard to describe the real correlation between continuous variables (Geiger D,1994). Therefore, researchers usually apply discretization pretreatment before modeling. (Li Xing-sheng, Li De-yi,2003)proposed discretization approach of continuous attributes based on density distribution clustering. Quinlan (Quinlan J R,1993) presented partial discretization based on MDL to 2
Tel: 00-86-27-62321378; Fax: 00-86-27-87651213. E-mail address :
[email protected] Supported by National Natural Science Foundation of China(2006, 30570611) and Innovation Fund of the Ministry of Science & Technology of China (2004, 02C26214200218)
2
HQ Tong, YF Deng, ZL Li
achieve the equilibrium of precision and complexity of the algorithm. Dougherty (Dougherty J, Kohavi R, Sahami M, 1995) investigated the overall discretization of continuous attributes based on information entropy, which could avoid noise disturbance for lack of partial data. All these work are very significant and we need estimate the prior distribution. Generally, kernel density estimation is adopted, but we haven’t investigated it in depth. There are some thorough research results of kernel density estimation in mathematical statistics. Kernel density estimation constructed by orthogonal polynomial was proposed, so we can obtain the convergence rate, but few articles have discussed the construction of orthogonal polynomial. Even if some of them gave the construction, the method only suited large sample, namely, it could keep the convergence rate, but small sample didn’t enter into their calculations. In other words, the function is lack of continuity and smoothness on the total space. This paper proposed a kind of orthogonal polynomial which is continuous and smooth on the total space and constructed kernel function to estimate prior distribution in Bayes network. When the sample is fixed, the estimators of density and derivatives can keep continuity and smoothness, and when size of a sample tends to infinity, they can keep good convergence rates. The essential of classification problem is to find mapping from characteristic attribute space {X1 , · · · , Xn } to class attribute C. Bayes regards posterior probability as classification indicator, namely, maximum classification label of output conditional probability is considered as target value. If characteristic attributes are all discrete variables, conditional independence assumption of native n Q Bayes is P ( x1 , · · · , xn | c) = P ( xi | c), where P (•) denotes discrete probit, xi i=1
and c denote the sample values of attributes Xi and C respectively. Here, Bayes decision rule is as follows: n Y P ( xi | c) (1) c∗ = arg max P ( c| x1 , · · · , xn ) = arg max P (c) c∈C
c∈C
i=1
However, when attribute set includes continuous variables, it is different. Suppose that continuous attribute set is {X1 , · · · , Xm }, then P (c) P ( c| x1 , · · · , xn ) =
m Q i=1
p ( xi | c)
n Q j=m+1
P ( xj | c)
p ( x1 , · · · , xm | xm+1 , · · · , xn ) P (xm+1 , · · · , xn )
(2)
where p ( x1 , · · · , xm | xm+1 , · · · , xn ) P (xm+1 , · · · , xn ) is a constant which is independent of class attribute and we shouldn’t consider it in calculation. Then Bayes decision rule becomes (Li Baice, Yuan Senmiao, Wang Liming, 2005). c∗ = arg max P ( c| x1 , · · · , xn ) = arg max P (c) c∈C
c∈C
m Y i=1
p ( xi | c)
n Y
P ( xj | c) (3)
j=m+1
Suppose that when C = c, a group of observation sequence of continuous attribute Xi is {˜ x1 , · · · , x ˜n }, then the estimator of conditional probability density
Optimality of kernel density estimation in BN
is n
pˆ ( xi | c) =
1 X Ld nh j=1
µ
xi − x ˜j h
3
¶ , i = 1, 2, · · · , m
(4)
where Ld (•) is a kernel function, a constant h is the corresponding bandwidth. What we need research on are construction of kernel function of density and its optimality. Generally speaking, Bartlett kernel K (x) = ¢ ¡ estimator 3 2 , |x| < 1 can be used here. We can also choose kernel function by 1 − x 4 orthogonal polynomial. Then we should investigate its construction and improve its optimality.
2
Constructions of Kernel Functions by Orthogonal Polynomials
If kernel density estimation jump discontinuous, parameter estimation will jump acutely when the current sample x varies continuously, which is difficult to use. Many references either neglected constructions of kernel functions or proposed kernel functions which are not continuous or smooth. Lin (1975, pp.155-164) proposed kernel density estimation of density and its derivative as follows: µ ¶ n 1 X x − Xi ˆ K0 fn (x) = nan j=1 an
(5)
µ ¶ n x − Xi 1 X 0 ˆ K1 fn (x) = na2n j=1 an
(6)
He constructed an orthogonal polynomial farther. When n = 3, 0 ≤ u ≤ 1, then there are K0 (u) = 30u2 − 36u + 9 = 30(u − 0.6)2 − 1.8
(7)
µ ¶2 8 K1 (u) = −180u2 + 192u − 36 = −180 u − + 15.2 15
(8)
Obviously, neitherfˆn (x) nor fˆ0 n (x) is continuous. We try to find a kind of orthogonal polynomial which is still continuous after being truncated. It can make kernel density estimation of density and its derivative be continuous and keep good convergence rates. First of all, we consider univariate kernel function by orthogonal polynomial. The continuity and smoothness of orthogonal polynomial we proposed are different from that Lin constructed, while the orthogonality is the same. Let H is a numerical determinant with r0 th order, suppose the elements from 0 1 the first row to the (r − 1) th row are dij = i+j , and all the elements of the last
4
HQ Tong, YF Deng, ZL Li
row are 1, that is ¯ ¯ 1/2 ¯ ¯ 1/3 ¯ H = ¯¯ · · · ¯ 1/r ¯ ¯1
1/3 1/4 ··· 1/(r + 1) 1
··· ··· ··· ··· ···
¯ 1/ (r + 1) ¯¯ 1/ (r + 2) ¯¯ ¯ ··· ¯ 1/(2r − 1) ¯¯ ¯ 1
(9)
j
Then, the elements of the first row in H are replaced by dij = (u/u0 ) , j = 1, 2, · · · , r, we can obtain ¯ ¯ r ¯ u/u0 (u/u0 )2 ¯ · · · (u/u0 ) ¯ ¯ ¯ 1/3 ¯ 1/4 · · · 1/ (r + 2) ¯ ¯ ¯ H0 (u) = ¯¯ · · · (10) ··· ··· ··· ¯ ¯ 1/r ¯ 1/ (r + 1) · · · 1/ (2r − 1) ¯ ¯ ¯1 ¯ 1 ··· 1 0
j
Similarly, the elements of the (d + 1) th row in H are replaced by dij = (u/u0 ) , j = 1, 2, · · · , r, and the univariate determinant Hd (u)r×r , d = 1, · · · , r − 2 is obtained. ¯ ¯ ¯ 1/2 1/3 · · · 1/ (r + 1) ¯¯ ¯ ¯ ¯··· ··· ··· ··· ¯ ¯ r ¯ (the (d + 1) row) ¯ u/u0 (u/u0 )2 · · · (u/u0 ) ¯ (11) Hd (u) = ¯¯ ¯ ··· ··· ··· ¯ ¯··· ¯ 1/r 1/ (r + 1) · · · 1/ (2r − 1) ¯¯ ¯ ¯ ¯1 1 ··· 1 Let
½ K0 (u) =
H0 /Hu0 0
when 0 ≤ u ≤ u0 otherwise
(12)
which is an kernel function constructed by orthogonal polynomial and has finite support [0, u0 ]. Because of H0 (0) = H0 (u0 ) = 0, K0 (u) is continuous on the total real axis. It’s easy to verify that it matches the orthogonal criteria: Z u0 Z u0 1 l u K0 (u) du = ul H0 du Hu0 0 0 ¯R ¯ ¯ u0 ul+1 du R u0 ul+2 du · · · R u0 ul+r du ¯ ¯ 0 ul+1 ¯ l+2 l+r 0 u 0 u 0 0 0 ¯ ¯ l−1 ¯ 1/3 1/4 · · · 1/(r + 2) ¯¯ u0 ¯ = ¯ ¯ ··· ··· ··· ¯ H ¯··· ¯ 1/r 1/(r + 1) · · · 1/(2r − 1) ¯¯ ¯ ¯1 ¯ 1 ··· 1 when l = 0 1 0 when l 6= 0, and 1 ≤ l ≤ r − 2 = (13) c when l = r − 1
Optimality of kernel density estimation in BN
5
At the truncated points, there are also K0 (0) = K0 (u0 ) = 0, it is obvious that K0 is an orthogonal and continuous kernel function. When r = 3, u0 = 1, K0 (u) = Cu(u − 35 )(u − 1). When u = 0 or u = 1, K0 (u) = 0, so it is continuous at total x axis. Let ½ d!Hd /ud+1 H when 0 ≤ u ≤ u0 0 Ld (u) = (14) 0 otherwise Substitute (14) for (3) and obtain kernel density estimation (4) of conditional probability density function. We can also construct kernel density estimation of prior distribution in Bayes network by Ld (u) in (14). pˆ(d) ( xi | c) =
¶ µ n 1 X xi − x ˜j L d nhd+1 i=1 h
(15)
The multivariate kernel function is constructed by orthogonal polynomial as follows: Kt,t1 ···,tp (u) = Kt1 (u1 )Kt2 (u2 ) · · · Ktp (up )
(16)
Kti (ui ), i = 1, · · · , p are common univariate kernel functions and satisfy ½
1 ti !
Z 0
ut
|Kti (ui )| ≤ C when ui ∈ (0, ut ) K(ui ) = 0 otherwise ½
uli Kti (ui )dui =
1 when l = ti 0 when l 6= ti , but 0 ≤ l ≤ r − 2
(17)
(18)
and there is Kti (ui ) ∈ C(R) The construction of such univariate kernel function is as follows: 0 The elements of the (ti + 1) th row in H in (9) are replaced by hti +1,j = (ui /ut )j , and determinant Hti (ui ) of r0 th order is obtained.
1/2 1/3 ··· Hti (ui ) = ui /ut ··· 1/r 1 Let
(t Kti (ui ) =
1/3 ··· 1/4 ··· ··· ··· 2 (ui /ut ) · · · ··· ··· 1/(r + 1) · · · 1 ···
i !Hti (ui ) t +1 Huti
0
when
1/(r + 1) 1/(r + 2) ··· r (ui /ut ) ··· 1/(2r − 1) 1
0 ≤ ui ≤ ut
otherwise
(19)
(20)
6
3
HQ Tong, YF Deng, ZL Li
Continuity and Smoothness of Density Kernel Density Estimation of Prior Distribution in Bayes Network
First of all, we notice that Ld (u) is a r degree polynomial, d = 1, 2, 3, · · ·. That’s to say, the order of estimator of density derivative function being a polynomial constructed by Ld (u) in (14) doesn’t vary with the number of times of taking derivatives. Secondly, smoothness of kernel density estimation of density is represented by estimator of density derivative function in such construction. Because of Hd (0) = Hd (u0 ) = 0, therefore, Ld (u) is continuous on the total space though it has finite support. Estimator of density derivative function constructed by that can keep preferable continuity, which can ensure smoothness of kernel density estimation of density. Ld (u) still matches the orthogonal criteria: 1 d!
=
R u0 0
ul−d−1 0 H
Ru ul Ld (u) du = Hu1d+1 0 0 ul Hd (u) du 0 ¯ ¯ 1/2 1/3 ¯ ¯··· ··· ¯ R ³ ´l+1 R u0 ³ u ´l+2 ¯ u0 u ¯ 0 du du u0 u0 0 ¯ ¯··· ··· ¯ ¯ 1/r 1/(r + 1) ¯ ¯1 1
1 = 0 c
when when when
··· ··· ··· ··· ··· ···
1/(r + 1) ··· R u0 ³ u ´l+r 0
u0
··· 1/(2r − 1) 1
¯ ¯ ¯ ¯ ¯ ¯ du ¯ ¯ ¯ ¯ ¯ ¯ ¯
l=d l 6= d, and 1 ≤ l ≤ r − 2 l =r−1
(21)
The multivariate kernel function Kt,t1 ···,tp (u) is constructed as follows: ¯ ½¯ ¯Kt;t1 ,···,tp (u)¯ ≤ C when u ∈ (0, ut )p =D ˆ (22) Kt;t1 ,···,tp (u) = 0 otherwise 1 t1 ! · · · tp ! =
Z ½
D
1 0
ui11 ui22 · · · uipp Kt;t1 ,···,tp (u)du when i1 = t1 , · · · , ip = tp otherwise, but 0 ≤ i1 , · · · , ip ≤ r − 2
(23)
and there is Kt,t1 ···,tp (u) ∈ C(Rp ). Suppose that the prior distribution in Bayes network has p-variable density 0 f (x), x = (x1 , · · · , xp ) .Mixed partial derivatives of multivariate density f (x) ∂ t f (x) t = t1 + · · · tp .Using multivariate kernel function Kt,t1 ,···,tp (u), are t1 tp , ∂x1 ···∂xp
0
u = (u1 , · · · up ) , the estimators of f (s) (x) are fˆn (t; t1 , · · · tp , x) =
µ (j) ¶ n x −x 1 X Kt;t1 ,···,tp an nap+t n j=1
(24)
Optimality of kernel density estimation in BN
where an = n−
1 2r+p
7
(25)
fˆn is continuous on the total space, namely fˆn (t; t1 , · · · , tp ; x) ∈ C(Rp )
(26)
Then we constructed the kernel density estimation of univariate density, multivariate density and their derivatives, and then verified the continuity and smoothness.
4
Convergency of Kernel Density Estimation of Density and Derivatives of Prior Distribution in Bayes Network
Because of orthogonality of Ld (u), we can obtain the convergence rate of kernel estimation of univariate density derivative of prior distribution constructed by Ld (u) in Bayes network easily. Let fˆ(d) (x) =
µ ¶ n 1 X x ˜j − x L d nhd+1 i=1 h
(27)
Suppose that derivative function of any order is locally bounded, then ¶ ¶ µ µ Z +∞ n 1 X 1 x ˜j − x y−x ELd = d+1 Ld f (y) dy nhd+1 i=1 h h h −∞ Z u0 Z u0 1 1 Ld (u)f (x + hu) du = d Ld (u) [f (x) + huf 0 (x) = d h 0 h 0 ¸ h2 u2 00 hd ud (d) hr−1 ur−1 (r−1) + f (x) + · · · f (x) + · · · + f (ξ) du 2! d! (r − 1)! ¡ r−1 ¢ (d) = f (x) + O h (28)
E fˆ(d) (x) =
So
¡ ¢ E fˆ(d) (x) − f (d) (x) = O hr−1
(29)
Mixed partial derivatives of multivariate density f (t) are as follows f (s) (t) = f (s) (s1 , · · · , sp ; t) =
∂ s f (t) s · · · ∂tpp
∂ts11
(30)
0
where t = (t1 , · · · , tp ) , s1 + · · · + sp = s, s = 0, 1, 2, · · ·. By using multivariate kernel function (16), the estimators of f (s) (t) are fˆn(s) (t) =
µ (j) ¶ n t −t 1 X ˜ Ks αn nαnp+s j=1
(31)
8
HQ Tong, YF Deng, ZL Li 1
where αn = n− 2r+p with r + 2 being the dimension of an orthogonal polynomial space in which the kernel functions are constructed. (s) In the following, we study the convergency of fˆn (t). Suppose that partial derivatives f (r) (t) is locally bounded, namely, there exists f (r) (t) which are independent of times of taking derivatives with respect to each component r1 , · · · , rp of t, and then when t ∈ Xt and t + ε ∈ Xt , there is sup 0≤||ξ||≤ε
|f (r) (t + ξ)| ≤ fε(r) (t)
(32)
where Xt is sample space of t. In the same reason, suppose that f (t) is locally bounded. En (•) denotes taking mathematical expectation with respect to n samples. THEOREM 1. Suppose f (r) (t) and f (t) are locally bounded, then ¯ ³ ¯ ´ ³ ´ r−s ¯ ¯ (33) ¯En fˆn(s) (t) − f (s) (t)¯ = O n− 2r+p fε(r) (t) ¾ ¯ ¯2 ³ 2(r−s) ´ ½h i2 ¯ ¯ En ¯fˆn(s) (t) − f (s) (t)¯ = O n− 2r+p fε(r) (t) + fε(0) (t) Proof: Let u = En (fˆn(s) (t)) =
y−t αn .
1 αnp+s
(34)
Since dy = αnp du, t(1) , · · · , t(n) are i.i.d µ
Z ˜s K Xi
y−t αn
¶
1 f (y)dy = s αn
Z ˜ s (u)f (t + αn u)du (35) K D0
From the multivariate Taylor formula and the orthogonal conditions of the kernel functions, we obtain Z En (fˆn(s) (t)) = f (s) (t) + αnr−s
D0
h
i ˜ s (u) K
n X i1 +···ip =r
i
ui11 · · · upp (r) f (t + ξ)du (36) i1 ! · · · ip !
√ where 0 ≤ ||ξ|| ≤ αn p. Because f (r) (t) is locally bounded and the kernel function as well as integral field are bounded, (33) is obtained. Moreover, from Z ³ 2(r−s) ´ n o 1 ˜ s (u)]2 f (t + αn u)du ≤ O n− 2r+p fε(0) (t) (37) [ K V ar fˆn(s) t ≤ nαnp+2s D0 and ¯2 ¯ ¯2 n o ¯ ¯ ¯ ¯ ¯ En ¯fˆn(s) (t) − f (s) (t)¯ = V ar fˆn(s) (t) + ¯En (fˆn(s) (t)) − f (s) (t)¯
(38)
we know (34) is right. ∂f (t) (t) (1) (t). Arrange the elements into When s = 1, ∂f ∂t1 , · · · , ∂tp are all denoted f (1) ∂f (t) a vector ∂t . Similarly fˆn (t) may denote p kinds of kernel density estimation (1) (1) (1) fˆn1 (t), · · · , fˆnp (t). We also arrange them into a vector f˜n (t). From(31), we
Optimality of kernel density estimation in BN
9
know their constructions of kernel function are different and the orthogonal conditions are different. From theorem 1, (¯ ° °2 ¯2 ¯ ¯2 ) ° (1) ° ¯ (1) ¯ ¯ (1) ¯ ∂f (t) ∂f (t) ∂f (t) ˜ ° ¯ˆ ¯ ¯ˆ ¯ En ° °fn (t) − ∂t ° = En ¯fn1 t − ∂t1 ¯ + · · · + ¯fnp (t) − ∂tp ¯
(39)
Moreover, we consider quadratic form. For positively definite matrix ∆, we have ° °2 µ ¶0 µ ¶ ° (1) ° ∂f (t) ∂f (t) ˜ (t) − ∂f (t) ° En f˜n(1) (t) − ∆ f˜n(1) (t) − ≤ CEn ° f n ° ∂t ∂t ∂t ° ³ 2(r−1) ´ n o = O n− 2r+p [fε(r) (t)]2 + fε(0) (t)
(40)
Suppose 1 < 2δ < 2. According to Jensen and Holder inequalities, we have E|η|2δ ≤ 2[(V ar(η))δ + |E(η)|2δ ]
(41)
Then we can obtain lemma 1. LEMMA 1. Suppose 12 < δ < 1, then ¯ ¯2δ ³ 2δ(r−s) ´ n o ¯ ¯ En ¯f˜n(s) (t) − f (s) (t)¯ = O n− 2r+p [fε(r) (t)]2δ + [fε(0) (t)]δ
(42)
¯ °2δ ¯ ¯¾2δ ° ½¯ ¯ ° ¯ ¯ ¯ ° (1) ˜ (t) − ∂f (t) ° ≤ En ¯f (1) (t) − ∂f (t) ¯ + · · · + ¯f (1) (t) − ∂f (t) ¯ En ° f np n1 ¯ ° ¯ ¯ ° n ∂t ∂t ∂tp ¯ (¯ ) ¯ ¯2δ ¯2δ ¯ ¯ (1) ¯ ∂f (t) ¯¯ ¯fn (t) − ∂f (t) ¯ (t) − ≤ pEn ¯¯fn(1) + · · · + 1 p ¯ ¯ ∂t1 ∂tp ¯ ³ 2δ(r−1) ´ n o [fε(r) (t)]2δ + [fε(0) (t)]δ = O n− 2r+p (43) Then the convergence rates of kernel density estimation of multivariate density and its partial derivatives constructed by orthogonal polynomials have been proved. This paper consider the estimation of the prior distribution in Bayes network, introduce their basic expression in Section 1. In Section 2 paper gave detailed constructions of orthogonal polynomials (14) and (16), and constructed kernel density estimation of univariate density, multivariate density and their derivatives by orthogonal polynomials. In Section 3 we specified that they can keep continuity and smoothness when samples are fixed. In Section 4 we proved that they can keep good convergence rates when the number of samples tends to infinity. So we completed our topics: research on the optimality of kernel density estimation of prior distribution in Bayes network.
10
HQ Tong, YF Deng, ZL Li
References 1. Kohavi R., Becker B., Sommerfield D.: Improving simple Bayes. Proceedings of the European Conference on Machine Learning (1997) 78–87 2. Zhang H.,Ling C.: Numeric mapping and learnability of naive Bayes. Applied Artificial Intelligence ,17 (5) (2003) 507–518 3. Sanchis A., Juan A., Vidal E.: Improving utterance verification using a smoothed naive Bayes model. Proceedings of the IEEE International Conference on Acoustics,Speech and Signal Processing (2003) 592–595 4. Ronald R Yager.: An extension of the naive Bayesian classifier. Information Sciences 176 (5) (2006) 577-588 5. Eugene Strahov., Yan V.: Universal results for correlations of characteristic polynomials: Riemann-Hilbert approach. Communications in Mathematical Physics 241 (2003) 343- 382 6. Freilikher V., KanzieperE., Yurkevich I.: Theory of random matrices with strong level confinement: orthogonal polynomial approach. Physical Review E 54 (1996) 210-219 7. Gajek L.: On improving density estimators which are not bona fide functions. Ann. Statist 11 (1986) 1612-1618 8. H. Roder., R. N. Silver., D. A. Drabold., Jian Jun Dong.: Kernel polynomial method for a nonorthogonal electronic-structure calculation of amorphous diamond. Physical Review B 55 (1997) 15382-15385 9. Jianjun Li., Shanti S Gupta.: Empirical Bayes tests based on kernel sequence estimation. Statistica Sinica 12 (2002) 1061–1072 10. Jones M C.: On kernel density derivative estimation. Communications in Statistics(Theory and Methods) 23 (1994) 2133–2139 11. Lin P E.: Rates of convergence in empirical Bayes estimation problems: continuous case. Ann. Statist 3 (1975) 155–164 12. Mark Girolami.: Orthogonal series density estimation and the kernel eigenvalue problem. Neural Computation 14 (2002) 669–688 13. Messer K.: A comparison of spline estimate to its equivalent kernel estimate. Ann. Statist 19 (1991) 817–829 14. Parzen E.: On estimation of a probability density function and model. Ann. Math. Statist 33 (1962) 1065–1076 15. Tong H Q.: Convergence rates for empirical Bayes estimators of parameters in multi-parameter exponential families. Communications in Statistics(Theory and Methods) 25 (1996) 1089–1098 16. Vanlessen M.: Universal behavior for averages of characteristic polynomials at the origin of the spectrum. Communications in Mathematical Physics 253 (2005) 535– 560