Cluster-Weighted Modeling as a basis for Fuzzy Modeling Madasu Hanmandlu Dept. of Electrical Engineering I.I.T. Delhi New Delhi-110016, India.
[email protected]
Vamsi Krishna Madasu School of IT & EE University of Queensland QLD 4072, Australia.
[email protected]
Abstract The Cluster-Weighted Modeling (CWM) is emerging as a versatile tool for modeling dynamical systems. It is a mixture density estimator around local models. To be specific, the input regions together with output regions are treated to be Gaussian serving as local models. These models are linked by a linear or non-linear function involving the mixture of densities of local models. The present work shows a connection between the CWM and Generalized Fuzzy Model (GFM) thus paving the way for utilizing the concepts of probability theory in fuzzy domain that has already emerged as a versatile tool for solving problems in uncertain dynamic systems.
1. Introduction Cluster-Weighted Modeling, introduced by Gershenfeld et al. [1] is a versatile approach for deriving functional relationship between input data and output data by using a mixture of expert clusters. Each cluster is localized to a Gaussian input region having its own local trainable model. The CWM algorithm uses expectationmaximization (EM) to find the optimal location of the clusters in the input space and to solve for the parameters of the local model [2]. CWM can be used as a modeling tool that allows one to characterize and predict systems of arbitrary dynamic character [3]. The framework employed in CWM is concerned with density estimation around Gaussian kernels containing simple local models that describe the system dynamics of a data subspace. In the simplest case, where we require only one kernel, the framework boils down to a simple model that is linear in the coefficients. In the complex case, we may need non-Gaussian, discontinuous, high-dimensional and chaotic models. In between CWM covers a wide range of models, each of which is characterized by a different local model. We can also create globally non-linear models with transparent local structures
Shantaram Vasikarla Information Technology Dept. American InterCon.University Los Angeles, CA 90066, U.S.A.
[email protected]
through the embedding of past practice and mature techniques in the general non-linear framework. Fuzzy modeling has evolved over the years for dealing with problems of dynamic systems. Recently, Generalized Fuzzy Model is proposed in [7], which generalizes the existing fuzzy models, viz., Compositional Rule of Inference (CRI) model and Takagi-Sugeno (TS) model. In this paper, we will show a strong connection between CWM and GFM. So far, GFM lacks a sound mathematical footing. But now, with this connection, GFM can gain a strong foothold and can be used to assimilate the strong points of probabilistic framework. The organization of the paper is as follows. Section 2 gives the concept of CWM, the use of EM in estimating density functions and the model estimation. Section 3 briefly reviews the fuzzy models. Section 4 establishes the equivalence between the CWM and GFM. Finally, conclusions are drawn in Section 5.
2. Cluster-Weighted Modeling It is hard to capture the local behavior with global beliefs. For example, if a smooth curve has some discontinuities then trying to fit the discontinuity we may miss the smoothness. Here, comes the need for a proper choice of a function to fit in so that the transition from low dimensional space to high dimension is easily achieved. The above considerations suggest that for capturing the local behavior we need to estimate density using local rather than global functions. Kernel density estimation adopts this approach by placing a Gaussian at each data point. This requires retention of every point in the model. The better approach is to find important points to fit in a smaller number of local functions that can model larger neighborhoods. Mixture models preferably involving Gaussians can achieve this. These models lead to splitting of a dataset into a set of clusters. An example is the unsupervised learning algorithm, which must learn itself where to fit in the local functions.
Proceedings of the International Conference on Information Technology: Computers and Communications (ITCC’03) 0-7695-1916-4/03 $17.00 © 2003 IEEE
A Gaussian Mixture Model (GMM) in Ddimensions can be expressed by factoring the density over multi-variate Gaussians. It can be thus represented by the following expression, M
M
m =1
m =1
p (x) = ∑ p (x, cm ) = ∑ p (x | cm ) ⋅ p (cm ) 1 −1 2 m
M c 1 = ∑ D ⋅exp− (x−µm)cm−1(x−µm) p(cm) 2 m=1 (2π) 2
Fig. 1: Cluster-Weighted Modeling
(1) where M is the number of clusters and p(x | cm ) refers to mth Gaussian with mean µ m and covariance
cm ,which can be calculated from:
µ m = ∫ x ⋅ p(x | cm )dx
we have {y i , x i }i =1 , where N
p (cm | x) = ∫ x⋅ dx p (cm )
≈ where,
Nm 1 ∑ xn ⋅ p(cm | xn ) N m ⋅ p(cm ) n =1
(2)
N m is the number of data in the mth cluster.
Nm 1 ∑ (xn − µm )(xn − µm )T p(cm , xn ) Nm ⋅ p(cm ) n=1
(3)
p (cm ) = =
In this,
For example, the input could be a vector of past logged values of a signal and y could be a predicted value of the signal, if the system considered is a predictor. We partition the inputM clusters (x m , y m ) ; output data into
m = 1...M by using some clustering algorithm. Let p ( y , x) be the joint density for the system such that p ( y | x) could yield the quantity of interest. We will expand this density in terms of clusters described by a weight, a domain of influence in the input space p ( x | cm ) and a
p (cm ) are
The expansion weights of
xi is the input and y i
x ε ℜ D is a D − dimensional vector and y ε ℜ1 is a scalar.
is the measured output.
Therefore,
cm ≈
The next problem is to fit a mixture of Gaussians on random data uniformly distributed over the interval [0, 1]. This fitting requires a proper overlap of multiple Gaussians. What we need is the expansion of density around models that can locally describe more complex behavior. The goal is to capture the functional dependence as part of density estimate for a system. Assuming N observations,
∞
∫
p (x, cm )dx
functional dependence in the output space p ( y | x, cm ) . The local models are shown by
∫
p (cm | x) ⋅ p (x)dx
solid lines in Fig.1 whereas the functional dependence is shown by bold lines. Now, p (y , x) is defined by,
−∞ ∞
−∞
1 ≈ Nm
Nm
∑ p (c
m
, x)
p(x, cm ) = p (x)
p (x | cm ) p (cm ) M
∑ p(x | c
m
m =1
M
p (y , x) = ∑ p (y , x, cm )
p (cm , x) is defined by,
The posteriori probability
p (cm | x) =
(4)
n =1
) p (cm )
(5)
m =1
M
= ∑ p (y , x | cm ) ⋅ p (cm ) m =1 M
= ∑ p (y | x, cm ) ⋅ p (x | cm ) ⋅ p (cm ) m =1
Proceedings of the International Conference on Information Technology: Computers and Communications (ITCC’03) 0-7695-1916-4/03 $17.00 © 2003 IEEE
(6)
where,
p (cm ) is a number that measures the
fraction of the dataset described by the cluster. If the input term is taken as D-dimensional then it is expressed by separable Gaussians, D
1
p(x | cm ) = ∏
2πσ m2 ,d
d =1
−( xd − µm,d )2 ⋅ exp 2 2σ m,d
(7)
Or using the full covariance matrix as in (1)
p( x | Cm ) =
Cm−1 (2π )
1
2
1 ⋅ exp − (x − µm )T ⋅ Cm−1 (x − µm ) 2 (8)
D/2
Note that the covariance matrix lets one cluster capture a linear relationship that would require many separate clusters to describe. The output term is also taken be Gaussian but incorporating its dependence with the input.
1
p(y | x, cm ) =
2πσ m2 , y
−[y − f (x,αm )]2 ⋅ exp 2σ m2 , y (9)
The mean of the Guassian is a function
f that
depends on x and a set of parameters α m . So, the conditional expected value of y is:
y | x = ∫ y ⋅ p ( y | x ) dy = ∫ y ⋅ M
∑ ∫ y ⋅ p(y | x, c
m
=
p ( y , x) dy p ( x)
)dy ⋅ p (x | cm ) ⋅ p (cm )
∑ p(x | c
m
) ⋅ p(cm )
m =1
M
=
∑ f ( x, α
m
) ⋅ p (x | cm ) ⋅ p(cm )
m =1
M
∑ p (x | c
m
) ⋅ p (cm )
m =1
M
=
∑ p(x | c
m
) ⋅ p(cm ) ⋅ f (x, α m )
m =1
M
∑ p (x | c
m
2.1. Expectation-Maximization (EM) The Expectation-Maximization algorithm [8] is used to estimate the probability density of a set of given data. In order to model the probability density of the data, Gaussian Mixture Model is used. The probability density of the data is modeled as the weighted sum of a number of Gaussian distributions. In other words, EM is typically used to compute maximum likelihood estimates given incomplete samples. Since, we fit the local model parameters α m of the function
(10)
remaining cluster parameters in charge of the global weighting; we can use a variant of the EM algorithm [8]. It is an iterative search that maximizes the model likelihood given a data set and initial conditions. We start with a set of initial values for the cluster parameters and then enter the iterations in the Expectation step.
p (cm | y, x) =
) ⋅ p (cm )
m =1
=
We observe that the expected y is a superposition of all the local functionals
f (x, α m ) in CWM and then find the
E-step: In the E-step, we proceed with the current cluster parameters assuming them to be current in order to evaluate the posterior probabilities that relate each cluster to each data point. These posteriors can be interpreted as the probability that a particular data was generated by a particular cluster or as the normalized responsibility of a cluster for a point.
m =1
M
where the weight of each contribution depends on the posterior probability that input was generated by a particular cluster. The denominator assumes that the sum of weights of all contributions equals unity. In the expected output (10), the Gaussians control the interpolation among the local functions, instead of serving directly as the basis for functional approximation. This means that the function f can be chosen to reflect the local relationship between x and y , which could be linear and even one cluster could model its behavior. We now calculate the posteriors using the forward probabilities.
f (x, α m ) ,
p( y, x, cm ) ⋅ p (cm ) p ( y , x)
p (y | x, cm ) ⋅ p( x | cm ) ⋅ p (cm ) M
∑ p(y, x, c
m
)
m =1
Proceedings of the International Conference on Information Technology: Computers and Communications (ITCC’03) 0-7695-1916-4/03 $17.00 © 2003 IEEE
=
p( y | x, cm ) ⋅ p (x | cm ) ⋅ p (cm ) M
∑ p ( y | x, c
m
Nm
(11)
) ⋅ p( x | cm ) ⋅ p (cm )
µ mnew =
m =1
∑x
n
p (cm | y n , x n )
n =1 Nm
∑ p (c
m
where, the sum over densities in the denominator causes clusters to interact, fight over points and specialize in data, they best explain. M-Step: In the M-step, we proceed with the current data distribution assuming it to be correct in order to find the cluster parameters that maximize the likelihood of data. The new estimate for the unconditional cluster probabilities is
p (cm ) = ∫ p (y , x, cm )dydx
= ∫ p (cm | y , x) ⋅ p (y , x)dydx
1 Nm
≈
m
(15)
| y n , xn )
n =1
which is the cluster weighted expectation value for the mth cluster. In the above, we have used the sampling trick to evaluate the integrals and guide the cluster updates. This permits clusters to respond to both where the data are in the input space and their models in the output space. A cluster won’t move to describe nearby data if they are better described by another cluster’s model. If the two clusters’ models work equally well then they will separate to better describe where the data are. The cluster-weighted expectations are also used to update the variances.
Nm
∑ p (c
m
| y n , xn )
(12)
n =1
Nm
σ m2, new = ∑
(x n − µ mnew ) 2 ⋅ p (cm | y n , x n ) Nm
∑ p (c
n =1
where
= x
m
(x n , y n ) is the input-output data set in the
mth cluster Here, the idea is that an integral over a density can be approximated by an average over variables drawn from the density. Next, we compute the expected input mean of each cluster, which is the estimate of the new cluster means. These are used to update the cluster weights.
| y n , xn )
n =1
or,
σ m2,new =
(x
n
− µ mnew
)
2
(16) m
or covariances,
[cxx ]ij = [cm ]ij = ( xi − µm,i ) ( x j − µm, j )
(17)
The new means are obtained as,
2.2. Model Estimation
µ mnew = ∫ x ⋅ p(x | cm )dx
The model parameters are chosen to maximize the cluster-weighted log-likelihood as follows:
= ∫ x ⋅ p (y , x | cm )dydx = ∫ x⋅ ∞
Since
∫
p (cm | y , x) ⋅ p (y , x)dydx (13) p (cm )
x ⋅ p ( x ) dx ≈
−∞
1 Nm
Nm
Let,
n
,
we
can
n =1
approximate the above function by,
J=
∂ ∂L = ∂α m ,l ∂α m ,l ∂
Nm
µ
new m
Nm
1 ≈ ∑ xn ⋅ p(cm | y n , xn ) N m ⋅ p(cm ) n=1
(14)
=∑ n =1
Nm
Further using (12),
(18)
n =1
Nm
∑x
L = ln ∏ p(y n , xn )
∂α m,l
=∑ n =1
Nm ln ∏ p(y n , x n ) n =1
[ ln p(y n , x n )]
∂p (y n , x n ) 1 ⋅ ∂α m,l p(y n , x n )
Proceedings of the International Conference on Information Technology: Computers and Communications (ITCC’03) 0-7695-1916-4/03 $17.00 © 2003 IEEE
(19)
But we know that,
p(y n , xn ) = p(yn , xn , cm ) ⋅ p(y n | xn , cm )
(20)
D ⇒ ( y − µm, y ) − ∑αm,d ( xd − µm,d ) xl − µm,l = 0 d =1
⇒
Therefore, we have,
(y − µ )(x D
p(yn , xn , cm ) yn − f (xn ,αm ) ∂f (x,αm ) J =∑ p(yn , xn ) σ m2 , y n=1 ∂αm,l
∑α
Nm
=
∂f (x,αm ) 1 Nm p(cm | yn , xn )[ yn − f (xn ,αm )] 2 ∑ σm, y n=1 ∂αm,l (21)
Let,
m, y
m ,d
d =1
l
− µ m ,l ) =
⋅ ( x l − µ m,l )( x d − µ m,d )
D
⇒ c yx = ∑ α m, d ⋅ [ cxx ]ld for l =1,.., D l
(25)
d =1
The above yield a set of linear equations for
l = 1...D, d = 1...D
cyx = αm,1 [ cxx ]1,1 + αm,2 [ cxx ]1,2 + .. + αm,D [ cxx ]1,D 1
σ m2 , y ⋅J J′ = N m ⋅ p (cm )
(22)
. .
. .
. .
cyx =αm,1 [cxx ]D,1 +αm,2 [ cxx ]D,2 +.. +αm,D [cxx ]D,D D
So, we have from [5],
∂f (x ,α ) 1 Nm J′ = p(cm | yn, xn)[yn − f (xn,αm)] n m ∑ Nm p(cm) n=1 ∂αml, Nm ∑ p(cm | yn , xn )(y n − f (xn ,αm )) ∂f (x ,α ) n m = n=1 p ( c | y , x ) α ∂ ∑ m n n m,l
∂f (x,α m ) = {y − f (x,α m )} ∂α m ,l
(26) In matrix form, the above equations appear as,
c yx [ cxx ] 1,1 1 ⋅ ⋅ = ⋅ ⋅ c [ cxx ] D ,1 yx D
⋅
⋅
[cxx ]1, D α m,1
⋅ ⋅ ⋅ [ cxx ]D , D α m , D ⋅ ⋅
⋅
(23) m
Linear Function Fitting For linear output models, we assume the output function relative to the input centre as follows:
C yx = C xx ⋅ α m ⇒ α m = Cxx −1C yx
(27)
The mean of the output clusters are updated using the new means,
µ mnew ,y = yn
(28)
D
f ( x, α m ) = µ m , y + ∑ α m , d (x d − µ m , d ) (24) d =1
and the output variances by,
{ f (x,α ) − µ } (y − µ )(x − µ ) new 2 m, y
α m . This form was
σ m2,,new = y
first conjectured in [4]. The actual form is shown in the Appendix by adapting a result from [10].
c yx = l
The optimal parameters α of the cluster can be obtained by equating J ′ to zero. That is,
If the input vector elements are independent then
where α m , d is the dth element of
J′ = 0 ⇒
{y − f (x,α m )}
∂f (x, α m ) =0 ∂α m ,l
[cxx ]l ,d = 0
m
m, y
l
m ,l
(29) (30)
∀ l ≠ d in (26).
Hence, we have,
[cxx ]d ,d = σ m2 ,d .
is simplified to,
Proceedings of the International Conference on Information Technology: Computers and Communications (ITCC’03) 0-7695-1916-4/03 $17.00 © 2003 IEEE
Therefore, (27)
α m,d = c yx d σ m2 ,d ∀ d = 1,.., D
(31)
µ k (x k )vk
K
yo = ∑ k =1
Non-linear Function Fitting
⋅ bk
K
∑µ
j
(35)
j
( x )v j
j =1
In this, we use non-linear models with linear coefficients α m but non-linear basis functions
f (x) as against linear functions in (24). D
f ( x, α m ) = ∑ α m , d ⋅ f d ( x d − µ m , d )
(32)
d =1
Now making use of (23) will lead to,
{ y − f (x,α m )} fl (xl − µm,l ) D
= y⋅ fl (xl −µml, ) −∑αm,d fl (xl −µml, )⋅ fd (xd −µm,d )
where, µ
(xk ) is the membership function of fuzzy k set A and bk is the centroid.
Let φ
x
k
( y ) be the membership function of B ⊂ R in the output space. φ k ( y ) can be of any shape of convex function with area vk and centroid bk such that k
vk = ∫ φ k ( y ) & bk = ∫ y ⋅φ k ( y) y
y
∫φ
k
( y)
(36)
y
d=1
In (35) we can see that
D
= c yx − ∑ α m ,d [ cxx ]ld l
(33)
d =1
Equating (32) to zero will yield the same equations as (25). So, we can determine α m from (27). After estimating
the
p(cm | y, x) and
functional
f (x, α m ) and
p(cm ) for each cluster, we
substitute in (10) to get the expected output.
k
consequent membership function B .
3.2. The Takagi-Sugeno (TS) Model Rules of TS-model are of the following form:
R k : IF x k is A k
3. Fuzzy Models
THEN y is f k (xk )
Before showing the connection between the CWM and the recently proposed Generalized Fuzzy Model (GFM), a brief discussion of the fuzzy models is presented here.
3.1. The Compositional Rule of Inference (CRI) Model Each rule of a fuzzy system based on the CRImodel maps the fuzzy subsets in the input space
A k ⊂ R D to a fuzzy subset in the output space B
k
⊂ R , and has the form
f k (x k ) is as follows: f k (x k ) = bk 0 + bk1x1 + ... + bkD x D
(34)
A non-linear form of
(38.a)
f k (x k ) is as follows:
f k (xk ) = bk 0 + bk1 f1(x1) + bk 2 f2 (x2 ) + ... + bkD fD (xD ) (38.b) This form defines a locally valid model on the support of the Cartesian product of fuzzy sets constituting the premise parts. The overall output of the TS-model is defined as K
y =∑ o
k =1
with k = 1, 2, …, K, K being the number of rules. The defuzzified output of CRI-model is given by
(37)
A linear form of
R k : IF x1 is A1k ∧ x 2 is A 2k ∧ ... ∧ x D is A kD THEN y is B k
vk is a weight to the
firing strength of a rule before its normalization and hence vk is defined as the index of fuzziness of the
µ k (xk ) K
∑µ
j
f k (x k )
j
(x )
j =1
Proceedings of the International Conference on Information Technology: Computers and Communications (ITCC’03) 0-7695-1916-4/03 $17.00 © 2003 IEEE
(39)
3.3. The Generalized Fuzzy Model (GFM) The CRI-model inhibits the property of fuzziness around the fixed centroid of the consequent part while the TS-model gives a varying singleton for the consequent part in each fuzzy rule. To combine both of these properties, Azeem et al. [1] introduced a rule of the form
We will now bring (24) to the form of (42.a) by the following: D D f (x,α m ) = µ m, y − ∑ α m,d µ m ,d + ∑ α m,d x d d =1 d =1
(43) On comparing (41) and (42), we get two conditions:
k
k
R : IF x is A
k
THEN y is B k ( f k (x k ), vk )
(40)
B k ( f k ( x k ), vk ) are the linguistic labels to the local linear (or nonlinear) regression of inputs,
f k (x k ) ,with index of fuzziness vk describing the qualitative state of the output variable y. The defuzzified output for the GFM is given by K
yo = ∑ k =1
µ k (x k ).vk
f k (x k )
K
∑µ
j
D bk 0 = µ m, y − ∑ α m,d µ m,d d =1
(44)
bkd = α m,d
(45)
Rewriting (32) to the form (42.b), we have, D D f (x,αm ) = µm, y − ∑αm,d ⋅ fd (µm,d ) + ∑αm,d ⋅ fd (xd ) d =1 d =1
(41)
(46)
j
(x ).v j
In the above equations, we have assumed that it is possible to separate the function f d ( x d − µ m , d )
j =1
4. Equivalence of CWM and GFM
into
Comparing equation (10) with (41), we observe that both the forms are similar. This means that defuzzified output is the conditional output in statistical terms. Assuming that the input clusters have been determined along with the associated output, we will have the estimate of the prior probabilities or weights, local models, input variances and input-output variances. Using these we can estimate the parameters of the function. In GFM clusters ( M ) correspond to rules (K), local models correspond to membership functions of inputs and weights vk correspond to strength of rules or index of fuzziness. If the functions (24) and (27) are assumed to be the same, we need to bring them to the same form. Equation (38.a) can be written as
f d (x d ) and f d ( µ m, d ) . If this is not the case,
we have a direct correspondence.
4.1. Conditions for equivalence The conditional output
y | x using CWM bears a
similarity with the defuzzified output under the following conditions:
y 0 of GFM
i ) The number of clusters, M in CWM is equal to the number of rules K , i.e., M = K . ii ) The priors or weights p(cm ) are equal to the strengths of the rules vk , i.e.,
p(cm ) = vk
iii ) The parameters of GFM are obtained from (27), (43) and (44).
D
f k (x k ) = bk 0 + ∑ bkd x d
(42.a)
d =1
Equation (38.b) where rewritten as,
f k (x k ) is non-linear can be
D
f k (x k ) = bk 0 + ∑ bkd f d (x d ) d =1
(42.b)
The equivalence of CRI and TS models with GFM is as follows: When strengths of all the rules are the same as in (41) we get the TS model output (39). When the function is constant, i.e., it is not a function of x , we get the CRI model output (35). Hence it is not a statistically valid model unlike GFM and TS models. This study has proved the statistical relevance of fuzzy models. So, we can now find new applications for fuzzy models.
Proceedings of the International Conference on Information Technology: Computers and Communications (ITCC’03) 0-7695-1916-4/03 $17.00 © 2003 IEEE
5. Conclusions An overview of the Cluster-Weighted Modeling (CWM) is presented. The maximum likelihood estimation of parameters of function, linking the local models is derived. A linear form is derived for this function. The nonlinear form can fit into any type of model. The parameters of the local models are related to the input variances and input-output co variances. The expected output of the CWM is shown to be similar to the defuzzified output of Generalized Fuzzy model (GFM). The conditions for their equivalence are also given. Because of this equivalence, the framework for the fuzzy model is established through CWM. As a consequence, it is now possible to determine the parameters of a fuzzy model by EM algorithm thus making learning process much simpler. In this work, we have not touched upon the clustering using EM algorithm. Also, applications of the present work have not been explored into as this work is intended to provide mainly a statistical basis for fuzzy modeling.
6. References 1.
2.
3.
4.
5.
6.
7.
N. Gershenfeld, B. Schoner and E. Metois, “Cluster-Weighted Modeling for time-series analysis”, Nature, 397, 329-332. N. Gershenfeld. The Nature of Mathematical Modeling. Cambridge University Press, New York, 1999. N. Gershenfeld, B. Schoner and E. Metois, “Cluster-Weighted Modeling: Probabilistic Time Series Prediction, Characterization and Synthesis”, in Nonlinear Dynamics and Statistics, Alistair Mees (Ed.), Birkhaeuser, Boston, 2000. D. V. Prokhorov, L. A. Feldkamp, and T. M. Feldkamp, “A New Approach to ClusterWeighted Modeling”, in Proceedings of IJCNN’01, vol.3, 1669-1674, 2001. L. A. Feldkamp, D. V. Prokhorov, and T. M. Feldkamp, “Cluster-Weighted Modeling for multi-clusters”,in Proceedings of IJCNN’01, vol.3, 1710-1714, 2001. M. F. Azeem, M. Hanmandlu and N. Ahmad, “Generalization of Adaptive Neuro-Fuzzy Inference Systems”, IEEE Trans. Neural Networks, vol.11, no.6, 1332-1346, 2000. M. F. Azeem, M. Hanmandlu and N. Ahmad, “Unification of CRI and TS Models”, to appear in Soft Computing.
8.
A. P. Dempster, N. M. Laird and D. B. Rubin, “Maximum Likelihood from incomplete Data via the EM Algorithm”, J. R. Statist. Soc. B, vol. 39, 1-38, 1977. 9. M. F. Azeem, M. Hanmandlu and N. Ahmad, “Structure Identification of Generalized Adaptive Neuro-Fuzzy Inference Systems”, IEEE Trans. Fuzzy Systems, in press. 10. Ming-Tao Gan and M. Hanmandlu, “ModelDependency of a Rule-Based Fuzzy System”, communicated to IEEE Trans. Fuzzy Systems.
Appendix Consider the integral for a single input-output pair in mth cluster, +∞
+∞
−∞
−∞
∫ y ⋅ p(x1, y)dy = ∫ y cxx
where C =
cyx
1 x1 − µm,1 exp − x1 − µm,1 y − µm, y C−1 dy 2π C 2 y − µm, y p(m)
cxy cyy
(A.1)
The above leads to the following as proved in [10], +∞
∫ y ⋅ p(x ,y)dy = 2π 1
−∞
where,
1 exp{(x1 − µm,1)cxx−1(x1 − µm,1)} f (x1) cxx
f (x1 ) = µ m , y + (x1 − µ m ,1 )cxx−1cxy
= µ m, y + (x1 − µ m,1 )
cxy 1 c [ xx ]1,1
(A.2)
Extending the above to multi input and single output case leads to
= µ m, y + ∑ (x d − µ m,d ) where
xd
µ m, d are
cxy d 2 σ m ,d
is the dth input, σ m , d 2
(A.3)
= [cxx ]d , d and
the corresponding variance and mean
respectively. Thus (A.3) verifies the form of (24), which we assumed for f ( x, α m ) . The derivation of (A.3) is given in [10].
Proceedings of the International Conference on Information Technology: Computers and Communications (ITCC’03) 0-7695-1916-4/03 $17.00 © 2003 IEEE