in Honor of Enrique Castillo. June 28-30, 2006. Functional Networks for ... and Iglesias et al. (2005), among others. In this paper we apply functional network ...
International Conference on Mathematical and Statistical Modeling in Honor of Enrique Castillo. June 28-30, 2006
Functional Networks for Classification and Regression Problems Beatriz Lacruz∗ and Ana P´ erez-Palomares Department of Statistical Methods, University of Zaragoza.
Rosa E. Pruneda Department of Applied Mathematics, University of Castilla-La Mancha.
Abstract In this paper we will use functional networks to model some linear and non linear relations among variables. In particular, our method allows us to discover adequate transformations of the response and/or the explanatory variables in multiple linear regression. If we apply this method to a heteroscedastic linear problem, we can estimate all the parameters involved in the model. Furthermore, we will tackle the estimation of classification functions. The proposed approach is compared with other statistical methods for classification and regression problems. Finally, the performance of the proposed procedure is illustrated by a simulation study and by real-life data sets.
Key Words: Functional networks, regression, heteroscedasticity, classification.
1
Introduction
Functional networks are a very useful general framework for modelling a wide range of probabilistic, statistical, mathematical and engineering problems. This paradigm was introduced by Castillo et al. (1998) and some of its applications can be found in Castillo and Guti´errez (1998), Castillo et al. (1999), Castillo et al. (2000), Castillo et al. (2001a), Castillo et al. (2001b), Castillo et al. (2003), Iglesias et al. (2004), Pruneda et al. (2005) and Iglesias et al. (2005), among others. In this paper we apply functional network models to regression and classification problems. Let X = (X1 , . . . , Xk ) be a vector of explanatory variables and Y a response variable, we consider the following relationship among these variables: f (Y ) = h(X1 , . . . , Xk ) + ². (1.1) ∗ Correspondence to: Beatriz Lacruz. Department of Statistical Methods. University of Zaragoza. (Spain)
2
B. Lacruz, A. P´erez-Palomares, R. E. Pruneda
Both, Y and X are observable variables and ² is a random error with zero mean. In regression problems, f can be a known or unknown invertible function (in particular, it can be the identity function) and h is an unknown function. The main goal is to estimate h and/or f . In classification problems, f is a known function, usually the identity function or the inverse of the sigmoidal function and h is an unknown function. The main goal is to estimate h. Then, to approximate h, in both regression and classification problems, we propose to use the following functional network models: 1. The Generalized Associativity model which leads to the additive model (see Castillo et al. (1998, p. 104) for further details), that is, h is approximated by a sum of arbitrary functions in each predictor: h(X1 , . . . , Xk ) = h1 (X1 ) + h2 (X2 ) + · · · + hk (Xk ).
(1.2)
2. The Separable model which considers a more general form for h h(X1 , . . . , Xk ) =
q1 X
···
r1 =1
qk X
βr1 ...rk φ1r1 (X1 ) . . . φkrk (Xk ),
(1.3)
rk =1
where βr1 ...rk are unknown parameters and the sets of functions Φj = {φjrj (Xj ), rj = 1, 2, . . . , qj }, j = 1, 2, . . . , k, are linearly independent. Figure 1 and Figure 2 show these functional network models for approximating (1.1). Once we have chosen the functional network model, we need to estimate f (if it is an unknown function) and h1 , h2 , . . . , hk in the additive functional network model and βr1 ...rk in the separable model. f , h1 , h2 , . . . , and hk are approximated by a linear combination of linearly independent functions φjrj , that is, q0 X αr0 φ0r0 (Y ) (1.4) f (Y ) ' r0 =1 qj
hj (Xj ) '
X rj =1
βjrj φjrj (Xj ), j = 1, . . . , k,
(1.5)
3
FN for Classification and Regression
x1
h1
x2
h2 .. .
.. .
xk
f
+
-1
y
.. .
hk
Figure 1: Additive functional network model.
x1
x2
c11
φ1
.. .
x
φq
x
.. .
c1q cq1
φ1
.. .
x
φq
x
+
f
-1
y
.. .
cqq
Figure 2: Separable functional network model (k = 2 and q1 = q2 = q).
and the problem is reduced to estimate the parameters αr0 and βjrj . The sets of linearly independent functions can be some of the following: 1. Polynomial family: Φ = {1, x, x2 , . . . , xq }
(1.6)
Φ = {1, ex , e−x , e2x , e−2x , . . . , eqx , e−qx }
(1.7)
2. Exponential family:
4
B. Lacruz, A. P´erez-Palomares, R. E. Pruneda
3. Fourier series family: Φ = {1, sin x, cos x, sin(2x), cos(2x), . . . , sin(qx), cos(qx)}.
(1.8)
In this paper we develop the estimation process for regression and classification problems. Section 2 is dedicated to regression problems. We apply the proposed methodology to discover transformations of the response and/or the explanatory variables and to estimate the parameters in heteroscedastic linear models. In Section 3 we apply this technique to estimate the classification function in a binary classification problem. Both applications are illustrated with some examples and the results compared with those obtained from other statistical techniques. Finally, some concluding remarks and the open research lines are detailed in Section 4.
2
Functional networks for regression problems
If we substitute (1.4) in (1.1) and we choose one of the proposed functional network models (1.2) or (1.3) to approximate h we obtain: 1. The approximated additive model: q0 X
αr0 φr0 (Y ) =
r0 =1
q1 X
β1r1 φ1r1 (X1 )+· · ·+
r1 =1
qk X
βkrk φkrk (Xk )+². (2.1)
rk =1
2. The approximated separable model: q0 X r0 =1
αr0 φr0 (Y ) =
q1 X r1 =1
···
qk X
βr1 ...rk φr1 (X1 ) . . . φrk (Xk ) + ².
(2.2)
rk =1
Both approximations can be unified using matrix notation, resulting a linear model in the parameters: ˜ = βX ˜ + ², αY
(2.3)
˜ and X ˜ where α and β are the corresponding vectors of parameters and Y are matrices of transformed data.
FN for Classification and Regression
5
We propose to apply the least squares criteria to estimate α and β. Note that the inclusion of parameters in both sides of (2.3) leads to a problem of identifiability. Then, we have to minimize ˜ − β X) ˜ T (αY ˜ − β X) ˜ min(αY α,β
(2.4)
subject to some restriction on the parameters. More details can be found in Castillo et al. (2005). There are two direct applications of this methodology. First, it allows us to discover transformation of the response and/or the explanatory variables. Second, it allows us to estimate the expectation and the variance of a linear model with heteroscedasticity without using the residuals of a ordinary least squares regression (as other techniques do). 2.1
Discovering transformations in multiple regression
Transformations of the response and/or the explanatory variables are useful in order to obtain linearity, normality or to stabilize the variance in a linear regression model. The transformation of the response variable is usually determined using Box-Cox transformation which is stated as follows: ( λ y −1 if λ 6= 0, λ y(λ) = (2.5) log y if λ = 0, where λ is determined by theoretical considerations (for instance, when y is Poisson or binomial distributed), it is guessed by examining OLS residual plots or it is estimated by maximum likelihood using a two steps procedure (see, for example, Atkinson (1985, p. 85) for further details). Our proposal is to apply the approximated additive model given in (2.1) to discover simultaneously possible transformations of both the response and the explanatory variables. Example 2.1. To illustrate this procedure we present the stack loss data set, given in Table 1, which was provided by Brownlee (1965) and has been extensively studied in the regression literature, as cited in Atkinson (1985, p. 266-268). It consists of a set of observations from 21 days operation of a plant for the oxidation of ammonia as a stage in the production of
6
B. Lacruz, A. P´erez-Palomares, R. E. Pruneda
Table 1: Stack Loss Data
Case 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21
Stack Loss Y 42 37 37 28 18 18 19 20 15 14 14 13 11 12 8 7 8 8 9 15 15
Air Flow X1 80 80 75 62 62 62 62 62 58 58 58 58 58 58 50 50 50 50 50 56 70
Temperature X2 27 27 25 24 22 23 24 24 23 18 18 17 18 19 18 18 19 19 20 20 20
Acid X3 89 88 90 87 87 87 93 93 87 80 89 88 82 93 89 86 72 79 80 82 91
nitric acid. The objective is to predict the stack loss, which is ten times the percentage of ingoing ammonia escaping unconverted. The explanatory variables are: the rate of operation of the plant measured by the air flow (X1 ), the inlet temperature of cooling water (X2 ) circulating through coils in the counter-current absorption tower where the nitric oxides produced are absorbed; and, the concentration of acid in the tower measured as 10×(acid concentration−50)(X3 ). The consensus of opinion is that X3 is not relevant and there are several outliers and leverage points, in particular, observation 21. The combination of Box-Cox and Box-Tidwell approaches leads to the conclusion that some transformation on the response variable should be taken and that the linear
7
FN for Classification and Regression 1.5
1.4
f(y) estimated log(y)
f(y) estimated and log(y)
1.3
1.2
1.1
1
0.9
0.8
0.7
0.6
0.5 0.05
0.1
0.15
0.2
0.25
0.3
0.35
0.4
0.45
y
Figure 3: fˆ(y) and the logarithmic transformation over y.
model should contain quadratic terms in X2 whether or not observation 21 is eliminated. In particular, when observation 21 is eliminated the logarithmic transformation is suggested as the best choice (see Atkinson (1985, p. 129136) and the references therein). We have applied the approximated additive model with linearly independent functions being the family of polynomials of degree 3. We have included the three explanatory variables and all the observations. After a 2 , we have obtained the followsearch process of the best model based on Radj ing model 9.514y − 22.949y 2 + 18.934y 3 = −4.067 + 11.458x1 − 8.091x21 + 5.355x2 + ², (2.6) which eliminates variable X3 , includes the quadratic term in x2 and a transformation of the response variable. Figure 3 shows that the left hand of (2.6) approximates very well the logarithmic function, as it was expected. Other real-life and simulated data sets have been used in Castillo et al. (2005). 2.2
Heteroscedastic Linear Models
A linear model with heteroscedasticity is stated as follows: y = Xβ + ²,
(2.7)
8
B. Lacruz, A. P´erez-Palomares, R. E. Pruneda
where E[²] = 0 and E[²²T ] = Σ, with Σ = diag(σ12 , σ22 , . . . , σn2 ). The BLUE estimator of the parameters is: ¡ 0 −1 ¢−1 0 −1 ˆ β X Σ y, (2.8) GLS = X Σ X which is called the generalized least squares estimator. Since the estimation of Σ is required, some structure of the variance is imposed. It is usually assumed that σi is a function of a linear combination of a set of explanatory variables given by the s-vector zi . For example, σi2 = (z0i α)2 or σi2 = exp(z0i α), which is the well-known multiplicative heteroscedasticity. The usual way to proceed consists of fitting the linear model (2.7) by ordinary least squares method and fitting again a function of its residuals over zi in order to estimate α. For example, if we assume σi2 = (z0i α)2 , a method consists of fitting | ei |= zTi α + vi . Harvey and Theil proved consistency properties for this method. Under multiplicative heteroscedasticity, the fitted model is log e2i = zTi α + vi . See Judge et al. (1980, ch. 11) for further details. We propose to fix a general structure of the inverse of the variance, that is,
1 = f (zi ) (2.9) σi and then, to approximate f using the additive (1.2) or the separable (1.3) functional network models. For example, using the additive model with polynomial family we have: f (zi ) ≈ α0 +
q1 X
α1r1 Z1r1 + · · · +
r1 =1
qs X
αsrs Zsrs .
(2.10)
rs =1
Therefore, we have an expression of Σ−1/2 ≈ P (α),
(2.11)
which is used to transform the linear model (2.7) in P (α)y = P (α)Xβ + ²∗ .
(2.12)
The least squares criteria leads to a system of non linear equations. Then, we propose to substitute the problem min(P (α)Y − P (α)Xβ)T (P (α)Y − P (α)Xβ), α,β
(2.13)
9
FN for Classification and Regression
subject to y0T α = c1 , by min (Y? α − Wβ ? )T (Y? α − Wβ ? ), α,β ?
(2.14)
subject to y0T α = c1 , where Y? and W contains powers of the variables X’s and Z’s, respectively. Then, we substitute α ˆ in (2.12) and we apply ˆ again OLS to obtain β. Example 2.2. In order to assess the performance of the proposed method we have simulated 100 Montecarlo samples from the model Y = 20 + 3 X1 + 2 X2 + ²,
(2.15)
where X1 and X2 are independent and uniformly distributed and ² is normally distributed with zero expectation and independent of X1 and X2 . We have tried several structures for the variance. We have compute the sample mean and the sample standard deviation of the coefficient estimators and we have compared the results with those obtained by the ordinary least squares criterion (βˆi ’s with high variance are expected), by Harvey and Theil’s method fitting | ei |= zTi α + vi (good results when the structure of the standard deviation is a polynomial are expected) and by “multiplicative heteroscedasticity”’s method fitting log e2i = zTi α + vi (good results when the structure of the variance is exponential is expected). Tables 2, 3 and 4 show the obtained results when a polynomial, an exponential and a more general structure of the variance are considered. In all the cases, f has been approximated by a polynomial of degree 2. Table 2: Estimators of the coefficients of model (2.15) by ordinary least squares (OLS), Harvey-Theil (H-T), multiplicative heteroscedasticity (MH) and functional networks (FN) methods when V ar[²] = (X1 + X12 )2 .
OLS
H-T
MH
FN
Mean
Std.
Mean
Std.
Mean
Std.
Mean
Std.
βˆ0
20.185
4.152
19.999
1.801
20.115
1.680
20.112
1.623
βˆ1
3.047
1.141
3.072
0.750
3.066
0.670
2.988
0.828
βˆ2
1.929
1.102
1.966
0.255
1.931
0.449
1.931
0.397
10
B. Lacruz, A. P´erez-Palomares, R. E. Pruneda
Table 3: Estimators of the coefficients of model (2.15) by ordinary least squares (OLS), Harvey-Theil (H-T), multiplicative heteroscedasticity (MH) and functional networks (FN) methods when V ar[²] = exp(X12 ) (multiplicative heteroscedasticity).
OLS
H-T
MH
FN
Mean
Std.
Mean
Std.
Mean
Std.
Mean
Std.
βˆ0
1072.5
7280.1
15.39
380.8
12.07
147.4
19.69
1.713
βˆ1
-231.1
2896.5
5.676
22.003
9.082
85.91
3.260
2.530
βˆ2
-265.9
22645.1
2.642
9.671
0.450
40.39
2.047
0.245
Table 4: Estimators of the coefficients of model (2.15) by ordinary least squares (OLS), Harvey-Theil (H-T), multiplicative heteroscedasticity (MH) and functional networks (FN) methods when V ar[²] = 1/(X1 + X12 )2 .
OLS
H-T
MH
FN
Mean
Std.
Mean
Std.
Mean
Std.
Mean
Std.
βˆ0
19.948
0.327
20.004
0.766
19.990
0.100
19.996
0.126
βˆ1
2.992
0.210
2.993
0.298
3.004
0.063
2.998
0.077
βˆ2
2.024
0.158
1.998
0.141
2.004
0.032
2.006
0.045
Our method provides very good results in all the cases. In the exponential structure of variance example we have obtained even better results than the other considered methods. Note that in this case, a strong heteroscedasticity is present. Our method presents two main advantages. First, it allows us to study in an unified way different heteroscedastic structures. Second, it estimates the covariance matrix from data without using OLS residuals.
11
FN for Classification and Regression
3
Functional networks for classification problems
We will consider the following model for a classification problem: Y = g(h(X1 , . . . , Xk )) + ²,
(3.1)
where y is a variable in [0, 1], g is the logistic function g(x) =
1 , 1 − e−x
(3.2)
and ² is a random error with zero mean. We can approximate h using the right hand of the approximated additive (2.1) or separable (2.2) functional network models. Example 3.1. To evaluate the accuracy of this approximation we have simulated 200 sets of size 100 to estimate the classification function and of size 50 to test the estimated function from the model y ∼ Ber(g(h(x))). We have considered the following classification functions: h(X) = log X, h(X p 1 , X2 ) = log X1 + X2 , h(X1 , X2 ) = log(X1 + X2 ) and h(X1 , X2 ) = (X12 + X22 )3 with X1 and X2 independent and uniformly distributed variables. Note that it is the way to simulate a non linear logistic model. We have used maximum likelihood criteria to estimate the parameters and the polynomial family as the set of linearly independent functions. It is equivalent to apply logistic polynomial regression with or without interactions. We have computed the goodness-of-approximation of the estimated function based on the least squares error as well as the training and test error rates. Finally, the mean and the variation coefficient of these measures have been compared with those obtained from other classification techniques as linear discriminant (LD), quadratic discriminant (QD) analysis, linear logistic regression (LL) and neural networks (NN) (one-layer two-neuron perceptron). Tables 5, 6, 7 and 8 show the obtained results. We conclude that the classification function is approximated better than any other method considered, including (one-layer-two neuron) neural network model. The training error rate is only improved by the neural network model in the second example (Table 6). However, it provides the best test error rate in all the examples. Furthermore, to estimate separately each function in an additive model is only possible with the proposed method. Figure 4 shows the approximation of h1 (x1 ) = log x1 when the classification function h(x1 , x2 ) = log x1 + x2 is considered.
12
B. Lacruz, A. P´erez-Palomares, R. E. Pruneda
Table 5: Training and test error rates and goodness-of approximation for the estimated classification function with h(X) = log X.
FN LL LD QD NN
Training Error Rate (%) Mean CV 27.0 0.154 30.0 0.153 37.2 0.148 36.7 0.137 30.0 0.161
Test Error Rate (%) Mean CV 30.0 0.226 30.9 0.230 39.8 0.170 44.5 0.193 44.4 0.198
Goodness-of Approximation Mean CV 6.125 1.575 30.76 0.727 233.8 0.461 65.33 0.463 52.74 0.261
Table 6: Training and test error rates and goodness-of approximation for the estimated classification function with h(X1 , X2 ) = log X1 + X2 .
FN LL LD QD NN
Training Error Rate (%) Mean CV 24.0 0.191 20.0 0.184 32.2 0.140 32.4 0.156 18.1 0.256
Test Error Rate (%) Mean CV 10.0 0.310 16.4 0.304 54.4 0.145 30.9 0.260 30.8 0.263
Goodness-of Approximation Mean CV 52.97 2.531 70.71 0.935 277.1 0.068 273.1 0.906 185.6 0.059
Table 7: Training and test error rates and goodness-of approximation for the estimated classification function with h(X1 , X2 ) = log(X1 + X2 ).
FN LL LD QD NN
Training Error Rate (%) Mean CV 29.0 0.138 32.6 0.134 38.1 0.116 35.7 0.116 29.7 0.170
Test Error Rate (%) Mean CV 38.0 0.186 35.6 0.186 59.9 0.103 51.9 0.161 51.9 0.162
Goodness-of Approximation Mean CV 7.643 1.269 26.97 0.807 64.83 0.031 49.60 0.393 34.28 0.209
13
FN for Classification and Regression
Table 8: Training and test error rates and goodness-of approximation for the p estimated classification function with h(X1 , X2 ) = (X12 + X22 )3 .
Training Error Rate (%) Mean CV 32.0 0.151 36.7 0.143 44.9 0.084 41.0 0.119 33.2 0.155
Functional Network Logistic Linear Linear Discriminant Quadratic Discriminant Neural Network
Test Error Rate (%) Mean CV 38.0 0.194 37.4 0.223 66.5 0.090 56.9 0.141 56.7 0.147
Goodness-of Approximation Mean CV 1.583 1.641 36.28 0.320 59.9 0.004 57.87 0.257 21.98 0.175
FN approximation 4 h1(x1) log(x1) 2
0
−2
−4
−6
−8
−10
0
0.5
1
1.5
2
2.5
3
3.5
4
Goodness of approximation = 70.06
ˆ 1 (x1 ) and log x1 over x1 . Figure 4: h
Other real-life and simulated data sets have been used in Pruneda et al. (2005).
14
4
B. Lacruz, A. P´erez-Palomares, R. E. Pruneda
Concluding remarks and open lines
Functional network models allow us to represent regression and classification problems in an unified way. In regression problems, our main contribution is on the one hand to propose a method for discovering simultaneously transformations in the response and/or the explanatory variables. And, on the other hand, to present a new approach which allows us to study different heteroscedastic structures. Our method provides good approximations in real and simulated data sets. In classification problems, the proposed method is equivalent to the logistic regression technique with the classification function approximated via linear combinations of linear independent functions. We have showed that this technique approximates the classification functions considered better than any other method among linear logistic regression, linear and quadratic discriminant analysis and neural network models. Furthermore, good levels of accuracy for both training and test data sets have been obtained in all the cases. Our future work will be addressed to study the probabilistic properties of the estimators in regression problems. Furthermore, we need to implement advanced search techniques as evolutionary algorithms in order to select parsimonious models. And, finally, other comparable methods as support vector machines will be tried in an immediate future.
5
Acknowledgements
First of all, we want to thank very much to Professor Enrique Castillo for guiding us in our research way. We are also grateful to Professor Ali Hadi who has contributed to the development of this regression model. Thanks also to DGI, Ministerio de Ciencia y Tecnolog´ıa/FEDER (project BFM2003-05695), Junta de Comunidades de Castilla-La Mancha (project PAI05-044) and Gobierno de Arag´on (grupo consolidado Stochastic Models), for partial support of this work.
FN for Classification and Regression
15
References Atkinson, A. C. (1985). Plots, Transformations and Regression. An Introduction to Graphical Methods of Diagnostic Regression Analysis. Clarendon Press, Oxford. Brownlee, K. A. (1965). Statistical Theory and Methodology in Science and Engineering. Wiley, New York, second edition ed. ´rrez, J. M., and Pruneda, E. (1998). Castillo, E., Cobo, A., Gutie An Introduction to Functional Networks with Applications. Kluwer Academic Publishers, New York. ´rrez, J. M., and Pruneda, E. (1999). Castillo, E., Cobo, A., Gutie Working with differential, functional and difference equations using functional networks. Applied Mathematical Modeling, 23:89–107. ´rrez, J. M., and Pruneda, E. (2000). Castillo, E., Cobo, A., Gutie Functional networks: A new network-based methodology. ComputerAided Civil and Infrastructure Engineering, 15(2):90–106. ´rrez, J. M. (1998). Nonlinear time series modCastillo, E. and Gutie eling and prediction using functional networks. Extracting information masked by chaos. Physics Letters A, 244:71–84. ´rrez, J. M., Hadi, A. S., and Lacruz, B. (2001a). Castillo, E., Gutie Some aplications of functional networks in statistics and engineering. Technometrics, 13:395–400. Castillo, E., Hadi, A. S., and Lacruz, B. (2001b). Optimal transformations in multiple linear regression using functional networks. Lecture Notes in Computer Science, 2084(Part I):316–324. Castillo, E., Hadi, A. S., Lacruz, B., and Pruneda, R. E. (2003). Functional network models in statistics. Monograf´ıas del Seminario Matem´ atico Garc´ıa Galdeano, 27:177–184. Castillo, E., Hadi, A. S., Lacruz, B., and Pruneda, R. E. (2005). Semi-parametric nonlinear regression and transformation using functional networks. Submitted for publication.
16
B. Lacruz, A. P´erez-Palomares, R. E. Pruneda
Iglesias, A., Arcaya, B., Cotos, J. M., and Varela, J. (2005). Optimisation of fishing predictions by means of artificial neural networks, anfis, functional networks and remote sensing images. Expert Systems with Applications, 29(2):356–363. ´ lvez, A. (2004). Functional netIglesias, A., Echevarr´ıa, G., and Ga works for b-spline surface reconstruction. Future Generation Computer Systems, 20:1337–1353. ¨ tkepohl, H., and Judge, G. G., Griffiths, W. E., Hill, R. C., Lu Lee, T. C. (1980). The Theroy and Practice of Econometrics. John Wiley and Sons. Pruneda, R. E., Lacruz, B., and Solares, C. (2005). A first approach to solve classification problems based on functional networks. Lecture Notes in Computer Science, 3697:313–318.