1,2 Statistics Department, Faculty of Mathematics and Natural Science. Institut
Teknologi Sepuluh Nopember, Surabaya, Indonesia 60111. Abstract-- Spline is a
...
International Journal of Basic & Applied Sciences IJBAS-IJENS Vol: 11 No: 02
119
Infants’ Weight Growth Model in Surabaya (Indonesia) By Using Weighted Spline Regression I Nyoman Budiantara1, Jerry Dwi Trijoyo Purnomo2 1,2
Statistics Department, Faculty of Mathematics and Natural Science Institut Teknologi Sepuluh Nopember, Surabaya, Indonesia 60111
Abstract--
Spline is a nonparametric regression model that become the center of attention in statistical modeling. Spline has several advantages, including flexible, have good statistical interpretation, according to data over that have no particular pattern, and very useful for modeling data that have changed pattern in sub-sub-intervals. The pattern of data between age and weight of infants in Surabaya (Indonesia) tends to change at a certain age. At less than 4 months of age, infants’ weight rise very quickly. At the age of 4 months to 14 months their weight remained increased but not as fast as less than 4 months of age. Whereas in more than 14 months of age infants’ weight rise slowly. Infants’ weight growth in Surabaya has unequal variance. The more rise of age of infants, the heavier weight of infants is obtained. Based on such of this pattern, weighted spline model is designed to model the relationships between infants’ weight and the age of infants in the city of Surabaya. Weighted spline model obtained is very suitable and has a coefficient determination of 99.56%.
Index Term--
Nonparametric Regression, Weighted Spline, Infants, Surabaya, Indonesia.
I.
INTRODUCTION
Given paired data (tj, yj) followed nonparametric regression model yj = f(tj) + εj, j =1,2,…,n. Function f is regression curve and εj is random error that is assumed independen and has normal distribution with mean zero and variance σ2. If the pattern of data tend to follow the pattern of linear / quadratic / cubic, thus suitable regression approach for these data are parametric regression of linear/quadratic/cubic forms ([9],[16]). In real Life actually the relationship between predictor variable tj and response variable yj has unknown shape. If in such cases, parametric regression models still imposed as a model of the data pattern, it will be a misleading conclusion. Nonparametric regression is a regression approach that is suitable for data patterns that have unknown regression curve, or do not have complete information about the shape of data ([3], [4]). Nonparametric regression model that often gets attention from researchers is kernel ([19], [23]), Spline ([3], [8], [9], [17], [18], [21], [22 ], [25], [30], [32]), Fourier series [1] and wavelets [2]. Nonparametric regression approach has high flexibility, because the data is expected to seek its own form of regression curves estimation without being influenced by subjective factors from researchers [16].
In many cases, the response variable can have a linear relationship with one predictor variable, however, this variable has unknown relationship with another predictor variable. In these circumstances, Wahba [30] and Budiantara and Subanar [6] suggests the use of semiparametric regression approach. Suppose predictor variables tj and x j ( x1 j ,..., x pj ) , and response variables yj, j = 1,2,...,n. Assumed that paired data (tj, x j , yj) follow the semiparametric regression model yj = g ( x j , t j ) + εj, where semiparametric function is g ( x j , t j ) = x j + f(tj), j =1,2,…,n. The function of
f(tj)
is a
component of nonparametric regression curve, ( 1 ,..., p )
R p the unknown parameters, x j is a parametric component and εj is random error that is assumed independently distribution with zero mean and variance σ2. Some researchers such as Budiantara [5], Chen and Shiau [11], Enggle, et al. [15] and He and Shi [20] have developed a partial spline estimator to estimate semiparametric regression curve
g ( x j , t j ) . While Kayri and Zirhhoglu [23] and
Speckman [29] using a kernel approach to estimate semiparametric regression curve g ( x j , t j ) . Among the nonparametric regression models above, spline is a regression model that have a nice statistical and visual interpretation. The model of spline is obtained from the optimization of Penalized Least Square (PLS) and has a high flexibility [30]. Besides, the spline is able to handle character data / functions that are smooth (smooth). Spline also has an excellent ability to handle data that change its behavior in certain sub-interval [9]. Further, the growth of infants at birth age until the age of about one year has very fast growth rate, however, the growth then will slowly with increasing age. Therefore, the spline approach is an appropriate model for this kind of data patterns. There are some researchers who investigated the growth curve under five, among them are Cooper [13] states that the data used to estimate the growth curve at the National Center for Health Statistics (NCHS) in Australia is the data about children aged 0-12 months with an interval of 3 months. Moreover, Ooki [27] have used the spline approach to identify the presence or absence of growth
1110902-0808 IJBAS-IJENS © April 2011 IJENS
IJENS
International Journal of Basic & Applied Sciences IJBAS-IJENS Vol: 11 No: 02 differences between twins aged 0-6 years living in various regions in Japan, with growth of twins who live in metropolitan areas of Tokyo. As a result, the growth curve of weight against age, and height of the age of two are not different. Chen [10] give polynomial parametric quantile regression approach to design growth card (KMS). Growth curve obtained by Chen is not preferred because it is too smooth, so it can not represent local data pattern and the pattern of data that change very sharply. Some researchers who have been very intensive research in the field of spline include Wu and Zhang [33] and Diggle, et. al., [14] for the spline model of longitudinal data, Wahba [30] for the original spline model, Cox and O'Sullivan [12] for the spline model of type-M, Oehlert [26] for relaxed spline model, Budiantara and Subanar [6 ] for the partial spline, and others. If traced deeper, spline models developed by researchers, have appear that there are 3 (three) very specific assumptions, (i). The resulting spline estimator is obtained from completing optimization PLS (Penalized Least Square), (ii). Spline basis used to be very specific (particular), and (iii). Spline estimator developed require the heavy assumption for statistical inference, namely the assumption of random error variance from the model must be equal (homoscedastic) for each observation. Budiantara, et. al. [8] have developed a heteroskedastic nonparametric regression spline model, but only for longitudinal data. In many practical cases, such as the growth pattern of infants, it is very difficult to expect that the spline nonparametric regression model (cross section data) homoscedastic. In the infants’ growth model in Surabaya inequality of variance (heteroscedastic) is occured. Therefore, in this paper, weighted spline nonparametric regression model is developed to handle heterocedatic problem.
120
solution in the form of a weighted smoothing spline estimator fˆ (t ) is a linear estimator in the observation and can be presented in the form:
fˆ (t ) =
n
b ( ) y i
i 1
(3)
i
for all of coefficient b ( ) . In other conditions, one can notice spline space and applying a truncated form of the spline basis:
1, t , t ,..., t 2
p
, (t 1 )p , (t 1 )p ,..., (t 1 )p
(4) with truncated function is given by :
(ti j ) p , ti j (ti ) ti j 0, p j
For each function f in the spline space can be defined as: p
f(t) =
m
t j
j 0
j
k 1
k p
(t k )p
(5)
If heteroscedastic nonparametric regression curve in equation (1) is approached with spline functions of degree p with knot points { 1 , 2 ,...., m } in equation (5), thus it will be obtained splineregression model: 1
y i j t j k p t k t i i p
m
j 0
k 1
p
(6)
where ti [a,b], i= 1,2,…,n. Heteroscedastic spline regression model (6) can be expressed in matrix form: (7) y T ( ) , (1 , 2 ,..., m ) With vectors y , , and matrix X defined as :
y ( y1 , y2 ,..., yn ) , II. PROCEDURE A. Weighted Spline Regression Given heteroscedastik nonparametric regression model presented in the form: yi= f(ti) +
i i , ti [a,b], i =1,2,…,n.
(1)
Random error i is assumed independently distribution with E ( i ) 0 and E ( i j ) ij (kronecker delta). Regression curve f is assumed in Sobolev Space W2m [a, b] ([25],[30]). The estimation for nonparametric regression curve f is obtained from optimization of Weighted Penalized Least Square (WPLS) : b 1 n 2 2 Min n ( y f ( t )) ( f ( m ) (t ))2 dt i i i m f W2 [ a ,b ] i 1 a
(2) with 0 is smoothing parameter that controls between goodness of fit and smoothness functions. Wahba [30] and Budiantara and Purnomo [9] show WPLS optimization
( 0 , 1 ,..., p , p 1 ,...., p m ) ,
(1 , 2 ,..., n ) , and 1 t1 1 t2 T ( ) 1 tn
Random error
t1p (t1 1 ) p t
(t2 )
p 2
p 1
tnp (tn 1 ) p
(t1 2 ) p (t2 )
p 2
(tn 2 ) p
(t1 m ) p (t 2 m ) p (tn m ) p
in heteroscedastic spline regression model
(7) has independently distribution with variance covariance matrix: W diag ( 12 , 22 ,...., n2 ) . The estimator for regression curve
E ( ) 0 and
f could be obtained
under the parameter estimator ˆ ( ) . The estimator of ˆ ( ) is obtained by using Weighted Least Square (WLS) method, based on the optimization:
1110902-0808 IJBAS-IJENS © April 2011 IJENS
IJENS
International Journal of Basic & Applied Sciences IJBAS-IJENS Vol: 11 No: 02 Min
R pm1
W 1
Min
R pm1
( y T ( ) )W
1
( y T ( ) ) (8)
By using partial derivative, the estimator ˆ ( ) is showed: ˆ ( ) = (T ( )W 1T ( )) 1T ( )W 1 y
p
f(t) =
matrix A( ) T ( )(T ( )W 1T ( ))1T ( )W 1 . It can be inferred from equation (9) that estimator ˆ ( ) is a linear estimator in observation, and also an unbiased estimator for
.
1
(T ( )W T ( )) T ( )W E[ y ] 1
1
j
j
1
= (T ( )W 1T ( )) 1T ( )W 1T ( ) . Weighted spline regression curve estimator is strongly influenced by the knot points (1 , 2 ,..., m ) . The best spline estimator is determined by optimal knot. The point of optimal knot obtained by receipts Gerneralized Cross Validation method ([7], [30], [31]). B. Infant’s Growth Cart Model in Surabaya Here is given an application of nonparametric regression models, especially the weighted spline. According to the National Family Planning Coordinating Board (BKKBN) in East Java Province (Indonesia), the dominant variables affecting weight infants are aged under five. In general, the growth pattern of infants is not constant, but there is a change in growth patterns of certain ages. In general, since the birth until the age of 6 months there are rapid growth in babies, however, after the age of 6 months the growth is rather slow. General growth pattern is also seen in children under five in the city of Surabaya. Plot between weight (y) in kilograms and age (t) in under five months of Surabaya is given in Figure 1.
k 1
k p
(t k )p ,
for various values of p which indicate the degree of spline and a variety of m which shows many points of knot. To select the optimal point of knot in spline models,we used the Generalized Cross Validation method (GCV). GCV Functions given by: GCV (1 ,..., m )
E ˆ ( ) = E (T ( )W T ( )) T ( )W y 1
m
t j 0
fˆ (t ) T ( )ˆ ( ) T ( )(T ( )W 1T ( ))1T ( )W 1 y A( ) y With
1
between age and weight of these infants, it is needed to use truncated polynomial spline:
(9)
The regression curve estimator is defined:
121
n 1[ y fˆ1 ,...,m (t )][ y fˆ1 ,...,m (t ))] . [n 1trace( I A(1 ,..., m )]2
Optimal knot point is obtained from optimization ([7], [30], [31]) : Min GCV (1 ,..., m ) = 1R ,...,m R
Min
1R ,...,m R
n 1[ y fˆ ,..., (t )][ y fˆ ,..., (t ))] 1 m 1 m . 1 2 [ n trace ( I A ( ,..., )] 1 m
It can be seen visually that the growth model for infants in the city of Surabaya has inequality of variance. There is a trend of increasing age, the variance also continues to expand. To make sure about the inequality of this variance, it can be confirmed by testing using the test of inequality variance, Glejser, at a significance level of 5%, which is concluded that in this case heteroscedastic is occured. As a result, the appropriate spline model in this matter is weighted spline. The first step that must be done in this model is to determine the optimal weight. Weight is obtained by modeling linear regression of the variance of each weight against age. Based on this way, the estimation of variance ˆ i2 0,38 0, 084 ti is obtained. Further, refer to this estimation, data of age and weight of infants under five are modeled with weighted spline models (linear, quadratic and cubic) with one, two and three points of knot. Moreover, the best model for weighted spline is determined by using GCV method. Table I presents a summary of the selection of optimal knot in weighted spline, and the value of GCV.
18
16
Table I Election Summary of knot points in weighted spline by using GCV. Quadratic Spline Cubic Spline
14
Weight
12
10
One Knot Two Knot Three Knot
8
6
4
2 0
12
24
36
48
60
Age
Fig. 1. Infants’ weight in Surabaya.
Based on Figure 1 there are clearly visible changes in the growth pattern of infants in the city of Surabaya in a particular age interval. Therefore, to model the pattern of relationship
8
GCV=0,055
4; 14
GCV=0,048
2; 8; 23
GCV=0,049
8 5; 15
GCV=0,049 GCV=0,048
3; 8; 23 GCV=0,048
It can be seen from Table I that among the weighted spline models that is selected (one point knot, two knot and three knot), there are three models that have equal value of GCV = 0.048, which is weighted
1110902-0808 IJBAS-IJENS © April 2011 IJENS
IJENS
International Journal of Basic & Applied Sciences IJBAS-IJENS Vol: 11 No: 02 quadratic spline model of two knot (t = 4 and t = 14), weighted cubic spline of two knot (t = 5 and t = 15), and weighted cubic spline of three-point knot(t = 3, t = 8 and t = 23). In this issue it will be selected spline model of the most parsimony (simple), thus weighted quadratic spline model of two points knot is choosen, with the location of knot point on the age of 4 months and 14 months of age. Weighted spline model is given by: fˆ (t ) 2, 22 1, 69t 0,16t 2 0,15(t 4)2 0, 01(t 14)2
2, 22 1, 69t 0,16t 2 , 0 t 4 fˆ t 4, 64 0, 48t 0, 01t 2 , 4 t 14 2 6, 79 0,17t 0, 001t , t 14
2 age
3 age 1 2
1,688
28,08
signifikant
-0,162
-19,43
signifikant
0,151
17,30
signifikant
17,83
signifikant
0,011 4 age 2 Note: t(0.025; 596) = 1.96 2
i i ,
=1,2,…,n.
ti
[a,b],
i
Random error i has independently distribution with E ( i ) 0 and E ( i j ) ij (cronecker delta). Heteroscedastic nonparametric regression curve f is approach with degree p of spline function with knot points { 1 , 2 ,...., m } : p
t j
m
j
k p (t k ) p . k 1
The estimator for weighted spline regression curve is obtained from optimization of WLS : Min
R pm1
W 1
Min
R pm1
( y T ( ) )W
1
( y T ( ) )
1. Weigted spline estimator can be expressed in the form: fˆ (t ) T ( )ˆ ( ) T ( )(T ( )W 1T ( ))1T ( )W 1 y A( ) y
TABLE II HYPOTHESIS TEST FOR THE PARAMETERS IN THE MODEL OF WEIGHTED QUADRATIC SPLINE Parameter Parameter th Conclusion Estimation 0 (intersept) 2,222 24,27 signifikant 2
yi = f(ti) +
j 0
H0 : j = 0 H1 : j ≠ 0, untuk j = 0,1,2,3,4.
1 age
III. CONCLUSION Given heteroscedastic nonparametric regression
f(t) =
Furthermore, the hypothesis test for the parameters in the model of weighted quadratic spline of two knot is applied, with a significance level of 5%. Given the hypothesis:
122
It can be concuded from Table II, that H0 is rejected. Thus, all of the paramters in weighted spline model are significant. Weighted quadratic spline model with two knot at the age of t=4 months and t = 14 months of age are given in Figure 2. Weighted quadratic spline model has a coefficient of determination (R2) equal to 99.56%. This R2 value indicates that the weighted quadratic spline model with two point of knot is an appropriate model for describe the relationship between age and weight of infants in Surabaya.
For a A( ). Weighted spline estimator is a linear estimator in observation. 2. The growth of infants’ weight in the city of Surabaya has specific pattern and change in certain age. At the age of less than 4 months, there are rapid growth in infants’ weight. Meanwhile, at the age of 4 months to 14 months the growth is not as fast as their weight in previous life. Then after the age of 14 months, the growth weight is slower than the age below 14 months. 3. The pattern of infants’ weight growth in Surabaya (Indonesia) has inequality of variance. The higher the age of infants, the heavier the weight of them. 4. The pattern of infants’ weight growth in Surabaya (Indonesia) is modeled by using quadratic weigted spline with two knot point (t = 4 and t = 14). This spline model is an appropriate model for these data and have coefficient of determination R2 = 99,56%. As a development ofweighted spline technique, especially for modeling infant’s growth, it is a great chance in the next research to create infant’s growth card by using quantile spline approach regarding weighted spline.
18
16
ACKNOWLEDGEMENTS This paper is a part of the Research Grant funded by the Professor Grant of the Institute of Technology (ITS), Surabaya, Indonesia, in 2010. On this occasion, the author would like to thank as much as possible for the opportunity and the fund that have been given to this research.
14
Weight
12
10
8
6
4
2 0
12
24
36
48
60
Age
Fig. 2. WEIGTED QUADRATIC SPLINE MODEL WITH TWO KNOT POINT T = 4, T = 14
1110902-0808 IJBAS-IJENS © April 2011 IJENS
V.
REFERENCES
IJENS
International Journal of Basic & Applied Sciences IJBAS-IJENS Vol: 11 No: 02 [1].
A.Antoniadis, G. Gregorire, and W. Mackeagu. Wavelet Methods for Curve Estimation, Journal of the American Statistical Association., 89, 1340-1353, 1994. [2]. A.Antoniadis, J.Bigot, and T. Spatinas. Wavelet Estimators in Nonparametric Regression : A Comparative Simulation Study, Journal of Statistical Software, 6, 1-83, 2001. [3]. H.Becher, G. Kauermann, P. Khomski, and B. Kouyate. Using Penalized Splines to Model Age and Season of Birth Dependent Effects of Childhood Mortality Risk Fabtors in Rural Burkina Faso, Biometrical Journal, 51, 110-122, 2009. [4]. I. N.Budiantara, Subanar, and Z. Soejoeti. 1997. Weighted Spline Estimator, Bulletin of the International Statistical Insitute, 51, 333-334. [5]. I. N.Budiantara. Estimator Spline Terbobot Dalam Regresi Semiparametrik, Majalah Ilmu Pengetahuan dan Teknologi (IPTEK), 10, 103-109, 1999. [6]. I. N.Budiantara, and Subanar. Weighted Spline Estimator in Partially Linear Models, Proc. International Conference on Mathematics and Its Applications 1999, Yogyakarta, Indonesia, 1999. [7]. I. N. Budiantara. Metode U, GML, CV dan GCV Dalam Regresi Nonparametrik Spline, Majalah Ilmiah Himpunan Matematika Indonesia (MIHMI), 6, 41-45, 2000. [8]. I. N. Budiantara, B. Lestari, and A. Islamiyati. Weighted Spline Estimator in Heteroscedastic Nonparametrik Regession for Longitudinal Data, IndoMS International Conference on Mathematics and Its Applications 2009 (IICMA 2009) Gadjah Mada University, October 12-13, 2009. [9]. I. N. Budiantara, and J. D T. Purnomo. Kartu Menuju Sehat (KMS) of Babies in The Province of East Java by Using Weighted Spline Approach, International Conference Institut Teknologi Badung, Indonesia, 2009. [10]. C.Chen. Growth Charts of Body Mass Index (BMI) with Quantile Regression, , 2002, Download , Dec, 12, 2006. [11]. H. Chen, and J.J.H. Shiau. Data Driven Efficient Estimators for a Partially Linear Model, The Annals of Statistics, 22, 211-237, 1994. [12]. D. D. Cox, and F. O’Sullivan. Penalized Type Estimator for Generalized Nonparametric Regression, 1983, Journal of Multivariate Analysis, 56, 185-206, 1996. [13]. E.Cooper. Analysis of Growth Data for Breastfed Infants and its relevance to breastfeeding, The University of Wolonggong, Australia, 2003. [14]. P. J. Diggle, P.Heagerty, K. Y. Liang, and S. L. Zelger, Analysis of Longitudinal Data, Oxford University Press, Oxford, 2002. [15]. R.F.Enggle, C.W.J. Grangger, J. Rice, and A. Weiss. Semiparametric Estimates of Relation Between Weather and Electric Sales, Journal of the American Statistical Association., 81, 310-320, 1986. [16]. R.L.Eubank. Spline Smoothing and Nonparametric Regression, Mercel Dekker, New York, 1988. [17]. P.J.Green, and B.W. Silverman. Nonparametric Regression and Generalized Linear Model, Chapman & Hall, London, 1994. [18]. C.Gu. Multivariate Spline Regression, In M. G. Schimeck (ed), Smoothing and Regression : Approaches, Computation and Application, New York, 2000. [19]. W.Hardle. Applied Nonparametric Regression, Cambridge University Press, New York, 1990. [20]. X. He, and Shi, Bivariate Tensor Product B-Spline in a Partly Linear Model, Journal of Multivariate Analysis, 58, 162-181, 1996. [21]. C. C.Holmes, and B. K. Mallick. Bayesian Regression with Multivariate Linear Splines, Journal of the Royal Statistical Society, Series B, 63, 3-18, 2001. [21]. J. Z.Huang, , and L. Liu. Polynomial Spline Estimation and Inference of Proportional Hazards Regression Models with Flexible Relative Risk Form, Biometrics, 62, 793-802. 2006.
123
[23]. M.Kayri, and G. Zirhhoglu. Kernel Smoothing Function and Choosing Bandwitdh for Nonparametric Regression Methods, Ozean Journal of Applied Sciences., 2, 49-60, 2009. [24]. R.Koenker, P.Ng, and S. Portnoy. Quantile Smoothing Spline, Biometrika, 81, 673-680, 1994. [25]. B.Lestari. I N.Budiantara, S. Sunaryo, and M.Mashuri. Spline Estimator in Multi-Respon Noparametric Regression Model with Unequal Correlation of Errors, Journal of Mathematics and Statistics, 6, 327-332, 2010. [26]. G.W.Oehlert. Relaxed Boundary Smoothing Spline, The Annals of Statistics, 20, 1146-1160, 1992. [27]. S.Ooki. Construction of Japanese Database on Child Twins and their Families, Journal of Epidermiology, 14, 215-225, 2004. [28]. P.Shi, and G.Li. On the Rate Convergence of Minimum L1-Norm Estimates in a Partly Linear Models, Communication in Statistics, Theory and Methods, 23, 175-196, 1994. [29]. P.Speckman. Kernel Smoothing in Partial Linear Model, Journal of the Royal Statistical Sociaty, Seies B, 50, 413-436, 1988. [30]. G.Wahba. Spline Models For Observasion Data, SIAM Pensylvania, 1990. [31]. Y.Wang. Spline Smoothing Models With Correlated Errors, Journal of the American Statistical Association., 93, 341-348, 1998. [32]. Y.Wang, W.Guo, and M. B. Brown. Smoothing Spline for Bivariate Data with Applications to Association Between Hormones, Statistica Sinica, 10, 377-397, 2000. [33]. H.Wu, and J. T. Zhang. Nonparametric Regression Method for Longitudinal Data Analisys : Mixed Effects Modeling Approaches, John Wiley and Sons, New York, 2006.
1110902-0808 IJBAS-IJENS © April 2011 IJENS
IJENS