Statistical Downscaling to Predict Monthly Rainfall Using ... - hikari

16 downloads 593 Views 210KB Size Report
web site at http://www.esrl.noaa.gov/psd/. Coverage areas of GCMs output studied are 101.25◦E - 116.25◦E longitude and 13.75◦S - 1.25◦N latitude with.
Applied Mathematical Sciences, Vol. 9, 2015, no. 108, 5361 - 5369 HIKARI Ltd, www.m-hikari.com http://dx.doi.org/10.12988/ams.2015.56434

Statistical Downscaling to Predict Monthly Rainfall Using Linear Regression with L1 Regularization (LASSO) Agus M Soleh Department of Statistics Faculty of Mathematics and Natural Sciences Bogor Agricultural University (IPB) Aji H. Wigena, Anik Djuraidah and Asep Saefuddin Department of Statistics Faculty of Mathematics and Natural Sciences Bogor Agricultural University (IPB) c 2015 Agus M Soleh, Aji H. Wigena, Anik Djuraidah and Asep Saefuddin. Copyright This article is distributed under the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

Abstract Regression with L1 regularization (lasso) method was compared to principal component regression (PCR) in Statistical Downscaling (SDS) modeling to predict monthly rainfall. SDS modeling uses ill-conditioned (high correlation/multicolliniear) covariates, which can be solved with selection or shrinkage methods. In this study, we used two GCMs with different characteristics as covariates (CMIP5 and GPCP version 2.2). The results shows that the lasso method gave better results (smaller RMSE and RMSEP) than PCR for GPCP version 2.2, and as good as PCR for CMIP5 covariates.

Keywords: statistical downscaling, L1 regularization, lasso, rainfall, general circulation models

5362

1

Agus M. Soleh et al.

Introduction

Rainfall is an important variable in the agricultural process. Several methods to estimate the rainfall have been carried out in Indonesia. One of which uses the modeling of Statistical Downscaling (SDS) ([1, 2, 3, 4, 5]). SDS modeling is one of the applications of the modeling problems in which p covariates are generally large and not independent. Downscaling [6] is a process to develop the relationship between several variables that represent a large scale with some variables that represent a small scale. In addition, the SDS modeling is a technique that uses statistical model to analyze the relationship between large-scale data (global) and small-scale (local) data. The large-scale data is the output of the General Circulation Model (GCM) and the small-scale data is the rainfall in certain areas. The GCM output is a climate data (precipitation, temperature, etc.) as the results of the modeling of a wide range of satellite output at the location stated in the domain of the grid with coarse resolution 2.5◦ × 2.5◦ (± 300 km2 ) measured in time units. Therefore, the GCM output in general is not independent due to their spatial relationship and temporal condition. The statistical model using the GCM output is done by using the variable selection (such as stepwise regression) and reduction dimension (such as principal component analysis). Variable selection is an important issue in regression especially for problems where the number of variables is large and not independent. Tibshirani [7] proposed a lasso (least absolute shrinkage and selection operator) as a new popular method for the selection of variables and shrinkage. Lasso adds L1 P penalty ( |βi | ≤ t) to the objective function regression model. The advanced of shrinkage is to prevent overfitting that occurs due to collinearity of covariates or high dimension [8]. There is no a closed form to estimate parameter, as in the case the parameter estimation uses the convex optimization. Tibshirani [7] used the quadratic programming and Efron et al. [9] proposed LAR (Least Angle Regression) which calculates a path of coefficients more efficient. The implementation to the SDS modeling shows lasso method is better than the stepwise regression method ([4, 5]). This article discusses the use of lasso method for the SDS modeling to estimate the monthly rainfall in Indramayu compared to Principal Component Regression (PCR).

2

Regression with L1 Regularization (LASSO)

PN Let the objective function of least square in multiple linear regression: min i=0 (yi − Pp 2 β0 − j=1 βj xij ) where xij , j=1, 2, . . ., p and response yi for the i-th observation, i=1, 2, . . ., n. Consider least square method with L1 penalty/regularization called the lasso written in the lagrangian form:

5363

Statistical downscaling to predict monthly rainfall

( argmin (y − βj

p X

βj xj )T (y −

p X

βj xj ) + λ

) |βj | ,

λ ≥ 0.

(1)

j=1

j=1

j=1

p X

The shrinkage methods [8] are generally not invariant to the relative scaling of the covariates, so that the covariates should be standardized. The Equation 1 is differentiated with respect to βj to have solutions as follows.  T xj r−j λ T    kxj k2 − 2kxj k2 , λ < 2xj r−j T (2) βˆj = xj r−j2 + λ 2 , −λ > 2xTj r−j kxj k 2kxj k    0 , λ ≥ |2xTj r−j | P P P Proof. Let f (βk , λ) = (y − pk=1 βk xk )T (y − pk=1 βk xk ) + λ pk=1 |βk |. Differentiate f (βk , λ) with respect to βj : ( ) p p p p X X X X ∂ ∂ yT y − 2yT βk x k + ( βk xk )T ( βk xk ) + λ |βk | f (βk , λ) = 0 = ∂βj ∂βj k=1 k=1 k=1 k=1 T

= −2y xj +

2xTj

p X

βk xk + λ sign(βj )

k=1

=

−xTj y

+

xTj

p X

βk xk +

k=1 p

= xTj

λ sign(βj ) 2

!

X

λ sign(βj ) 2 ! X λ βk xk − y + sign(βj ) 2 k6=j ! X λ y− βk xk + sign(βj ) 2 k6=j

βk xk − y

k=1

= βj xTj xj + xTj = βj xTj xj − xTj

+

Note that xTj xj = kxj k2 , then: ! 2

= βj kxj k −

xTj

y−

X

βk xk

k6=j



= βj −

xTj y −

P

k6=j βk xk

kxj k2

+

λ sign(βj ) 2

 +

λ sign(βj ) 2kxj k2

5364

Agus M. Soleh et al.

Let r−j = y −

P

k6=j

βk xk , then the solution of βj is: βˆj =

xTj r−j λ − sign(βj ) 2 kxj k 2kxj k2

Since λ and kxj k2 are always positive while the sign of xTj r−j is equal to sign (βj ), then Equation 2 is proved.

3

Aplication to Predict Monthly Rainfall

The SDS modeling in this study used two types of data, i.e monthly precipitation of GCM output as the covariates and monthly rainfall from 11 rainfall stations in Indramayu (January 1981 to May 2014) as responses. Based on the seasonal forecast map, Indramayu is divided into four season zone (ZOM), i.e ZOM 77, 78, 79, and 80. This study used monthly rainfall which represented three ZOM, i.e ZOM 77 (Kr Anyar, Pusakanegara and Tulang Kacang), ZOM 78 (Dempet, Indramayu, Juntinyuat and Losarang), and ZOM 79 (Gegesik, Karangkendal, Krangken and Sukadana). The precipitation of two GCMs were used as covariates i.e multi-model ensemble Phase 5 Couple Model Intercomparisson Project (CMIP5) [10] with a moderate climate change scenarios of RCP (Representative Concentration Pathways) 4.5 and the Global Precipitation Climatology Project (GPCP) [11] version 2.2 with domain of 7 x 7 grid (49 grids). The CMIP5 data can be downloaded via http://pcmdi9.llnl.gov/esgf-web-fe/ and the GPCP data provided by the NOAA/OAR/ESRL PSD, Boulder, Colorado, USA, from their web site at http://www.esrl.noaa.gov/psd/. Coverage areas of GCMs output studied are 101.25◦ E - 116.25◦ E longitude and 13.75◦ S - 1.25◦ N latitude with 2.5◦ x 2.5◦ grid. Location of Indramayu is exactly under domain that was selected.

3.1

The Characteristics of GCMs

The covariates in GCM output are highly correlated. In CMIP5, the numbers of covariate pairs with correlation greater than 0.6 were about 91.24% (1073 of 1176 pairs) and positive, while in GPCP, the numbers of covariate pairs with correlation greater than 0.6 were about 23.55% (277 of 1176 pairs) and 308 pairs (26.19%) of them had negative values. Principal Component Analysis (PCA) was often used as a pre-processing technique to obtain the latent variables (principal components/PC) from linear combination of its covariates that mutually orthogonal. The number of the PC determined by three approach: the screeplot graph, the cumulative proportion of variance and the eigen value.

5365

Statistical downscaling to predict monthly rainfall

Figure 1: Screeplot of the principal component analysis of the CMIP5 and the GPCP outputs Screeplot obtained from CMIP5 and GPCP outputs are presented in Figure 1. The screeplot results from the GCMs advised to take three PCs. The cumulative proportion of variance for three PCs of CMIP5 output had already above 90% (97.30%), but the first three PCs of GPCP output had only reached 76.70% (Table 1). Meanwhile the variance of the cumulative proportion that was above 90% was comprised by eight PCs. The eigen values of three PCs from CMIP5 output and eight PCs from GPCP output have obtained greater than 1. Therefore, further analysis would use three PCs for covariates from CMIP5 output and eight PCs for covariates from GPCP output.

3.2

Estimation Models

The estimation of regression models using lasso was done using lars package on the R statistical computing software. The selected variables determined by CP Mallows statistics that was close to the number of parameter expected. Smaller RMSE (Root Mean Square Error) and RMSEP (Root Mean Square Error Prediction) value were used to compared q the best methods. These values Pn

(Y −Yˆ )2

i i=1 i . RMSE uses all data, were obtained by RMSE or RMSEP = n while RMSEP uses 10-fold cross validation techniques. RMSE and RMSEP values of each models are presented in Table 2. RMSE values obtained from LS were always smaller than other methods, but RMSEP values gave higher. Lasso was give smaller RMSE than PCR for both types of the GCMs output on all models. This shows that the prediction model using lasso was better than PCR. However, RMSEP values that was

5366

Agus M. Soleh et al.

Table 1: Proportion of cumulative variance for each of the GCM outputs CMIP5 GPCP Component Eigen Proportion Eigen Proportion 1 13.33 88.14% 16.82 45.87% 2 3.99 96.04% 12.19 69.97% 3 1.60 97.31% 6.44 76.70% 4 1.57 98.53% 5.30 81.24% 5 0.90 98.93% 4.67 84.77% 6 0.85 99.29% 3.71 87.00% 7 0.49 99.41% 3.10 88.56% 8 0.48 99.53% 3.06 90.08% 9 0.37 99.59% 2.78 91.33% 10 0.31 99.64% 2.40 92.26%

obtained from 10-fold cross validation provided different values according to the characteristics of the covariates that were used. RMSEP value using CMIP5 output as covariates did not show one of the methods (either PCR or lasso) that was superior to other methods. Only six models have smaller RMSEP values using PCR, while five models with smaller RMSEP values achieved by lasso. However, lasso provided smallest RMSEP values for all models using covariates from GPCP output.

4

Conclusion

The estimation method of the linear regression model using the lasso can be considered as a method of estimating SDS model. In this study, the lasso was superior compared to PCR in the case of not many covariate pairs with high correlation such as in GPCP version 2.2 output. The covariates with positives and high correlation such as in CMIP5 output, the lasso as good as PCR indicated by smallest RMSEP values. Acknowledgements. The author would like to thank to the Directorate General of Higher Education - Ministry of Education and Culture - Republic of Indonesia which has given scholarships ”Beasiswa Pendidikan Pasca Sarjana (BPPS) 2010” and ”Beasiswa Peningkatan Kualitas Publikasi Internasional (Sandwich-like) 2014”.

5367

Statistical downscaling to predict monthly rainfall

Table 2: The value of RMSE and RMSEP using each model and GCMs output Rainfall Estimation RMSE Model LS PCR lasso CMIP5 Kr Anyar 102.92 118.52 104.65 Pusakanegara 76.81 86.78 80.28 Tulang Kacang 120.12 135.56 126.08 Dempet 93.17 102.48 97.74 Indramayu 112.34 125.01 114.95 Juntinyuat 91.44 101.62 94.88 Losarang 88.33 98.90 92.46 Gegesik 79.79 90.58 86.32 Karangkendal 84.47 91.91 89.73 Krangken 78.07 84.75 81.64 Sukadan 78.99 89.54 82.83 Pegaden 104.19 120.14 110.36 GPCP version 2.2 Kr Anyar 106.96 121.19 113.75 Pusakanegara 83.04 95.35 87.58 Tulang Kacang 126.52 137.56 131.91 Dempet 95.33 108.31 98.91 Indramayu 115.95 131.07 119.89 Juntinyuat 95.21 103.83 99.37 Losarang 94.49 104.71 98.09 Gegesik 78.80 89.32 80.92 Karangkendal 83.42 93.28 87.75 Krangken 76.74 85.62 79.96 Sukadan 80.51 89.62 84.41 Pegaden 103.29 113.35 108.27

10-fold cross validation for RMSEP 10-fold LS PCR lasso 131.43 94.95 153.62 112.75 139.41 111.20 108.33 101.01 101.85 94.70 94.00 121.14

120.02 87.56 138.19 104.38 127.01 102.77 99.48 92.42 93.91 85.76 90.54 121.61

128.39 87.22 142.48 105.39 129.90 104.69 98.49 89.30 93.22 87.18 88.08 115.31

124.95 97.01 150.88 110.90 128.72 109.26 109.95 94.36 100.77 89.02 93.91 124.90

124.70 98.30 143.33 111.39 134.29 106.31 108.31 91.52 96.46 87.89 92.81 116.9

120.46 93.27 142.9 106.82 127.45 104.77 105.52 91.32 94.01 86.40 90.32 115.48

5368

Agus M. Soleh et al.

References [1] A. H. Wigena, Statistical Downscaling Modeling using Projection Pursuit Regression for Rainfall Prediction Dissertation, Bogor (ID): Institut Pertanian Bogor, 2006. [2] D. J. Vimont, D. S. Battisti, R. L. Naylor, Downscaling Indonesian precipitation using large-scale meteorogical fields, Int. J. Climatol., 30 (2010), 1706-1722. http://dx.doi.org/10.1002/joc.2010 [3] Sutikno, Setiawan, H. Purnomoadi, Statistical Downscaling Output GCM Modeling with Continuum Regression and Pre-Processing PCA Approach, IPTEK J. Tech. Sci., 21 (2010), 109-117. http://dx.doi.org/10.12962/j20882033.v21i3.41 [4] D. Hammami, T. S. Lee, T. B. M. J. Ouarda, J. Lee, Predictor selection for downscaling GCM data with LASSO, J. Geophys. Res., 117 (2012). http://dx.doi.org/10.1029/2012jd017864 [5] L. Gao, K. Schulz, M. Bernhardt, Statistical Downscaling of ERA-Interim Forecast Precipitation Data in Complex Terrain Using LASSO Algorithm, Advances in Meteorology, 2014 (2014), 1-16. http://dx.doi.org/10.1155/2014/472741 [6] R. E. Benestad, D. Chen, I. Hanssen-Bauer, Empirical-Statistical Downscaling, World Scientific Publishing Co. Pte. Ltd., Singapore, 2008. http://dx.doi.org/10.1142/6908 [7] R. Tibshirani, Regression Shrinkage and Selection via the LASSO, J. R. Statist. Soc. (B) 58 (1996), 267-288. [8] T. Hastie, R. Tibshirani, J. Friedman, The Elements of Statistical Learning, Data Mining, Inference, and Prediction, Second Edition, Springer, 2009. http://dx.doi.org/10.1007/978-0-387-84858-7 [9] B. Efron, T. Hastie, I. Johnstone, R. Tibshirani, Least Angle Regression, Ann. Statist., 32 (2004), 407-499. http://dx.doi.org/10.1214/009053604000000067 [10] K. E. Taylor, R. J. Stouffer, G. A. Meehl, An Overview of CMIP5 and the Experiment Design, Bull. Amer. Meteor. Soc., 93 (2012), 485-498. http://dx.doi.org/10.1175/bams-d-11-00094.1 [11] R. F. Adler, G. C. Huffman, A. Chang, R. Ferraro, P. Xie, J. Janowiak, B. Rudolf, U. Schneider, S. Curtis, D. Bolvin, A. Gruber, J. Susskind, P. Arkin, The Version-2 Global Precipitation Climatology Project (GPCP) Monthly Precipitation Analysis (1979-Present),

Statistical downscaling to predict monthly rainfall

5369

J. Hydrometeor., 4 (2003), 1147-1167. http://dx.doi.org/10.1175/15257541(2003)004¡1147:tvgpcp¿2.0.co;2 Received: June 29, 2015; Received: August 14, 2015

Suggest Documents