INTERNATIONAL JOURNAL OF CLIMATOLOGY Int. J. Climatol. 27: 831–836 (2007) Published online 6 December 2006 in Wiley InterScience (www.interscience.wiley.com) DOI: 10.1002/joc.1449
Binary Logistic Regression Models for short term prediction of premonsoon convective developments over Kolkata (India) S. Dasguptaa * and U. K. Deb a
b
Department of Statistics, St. Xavier‘s College, 30 Park Street, Kolkata 700016, India Atmospheric Science Research Group, School of Environmental Studies, Jadavpur University, Kolkata 700032, India
Abstract: Scientists have identified certain dynamic and thermodynamic parameters, which are significant for the occurrence of convective developments in the premonsoon period (March–May) in Kolkata (India). In this paper Binary Logistic Regression Models have been considered for prediction of these convective developments from a prior knowledge of the values of the significant parameters. Analyses have been carried out for lead times of 12 h. Initially the evening radiosonde information (1200 UTC) has been considered as these convective developments occur in Kolkata mostly toward the evening. The performance of the model for similar developments in the morning has also been taken up at a later point. As some of the identified parameters are correlated, the dimension of the problem has been reduced and uncorrelated explanatory variables generated for the Logistic Regression Model. Two conventional variable selection methods, for example, forward and backward selections have been used for this purpose. A comparative study of the two variable selection procedures in the light of the predictive power of the resulting models has been presented. The probable causes underlying the poor performance of the models in case of correct prediction of morning convection over the city has been explored and possible remedies suggested at the end. Copyright 2006 Royal Meteorological Society KEY WORDS
premonsoon; convective development; logistic regression; covariates; forward selection; backward selection
Received 21 June 2006; Revised 14 September 2006; Accepted 24 September 2006
INTRODUCTION Over the past few decades, various statistical models have drawn the attention of scientists to the analysis of data arising out of atmospheric sciences. Keeping this in view Dasgupta and De (2001) have shown how a simple probabilistic description of the thunderstorm phenomenon can be provided with the help of two-state Markov chains. Attempts have also been made to forecast the occurrence of thunderstorms over a region using the techniques of principal component analysis and linear discriminant analysis (Ghosh et al., 1999, 2004). The present work is an attempt to set up a new and objective forecasting tool for convection events, which release the convective instability that builds up in a tropical atmosphere. In Gangetic West Bengal, situated in the eastern part of India, the premonsoon season (March–May) is typically marked by the formation of convective developments of different forms, which include not only severe local storms or wind squalls but also thundershowers, dry thunderstorm and only lightning activity. Many of these events fall under the category of ‘Norwesters’, a popular name for such convection events. Typical Norwesters are * Correspondence to: S. Dasgupta, Department of Statistics, St. Xavier‘s College, 30 Park Street, Kolkata 700016, India. E-mail:
[email protected] Copyright 2006 Royal Meteorological Society
a result of the confluence of dry northwesterly and hot moist southeasterly winds. They are characterized by violent storms in many places. They have manifold effects on the human society. As a result, the study of these developments finds an important place in meteorological research. As mentioned earlier the aim of the present work is to set up objective forecasting tools for premonsoon convective events. The region of interest is Kolkata, situated in the heart of Gangetic West Bengal. Over the years scientists have identified certain dynamic and thermodynamic parameters as significant for the occurrence of premonsoon convective developments over Kolkata. An attempt has been made to develop suitable statistical models for the prediction of these events with the identified parameters as covariates of the model. The Binary Logistic Regression Model (Kleinbaum, 1994), based on the logistic function is generally used to study the nature of dependence of a dichotomous response variable (Y ) on a number of explanatory variables (X1 , X2 , Xk ), which are either discrete or continuous in nature. Although used extensively in epidemiology, the use of Logistic Regression in the context of meteorology is of a recent origin. Sanchez et al. (1998a) have applied this model to the short term forecast of hail risk in the province of Leon in the northwestern Iberian Peninsula of Spain. Other contemporary authors have applied this model to predict rainfall-rate, hurricanes and
832
S. DASGUPTA AND U. K. DE
lightning activity (Chiu and Kadem, 1990; Gray et al., 1992; Crosby et al., 1995; Hess et al., 1995; Elsner et al., 1996; Sanchez et al., 1998b; Mazany et al., 2002). But there is hardly any reference within the knowledge of the present authors of the application of this model in the context of Indian climatology (Paranjpe and Gore, 1991). The present work is concerned with the development of Binary Logistic Regression Models for an objective forecast of the risk of occurrence of premonsoon convective developments in and around Kolkata 12 h in advance, using standard variable selection techniques for dimension reduction.
DATA This work is based on the vertical atmospheric profile available from the morning (0000 UTC) and evening (1200 UTC) radiosonde at Dum Dum meteorological station of Kolkata. The information also relates to the details of convective developments during the premonsoon season (March–May) of the 12-year period 1985–1996, as available from the two surface observatories of Kolkata at Alipore (22.53 ° N, 88.33 ° E) and Dum Dum (22.65 ° N, 88.45 ° E). Because of nonavailability of complete information for all days covered by the 12-year period the analyses have been based on samples of effective sizes 403 days and 363 days, respectively, for the morning and evening analysis. It is to be noted that a convection event in either Alipore or Dum Dum stations is considered as a convection event for Kolkata. On the other hand if there is convection at both Alipore and Dum Dum stations then also it is considered as a single convection event for Kolkata regardless of the nature and intensities of convection at the two places, which are not necessary in the initial phase of the analysis. The study has been confined up to the 500 hPa level of the atmosphere as the importance of this level has been stressed upon by many scientists like Showalter (1953); Galway (1956); Darkow (1968); Fujita et al. (1970) and Miller (1972) and the level of cloud development may be taken around 500 hPa (Kessler, 1982). The atmosphere up to 500 hPa has been divided into four layers, viz (1000–850) hPa, (850–700) hPa, (700–600) hPa and (600–500) hPa. From the works of various scientists (Kuo, 1965; Betts, 1974; Kessler, 1982 and Williams and Renno, 1993) five basic parameters have been identified as significant for the occurrence of thunderstorms. The values of these atmospheric variables in each of the atmospheric layers considered generate an initial set of 20 significant parameters for the study. They are: X1 X2 X3 X4 X5 X6
= (θ es − θ e) at 1000 hPa = (P − PLCL ) at 1000 hPa = δθ e/δz at (1000–850) hPa = δθ es/δz at (1000–850) hPa = δu/δz at (1000–850) hPa = (θ es − θ e) at 850 hPa
Copyright 2006 Royal Meteorological Society
X7 = (P − PLCL ) at 850 hPa X8 = δθ e/δz at (850–700) hPa X9 = δθ es/iz at (850–700) hPa X10 = δu/δz at (850–700) hPa X11 = (θ es − θ e) at 700 hPa X12 = (P − PLCL ) at 700 hPa X13 = δθ e/δz at (700–600) hPa X14 = δθ es/δz at (700–600) hPa X15 = δu/δz at (700–600) hPa X16 = (θ es − θ e) at 600 hPa X17 = (P − PLCL ) at 600 hPa X18 = δθ e/δz at (600–500) hPa X19 = δθ es/δz at (600–500) hPa X20 = δu/δz at (600–500) hPa where θ es: saturated equivalent potential temperature θ e: equivalent potential temperature P : pressure at the reference level PLCL : pressure at the lifting condensation level z: vertical height u: resultant horizontal wind speed expressed in meter per second δθ e/δz: convective instability of the atmospheric layer δθ es/δz: conditional instability of the atmospheric layer δu/δz: vertical shear of the horizontal wind The values of the variables (θ es − θ e) and (P − PLCL ) at the lower boundaries of each atmospheric layer (i.e., at 1000, 850, 700 and 600 hPa) have been taken as the representative values for the corresponding layers. The parameter PLCL for surface parcels was considered by Kuo (1965) as the cloud base and hence (P − PLCL ) can be taken as a forcing factor for the saturation of a parcel. The thermodynamic parameter (θ es − θ e) introduced by Betts (1974) is a measure of the unsaturation or humidity of the atmosphere. It is now well established that thunderstorms are strongly favored by convective instability, abundant moisture at low levels, strong wind shear and a dynamic lifting mechanism that can reduce the instability (Kessler, 1982). The vertical shear of the horizontal winds has to match the value of the convective instability for proper development of a large convective cloud (Asnani, 1992). Williams and Renno (1993) have emphasized conditional instability for supporting electrification and lightning.
METHODOLOGY The thunderstorm, which is predominant in the premonsoon season in Gangetic West Bengal, is a phenomenon that is associated with atmospheric convection on hot summer days in a tropical region. It is in fact a powerful agent that releases the convective instability that continually builds up in the tropical atmosphere. Thunderstorms develop from cumulo-nimbus clouds or aggregation of such clouds and differ from ordinary shower clouds as they produce thunder and lightning (Pettersen, 1959). Int. J. Climatol. 27: 831–836 (2007) DOI: 10.1002/joc
BINARY LOGISTIC REGRESSION MODELS FOR SHORT TERM PREDICTION
In the present paper an attempt has been made to develop Binary Logistic Regression Models for prediction of premonsoon convective developments over Kolkata in the next 12 h of 0000 UTC and 1200 UTC using standard methods for selection of covariates for the model from the initial set of 20 atmospheric variables. Development of the model Let (X1 , X2 , . . . ..Xm ) denote a subset of ‘m’ uncorrelated variables. With the ith day, i = 1(1) n we associate a variable Yi such that: Yi = 1 if a convective development occurs on the ith day, 0 if a convective development does not occur on the ith day and the observed value of Xj j = 1(1) m on the ith day i = 1(1) n is denoted by Xji j = 1(1)m i = 1(1)n. The mathematical framework thus consists of a dichotomous response variable Y together with explanatory variables, which have no restrictions on their nature. The general multiple linear regression model is given by: E (Yi /X1 , X2 . . . Xm) = β0 + β1i X1i + β2i X2i + . . . . . . . . . . . . . . . .. + βmi Xmi i = 1(1) n is not appropriate here as the function β0 + β1i X1i + β2i X2i + . . . . . . . . . . . . . . . ..βmi Xmi is unbounded and does not provide estimates of Y , which are bounded between 0 and 1. A better and more appropriate model would perhaps be the classical Binary Logistic Regression Model for dichotomous response where the explicit regression equation is given by: E (Yi /X1 , X2 . . . Xm) = {1 + e−(β0 +β1 X1i +...+βm Xmi ) }−1 In the present context the model serves as a tool for forecasting the risk of occurrence of premonsoon convective developments. Suppose in a given sample of ‘n’ days, there are ‘t’ days on which a convective development occurs and ‘n − t’ fair-weather days. The probability of the observed sample is proportional to {Pr(Yi = 1/X1 , X2 , Xm )}t {Pr(Yi = 0/X1 , X2 , Xm )}n−t Given (X1 , X2 , . . . Xm), the above expression is a function of (β0 , β1 , βm ) only and is called the likelihood function of θ where θ = (β0 , β1 , βm ). It is denoted by L(θ ). The maximum likelihood equations for estimating (β0 , β1 , βm ) are obtained by maximizing L(θ ) with respect to θ . In practice the fitting process required by the model involves repeated numerical resolution Copyright 2006 Royal Meteorological Society
833
of equations to estimate the coefficients βj j = 1(1)m (Scarborough, 1966). Variable selection methods It is well known that the pair of variables (θ es − θ e) and (P − PLCL ) are highly correlated in all the layers of the atmosphere. The 95% normal confidence intervals for the means of the two variables were constructed on the basis of the data. The degree of separation of a particular variable in a given atmospheric layer is defined as: Degree of separation of a variable in a particular atmospheric layer = Absolute distance between the confidence intervals on convective and fair-weather days in the particular layer − − − − − − − − − − × 100 Length of the confidence interval on convective days in that particular layer Here ‘absolute distance’ stands for the absolute value of the difference between the upper confidence limit on convective days and lower confidence limit on fairweather days. The variable admitting of higher separability indicated by a higher value of the percentage in the above formula was to be retained in the analysis. The variable selection procedures were then applied on the reduced set of 16 atmospheric variables to yield a final set of covariates for the Logistic Regression Model. Forward selection. In forward selection the variables are entered into the model in a stepwise manner. The variable that is entered first is the one that provides maximum discrimination between the binary situations of the model ascertained by a suitable statistical criterion (Sharma, 1996). The variable that is entered in the next step is the one that adds the maximum amount of additional discriminating power to the model as measured by the same criterion. The procedure continues until no further variables are entered into the model. Backward selection. The backward selection rule begins with all the variables in the model. At each step one variable is removed, that one being the one which provides the least amount of decrease in the discriminating power of the model as measured by a statistical criterion (Sharma, 1996). The procedure continues until no further variables can be removed. The parameter estimates of the models under the two different variable selection procedures are obtained using the unconditional maximum likelihood approach as the number of parameter in the models are small compared to the number of subjects, i.e. the number of days under study. Int. J. Climatol. 27: 831–836 (2007) DOI: 10.1002/joc
834
S. DASGUPTA AND U. K. DE
Validation of the model The models developed on the basis of the existing samples were applied for the prediction of premonsoon convection events over Kolkata using a different sample comprising the premonsoon weather information of the three consecutive years 1997, 1998 and 1999. Owing to partial availability of data the test sample consists of 128 days for the morning analysis of which on 84 days there were no incidents of convection. Some form of convection was noted on the remaining 44 days. The corresponding figures for the evening analysis were 118, 64 and 54. In line with the works of earlier researchers convection was predicted on a particular day if the probability of a convective development on that day as suggested by the model exceeded 0.5. A value of the said probability, which falls short of 0.5, indicates fair-weather or absence of convection. Using these binary predictions the predictive powers of the models were assessed separately for the morning and evening analysis with the help of performance measures for 2 × 2 contingency tables, viz. hit-rate and false-alarm rate (Mason, 2003).
RESULTS AND DISCUSSION The degree of separation of the pair of variables (θ es − θ e) and (P − PLCL ) on convective and fair-weather days were calculated for all atmospheric layers separately for evening (1200 UTC) and morning (0000 UTC) analysis (Tables I and II). (P − PLCL ) owing to its greater separability was retained and (θ es − θ e) dropped from the analysis. Forward (LR) and backward (LR) options available within the SPSS package were applied to the reduced set of 16 atmospheric variables to determine the covariates Table I. Separation of (θ es − θ e) and (P − PLCL ) for 1200 UTC data. Degree of separation ————Variable (θ es − θ e) (P − PLCL )
Layer 1 (%)
Layer 2 (%)
Layer 3 (%)
Layer 4 (%)
75.31 79.50
0 0
0 0
0 0
Table II. Separation of (θ es − θ e) and (P − PLCL ) for 0000 UTC data. Degree of separation ————Variable (θ es − θ e) (P − PLCL )
Layer 1 (%)
Layer 2 (%)
Layer 3 (%)
Layer 4 (%)
0 0
18.92 38.06
24.45 67.31
69.64 81.19
Copyright 2006 Royal Meteorological Society
Table III. Results of variable selection methods for 1200 UTC data. Variable selection techniques
Covariates of the model
Forward selection Backward selection
X2 , X18 , X20 X2 , X12 , X13 , X17 , X19 , X20 .
Table IV. Results of variable selection methods for 0000 UTC data. Variable selection techniques
Covariates of the model
Forward selection Backward selection
X3 , X9 , X17 , X19 X9 , X12 , X13 , X14 , X19 .
Table V. Predictive power of the models based on out of sample predictions for 1200 UTC data. Variable selection techniques Forward selection Backward selection
Hit-rate (%)
False-alarm rate (%)
62.3 73.6
30.8 30.8
Table VI. Predictive power of the models based on out of sample predictions for 0000 UTC data. Variable selection techniques Forward selection Backward selection
Hit-rate (%)
False-alarm rate (%)
18.2 16.0
7.1 7.1
Table VII. Predictive power of the models for 0000 UTC data after deleting days of weak convection. Variable selection techniques Forward selection Backward selection
Hit-rate (%)
False-alarm rate (%)
26.7 33.3
0.0 7.1
of the final model (Tables III and IV). The models were fitted after reducing the dimensionality. The predictive power of the two models based on the binary predictions explained under section ‘METHODOLOGY’, were compared in the light of different performance measures. (Tables V and VI). In the case of the evening analysis, the model developed on the basis of backward selection gives better results as compared to the morning analysis, where forward selection performs better although both the models perform rather Int. J. Climatol. 27: 831–836 (2007) DOI: 10.1002/joc
BINARY LOGISTIC REGRESSION MODELS FOR SHORT TERM PREDICTION
Table VIII. Effect of varying threshold probability on the model with forward selection (0000 UTC). Threshold probability 0.5 0.4 0.35 0.3
Hit-rate (%)
False-alarm rate (%)
18.2 36.2 47.7 61.4
7.1 17.8 33.3 46.4
Table IX. Effect of varying threshold probability on the model with backward selection (0000 UTC). Threshold probability 0.5 0.4 0.35 0.3
Hit-rate (%)
False-alarm rate (%)
16.0 38.6 52.3 65.9
7.14 19.0 32.14 40.4
poorly as far as correct prediction of morning convection is concerned (Table VI). To ascertain the probable causes underlying the poor performance of the models for morning convection we checked the weather information of all the misclassified convection days. Interestingly, on most of these days the convective developments were of a rather weak intensity giving either poor or no rain. Perhaps the models were not sensitive enough to detect such weak developments. To prove the validity of this conjecture we deleted the days of weak convection from the data set. The remaining observations were reanalyzed on similar lines computing afresh the performance measures of the models under the different variable selection methods (Table VII). It is seen from Table VII that the predictive power of the models increases moderately for convection if the days of weak convection are ignored. It may therefore be said that the accurate prediction of morning convection 12 h in advance still poses a bit of a problem. A more obvious solution was attempted by varying the threshold probability from the conventional value of 0.5. The results are presented in Tables VIII and IX. It is seen that for both the variable selection techniques, the hit rate as well as the false-alarm rate increase with decrease of threshold probability. Striking a balance between the observed hit-rates and false-alarm rates optimality in some sense is seen to be achieved at p = 0.35.
CONCLUSION The present paper reveals that the Binary Logistic Regression Models can serve as a handy and objective forecasting tool for premonsoon convective developments over Kolkata upto 12 h in advance. The method of backward selection and forward selection are reasonably the best variable selection techniques for the evening and morning Copyright 2006 Royal Meteorological Society
835
analysis, respectively. The poor performance of the models in identifying morning convection may be attributed to the low intensity of the convection events during this period. When the misclassified days are removed from the analysis the predictive power of the models improves, although not drastically. A more or less satisfactory performance of the models in case of the morning analysis is observed when the threshold probability is lowered to 0.35 instead of the usual 0.5. Another rational alternative may be to segregate the observed days into three categories instead of two, viz. (1) Days of major/significant convection (2) Days of minor/weak convection and (3) Fair-weather days. One can then use a mode of analysis amenable to the treatment of trichotomous response. ACKNOWLEDGEMENTS
The authors would like to thank the India Meteorological Department for providing data and valuable references. The first author is personally indebted to the University Grants Commission, Government of India and Reverend Father P.C. Mathew, Principal, St. Xavier’s College, Kolkata, for sanctioning leave under Faculty Improvement Program, in order to pursue research activity in full earnest. The authors are also grateful to the anonymous referees, whose valuable suggestions and comments have greatly improved the exposition of this paper. REFERENCES Asnani GC. 1992. Tropical Meteorology 2. Asnani GC: India. Betts AK. 1974. Thermodynamic classification of tropical convective soundings. Monthly Weather Review 108: 1046–1053. Chiu LS, Kadem B. 1990. Estimating the exceedance probability of rain rate by logistic regression. Journal of Geophysical ResearchAtmospheres 95: 2217–2227. Crosby DS, Ferraro RR, Wu H. 1995. Estimating the probability of rain in an SSM/I FOV using logistic regression. Journal of Applied Meteorology 34: 2476–2480. Darkow GL. 1968. The total energy environment of severe storms. Journal of Applied Meteorology 7: 199–205. Dasgupta S, De UK. 2001. Markov Chain models for pre-monsoon thunderstorms in Calcutta, India. Indian Journal of Radio & Space Physics 30: 138–147. Elsner JB, Lehmiller TB, Kimberlain TB. 1996. Objective classification of Atlantic Basin hurricanes. Journal of Climate 9: 2880–2889. Fujita TT, Bradbury DL, van Thullenar CF. 1970. Palm Sunday tornadoes of April 11, 1965. Monthly Weather Review 98: 26–29. Galway JG. 1956. The lifted index as a predictor of latent instability. Bulletin of the American Meteorological Society 37: 528–529. Ghosh S, Sen PK, De UK. 1999. Identification of Significant Parameters for the prediction of pre-monsoon thunderstorms at Calcutta, India. International Journal Of Climatology 19: 673–681. Ghosh S, Sen PK, De UK. 2004. Classification of thunderstorm and non-thunderstorm days in Calcutta (India) on the basis of LDA. Atmosfera 17: 1–12. Gray WM, Landsea CW, Mielke PW, Berry KJ. 1992. Predicting Atlantic seasonal hurricane activity 6–11 months in advance. Weather and Forecasting 7: 440–455. Hess JC, Elsner JB, La Seur NE. 1995. Improving seasonal hurricane predictions for the Atlantic Basin. Weather and Forecasting 10: 425–432. Kessler E. 1982. Thunderstorm Morphology and Dynamics. US Department of Commerce: USA. Int. J. Climatol. 27: 831–836 (2007) DOI: 10.1002/joc
836
S. DASGUPTA AND U. K. DE
Kleinbaum DG. 1994. Logistic Regression. Springer Verlag: New York. Kuo HL. 1965. On formation and intensification of tropical cyclones through latent heat release by cumulus convection. Journal of Atmospheric Sciences 22: 40–63. Mason IB. 2003. Forecast Verification: A Practitioner’s Guide in Atmospheric Science, Jolliffe IT, Stephenson DB (eds). John Wiley and Sons: England. Miller RG. 1972. Notes on analysis and severe storm forecasting procedures of the air force global weather control. AFGWC Technical Report 200. (Rev) Air Weather Service: US Air Force. Mazany RA, Businger S, Gutman SI, Roeder W. 2002. A lightning index that utilizes GPS integrated precipitable water vapor. Weather and Forecasting 17: 1034–1048. Paranjpe SA, Gore AP. 1991. A parsimonious model for prediction of monsoon rainfall in India. Current Science 60: 446–448. Pettersen S. 1959. Introduction to Meteorology. McGraw Hill: USA.
Copyright 2006 Royal Meteorological Society
Sanchez JL, de La Fuente MT, Castro A. 1998a. A logistic regression model for short-term prediction of hail risk in Spain. Physics Chemistry Earth 23: 645–648. Sanchez JL, Fraile MT, de La Fuente MT, Marcos JL. 1998b. Discriminant Analysis applied to forecasting of thunderstorms. Meteorology and Atmospheric Physics 68: 187–195. Scarborough JB. 1966. Numerical Analysis. IBH Publishing Co Pvt. Ltd: Oxford. Sharma S. 1996. Applied Multivariate Techniques. John Wiley and Sons: USA. Showalter AK. 1953. A stability index for thunderstorm forecasting. Bulletin of the American Meteorological Society 34: 250–252. Williams E, Renno N. 1993. An analysis of the conditional instability of the tropical atmosphere. Monthly Weather Review 121: 23–26.
Int. J. Climatol. 27: 831–836 (2007) DOI: 10.1002/joc