A Festschrift in honor of Professor Jana

1 downloads 0 Views 2MB Size Report
tification of asymptotics for rank estimators in linear models but also opened up a .... Carolina), Roger Koenker and Stephen Portnoy (University of Illinois), Ehsanes ..... Data Analysis from Statistical Foundations – Festschrift in honour ...... bution functions F(t) and G(t) such that the supports of their densities are real in-.
Institute of Mathematical Statistics COLLECTIONS Volume 7

Nonparametrics and Robustness in Modern Statistical Inference and Time Series Analysis: A Festschrift in honor of Professor Jana Jureˇ ckov´ a

J. Antoch, M. Huˇskov´ a and P.K. Sen, Editors

Institute of Mathematical Statistics Beachwood, Ohio, USA

Institute of Mathematical Statistics Collections

The production of the Institute of Mathematical Statistics Collections is managed by the IMS Office: Marten Wegkamp, Executive Secretary and Elyse Gustafson, Executive Director.

Library of Congress Control Number: 2010937067 International Standard Book Number 978-0-940600-80-5 International Standard Serial Number 1939-4039 c 2010 Institute of Mathematical Statistics Copyright  All rights reserved Printed in Lithuania

Contents

Preface Jarom´ır Antoch, Marie Huˇskov´ a and Pranab K. Sen . . . . . . . . . . . . . . . . . . . . . . . . . v Contributors of this volume . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . vii Life and Work of Jana Jureˇ ckov´ a: An Appreciation Jarom´ır Antoch, Marie Huˇskov´ a and Pranab K. Sen . . . . . . . . . . . . . . . . . . . . . . . . . 1 Nonparametric comparison of ROC curves: Testing equivalence Jarom´ır Antoch, Luboˇs Prchal and Pascal Sarda . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12 The unbearable transparency of Stein estimation Rudolf Beran . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25 On the estimation of cross-information quantities in rank-based inference Delphine Cassart, Marc Hallin and Davy Paindaveine . . . . . . . . . . . . . . . . . . . . . . 35 Estimation of irregular probability densities Lieven Desmet, Ir`ene Gijbels and Alexandre Lambert . . . . . . . . . . . . . . . . . . . . . . . 46 Measuring directional dependency Yadolah Dodge and Iraj Yadegari . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62 On a paradoxical property of the Kolmogorov-Smirnov two-sample test Alexander Y. Gordon and Lev B. Klebanov . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 70 MCD-RoSIS – A robust procedure for variable selection Charlotte Guddat, Ursula Gather and Sonja Kuhnt . . . . . . . . . . . . . . . . . . . . . . . . . 75 A note on reference limits Jing-Ye Huang, Lin-An Chen and Alan H. Welsh. . . . . . . . . . . . . . . . . . . . . . . . . . . 84 Simple sequential procedures for change in distribution Marie Huˇskov´ a and Ondˇrej Chochola . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 95 A class of multivariate distributions related to distributions with a Gaussian component Abram M. Kagan and Lev B. Klebanov . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 105 iii

Locating landmarks using templates Jan Kalina . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 113 On the asymptotic distribution of the analytic center estimator Keith Knight . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 123 Rank tests for heterogeneous treatment effects with covariates Roger Koenker . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 134 A class of minimum distance estimators in AR(p) models with infinite error variance Hira L. Koul and Xiaoyu Li . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 143 Integral functionals of the density David M. Mason, Elizbar Nadaraya and Grigol Sokhadze . . . . . . . . . . . . . . . . . . 153 Qualitative robustness and weak continuity: the extreme unction? Ivan Mizera . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 169 Asymptotic theory of the spatial median Jyrki M¨ ott¨ onen, Klaus Nordhausen and Hannu Oja. . . . . . . . . . . . . . . . . . . . . . . .182 Second-order asymptotic representation of M-estimators in a linear model Marek Omelka . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 194 Extremes of two-step regression quantiles Jan Picek and Jan Dienstbier . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 204 Is ignorance bliss: Fixed vs. random censoring Stephen Portnoy . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 215 The Theil–Sen estimator in a measurement error perspective Pranab K. Sen and A. K. Md. Ehsanes Saleh . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 224 The Lasso with within group structure Sara van de Geer . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 235 Nonparametric estimation of residual quantiles in a conditional Koziol–Green model with dependent censoring No¨el Veraverbeke . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 245 Robust error-term-scale estimate ´ Jan Amos V´ıˇsek . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 254 iv

Preface

In the broader domain of Statistics and Probability theory, Jana Jureˇckov´a is a distinguished researcher, especially in Europe and among the women researchers. Her fundamental research contributions stemmed from the inspiring ideas of her advisor Jaroslav H´ajek and covered the evolving areas of nonparametrics, general asymptotic theory, robust statistics, as well as, applications in sampling theory, econometrics and environmetrics. She has a longstanding and exemplary leadership to the academics; the Czech (oslovakian) school of statistics has indeed benefited from her professional acumen. Jana has an illuminating career in the Department of Probability and Mathematical Statistics at the Faculty of Mathematics and Physics, Charles University in Prague, for over forty years. It was thought that at the juncture of her career, Jana should be honored and recognized for her standing in her professional field. The idea of floating this Festschrift was essentially due to the three editors of this volume and, in addition, the spontaneous response from her colleagues far beyond the boundaries of her native country has been truly inspiring. Jana has a long list of collaborators and former advisees too. Because of time and space constraints, not all of them were invited to contribute to this Festschrift, and a chosen handful of contributors with professional standing have spanned the broad area of Jana’s research interest. Besides an article of appreciation of Jana’s life and work by the three editors, there are twenty-four articles with a list of co-authors over 40. All these articles have gone through the usual peer-reviewing in strict adherence to the high standards of the IMS Lecture Notes and Collection Series. To all the reviewers we have deep gratitude for their most timely job. To the contributors we are truly grateful for their willingness to contribute in a relatively short time and to many of them for refereeing some other submitted articles. We could not have reached this stage of the Festschrift without their support and interest. Last but not the least, our profound thanks are due to the Department of Probability and Mathematical Statistics, at the Faculty of Mathematics and Physics at Charles University in Prague for their enthusiastic support in all respects. We would like to thank Blanka Anfilov´a for all administrative assistance and Iva Mareˇsov´a for meticulous job with the electronic preparation of the manuscript during the reviewing stage as well as the final phase. We are indeed very grateful to the IMS Editorial Office for their constant assistance from the initiation of this project to its very completion. Initial contact with Professor Anirban Dasgupta (editor of the IMS Collection Series at that time) facilitated the negotiation. In particular, IMS Executive Director Elyse Gustafson deserves our most sincere appreciation for their untiring efforts to have this volume released on time and the same applies to Production Manager Ms Geri Mattson for carrying out publications related tasks expeditiously and efficiently. We conclude this note with a toast to Jana Jureˇckov´a for her consent to this project and occasional help too.

v

Jarom´ır Antoch Charles University, Czech Republic Marie Huˇskov´ a Charles University, Czech Republic Pranab K. Sen University of North Carolina, U. S. A.

vi

Contributors to this volume

Antoch, J., Charles University in Prague Beran, R., University of California Cassart, D., Universit´e Libre de Bruxelles Chen, L-A., National Chiao Tung University, Taiwan Chochola, O., Charles University Desmet, L., Katholieke Universiteit Leuven Dienstbier, J., Charles University in Prague Dodge, Y., Universit´e de Neuchˆ atel Gather, U., Technische University Dortmund Gijbels, I., Katholieke Universiteit Leuven Gordon, A. Y., University of North Carolina, Charlotte Guddat, Ch., Technische University Dortmund Hallin, M., Universit´e Libre de Bruxelles Huang, J.-Y., National Institute of Technology, Taiwan Huˇskov´a, M., Charles University in Prague Kagan, A. M., University of Maryland Kalina, J., Charles University in Prague Klebanov, L. B., Charles University in Prague Knight, K., University of Toronto Koenker, R., University of Illinois Koul, H. L., Michigan State University Kuhnt, S., Technische University Dortmund Lambert, A., Universit´e catholique de Louvain Mason, D. M., Delaware University Mizera, I., University of Alberta M¨ott¨ onen, J., University of Helsinki Nadaraya, E., Tbilisi State University Nordhausen, K., University of Tampere Oja, H., University of Tampere Omelka, M., Charles University in Prague Paindaveine, D., Universit´e Libre de Bruxelles Picek, J., Technical University of Liberec Portnoy, S., University of Illinois Prchal, L., Charles University in Prague

vii

Saleh, E., Carleton University, Ottava Sarda, P., Universit´e Toulouse Sen, P. K., University of North Carolina, Chapel Hill Sokhadze, G., Tbilisi State University ven de Geer, S., EHT Zurich Veraverbeke, N., Hasselt University ´ Charles University in Prague V´ıˇsek, J. A., Welsh, A. H., Australian National University, Canberra Xiaoyu, L., Michigan State University Yadegari, I., Tarbiat Modares University of Tehran

viii

ix

IMS Collections Nonparametrics and Robustness in Modern Statistical Inference and Time Series Analysis: A Festschrift in honor of Professor Jana Jureˇ ckov´ a Vol. 7 (2010) 1–11 c Institute of Mathematical Statistics, 2010  DOI: 10.1214/10-IMSCOLL701

Life and Work of Jana Jureˇ ckov´ a: An Appreciation Jarom´ır Antoch1 and Marie Huˇ skov´ a1 and Pranab K. Sen2 Charles University at Prague, and University of North Carolina at Chapel Hill Abstract: Professor Jana Jureˇ ckov´ a has extensively contributed to many areas in statistics and probability theory. Her contributions are highlighted here with reverent appreciation and admiration from all her colleagues, advisees as well as research collaborators.

Professor Jana Jureˇckov´ a has extensively contributed to many areas in statistics and probability theory. This article is an appreciation of her work that draws on feedback from her colleagues, collaborators and advisees. Jana (Pˇristoupilov´a) Jureˇckov´ a was born on September 20, 1940, in Prague, Czechoslovakia. She spent a larger part of her childhood in Roudnice nad Labem (central Bohemia). She had her school and college education in Prague. She graduated (MSc.) from the Charles University at Prague in 1962, and earned her Ph.D. (CSc.) degree in Statistics in 1967 from the Czechoslovak Academy of Sciences. Her dissertation advisor was Professor Jaroslav H´ajek and the principal theme of her thesis was R-estimation based on the uniform asymptotic linearity of linear rank statistics in the parameter of interest. This work not only provided a rigorous justification of asymptotics for rank estimators in linear models but also opened up a novel and elegant approach to the study of asymptotic properties of nonparametric tests and estimates. In later years Jana has tremendously expanded the domain of this basic theme far beyond linear rank statistics. Professor H´ajek was inspirational for career development of not only a number of his advisees but also of many other colleagues in Prague and abroad. All the three editors of this collection have benefited a lot from Professor H´ajek’s vision and his professional acumen. Jana was no less fortunate than others to join the Department of Probability and Mathematical Statistics at Faculty of Mathematics and Physics, Charles University in Prague, in 1964, even before her dissertation work was completed. It would be interesting to note that Professor H´ajek formed a very active and bright group of researchers, including some outstanding women members in the Department of Probability and Mathematical Statistics, the Charles University in Prague, at a time when mathematical sciences used to have a far smaller representation of women researchers and teachers. With a deep sense of 1 Department of Probability and Mathematical Statistics, Charles University, Prague, Sokolovsk´ a 83, CZ–186 75 Prague 8, Czech Republic. e-mails: {antoch,huskova}@karlin.mff.cuni.cz 2 Departments of Biostatistics and Statistics Operations Research, University of North Carolina, Chapel Hill, NC 27599-7420, USA. e-mail: [email protected] AMS 2000 subject classifications: asymptotics, Jaroslav H´ ajek, nonparametrics, rank estimator, regression rank scores; robustness, tail-behavior, uniform asymptotic linearity. Keywords and phrases: 60, 62.

1

2

Jarom´ır Antoch, Marie Huˇskov´ a and Pranab K. Sen

devotion and professional acumen Jana has been associated with her alma mater for more than forty-three years. In 1982 Jana passed her habilitation, in 1984 she defended her DrSc. scientific degree and, finally, in 1992 she was appointed by the president of Czechoslovakia to the rank of a full professors. At the present she holds a pivotal position at the Jaroslav H´ajek Center of Theoretical and Applied Statistics, established under the auspices of the Ministry of Education, Czech Republic. Jana has published extensively (more than 120 scientific articles), mostly, in the leading journals of statistics and probability theory and coauthored a number of monographs too. We enclose herewith Jana’s list of publications. Her research interests cover a wide area of statistical inference, including (but not limited to): a) statistical procedures based on ranks; b) robust statistical procedures based on so called M-statistics and L-statistics; c) statistical procedures based on extreme sample observations; d) tail behavior and its application in statistics; e) asymptotic methods of mathematical statistics; f) finite sample behavior of estimates and tests. During 1967 – 1976 her research attention mostly focused on nonparametrics, and the impact of Professor H´ ajek’s outstanding research was well reflected in Jana’s work. She realized a harder way to excel in research at the demise of Jaroslav H´ajek in 1974, and undertook a much broader research field. She exploited in the late 1970’s the relationship of R-, M- and L-estimators using the same uniform asymptotic linearity she developed in her dissertation. In collaboration with P. K. Sen, in 1981, she exploited the moment convergence of R-, L-, and M-estimators in a broader sequential context. Berry-Essen bounds for the normal approximation of the distribution of rank statistics were also studied in detail. Gradually, she become more interested in robust inference, and has contributed extensively in this field. Her research, some in collaboration with others, culminated in the 1996 Jureˇckov´a – Sen book. Regression rank scores and their use in statistical inference has been a favorite topic of research of Jana. Her work with Gutenbrunner, Portnoy and Koenker is specially noticeable. Moving beyond the independence assumption was a natural follow-up, and in that respect, in later years, her work with Hallin, Koul and others are noteworthy. Shrinkage estimation in a robust setup has also Jana’s imprint from time to time. Quantile regression is another topic of Jana’s research interest. Extreme value distributions and tail behavior of robust statistics have been extensively studied by Jana. In addition, she has always fathomed into adaptive robust inference and their computational aspects, later work with Jan Picek bearing this testimony. Jana has co-edited a number of monographs, the most noteworthy was the first one in 1978: Contributions to Statistics: Essays in memory of Jaroslav H´ajek. She has co-authored a number of monographs, some in an advanced level and some at intermediate levels with some emphasis on data analysis. All they are cited in the enclosed list of publications. Jana has an impressive list of places where she has visited from time to time: University of North Carolina, Chapel Hill; University of Illinois at Urbana-Champaign; Universit´e Libre de Bruxelles; Universit´e P. Sabatier, Toulouse; Universit´e de Neuchˆatel; Carleton University in Ottawa, Limburgs University Center at Diepenbeek; Humboldt University in Berlin, University at Bordeaux, and University of Rome. Jana can communicate well in Russian, French, German and English languages, in addition to her mother tongue. No wonder, she had collaborators from all the five continents and in diverse setups. Jana has a most impressive list of extensive

Jana Jureˇ ckov´ a: Life and Works

3

international research collaboration, including Pranab K. Sen (University of North Carolina), Roger Koenker and Stephen Portnoy (University of Illinois), Ehsanes Saleh (Carleton University, Ottawa), Marc Hallin (Universit´e Libre de Bruxelles), Xavier Milhaud (University P. Sabatier, Toulouse), Hira Koul (Michigan State University), Yadolah Dodge (Universit´e de Neuchˆ atel), Paul Janssen and No¨el Veraverbeke (Hasselt University), Lev B. Klebanov (St. Petersburg, Charles University) and Keith Knight (University of Toronto) among others, as well as a number of her own advisees all over the world. She has long fruitful discussions with Abram M. Kagan (St. Petersburg, University of Maryland), Allan Welsh (The Australian National University), Witting (University of Freiburg) and Ivan Mizera (University of Alberta) among others. Note that Jana has supervised the doctoral dissertation of more than 15 advisees; the list is appended here. Many of them have proclaimed significant professional recognition. Jana had important international collaborative research grants, co-sponsored by the Czech National Grant Agency, National Science Foundation in USA or NSERC in Canada. Jana is an elected member of the International Statistical Institute, Fellow of the Institute of Mathematical Statistics, member of the Bernoulli Society (where she served in the council as well as its European regional committee). Since 2003 she is the elected fellow of the Learned Society of Czech Republic, the most prestigious Czech scientific society. During 2000, she was visiting Belgian Universities for six months under a Francqui Foundation distinguished faculty position. She has served on the editorial board of a number of leading statistics journals, including Annals of Statistics (1992 – 1997), Journal of the American Statistical Association (2006 – 2008), Sankhya (2006 – 2011), Sequential Analysis (1982 – 2002), and Statistics (1980 – 1993). She has also served on the Review Panel of (US) NSF Grants and many other Research sponsoring agencies in Czech Republic and elsewhere. Jana has organized or co-organized a number of important international meetings. Let us mention a series of successful workshops on Perspectives in Modern Statistical Inference (1998, 2002, 2005), series of conferences on L1 -statistical procedures (1987, 1992, 1997, 2002) and ICORS 2010 among others. With Jarom´ır Antoch and Tom´ aˇs Havr´ anek Jana started very successful series of biennial conferences ROBUST at 1980 which impacted Czech Statistical Community in a profound way. Even now Jana is as active, as energetic and as persuasive in basic research and professional development as in the time when she started her scientific career. We wish her a very long and even more prosperous life in future.

4

Jarom´ır Antoch, Marie Huˇskov´ a and Pranab K. Sen

PH.D. GUIDANCE (co-adviser) Served as the adviser and supervised the doctoral dissertations of the following persons at Charles University in Prague. (Cornelius Gutenbrunner at Freiburg University and Hana Kotouˇckov´ a at Masaryk University, Brno). Jarom´ır Antoch

Behavior of the Location Estimators from the Point of View of Large Deviations (adviser V. Dupaˇc)

1982

Cornelius Gutenbrunner

Zur Asymptotik von Regression Quantil Prozessen und daraus abgeleiten Statistiken

1985

Jan Hanousek

Robust Bayesian Type Estimators

1990

Marek Mal´ y

The Asymptotics for Studentized k-step M -Estimators of Location Trimmed Estimates in the Nonlinear Regression Model Weak Continuity and Identifiability of M-Functionals Testing Linear Hypotheses Based on Regression Rank Scores Robust Methods in the Linear Calibration Model M-estimators in Linear Model for Irregular Densities Estimating and Testing Pareto Tail Index

1991

Bohum´ır Proch´ azka Ivan Mizera Jan Picek Ivo M¨ uller Jan Svatoˇs Alena Fialov´ a Marek Omelka Martin Schindler Hana Kotouˇckov´ a

1992 1993 1996 1996 2000 2001

Second Order Properties of some M- and Restimators Inference Based on Regression Rank Scores

2006

History of Robust Mathematical-Statistical Methods

2009

2008

MONOGRAPHS AND TEXTBOOKS Robuste statistische Methoden in linearen Modellen. In: K. M. S. Humak: Statistische Methoden der Modellbildung II, 195–255 (Chapter 2). Akademie-Verlag, Berlin, 1983. English translation: Nonlinear Regression, Functional Relations and Robust Methods. (Bunke, H. and Bunke, O., eds.), 104–158 (Chapter 2). J. Wiley, New York, 1989. Robust Statistical Inference: Asymptotics and Interrelations (co-author P. K. Sen), J. Wiley, New York, 1996. Adaptive Regression (co-author Y. Dodge). Springer-Verlag, New York, 2000. Robust Statistical Methods (textbook, in Czech). Karolinum, Publishing House of Charles University in Prague, 2001. Robust Statistical Methods with R (co-author J. Picek). Chapman & Hall/CRC, 2005.

Jana Jureˇ ckov´ a: Life and Works

5

Publications of Jana Jureˇ ckov´ a Jureˇckov´a, J. (1969). Asymptotic linearity of a rank statistic in regression parameter. Ann. Math. Statist. 40, 1889–1900. Jureˇckov´a, J. (1971). Nonparametric estimate of regression coefficients. Ann. Math. Statist. 42, 1328–1338. Jureˇckov´a, J. (1971). Asymptotic independence of rank test statistic for testing symmetry on regression. Sankhya A 33, 1–18. Jureˇckov´a, J. (1972). An asymptotic theorem of nonparametrics. Coll. Math. Soc. J. Bolyai (Proc. 9th European Meeting of Statisticians), pp. 373–380. Jureˇckov´a, J. (1973). Almost sure uniform asymptotic linearity of rank statistics in regression parameter. Trans. 6th Prague Conf. on Inform. Theory, Random Processes and Statist. Decis. Functions, pp. 305–313. Jureˇckov´a, J. (1973). Central limit theorem for Wilcoxon rank statistics process. Ann. Statist. 1, 1046–1060. Jureˇckov´a, J. (1973). Asymptotic behaviour of rank and signed rank statistics from the point of view of applications. Proc. 1st Prague Symp. on Asympt. Statist., Vol. 1, pp. 139–155. Jureˇckov´ a, J. (1975). Nonparametric estimation and testing linear hypotheses in the linear regression model. Math. Operationsforsch. Statist. 6, 269–283. Jureˇckov´a, J. (1975). Asymptotic comparison of maximum likelihood and a rank estimate in simple linear regression model. Comment. Math. Univ. Carolinae 16, 87–97. Jureˇckov´a, J. and Puri, M. L. (1975). Order of normal approximation of rank statistics distribution. Ann. Probab. 3, 526–533. Jureˇckov´a, J. (1977). Asymptotic relations of least-squares estimate and of two robust estimates of regression parameter vector. Trans. 7th Prague Conf. on Inform. Theory, Random Processes and Statist. Decis. Functions A, pp. 231– 237. Jureˇckov´a, J. (1977). Locally optimal estimates of location. Comment. Math. Univ. Carolinae 18, 599–610. Jureˇckov´ a, J. (1977). Asymptotic relations of M-estimates and R-estimates in linear regression model. Ann. Statist. 5, 464–472. Jureˇckov´ a, J. (1978). Bounded-length sequential confidence intervals for regression and location parameters. Proc. 2nd Prague Symp. on Asympt. Statist., pp. 239–250. Jureˇckov´a, J. (1979). Finite-sample comparison of L-estimators of location. Comment. Math. Univ. Carolinae 20, 507–518. Jureˇckov´a, J. 1979). Contributions to Statistics. Jaroslav H´ ajek Memorial Volume. (Jureˇckov´ a, J., ed.) Academia, Prague and Reidel, Dordrecht. Jureˇckov´a, J. (1979). Nuisance medians in rank testing scale. Contributions to Statistics – J. H´ ajek Memorial Volume (Jureˇckov´a, J. ed.), pp. 109–117. Jureˇckov´a, J. (1980). Asymptotic representation of M-estimators of location. Math. Operationsforsch. Statist., Ser. Statistics 11, 61–73. Jureˇckov´a, J. (1980), Rate of consistency of one-sample tests of location. J. Statist. Planning Infer. 4, 249–257. Jureˇckov´a, J. (1980). Robust statistical inference in linear regression model. Proc. 3rd Intern. Summer School on Probab. and Statistics, Varna 1978, pp. 141– 166. Jureˇckov´a, J. (1980). Robust estimation in linear regression model. Banach Centre Publications 6, 168–174.

6

Jarom´ır Antoch, Marie Huˇskov´ a and Pranab K. Sen

Jureˇckov´a, J. and Sen, P. K. (1981). Sequential procedures based on M-estimators with discontinuous score functions. J. Statist. Planning Infer. 5, 253–266. Jureˇckov´a, J. and Sen, P. K. (1981). Invariance principles for some stochastic processes related to M-estimators and their role in sequential statistical inference. Sankhya A 43, 190–210. Jureˇckov´a, J. (1981). Tail behavior of location estimators. Ann. Statist. 9, 578– 585. Huˇskov´a, M. and Jureˇckov´ a, J. (1981). Second order asymptotic relations of Mestimators and R-estimators in two-sample location model. J. Statist. Planning Infer. 5, 309–328. Jureˇckov´a, J. (1981). Tail behaviour of location estimators in non-regular cases. Comment. Math. Univ. Carolinae 22, 365–375. Jureˇckov´a, J. and Sen, P. K. (1982). Simultaneous M-estimator of the common location and scale-ratio in the two-sample problem. Math. Operationsforsch. Statist., Ser. Statistics 13, 163–169. Jureˇckov´a, J. and Sen, P. K. (1982). M-estimators and L-estimators of location: Uniform integrability and asymptotically risk-efficient sequential version. Comm. Statist. C 1, 27–56. Jureˇckov´a, J. (1982). Tests of location and criterion of tails. Coll. Math. Soc. J. Bolyai 32, 469–478. Jureˇckov´a, J. (1983). Robust estimators of location and regression parameters and their second order asymptotic relations. Trans. 9th Prague Conf. on Inform. Theory, Random Processes and Statistics & Decisons Functions, pp. 19–32. Academia, Prague. Jureˇckov´a, J. (1983). Asymptotic behavior of M-estimators of location in nonregular cases. Statistics & Decisions 1, 323–340. Jureˇckov´a, J. (1983). Winsorized least-squares estimator and its M-estimator counterpart. Contributions to Statistics: Essays in Honour of Norman L. Johnson (Sen, P. K., ed.), pp. 237–245. North Holland. Jureˇckov´a, J. (1983). Trimmed polynomial regression. Comment. Math. Univ. Carolinae 24, 597–607. Jureˇckov´a, J. (1983). Robust estimators and their relations. Acta Univ. Carolinae – Math. et Phys. 24, 49–59. Jureˇckov´a, J. (1984). Regression quantiles and trimmed least squares estimator under a general design. Kybernetika 20, 345–357. Jureˇckov´ a, J. (1984). Rates of consistency of classical one-sided tests. Robustness of Statistical Methods and Nonparametric Statist. (Rasch, D. and Tiku, M. L., eds.), pp. 60–62. Deutscher Verlag der Wissenschaften, Berlin. ´ (1984). Sensitivity of Chow–Robbins procedure to Jureˇckov´a, J. and V´ıˇsek, J. A. the contamination. Sequential Analysis 3, 175–190. Jureˇckov´a, J. and Sen, P. K. (1984). On adaptive scale-equivariant M-estimators in linear models. Statistics & Decisions 2, Suppl. Issue No. 1, 31–46. Behnen, K., Huˇskov´ a, M., Jureˇckov´ a, J. and Neuhaus, G. (1984). Two-sample linear rank tests and their Bahadur efficiencies. Proc. 3rd Prague Symp. on Asympt. Statist. 1, pp. 103-117. Jureˇckov´a, J. (1984). M-, L- and R-estimators. Handbook of Statistics Vol. 4 (Krishnaiah, P. R. and Sen, P. K., eds.), pp. 464–485 (Chapter 21). Elsevier Sci. Publishers. Jureˇckov´ a, J. (1985). Representation of M-estimators with the second order asymptotic distribution. Statistics & Decisions 3, 263–276.

Jana Jureˇ ckov´ a: Life and Works

7

Janssen, P., Jureˇckov´ a, J. and Veraverbeke, N. (1985). Rate of convergence of one- and two-step M-estimators with applications to maximum likelihood and Pitman estimators. Ann. Statist. 13, 1222–1229. Jureˇckov´a, J. (1985). Robust estimators of location and their second-order asymptotic relations. Celebration of Statistics. The ISI Centenary Volume (Atkinson, A. C. and Fienberg, S. E., eds.), pp. 377–392. Springer-Verlag, New York. Antoch, J. and Jureˇckov´ a, J. (1985). Trimmed least squares estimator resistant to leverage points. Comp. Statist. Quarterly 4, 329–339. Jureˇckov´a, J. (1985). Sequential confidence intervals based on robust estimators. Sequential Methods in Statistics. Banach Centre Publications 16, 309–319. Jureˇckov´a, J. (1985). Tail-behavior of L-estimators and M-estimators. Proc. 4th Pannonian Symp. 1, pp. 205–217. Huˇskov´a, M. and Jureˇckov´ a, J. (1985). Asymptotic representation of R-estimators of location. Proc. 4th Pannonian Symp. 1, 145–165. Jureˇckov´a, J. (1985). Linear statistical inference based on L-estimators. Linear Statistical Inference (Calinski, T. and Klonecki, W., eds.), pp. 88–98. Lecture Notes in Statistics 15, Springer-Verlag. Jureˇckov´a, J. (1986). Asymptotic representation of L-estimators and their relations to M-estimators. Sequential Analysis 5, 317–338. Jureˇckov´a, J. and Kallenberg, W. C. M. (1987). On local inaccuracy rates and asymptotic variances. Statistics & Decisions 5, 139–158. Jureˇckov´a, J. and Sen, P. K. (1987). A second order asymptotic distributional representation of M-estimators with discontinuous score functions. Ann. Probab. 5, 814–823. Jureˇckov´a, J. and Portnoy, S. (1987). Asymptotics for one-step M-estimators in regression with application to combining efficiency and high breakdown point. Comm. Statist. A 16, 2187–2199. Jureˇckov´a, J. and Sen, P. K. (1987). An extension of Billingsley’s uniform boundedness theorem to higher dimensional M-processes. Kybernetika 23, 382–387. Jureˇckov´a, J., Kallenberg, W. C. M. and Veraverbeke, N. (1988). Moderate and Cram´er-type deviations theorems for M-estimators. Statist. Probab. Letters 6, 191–199. Dodge, Y. and Jureˇckov´ a, J. (1988). Adaptive combination of least squares and least absolute deviations estimators. Statist. Analysis Based on L1 -Norm (Dodge, Y., ed.), pp. 275–284. North Holland. Dodge, Y. and Jureˇckov´ a, J. (1988). Adaptive combination of M-estimator and L1 estimator in the linear model. Optimal Design and Analysis of Experiments (Dodge, Y., Fedorov, V. V. and Wynn, H. P., eds.), pp. 167–176. Elsevier Sci. Publ., Amsterdam. Jureˇckov´a, J. (1989). Consistency of M-estimators in linear model generated by non-monotone and discontinuous ψ-functions. Probab. and Math. Statist. 10, 1–10. Jureˇckov´a, J. and Sen, P. K. (1989). Uniform second order asymptotic linearity of M-estimators in linear models. Statistics & Decisions 7, 263–276. Jureˇckov´a, J. Saleh, A. K. M. E. and Sen, P. K. (1989). Regression quantiles and improved L-estimation in linear models. Probability, Statistics and Design of Experiments (Bahadur, R. R., ed.), pp. 405–418. Wiley Eastern Ltd., New Delhi. Jureˇckov´a, J. (1989). Consistency of M-estimators of vector parameters. Proc. 4th Prague Conf. on Asympt. Statist. (Mandl, P. and Huˇskov´a, M., eds.), pp. 305–312. Charles University Press, Prague.

8

Jarom´ır Antoch, Marie Huˇskov´ a and Pranab K. Sen

Jureˇckov´a, J. and Saleh, A. K. M. E. (1990). Robustified version of Stein’s multivariate location estimation. Statist. and Probab. Letters 9, 375–380. Jureˇckov´a, J. and Sen, P. K. (1990). Effect of the initial estimator on the asymptotic behavior of one-step M-estimator. Ann. Inst. Statist. Math. 42, 345–357. Jureˇckov´a, J. and Welsh, A. H. (1990). Asymptotic relations between L- and Mestimators in the linear model. Ann. Inst. Statist. Math. 42, 671–698. He, X., Jureˇckov´ a, J., Koenker, R. and Portnoy, S. (1990). Tail behavior of regression estimators and their breakdown points. Econometrica 58, 1195–1214. Jureˇckov´a, J. (1991). Confidence sets and intervals. Handbook of Sequential Analysis (Ghosh, B. K. and Sen, P. K., eds.), pp. 269–281 (Chapter 11). M. Dekker, New York. Dodge, Y., Jureˇckov´ a, J. and Antoch, J. (1991). Adaptive combination of least squares and least absolute deviations estimators: Computational aspects. Comp. Statist. & Data Analysis 12, 87–100. Dodge, Y. and Jureˇckov´ a, J. (1991). Flexible L-estimation in the linear model. Comp. Statist. & Data Analysis 12, 211–220. Jureˇckov´a, J. (1991). Comments to the paper “Nonparametrics: retrospectives and perspectives” by P. K. Sen. Nonpar. Statist. 1, 49–50. Jureˇckov´a, J. (1992). Estimation in a linear model based on regression rank scores. Nonpar. Statist. 1, 197–203. Gutenbrunner, C. and Jureˇckov´ a, J. (1992). Regression rank scores and regression quantiles. Ann. Statist. 20, 305–330. Jureˇckov´a, J. (1992). Uniform asymptotic linearity of regression rank scores process. Nonparametric Statistics and Related Topics (Saleh, A. K. M. E., ed.), pp. 217–228. Elsevier Sciences Publishers. Jureˇckov´a, J. (1992). Tests of Kolmogorov–Smirnov type based on regression rank scores. Trans. 11th Prague Conf. on Inform. Theory, Random Proc.and ´ ed.), pp. 41–49. Academia, Statist. Decis. Functions, Vol. B (V´ıˇsek, J. A., Prague & Kluwer Acad. Publ. Dodge, Y. and Jureˇckov´ a, J. (1992). A class of estimators based on adaptive convex combinations of two estimation procedures. L1 -Statist. Analysis and Related Methods (Dodge, Y., ed.), pp. 31–45. North Holland. Gutenbrunner, C., Jureˇckov´ a, J., Koenker, R., and Portnoy, S. (1993). Tests of linear hypotheses based on regression rank scores. Nonpar. Statist. 2, 307– 331. Jureˇckov´a, J. and Sen, P. K. (1993). Asymptotic equivalence of regression rank scores estimators and R-estimators in linear models. Statistics and Probability: A Raghu Raj Bahadur Festschrift (Ghosh, J, K., Mitra, S. K., Parthasarathy, K. R., and Prakasa Rao B. L. S., eds.), pp. 279–292. Wiley Eastern Limited Publishers. Jureˇckov´a, J. and Sen, P. K. (1993). Regression rank scores scale statistics and studentization in linear models. Asymptotic Statistics [Proc. 5th Prague Symp.] (Mandl, P. and Huˇskov´ a, M., eds.) pp. 111–121. Physica-Verlag, Heidelberg. Jureˇckov´a, J. and Milhaud, X. (1993). Shrinkage of maximum likelihood estimator of multivariate location. Asymptotic Statistics [Proc. 5th Prague Symp.] (Mandl, P. and Huˇskov´ a, M., eds.), pp. 303–318. Physica-Verlag, Heidelberg. Jureˇckov´a, J. and Proch´azka, B. (1994). Regression quantiles and trimmed least squares estimator in nonlinear regression model. Nonpar. Statist. 3, 201–222. Jureˇckov´a, J., Koenker, R., and Welsh, A. H. (1994). Adaptive choice of trimming proportions. Ann. Inst. Statist. Math. 40, 737–755.

Jana Jureˇ ckov´ a: Life and Works

9

Jureˇckov´a, J. (1995). Regression rank scores: Asymptotic linearity and RR-estimators. Proceedings of MODA 4 (Kitsos, C. P. and M¨ uller, W. G., eds.), pp. 193–203. Physica-Verlag, Heidelberg. Dodge, Y. Jureˇckov´ a, J. (1995). Estimation of quantile density function based on regression quantiles. Statist. Probab. Letters 23, 73–78. Jureˇckov´a, J. (1995). Affine and scale-equivariant M-estimators in linear model. Probability and Math. Statist. 15, 397–407. Jureˇckov´a, J. and Mal´ y, M. (1995). The asymptotics for studentized k-step Mestimators of location. Sequential Analysis 14 (3), 225–245. Jureˇckov´a, J. (1995). Trimmed mean and Huber’s estimator: Their difference as a goodness-of-fit criterion. J. of Statistical Science 29, 31–35. Jureˇckov´a, J., ed. (1997). Environmental Statistics and Earth Science. Environmetrics 7 (5), (special issue). Dodge, Y. and Jureˇckov´ a, J. (1997). Adaptive choice of trimming proportion in trimmed least squares estimation. Statist. and Probab. Letters 33, 167–170. Hallin, M., Jureˇckov´ a, J., Kalvov´a, J., Picek, J., and Zahaf, T. (1997). Nonparametric tests in AR models with applications to climatic data. Environmetrics 8, 651–660. Jureˇckov´a, J. and Klebanov, L. B. (1997). Inadmissibility of robust estimators with respect to L1 norm. L1 -Statistical Procedures and Related Topics (Dodge, Y., ed.). IMS Lecture Notes – Monographs Series 31, 71–78. Jureˇckov´a, J. and Sen, P. K. (1997). Asymptotic representations and interrelations of robust estimators and their applications. Handbook of Statistics 15 (Maddala, G. S. and Rao, C. R., eds.), pp. 467–512. North Holland. Jureˇckov´a, J. (1998). Characterization and admissibility in invariant models. Prague ´ eds.), pp. 275–278. Stochastics’98 (Huˇskov´ a, M., Lachout, P., and V´ıˇsek, J. A., ˇ JCMF Praha. Hallin, M., Jureˇckov´ a, J., and Milhaud, X. (1998). Characterization of error distributions in time series regression models. Statist. and Probab. Letters 38, 335–345. Huˇskov´a, M. and Jureˇckov´ a, J. (1998). Jaroslav H´ ajek and its impact on the theory of rank tests. Collected Works of Jaroslav H´ ajek with Commentary (Huˇskov´ a, M., Beran, R. and Dupaˇc, V., eds.), pp. 15–20. J. Wiley. Jureˇckov´a, J. and Klebanov, L. B. (1998). Trimmed, Bayesian and admissible estimators. Statist. and Probab. Letters 42, 47–51. Jureˇckov´ a, J. and Sen, P. K. (1998). Partially adaptive rank and regression rank scores tests in linear models. Applied Statistical Science IV (Ahmad, E., Ahsanullah, M., Sinha, B. K., eds.), pp. 1–12. Picek, J. and Jureˇckov´ a, J. (1998). Application of rank tests for detection of dependence in time series (in Czech). ROBUST’98 (Antoch, J. and Dohnal, ˇ G., eds.), pp. 149–160. JCMF Praha. Jureˇckov´ a, J. (1999). Equivariant estimators and their asymptotic representations. Tatra Mountains Mathematical Publications 17, 1–9. Jureˇckov´a, J. (1999). Regression rank scores tests against heavy-tailed alternatives. Bernoulli 5, 659–676. Hallin, M., Jureˇckov´ a, J., Picek, J., and Zahaf, T. (1999). Nonparametric tests of independence of two autoregressive time series based on autoregression rank scores. J. Statist. Planning Infer. 75, 319–330. Hallin, M. and Jureˇckov´ a, J. (1999). Optimal tests for autoregressive models based on autoregression rank scores. Ann. Statist. 27, 1385–1414.

10

Jarom´ır Antoch, Marie Huˇskov´ a and Pranab K. Sen

Jureˇckov´a, J. and Milhaud, X. (1999). Characterization of distributions in invariant models. J. Statist. Planning Infer. 75, 353–361. Portnoy, S. and Jureˇckov´ a, J. (1999). On extreme regression quantiles. Extremes 2 (3), 227–243. Jureˇckov´a, J. (2000). Tests of tails based on extreme regression quantiles. Statist. & Probab. Letters 49, 53–61. Kalvov´ a, J., Jureˇckov´ a, J., Picek, J., and Nemeˇsov´a, I. (2000). On the order of autoregressive (AR) model in temperature series. Meteorologick´ y ˇcasopis 3, 19–23. Jureˇckov´a, J. and Sen, P. K. (2000). Goodness-of-fit tests and second order asymptotic relations. J. Statist. Planning Infer. 91, 377–397. Jureˇckov´a, J., Koenker, R., and Portnoy, S. (2001). Tail behavior of the least squares estimator. Statist. & Probab. Letters 55, 377–384. Jureˇckov´a, J. and Picek, J. (2001). A class of tests on the tail index. Extremes 4 (2), 165–183. Picek, J. and Jureˇckov´ a, J. (2001). A class of tests on the tail index using the modified extreme regression quantiles. ROBUST’2000 , pp. 217–226 (Antoch, J. and Dohnal, G., eds.). Union of Czech Mathematicians and Physicists, Prague. Jureˇckov´a, J. and Sen, P. K. (2001). Asymptotically minimum risk equivariant estimators. Data Analysis from Statistical Foundations – Festschrift in honour of the 75th birthday of D.A.S. Fraser (Saleh, A. K. M. E., ed.), 329–343. Nova Science Publ., Inc., Huntington, New York. Jureˇckov´a, J., Koenker, R., and Portnoy, S. (2001). Estimation of Pareto index based on extreme regression quantiles. Prepress #22, Charles University, Department of Probability and Math. Statistics. Jureˇckov´a, J., Picek, J., and Sen, P. K. (2002). A goodness-of-fit test with nuisance parameters: Numerical performance. J. Statist. Planning Infer. 102 (2), 337– 347. Jureˇckov´a, J. (2002). L1 derivatives, score functions and tests. Statistical Data Analysis Based on the L1 Norm and Related Methods (Dodge, Y., ed.), pp. 183– 189. Birkh¨auser, Basel. Dodge, Y. and Jureˇckov´ a, J. (2002). Adaptive combinations of tests. Goodnessof-Fit Tests and Model Validity (Huber-Carol, C., Balakrishnan, N., Nikulin, M. S. and Mesbach, M., eds.), pp. 411–422. Birkh¨auser, Boston. Jureˇckov´a, J. (2003). Statistical tests on tail index of a probability distribution with a discussion. Metron LXI (2), 151–190. Jureˇckov´a, J. (2003). “Statistical tests for comparison of two data-sets” with a discussion (in Czech). Statistika 3, 1–23. Jureˇckov´a, J. and Milhaud, X. (2003). Derivative in the mean of a density and statistical applications. Mathematical Statistics and Applications. Festschrift for Constance van Eeden (Moore, M., L´eger, C., and Froda, S., eds.) IMS Lecture Notes 42, pp. 217–232. Jureˇckov´a, J., Picek, J. and Sen, P. K. (2003). Goodness-of-fit tests with nuisance regression and scale. Metrika 58, 235–258. Sen, P. K., Jureˇckov´ a, J., and Picek, J. (2003). Goodness-of-fit test of Shapiro– Wilk type with nuisance regression and scale. Austrian J. of Statist. 32 (1& 2), 163–177. Jureˇckov´a, J. and Picek, J. (2004). Estimates of the tail index based on nonparametric tests. Statistics for Industry and Technology (Hubert, M., Pison, G., Struyf, A., and Van Aelst, V., eds.), pp. 141–152. Birkh¨auser, Basel.

Jana Jureˇ ckov´ a: Life and Works

11

Fialov´ a, A., Jureˇckov´ a, J., and and Picek, J. (2004). Estimating Pareto tail index based on sample means. Revstat 2 (1), 75–100. Jureˇckov´a, J. and Picek, J.: Two-step regression quantiles. Sankhya 67 (2), 227– 252. Jureˇckov´a, J. and Sen, P. K. (2006). Robust multivariate location estimation, admissibility and shrinkage phenomenon. Statistics & Decisions 24, 273–290. Jureˇckov´a, J. and Saleh, A. K. M. E. (2006). Rank tests and regression rank scores tests in measurement error models. KPMS Preprint 54, Charles University in Prague. Hallin, M. Jureˇckov´ a, J., and Koul, H. L. (2007). Serial autoregression rank score statistics. In: Advances in Statistical Modelling and Inference (invited paper). Essays in Honor of Kjell A. Doksum (Vijay Nair, ed.), pp. 335–362. World Scientific, Singapore. Jureˇckov´a, J. and Picek, J. (2007). Shapiro–Wilk type test of normality under nuisance regression and scale. Comp. Statist. & Data Analysis 51 (10), 5184– 5191. Jureˇckov´a, J. (2007). Remark on extreme regression quantile I. Sankhya 69, 87– 100. Jureˇckov´a, J. (2007). Remark on extreme regression quantile II. Bull. of the Intern. Statist. Institute, Proceedings of the 56th Session, Section CPM026. Jureˇckov´a, J. (2008). Regression rank scores in nonlinear models. In: Beyond Parametrics in Interdisciplinary Research: Festschrift in honor of Professor Pranab K. Sen’ (Balakrishnan, N., Pe˜ na, E. A., and Silvapulle, M. J., eds.). Institute of Mathematical Statistics Collections 1, 173-183. Pavlopoulos, H., Picek, J., and Jureˇckov´ a, J. (2008): Heavy tailed durations of regional rainfall. Applications of Mathematics 53, 249–265. Jureˇckov´a, J., Kalina, J., Picek, J., and Saleh, A. K. M. E. (2009). Rank tests of linear hypothesis with measurement errors both in regressors and responses. KPMS Preprint 66, Charles University in Prague. Jureˇckov´a, J., Koul, H. L., and Picek, J. (2009). Testing the tail index in autoregressive models. Annals of the Institute of Statistical Mathematics 61, 579–598. Jureˇckov´ a, J. and Picek, J. (2009). Minimum risk equivariant estimators in linear regression model. Statistics & Decisions 27, 1001–1019. Jureˇckov´a, J. and Omelka, M. (2010). Estimator of the Pareto index based on nonparametric test. Communications in Statistics – Theory and Methods 39, 1536–1551. Jureˇckov´a, J., Picek, J., and Saleh, A. K. M. E. (2010). Rank tests and regression rank scores tests in measurement error models. Computational Statistics and Data Analysis, in print. Jureˇckov´a, J. (2010). Nonparametric regression based on ranks. In: Lexicon of Statistical Science (Lovri´c, M., ed.) Springer, to appear. Jureˇckov´a, J. (2010). Adaptive linear regression. In: Lexicon of Statistical Science (Lovri´c, M., ed.) Springer, to appear.

IMS Collections Nonparametrics and Robustness in Modern Statistical Inference and Time Series Analysis: A Festschrift in honor of Professor Jana Jureˇ ckov´ a Vol. 7 (2010) 12–24 c Institute of Mathematical Statistics, 2010  DOI: 10.1214/10-IMSCOLL702

Nonparametric comparison of ROC curves: Testing equivalence∗ Jarom´ır Antoch1 , Luboˇ s Prchal1,2 and Pascal Sarda2 Charles University of Prague and Universit´ e Paul Sabatier Toulouse Abstract: The problem of testing equivalence of two ROC curves is addressed. A transformation of corresponding ROC curves, which motivates a test statistic based on a distance of two empirical quantile processes, is suggested, its asymptotic distribution found and a simulation scheme proposed that enables us to find critical values.

1. Introduction Receiver operating characteristic (ROC) curves are a popular and widely used tools that can help to summarize the overall performance of diagnostic methods and/or classifiers assigning individuals g ∈ G = G0 ∪ G1 , G0 ∩ G1 = ∅, into one of the groups G0 or G1 . Typically, the G1 individuals hold a feature of interest and are referred to as positives, while the G0 individuals are without the feature and are referred to as negatives. Assume that a suitable diagnostic measure Y is available. By convention, the larger values of Y are supposed to be more indicative for an individual to belong to G1 , so that if Y ≥ t, t ∈ R is a fixed threshold, then an individual is assigned to G1 . On the contrary, if Y < t then it is assigned to G0 . Let us introduce probabilities F0 (t) = P(Y ≤ t | G0 ) and F1 (t) = P(Y ≤ t | G1 ). It is evident that F0 (t) and F1 (t) as functions of t are distribution functions of the diagnostic variable Y for the G0 and G1 groups, so that we can denote the corresponding random variables by Y0 and Y1 . With this notation in mind, one possible way is to define ROC functions as mapping (·; F0 , F1 ), where ( · ; F0 , F1 ) : R → [0, 1] × [0, 1]   (1) t → 1 − F0 (t), 1 − F1 (t) . In other words, it is a curve in a unit square [0, 1] × [0, 1] square consisting of 1 − F1 (t) on the vertical axis plotted against 1 − F0 (t) on the horizontal axis for all t ∈ R. We refer the readers to the monographs Zhou et al. [32] and Pepe [23] for the properties and applications of ROC curves. In practice, ROC curves are often used to compare several diagnostic methods (classifiers). It is usually accepted that the method with a corresponding ROC curve ∗ Work

ˇ 201/09/0755 and research project MSM 0021620839. was supported by grant GACR University in Prague, Department of Probability and Mathematical Statistics, Sokolovsk´ a 83, CZ – 186 75 Praha 8, Czech Republic. e-mails: [email protected], [email protected] 2 Universit´ e Paul Sabatier, Institut de Math´ ematiques de Toulouse, UMR 5219, 118 route de Narbonne, F – 310 62 Toulouse cedex, France. e-mail: [email protected] AMS 2000 subject classifications: Primary 62G05; secondary 62H30, 62P99. Keywords and phrases: ROC curves, binary classification, kernel ROC curves. 1 Charles

12

Nonparametric comparison of ROC curves: Testing equivalence

13

closest to the point (0, 1) is the best one for the particular problem. However, this oversimplified rule is not easily applicable in practice because ROC curves in many applications are mostly non convex and the effect on the analysis can be non-trivial. Some examples are presented in this paper. Figure 1 displays three plots, each with a pair of ROC curves corresponding to different association measures suitable for the collocation extraction. It illustrates three typical situations that we come across.

0.8

0.8

0.8

0.6

0.6

0.6 tpr

1

tpr

1

tpr

1

0.4

0.4

0.4

0.2

0.2

0.2

0

0 0

0.2

0.4

fpr

0.6

0.8

1

0 0

0.2

0.4

fpr

0.6

0.8

1

0

0.2

0.4

fpr

0.6

0.8

1

Fig. 1. Examples of ROC curves for several linguistic measures described in Pecina and Schlesinger [22]. First, everyone would agree that the solid curve in Figure 1a outperforms the dashed one. Figure 1b seems to be the opposite case, because both association measures provide, at least optically, equivalent ROC curves. Finally, the situation in Figure 1c is not at all clear. On one hand the solid line is much closer to the point (0, 1). On the other hand, the curves are crossing and it is not at all clear which of them we should prefer. In all three cases, nevertheless, a natural question arises: Are these ROC curves significantly different? Several methods exist for testing the equivalence of two ROC curves. The pioneer work, proposed for normally distributed variables, was Greenhouse and Mantel [11], later extended by Weiand et al. [30] and Beam and Wieand [2]. The most widely used current approach is based on the AUC (area under the curve), proposed by Bamber [1] and developed further by, e. g., Hanley and McNeil [13] and Delong et al. [6]. A totally different approach to testing is based on a permutation principle suggested by Venkatraman and Begg [29]. Additional parametric methods, mainly connected to the binormal ROC curves and their transformations, have been also developed. We refer to Zhou et al. [32] for the review of the parametric ROC curve modeling. In practice it is usual that we do not have any a priori information about the form of the underlying distribution of Y . In such a case a parametric approach is not appropriate. Since we often deal with curves possibly crossing each other as in Figure 1c, the AUC test does not work, since the crossing curves may have the same AUC but represent diagnostic methods with completely different properties. However, in case of large sample sizes, large numbers of considered ROC curves disqualify the use of the permutation principle or other resampling techniques because they are unsupportable from a computational point of view. All of these considerations motivated us to suggest a new test of equivalence of two ROC curves. The basic idea is to transform the testing problem and consider the methods separately in groups G0 and G1 rather than to compare the ROC curves themselves. We believe that this alternative approach covers a large field of ROC settings and might open new perspectives of a ROC curve analysis as a whole. It leads to a test statistic based on the difference between the quantile processes associated with diagnostic variables of each group, and enables us to determine the asymptotic distribution under the null hypothesis of ROC curves equivalence. These

14

J. Antoch et al.

points are discussed in Section 2, where a more precise setting for ROC curves and their estimators are presented as well. Regarding estimation of F0 (t) and F1 (t), we use the empirical cumulative distribution function (CDF). The main competitor of the empirical CDF is the smooth kernel CDF estimator that possesses some theoretical and “visual” advantages for CDF and ROC curve estimation. For details see, e. g., Falk [9] or Zhou et al. [33]. However, in the case of large sample size of data the possible advantage of the kernel ROC curve appears to be completely negligible. On the other hand, estimating ROC curves and testing their equivalence are totally different tasks. In our experience, the kernel estimator does not substantially improve the testing procedure, whereas the empirical CDF estimator is easier to apply. Nevertheless, in other practical situations the kernel approach can be useful, at least as an alternative to the empirical CDF. It is shown that all theoretical results remain true when testing is based on either the empirical or the kernel estimators. The rest of the paper is organized as follows. Section 2 contains the hypothesis formulation, description of the test procedure, discussion about finding critical values and the use of the kernel estimators instead of the empirical ones. The proofs of the theoretical results formulated in Section 2 are given in Appendix. 2. Test of equivalence of two ROC curves 2.1. Hypothesis formulation Let Y be a diagnostic variable with distribution functions F0 (t) and F1 (t), and let Y0 and Y1 denote corresponding random variables as introduced in Section 1 above formula (1). Denote, according to (1), the ROC curve associated to Y by   (2) ROCY = r ∈ [0, 1]2 : ∃ t ∈ R (t; F0 , F1 ) = r . Moreover, assume that: (C1) Y0 and Y1 have continuous distributions with densities f0 (t) and f1 (t) such that f0 (t) > 0 and f1 (t) > 0 on the same interval IY ⊆ R, and that the densities are equal to zero outside IY . (C2) Y0 and Y1 are independent. Remarks. (i) Model assumption (C1) on supports assures one-to-one mapping between the thresholds and ROC points in the unit square [0, 1] × [0, 1] square. This technically simplifies the notation used later, but it can be relaxed if one properly takes into account the relationship between t and the ROC curve (ii) Assumption (C2) means that diagnostic variable Y0 keeps only the information assuring that negatives belong to G0 , while diagnostic variable Y1 keeps only the information assuring that positives belong to G1 . Let us introduce another diagnostic variable Z with the distribution functions G0 (t) and G1 (t), denoting the corresponding ROC curve by   (3) ROCZ = r ∈ [0, 1]2 : ∃ t ∈ R (t; G0 , G1 ) = r , and assume that Z0 and Z1 also satisfy conditions (C1) and (C2) with densities g0 (t) and g1 (t) on some IZ . Our main goal is to compare these two ROC curves; more precisely, we aim to test equivalence of ROCY and ROCZ .

Nonparametric comparison of ROC curves: Testing equivalence

15

Taking into account the definition of ROC curves, the equivalence of ROCY and ROCZ means that for any particular point r Y ∈ ROCY there exists an “identical” point r Z ∈ ROCZ , i. e. r Y = r Z . Equivalently, for any threshold tY ∈ IY the equivalence of the curves assures that we can find a threshold tZ ∈ IZ such that (tY ; F0 , F1 ) = (tZ ; G0 , G1 ). This allows us to express the ROC equivalence in terms of distribution functions, i. e. (4)

ROCY ≡ ROCZ ⇐⇒ ∀ tY ∈ IY ∃ tZ ∈ IZ : F0 (tY ) = G0 (tZ ) & F1 (tY ) = G1 (tZ ).

Due to (C1), all considered distribution functions are strictly increasing on IY , IZ respectively, so that there exist increasing transformation functions τ0 , τ1 : IY → IZ relating separately distribution functions  in group G0 and   G1 . Define functions τ0 (t) and τ1 (t) such that F0 (t) = G0 τ0 (t) and F1 (t) = G1 τ1 (t) , i. e., (5)

  F0 (t) τ0 (t) = G−1 0

and

  τ1 (t) = G−1 F1 (t) ∀ t ∈ IY . 1

ROC curves consist of the values of distribution functions evaluated simultaneously at the same thresholds. Therefore, they are equivalent if and only if the groups G0 and G1 are related by the same threshold transformations τ0 (t) ≡ τ1 (t). Hence, we may formulate the null hypothesis of the two ROC curves equivalence as (H)

τ0 (t) = τ1 (t)

∀ t ∈ IY ,

which we aim to test against the alternative (A)

∃ JY ⊆ IY , JY = ∅,

such that

τ0 (t) = τ1 (t)

∀ t ∈ JY .

Before deriving a test statistic, let us have a look at the transformations used. First, notice that the original problem of comparing two ROC curves is transformed into the problem of comparing behavior of the involved diagnostic methods on G0 and G1 . Indeed, in order to have identical ROC curves it is not necessary that considered diagnostic methods behave exactly in the same manner, but that their behavior globally agrees both on the “positive” and the “negative” parts of the population. Globally it means that both methods correctly recognize the same proportion of G0 and G1 individuals, even though not necessarily the same individuals. Moreover, note that the transformations are not only technical tools but provide an interesting diagnostic approach as well. They have been studied extensively, e. g., by Doksum [7] and Doksum and Sievers [8], who proposed confidence regions and statistical inference about their shape. To get insight into this concept, the upper row of plots in Figure 2 display empirical estimators      −1 F0 (t) and τ1 (t) = G  −1 F1 (t) , (6) τ0 (t) = G ∀ t ∈ IY , 0 1 of the transformation functions used for the three ROC pairs presented in Figu k (t), k = 0, 1, are based on the samples re 1. The empirical CDF’s Fk (t) and G Y01 , . . . , Y0nY0 , Y11 , . . . , Y1nY1 , Z01 , . . . , Z0nZ0 , and Z11 , . . . , Z1nZ1 , with a total sample size n = n0 + n1 = nY0 + nZ + nY1 + nZ 0  1 . The quantile functions used are defined as  −1   Gk (u) = inf t : Gk (t) > u , k = 0, 1. We clearly see almost identical transformations in the central plot as expected in the case of equivalent ROC curves, while τ0 (t) and τ1 (t) have rather different

16

J. Antoch et al.

forms in the other two cases. Another point of view is presented in lower plots of Figure 2. The transformation functions are plotted one against the other. Under the null hypothesis the obtained cloud of points should lie along the straight line with the unit-slope. In the central plot we see that a majority of points, with respect to supports of transformations, touches the line indicating ROC equivalence, while the points on the other plots are considerably far away from the expected null hypothesis line. 1

1

20 0

0.5

10

0

−1

tauk

tauk

tauk

15

−0.5

−2

5

−1

−3

0 −1

0

1 t

2

3

−1

0

t

1

2

−1.5

3

−1

0

t

1

10

1

20

−2

0.5

8

0

15

6 tau1

tau1

tau1

−0.5

10

−1 −1.5

4

2

5 −2

0 −2.5

0 0

5

10 tau0

15

20

−3 −3

−2

−1 tau0

0

1

−2 −2

0

2

4 tau0

6

8

10

Fig. 2. Transformation functions corresponding to the ROC curves plotted in Figure 1. The upper plots presents the form of τ0 (t) (solid lines) and τ1 (t) (dashed lines) depending on the threshold t, while the lower plots show transformation τ0 (t) plotted against τ1 (t). 2.2. Test statistic As illustrated by the graphs in Figure 2, transformation functions τ0 (t) and τ1 (t) indicate (non)equivalence of two ROC curves. Therefore, we suggest basing a decision on the distance between them. Precisely, we suggest a test statistic of the form 2  (7) Tn = n τ0 (t) − τ1 (t) dt, ∗ IY

where the integral is on a closed interval IY∗ ⊆ IY such that the densities g0 (s) and g1 (s) are positive and finite for all s in the images of τ0 (t) and τ1 (t), t ∈ IY∗ , i. e. (C3)

  0 < g0 τ0 (t) < ∞

  and 0 < g1 τ1 (t) < ∞ ∀ t ∈ IY∗ .

There is a lack of symmetry as concerns cdf’s F (x) and G(x) in the definition of Tn inherited from the genesis of ROC curves. Our numerical calculations both with real and simulated data show, however, that its influence on the p values is quite negligible, especially when the size of the data is large. As expected, test statistic Tn should be small under the null hypothesis and increase with growing difference between τ0 (t) and τ1 (t) under the alternative. Hence,

Nonparametric comparison of ROC curves: Testing equivalence

17

if an appropriate critical value c(α) is available, the decision rule rejects the null hypothesis whenever Tn > c(α). Theorem 2.1 stated below establishes the asymptotic distribution of Tn under the null hypothesis (H). Theorem 2.1. Assume the setting described in Subsection 2.1 and the test statistic Tn defined by (7). Let conditions (C1) – (C3) hold and Y0 , Y1 , Z0 and Z1 be mutually independent. Let n0 and n1 tend to infinity such that nY0 /n0 → κ0 , nY1 /n1 → κ1 , κ0 , κ1 ∈ (0, 1), and n tends to infinity such that n/n0 → κ0 and n/n1 → κ1 , where 1/κ0 , 1/κ1 ∈ (0, 1). Then, under the null hypothesis (H), the test statistic Tn converges for n → ∞ in distribution to the infinite weighted sum of independent χ21 variables η12 , η22 , . . ., i. e.

(8)

D

Tn −→ T

B

=



λj ηj2 ,

j=1

where {λj } represent the eigenvalues of the covariance operator of the zero-mean Gaussian process B(t) with the covariance structure       F0 (s) 1 − F0 (t) F1 (s) 1 − F1 (t)    + c1    , (9) cov B(s), B(t) = c0  g0 τ0 (s) g0 τ0 (t) g1 τ1 (s) g1 τ1 (t)     s ≤ t ∈ IY∗ , c0 = κ0 / κ0 (1 − κ0 ) , c1 = κ1 / κ1 (1 − κ1 ) . Proof. Postponed to Appendix A. Asymptotic distribution of Tn is stated in Theorem 2.1 for independent realizations of independent diagnostic variables Y and Z. However, this condition is not always realistic in practice. We think that the above test procedure behaves well for weakly dependent variables. However, when strong dependence is suspected, we suggest to use following two-step approach. The first step consists of determining separately critical val −1 (t), k = 0, 1 (see appendix A). ues based on the limit processes of Fk (t) and G k A critical value for Tn can then be obtained by using a Bonferroni inequality as derived in Horv´ath et al. [14]. Of course, the accuracy of this procedure, and more generally the problem of dependence between diagnostic variables, should warrant a deep study of its own. Taking into account the genesis of the test statistics, which is data dependent, its power against any alternative is of natural interest. Thus, the following theorem assures the consistency of the suggested test statistic. Theorem 2.2. Assume the setting and assumptions of Theorem 2.1 and the test statistic Tn defined by (7). Then this test is consistent against any alternative for which the conditions of Theorem 2.1 are satisfied. Proof. Postponed to Appendix A. 2.3. Critical values We have seen that the distribution of the test statistic can be approximated by ∞ the distribution of an infinite weighted sum of χ21 variables T B = j=1 λj ηj2 . As a practical matter, several problems have to be solved. First, we need to estimate unknown eigenvalues {λj }. Second, even if the eigenvalues were known, we would

18

J. Antoch et al.

need to set an appropriate cut-off point and consider only a finite approximation of (8). Finally, even the finite approximation of T B may still be quite complex and great attention has to be paid to obtain reliable critical values. We start with estimating the eigenvalues of the covariance operator, say Γ, of the limit process B(t). The covariance operator is a kernel operator whose kernel is formed by the covariance structure (9) of the underlying process, i. e.,   (10) Γξ(t) = cov B(s), B(t) ξ(s) ds, ξ ∈ L2 (IY∗ ). ∗ IY

Therefore, estimators of the eigenvalues of Γ can be based on the estimated   cov B(s), B(t) . For that purpose, we suggest using a plug-in estimator       F0 (s) 1 − F0 (t) F1 (s) 1 − F1 (t)    ,     c ov B(s), B(t) = c0 + c1 g 0 τ0 (s) g 0 τ0 (t) g 1 τ1 (s) g 1 τ1 (t) where s, t ∈ {t1 , . . . , tp } ⊂ IY∗ and Fk (t), k = 0, 1, are the empirical CDFs, τk (t) are given by (6), and g k (t) stands for the kernel estimators of the densities gk (t). For details see, e. g., Silverman [26]. The covariance operator Γ then can be approximated by its discrete estimated version   p  n,p = ωi c (11) Γ ov B(ti ), B(tj ) , i,j=1

where ωi stands for the weights used for the numerical quadrature replacing theoretical integration in (10) by discrete summation over {t1 , . . . , tp }. Another possibility  is to use ωi = ti − ti−1 . Spectral decomposition of the matrix Γ then provides     n,p j of the asymptotic eigenvalues λj . consistent estimators λ Values of ci ’s are in practice established by the data, as seen in Theorem 2.1. The real problem can arise when the proportion of G0 elements – and therefore also of G1 elements – is extreme, i. e., very close to zero or n. Regarding the value of p, it follows from our calculations that it is preferable to keep the grid of values ti as dense as possible, of course, to be able to estimate Γξ(t). We used p = 103 for our calculations. Theorem 2.3. Assume that kernel density estimators g k (t) are based on continuous, bounded, compactly supported kernels and on bandwidths {hk } such that, Z Z hk → 0 and hk nZ k / log(nk ) → ∞ for nk → ∞, k = 0, 1. Then, under the conditions of Theorem 2.1, it holds     P λj − λj  −→ 0 as n → ∞, j = 1, 2, . . . Proof. Postponed to Appendix A. Suppose that the described estimation procedure results in J positive eigenvalues that allow approximation of the infinite representation of T B by its first J components, i. e., (12)

TB ≈

J

j η 2 ≡ S J , λ j

j=1

where η12 , . . . , ηJ2 stand for independent χ2 variables with one degree of freedom. In our calculations we set J in such a way that we have used all eigenvalues larger than 10−10 .

Nonparametric comparison of ROC curves: Testing equivalence

19

As distribution of S J is not explicitly known, we can perform Monte Carlo simulation to obtain the desired critical value. The simulation scheme is straightforward: 1. FOR k = 1 : K 2. Simulate J independent χ21 variables η12 , . . . , ηJ2 3. Calculate the value of S J and store it to SkJ 4. ENDFOR J Once the sample S1J , . . . , SK is available, we form standard empirical distribution and quantile functions and use estimated quantiles instead of the unknown exact ones. If extreme quantiles are required, more sophisticated rare event methods based on properly tuned importance sampling or saddle point approximation should be used to obtain reliable results. Concerning computational costs, performing sufficiently many (K ≈ 106 ) simulations for J ≈ 1000 components is feasible on a standard “home” computer in a couple of seconds. We point out that taking squares of standard normal variables is considerably faster, mainly for a large J value, than a direct simulation of χ2 variables, especially if a matrix language such as Matlab, e. g., is available. Notice that far fewer simulations are required to get critical values for the test statistic (8) at standards α-levels. Typically K = 104 is enough. However, in our context one needs reasonably exact p-values for small values of p, making it necessary to run a large number of simulations in order to obtain a reliable estimator of the tail of the distribution. Kac and Siegert [16] have shown that the characteristic function of T B takes the form ∞  (1 − iςλj )−1/2 , ς ∈ R, ψT B (ς) = E exp{iςT B } = j=1

so that the inverse formula by Gil-Pelaez [10] provides the distribution function of T B , i. e., (13) P(T

B

1 1 ≤ s) = HT B (s) = − 2 π



∞ 0



e−iςs ψT B (ς)  ς

 dς,

s ≥ 0,

where (z) stands for the complex part of a complex number z ∈ C. If T B is approximated by S J , Imhof [15] suggested to represent its distribution function by (14)

P(S J < s) = 

1 1 − 2 π 



∞ 0

sin θ(s, u) du, uρ(u)

 J j u − su, ρ(u) = J 2 2 1/4 . In practice, where 2θ(s, u) = j=1 arctan λ j=1 1 + λ u ) the integration in (14) has to be carried over a finite range 0 ≤ u ≤ U . Imhof [15] claims that the truncation error is satisfactorily small and provides its upper  −1 J −1/2 . However, our numerical experiments show that the bound JU J j=1 λj integration of (14) must be performed extremely carefully with either a very fine step of the order 10−6 or rather tricky weighting. We point out that a naive use of numerical quadrature often leads to the values of distribution function greater than one, which is, of course, an unacceptable property. As one does not obtain an adequate precision gain with respect to the computational costs of Imhof’s procedure, simulations turn out to be the most favorable in practice.

20

J. Antoch et al.

2.4. Kernel estimator The methodology described above is based on the use of the empirical estimators of distribution and quantile functions Fk (t), G−1 k (p), k = 0, 1. Evidently, to estimate cdf’s Fk (t), k = 0, 1, one might use the kernel estimators instead, i. e.,   nk 1

t − Yki

, H Fk (t) = Y hk nk i=1 Y

(15)

t ∈ R, k = 0, 1,

where H(·) is an appropriate cumulative kernel function and the bandwidth parameters h0 and h1 control the smoothness of estimators. Analogously, kernel

−1 (p) might be used to estimate quantile functions G−1 (p), where estimators G k k   −1

(p) = inf t : G

k (t) > p , p ∈ (0, 1), k = 0, 1. G k Combining these two kernel estimators and following ideas of Section 2.1 we naturally come to the kernel analogue of the empirical transformation functions (6), i. e., to  

−1 F k (t) , (16) τ k (t) = G t ∈ IY , k = 0, 1. k Consequently, in the definition (7) one can replace the empirical transformations τk (t) with the kernel ones τ k (t) and obtain the kernel analogue of the test statistic Tn . As one might expect, both Theorem 2.1 and Theorem 2.3 hold for the kernel type test statistic as well (see Appendix A for a formal proof). Hence, in the practice of performing the test procedure, one may follow the same “lines” both for the empirical and the kernel estimators. It is well known that the kernel estimators offer some advantages compared to their empirical analogues. The most important is probably the fact that kernel smoothing typically brings a better “visual” effect as it provides a continuous curve in the ROC square instead of discrete points of an empirical ROC curve. On the other hand, if smoothing parameters are not properly chosen, the kernel type test statistic may lead to irrelevant and unreliable results. The kernel CDF estimator has been proposed and studied for the first time by Nadaraya [20]. Concerning the kernel ROC curves, one finds the proposals in, e. g., Zhou et al. [33] or Lloyd [19]. The last paper has been followed-up by an interesting paper of Hall et al. [12]. Later, Prchal [24] suggested an automatic procedure that, by means of data transformation, improves accuracy of kernel ROC curves. Appendix A: Proofs Theorem 2.1 is stated for the test statistic Tn defined by (7), which is based on the empirical estimators of the distribution and quantile functions. However, we provide its proof for a more general class of estimators satisfying conditions (P1) and (P2) listed below. Let Y1 , . . . , Ym and Z1 , . . . , Zn be i.i.d. samples with respective continuous distribution functions F (t) and G(t) such that the supports of their densities are   real intervals IY and IZ . Let IY∗ ⊆ IY be a closed interval such that 0 < g G−1 F (t) <  be the estimators of F (t) and G(t), and G  −1 (u) = ∞, ∀ t ∈ IY∗ . Let F (t) and G(t) −1  > u} be an estimator of the quantile function G (u), such that inf {t : G(t)     a.s. (P1) sup F (t) − F (t) −→ 0, ∗ t∈IY

Nonparametric comparison of ROC curves: Testing equivalence

(P2)

21

 D  D   √    √   m F (t)−F (t) −→ W1 F (t) & n G(t)−G(t) −→ W2 G(t) , ∀ t ∈ IY∗ ,

where W1 and W2 stand for independent Brownian bridges. The first step of proving Theorem 2.1 concerns a weak convergence result of an estimated quantile process. Lemma A.1. Let m and n tend to infinity such that m/(n + m) → κ ∈ (0, 1). Then, under the conditions (P1) and (P2),      √  −1 F (t) −G−1 F (t) m+n G (17)   1 1 D    W F (t) , −→  −1 F (t) κ(1 − κ) g G

t ∈ IY∗ ,

  where W (s), s ∈ [0, 1] denotes a Brownian bridge defined on [0, 1].      √  −1 F (t) − G−1 F (t) can be decomposed as Proof. First, notice that m + n G      √  −1 F (t) − G−1 F (t) m+n G       G−1 F (t) − G−1 F (t) √ + m + n F (t) − F (t) , F (t) − F (t)

(18)

t ∈ IY∗ .

The second term, using (P1), (P2) and the same arguments as in the proof of Theorem 4.1 by Doksum [7], converges in distribution to   1 1  W1 F (t) , √  −1  κg G F (t)

∀ t ∈ IY∗ .

Further, from (P2) and (3.4) in Ralescu and Puri [25] we can deduce that √    P   −1 (u) − G−1 (u) − U (u) −→ 0, (19) sup  m+n G ∗ u=F (t), t∈IY

 −1 1 − κg G−1 (u) V (u) and V stands for a Brownian bridge  independent of W1 . Note that when G(.) is the empirical function, (19) can be deduced from results stated by Kiefer (1970, 1972), see Theorems 4.3.2 and 5.2.1 in Cs¨org¨ and R´ev´esz [5]. Together with (P1) and continuity arguments we obtain √              −1 F (t) − G−1 F (t) − U F(t) + U F (t) − U F (t)  sup  m + n G where U (u) ≡

√

∗ t∈IY

√           −1 (u) − G−1 (u) − U (u) + sup U F(t) − U F (t) , ≤ sup  m + n G ∗ t∈IY

0≤u≤1

that converges to 0 in probability. Proof of Theorem 2.1 Proof. If F (t) stands for the empirical CDF estimator, the property (P1) is satisfied due to the well-known Glivenko–Cantelli theorem, whereas the proof of (P2) can be found, e. g., in Billingsley [4]. Hence, Lemma A.1 holds for this case and with continuity of L2 norm with respect to the Skorochod topology it assures D (20) Tn −→ B 2 (t) dt, ∗ IY

22

J. Antoch et al.

  where B(t), t ∈ IY∗ is a zero-mean Gaussian process with the covariance structure given by (9). As E B 2 (t) < ∞ ∀ t ∈ IY∗ , B(t) admits the Karhunen-Lo`eve decomposition ∞

 B(t) = λj ηj vj (t), j=1

where ηj are real random variables following the standard normal distribution and {vj } is the orthonormal system of the eigenfunctions corresponding to the eigen  values {λj } of the covariance operator Γ of B(t), t ∈ IY∗ . It follows from Kac and Siegert [16] that ⎛ ⎞2 ∞ ∞



 ⎝ B 2 (t) dt = λj ηj vj (t)⎠ dt = λj ηj2 , (21) ∗ IY

∗ IY

j=1

j=1

which assures the statement of Theorem 2.1.

Proof of Theorem 2.2 Proof. We have shown above that, under the assumptions of Theorem 2.1, Lemma A.1 holds. Thus 2  D (22) n B 2 (t) dt, τ0 (t) − τ0 (t) − τ1 (t) + τ1 (t) dt −→ ∗ IY



∗ IY

 ∗

where B(t), t ∈ IY is a zero-mean Gaussian process with the covariance structure given by (9). Under an alternative hypothesis and a given critical value tα , the  probability of rejecting the null hypothesis is P Tn > tα . Using (22) we have   lim P Tn > tα −→ 1, n→∞

what proves consistency of the test. Remark. As pointed out in Section 2.4, Theorem 2.1 remains valid when the kernel CDF estimators are used. Indeed, property (P1) is due to Nadaraya [20], while Nixdorf [21] has shown (P2). Proof of Theorem 2.3 Proof. According to the Glivenko–Cantelli theorem one has for k = 0, 1       (23) sup Fk (s) 1 − Fk (t) − Fk (s) 1 − Fk (t)  ∗ s,t∈IY

     a.s.  = sup  1 − Fk (t) Fk (s) − Fk (s) + Fk (s) Fk (t) − Fk (t)  −→ 0. ∗ s,t∈IY

Further, Bertrand-Retali [3] has shown that     a.s. gk (t) − gk (t) −→ 0. (24) sup  ∗ t∈IY

For validity of (25)

sup  0, ˆ ξ) ≥ r(c), lim inf inf sup Rn (ξ,

(3.1)

n→∞

ξˆ |ξ|2 ≤nc

ˆ the infimum being taken over all estimators ξ. This result follows easily from preceding considerations. Indeed, as Stein [10] pointed out, the estimation problem is invariant under the orthogonal group, which is compact. By the Hunt–Stein theorem, ˆ ξ) = inf sup Rn (ξ, ˆ ξ), inf sup Rn (ξ,

(3.2)

ξˆ |ξ|2 ≤nc

ξˆI |ξ|2 ≤nc

the infimum on the right side being taken only over orthogonally equivariant estimators ξˆI . Using the first bulleted result in the previous subsection on the fixed length model, with ρ0 = n1/2 c1/2 , (3.3)

ˆ ξ) ≥ inf sup Rn (ξ, ˆ ξ) = sup Rn [ξˆE (n1/2 c1/2 ), ξ]. inf sup Rn (ξ, ξˆI |ξ|2 ≤nc

ξˆI |ξ|2 =nc

|ξ|2 =nc

Because of (2.8), the right side of (3.3) converges to r(c), thereby establishing (3.1). This result is actually an instance of Pinkser’s [9] theorem on estimation of ξ. See Beran and D¨ umbgen [5] for a relevant statement of the latter. The argument above pursues ideas broached in Section 3 of Stein [10] rather than ideas in Pinsker’s later, more general study of the problem through Bayes estimators. To construct estimators that achieve the lower bound (3.1) for every c > 0, it suffices to construct a good estimator ρˆ of |ξ| from X and then form the adaptive estimators (3.4)

ξˆE (ˆ ρ) = ρˆAn (ˆ ρ|X|)ˆ μ,

ξˆAE (ˆ ρ) = (ˆ ρ2 /|X|)ˆ μ.

The following local asymptotic minimax result governs estimation of |ξ|2 :

Stein estimation

29

• In the full N (ξ, I) model, for every finite b > 0, (3.5)

lim lim inf inf

c→∞ n→∞

sup

ρˆ ||ξ|2 /n−b|≤n−1/2 c

n−1 E(ˆ ρ2 − |ξ|2 )2 ≥ 2 + 4b,

the infimum being taken over all estimators ρˆ. If ρˆ2 = |X|2 − n + d or [|X|2 − n + d]+ , where d is a constant, then (3.6)

lim

sup

n→∞ ||ξ|2 /n−b|≤n−1/2 c

n−1 E(ˆ ρ2 − |ξ|2 )2 = 2 + 4b

for every finite c > 0. For a proof, see Beran [2]. A related treatment for estimators of |ξ| was given by Hasminski and Nussbaum [7]. ρ) coincides with the positive-part James– If ρˆ2 is [|X|2 − n + 2]+ , then ξˆAE (ˆ ρ) is defined. The James–Stein estimator ξˆS is ξˆAE (ˆ ρ) when Stein estimator and ξˆE (ˆ ρˆ2 = |X|2 − n + 2. This definition works formally even when |X|2 − n + 2 is negative. For such ρˆ, the asymptotic risks of the adaptive estimators in (3.4) are readily found: • In the full N (ξ, I) model with ρˆ2 = |X|2 − n + d or [|X|2 − n + d]+ , the following holds for every finite c > 0: (3.7)

lim

sup |Rn (ξˆAE (ˆ ρ), ξ) − r(|ξ|2 /n)| = 0.

n→∞ |ξ|2 ≤nc

Consequently, (3.8)

lim

sup Rn (ξˆAE (ˆ ρ), ξ) = r(c),

n→∞ |ξ|2 ≤nc

ρ) achieves the asymptotic minimax bound for every finite c > 0. Hence, ξˆAE (ˆ ρ) when ρˆ2 = [|X|2 − n + d]+ . (3.1). The same conclusions hold for ξˆE (ˆ This result entails, in particular, that the James–Stein estimator ξˆS and the positive-part James–Stein estimator are both asymptotically minimax for ξ on balls about the origin. Such is not the case for the classical estimator X because (3.9)

lim

sup Rn (X, ξ) = 1 > r(c)

n→∞ |ξ|2 ≤nc

for every c > 0. 4. Stein confidence sets Remark (viii) on p. 205 of Stein [10] briefly stated:“Nevertheless it seems clear that we shall obtain confidence sets which are appreciably smaller geometrically than the usual disks centered at the sample mean vector.” A method for constructing such confidence balls was described in the penultimate paragraph of Stein [12], in connection with a general conjecture. We describe how, asymptotically in n, Stein’s method yields geometrically smaller confidence sets for ξ that are centered at the James–Stein estimator ξˆS . ˆ Consider confidence balls for ξ centered at estimators ξˆ = ξ(X), (4.1)

ˆ d) ˆ = {x : |ξˆ − x| ≤ d}. ˆ C(ξ,

30

R. Beran

ˆ ˆ d) ˆ  ξ) under The radius dˆ = d(X) is such that the coverage probability P(C(ξ, ˆ d), ˆ viewed the model is exactly or asymptotically α. The geometrical size of C(ξ, as a set-valued estimator of ξ, is measured by the geometrical risk ˆ d), ˆ ξ) = n−1/2 E Gn (C(ξ,

(4.2)

sup

ˆ |x − ξ| = n−1/2 E|ξˆ − ξ| + n−1/2 E(d).

ˆ d) ˆ x∈C(ξ,

This geometrical risk extends to confidence sets the quadratic risk criterion that supports Stein point estimation. The classical confidence ball for ξ is CC = C(X, χ−1 n (α)),

(4.3)

where the square of χ−1 n (α) is the α-th quantile of the chi-squared distribution with n degrees of freedom. CC is a ball centered at X whose squared radius for large n is approximately n + (2n)1/2 Φ−1 (α). Here Φ−1 denotes the quantile function of the standard normal distribution. From this and (4.2): • For every α ∈ (0, 1) and every c > 0, (4.4) (4.5)

P(CC  ξ) = α

for every ξ.

sup |Gn (CC , ξ) − 2| = 0.

lim

n→∞ |ξ|2 ≤nc

Stein confidence balls for ξ have the form (4.1), with the James–Stein estimator ξˆS as center. To construct suitable critical values dˆ in this case, consider the root Dn (X, ξ) = n−1/2 {|ξˆS − ξ|2 − [n − (n − 2)2 /|X|2 ]},

(4.6)

which compares the loss of the James–Stein estimator with an unbiased estimator of its risk. By orthogonal invariance, the distribution of Dn (X, ξ) depends on ξ only through |ξ|2 and can thus be written as Hn (|ξ|2 ). Let ⇒ designate weak convergence of distributions. The triangular array central limit theorem implies: • Suppose that limn→∞ |ξ|2 /n = a < ∞. Then (4.7)

Hn (|ξ|2 ) ⇒ N (0, σ 2 (a)),

where (4.8)

σ 2 (t) = 2 − 4t/(1 + t)2 ≥ 1.

It follows from (3.6) that ρˆ2 = [|X|2 −n+2]+ is a good estimator of |ξ|2 such that limn→∞ sup|ξ|2 ≤nc P[|ˆ ρ2 /n − |ξ|2 /n| > ] = 0 for every c > 0 and  > 0. This and (4.7) motivate approximating Hn (|ξ|2 ) by N (0, σ 2 (ˆ ρ2 /n)). The latter approximation and the definition (4.6) of Dn (X, ξ) suggest the asymptotic Stein confidence ball (4.9)

CSA = C(ξˆS , dˆA (α)),

where (4.10)

1/2 ρ2 /n)Φ−1 (α)]+ . dˆA (α) = [n − (n − 2)2 /|X|2 + n1/2 σ 2 (ˆ

Asymptotic analysis establishes

Stein estimation

31

• For every α ∈ (0, 1) and every c > 0, (4.11)

lim

sup |P(CSA  ξ) − α| = 0

n→∞ |ξ|2 ≤nc

and (4.12)

lim

sup |Gn (CSA , ξ) − rS (|ξ|2 /n)| = 0,

n→∞ |ξ|2 ≤nc

where (4.13)

rSA (t) = 2[t/(1 + t)]1/2 < 2.

Like the classical confidence ball centered at X, the Stein confidence ball CSA entered at ξˆS has correct asymptotic coverage probability α, uniformly over large compact balls about the shrinkage point ξ = 0. Comparing (4.12) with (4.5), the geometrical risk of CSA is asymptotically smaller than that of CC , particularly when ξ is near 0. To obtain valid bootstrap critical values for Stein confidence sets requires care because the naive bootstrap fails. Define the constrained length estimator of ξ by (4.14)

1/2 ξˆCL = [1 − (n − 2)/|X|2 ]+ X.

The triangular array central limit theorem implies: • Suppose that limn→∞ |ξ|2 /n = a < ∞. Then, for σ 2 defined in (4.8) (4.15)

Hn (|ξˆCL |2 ) ⇒ N (0, σ 2 (a)),

while (4.16) Hn (|X|2 ) ⇒ N (0, σ 2 (1 + a)),

Hn (|ξˆS |2 ) ⇒ N (0, σ 2 (a2 /(1 + a))),

the weak convergences all being in probability. See Beran [1] for proof details. ˆ B = Hn (|ξˆCL |2 ) converges In view of (4.7), the bootstrap distribution estimator H 2 weakly in probability to Hn (|ξ| ), as desired, while the naive bootstrap distribution ˆB. estimators Hn (|X|2 ) and Hn (|ξˆS |)2 do not. Let dˆB (α) be the α-th quantile of H Conclusions (4.11) and (4.12) continue to hold for the bootstrap Stein confidence ball CSB = C(ξˆS , dˆB (α)). Further analysis reveals that both the asymptotic and bootstrap forms of the Stein confidence ball have coverage errors of order O(n−1/2 ) and that coverage accuracy of order O(n−1 is achieved by a prepivoted bootstrap construction of the confidence ball radius. See Beran [1] for details. 5. Multiple Stein shrinkage The James–Stein estimator is often viewed as a curiosity of little practical use. The semifinal paragraph on p. 198 of Stein [10] addressed this point and showed how to resolve it: “A simple way to obtain an estimator which is better for most practical purposes is to represent the parameter space . . . as an orthogonal direct sum of two or more subspaces, also of large dimension and apply spherically symmetric

32

R. Beran

estimators separately in each.” The geometric asymptotic reasoning in Stein’s paper extends readily to multiple shrinkage. Let O = [O1 |O2 | . . . |Os ] be a specified n × n orthogonal matrix partitioned into s s submatrices {Ok : 1 ≤ k ≤ s} such that Ok is n×nk , each nk ≥ 1, and k=1 nk = n. Define Pk = Ok Ok . The {Pk : 1 ≤ k ≤ s} are orthogonal projections into Rn , are mutually orthogonal, and sum to In . The mean vector ξ and the data vector X can s s then be expressed as sums, ξ = k=1 Pk ξ and X = k=1 Pk X, the summands in each case being mutually orthogonal. Consider the candidate multiple shrinkage estimators ˆ = ξ(a)

(5.1)

s

a ∈ [0, 1]s ,

ak Pk X,

k=1

where a = (a1 , a2 , . . . , as ). These form the closure of the class of candidate penalized least squares estimators argmin[|X − ξ|2 +

(5.2)

ξ∈Rn

s

λk |Pk ξ|2 ],

λk ≥ 0,

1 ≤ k ≤ s.

k=1

Let τk = n−1 tr(Pk ) = nk /n and let wk = n−1 |Pk ξ|2 . Then, the normalized ˆ − ξ|2 is quadratic risk n−1 E|ξ(a) s

ˆ R(ξ(a), ξ) =

(5.3)

r(ak , τk , wk ),

k=1

˜k )2 (τk + wk ) + τk a ˜k , with a ˜k = wk (τk + wk )−1 . Let where r(ak , τk , wk ) = (ak − a a ˜ = (˜ a1 , a ˜2 , . . . , a ˜s ). The oracle multiple shrinkage estimator that minimizes risk is ˆ a) and the oracle risk is clearly ξ˜M S = ξ(˜ R(ξ˜M S ) =

(5.4)

s

τk wk (τk + wk )−1 .

k=1

Unfortunately, ξ˜M S depends on the unknown {wk }. Let w ˆk = w ˘+ , where w ˘k = p−1 |Pk X|2 −τk , and w ˘+ is the positive part of w. ˘ Note that w ˆk is non-negative like wk and satisfies the inequality |w ˆ k − wk | ≤ | w ˘k − wk |. Replacing wk with w ˆk in the oracle estimator just described yields the multiple shrinkage estimator ξˆM S =

(5.5)

s

w ˆk (τk + w ˆk )−1 Pk X.

k=1

Plugging {w ˆk } into (5.4) also yields an estimator for the risk of ξˆM S , ˆ ξˆM S ) = R(

(5.6)

s

τk w ˆk (τk + w ˆk )−1 .

k=1

Asymptotically in n, the following holds: • For every finite c > 0 and fixed integer s, (5.7)

lim

sup

n→∞ n−1 |ξ|2 ≤c

|R(ξˆM S , ξ) − R(ξ˜M S , ξ)| = 0.

Stein estimation

33

Moreover, for V equal to either the loss n−1 |ξˆM S − ξ|2 or the risk R(ξˆM S , ξ), (5.8)

lim

sup

n→∞ n−1 |ξ|2 ≤c

ˆ ξˆM S ) − V | = 0. E|R(

Thus, the risk of the multiple shrinkage estimator ξˆM S converges to the best risk achievable over the candidate class; and its plug-in risk estimator converges to its actual risk or loss. Stein [11] improved on ξˆM S through an exact risk analysis for finite n and described an application to estimation of means in ANOVA models. The foregoing development is extended in Beran [4] to multiple affine shrinkage of a data matrix X, with first application to MANOVA models. A much larger class of candidate estimators is generated by including, for each value of n, every possible selection of the column dimensions n1 , n2 , . . . , ns . Redefine ξˆM S and ξ˜M S to minimize, respectively, estimated risk and risk over this larger class of candidate estimators. Convergences (5.7) and (5.8) continue to hold, by applying the analysis in Beran and D¨ umbgen ([5], p. 1832) of bounded total variation shrinkage. 6. Adaptive symmetric linear estimators Larger than the class of candidate multiple shrinkage estimators is the class of candidate symmetric linear estimators (6.1)

ˆ ξ(A(t)) = A(t)X,

t∈T,

where {A(t) : t ∈ T } is a family of n × n positive semidefinite matrices indexed by t. This class of estimators includes penalized least squares estimators with multiple quadratic penalties, running weighted means, nested submodel fits in regression, and more. Let {λk (t) : 1 ≤ k ≤ s} denote the distinct eigenvalues of A(t) and let {Pk (t) : 1 ≤ k ≤ s} denote the associated eigenprojections. Here s ≤ n may depend on n. Then (6.2)

ˆ ξ(A(t)) =

s

λk (t)Pk (t)X,

t∈T

k=1

ˆ represents ξ(A(t)) as a candidate multiple shrinkage estimator. If the index set T is not too large, in the covering number sense of modern empirical process theory, it may be possible to find tˆ = tˆ(X) ∈ T such that the ˆ tˆ)) converges to the smallest risk achievable risk of the adaptive estimator ξ(A( over the candidate class (6.2) as n tends to infinity. See Beran and D¨ umbgen [5] and Beran [3] for instances of such asymptotics. Such results link the profound insights and results in Stein [10] with modern theory for regularized estimators of high-dimensional parameters—estimators that have proved their value in practice. 7. Envoi Gauss offered two brief justifications for the method of least squares. The first was what we now call the maximum likelihood argument. The second, mentioned years later in a letter to Bessel, was the concept of risk and the start of what we now call the Gauss–Markov theorem.

34

R. Beran

Stein’s prophetic work [10] revealed that neither maximum likelihood estimators nor unbiased estimators necessarily have low risk when the dimension of the parameter space is not small. Despite the wonderfully transparent asymptotic geometry in his paper—geometry that extends readily to useful multiple shrinkage estimators and to the construction of confidence balls around these—many found his insights unbearable and labelled his findings paradoxical. Few contemporaries appear to have read his paper [10] carefully. Modern regularization estimators that reduce risk through beneficial multiple shrinkage have made manifest the fundamental nature of Stein’s achievement. References [1] Beran, R. (1995). Stein confidence sets and the bootstrap. Statistica Sinica 5 109–127. [2] Beran, R. (1996). Stein estimation in high dimensions: a retrospective. In Madan Puri Festschrift (E. Brunner and M. Denker, eds.) 91–110. VSP, Zeist. [3] Beran, R. (2007). Adaptation over parametric families of symmetric linear estimators. Journal of Statistical Planning and Inference (Special Issue on Nonparametric Statistics and Related Topics) 137 684–696. [4] Beran, R. (2008). Estimating a mean matrix: boosting efficiency by multiple affine shrinkage. Annals of the Institute of Statistical Mathematics 60 843–864. ¨ mbgen, L. (1998). Modulation of estimators and confidence [5] Beran, R. and Du sets. Annals of Statistics 26 1826–1856. [6] Efron, B. and Morris, C. (1973). Stein’s estimation rule and its competitors — an empirical Bayes approach. Journal of the American Statistical Association 68 117–130. [7] Hasminski, R. Z. and Nussbaum, M. (1984). An asymptotic minimax bound in a regression problem with an increasing number of nuisance parameters. In Proceedings of the Third Prague Symposium on Asymptotic Statistics (P. Mandl and M. Huˇskov´ a, eds.) 275–283. Elsevier, New York. [8] James, W. and Stein, C. (1961). Estimation with quadratic loss. In Proceedings of the Fourth Berkeley Symposium on Mathematical Statistics and Probability (J. Neyman, ed.) 1 361–380. University of California Press. [9] Pinsker, M. S. (1980). Optimal filtration of square-integrable signals in Gaussian white noise. Problems of Information Transmission 16 120–133. [10] Stein, C. (1956). Inadmissibility of the usual estimator for the mean of a multivariate normal distribution. In Proceedings of the Third Berkeley Symposium on Mathematical Statistics and Probability (J. Neyman, ed.) 1 197–206. University of California Press. [11] Stein, C. (1966). An approach to the recovery of inter-block information in balanced incomplete block designs. In Festschrift for Jerzy Neyman (F. N. David, ed.) 351–364. Wiley, New York. [12] Stein, C. (1981) Estimation of the mean of a multivariate normal distribution. Annals of Statistics. 9 1135–1151. [13] Stigler, S. M. (1990). A Galtonian perspective on shrinkage estimators. Statistical Science 5 147–155. [14] Watson, G. S. (1983). Statistics on Spheres. Wiley-Interscience, New York.

IMS Collections Nonparametrics and Robustness in Modern Statistical Inference and Time Series Analysis: A Festschrift in honor of Professor Jana Jureˇ ckov´ a Vol. 7 (2010) 35–45 c Institute of Mathematical Statistics, 2010  DOI: 10.1214/10-IMSCOLL704

On the estimation of cross-information quantities in rank-based inference Delphine Cassart1 , Marc Hallin1,∗ and Davy Paindaveine1,† Universit´ e Libre de Bruxelles Abstract: Rank-based inference and, in particular, R-estimation, is a red thread running through Jana Jureˇ ckov´ a’s entire scientific career, starting with her dissertation in 1967, where she laid the foundations of an extension to linear regression of the R-estimation methods that had recently been proposed by Hodges and Lehmann [13]. Cross-information quantities in that context play an essential role. In location/regression problems, these quantities  take the form 01 ϕ(u)ϕg (u) du where ϕ is a score function and ϕg (u) :=  −1 g (G (u))/g(G−1 (u)) is the log-derivative of the unknown actual underlying density g computed at the quantile G−1 (u); in other models, they involve more general scores. Such quantities appear in the local powers of rank tests and the asymptotic variance of R-estimators. Estimating them consistently is a delicate problem that has been extensively considered in the literature. We provide here a new, flexible, and very general method for that problem, which furthermore applies well beyond the traditional case of regression models.

1. Introduction 1.1. Asymptotic linearity and the foundations of R-estimation The 1969 volume of the Annals of Mathematical Statistics is rightly famous for two pathbreaking papers (Jureˇckov´ a [15]; Koul [17]) that opened the door to R-estimation procedures in linear regression models. Both papers were their author’s first publication. Both were addressing, with different mathematical tools, in slightly different contexts, and under different assumptions, the same essential problem: the uniform asymptotic linearity of residual rank-based statistics in a regression parameter. The idea of using rank-based test statistics in order to construct point estimators and confidence regions had been proposed, in 1963, by Hodges and Lehmann [13], ∗ Acad´ emie Royale de Belgique and CentER, Tilburg University. Research supported by the Sonderforschungsbereich “Statistical modelling of nonlinear dynamic processes” (SFB 823) of the German Research Foundation (Deutsche Forschungsgemeinschaft) and a Discovery Grant of the Australian Research Council. The financial support and hospitality of ORFE and the Bendheim Center at Princeton University, where part of this work was completed, is gratefully acknowledged. Marc Hallin is also a member of ECORE, the association between CORE and ECARES. † Research supported by a Mandat d’Impulsion Scientifique of the Fonds National de la Recherche Scientifique, Communaut´ e fran¸caise de Belgique. Davy Paindaveine is also a member of ECORE, the association between CORE and ECARES. 1 ECARES, Universit´ e Libre de Bruxelles, Avenue F.D. Roosevelt, 50, CP114, B-1050 Brussels, Belgium; e-mail: [email protected]; [email protected]; [email protected]; url: http://homepages.ulb.ac.be/~dpaindav AMS 2000 subject classifications: Primary 62G99; secondary 62G05, 62G10. Keywords and phrases: sample, Rank tests, R-estimation, cross-information, local power, asymptotic variance.

35

36

D. Cassart, M. Hallin, and D. Paindaveine

in the context of one- and two-sample location models. The potential applications of that idea in a much broader context were clear, and immediately triggered a surge of activity with the objective of extending the new technique to more general models. The analysis of variance case very soon was developed by Lehmann himself (Lehmann [23]; see also Sen [29]), very much along the same lines as in his original paper with Hodges. But the simple and multiple regression cases were considerably more difficult, the main obstacle to the desired result being a uniform asymptotic linearity property of the rank statistics to be used in the (regression) parameters. That result was more challenging than expected; it is missing, for instance, in Adichie [1]. It was successfully established, simultaneously and independently, in 1967, in two doctoral dissertations, one by Jana Jureˇckov´a (in Czech, defended in Prague; advisor Jaroslav H´ajek), the other one by Hira Koul (defended in Berkeley; advisor Peter Bickel). Although essentially addressing the same issue, the two contributions (Jureˇckov´ a [15]; Koul [17]) have little overlap: ranks and H´ajek projection methods on one hand, signed-ranks and Billingsley-style weak convergence techniques on the other. Both got published in the same 1969 issue of the Annals of Mathematical Statistics. Those uniform asymptotic linearity results paved the way for a complete theory of rank-based estimation in linear models and their extensions to parametric regression and time series, both linear and nonlinear —see the monographs by Puri and Sen [26], Jureˇckov´ a and Sen [16], or Koul [18, 19] for systematic expositions. This modest contribution to the subject is a tribute to Jana Jureˇckov´a’s pioneering work in the domain. 1.2. Cross-information quantities ϑ0 ) some rank-based test statistic for a two-sided null hypotheDenoting by Q (ϑ sis of the form ϑ = ϑ 0 , an R-estimator ϑ of ϑ is usually defined as a minimizer ϑ), that is, ϑ := argminϑ Q (ϑ ϑ). Under appropriate regularity conditions, of Q (ϑ

model under

study, the asymptotic performances of the Rand irrespective of the estimator ϑ and the related rank test typically are the same. More specifically, the

of rank tests are monotone functions of quantities of the form local powers  1 2 (1.1) ϕ(u)ϕg (u) du , 0

whereas the related R-estimators are asymptotically normal, with asymptotic variances proportional to the inverse of the same quantity. Here ϕ is the score function ϑ) from which the R-estimator is constructed, defining the rank-based statistic Q (ϑ while, in the context of location and regression, ϕg (u) := g  (G−1 (u))/g(G−1 (u)) is the log-derivative of the unknown actual underlying density g (with distribution function G) of the error terms underlying the model, computed at G−1 (u). All usual score functions ϕ themselves being of the form ϕf for some reference density f , the integral in (1.1) generally is of the form 1 ∞  −1 f (F (G(z))) g  (z) J (f ; g) := ϕf (u) ϕg (u) du = g(z) dz. −1 (G(z))) g(z) 0 −∞ f (F Under that form, and since ∞   2 ∞   2 f (z) g (z) f (z) dz and Ig := J (g; g) = g(z) dz If := J (f ; f ) = f (z) g(z) −∞ −∞

On the estimation of cross-information quantities in rank-based inference

37

are Fisher information quantities (for location), J (f ; g) clearly can be interpreted as a cross-information quantity, which explains the terminology and the notation we are using throughout, although ϕf and ϕg in the sequel need not be log-derivatives of probability densities. That relation between rank tests and R-estimators extends to the multiparameter case, with information and cross-information quantities entering the definition of information and cross-information matrices. It also extends to more general models, much beyond the case of linear regression, where information and cross-information quantities still take the form (1.1), but involve scores ϕf and ϕg that are not location scores anymore; the notation J (g) will be used in a generic way for an integral of the form (1.1) where ϕ is the score of the rank statistic under study, and ϕg the log-derivative of the unknown actual density g with respect to the appropriate parameter of interest. 1.3. One-step R-estimation An alternative to the classical Hodges–Lehmann argmin definition of an R-estimator was considered recently, for the estimation of the shape matrix of elliptical observations, by Hallin, Oja, and Paindaveine (2006). That method, which is directly connected to Le Cam’s one-step approach to estimation problems, actually extends to a very broad range of uniformly locally asymptotically normal (ULAN) models, and is based on the local linearization of a rank-based version of the central sequence of the family. Such a linearization, in a sense, revives, in the context of Le Cam’s asymptotic theory of statistical experiments, an old idea that goes back to van Eeden and Kraft [31] and Antille [2]. The same idea also has been exploited by McKean and Hettmansperger [24], still in the traditional linear model setting, and in the slightly different approach initiated by Jaeckel [14] (which involves the argmin of a function that is not purely rank-based). One-step estimators avoid some of the computational problems related with argmins of discrete-valued and possibly non-convex objective functions of (in the multiparameter case) several variables. Under their original form (as proposed by van Eeden and Kraft), however, they fail to achieve the same optimality bounds (parametric or nonparametric) as their argmin counterparts. McKean and Hettmansperger [24], in the context of linear models with symmetric noise, and Hallin, Oja, and Paindaveine [12], in the context of shape matrix estimation, solve that problem by introducing an estimated cross-information factor in the linearization step. Although different from (1.1) (since the scores ϕf and ϕg are those related to shape parameters), the cross-information quantity for shape plays exactly the same role in the asymptotic covariance matrix of R-estimators of shape as (1.1) does in the asymptotic variance of R-estimators of location or in the asymptotic covariance matrix of R-estimators of regression coefficients. Whether entering as an essential ingredient in some one-step form of estimation or not, cross-information quantities explicitly appear in the asymptotic variances of R-estimators, and thus need to be estimated. Now, the trouble with cross-information quantities is that, being expectations, under the unspecified actual density g, of a function which itself depends on that unknown g, they are not easily estimated. That difficulty may well be one of the main reasons why R-estimation, despite all its attractive theoretical features, never really made its way to everyday practice.

38

D. Cassart, M. Hallin, and D. Paindaveine

1.4. Estimation of cross-information quantities A vast literature has been devoted to the problem of estimating (1.1) in the context of linear models with i.i.d. errors (except for Hallin, Oja, and Paindaveine 2006, more general cross-information quantities, to the best of our knowledge, have not been considered so far). Four approaches, mainly, have been investigated. (a) McKean and Hettmansperger [24] estimate J (f ; g) as the ratio of a (1 − α) confidence interval to the corresponding standard normal interquantile range; that idea can be traced back to Lehmann [23] and Sen [29], and requires the arbitrary choice of a confidence level (1 − α), which has no consequence in the limit, but for finite n may have quite an impact (Aubuchon and Hettmansperger [3] in the same context propose using the interquartile ranges or median absolute deviations from the median). A similar idea, along with powerful higher-order methods leading to most interesting distributional results, is exploited by Omelka [25], but requires the same choice of a confidence level (1 − α). (b) Some other authors (Antille [2]; Jureˇckov´a and Sen [16], p. 321) rely on the asymptotic linearity property of rank statistics, by evaluating the consequence of a O(n−1/2 ) perturbation of ϑ 0 on the test statistic for H0 : ϑ = ϑ 0 . This again involves an arbitrary choice—that of the amplitude cn−1/2 , c ∈ R0 (in the multiparameter case, cn−1/2 , c ∈ Rk \ {0}) of the perturbation. Again, different values of c or c lead, for finite n, to completely different estimators; asymptotically, this has no impact, but finite-n results can be quite dramatically affected. (c) More sophisticated methods involving window or kernel estimates of g—hence performing poorly under small and moderate sample sizes—have been considered, for Wilcoxon scores, by Schuster [27] and Schweder [28] (see also Cheng and Serfling [7]; Koul, Sievers and McKean [20]; Bickel and Ritov [5]; Fan [8] and, in a more general setting, Section 4.5 of Koul [19]). Instead of a confidence level (1 − α) or a deviation c, a kernel and a bandwidth are to be selected. Density estimation methods, moreover, are kind of antinomic to the spirit of rank-based methods: if estimated densities are to be used, indeed, using them all the way by considering semiparametric tests based on estimated scores (in the spirit of Bickel et al. [4]) seems more coherent than considering ranks.) (d) Finally, jacknifing and the bootstrap also have been utilized in this context: see George and Osborne [9] and George et al. [10] for an investigation of that approach and some empirical findings. The approach proposed in Hallin, Oja, and Paindaveine [12] is of a different nature. It is based on the asymptotic linearity of a rank-based central sequence, hence requires uniform local asymptotic normality in the Le Cam sense, and consists in solving a local linearized likelihood equation. It does not involve any arbitrary choices, and, irrespective of the dimension of the parameter of interest, its implementation involves one-dimensional optimization only. However, it only can handle information quantities entering as a scalar factor in the information matrix of a given model, or, in the case of a block-diagonal information matrix, in some diagonal block thereof. This places a restriction on the quantities to be estimated, and rules out some cases, such as the information quantity for skewness derived in Cassart et al. [6]. In this contribution, we propose a generalization of the Hallin, Oja, and Paindaveine method that does not require uniform local asymptotic normality,

On the estimation of cross-information quantities in rank-based inference

39

and can accommodate much more general situations, including that of Cassart et al. [6]. 2. Consistent estimation of cross-information quantities (n)

Let P (n) := {Pϑ ;g | ϑ ∈ Θ , g ∈ F} be a family (actually, a sequence of them, indexed by n ∈ N) of probability measures over some observation space (usually, Rn , equipped with its Borel σ-field), indexed by a k-dimensional parameter ϑ ∈ Rk and a univariate probability density g; ϑ ranges over some open subset Θ of Rk , and g over some broad class of densities F. Associated with that observation, (n) ϑ), . . . , Zn(n) (ϑ ϑ)) of residuals such that assume that there exists an n-tuple (Z1 (ϑ (n) (n) (n) ϑ0 ), . . . , Zn (ϑ ϑ0 ) under Pϑ ;g are independent and identically distributed with Z1 (ϑ density g iff ϑ = ϑ 0 . (n) ϑ) the rank of Zi(n) (ϑ ϑ) among Z1(n) (ϑ ϑ), . . . , Zn(n) (ϑ ϑ), the vector Denoting by Ri (ϑ (n) (n) (n) (n) ϑ ϑ), . . . , Rn (ϑ ϑ)) under Pϑ ;g is uniformly distributed over the n! R (ϑ ) := (R1 (ϑ permutations of {1, . . . , n}, irrespective of g—a distribution-freeness property which serves as the starting point of rank tests and R-estimation of ϑ in the family P (n) . Our goal is to estimate consistently a cross-information quantity J (g) > 0 that enters the picture through the following assumption. ϑ) of k-dimensional R(n) (ϑ ϑ)Assumption (A) There exists a sequence S(n) (ϑ (n)

measurable statistics such that, under Pϑ ;g , ϑ), n ∈ N is uniformly tight and asymptotically bounded away from the (i) S(n) (ϑ

origin; more precisely, for all ε > 0, there exist δε > 0, Mε and Nε such that, for all n ≥ Nε ,   (n) ϑ) ≤ Mε ≥ 1 − ε Pϑ ;g δε ≤  S(n) (ϑ

(uniformity here is with respect to n, not ϑ ); ϑ), where Υ −1 (ϑ ϑ) is a full-rank (ii) there exists a continuous mapping ϑ → Υ −1 (ϑ k × k matrix such that (2.1)

ϑ + n−1/2 t(n) ) = S(n) (ϑ ϑ) − J (g)Υ Υ−1 (ϑ ϑ)t(n) + oP (1) as n → ∞ S(n) (ϑ

for any bounded sequence t(n) ∈ Rk .

We will also need ˆ (n)of ϑ is available, such that, Assumption (B) A root-n consistent estimator ϑ (n) ˆ (n) ) is asymptotically bounded away from zero: for all ε > 0, under Pϑ ;g , S(n) (ϑ there exist δ and N such that ε

ε

  (n) ˆ (n) ) ≥ δε ≥ 1 − ε Pϑ ;g  S(n) (ϑ

for all n ≥ Nε . Note that part (i) of Assumption (A) is rather mild, as it is satisfied as soon as ϑ) under Pϑ(n) S(n) (ϑ ;g is converging in distribution to a random vector that has no

at the origin. As for part (ii), it does not require the asymptotic linearity (2.1) atom ˆ (n) ) asymptotically to be uniform. Similarly, Assumption (B) requires that S(n) (ϑ

40

D. Cassart, M. Hallin, and D. Paindaveine

has no atom at 0. The statistic S(n) indeed is to provide, via its local behavior (2.1),

statistic, nor (through some estimating equation) an estimator for J (g)—not a test an estimator for ϑ : Assumption (B) thus explicitly rules out an estimator that would ˆ (n)= argminϑ  S(n) (ϑ ϑ). be obtained as ϑ

In order to control for the uniformity of local behaviors, a discretized version ˆ (n)will be considered in theoretical asymptotic statements. Such a version ˆ (n) of ϑ ϑ # can be obtained, for instance, by letting  (n)   (n)  ˆ ˆ )i cn1/2 |(ϑ ˆ (n) )i |, i = 1, . . . , k ϑ := (cn1/2 )−1 sign (ϑ # i

for some arbitrary discretization constant c > 0. This discretization trick, which is due to Le Cam, is quite standard in the context of one-step estimation. While retaining root-n consistency, discretized estimators indeed enjoy the important property of asymptotic local discreteness, that is, as n → ∞, they only take a bounded number of distinct values in ϑ -centered balls with O(n−1/2 ) radius. In fixed-n practice, however, such discretizations are irrelevant (one cannot work with an infinite number of decimal values, and c can be chosen arbitrarily large). The reason why discretization is required in asymptotic statements is that (see, for instance, ˆ (n) − ϑ ) substituted for Lemma 4.4 of Kreiss [21]), (2.1) then also holds with n1/2 (ϑ # t(n) , yielding ˆ (n) ) = S(n) (ϑ ˆ (n) − ϑ) + oP (1) ϑ) − n1/2 J (g)Υ Υ−1 (ϑ ϑ)(ϑ S(n) (ϑ # #

(n) as n → ∞ under Pϑ ;g . This stochastic form of (2.1) in a sense takes care of uniformity problems. We now describe the construction of our estimator of J (g). For any λ ∈ R+ , define (2.2)

(n) ˆ (n) + n−1/2 λΥ ˆ (n) ) S(n) (ϑ ˆ (n) ). Υ(ϑ ϑ λ := ϑ # # #

(2.3)

(n) When λ ranges over the positive real line, ϑ λ for fixed n thus moves, monotonically

ˆ (n). Note that any ϑ (n) , once with respect to λ, along a half-line with origin ϑ # λ

(n) discretized into ϑ λ# , provides a new root-n consistent and asymptotically locally

discrete estimator of ϑ to which (2.2) applies. It follows that (n) ˆ (n) ) = −λJ (g) S(n) (ϑ ˆ # ) + oP (1), S(n) ( ϑ λ# ) − S(n) (ϑ #

(n)

(n) still as n → ∞ under Pϑ ;g . Moreover, ϑ λ# also can serve as the starting point for

an iteration of the type (2.3), yielding, for any μ ∈ R+ , a further root-n consistent estimator of the form

(2.4)

(n) (n) (n) Υ( ϑ (n) ϑ λ# + n−1/2 μΥ ( ϑ λ# ). λ# ) S



From (2.4) we thus obtain, for all λ > 0,

(2.5)

(2.6)

(2.7)

(n) (n) ˆ (n) ) S(n) (ϑ ˆ (n) ) Υ ( ϑ λ# Υ(ϑ )Υ S(n)( ϑ λ# )Υ # #



(n)  ˆ (n) )Υ ˆ (n) ) S(n) (ϑ ˆ (n) ) + oP (1) Υ(ϑ = (1 − λJ (g)) S(n) (ϑ # Υ ( ϑ λ# )Υ # #



=

 ˆ (n) )Υ ˆ (n) ) + oP (1). ϑ)Υ Υ(ϑ ϑ) S(n) (ϑ (1 − λJ (g)) S(n) (ϑ # Υ (ϑ #

On the estimation of cross-information quantities in rank-based inference

41

The intuition behind our method lies in the fact that (2.6), which is the scalar product of the increments in (2.3) and (2.5), is, up to oP (1)’s, a decreasing linear function (2.7) of λ: since Υ has full-rank, the quadratic form in (2.7) indeed is positive definite. That function takes positive values for λ close to zero, and changes sign at λ = J −1 (g). Let therefore (c is an arbitrary discretization constant that plays no role in practical implementations)   (n) λ− := min λ := c



(n) ˆ (n) ) S(n) (ϑ ˆ (n) ) < 0 Υ ( ϑ λ(n) Υ(ϑ such that S(n)( ϑ λ+1 # )Υ )Υ # # +1 #



(2.8)

and λ+ := λ− + 1c . Defining J (n) (g) := (λ(n) )−1 , where λ(n) is based on a linear (n) (n) interpolation between λ− and λ+ , namely (n)

(n)

(n)

λ(n) := λ−

(n) (n) (n) ˆ (n) ) S(n) (ϑ ˆ (n) ) Υ ( ϑ (n) Υ(ϑ )Υ (λ+ − λ− ) S(n) ( ϑ (n) )Υ (n) # # λ λ # #





+ (n) (n) ˆ (n) ) S(n) (ϑ ˆ (n) ) Υ ( ϑ (n) Υ ( ϑ (n) Υ(ϑ [ S(n) ( ϑ (n) )Υ ) − S(n) ( ϑ (n) )Υ )]Υ (n) (n) # # λ λ λ λ # # # # − − + +





(n)

= λ−

(n) ˆ (n) ) S(n) (ϑ ˆ (n) ) Υ ( ϑ (n) Υ(ϑ S(n) ( ϑ (n) )Υ )Υ (n) # # # # λ λ 1





+ , (n) (n) (n) (n) c [ S(n) ( ϑ (n) )Υ   ˆ ˆ (n) ( ϑ (n) (n) (ϑ Υ Υ Υ ( ϑ )− S )Υ ( ϑ )]Υ ( ϑ ) S ) (n) (n) (n) (n) # #

λ− #

λ− #

λ+ #

λ+ #



we have the following result (see the Appendix for the proof). Proposition 2.1. Let Assumptions (A) and (B) hold. Then J (n) (g) = J (g)+oP (1) (n) as n → ∞, under Pϑ ;g . As already mentioned, discretizing the estimators is a mathematical device which is needed in the proof of asymptotic results but makes little sense in a fixed-n practical situation, as a very large discretization constant can be chosen. In practice, still assuming that Assumptions (A) and (B) hold, we recommend directly computing (J (n) (g))−1 as   (n) ˆ (n) ) S(n) (ϑ ˆ (n) ) < 0 . Υ ( ϑ (n) Υ(ϑ (J (n) (g))−1 := λ(n) := inf λ such that S(n)( ϑ λ )Υ λ )Υ



ˆ (n) are arbitrarily ˆ (n) and ϑ Indeed, for large values of the discretization constant c, ϑ # (n)

(n)

close, as well as λ− and λ+ defined in (2.8). 3. Conclusion Proposition 2.1 establishes the consistency of the proposed estimator of crossinformation quantities. Consistency indeed is the only property required from estimators of cross-information quantities—be it in the construction of a one-step R-estimator ϑ of ϑ or in the estimation of its asymptotic variance (with the pur

42

D. Cassart, M. Hallin, and D. Paindaveine

pose, for instance, of computing asymptotically valid confidence regions for ϑ ). We do not provide (and, to the best of our knowledge, nobody, in that context, ever has) any indication about the consistency rates and asymptotic distribution of J (n) (g) as an estimator of J (g)—even less about its optimality. While they have no impact on the asymptotic behavior of ϑ , the choices of (i) the sequence of rank-based

ˆ (n), and (iii) the discretization conϑ), (ii) the initial estimator statistics S(n) (ϑ ϑ stant c are likely to affect its finite-sample performances. However, the magnitude of such effects can be expected to be negligible when compared to the estimation error ( ϑ − ϑ ) itself.

Appendix A: Proof of Proposition 2.1 (n)

(n)

To start with, let us show that λ− , defined in (2.8), hence also λ+ , is OP (1) (n) under Pϑ ;g . Assume therefore it is not: then, there exist  > 0 and a sequence (n )

(n )

ni ↑ ∞ such that, for all L ∈ R and i, Pϑ ;gi [λ− i > L] > . This implies, for arbitrarily large L, that (n )

Pϑ ;gi



 (n ) i) ˆ (ni ) ) S(ni ) (ϑ ˆ (ni ) ) > 0 > , Υ ( ϑ (n Υ(ϑ S(ni )( ϑ L#i )Υ # # L# )Υ



hence, in view of (2.7),   (n ) ˆ (ni ) )Υ ˆ (ni ) ) + ζ (ni ) > 0 >  Υ (ϑ ϑ)Υ Υ(ϑ ϑ) S(ni ) (ϑ Pϑ ;gi (1 − LJ (g)) S(ni )(ϑ # #

for all i, where ζ (n) , n ∈ N is some oP (1) sequence. For L > (J (g))−1 , this entails, for all i,   (n ) ˆ (ni ) )Υ ˆ (ni ) ) < |ζ (ni ) |/(LJ (g) − 1) > , Υ (ϑ ϑ)Υ Υ(ϑ ϑ) S(ni ) (ϑ Pϑ ;gi 0 < S(ni )(ϑ # #

ˆ (n) ) is bounded away from zero. It which contradicts Assumption (B) that S(n) (ϑ (n) (n)

follows that λ is O (1) under P ; actually, we have shown the stronger result −

P

ϑ ;g

that, for any L > (J (g))−1 , limn→∞ Pϑ ;g [λ− > L] = 0. (n)

(n)

In view of Assumption (B), for all η > 0, there exist δη > 0 and an integer Nη such that   (n) (n) (n)  (n) ˆ (n) (n) ˆ (n) ˆ Υ ( ϑ # )Υ Υ(ϑ # ) S (ϑ # ) ≥ δη ≥ 1 − η/2 Pϑ ;g S (ϑ # )Υ

(n)

(n)

for all n ≥ Nη . In view of (2.4), the fact that λ− and λ+ are OP (1), and Assumption (A), for all η > 0 and ε > 0, there exists an integer Nε,δ ≥ Nη such that, for (n) (n) (n) all n ≥ Nε,δ (with λ± standing for either λ− or λ+ ),  (n) Pϑ ;g

(n) (n)  ˆ (n) )Υ ˆ (n) ) S(n) (ϑ ˆ (n) ) Υ(ϑ (1 − J (g)λ± ) S(n) (ϑ # Υ ( ϑ # )Υ # #

   (n) (n) (n)  (n) (n) ˆ (n) ˆ Υ ( ϑ λ± # )Υ Υ(ϑ # ) S (ϑ # ) ± ε ≥ 1 − η/2. ∈ S ( ϑ λ± # )Υ



On the estimation of cross-information quantities in rank-based inference

43

It follows that for all η > 0, ε > 0 and n ≥ Nε,δ , letting δ = δη , (n) Pϑ ;g



(n) Aε,δ



 :=

(n) Pϑ ;g

(n) (n)  ˆ (n) )Υ ˆ (n) ) S(n) (ϑ ˆ (n) ) Υ(ϑ (1 − J (g)λ± ) S(n) (ϑ # Υ ( ϑ # )Υ # #

  (n) ˆ (n) ) S(n) (ϑ ˆ (n) ) ± ε Υ ( ϑ λ(n) Υ )Υ ( ϑ ∈ S(n)( ϑ λ± # )Υ # # ±#



and S

(n)



(n)  ˆ (n) )Υ ˆ (n) ) S(n) (ϑ ˆ (n) ) ≥ δ Υ(ϑ (ϑ # Υ ( ϑ # )Υ # #

≥ 1 − η. ˆ (n) , D(n) and D(n) the graphs of the mappings Next, denote by D ±



λ

(n) ˆ (n) ) S(n) (ϑ ˆ (n) ) Υ ( ϑ λ(n) Υ(ϑ )Υ S(n) ( ϑ λ− # )Υ # # −#



Υ ( ϑ λ− # ) − S(n) ( ϑ λ+ # )Υ Υ ( ϑ λ+ # )] −c(λ − λ− )[ S(n) ( ϑ λ− # )Υ



(n) ˆ ) S(n) (ϑ ˆ (n) ), Υ(ϑ ×Υ # #

(n) (n) (n)  ˆ (n) ˆ )Υ ˆ ) S(n) (ϑ ˆ ), Υ(ϑ

→ (1 − J (g)λ) S(n) (ϑ # Υ (ϑ # )Υ # #

(n)

λ

(n)

(n)

(n)

and λ

 ˆ (n) ˆ (n) )Υ ˆ (n) ) S(n) (ϑ ˆ (n) ) ± ε, Υ(ϑ

→ (1 − J (g)λ) S(n) (ϑ # Υ (ϑ # )Υ # #

respectively. These graphs take the form of four random straight lines, intersecting the horizontal axis at λ(n) (our estimator of (J (n) (g))−1 ), λ0 := (J (g))−1 , λ+ 0 and (n) − (n) λ0 , respectively. Since D± and D are parallel, with a negative slope, we have that + λ− 0 ≤ λ0 ≤ λ0 . (n)

Under Aε,δ , that common slope has absolute value at least J (g)δ, which implies that − λ+ 0 − λ0 ≤ 2ε/J (g)δ. (n) (n) (n) ˆ (n) (n) is lying between D− Still under Aε,δ , for λ values between λ− and λ+ , D (n)

and D+ , which entails (n) λ− ≤ λ+ 0 ≤λ 0.

Summing up, for all η > 0 and ε > 0, there exist δ = δη > 0, and N = NεJ (g)δ/2,δ (n) such that, for any n ≥ N , with Pϑ ;g probability larger than 1 − η, − |λ(n) − λ0 | ≤ λ+ 0 − λ0 ≤ ε.



Acknowledgement. We gratefully acknowledge the insightful comments by two anonymous referees.

44

D. Cassart, M. Hallin, and D. Paindaveine

References [1] Adichie, J. N. (1967). Estimates of regression parameters based on rank tests. Annals of Mathematical Statistics 38 894–904. [2] Antille, A. (1974). A linearized version of the Hodges–Lehmann estimator. Annals of Statistics 2 1308–1313. [3] Aubuchon, J. C. and Hettmansperger, T.P. (1984). A note on the estimation of the integral of f 2 (x). Journal of Statistical Planning and Inference 9 321–331. [4] Bickel, P. J., Klaassen, C. A. J., Ritov, Y., and Wellner, J. A. (1993). Efficient and Adaptive Statistical Inference for Semiparametric Models, Johns Hopkins University Press, Baltimore. [5] Bickel, P. J. and Ritov, Y. (1988). Estimating integrated squared density derivatives. Sankhya A 50 381–393. [6] Cassart, D., Hallin, M., and Paindaveine, D. (2010). A class of optimal signed-rank tests for symmetry. Submitted. [7] Cheng, K. F. and Serfling, R. J. (1981). On estimation of a class of efficiency-related parameters. Scandinavian Actuarial Journal 8 83–92. [8] Fan, J. (1991). On the estimation of quadratic functionals. Annals of Statistics 19 1273–1294. [9] George, K. J. and Osborne, M. (1990). The efficient computation of linear rank statistics. Journal of Statistical Computation and Simulation 35 227–237. [10] George, K. J., McKean, J. W., Schucany, W.R., and Sheather, S. J. (1995). A comparison of confidence intervals from R-estimators in regression. Journal of Statistical Computation and Simulation 53 13–22. ˘ a ´ jek, J. and Sid ´ k, Z. (1967). Theory of Rank Tests. Academic Press, [11] Ha New York. [12] Hallin, M., Oja, H., and Paindaveine, D. (2006). Semiparametrically efficient rank-based inference for shape: II Optimal R-estimation of shape. Annals of Statistics 34 2757–2789. [13] Hodges, J. L., Jr. and Lehmann, E. L. (1963). Estimates of location based on rank tests. Annals of Mathematical Statistics 34 598–611. [14] Jaeckel, L. A. (1972). Estimating regression coefficients by minimizing the dispersion of the residuals. Annals of Mathematical Statistics 43 1449–1458. ˇkova ´ , J. (1969). Asymptotic linearity of a rank statistic in regression [15] Jurec parameter. Annals of Mathematical Statistics 40 1889–1900. ˇkova ´ , J. and Sen, P. K. (1996). Robust Statistical Procedures: Asymp[16] Jurec totics and Interrelations. New York: Wiley. [17] Koul, H. L. (1969). Asymptotic behavior of Wilcoxon type confidence regions in multiple linear regression. Annals of Mathematical Statistics 40 1950–1979. [18] Koul, H. L. (1992). Weighted Empiricals and Linear Models, IMS lecture Notes-Monograph 21, Institute of Mathematical Statitics. [19] Koul, H. L. (2002). Weighted Empirical Processes in Dynamic Nonlinear Models, 2nd edition. New York: Springer Verlag. [20] Koul, H. L., Sievers, G. L., and McKean, J. W. (1987). An estimator of the scale parameter for the rank analysis of linear models under general score functions. Scandinavian Journal of Statistics 14 131–141. [21] Kreiss, J.-P. (1987). On adaptive estimation in stationary ARMA processes. Annals of Statistics 15 112–133. [22] Le Cam, L. M. (1986). Asymptotic Methods in Statistical Decision Theory. New-York: Springer-Verlag.

On the estimation of cross-information quantities in rank-based inference

45

[23] Lehmann, E. L. (1963). Nonparametric confidence intervals for a shift parameter. The Annals of Mathematical Statistics 34 1507–1512. [24] McKean, J. W. and Hettmansperger, T. P. (1978). A robust analysis of the general linear model based on one-step R-estimates. Biometrika 65 571–579. [25] Omelka, M. (2008). Comparison of two types of confidence intervals based on Wilcoxon-type R-estimators. Statistics and Probability Letters 78 3366–3372. [26] Puri, M. L. and Sen, P. K. (1985). Nonparametric Methods in General Linear Models. New York: J. Wiley. [27] Schuster, E. (1974). On the rate of convergence of an estimate of a functional of a probability density. Scandinavian Actuarial Journal 1 103–107. [28] Schweder, T. (1975). Window estimation of the asymptotic variance of rank estimators of location. Scandinavian Journal of Statistics 2 113–126. [29] Sen, P. K. (1966). On a distribution-free method of estimating asymptotic efficiency of a class of nonparametric tests. The Annals of Mathematical Statistics 37 1759–1770. [30] van Eeden, C. (1972). An analogue, for signed-rank statistics, of Jureˇckov´a’s asymptotic linearity theorem for rank statistics. The Annals of Mathematical Statistics 43 791–802. [31] van Eeden, C. and Kraft, C. H. (1972). Linearized rank estimates and signed-rank estimates for the general linear hypothesis. The Annals of Mathematical Statistics 43 42–57.

IMS Collections Nonparametrics and Robustness in Modern Statistical Inference and Time Series Analysis: A Festschrift in honor of Professor Jana Jureˇ ckov´ a Vol. 7 (2010) 46–61 c Institute of Mathematical Statistics, 2010  DOI: 10.1214/10-IMSCOLL705

Estimation of irregular probability densities∗ Lieven Desmet1,† , Ir` ene Gijbels2,† and Alexandre Lambert3 Katholieke Universiteit Leuven and Universit´ e catholique de Louvain Abstract: This paper deals with nonparametric estimation of an unknown density function which possibly is discontinuous or non-differentiable in an unknown finite number of points. Estimation of such irregular densities is accomplished by viewing the problem as a regression problem and applying recent techniques for estimation of irregular regression curves. Moreover, the method can deal with estimation of densities that have an irregularity at the endpoint(s) of their support. A simulation study compares the performance of the proposed method with those of other methods available in the literature. A further illustration on real data is provided.

1. Introduction Consider a random variable X with unknown density function fX . Based on an i.i.d. sample X1 , X2 , · · · , Xn from X a well-known nonparametric estimator for fX is the kernel density estimator   n 1 1 x − Xi , (1) fn (x) = K n i=1 h h with K a kernel function and h > 0 a bandwidth parameter. When fX (·) is continuous at x, then fn (x) is a consistent estimator of fX (x). By contrast, in points of discontinuity the estimate will typically smooth out the discontinuous behaviour and will not be consistent (see e. g. [20] and [27]). A particular example here is the case of a density with support [0, +∞[ (for example an exponential density) which is discontinuous at the endpoint 0 of its support. See for example [11]. Several approaches for obtaining consistent estimates of densities at such discontinuous endpoints or boundary points have been proposed in the literature: a reflection method of [25], transformation methods as in [21] and kernel methods with specially ∗ This research was supported by the IAP research network P6/03, Federal Science Policy, Belgium. † The first and second author gratefully acknowledge financial support from the GOA/07/04project of the Research Fund KULeuven. 1 This work was part of the doctoral research of the first author carried out at the Katholieke Universiteit Leuven; e-mail: [email protected] 2 Katholieke Universiteit Leuven, Department of Mathematics and Leuven Statistics Research Center (LStat), Box 2400, Celestijnenlaan 200B, B-3001 Leuven (Heverlee), Belgium; e-mail: [email protected] 3 This work was initiated during the doctoral research of the third author carried out while at the Institut de Statistique, Universit´ e catholique de Louvain; e-mail: [email protected] AMS 2000 subject classifications: Primary 62G07; secondary 62G08. Keywords and phrases: density estimation, irregularities, local linear fitting, variance stabilization.

46

Estimating irregular densities

47

adapted kernels for the boundary points, as in [17]. There is also a vast literature on detection of locations of discontinuity points in density or regression functions (see e. g. [6], [28], [12], among others, and references therein). An important issue in kernel density estimation is the choice of the bandwidth. Global and local bandwidth selection procedures have been studied. See [27] and references therein. Papers on local bandwidth selection in kernel density estimation include [23], [24] and [18], among others. See [5] for a comparative study on bandwidth selectors. In this paper we consider the more general problem of estimating fX when this function possibly exhibits discontinuities, at the function itself or in its derivative, at certain (unknown) locations at the interior or at the boundary of its support. If the density is continuous but not differentiable at a point x, then the estimate (1) will be consistent but the rate of convergence is slower than at points of continuity. To deal with estimation of densities that possibly show irregularities of the jump type (i. e. discontinuity in the function itself) or of the peak type (i. e. discontinuity in the derivative) we first view the density estimation problem as a regression problem and then apply the technique developed by [10] for regression functions with jump and/or peak irregularities to the resulting regression problem. Of importance is to link the density estimation problem with the regression problem to see how properties of the regression estimation context lead to properties of the resulting density estimator. Viewing density estimation as a regression problem is not new, and has been used in for example [7] and [19] for respectively estimation of densities at boundaries and densities at points of discontinuity. The contribution of this paper consists of dealing with estimation of irregular densities showing jump or peak irregularities at unknown locations. The proposed method also leads to consistent estimation at (discontinuous) boundary points. The method relies on local linear fits. The merits of techniques based on local linear fitting for estimating regression curves and surfaces with irregularities have been largely proven in [13], [14], [11], [9] and [8]. The paper is organized as follows. In Section 2 we recall how binning of the data leads to a regression problem, and we briefly discuss important properties of this regression problem. Section 3 provides insights in how irregularities in the density fX have an impact on the regression problem. The proposed estimation procedure is discussed in Section 4. The finite sample performance of the method is investigated via a simulation study in Section 5, which includes also comparisons with existing methods, and a real data example. 2. Density estimation formulated in a regression context 2.1. Data binning Define an interval [a, b] such that essentially no data point Xi fall outside it. Partition the interval [a, b] into N subintervals {Ik ; k = 1, · · · , N } of equal length b−a (b − a)/N . More precisely, let Ik = [a + (k − 1) b−a N , a + k N [, for k = 1, · · · , N − 1, N −1 and let the last bin be IN = [a + N (b − a), b]. Denote by Ck the number of observations in the bin Ik , k = 1, · · · , N . The bin counts (C1 , . . . , CN ) behave like a multinomial distribution with n trials and probabilities (β1 /N, . . . , βN /N ) where βk := N

k a+(b−a) N

a+(b−a) k−1 N

fX (x) dx, k = 1, · · · , N . Denote by xk = a +

center of the bin Ik , k = 1, · · · , N .

b−a N (k

− 12 ) the

48

L. Desmet, I. Gijbels, and A. Lambert

Then, asymptotically, for N = N (n) tending to infinity with n, we have that βk ≈ (b − a)fX (xk ). Since the counts Ck ∼ Binomial(n, βk /N ), it holds that E(Ck ) = nβk /N = mβk , with m = n/N , and Var(Ck ) = nβk /N (1−βk /N ) = mβk (1−βk /N ), and hence asymptotically, as N tends to infinity, E{Ck /((b − a)m)} ≈ fX (xk ) and Var{Ck /((b − a)m)} ≈ fX (xk )/((b − a)m). Estimating fX (x) can thus be viewed as a heteroscedastic nonparametric regression problem where the regression curve (the mean regression function) is fX (x) and the conditional variance function σ 2 (x) ≈ fX (x)/m with data set {(xk , Ck /((b − a)m), k = 1, · · · N } as the sample. We will assume that m → ∞ as n → ∞, meaning that the number of data per bin also increases as the total number of data increases. For future developments it is convenient to treat the bin counts as Poisson variables. Indeed, the variables Ck ∼ Binomial(n, βk /N ) behave asymptotically like Poisson variables with parameter mβk (recall, as n → ∞, we have that N → ∞). A widely used approach to diminish heteroscedasticity is to apply a variancestabilizing transformation to the bin counts, which in some sense normalizes their variance to a constant value. Strictly speaking, the local linear fitting procedure does not require the conditional variance to be constant but its consistency properties are established under continuity. This is not guaranteed when starting from densities with jumps as these will show up in the conditional variance. Due to the variance stabilizing transformation we need however not to worry about this. See Section 3. 2.2. Variance stabilizing transformations It was suggested already by [2] that the square root of a Poisson variable (say X ∼ Poisson(λ) with λ > 0) has a distribution that is closer to the normal distribution than the original variable. The variance is approximately 1/4 when λ is large. This idea was √ further explored in [1], in particular by considering transformations of the type X + c with c ≥ 0. √ The behaviour of the expectation and the variance of the transformed X + c Poisson random variable X, for λ → ∞, can be obtained via Taylor expansion. The following result can be found in for example [4]. Lemma 1 Assume X ∼ P oisson(λ) and c ≥ 0 is a constant. Then it holds: √ E( X + c)

=

√ V ar( X + c)

=

5 4c − 1 − 1 16c2 − 24c + 7 − 3 λ 2− λ 2 + O(λ− 2 ) 8 128 1 3 − 8c −1 32c2 − 52c + 17 −2 + λ + λ + O(λ−3 ) . 4 32 128 1

λ2 +

In [1] it was proposed to take c = 3/8 in order to get a constant variance and nearly constant bias but √ [4] argue√that the choice c = 1/4 is better for minimizing the first order bias E( X + c) − λ while still stabilizing the variance equally well (for λ large enough). In this paper we opt for the choice c = 1/4. 2.3. Asymptotic properties of the transformed bin counts In [4] the behaviour of the transformed bin counts as stochastic variables was studied in detail. That paper establishes an explicit decomposition of the transformed bin counts in a deterministic term directly related to the (square root of the) density in the corresponding grid points, a deterministic o(1) term and a stochastically small

Estimating irregular densities

49

random variable. This result extends Lemma 1 and applies it to the binning case where the bin counts Ck are assumed Poisson variables with parameter mβk . ! Proposition 1 With notations as before, Y k = Ck + 14 , we have  1 k = 1, 2, . . . , N , (2) Y k = mβk + εk + Zk + ξk , 2 3

where the Zk are i.i.d. N (0, 1) variables, the εk are constants that are O((mβk )− 2 ), N the quantity k=1 ε2k is O(1) and the ξk are independent and stochastically small   variables. More, precisely we have: E|ξk | ≤ c (mβk )− 2 and P (|ξk | > α) ≤ (α2 mβk )− 2 where  > 0, α > 0 and c > 0 is a constant (depending on  only). The authors in [4] rely on this regression model to estimate fX (·) using wavelet block thresholding techniques. The simulation study in Section 5 includes a comparison with this method. The above result is of course an asymptotic result requiring that mβk → ∞. Note that then the εk are o(1) quantities and the ξk are oP (1). It is thus important that βk > 0 while m → ∞. In other words, the result is not applicable for βk = 0 as the parameter of a Poisson variable cannot be 0. Consequently, the finite sample behaviour of any estimate using this model could be bad in regions where the true density is zero or close to zero. Therefore, we need to assume, on the domain under consideration, that inf fX (x) > 0. 3. Variance stabilization and irregularities We now turn to the situation that the unknown density fX is continuous and twice differentiable except at a finite (unknown) number of points in which the density function itself or its derivative is discontinuous. A point s is called a jump irregularity when fX (s+) = fX (s−) + d with fX (s−) > 0, fX (s+) > 0, and d = 0.   A point s is called a peak irregularity when fX (s+) = fX (s−) + d∗ , with d∗ = 0  and fX (s+) = fX (s−) > 0, where fX denotes the first derivative of fX . We assume that the second order derivatives of fX at all regular points (i. e. points at which fX is continuous and twice differentiable) are uniformly bounded. We now investigate what is the impact of such irregularities on the regression problem related to the transformed counts. The following result shows how the asymptotic variance changes with the grid point xk . It is an immediate consequence of Lemma 1. Corollary 1 Let Ck be the bin counts and suppose that xk and xk+1 are in the interior of the support of fX . Then the asymptotic difference in variance over these neighbouring grid points behaves like: " " 1 1 1 1 3 − 8c 1 ΔVark := Var( Ck + ) − Var( Ck+1 + ) = − ) + o(1/m) . ( 4 4 m 32 βk βk+1 Proof. Apply Lemma 1 to the variables C√ k that are distributed as Poisson variables 1 with parameter mβk . Then we have Var( Ck + c) = 14 + 3−8c 32 mβk + o(1/m). The result follows by rewriting this equation in the neighbouring point with index k + 1 and taking the difference. 2 From the result in Corollary 1 we get insight into the effect of the variance stabilisation on the behaviour of the conditional variance function in the regression

50

L. Desmet, I. Gijbels, and A. Lambert

problem, and more particularly on how this variance changes with the x-coordinate. We first study ΔVark , with xk and xk+1 interior points of the support of fX , for different situations, namely that the interval ]xk , xk+1 [: (S1) does not contain any irregularity point; (S2) contains a jump irregularity point s; and (S3) contains a peak irregularity point s. The findings can be summarized as follows: (S1). We have that |fX (xk ) − fX (xk+1 )| = O(1/N ) and since asymptotically 1 βk → (b − a)fX (xk ) we have that ( β1k − βk+1 ) → 0 as well as 1/m → 0. Therefore ΔVark vanishes asymptotically, or in other words the variance in smooth regions of the density fX tends to behave like a constant. (S2). In this situation we have fX (xk ) = fX (s−) + O(1/N ) and fX (xk+1 ) = 1 fX (s−) + d + O(1/N ) and a first order approximation of the ( β1k − βk+1 ) term −1 is given by d[(b − a) fX (s−)(fX (s−) + d)] . However, since 1/m → 0, the quantity ΔVark will converge to zero although slower than in situation (S1) (and (S3)). (S3). In this case, an analysis similar to the one in (S1) applies, and the difference in variance ΔVark vanishes asymptotically. The case when the unknown density shows a jump discontinuity at an endpoint of its support is discussed in Section 4.2. 4. Proposed estimation procedure 4.1. Jumps and peaks preserving fit In [10] a nonparametric method for estimating regression curves with jump and/or peak irregularities using local linear fitting was proposed. The aim is to apply this method to the regression model obtained from the binned and transformed data. The requirement of homoscedastic errors in [10] can be relaxed since it is sufficient to have a continuous (locally constant) conditional variance for the method to work. From the result in (2) and the discussion in Section 3 we know that the regression model for (xk , Y k ) has a less heteroscedastic conditional variance, and the effect of irregularities in the interior of the support vanishes asymptotically. We need to assume that m = n/N → ∞; thus the number of observations per bin grows as the number of observations grows. ! From the transformed bin counts Y k = Ck + 1 we can effectively estimate 4

the ! function g(·) that relates to the original density fX (·) as follows: g(x) ≈ m(b − a)fX (x) + 14 . Once an estimate for g(·) is obtained we recover an estimate for fX (·) by applying an inverse transformation. In summary, the estimation procedure reads as follows: • Step 1. Binning step: set up the grid of N equal-length intervals and calculate the bin counts Ck , k = 1, . . . , N . ! • Step 2. Root transform: put Y k = Ck + 1 and treat (xk , Y k ), k = 1, . . . , N 4

as the new equispaced sample for a nonparametric regression problem. • Step 3. Apply the jump and peak preserving local linear fit of [10] to obtain an estimate g(·) of g(·). • Step 4. Perform an inverse transformation and renormalization (3)

1 fX (.) = S ( g 2 (.) − )+ , 4

Estimating irregular densities

51

where z+ = max(z, 0) and S is a normalization constant. The jump and peak preserving local linear fitting method of [10] consists of fitting three local linear models, using observations in a centered, a right and a left neighbourhood of the point. In the presence of a jump or peak irregularity, one of the three fits will outperform the other two, and this fit is selected in a data driven way using an appropriate diagnostic quantity. We now provide details of this estimation algorithm in Step 3. Let Kc be a bounded symmetric kernel density function supported on the interval [−1/2, 1/2], and let h > 0 be the bandwidth parameter. The (conventional) local linear estimate for g(x) is obtained by weighted least-squares minimization: (4)

( ac,0 (x),  ac,1 (x)) = arg min

a0 ,a1

N 

k=1

  2 xk − x . Y k − a0 − a1 (xk − x) Kc h

Starting from this conventional kernel Kc one then considers one-sided versions K (x) = Kc (x) I{x ∈ [−1/2, 0 [ } and Kr (x) = Kc (x) I{x ∈ [0, 1/2] } which via a weighted least-squares minimization as in (4) but with K = K , respectively K = Kr , leads to the left local linear estimate, respectively the right local linear estimate, denoted by ( aj,0 (x),  aj,1 (x)) with j = , r respectively. Consider the Residual Sum of Squares (RSS) of the three fits, defined as: (5) RSSj (x) =

  N  2

xk − x , Y k −  aj,0 (x) −  aj,1 (x)(xk − x) Kj h

j = c, , r .

k=1

Then an important diagnostic quantity is   RSSc (x) RSS (x) RSSc (x) RSSr (x) , (6) diff(x) = max − , − wc (x) w (x) wc (x) wr (x)  xk − x , for j = c, , r. The peak and jump preserving h k=1 local linear regression estimator is then given by

where wj (x) =

(7)

g(x) =

N



Kj

⎧  ac, 0 (x) ⎪ ⎪ ⎪ ⎨  ar, 0 (x)

if if

diff(x) < u diff(x) ≥ u and

 a , 0 (x) if ⎪ ⎪ ⎪ ⎩ ( a , 0 (x) +  ar, 0 (x))/2 if

diff(x) ≥ u and diff(x) ≥ u and

RSSr (x) wr (x) RSSr (x) wr (x) RSSr (x) wr (x)

< > =

RSS (x) w (x) RSS (x) w (x) RSS (x) w (x)

,

where u > 0 is a suitably chosen threshold value. Together with good choices of the parameters h and u involved, this leads to the following practical estimation algorithm: Consider a grid of bandwidths hgrid := (h1 , . . . , hM ). Iterate over these bandwidths and put h := hq , q = 1, . . . , M . For this bandwidth:  Calculate estimates  aj,0 (x) and  aj,1 (x) for j = , r, c.  Obtain d := supx | ar,0 (x) −  a ,0 (x)| and d∗ := supx | ar,1 (x) −  a ,1 (x)|.

52

L. Desmet, I. Gijbels, and A. Lambert

  2 C c (0) 1/2 C c (0)  Put umax := 12 d2 v00,c + d∗ v20,c h2 , with v0,c = −1/2 Kc (t) dt and with C0c (0) and C2c (0) constants that only depend on K (see [10] for details).  Put ugrid := (0.001umax , 0.01umax , 0.1umax , umax ). Now iterate over the threshold values and put u := up , p = 1, . . . , 4. ∗ For the combination of h and u values at hand, calculate g−k (xk ) as in (7), but leaving out the k-th observation itself. n ' ∗ Calculate [Yk − g−k (xk )]2 . k=1

 Retain the value of u that yields the minimum for the sum in the former step and associate with hq by putting it u

q . Repeat the above procedure for each bandwidth and look for the bandwidth hq (and associated threshold u

q ) that yields the lowest value for the sum. Calculate the final estimate with (7) from the couple (h, u) obtained as above. For a detailed study of this jump and peak preserving estimator, in a general regression context, see [10]. From this and previous studies we need to impose conditions on how the bandwidth decreases as N → ∞. More precisely, we need to impose that h ∼ (log N )2/5 N −1/5 , which can be translated to a condition on n depending on the relation between N and n. From the discussion in Section 3 it is already clear that the above estimation procedure can deal with estimation of irregular densities at the interior of their support. We now show that the method can also handle a non-smooth behaviour of the density at an unknown boundary. 4.2. Densities with discontinuity at the boundary As mentioned before a boundary point can be seen as a potential jump in the regression function to be estimated with the jump and peak preserving local linear fit of Section 4.1. In practice, we take a large enough binning interval (extending to the left of the smallest and to the right of the largest observation) and consider the unknown density as a function defined on this whole interval (coinciding with the density on its support and with value zero outside of the support). Let s be a boundary point of the support of fX , and suppose that fX (·) is discontinuous in s, i. e. fX (s−) = 0, and fX (s+) = dB > 0, and we have uniformly bounded derivatives up to the second order outside of s. Then to the left of s the bin counts have variance zero (since they remain zero themselves) and to the right of s we see the variance converging to 1/4. Therefore, asymptotically, the jump discontinuity in the variance cannot be resolved by a variance stabilizing transformation. The proposed method however can deal with this situation in an automatic way. The jump and peak preserving estimator from Section 4.1 will select the suitable one-sided local linear fit in the neighbourhood of the boundary, and hence will estimate the jump correctly. The argumentation for this is in two steps: first we analyse this problem in the regression context in Lemma 2 and then we apply this to the density estimation setting. Lemma 2 Consider a regression model Yi = m(xi ) + εi where m(·) is an unknown function such that m(x) = 0 for x < s and m(s+) = d > 0 (and m has continuous second order derivatives outside of s), the errors have constant variance σ 2 for

Estimating irregular densities

53

xi > s (and are 0 for xi < s), with Eε4 < ∞. Assume the kernel K is uniform nh Lipschitz continuous and h → 0, log n → ∞ as n → ∞. Then asymptotically, we have the following behaviour of the residual sum of squares quantities, in points x = s + τ h near the jump point s. −1/2 < τ ≤ 0 c

τ,+

RSSc (x)

C (τ ) d2 v0 0,c

+

RSSr (x)

d2

C0r (τ ) v0,r

+

wc (x)

wr (x)

RSS (x)

v0,c v0,c τ,+ v0,r v0,r

σ2

+ o(1) a.s.

σ 2 + o(1) a.s.

τ,+

+

v0,c v0,c

σ 2 + o(1) a.s.

σ 2 + o(1) a.s. d2

o(1) a.s.

w (x)

0 < τ < 1/2 c

C (τ ) d2 v0 0,c

C0 (τ ) v0,

τ,+

+

v0,

v0,

σ 2 + o(1) a.s.

τ,+ where asymptotic remainder terms are uniform in x and with v0,j := for j = , r, c.

1/2

1−τ Kj (t) dt

The proof of Lemma 2 is omitted here, and can be found in [8]. We cannot immediately apply this result to our density estimation setting where responses Y k are obtained from transformed bin counts. However, asymptotically

we do have conditions as in the lemma: for xk < s we √ have C1k = 0, Yk = 0.5 and

VarYk = 0, whereas for xk > s, asymptotically Yk = mβk + 2 Zk + oP (1), with Zk standard normal variables as in (2).  The jump d and the quantity σ 2 in the lemma then correspond to ( m(b − a)dB − 0.5), respectively 1/4 in our setting. Asymptotically, as m → ∞, the contribution of the jump increases unboundedly. Therefore, considering (7) and the definition of RSS (x) c (x) diff(x) in (6), we have for −1/2 < τ ≤ 0, diff(x) = RSS wc (x) − w (x) which increases RSS (x) r (x) asymptotically above threshold values and clearly RSS wr (x) > w (x) so the left esRSSr (x) c (x) timate will be selected. Now for 0 < τ < 1/2 we will see diff(x) = RSS wc (x) − wr (x)

r (x)  (x) increase above threshold and since RSS > RSS w (x) wr (x) , the right estimate will be selected.

5. Numerical analysis 5.1. Simulation study The proposed estimation method is applied to five test densities with jump and/or peak irregularities in the interior or with discontinuous boundary. Model (a) is a discontinuous density defined from two different exponential densities. fX (x) = 0.5 exp(x) I{x < 0} + 5 exp(−10 x) I{x ≥ 0} . Model (b) is a discontinuous density which is a mixture of two different normal densities and was considered in [15]: fX (x) = 0.5fN(0,( 10 )2 ) I{x < 0} + 0.5fN(0,( 32 )2 ) I{x ≥ 0} . 3

3

Model (c) is the claw density defined in [22] (their model #10). It can be seen as a convex combination of normal densities: 1 1  fX (x) = f fN(0,1) (x) + 1 2 (x) + f 1 2 (x) + f 1 2 (x) N(− 12 ,( 10 N(0,( 10 ) ) ) ) 2 10 N(−1,( 10 ) )  +fN( 1 ,( 1 )2 ) (x) + fN(1,( 1 )2 ) (x) 2

10

10

.

54

L. Desmet, I. Gijbels, and A. Lambert

Strictly speaking this is a smooth model but it is challenging. Model (d) is the standard exponential density (so with a discontinuity at the boundary). Model (e) is a density with discontinuity in the first derivative. fX (x) = 5 exp(−|10x|) . All these models have unbounded support (on at least one side), and are shown in Figure 1. b

c

−5

−3

−1

1

e

−20

10

40

−3

−1

1

4 0

0.0

2

0.2

0.4

6

0.4

0.8

8 10

0.6

d

0.0

0

0.00

1

2

0.04

3

0.08

4

5

0.12

a

3

0

2

4

6

−3

−1

1

3

Fig. 1. The five test models.

An illustration of the effect of the variance stabilization is provided in Figure 2. Hundred samples of size n = 16384 are generated from each model. Each sample is binned into N = 256 bins. In each gridpoint we thus have a sample of bin counts and transformed bin counts of size 100 (from the 100 repetitions), from which the sample standard deviations are then calculated. Model (d)

Model (e) 6 5

2.0

2.5

−5

−3

−1

1

−20 0

20 40

2 −3

−1

1

3

1 0

0.0

0.0

0

0.0

1

0.5

0.5

0.1

2

1.0

1.0

3

1.5

0.2

3

1.5

4

4

0.3

2.0

5

6

Model (c) 2.5

Model (b) 0.4

Model (a)

−1 0

1

2

3

−3

−1

1

3

Fig. 2. Variance stabilization in each model. Black circles indicate standard deviations based on bin counts Ck , grey crosses show standard deviations based on transformed bin  counts

Ck +

1 4

for a large value m = 64.

As can be seen from Figure 2 the original bin counts are strongly heteroscedastic and the standard deviations follow the shape of the density itself, as explained in Section 2.1. This greatly improves when taking a transformation: peaks get largely

Estimating irregular densities

55

suppressed and discontinuities also diminish. However, in those regions where the density was already small (near zero) we still have small values after transformation and hence standard deviation is still far from the theoretical value. In addition, a discontinuity at the boundary still gives rise to a discontinuity of magnitude 0.5 in the standard deviation of the transformed values (see Model (d)). However this does not cause any problem, as explained in Section 4.2. In this simulation study we include a comparison with a variety of other methods, such as standard kernel density estimation methods (see the estimator in (1)) with different bandwidth selection strategies as well as methods developed for densities with irregularities such as wavelet thresholding and a histogram method combined with a suitable selection of the number of bins. An overview of the considered estimators and their short notation is given in Table 1. Table 1 Overview of estimators Name Method proposed estif1 mator kernel f2 f3 conventional f4 local linear wavelet f5 wavelet f6 histogram f7 f8

Input data binned transformed

Main smoothing parameter global bandwidth (cross-validation)

raw data

global bandwidth: Sheather-Jones solve-the-equation (ste) Sheather-Jones direct plug-in (dpi) local bandwidth

binned transformed raw data binned transformed raw data

thresholding blocked thresholding number of bins: penalized max likelihood (Hellinger distance) penalized max likelihood (L2 distance)

Details about the methods are provided below. • The proposed estimator is denoted by f1 = fX , defined in (3), and is obtained via the fully automatic procedure described in Section 4.1. We use an equispaced grid of bandwidth values hgrid = {0.02 + (q − 1)0.01; q = 1, · · · , 13}. • Methods f2 and f3 are kernel density estimators based on the Sheather-Jones bandwidth selectors, respectively with direct plug-in and solve-the-equation strategies, as implemented in R: stats package. See [26]. • Method f4 is the estimate obtained from local kernel regression with a variable bandwidth as in [16] (package R: lokern). • Estimator f5 is a recent wavelet thresholding method of [15]. As recommended in that paper we use the Haar wavelet basis for which the theory was developed as well as the guidelines on the finest resolution level. An important procedure parameter is then still the p-value for the testing procedure, for which no guidelines are given. The results reported here are for p = 0.05 (which gave the best performance in the majority of cases). √ • In f6 , binned and transformed data are used as a model for f (up to a √ scaling factor). The block thresholding wavelet method yields (f and the final estimate is obtained by squaring and renormalization (see [4]). The parameter λ∗ in the James-Stein shrinkage formula regulates thresholding. The standard value of 4.50524 recommended in the paper gives only small amount of smoothing (visually the estimates were quite wiggly), therefore simulations

56

L. Desmet, I. Gijbels, and A. Lambert

were also done for 10 and 100 times this value. The reported results are for λ∗ = 10 × 4.50524. • Methods f7 and f8 are histogram methods developed by [3] where the number of bins is selected by maximization of a maximum likelihood criterium (respectively based on Hellinger or L2 distance) over a grid of values namely from 10 to 100 (steps of 2) or from 100 to 800 (steps of 10). In the simulation study hundred replications were performed and in one replication a sample of size n was generated from the given distribution. For the methods that are based on regression, data were binned over a number N of bins: for samples sizes n 2048, 1024 and 512, the number of bins N are respectively 512, 256 and 128. We now summarize the simulation results. For saving space we only present plots for Model (a) and sample size n = 1024. These pictures provide information on the performance of each method, including its variability. For each method we present pointwise 10% and 90% quantiles and median values calculated from the 100 estimation values. For increasing the visibility at the peak irregularity at the point zero, we add short horizontal segments at that location.

−5

−4

−3

−2

−1

0

1

5 0

1

2

3

4

5 4 3 2 1 0

0

1

2

3

4

5

6

f3

6

f2

6

f1

−5

−4

−3

−2

−1

0

1

−5

−4

−3

−2

−1

0

1

Fig. 3. 10% and 90% percentiles (dotted lines), median (black solid line) and true model (thick grey line). Left panel: f1 , middle panel: f2 and right panel: f3 .

Figure 3 presents the results for the proposed jump and peak preserving local linear method (f1 ) and for the global bandwidth kernel methods (f2 and f3 ). From this figure it can be seen that f1 shows reasonably low bias and low variance (except near the irregularity where the gap between quantiles is larger). The estimates f2 , f3 have higher variance in the smooth regions and both underestimate the irregularity (unlike for f1 , the true model value falls outside the 10% to 90% quantile interval). In general we noticed that the cross-validation procedure selects significantly larger bandwidths than the Sheather-Jones bandwidth selectors. However, bias is still reduced thanks to one-sided estimation in the jump and peak preserving procedure. Outside of the irregularities, variance is kept low thanks to the larger bandwidth. Using a local bandwidth parameter (estimate f4 ) introduces some artifacts as can be seen from Figure 4. This happens in all models except in Model (c). The artifacts are related to jumps in the local bandwidth selection taking place in the transition from flat regions (large selected bandwidth) to regions with higher density values (more reasonable smaller bandwidth values are selected). Local bandwidth selection around irregularities behaves as one would expect as can be seen in the right panel of Figure 4. Across all models, the variance of f4 is comparable to that of f1 or slightly larger. In Model (a) the variance is larger for f4 than for f1 . The

Estimating irregular densities

local bandwidth

0

0.02

1

0.06

2

3

0.10

4

0.14

5

f4

57

−5

−4

−3

−2

−1

0

1

−3

−2

−1

0

Fig. 4. Left panel: 10% and 90% quantiles (dotted lines), median (black solid line) and true model (thick grey line) for f4 . Right panel: selected local bandwidth 10%, 50% and 90% quantiles.

bias for f4 is comparable with that of f2 and f3 . For results on Model (a) for the wavelet threshold method (estimate f5 of [15]), see Figure 5 (left panel). In terms of bias this wavelet method does a rather poor job, in particular in Models (a), (c), (d) and (e), where the true model values at irregular points fall outside of the band delimited by 10% and 90% quantiles (not all plots are shown here). The variability is also quite large in certain models. f6

0

0

1

1

2

2

3

3

4

4

5

5

f5

−5

−4

−3

−2

−1

0

1

−5

−4

−3

−2

−1

0

1

Fig. 5. Left panel: 10% and 90% quantiles (dotted lines), median (black solid line) and true model (thick grey line) for f5 . Right panel: same for f6 .

The blocked wavelet thresholding estimate f6 is based on squaring the estimate

58

L. Desmet, I. Gijbels, and A. Lambert

obtained from the binned transformed data. This approach introduces a systematic bias in the baseline (bin counts of zero are transformed to a value of 0.5, squaring and rescaling still yields a non-zero value). Especially in Models (b), (c) and (d) this effect was visible (due to the scale of these models). In general the performance of this blocked wavelet estimate f6 was rather poor. A possible explanation is again the bias in the baseline, which in turn causes bias in other regions when doing the normalization step. The histogram methods of [3] (estimates f7 and f8 ) perform quite well. See Figure 6 for results for Model (a), showing a better performance for f8 than for f7 at the discontinuity location. In general the variant f8 based on an L2 measure, selected a larger number of bins (resulting into better bias properties but a larger variance). Except for Model (e) the bias is indeed quite good. For these models, the method based on L2 outperforms the recommended one, both in terms of bias and MISE (see also Table 2). f8

0

0

1

1

2

2

3

3

4

4

5

5

f7

−5

−4

−3

−2

−1

0

1

−5

−4

−3

−2

−1

0

1

Fig. 6. Left panel: 10% and 90% quantiles (dotted lines), median (black solid line) and true model (thick grey line) for f7 . Right panel: same for f8 .

Table 2 MISE values for n=2048 and n=512.

n f1 f2 f3 f4 f5 f6 fˆ7 f8

Model (a). 2048 512 0.01929 0.0973 0.05620 0.08888 0.05609 0.1273 0.04827 0.08178 0.04907 0.1257 0.07452 0.1041 0.06238 0.1448 0.02697 0.09854

Model 2048 0.001084 (0.02926) 0.001887 0.001222 0.0009915 0.002454 0.001023 0.0007281

(b). 512 0.002156 (0.1216) 0.002853 0.002719 0.002741 0.003847 0.003562 0.002595

Model 2048 0.006986 0.01032 0.008586 0.006762 0.016394 0.01740 0.01195 0.01055

(c). 512 0.01169 0.05013 0.01456 0.02012 0.04696 0.01826 0.03666 0.02696

Model 2048 0.00593 0.01254 0.01153 0.02916 0.009218 0.02551 0.007062 0.005029

(d). 512 0.01375 0.02253 0.01961 0.01974 0.06274 0.03288 0.01451 0.01128

Model (e). 2048 512 2.122 2.407 2.474 2.516 2.472 2.501 2.558 2.4851 2.598 2.827 3.407 3.503 2.812 3.124 2.521 2.412

In Table 2 we provide the MISE (Mean Integrated Squared Error) values for all models for sample sizes n = 2048 and n = 512. From this table it is seen that f1

Estimating irregular densities

59

has the best performance in many models (for example in the challenging Model (e)) or it has very competitive performance. If it is outperformed, then this is by f8 . The latter estimate has good to very good performance in Models (a), (b) and (d). The proposed estimate f1 is doing quite well overall, far better than f3 , f5 and f6 . As for specific methods: among the Sheather-Jones global bandwidth methods, f3 (direct plug-in, with larger selected bandwidths) shows better MISE (some values for f2 were unreliable due to convergence problems and therefore put between parentheses). It is not surprising that f3 is doing well in smooth models such as model (c), however from pictures its inconsistency at jumps and unsatisfactory behaviour at peaks is clearly visible (see Figure 3 for Model (a)). For the local bandwidth type kernel estimate f4 , note the low value for Model (c) and the high value for Model (d) (n = 2048), probably due to the artifacts mentioned before. Finally, in the histogram methods f8 outperforms f7 also in terms of MISE (the former method selects generally a larger number of bins). The effect of sample size is also clearly visible: MISE values are generally larger for the smaller sample size, in line with a general decline in variance and bias performance noticed for smaller sample size. 5.2. Data example: call center data

0.04 0.00

density

0.08

The data example concerns data gathered between January 1st and December 31st of 1999 in the call-center of “Anonymous Bank” in Israel. We gratefully acknowledge Prof. Avisham Mandelbaum and Dr. Ilan Guedj from Technion University at Haifa for making the data freely accessible. The dataset, organized per month, contains some 20000–40000 records on phone calls made to the call center. Among many other features recorded we focus on the time the call entered the system. We use data for the month of May, concerning 39553 phonecalls.

0

5

10

15

20

time 24 hours

Fig. 7. Black solid line: proposed estimate fˆ1 , black dotted line: kernel estimate f3 .

60

L. Desmet, I. Gijbels, and A. Lambert

In Figure 7 data are plotted together with two density estimates: the proposed estimator f1 and the kernel density estimate f3 based on Sheather-Jones direct plugin bandwidth (of value 0.264; the solve-the-equation bandwidth yields a bandwidth of 0.297 and f2 is very similar to f3 ). The bandwidth selected in the cross-validation procedure was 0.72. This results in a smooth curve except for some peak features. In contrast, the estimate f3 based on a smaller bandwidth produces a rather wiggly curve (probably too wiggly to reflect the true underlying density). The estimate f1 shows a smoothly ascending curve (starting shortly after 7am, time at which the call center begins to be staffed), leading to a peak between 10 and 11 when people seem to be most keen on thinking about banking. After the peak, the density decreases to a plateau in the early afternoon and then descends further to reach a minimum around 8pm. After this, the density increases again peaking around 10pm, which may be related to phone rates in Israel which change at that time. The call center stops being staffed at midnight. References [1] Anscombe, F. J. (1948). The transformation of Poisson, Binomial and Negative-Binomial data. Biometrika 35 246–254. [2] Bartlett, M. S. (1936). The square root transformation in the analysis of variance. Journal of the Royal Statistical Society, Supplement 3 68. ´, L. and Rozenholc, Y. (2006). How many bins should be put in a [3] Birge regular histogram. ESAIM Probability and Statistics 10 24–45. [4] Brown L., Cai T., Zhang R., Zhao L., and Zhou H. (2010). The RootUnroot algorithm for density estimation as implemented via wavelet block thresholding. Probability Theory and Related Fields 146 401–433. ´ lez Manteiga, W. (1994). A compara[5] Cao, R., Cuevas, A., and Gonza tive study of several smoothing methods in density estimation. Computational Statistics & Data Analysis 17 153–176. [6] Couallier, V. (1999). Estimation non param´etrique d’une discontinuit´e dans une densit´e. C.R. Acad. Sci. Paris 329 633–636. [7] Cheng, M.-Y., Fan, J., and Marron, J. S. (1997). On automatic boundary corrections. The Annals of Statistics 25 1691–1708. [8] Desmet, L. (2009). Local linear estimation of irregular curves with applications. Doctoral Dissertation, Statistics Section, Department of Mathematics, Katholieke Universiteit Leuven, Belgium. [9] Desmet, L. and Gijbels, I. (2009). Local linear fitting and improved estimation near peaks. The Canadian Journal of Statistics 37 453–475. [10] Desmet, L. and Gijbels, I. (2009). Curve fitting under jump and peak irregularities using local linear regression. Communications in Statistics–Theory and Methods, to appear. [11] Gijbels, I. (2008). Smoothing and preservation of irregularities using local linear fitting. Applications of Mathematics 53 177–194. [12] Gijbels, I. and Goderniaux, A.-C. (2004). Bandwidth selection for change point estimation in nonparametric regression. Technometrics 46 76–86. [13] Gijbels, I., Lambert, A., and Qiu, P. (2006). Edge-preserving image denoising and estimation of discontinuous surfaces. IEEE Transactions on Pattern Analysis and Machine Intelligence 28 1075–1087. [14] Gijbels, I., Lambert, A., and Qiu, P. (2007). Jump-preserving regression and smoothing using local linear fitting: a compromise. The Annals of the In-

Estimating irregular densities

61

stitute of Statistical Mathematics 59 235–272. [15] Herrick, D. R. M., Nason, G. P., and Silverman, B. W. (2001). Some new methods for wavelet density estimation. Sankhy¯ a Series A 63 394–411. [16] Herrmann, E. (1997). Local bandwidth choice in kernel regression estimation. Journal of Computational and Graphical Statistics 6 35–54. [17] Jones, M. C. and Foster, P. J. (1993). Generalized jackknifing and higher order kernels. Journal of Nonparametric Statistics 3 81–94. [18] Park, B. U., Jeong, S.-O., Jones, M. C., and Kang, K. H. (2003). Adaptive variable location kernel density estimators with good-performance at boundaries. Journal of Nonparametric Statistics 15 61–75. [19] Lambert, A. (2005). Nonparametric estimations of discontinuous curves and surfaces. Doctoral dissertation, Institut de Statistique, Universit´e catholique de Louvain, Louvain-La-Neuve, Belgium. [20] Leibscher, E. (1990). Kernel estimators for probability densities with discontinuities. Statistics 21 185–196. [21] Marron, J. S. and Ruppert, D. (1994). Transformations to reduce boundary bias in kernel density estimation. Journal of the Royal Statistical Society, Series B 56 653–671. [22] Marron, J. S. and Wand, M. P. (1992). Exact mean integrated squared error. The Annals of Statistics 20 712–736. [23] Mielniczuk, J., Sarda, P. and Vieu, P. (1989). Local data-driven bandwidth choice for density estimation. Journal of Statistical Planning and Inference, 23, 53–69. [24] Schucany, W. R. (1989). Locally optimal window widths for kernel density estimation with large samples. Statistics & Probability Letters 7 401-405. [25] Schuster, E. F. (1985). Incorporating support constraints into nonparametric estimators of densities. Communications in Statistics–Theory and Methods 14 1123–1136. [26] Sheather, S. J. and Jones, M. C. (1991). A reliable data-based bandwidth selection method for kernel density estimation. Journal of the Royal Statistical Society, Series B 53 683–690. [27] Wand, M. P. and Jones, M. C. (1995). Kernel Smoothing. Chapman and Hall, London. [28] Wu, J. S. and Chu, C. K. (1993). Kernel type estimators of jump points and values of regression function. The Annals of Statistics 21 1545–1566.

IMS Collections Nonparametrics and Robustness in Modern Statistical Inference and Time Series Analysis: A Festschrift in honor of Professor Jana Jureˇ ckov´ a Vol. 7 (2010) 62–69 c Institute of Mathematical Statistics, 2010  DOI: 10.1214/10-IMSCOLL706

Measuring directional dependency Yadolah Dodge1 and Iraj Yadegari2 Abstract: In this article we propose new methods for finding the direction of dependency between two random variables which are related by a linear function.

1. Introduction The concepts of regression and correlation has been discovered by Francis Galton and Karl Pearson at the turn of the 20th century. The Galton–Pearson correlation coefficient is probably the most frequently used statistical tool in applied sciences, and up to now different interpretations for it has been provided. Rodgers and Nicewander [8] provided thirteen interpretations for it. Rovine and von Eye [9], and Falk and Well [5] show a collection of algebraic and geometric interpretation of the correlation coefficient. An elegant property of the correlation coefficient similar to that of given random variable which is defined by its mean and variance can be found in Nelsen [7] who shows shows that the correlation coefficient is equal to the ratio of a difference and a sum of two moments of inertia about certain line in the plane. Dodge and Rousson [1] provided four new asymmetric interpretations in case of symmetrical error in the linear relationship of two variables including the cube of the correlation coefficient. Using the relationship found in their paper, and assuming the existence of linear relation between two random variables, they determined the direction of dependence in linear regression model. That is, they provided a model on the basis of which one can make a distinction between dependent and independent variables in a linear regression. The directional dependence between two variables, when they follow the Laplace distributions, were provided by Dodge and Whittaker [3] using graphical model approach. Muddapur [6] arrives at the same relationship and found yet another formula between the correlation coefficient and the ratio of two coefficients of kurtosis. However, the author does not indicate how it could be used in determining the direction of dependence between two variables in simple linear regression. Dodge and Yadegari [4] presented five new asymmetric faces of the correlation coefficient. One of these formulas is the fourth power of the correlation coefficients and ratio of coefficients of excess kurtosis of response and explanatory variable. Also, they showed that, in the regression through the origin, the coefficient of correlation is equal to the ratio of coefficients of variation of explanatory variable to response variable. Thus, the coefficient of variation of response variable is larger that the coefficient of variation of explanatory variable. 1 Instiute

Of Statistics, University of Neuchˆ atel, Switzerland, e-mail: [email protected] Azad University, Kermanshah, Iran, e-mail: [email protected] AMS 2000 subject classifications: Primary 62J05; secondary 62M10. Keywords and phrases: asymptotic interpretation of the correlation coefficient, causality, correlation coefficient, Kurtosis coefficient, linear regression, response variable coefficient of variation, skewness coefficient. 2 Islamic

62

Measuring directional dependence

63

In Section 2 we review some asymmetric formulas for the correlation coefficient, and in Section 3 the concept of the directional dependency between two variables is presented and procedures for determining the direction of dependency between response and explanatory variables in linear regression are discussed. In Section 4 we provide asymmetric measures of directional dependency in linear regression. 2. Some asymmetric faces of the correlation coefficient Rodgers and Nicewander [8], Rovine and von Eye [9], Falk and Well [5] and Nelsen [7] provided different faces of the correlation coefficient which was discussed by Dodge and Rousson [1, 2] and Dodge and Yadegar in [4]. Also, we present a new face of correlation coefficient. Later we use some of these formulas for determining the direction of dependency between two variables. Let us consider two random variables X and Y that are related by (2.1)

Y = α + βX + ε,

where the skewness and the excess kurtosis coefficients of the random variables X and Y are not zero, α is the intercept, β is the slope parameter and ε is an error variable that is independent of X and has normal distribution with zero mean and fixed variance. The correlation coefficient between two random variables X and Y is defined as follows Cov(X, Y ) ρ= (2.2) , σX σY 2 and σY2 are variances of X where Cov(X, Y ) is covariance between X and Y , σX and Y , respectively. Under the linear model (2.1) we have

(2.3)

ρ=β

σX . σY

Since X is independent of ε, starting with (2.1) we can write 2 + σε2 σY2 = β 2 σX

and using (2.3) we have (2.4)

 1 − ρ2 =

σε σY

2 .

Afterwards we easily obtain       X − μX ε − με Y − μY 2 12 (2.5) =ρ + (1 − ρ ) . σY σX σε 2.1. Cube of the correlation coefficient The classical notion of skewness is given in the univariate case by the standardized third central moment. The coefficient of skewness of X is  X − μ 3 X γX = E (2.6) . σX Dodge and Rousson [1] have proved that under the assumption of symmetry of the error variable and under model (2.1), the cube of the correlation coefficient is

64

Y. Dodge and I. Yadegari

equal to the ratio of the skewness of the response variable and the skewness of the explanatory variable. We can derive it in the same way. From third power of both sides of (2.5) and under expectation we have 3

γY = ρ3 γX + (1 − ρ2 ) 2 γε , where γε is the skewness coefficient of the error variable. If the error variable is symmetric, γε = 0, then γY ρ3 = (2.7) γX as long as γX = 0. 2.2. The 4th power of the correlation coefficient The coefficient of excess kurtosis of random variable X is defined by  X − μ 4 X (2.8) − 3. κX = E σX Dodge and Yadegari [4] showed that under the assumption of symmetry of the error variable and under model (2.1), the 4th power of the correlation coefficient is equal to the ratio of the kurtosis of the response variables and the kurtosis of the explanatory variable. From the 4th power of both sides of (2.5) and under expectation, and after simplification and using (2.4) we have κY = ρ4 κX + (1 − ρ2 )2 κε . If κε = 0, we have (as long as κX = 0) (2.9)

ρ4 =

κY . κX

This formula has a natural interpretation: add a symmetric error to an explanatory variable and you get a response variable with less kurtosis. Also, the fourth power of the correlation may be described as the percentage of kurtosis which is preserved by a linear model. 2.3. The 5th power of the correlation coefficient If we assume that X and Y are asymmetric, from the fifth power of both sides of (2.5) and under expectation we can obtain  5 5    3 Y − μY X − μX E = ρ5 E + C35 ρ3 γX (1 − ρ2 ) + ρ2 (1 − ρ2 ) 2 γε σY σX 5  ε − με 2 52 + (1 − ρ ) E (2.10) , σε 3 5   ε−με m! ε where Cnm = n!(m−n)! . If we assume that E ε−μ = E = 0, then from σε σε (2.7) and (2.10) we have * )  * )  5 5 X − μX Y − μY 5 5 5 (2.11) − C3 γ Y = ρ E − C3 γ X . E σY σX

Measuring directional dependence

65

Hence, we obtain a new expression for the correlation coefficient:  5 Y E Y −μ − C35 γY σY (2.12) . ρ5 =  5 X−μX 5 E − C3 γX σX This formula represents another asymmetric face of the correlation coefficient. 2.4. The ratio of excess kurtosis to skewness By dividing equation (2.9) to equation (2.7) we obtain (2.13)

ρ=

κY /γY . κX /γX

The equation (2.12) signifies that we can express the correlation coefficient as a ratio of a function of Y to the same function of X. This ratio is an asymmetric function of the excess kurtosis and the skewness coefficients of dependent and independent random variables. 2.5. Asymmetric function of Joint Distribution Another asymptotic formula for ρ under model (2.1) may be obtained by introducing higher order correlations  i  j  Y − μY X − μX ρij (X, Y ) = E . σX σY We can obtain a beautiful formula for ρ as (2.14)

ρ=

ρ12 (X, Y ) . ρ21 (X, Y )

Result (2.14) shows a different asymmetric face of correlation which comes from joint distribution of X and Y (Dodge and Rousson [1, 2]). 2.6. The ratio of two coefficients of variation The coefficient of variation of random variable X, denoted by CVX , is defined as σX CVX = (2.15) . μX The correlation coefficient can also be expressed as the ratio of two coefficients of variation of random variables related by a linear regression forced from origin (Dodge and Yadegari [4]). Let us consider two random variables X and Y that are related by regression model (2.16) Y = βX + ε, where ε is an error variable with zero mean and fixed variance that is independent of X and β ∈ R is a constant. In the model (2.16) we have μY = βμX , then (2.17)

ρ=

CVX . CVY

From equation (2.17) we conclude that the coefficient of variation of the response variable will always be greater than the coefficient of variation of the explanatory variable.

66

Y. Dodge and I. Yadegari

3. Determining direction of dependence Consider the situation that a linear relationship exists between two random variables X and Y in the following form (3.1)

Y = α + βX + ε.

In (3.1) the random variable Y is a linear function of the random variable X, and X is assume to be independent of the error variable ε. In this situation we say that the response variable Y depends on the variable X, and the direction of dependency is from X to Y . Equation (3.1) can also be thought as a causal relationship between explanatory variable (cause) and response variable (effect). If X causes Y , then we select the model (3.1). On the other hand, if Y causes X, then we select the model (3.2)

X = α + β  Y + ε .

In (3.2) the error variable ε is independent of the explanatory variable Y . In both models (3.1) and (3.2) we assume that the error variable has a normal distribution with zero mean and fixed variance. If we wish to investigate the direction of dependency, we may hesitate between model (3.1) and model (3.2). To answer such a question, Dodge and Rousson [1] and Dodge and Yadegari [4] proposed some methods for determining the direction of dependency in the linear regression based on the assumption that the skewness or kurtosis coefficient of the error variable is zero. In what follows, we change the problem of determining the direction of dependence to the problem of comparing two dependent variances or two dependent coefficients of skewness, kurtosis and variation. 3.1. Using joint distribution Dodge and Rousson [2] has showed an asymmetric face of correlation coefficient, that no assumption is needed about the error variable (except its independence with the explanatory variable). (3.3)

ρXY =

ρ12 (X, Y ) . ρ21 (X, Y )

This formula can be obtained from joint distribution. They used formula (3.3) to determine the direction of dependence between X and Y . Since |ρXY | ≤ 1, (3.4)

ρ212 (X, Y ) ≤ ρ221 (X, Y ).

Thus, Y is a response variable. A similar argument can be provided for the linear regression dependence of X on Y . Then, ρ212 (X, Y ) ≤ ρ221 (X, Y ) implies Y is the response variable and ρ212 (X, Y ) ≥ ρ221 (X, Y ) implies X is the response variable. 3.2. Comparing skewness coefficients Dodge and Rousson [2] showed that under assumption of symmetry of the error variable and under model (3.1), the cube of the correlation coefficients is equal to the ratio of the skewness of the response variable and the skewness of the explanatory variable: γY ρ3XY = (3.5) , γX

Measuring directional dependence

67

(as long as γX = 0). They used formula (3.5) to determine the direction of dependence between X and Y . Since |ρXY | ≤ 1, (3.6)

2 . γY2 ≤ γX

Thus, the direction of dependence is from X to Y (Y is a response variable). A similar argument can be provided for the linear regression dependence of X on Y . 2 2 Then, γX ≥ γY2 implies Y is the response variable and γX ≤ γY2 implies X is the response variable. 3.3. Comparing kurtosis coefficients Dodge and Yadegari [4] gave another method that works in symmetric and asymmetric situations. Under model (2.1), the fourth power of the correlation coefficient is equal to the ratio of kurtosis of the response variable to the kurtosis of the explanatory variable, (as long as κX = 0) (3.7)

ρ4 =

κY , κX

where κX and κY are kurtosis coefficients of X and Y respectively (as long as κX = 0). Since ρ4 ≤ 1 κ Y ≤ κX . (3.8) This shows that the kurtosis of the response variable is always smaller than the kurtosis of the explanatory variable. Then, for a given ρXY , κX ≥ κY implies Y is the response variable and κX ≤ κY implies X is the response variable. We can similarly use inequalities (2.13) and the 5th power of the correlation coefficient (2.12) to assessing direction of dependence in a linear regression. 3.4. Comparing coefficients of variation Now consider the situation that a linear relationship exists between two random variables X and Y in the following form (3.9)

Y = βX + ε.

If X causes Y , then we select the model (3.9). In the other hand, if Y causes X, then we select the model X = β  Y + ε . (3.10) In (3.10) the error variable ε is independent of the explanatory variable Y . In both models (3.9) and (3.10) we assume that the error variable has a zero mean and fixed variance. Under assumptions of the model (3.9) and from (3.10), we can conclude that CVX ρ= (3.11) . CVY Thus, the coefficient of variation of response variable is larger than the coefficient of variation of explanatory variable.

68

Y. Dodge and I. Yadegari

3.4.1. Special case (comparing variables) Let us consider two random variables X and Y , where a linear relationship exists between them in the following form (3.12)

Y =X +ε

or (3.13)

X = Y + ε .

Under model (3.12) we have ρ2 = σY2

2 σX ,

2 σˆX 2 σY

(obtained from (2.3) when β = 1) and then

2 > and under model (3.13) we can obtain that σY2 < σX . Then, the variance of the explanatory variable is always smaller than the variance of the response 2 2 variable. Then, σY2 > σX implies Y is the response variable and σY2 < σX implies X is the response variable.

4. Measures of the directional dependency We say that the direction of dependency is from X to Y , denoted by X → Y , if a linear relationship exists between random variables X and Y in the following form (4.1)

Y = α + βX + ε,

where α is the intercept and β is the slope parameter and ε is an error variable that is independent of X and has a normal distribution with zero mean and a fixed variance. For measuring amount of asymmetric dependency between X and Y we cannot use the Galton–Pearson correlation coefficient, because the Galton–Pearson correlation is a symmetric measure of dependency between two random variables. In situations where we have asymmetric measures of dependency, we can present new procedures for determining the direction of dependency. Using the skewness and kurtosis coefficients, in this section, we propose two new asymmetric measures of dependency to distinguish the response from explanatory variable. Let us consider two random variables X and Y that are related by a linear relationship (4.1). We define another directional correlation coefficient as S(X → Y ) =

(4.2)

2 γX 2 + γ2 . γX Y

Here are some properties of this measure: 1. 0 < S(X → Y ) < 1 2. S(Y → X) = 1 − S(X → Y ) 2 , then S(Y → X) ≤ S(X → Y ) 3. If γY2 ≤ γX 2 = γY2 , then S(X → Y ) = S(Y → Y ) = 4. If γX 2 5. If γY2 < γX , then

6. If

γY2

>

2 γX ,

1 2

1 2

< S(X → Y ) < 1

then 0 < S(X → Y ) < 12 .

Thus, S(X → Y ) > S(Y → X) implies Y is the response variable and S(X → Y ) < S(Y → X) implies X is the response variable.

Measuring directional dependence

69

We can use the kurtosis coefficients to introduce another asymmetric measures of dependency between two random variables, which measures the directional dependency. Under the model (4.1), we define a measure of the directional dependence in this model as κ2 K(X → Y ) = 2 X 2 . (4.3) κX + κY Here are some properties of the kurtosis-based directional correlation: 1. 0 < K(X → Y ) < 1 2. K(Y → X) = 1 − K(X → Y ) 3. If κX = κY , then K(X → Y ) = K(Y → X) = 4. If κ2Y < κ2X , then

1 2

1 2

< K(X → Y ) ≤ 1

5. If κ2Y ≤ κ2X , then K(Y → X) ≤ K(X → Y ) 6. If κ2Y > κ2X , then 0 ≤ K(X → Y ) < 12 . Thus, K(X → Y ) > K(Y → X) implies Y is the response variable and K(X → Y ) < K(Y → X) implies X is the response variable. References [1] Dodge, Y. and Rousson, V. (2000). Direction dependence in a regression line. Commun. Stat. Theory Methods 29 9–10 1957–1972. [2] Dodge, Y. and Rousson, V. (2001). On asymmetric property of the correlation coefficient in the regression line. Am. Stat. 55 1 51–54. [3] Dodge, Y. and Wittaker, J. (2000). The information for the direction of dependence in L1 regression. Commun. Stat. Theory Methods 29 9–10 1945– 1955. [4] Dodge, Y. and Yadegari, I. (2009). On direction of dependence. Metrika 72 139–150. [5] Falk, R. and Well, A. D. (1997). Faces of the correlation coefficient. J. Statistics Education [Online] 5 3. [6] Muddapur, M. (2003). Dependence in a regression line. Commun. Stat. Theory Methods 32 10 2053–2057. [7] Nelsen, R. B. (1998). Regression lines, and moments of inertia. Amer. Statistician 52 4 343–345. [8] Rodgers, J. L. and Nicewander, W. A. (1988). Thirteen ways to look at the correlation coefficient. Am. Stat. 42 59–66. [9] Rovine, M. J. and Eye, A. (1997). A 14th way to look at a correlation coefficient: Correlation as the proportion of matches. Am. Stat. 51 42–46.

IMS Collections Nonparametrics and Robustness in Modern Statistical Inference and Time Series Analysis: A Festschrift in honor of Professor Jana Jureˇ ckov´ a Vol. 7 (2010) 70–74 c Institute of Mathematical Statistics, 2010  DOI: 10.1214/10-IMSCOLL707

On a paradoxical property of the Kolmogorov–Smirnov two-sample test Alexander Y. Gordon1 and Lev B. Klebanov2 University of North Carolina at Charlotte and Charles University at Prague Abstract: The two-sample Kolmogorov–Smirnov test can lose power as the size of one sample grows while the size of the other sample remains constant. In this case, a paradoxical situation takes place: the use of additional observations weakens the ability of the test to reject the null hypothesis when it is false.

1. Biasedness of the Kolmogorov goodness-of-fit test We start with partially known results on biasedness of the Kolmogorov goodnessof-fit test (see [1]). Let us recall some definitions. Suppose that X1 , . . . , Xn are independent and identically distributed (i.i.d.) random variables (observations) with (unknown) distribution function (d.f.) F . Based on the observations, one needs to test the hypothesis H 0 : F = F0 , where F0 is a fixed d.f. Definition 1.1. For a specific alternative hypothesis, a test is said to be unbiased if the probability of rejecting the null hypothesis (a) is greater than or equal to the significance level when the alternative is true, and (b) is less than or equal to the significance level when the null hypothesis is true (i. e. the test is of the α level). A test is said to be biased for an alternative hypothesis, if (a) is not true while (b) remains true (i. e. for this alternative test remains to be of level α). Below we will consider a test with the following properties: 1. For a distance d in the space of d.f.’s we reject the null hypothesis H0 if d(Gn , F0 ) > δα , where Gn is a sample d.f. of X1 , . . . , Xn and δα satisfies the inequality (1.1)

IP{d(Gn , F0 ) > δα } ≤ α.

1 Department of Mathematics and Statistics, University of North Carolina at Charlotte, 9201 University City Blvd, Charlotte, NC 28223, USA, e-mail: [email protected] 2 Department of Probability and Statistics Charles University, Sokolovsk´ a 83, Prague, 18675, Czech Republic, e-mail: [email protected] AMS 2000 subject classifications: Primary 62G10 Keywords and phrases: Kolmogorov goodness-of-fit test, Kolmogorov–Smirnov two-sample test, unbiasedness

70

On a paradoxical property

71

2. The test is distribution free, i. e., the probability IPF {d(Gn , F ) > δα } does not depend on the continuous d.f. F . We call such tests distance-based. Denote by B(F, δ) an closed ball of radius δ > 0 centered at F in the metric space of all d.f.’s with the distance d. Let F0 be a continuous d.f. and let δα be defined to satisfy (1.1). Theorem 1.1. Suppose that for some α > 0 there exists a continuous d.f. Fa such that (1.2)

B(Fa , δα ) ⊂ B(F0 , δα ),

and (1.3)

IPFa {Gn ∈ B(F0 , δα ) \ B(Fa , δα )} > 0.

Then the distance-based test is biased for the alternative Fa . Proof. Let X1 , . . . , Xn be a sample from Fa and Gn be the corresponding sample d.f. Then IPFa {Gn ∈ B(Fa , δα )} ≥ 1 − α. In view of (1.2) and (1.3) we have IPFa {Gn ∈ B(F0 , δα )} > 1 − α, that is IPFa {d(Gn , F0 ) > δα } < α.

Note that Theorem 1.1 is not a consequence of the result [2], because the alternative distribution in [2] is an n-dimensional distribution, and therefore, the observations X1 , . . . , Xn are not i.i.d. random variables. Consider now the Kolmogorov goodness-of-fit test. Clearly, it is a distance-based test for the uniform distance (1.4)

d(F, G) = sup |F (x) − G(x)|. x

Let us show that there are F0 and Fa such ality we may choose ⎧ ⎪ ⎨0, F0 (x) = x, ⎪ ⎩ 1,

that (1.2) holds. Without loss of generx < 0, 0 ≤ x < 1, x ≥ 1.

For a fixed n, we define δα so that (1.1) is true. The ball B(F0 , δα ) with δα = 0.2 is shown in Figure 1. Its center – the function F0 – is shown in black, while the lower and upper “boundaries” of the ball are shown in gray.

72

A. Y. Gordon, L. B. Klebanov 1.0

0.8

0.6

0.4

0.2

1.0

0.5

0.5

1.0

1.5

2.0

1.5

2.0

Fig 1. The ball B(F0 , δα ). 1.0

0.8

0.6

0.4

0.2

1.0

0.5

0.5

1.0

Fig 2. The ball B(Fa , δα ).

Consider now the following d.f.: ⎧ ⎪ 0, ⎪ ⎪ ⎪ ⎪ ⎪ ⎨2x − δα , Fa (x) = x, ⎪ ⎪ ⎪ 2x − (1 − δα ), ⎪ ⎪ ⎪ ⎩1,

x < δα /2, δα /2 ≤ x < δα , δ α ≤ x < 1 − δα , 1 − δα ≤ x < 1 − δα /2, x ≥ 1 − δα /2.

Comparing Figures 1 and 2, we see that B(Fa , δα ) ⊂ B(F0 , δα ), and therefore Kolmogorov test is biased for alternative Fa . 2. Biasedness of the Kolmogorov–Smirnov two-sample test for substantially different sizes of the samples and the paradox Let us turn to two-sample problem. Suppose that we have two samples X1 , . . . , Xm and Y1 , . . . , Yn , where all observations are independent. We also suppose that all

On a paradoxical property

73

Xi ’s have the same d.f. F and all Yj ’s – the same d.f. G. We suppose that both F and G are continuous functions. The null hypothesis is now H0 : F = G. It is clear that, without loss of generality, we may assume ⎧ ⎪ ⎨0, x < 0, (2.5) G(x) = x, 0 ≤ x < 1, ⎪ ⎩ 1, x ≥ 1. In addition, we suppose that (2.6)

suppF ⊂ [0, 1] and F is absolutely continuous.

From the results of Section 1 we see that, for an arbitrary fixed n and sufficiently large nm, the two-sample Kolmogorov–Smirnov test is biased (for alternative F = Fa = G given in Section 1), because for m → ∞ we obtain in the limit the Kolmogorov goodness-of-fit test. In Section 3 we show that in the case where m = n the Kolmogorov–Smirnov test is unbiased, at least for small values of α for any alternative (2.6). However, for the same values of α and fixed n, the test will no longer be unbiased if m is large enough. In other words, the power of the test for some alternatives will be smaller for a large m  n than for m = n. This means, paradoxically, that using the Kolmogorov–Smirnov test one cannot benefit from the additional information contained in a much larger sample: vice versa, instead of gaining power, the test loses it. The situation here is in some sense similar to that in statistical estimation theory in the situation where non-convex loss functions are used (see, for example, [3]). 3. On the unbiasedness of two-sample Kolmogorov–Smirnov test for samples of the same size Here we will show that in the case where m = n the Kolmogorov–Smirnov test is unbiased, at least for small values of α, for any alternative satisfying (2.6). Theorem 3.1. For m = n there exists α ∈ (0, 1) such that the Kolmogorov– Smirnov test is unbiased for any alternative (2.6). Proof. Recall that the Kolmogorov-Smirnov statistic is of the form Dn = sup |Fn (x) − Gn (x)|, x

where Fn and Gn are sample d.f.’s based on the samples Xj and Yj (j = 1, . . . , n), respectively. Clearly, under the hypothesis H0 the distribution of the Kolmogorov– Smirnov statistic is discrete and therefore for some α ∈ (0, 1) the event Dn > δα is equivalent to the event Dn = 1. The latter event takes place if and only if (3.7) max(X1 , . . . , Xn ) < min(Y1 , . . . , Yn ) or max(Y1 , . . . , Yn ) < min(X1 , . . . , Xn ) The probability of the event (3.7) equals

1

(3.8)

 F n (x)(1 − x)n−1 + (1 − F (x))n xn−1 dx.

0

In (3.8) we suppose that Y1 has d.f. (2.5) and X1 has d.f. F (x).

74

A. Y. Gordon, L. B. Klebanov

It is easy to see that the function y n (1 − x)n−1 + (1 − y)n xn−1 , for any x (0 < x < 1) has a minimum in y (0 < y < 1) at the point y = x. Therefore, the integral (3.8) attains its minimum in F for F (x) ≡ x. This minimum equals

1

z n−1 (1 − z)n−1 dz = n 0

Γ2 (n) , Γ(2n)

what can be easily seen from combinatorial considerations, too. The integral represents the probability of rejecting the alternative, and it is minimal when F = G, i. e., when the null hypothesis is true. Note that in the case m = n = 2 Theorem 3.1 establishes the unbiasedness of the Kolmogorov–Smirnov test for any alternative satisfying(2.6), because other values of δα lead to a trivial result. We believe that in the case m = n the test is unbiased for any α and any continuous alternative. 4. Concluding remarks It has been shown that for the two-sample Kolmogorov–Smirnov test a paradoxical situation takes place: one cannot use additional information contained in a very large sample if the second sample is relatively small. This paradoxical situation takes place not only for the Kolmogorov–Smirnov test. A similar paradox takes place, e. g., for the Cram´er–Von Mises two-sample test (see [4], where the biasedness of the Cram´er–Von Mises goodness-of-fit test is proved). We believe that a new approach is needed for handling the case of substantially different sample sizes. Acknowledgement The second named author was supported by the Grant MSM 002160839 of the Ministry of Higher Education of Czech Republic. References [1] Massey, F.J., Jr (1950). A Note on the Power of a Non-Parametric Test. Annals of Math. Statist. 21 440–443. [2] Thompson, Roy O.R.Y (1979). Bias and Monotonicity of Goodness-of-Fit Tests. Journal of Amer. Statist. Association 74 875–876. [3] Klebanov, L., Rachev, S., Fabozzi, F. (2009). Robust and Non-Robust models in Statistics, Nova, New York. [4] Thompson, Roy O.R.Y (1966) Bias of the One-Sample Cram´er-Von Mises Test. Journal of Amer. Statist. Association 61 246-247.

IMS Collections Nonparametrics and Robustness in Modern Statistical Inference and Time Series Analysis: A Festschrift in honor of Professor Jana Jureˇ ckov´ a Vol. 7 (2010) 75–83 c Institute of Mathematical Statistics, 2010  DOI: 10.1214/10-IMSCOLL708

MCD-RoSIS – A robust procedure for variable selection∗ Charlotte Guddat1 and Ursula Gather1 and Sonja Kuhnt1 TU Dortmund University Abstract: Consider the task of estimating a regression function for describing the relationship between a response and a vector of p predictors. Often only a small subset of all given candidate predictors actually effects the response, while the rest might inhibit the analysis. Procedures for variable selection aim to identify the true predictors. A method for variable selection when the dimension p of the regressor space is much larger than the sample size n is Sure Independence Screening (SIS). The number of predictors is to be reduced to a value less than the number of observations before conducting the regression analysis. As SIS is based on nonrobust estimators, outliers in the data might lead to the elimination of true predictors. Hence, a robustified version of SIS called RoSIS was proposed which is based on robust estimators. Here, we give a modification of RoSIS by using the MCD estimator in the new algorithm. The new procedure MCD-RoSIS leads to better results, especially under collinearity. In a simulation study we compare the performance of SIS, RoSIS and MCD-RoSIS w.r.t. their robustness against different types of data contamination as well as different degrees of collinearity.

1. Introduction In the analysis of high dimensional data the curse of dimensionality Bellmann [1] is a phenomenon which hinders an accurate modeling of the relation between a response variable Y ∈ R and a p-dimensional vector of predictors X = (X1 , . . . , Xp )T ∈ Rp . There are essentially two ways to handle the problem: we either use a regression method that is able to cope with high dimensional data, or we apply a dimension reduction technique that projects the p-dimensional predictor onto a subspace of lower dimension K  p followed by a usual regression procedure. For the later approach, Li [10] proposed the model (1.1)

Y = f (b1 X, . . . , bK X, ε),

where f : RK → R is an unknown link function to be estimated from observations (xTi , yi )T , i = 1, . . . , n, and ε is an error term that is independent from X. The vectors bi , i = 1, . . . , K, are called effective dimension reduction (edr) directions which span a K-dimensional subspace SY |X assumed to be the central subspace in ∗ This work was partially supported by the German Science Foundation (DFG, SFB 475, “Reduction of complexity in multivariate data structures”, and SFB 823, “Statistical modelling of nonlinear dynamic processes”). 1 Faculty of Statistics, TU Dortmund University, 44221 Dortmund, Germany, e-mails: guddat,gather,[email protected] AMS 2000 subject classifications: Primary 62G35, 62J99. Keywords and phrases: Variable selections, dimension reduction, regression, outliers, robust estimation.

75

76

Ch. Guddat et al.

the sense of Cook [2, 3]. Under model (1.1) the projection of X onto SY |X captures all relevant information that is given by the original data. In this paper we further restrict the link function by assuming a linear model Y = bT X + ε with b ∈ Rp . Commonly, variable selection is conducted simultaneously to the regression analysis — it is part of the model selection Li et al. and Cox and Snell [11, 4]. Here, we focus on variable selection as a prestep to the regression and assume model (1.1). A special case of dimension reduction arises if all edr directions are projections onto one component of X each. Hence, out of the p predictors at hand only KV S canonical unit vectors bi ∈ Rp , i = 1, . . . , KV S , KV S  p, are classified as being relevant and are solely used in the following regression analysis. These days, we face a more difficult situation than the one described above more and more often: The sample size n can be much smaller than the dimension p of the regressor space. The accomplishment of this challenge is an important part of current research. Fan and Lv [6] provide a procedure for variable selection especially for this situation. They can even show that their method Sure Independence Screening (SIS) possesses the sure screening property. That is, after the selection of n − 1 or n/ log(n) variables by SIS, all true predictors are in the chosen subset with a very high probability when some conditions are fulfilled. However, SIS is based on nonrobust estimators such that outliers in the data might influence the selection of predictors negatively, i. e. variables with an effect on Y are not extracted or noise variables are selected as being relevant. Hence, Gather and Guddat (2008) provide a robust version of SIS called RoSIS — Robust Sure Independence Screening. Here, we suggest a further modification which results in the new procedure MCD-RoSIS being in many situations even more robust than RoSIS and also working better under collinearity. We show this by a simulation study where we replace observations by outliers in the response as well as in the predictors and vary the sample size and the dimension of the regressor space. Also, we investigate different degrees of collinearity. 2. SIS and RoSIS Sure Independence Screening (SIS; Fan and Lv [6]) is a procedure for variable selection that is constructed for situations with p  n. Assuming the linear model, the method is based on the determination of the pairwise covariances of each standardized predictor Zj , j = 1, . . . , p, with the response. Aim is to reduce the number of predictors to a value KSIS which is smaller than the sample size n. Therefore, those variables whose pairwise covariance with Y belong to the absolutely largest, are selected for the following regression analysis. The empirical version of Zj = (Xj − μj )/σj results from the substitution of the expectation μj and the variance σj2 of Xj by the corresponding arithmetic mean X j and the empirical variance s2j , j = 1, . . . , p, respectively. For the estimation of the covariance Cov(Zj , Y ), j = 1, . . . , p, the empirical covariance is used. All these estimators are sensitive against outliers as we know. Hence, it is possible that outliers lead to an underestimation of the relation between a true predictor and Y or to an overestimation of the relation between a noise variable and Y , respectively. In the case of a strong deviation between true and estimated covariance, the elimination of a true predictor results. To avoid this, Gather and Guddat [7] introduce a robust version of SIS which is based on a robust standardization of the predictors and a robust estimation of the covariances using the Gnanadesikan–Kettenring estimator Gnanadesi and Kankettenring [8] employing the robust tau-estimate for

MCD-RoSIS – A robust procedure for variable selection

77

estimating the univariate scale Maronna and Zamar [12]. First comparisons of this new method Robust Sure Independence Screening (RoSIS) with SIS have shown promising results Gather and Guddat [7]. However, as previous results indicate that the Gnanadesikan-Kettenring estimator is not the best choice under collinearity for example, we suggest a version of RoSIS which employs the Minimum Covariance Determinant (MCD) estimator Rousseeuw [14] coping with this situation much better. We call this version MCDRoSIS and refer to RoSIS in the following as GK-RoSIS for a better distinction. After a robust standardization and the estimation of the pairwise covariances by the MCD estimator the resulting values are ordered by their absolute size. Those predictors belonging to the KSIS largest results are selected for the following analysis. The number KSIS is to be chosen smaller than the sample size, e. g. Fan and Lv [6] suggest KSIS = n − 1 or KSIS = n/log(n). Definition 2.1. Let {(X T1 , Y1 )T . . . , (X Tn , YnT )T } be a sample of size n in Rp+1 , where p >> n, and KSIS ∈ {1, . . . , n} given. MCD-RoSIS selects the variables as follows: (i) Robust standardization of the observations of the predictors by Median and MAD. (ii) Robust estimation of the pairwise covariances Cov(Zj , Y ) by ω ˆ rob,j = CM CD ({z1,j , . . . , zn,j }, {y1 , . . . , yn }), j = 1, . . . , p, by means of the MCD estimator. (iii) Ordering of the estimated values by their absolute size: |ˆ ωrob,j1 |(1) ≤ |ˆ ωrob,j2 |(2) ≤ . . . ≤ |ˆ ωrob,jp |(p) . (iv) Selection of K variables: SIS   U = Zj : |ωrob,jKS |(KS ) ≤ |ωrob,j |, 1 ≤ j ≤ p .

In the following section we examine to which extent SIS, GK-RoSIS and MCDRoSIS are robust against large aberrant data points by means of a simulation study and compare the performance of both methods in different situations regarding the dimension p, the sample size n, the types of outliers as well as the degree of collinearity.

3. Comparison of SIS and MCD-RoSIS In order to examine the effect of outliers on the correct selection of predictors, we simulate different outlier scenarios. We look at the effect of outliers in predictor variables and in the response variable while we vary the dimension p, the sample size n as well as the degree of collinearity. The following subsection contains a detailed description of the data generating processes. All simulations are carried out using the free software R (2008). We look at three different models. The setup is the same as Fan and Lv [6] chose for checking the performance of SIS. The n observations of the p predictors X1 , . . . , Xp are generated from a multivariate normal distribution N (0, Σ) with covariance matrix Σ = (σij ) ∈ Rp×p having the entries σii = 1, i = 1, . . . , p, and σij = ρ, i = j. The observations of ε are drawn from an independent standard normal distribution. The response is assigned according to the model Y = f (X) + ε where f (X) is the link function chosen as presented in Model 1 through Model 3.

78

Ch. Guddat et al.

Model 1: Model 2:

Y = 5X1 + 5X2 + 5X3 + ε, Y = 5X1 + 5X2 + 5X3 − 15ρ1/2 X4 + ε, where Cov(X4 , Xj ) = ρ1/2 , j = 1, 2, 3, 5, . . . , p

Model 3:

Y = 5X1 + 5X2 + 5X3 − 15ρ1/2 X4 + X5 + ε where Cov(X4 , Xj ) = ρ1/2 , j = 1, 2, 3, 5, . . . , p, and Cov(X5 , Xj ) = 0 , j = 1, 2, 3, 4, 6 . . . , p.

The models are taken over from Fan and Lv [6] simulations. The link function in Model 1 is linear in three predictors and a noise term. The second link function includes a fourth predictor which has correlation ρ1/2 with all the other p − 1 candidate predictors, but is uncorrelated with the response. Hence, SIS can pick all true predictors only by chance. In the third model a fifth variable is added that is uncorrelated with the other p − 1 predictors and that has the same correlation with Y as the noise has. Depending on ρ, X5 has weaker marginal correlation with Y than X6 , . . . , Xp and hence has a lower priority of being selected by SIS. We consider a dimension of p = 100 and 1000; the sample size is set to be n = 50 and 70; collinearity is varied by ρ = 0, 0.1, 0.5, 0.9. The number of repetitions is 200. We apply SIS, GK-RoSIS and MCD-RoSIS to each generated data set for the selection of n − 1 variables. For contaminating the data we replace 10% of the simulated observations by values, which are on the boundary of specific tail regions according to the notion of α-outliers Davies and Gather ([5]). For a contamination of the response we replace yi by f (x) + z1−α/2 with z1−α/2 the (1 − α/2)-quantile of the error distribution 1 and α = 1 − 0.999 n depending on the sample size n, keeping xi as it is. Concerning contamination of X we distinguish between two different directions. We place outliers in X1 - or in X1 + X2 + X3 -direction by choosing a contamination such that xT Σ−1 x = χ2 , with χ2 the quantiles of the χ2 -distribution with p de1 1 0.999 n ,p

0.999 n ,p

grees of freedom. For the X1 -direction we keep the values xi,2 , ..., xi,p and use the largest solution of the equation with respect to the first entry of x as replacement for xi,1 . For the X1 + X2 + X3 -direction we insert xi,4 , ..., xi,p , set the first three entries of x equal and take the largest solution as replacement for xi,1 , xi,2 , xi,3 . As the goal of a method for variable selection is to detect the predictors which have an influence on the response a natural measure of performance is the number of correctly selected as well as the number of falsely selected predictors. As we fix the number of variables to be selected as KSIS = n − 1 it is sufficient to look at the number of correctly selected variables. In the following we shortly summarize the resulting performance of SIS, GKRoSIS and MCD-RoSIS. Generally, we found that the new method MCD-RoSIS identifies all true predictors in almost 100% of the cases for all settings when the data are contaminated in one of the X-directions while the classical procedure SIS fails here very often. Especially, under high collinearity or when the dimension p is large the performance of SIS is very bad. In these situations partly none of the predictors can be identified by SIS in many cases. GK-RoSIS works mostly better than SIS, but not as good as MCD-RoSIS.

MCD-RoSIS – A robust procedure for variable selection

uncontaminated

79

Y−direction

100 SIS GK−RoSIS MCD−RoSIS % 50

0 0

1

2

3

0

2

3

(X1 + X2 + X3)−direction

X1−direction

100

1

% 50

0 0

1

2

3

0

1

2

3

Number of correctly selected variables

Fig 1. SIS, GK-RoSIS and MCD-RoSIS in Model 1 with p = 100, n = 70, ρ = 0.5

Comparing both procedures when the data are uncontaminated or contaminated in Y -direction we have to distinguish between the models. While for Model 1 MCDRoSIS is only almost as good as SIS, it is generally speaking the better choice for Model 2 and 3. GK-RoSIS is rather on the same level as SIS but suffers strongly from high collinearity. Figure 1 shows the performance of SIS GK- and MCD-RoSIS for Model 1 with parameters p = 100, n = 70 and ρ = 0.5. As described before, all three procedures perform similarly good for uncontaminated data and when outliers are given in the response. For the situations with outliers in X the superiority of MCD-RoSIS is obvious. Concerning Model 2 Figure 2 shows the case of parameters p = 100, n = 70 and ρ = 0.9. In all data situation SIS and GK-RoSIS correctly select all predictors in around 50 − 60% of the cases, whereas MCD-RoSIS has a rate of more than 95%. In Figure 3 we find the results for Model 3 with parameters p = 1000, n = 50 and ρ = 0.1. This model includes a predictor that has only a very small correlation with the response. That is why SIS is not able to identify this variable X5 even when the data are generated from the assumed model. Clearly, MCD-RoSIS finds more true predictors. To complement the treated parameter situations, Table compares all methods, data situations and models for parameters p = 1000, n = 50 and ρ = 0. For all other simulations results see Guddat et al. [9]. We have seen that the MCD-RoSIS and GK-RoSIS are the better procedures for variable selection when outliers in X are present while MCD-RoSIS is at least a little weaker in the uncontaminated situations. It has also turned out that GK-

80

Ch. Guddat et al.

uncontaminated

Y−direction

100 SIS GK−RoSIS MCD−RoSIS % 50

0 0

1

2

3

4

0

2

3

4

(X1 + X2 + X3)−direction

X1−direction

100

1

% 50

0 0

1

2

3

4

0

1

2

3

4

Number of correctly selected variables

Fig 2. SIS, GK-RoSIS and MCD-RoSIS in Model 2 with p = 100, n = 70, ρ = 0.9

RoSIS suffers from collinearity as it shows inferior results in the respective situations of contamination. The reason presumably lies in the fact that the Gnandesikan– Kettenring estimator is based on univariate scale estimators. We have also observed that MCD-RoSIS is more suitable even for uncontaminated data when true predictors have only a small or no correlation with the response. At first sight it is a little bit unexpected that the robustified procedures do not perform generally better when there is a contamination in Y -direction. The reason is that the size of α-outliers is dependent on the dimension. As the response is one dimensional, the magnitude of outlying observations in this direction is comparatively small. Hence, the application of robust estimators in the algorithm for variable selection is not beneficial yet. But the superiority of MCD-RoSIS increases along with the magnitude of the outliers. Altogether, we can conclude that MCD-RoSIS is a very good alternative for the variable selection in high dimensional settings. 4. Summary We provide a robustified version of Sure Independence Screening (SIS) introduced by Fan and Lv [6] which is a procedure for variable selection when the number of predictors is much larger than the sample size. Aim is the reduction of the dimension to a value which is smaller than the sample size such that usual regression methods are applicable. We modify the algorithm by using robust estimators. To be precise, we employ Median and MAD for standardization as well as the MCD covariance estimator for the identification of the important variables. This leads to the new procedure MCD Robust Sure Independence Screening (MCD-RoSIS).

MCD-RoSIS – A robust procedure for variable selection

uncontaminated

81

Y−direction

100 SIS GK−RoSIS MCD−RoSIS % 50

0 0

1

2

3

4

5

0

2

3

4

5

(X1 + X2 + X3)−direction

X1−direction

100

1

% 50

0 0

1

2

3

4

5

0

1

2

3

4

5

Number of correctly selected variables

Fig 3. SIS, GK-RoSIS and MCD-RoSIS in Model 3 with p = 1000, n = 50, ρ = 0.1

In a simulation study we compare the performance of the classical procedure SIS and of the robustified versions GK- and MCD-RoSIS in different scenarios. We observe that MCD-RoSIS is the better choice for variable selection under strong contamination of the data. But we can also detect that MCD-RoSIS is at least almost as good as the classical procedure in the uncontaminated situations. GK-RoSIS is in many contaminated situations better than SIS, but it is also very sensible against collinearity. In case of predictors that have only small correlation with the response MCD-RoSIS always finds more often all true predictors even when the data are uncontaminated. Under comparatively small deviations the robustified procedure is not always the better choice. In these situations the behavior corresponds to that in the uncontaminated case. Obviously, as in other data situations the outliers must be of some size such that the use of robust estimators is profitable.

82

Ch. Guddat et al.

Table. Simulation results for p = 1000, n = 50, ρ = 0 Model 1 uncontaminated Y -direction X1 -direction (X1 + X2 + X3 )direction

Model 2 uncontaminated Y -direction X1 -direction (X1 + X2 + X3 )direction

Model 3 uncontaminated Y -direction X1 -direction (X1 + X2 + X3 )direction

method SIS GK-RoSIS MCD-RoSIS SIS GK-RoSIS MCD-RoSIS SIS GK-RoSIS MCD-RoSIS SIS GK-RoSIS MCD-RoSIS method SIS GK-RoSIS MCD-RoSIS SIS GK-RoSIS MCD-RoSIS SIS GK-RoSIS MCD-RoSIS SIS GK-RoSIS MCD-RoSIS method SIS GK-RoSIS MCD-RoSIS SIS GK-RoSIS MCD-RoSIS SIS GK-RoSIS MCD-RoSIS SIS GK-RoSIS MCD-RoSIS

0 0.000 0.020 0.025 0.000 0.010 0.030 0.000 0.035 0.000 0.590 0.245 0.000

0 0.000 0.015 0.010 0.000 0.005 0.020 0.000 0.035 0.000 0.560 0.245 0.000

0 0.000 0.015 0.020 0.000 0.005 0.005 0.000 0.050 0.000 0.495 0.205 0.000

No. of correctly sel. predictors 1 2 3 0.000 0.010 0.990 0.130 0.225 0.625 0.020 0.005 0.950 0.000 0.015 0.985 0.115 0.280 0.595 0.030 0.005 0.935 0.005 0.865 0.130 0.130 0.365 0.470 0.000 0.000 1.000 0.120 0.080 0.210 0.175 0.115 0.465 0.000 0.000 1.000 No. of correctly sel. predictors 1 2 3 0.000 0.010 0.940 0.130 0.230 0.605 0.035 0.000 0.005 0.000 0.015 0.940 0.120 0.265 0.565 0.030 0.010 0.010 0.005 0.820 0.170 0.120 0.375 0.450 0.000 0.000 0.000 0.145 0.085 0.195 0.175 0.115 0.435 0.000 0.000 0.000

No. of 1 0.000 0.100 0.010 0.000 0.115 0.010 0.010 0.075 0.000 0.200 0.190 0.000

correctly 2 0.015 0.260 0.005 0.025 0.295 0.005 0.735 0.430 0.000 0.080 0.160 0.000

sel. predictors 3 4 0.830 0.150 0.545 0.080 0.000 0.010 0.825 0.145 0.500 0.085 0.005 0.020 0.215 0.040 0.385 0.060 0.000 0.000 0.175 0.050 0.380 0.065 0.000 0.000

4 0.050 0.020 0.950 0.045 0.045 0.930 0.005 0.020 1.000 0.015 0.030 1.000

5 0.005 0.000 0.955 0.005 0.000 0.955 0.000 0.000 1.000 0.000 0.000 1.000

References [1] Bellman, R. E. (1961). Adaptive Control Processes. Princeton University Press. [2] Cook, R. D. (1994). On the Interpretation of Regression Plots. J. Amer. Statist. Assoc., 89 177–189. [3] Cook, R. D. (1998). Regression Graphics: Ideas for Studying Regressions Through Graphics. Wiley, New York. [4] Cox, D. R., Snell, E. J. (1974). The Choice of Variables in Observational Studies. Appl. Statist. 23 51–59. [5] Davies, P. L. and Gather, U. (1993). The Identification of Multiple Outliers (with discussion and rejoinder). J. Amer. Statist. Assoc. 88 782–792. [6] Fan, J. Q. and Lv, J. (2008). Sure Independence Screening for Ultrahigh Dimensional Feature Space (with discussion and rejoinder). J. Roy. Stat. Soc. B 70 849–911.

MCD-RoSIS – A robust procedure for variable selection

83

[7] Gather, U. and Guddat, C. (2008). Comment on “Sure Independence Screening for Ultrahigh Dimensional Feature Space” by Fan, J.Q. and Lv, J. J. Roy. Stat. Soc. B 70 893–895. [8] Gnanadesikan, R., Kettenring, J. (1972). Robust Estimates, Residuals, and Outlier Detection With Multiresponse Data. Biometrics 28 81–124. [9] Guddat, C., Gather, U., and Kuhnt, S. (2010). MCD-RoSIS - A Robust Procedure for Variable Selection. Discussion Paper, SFB 823, TU Dortmund, Germany. [10] Li, K.-C. (1991). Sliced Inverse Regression for Dimension Reduction (with discussion). J. Amer. Statist. Assoc. 86 316–342. [11] Li, L., Cook, R. D. and Nachtsheim, C. J. (2005). Model-free Variable Selection. J. Roy. Stat. Soc. B 67 285–299. [12] Maronna, R. A., Zamar, R. H. (2002). Robust Estimates of Location and Dispersion for High-dimensional Datasets. J. Amer. Statist. Assoc. 44 307–317. [13] R Development Core Team (2008). R: A Language and Environment for Statistical Computing, R Foundation for Statistical Computing. Vienna, Austria, ISBN 3-900051-07-0, URL http://www.R-project.org. [14] Rousseeuw, P. J. (1984). Least Median of Squares Regression. J. Amer. Statist. Assoc. 84 871–880.

IMS Collections Nonparametrics and Robustness in Modern Statistical Inference and Time Series Analysis: A Festschrift in honor of Professor Jana Jureˇ ckov´ a Vol. 7 (2010) 84–94 c Institute of Mathematical Statistics, 2010  DOI: 10.1214/10-IMSCOLL709

A note on reference limits Jing-Ye Huang

1,∗

Lin-An Chen

2,†

and A. H. Welsh

3,§,‡

National Taichung Institute of Technology∗ , National Chiao Tung University† and The Australian National University‡ Abstract: We introduce a conceptual framework within which the problem of setting reference intervals is one of estimating population parameters. The framework enables us to broaden the possibilities for inference by showing how to create confidence intervals for population intervals. We propose a new kind of interval (the γ-mode interval) as the population parameter of interest and show how to estimate and make optimal inference about this interval. Finally, we clarify the relationship between our reference intervals and other types of intervals.

1. Introduction Reference limits are fundamentally important in clinical chemistry, toxicology, environmental health, metrology (the study of measurement), quality control, engineering and industry (Holst & Christensen [9]) and there are published standards for their statistical methodology; see for example the International Standards Organisation (ISO 3534-1, 1993; 3534-2, 1993), the International Federation of Clinical Chemists (IFCC) (Solberg, [19, 20], Peticlerc & Solberg [16], Dybkær & Solberg [4], National Committee for Clinical Laboratory Standards (NCCLS C28-A2 [12]) and the International Union of Pure and Applied Chemistry (IUPAC) (Poulsen, Holst & Christensen [17]). The purpose of this paper is to discuss reference limits from a more statistical perspective. Suppose that we have a sample X1 , . . . , Xn of size n ≥ 1 of independent observations from the distribution F (·; θ) with unknown parameter θ. The reference limit problem is to use the sample to construct an interval for an unobserved statistic w = w(Z1 , . . . , Zm ), m ≥ 1, which has distribution function Fw (·; θ) when Z1 , . . . , Zm have the same distribution F (·; θ) as X1 , . . . , Xn . The statistic w is of m ten the sample mean Z¯ = m−1 i=1 Zi or, when m = 1, a single observation, but the general formulation is useful. The IFCC standard (γ-content) reference interval for w is an estimate of the if inter-fractile interval Cw,γ (θ) = [Fw−1 {(1 − γ)/2; θ}, Fw−1 {(1 + γ)/2; θ}], often with γ = 0.95. This target interval ensures the intuitive requirement that a reference 1 Department of Statistics, National Taichung Institute of Technology, 129 Sanmin Rd., Sec. 3, Taichung, Taiwan, e-mail: [email protected] 2 Institute of Statistics, National Chiao Tung University, 1001 Ta Hsueh Rd. Hsinchu, Taiwan, e-mail: [email protected] 3 Centre for Mathematics and its Applications, The Australian National University, Canberra ACT 0200, Australia, e-mail: [email protected] § Research supported by Australian Research Council DP0559135. AMS 2000 subject classifications: Primary 62F25; secondary 62G15. Keywords and phrases: confidence interval, coverage interval, inter-fractile interval, mode interval, reference interval, reference limits, tolerance interval.

84

Reference limits

85

interval represent a specified proportion of the central values obtained in the reference population is satisfied. The standard requires either applying a known (possiif bly identity) transformation to the data, estimating the normal version of Cw,γ (θ) and then retransforming to obtain the interval, or a nonparametric approach which if estimates Cw,γ (θ) directly. The IFCC recommends that the reference interval be reported with 1 − α (usually 1 − α = 0.95) confidence intervals for the endpoints of if Cw,γ (θ). It is useful to see how the IFCC standard works in a simple example. Suppose that w is a single observation from an exponential distribution with mean θ and the estimation sample is from the same distribution. For the parametric approach, there is no transformation (not depending on θ) that produces exact normality but we can apply transformations which stabilise the variance (g(x) = log(x)) or symn 1/3 −1 metrise the distribution (g(x) = x ). In either case, let A = n g i=1 g(Xi ) and n −1 2 Sg = (n − 1) {g(X ) − A } be the sample mean and variance of the transi g i=1 formed data. Then the IFCC 95% reference interval is [g −1 (Ag − 1.96Sg ), g −1 (Ag + 1.96Sg )]. Since E{g(X)} ≈ g(θ) and var{g(X)} ≈ θ2 g  (θ)2 , the reference interval is estimating [0.1408θ, 7.099θ] when we use the logarithmic transformation and [0.0416θ, 4.5194θ] when we use the cube root transformation. The actual coverage of these intervals is 0.868 (length = 6.9582θ) and 0.948 (length 4.478θ) respectively. The nonparametric approach produces an estimate of the 95% inter-fractile interval [0.0253θ, 3.6889θ] (length = 3.664θ). None of these intervals includes the region around zero, the region of highest probability for the exponential distribution. The exponential example shows that we need a conceptual framework to evaluate reference intervals and unambiguous, interpretable methods for constructing reference intervals with desirable properties. Our approach developed in Section 2 is to if treat underlying population intervals (such as Cw,γ (θ) or [μw − kσw , μw + kσw ]) as parameters and then consider estimating and making inference about them. In this framework, reference intervals are ‘point estimates’ of underlying intervals so we can use well-established ideas to evaluate and interpret them. The only new issue is that the unknown parameter is an interval rather than a familiar vector. The treatment of an interval as an unknown parameter is arguably implicit in the statistical literature (for example in Carroll & Ruppert [1]) but it is useful to make it explicit in the present context because it enables us to separate discussion of the choice of parameter from discussion of alternative estimators and methods of inference. As we discuss in Section 3, it also allows us to relate reference intervals to well-known intervals for future observations such as prediction and tolerance intervals. In this paper, we also propose that reference intervals be based on a new γcontent interval Cw,γ (θ) defined in Section 2 which we call the γ-mode interval, if rather than the inter-fractile interval Cw,γ (θ). The γ-mode interval is the same as the inter-fractile interval when w has a unimodal, symmetric distribution; it is a more appropriate and useful interval when w has an asymmetric distribution which cannot be transformed directly to normality. For a single observation from an exponential distribution, the 95%-mode interval is [0, λ−1 2.9957] which is shorter than the other intervals we examined and includes the mode of the distribution. The γ-mode interval contains the highest density points in the sample space so has the highest-density property used as a starting point by Eaton et al. [5] for their discussion of multivariate reference intervals for the multivariate normal distribution. Even for the multivariate normal distribution, multivariate reference intervals are difficult to obtain; for recent results, see for example Trost [21] and Eaton et al. [5]. We define the intervals and present some results on optimal confidence intervals in Section 2. We discuss in detail the relationship between reference and confidence

86

Jing-Ye Huang et al.

intervals for γ-mode intervals and prediction and tolerance intervals in Section 3. We illustrate the methodology and explore the relationships between the different kinds of intervals further in the Gaussian and Gamma cases in Sections 4 and 5 respectively. We restrict ourselves to these simple cases so that we can obtain explicit results and make comparisons with other methods in the literature: the results can be extended to other statistics w and other models such as regression and generalized linear models in which one or more model parameters are functions of known covariates. Although our present focus is on parametric methods, we have developed a nonparametric approach (using order statistics) when w is a single observation (so Fw = F ). However, the approach is difficult to apply with complex, structured data, when w is a more general statistic, is less efficient than the parametric methods, and the confidence intervals perform poorly in small samples (because tail quantiles are difficult to estimate). Parametric methods overcome these difficulties at the cost of requiring more careful model examination (including diagnostics) and consideration of robustness. At least when the model holds, parametric and nonparametric methods should estimate the same interval. This is the case with our methodology but not with the IFCC method where parametric estimation can lead to estimating a different interval from the one we have specified (which we can interpret as bias) and does not necessarily yield efficient estimators (in the sense that their variance is larger than necessary). 2. Definitions and results A random interval Cˆ = [ˆ a, ˆb] is an unbiased estimator of a nonrandom interval C(θ) = [a(θ), b(θ)] if Eθ [Length{(Cˆ ∩ C(θ)c ) ∪ (Cˆ c ∩ C(θ))] = 0 and a consistent estimator of C(θ) if Prθ [Length{(Cˆ ∩ C(θ)c ) ∪ (Cˆ c ∩ C(θ))} > ] → 0 for all  > 0. That is, the length of the region in which the intervals do not overlap has expectation zero or tends to zero in probability. We can show that an interval is unbiased or consistent if ˆ a +(1−)ˆb is unbiased or consistent for a(θ)+(1−)b(θ), 0 ≤  ≤ 1. Thus the discussion of separate maximum likelihood and uniformly minimum variance unbiased estimation of the endpoints of the normal inter-fractile interval in Trost [21] immediately applies to estimation of that interval as a single parameter. A 100(1 − α)% confidence interval for C(θ) is a realisation of a random interval Cˆα = [ˆ aα , ˆbα ] which satisfies aα ≤ a(θ) < b(θ) ≤ ˆbα ) = Pθ {Cˆα ⊇ C(θ)} = 1 − α for all θ. Pθ (ˆ To develop an optimality theory based on the concept of uniformly most accurate (UMA) confidence intervals, we define a 100(1 − α)% confidence interval Cˆα for C(θ) to be type I UMA if Pθ {Cˆα ⊇ C(θ )} ≤ Pθ {Cˆα∗ ⊇ C(θ )},

for all θ < θ,

type II UMA if Pθ {Cˆα ⊇ C(θ )} ≤ Pθ {Cˆα∗ ⊇ C(θ )},

for all θ > θ,

for any other 100(1−α)% confidence interval Cˆα∗ for C(θ). A 100(1−α)% confidence interval Cˆα for C(θ) is unbiased if Pθ {Cˆα ⊇ C(θ )} ≤ 1 − α,

for all θ = θ ,

Reference limits

87

and UMA unbiased if it is unbiased and Pθ {Cˆα ⊇ C(θ )} ≤ Pθ {Cˆα∗ ⊇ C(θ )},

for all θ = θ ,

for any other 100(1 − α)% unbiased confidence interval Cˆα∗ for C(θ). The following theorem shows how to construct optimal confidence intervals for a wide class of fixed intervals, including many of the intervals of interest to us. Theorem 2.1. Consider the interval C(θ) = [a(θ), b(θ)], where θ is a scalar unknown parameter and a and b are increasing functions of θ. Let Tˆ = [θˆ1 , θˆ2 ] be an interval with θˆ1 < θˆ2 and define CˆT = [a(θˆ1 ), b(θˆ2 )]. i) If Tˆ is a 100(1 − α)% confidence interval for θ, then CˆT is a 100(1 − α)% confidence interval for C(θ). ii) If Tˆ is a 100(1 − α)% unbiased confidence interval for θ, then CˆT is a 100(1 − α)% unbiased confidence interval for C(θ). iii) If Tˆ is a 100(1 − α)% UMA unbiased confidence interval for θ, then CˆT is a 100(1 − α)% UMA unbiased confidence interval for C(θ). Proof. As a and b are monotone increasing, we have that for any θ {CˆT ⊇ C(θ )} = {a(θˆ1 ) ≤ a(θ ) < b(θ  ) ≤ b(θˆ2 )} ⇔ {θˆ1 ≤ θ ≤ θˆ2 } so

Pθ {CˆT ⊇ C(θ )} = Pθ (θˆ1 ≤ θ ≤ θˆ2 ).

The results i) and ii) follow from the definitions of confidence intervals and unbiased confidence intervals. For iii), suppose that Cˆα∗ is a 100(1 − α)% confidence interval for C(θ) and consider the set Tˆ∗ = {θ : C(θ) ⊂ Cˆα∗ }. Then, for any θ , Pθ (θ ∈ Tˆ∗ ) = Pθ {C(θ ) ⊂ Cˆα∗ } so setting θ = θ, we see that Tˆ∗ is a 100(1 − α)% confidence set for θ and setting θ = θ, we see that Tˆ∗ is an unbiased 100(1 − α)% confidence set for θ whenever Cˆα∗ is a unbiased 100(1 − α)% confidence interval for C(θ). Since Tˆ is a 100(1 − α)% UMA unbiased confidence set for θ Pθ {CˆT ⊇ C(θ )} = Pθ (θˆ1 ≤ θ ≤ θˆ2 ) ≤ Pθ (θ ∈ Tˆ∗ ) = Pθ {Cˆα∗ ⊇ C(θ )} and the result obtains. The theorem can be applied with a and b decreasing if we reparametrize the model and write a and b as increasing functions of the transformed parameter. A slightly different approach is required for the case that one endpoint of the interval C(θ) is known. Theorem 2.2. Consider the interval C(θ) = [a, b(θ)], where a is known and b is a monotone increasing function of a scalar unknown parameter θ, or C(θ) = [a(θ), b], where a is a monotone increasing function of a scalar unknown parameter θ and b is known. Let Tˆ = (−∞, θˆ2 ] be an upper interval or Tˆ = [θˆ1 , ∞) be a lower interval according to whether a is known or b is known, and define CˆT = [a, b(θˆ2 )], if a is known, or CˆT = [a(θˆ1 ), b], if b is known. i) If Tˆ is a 100(1 − α)% upper/lower confidence interval for θ, then CˆT is a 100(1 − α)% confidence interval for C(θ) with a/b known. ii) If Tˆ is a 100(1 − α)% UMA upper/lower confidence interval for θ, then CˆT is a 100(1 − α)% type I/type II UMA confidence interval for C(θ) with a/b known.

88

Jing-Ye Huang et al.

Proof. The proof is similar to that of Theorem 1 using the relations {a ≤ b(θ) ≤ b(θˆ2 )} ⇔ {θ ≤ θˆ2 } when a is known and {a(θˆ1 ) ≤ a(θ) ≤ b} ⇔ {θˆ1 ≤ θ} when b is known. A much simpler but more restricted theory for optimal confidence intervals based directly on the length or the length on the log scale can be constructed in particular cases. Theorem 2.3. Suppose that the interval C(θ) = [a(θ), b(θ)] is a location interval so that a(θ) = θ + k1 and b(θ) = θ + k2 or a scale interval so that a(θ) = k1 θ and b(θ) = k2 θ with k1 , k2 = 0. Then in the location/scale case, if Tˆ is the shortest/logshortest 100(1−α)% confidence interval for θ, it follows that CˆT is the shortest/logshortest 100(1 − α)% confidence interval for C(θ). Proof. For the location family, the length of Cˆ is length(CˆT ) = b(θˆ2 ) − a(θˆ1 ) = length(Tˆ) + k2 − k1 and for the scale family lengthlog (CˆT ) = log b(θˆ2 ) − log a(θˆ1 ) = lengthlog (Tˆ) + log k2 − log k1 and the result follows from the fact that k2 − k1 and log k2 − log k1 are fixed. The intuitive meaning of the above results is that good confidence intervals for C(θ) are obtained from good confidence intervals for θ. Not surprisingly, the case in which θ is a vector parameter is much more difficult to handle; exact intervals can only be constructed in particular cases (for an example, see Section 4) but we can construct asymptotic intervals. The above results apply to any kind of interval; we now turn our attention to a particular type of interval. A γ-content interval for w is a nonrandom interval Cw,γ (θ) = [aw,γ (θ), bw,γ (θ)] which satisfies Prθ {w ∈ Cw,γ (θ)} = Fw {bw,γ (θ); θ} − Fw {aw,γ (θ); θ} = γ. Note that Cw,γ (θ) is non-random so tolerance intervals are not γ-content intervals in this sense; see Section 3 for further discussion. A reference interval for w is an estimate of a γ-content interval for w. A confidence interval for an interval captures the uncertainty in estimating the interval and provides an estimate with the same content as the interval with confidence 1 − α. i. e. a 1 − α confidence interval for a γ-content interval is a γ-content interval with confidence 1 − α. Consider the class of γ-content intervals Cw,γ,δ (θ) = [Fw−1 (δ; θ), Fw−1 (γ + δ; θ)], 0 < δ < 1 − γ, where δ is a location constant to be chosen by the user. These intervals include the inter-fractile intervals when δ = (1−γ)/2 but are more flexible. A γ-mode interval is the shortest interval in the class Cw,γ,δ (θ), namely Cw,γ (θ) = Cw,γ,δ∗ (θ), where δ ∗ = δ ∗ (γ, θ) = argδ min0 0 is also the θχ22κ /2 distribution. The mean Z¯ of m ≥ 1 independent observations from this distribution has a θχ22mκ /2m distribution so the γ-mode interval for the mean of m > 1/κ observations is  −1  −1 ∗ ∗ (4) CZ,γ ¯ (θ) = θG2mκ {δ (κ)} /2m, θG2mκ {γ + δ (κ)} /2m , −1 where δ ∗ (κ) = argδ inf0 1, when m = 1, (4) is also the γ-mode interval for a single observation. However, when κ = 1 (i. e. the exponential distribution), (4) with κ = 1 gives the γ-mode interval

92

Jing-Ye Huang et al.

for the sample mean of m ≥ 2 observations but the γ-mode interval for a single observation is (5)

CZ,γ (θ) = [0, −θ log(1 − γ)].

The mode 0 is always in this interval. 5.2. The reference interval Suppose that X1 , . . . , Xn are independent γ(κ, θ) random nvariables.¯The maximum likelihood estimator κ ˆ of κ satisfies ψ(κ)−log(κ) = n−1 i log(Xi /X) with ψ(·) the ¯ 2 / n (Xi − X) ¯ 2. digamma function and, the method of moments estimator, κ ˆ = nX i ¯ κ. If κ is known, both estimators are obtained In either case, we estimate θ by X/ˆ by replacing κ ˆ by κ in (4). The maximum likelihood estimator of (5) is CˆZ,γ (θ) = ¯ log(1 − γ)]. [0, −X 5.3. Confidence intervals Suppose initially that the shape parameter κ > 1 is known so the γ-mode interval is (4). Choose g and h to satisfy 1 − α = Pr(g < χ22nκ < h). Then, from Theorem 2.1, a 100(1 − α)% confidence interval for (4) with m = 1 is   ¯ −1 (δ ∗ )/h, nXG ¯ −1 (γ + δ ∗ )/g . (6) nXG 2κ 2κ From Theorem 2.1, for the UMA unbiased confidence interval, g and h also satisfy   G2nκ (g) = G2nκ (h); from Theorem 2.3, for the log-shortest confidence interval   ¯ based on the pivot 2nX/θ, g and h also satisfy gG2nκ (g) = hG2nκ (h). A two-sided γ-level 100(1 − α)% tolerance interval for the gamma distribution ¯ 1 , Xc ¯ 2 ), with known shape parameter was given by Guenther [7]. The interval is (Xc where, for large 2nκ, c1 and c2 satisfy the two equations G2κ (hc2 /n) − G2κ (hc1 /n) = G2κ (gc2 /n) − G2κ (gc1 /n) = γ, where g and h satisfy 1 − α = Pr(g < χ22nκ < h). The tolerance interval is close to but not the same as the confidence interval for the γ-mode interval. If κ = 1, the γ-mode interval for a single observation is (5) and from Theorem 2.2, a 100(1 − α)% type II UMA confidence interval for (5) is (7)

¯ log(1 − γ)/G−1 (α)]. [0, −2nX 2n

The confidence interval (7) is constructed as a two-sided interval but is numerically the same as the one-sided γ-level 100(1 − α)% tolerance interval. The two-sided γ-level 100(1 − α)% tolerance interval obtained by Goodman & Madansky [6] by controlling both tails like Owen [13], is the same as the 100(1 − α)% confidence interval for the inter-fractile interval, namely   ¯ log {(1 + γ)/2} /G−1 (1 − α/2), −2nX ¯ log {(1 − γ)/2} /G−1 (α/2) . (8) −2nX 2n 2n We argue that the confidence interval for the mode interval is the more meaningful interval and question the value of the standard two-sided tolerance interval (8) which omits the highest density region. Prediction intervals can be constructed

Reference limits

93

from the normalized spacings between order statistics but these do not relate in a simple way to the estimated γ mode interval. When the shape parameter κ is also unknown, exact intervals are not available. However, it is straightforward to use large sample approximations based on Taylor series expansions of the endpoints of the reference interval to construct approximate confidence intervals for (4). References [1] Carroll, R. J. and Ruppert, D. (1991). Prediction and tolerance intervals with transformation and/or weighting. Technometrics 33 197–210. [2] Chen, L.-A. and Hung, N.-H. (2006). Extending the discussion on coverage intervals and statistical coverage intervals. Metrologia 43 L43–L44. [3] Chen, L.-A., Huang, J.-Y., and Chen, H.-C. (2007). Parametric coverage interval. Metrologia 44 L7–L9. [4] Dybkær, R. and Solberg, H. E. (1987). International Federation of Clinical Chemistry (IFCC). Approved recommendation (1987) on the theory of reference values. Part 6. Presentation of observed values related to reference values. Clinica Chimica Acta 170 33–42; J. Clinical Chemistry and Clinical Biochemistry 25 657–662. [5] Eaton, M. L, Muirhead, R. J., and Pickering, E. H. (2006). Assessing a vector of clinical observations. J. Statist. Plan. Inf. 136 3383–3414. [6] Goodman, L. A. and Madansky, A. (1962). Parameter-free and nonparametric tolerance limits: the exponential case. Technometrics 4 75–96. [7] Guenther, W. C. (1972). Tolerance intervals for univariate distributions. Naval Research Logistics Quarterly 19 310–333. [8] Guttman, I. (1970). Statistical tolerance regions: Classical and Bayesian. Griffin, London. [9] Holst, E. and Christensen, J. M. (1992). Intervals for the description of the biological level of a trace element in a reference population. The Statistician 41 233–242. [10] Howe, W. G. (1969). Two-sided tolerance limits for normal populations some improvements. J. Amer. Statist. Assoc. 64 610–620. [11] Krishnamoorthy, K. and Mathew, T. (2009). Statistical Tolerance Regions: Theory, Application and Computation. Wiley, New York. [12] NCCLS C28-A2 (1995). How to define and determine reference intervals in the clinical laboratory: Approved Guideline. Second edition. Villanova, PA, National Committee for Clinical Laboratory Standards. [13] Owen, D. B. (1964). Control of percentage in both tails of the normal distribution. Technometrics 6 377–387. [14] Patel, J. K. (1986). Tolerance limits – A review. Communications in Statistics – Theory and Methods 15 2719–2762. [15] Paulson, E. (1943). A note on tolerance limits. Ann. Math. Statist. 14 90–93. [16] Petitclerc, C. and Solberg, H. E. (1987). International Federation of Clinical Chemistry (IFCC). Approved recommendation (1987) on the theory of reference values. Part 2. Selection of individuals for the production of reference values. Clinica Chimica Acta 170 1–12; J. Clinical Chemistry and Clinical Biochemistry 25 639–644. [17] Poulsen, O. M., Holst, E., and Christensen, J. M. (1997). Calculation and application of coverage intervals for biological reference values (technical report). Pure and Applied Chemistry 69 1601–1611.

94

Jing-Ye Huang et al.

[18] Proschan, F. (1953). Confidence and tolerance intervals for the normal distribution. J. Amer. Statist. Assoc. 48 550–564. [19] Solberg, H. E. (1987). International Federation of Clinical Chemistry (IFCC). Approved recommendation (1986) on the theory of reference values. Part 1. The concept of reference values. Annales de Biologie Clinique 45 237– 241; Clinica Chimica Acta 165 111–118; J. Clinical Chemistry and Clinical Biochemistry 25 337–42. [20] Solberg, H. E. 1987. International Federation of Clinical Chemistry (IFCC). Approved recommendation (1987) on the theory of reference values. Part 5. Statistical treatment of collected reference values. Determination of reference limits. Clinica Chimica Acta 170 13–32; J. Clinical Chemistry and Clinical Biochemistry 25 645–656. [21] Trost, D. C. (2006). Multivariate probability-based detection of druginduced hepatic signals. Toxicol. Review 25 37–54. [22] Wald, A. (1943). An extension of Wilks’ method for setting tolerance limits. Ann. Math. Statist. 14 45–55. [23] Wald, A. and Wolfowitz, J. (1946). Tolerance limits for a normal distribution. Ann. Math. Statist. 17 208–218. [24] Willink, R. (2004). Coverage intervals and statistical coverage intervals. Metrologia 41 L5–L6. [25] Wilks, S. S. (1941). Determination of sample sizes for setting tolerance limits. Ann. Math. Statist. 12 91–96.

IMS Collections Nonparametrics and Robustness in Modern Statistical Inference and Time Series Analysis: A Festschrift in honor of Professor Jana Jureˇ ckov´ a Vol. 7 (2010) 95–104 c Institute of Mathematical Statistics, 2010  DOI: 10.1214/10-IMSCOLL710

Simple sequential procedures for change in distribution Marie Huˇ skov´ a∗,1 and Ondˇ rej Chochola1 Charles University of Prague, Department of Statistics Abstract: A simple sequential procedure is proposed for detection of a change in distribution when a training sample with no change is available. Its properties under both null and alternative hypothesis are studied and possible modifications are discussed. Theoretical results are accompanied by a simulation study.

1. Introduction We assume that the observations X1 , . . . , Xn , . . . are arriving sequentially, Xi has a continuous distribution function Fi , i = 1, 2, . . . and the first m observations have the same distribution function F0 , i. e., F1 = . . . = Fm = F0 , where F0 is unknown. X1 , . . . , Xm are usually called training data. We are interested in testing the null hypothesis H 0 : Fi = F 0 ,

∀ i ≥ m,

against the alternative hypothesis HA : there exists k ∗ ≥ 0 such that Fi = F0 , 1 ≤ i ≤ m + k ∗ , Fi = F 0 , m + k ∗ < i < ∞,

F0 = F 0 .

In case of independent observations there are no particular assumptions on the distribution functions Fi except their continuity. In case of dependent observations certain dependency among observations is assumed. Such a problem was considered by [2, 11] and [12]. Mostly such testing problems concern a change in finite dimensional parameter, see, [3, 8, 1] among others. They developed and studied sequential tests for a change in parameters in regression models. Our test procedure is described by the stopping rule: (1)

τm,N = inf{1 ≤ k ≤ N : |Q(m, k)| ≥ c qγ (k/m)}

∗ Acknowledgement. The work of the first author has been supported by grants GACR ˇ 201/09/J006, MSM 0021620839 and LC06024. The work of the second author has been supported by grants SVS 261315/2010 and GAUK 162310. 1 Address: MFF UK Sokolovsk´ a 83, CZ – 186 75 Praha 8, Czech Republic e-mail: [email protected], e-mail: [email protected] AMS 2000 subject classifications: 62G20, 62E20, 62L10. Keywords and phrases: change in distribution, sequential monitoring.

95

96

M. Huˇskov´ a and O. Chochola

with inf ∅ := ∞ and either N = ∞ or N = N (m) with limm→∞ N (m)/m = ∞. Q(m, k) is a detector depending on X1 , . . . , Xm+k , k = 1, 2, . . ., qγ (t), t ∈ (0, ∞) is a boundary function with γ ∈ [0, 1/2) (a tuning parameter) and c is a suitably chosen positive constant. We require under H0 for α ∈ (0, 1) (fixed) and, under HA ,   (2) lim PH0 τm,N < ∞ = α, m→∞

and, under HA , (3)

  lim PHA τm,N < ∞ = 1.

m→∞

The request (2) means that the test has asymptotically level α and (3) corresponds to consistency of the test. We usually choose the detectors Q(m, k)’s and the boundary function qγ (·), then constant c has to fulfill under H0   |Q(m, k)| lim P max ≥ c = α. m→∞ 1≤k≤N qγ (k/m) In the present paper we choose (4)

Q(m, k) =

1 √

m+k

σ m m i=m+1

(Fm (Xi ) − 1/2),

k = 1, 2, . . . ,

m is a where Fm is an empirical distribution function based on X1 , . . . , Xm and σ suitable standardization based on X1 , . . . , Xm . We put (5)

qγ (t) = (1 + t)(t/(1 + t))γ , t ∈ (0, ∞),

0 ≤ γ < 1/2.

Two sets of assumptions on the joint distribution of Xi ’s are considered. One set assumes that {Xi }i are independent random variables and Xi has continuous distribution function Fi , i = 1, 2, . . . ., i. e., under H0 they are independent identically distributed (i.i.d.) with common unknown continuous distribution function F0 . The other set of conditions admits dependent observations. Notice that the detector Q(m, k) can be expressed through empirical distribution function based on X1 , . . . , Xm and observations Xm+1 , . . . , Xm+k . Different test procedures for our problem based on empirical distribution functions were proposed by [11] and [2]. In these papers there are rather strict restrictions on N and independent observations are assumed. The paper [12] focuses the sequential detection of a change in the error distribution in time series. The studied procedure is based on empirical distribution functions of residuals. One can develop rank based procedures along the above lines but we do not pursue it here. Certain class of rank based are considered in [12] while U -statistics based sequential procedures are studied in [5] and [6]. The rest of the paper is organized as follows. Section 2 contain theoretical results together with discussions. Section 3 presents results of a simulation study. The proofs are in Section 4. 2. Main Results Here we formulate assertions on limit behavior of our test procedure under both null hypothesis as well under some alternatives and discuss various consequences. Under the null hypothesis we consider two sets of assumptions:

Sequential procedure for change in distribution

97

(H1 ) {Xi }i are independent identically distributed (i.i.d.) random variables, Xi has continuous distribution function F0 . (H2 ) {Xi }i is a strictly stationary α- mixing sequence with {α(i)}i such that for all δ > 0 (6)

P (|X1 − X1+i | ≤ δ) ≤ D1 δ,

(7)

α(i) ≤ D2 i−(1+η)3 ,

i = 1, 2, . . . ,

i = 1, 2, . . .

for some positive constants η, D1 , D2 . Xi has continuous distribution function F0 . Here the coefficient α(i)’s are defined as α(i) = sup |P (A ∩ B) − P (A)P (B)| A,B

where sup is taken over A ∈ σ(Xj , j ≤ n) and A ∈ σ(Xj , j ≥ n + i). Next the assertion on limit behavior of the functional of Q(m, k) under H0 is stated. Theorem 1 2 (I) Let the sequence {Xi }i fulfill the assumption (H1 ) and put σ m = 1/12. Then     |Q(m, k)| |W (t)| (8) lim P sup ≤ x ≤ x = P sup m→∞ tγ 0≤t≤1 1≤k 0 (10)

P (|Xm+k∗ +1 − Xm+k∗ +1+i | ≤ δ) ≤ D3 δ,

(11)

α0 (i) ≤ D4 i−(1+κ)3 ,

for some positive constants κ, D3 , D4 . Also

i = 1, 2, . . . ,

i = 1, 2, . . . F0 (x) dF 0 (x) = 1/2 is assumed.

98

M. Huˇskov´ a and O. Chochola

Alternative hypotheses cover a change in parameters like location but also a change in the shape of distribution. Additionally, alternative (A.2) is sensitive w.r.t. a change in dependence among observations. Theorem 2 Let {Xi }i fulfill either (A1 ) or (A2 ), let k ∗ < N η for some 0 ≥ η < 1, let (5) be satisfied. Then, as m → ∞, sup 1≤k 0}, where 1 Vm (t) = √ m



m+mt

(Fm (Xi ) − 1/2)

i=m+1

is the same as of {Zm (t), t > 0} with m+mt m  1 

k

Zm (t) = √ (F0 (Xi ) − 1/2) − (F0 (Xj ) − 1/2) . m j=1 m i=m+1

Moreover, as m → ∞ the process



√1 m



to a Gaussian process in a certain sense distribution to N (0, σ 2 ), where σ2 =



m+mt i=m+1 (F0 (Xi ) − 1/2), t > 0 converges m and √1m j=1 (F0 (Xj ) − 1/2) converges in



1 cov{F0 (X1 ), F0 (Xj+1 )}. +2 12 j=1

In case of independent observations σ 2 = 1/12 while for dependent ones the second term in σ 2 is generally nonzero and also unknown. As an estimator of σ 2 we use the estimator (13)

2  σ m = R(0) +2

Λm

m (k), w(k/Λm )R

k=1

(14)

m (k) = 1 R n

n−k

i=1

(Fm (Xi ) − 1/2)(Fm (Xi+k ) − 1/2),

Sequential procedure for change in distribution

99

where w(·) is a weight function. Usual choices are either w1 (t) = 1I{0 ≤ t ≤ 1/2} + 2(1 − t){1/2 < t ≤ 1} or w2 (t) = 1 − tI{0 ≤ t ≤ 1}. The weight w1 (·) is called the flat top kernel, while w2 (·) is the Bartlett kernel. Theorem 3 Let the sequence {Xi }i fulfill the assumption (H1 ) and let Λm → ∞,

Λm (log m)−β → 0

for some β > 2. Then, as m → ∞, 2 − σ 2 = oP (1). σ m

Proof. It is omitted since it very similar to the proof of Theorem 1 (II). 3. Simulations In this section we report the results of a small simulation study that is performed in order to check the finite sample performance of the monitoring procedure considered in the previous section. The simulations were performed using the R software. All results are obtained for the level α = 5% where the critical values c were set using the limit distribution as indicated in (12). Unfortunately the explicit form for the distribution of sup0≤t≤1 |W (t)|/tγ is known only for γ = 0 otherwise the simulated critical values are used. They are reported in [8] for example. We choose three different length of the training data m = 50, 100 and 500 to asses the ap2 proximation based on asymptotics. The estimate σ m is set to 1/12 for independent observations and it is calculated according to (13) with flat top kernel for dependent ones. We also comment on a common situation when we do not have the apriori in2 formation about the independence and the estimate of σm is calculated also for the independent observations. The symbol tk stands for t-distribution with k degrees of freedom. The empirical sizes of the procedure under the null hypothesis are based on 10 000 replications and monitoring period of length 10 000. They are reported in Table 1 for both independent and dependent observations, where dependent ones form an AR(1) sequence with a coefficient ρ. Since the procedure make use of the empirical distribution function it is convenient also for distributions with heavier tails. Two such examples are shown in the table, as well as a skewed distribution (demeaned Log-normal one). We use different values of a tuning constant γ and since we will later examine an early change, we are mostly interested in γ close to 1/2. We can see that for independent observations the level is kept and the prolongation of the training period has no significant effect. This is not the case when we do not make use of the independence information (figures are not reported here). The reason is that we need more data to estimate σ 2 precisely enough and therefore the prolongation will bring the empirical size closer to the required level. Similar reasoning holds for dependent observations as well. For γ in question (0.49), the results are satisfactory. Typically, the results for more regular distributions (e. g. normal one) are better than those reported here.

100

M. Huˇskov´ a and O. Chochola

Table 1 Empirical sizes for 5% level for different distribution of errors being either independent (ρ = 0) or forming AR(1) sequence with coefficient ρ. ρ 0

0.2

0.4

m\γ 50 100 500 50 100 500 50 100 500

t1 0 4.7 4.6 4.5 9.4 7.5 5.7 12.1 10.9 8.8

0.25 4.5 4.7 4.5 9.0 7.6 5.8 12.2 11.0 9.6

t4 0.45 2.9 3.4 4.2 6.7 5.7 5.3 8.7 8.3 8.8

0.49 1.7 2.2 3.0 4.6 3.8 4.1 5.6 5.9 6.6

0 4.4 4.7 4.2 8.6 6.6 5.0 10.3 9.0 6.7

0.25 4.3 4.3 4.4 8.7 6.4 5.3 10.4 9.3 7.2

0.45 2.8 3.2 3.8 6.6 5.3 4.7 7.6 6.9 6.4

0.49 1.7 2.0 2.8 4.6 3.8 3.5 5.2 4.8 4.9

0 4.3 4.7 4.4 9.0 7.5 5.2 11.0 8.9 7.2

LN(0,1)-e−1/2 0.25 0.45 4.1 3.0 4.3 3.1 4.5 4.0 8.8 6.7 7.5 5.8 5.6 5.1 10.9 7.9 8.8 6.4 7.4 6.8

0.49 1.7 2.0 3.0 4.6 3.9 3.8 5.4 4.2 5.0

Now we focus on alternatives. We take k ∗ = 0, i. e. the change occurs right after the end of training period. Therefore we use γ = 0.49, which is the most convenient choice for an early change. The maximal length of the monitoring period is 500 and the number of replications is 2500. Table 2 summarizes results for stopping times for independent observation when change is in the location with zero location before the change and μ0 afterwards. For comparison there are k ∗ = 0 and also k ∗ = 9. The latter case leads to a small increase in the delay of detection, otherwise the results are analogous, so we will report only results for k ∗ = 0 onwards. The detection delays are quite small even for a smaller change. The prolongation of the training period leads mainly to reducing extremes of the delay. However when we do not have the apriori information about the independence i. e. the estimate of σ 2 need to be calculated, the delays are monotonically decreasing in m. The results are generally a bit worse even for the largest m (figures are not reported here). In some simulations where the max value equals to 500 the change was not detected, however this is quite rare in this setting. The results for dependent observations are shown in Table 3. In the the upper part there are stopping times for a unit change in mean, when errors form AR(1) sequence. For dependent observations the positive impact of increased m is clearly visible. With an increasing dependence amongst the data, the performance of the procedure is worsening. However the results for m = 500 are satisfactory even with ρ = 0.4. The lower part of the table presents the results for change in distribution of innovations from t4 to demeaned Log-normal one. The procedure detects the change for larger m, however the performance is not satisfactory. This pair of distributions was chosen because it fulfills the requirement on F0 and F 0 as described in (A1 ). That requirement excludes the possibility of change from a symmetric distribution to another symmetric one. Simulations confirmed that the procedure is insensitive to this type of change. Table 4 shows the results for a change in variance of independent observations. Due to the requirement of (A1 ) we choose two skewed distributions, Log-normal and χ22 ones, which were again demeaned. We consider doubling either the variance or the standard deviation. The results are generally better for Log-normal distribution because it is more skewed. One can see an improvement in delay with an increasing m. A longer training period is crucial mainly for a smaller change.

Sequential procedure for change in distribution

101

Table 2

Summary of the stopping times for independent observations with different distributions when change in location of μ0 occurs, 2 σ m = 1/12 and k ∗ = 0 (if not stated otherwise). μ0

\m Min. 1st Qu. Median Mean 3rd Qu. Max. Min. 1st Qu. Median Mean 3rd Qu. Max.

1

0.5

50 5 8 11 12 15 52 5 14 22 27 34 153

t4 100 5 9 13 15 18 54 5 19 31 36 47 197

500 5 10 13 14 17 46 6 18 26 29 37 110

50 4 9 16 21 28 126 4 18 38 60 77 500

t1 100 4 12 20 24 31 124 4 25 48 69 91 500

500 4 12 19 23 29 100 4 22 38 46 64 250

LN(0,1)-e−1/2 50 100 500 5 4 5 11 14 13 13 18 15 14 18 16 16 22 18 33 42 30 5 4 7 35 52 32 59 85 44 73 99 45 96 131 57 464 500 128

50 24 30 34 35 39 81 25 44 55 60 71 205

t4 , k∗ =9 100 500 18 24 24 31 27 34 28 35 32 39 64 67 18 27 36 45 47 53 52 55 62 64 214 124

Table 3

Summary of the stopping time for errors forming AR(1) process. Upper part – change in mean of +1 occurs, lower part-change in distribution of innovations, k∗ = 0 for both. distribution

t4

LN(0,1)-e−1/2

t4 ↓ LN(0,1)-e−1/2

\m Min. 1st Qu. Median Mean 3rd Qu. Max. Min. 1st Qu. Median Mean 3rd Qu. Max. Min. 1st Qu. Median Mean 3rd Qu. Max.

50 5 8 11 12 15 52 1 9 14 23 23 500 1 59 500 328 500 500

ρ=0 100 5 9 13 15 18 54 2 10 13 15 18 75 2 49 159 247 500 500

500 5 10 13 14 17 46 4 10 13 13 16 30 6 43 83 106 141 500

50 2 37 63 120 123 500 3 19 34 90 76 500 2 201 500 383 500 500

ρ = 0.2 100 4 34 50 61 71 500 5 19 27 38 41 500 5 106 500 339 500 500

500 11 32 42 45 56 168 8 19 24 26 31 67 7 70 145 193 276 500

50 3 56 141 234 500 500 4 38 113 230 500 500 4 500 500 423 500 500

ρ = 0.4 100 6 47 78 125 140 500 8 33 58 116 122 500 6 311 500 395 500 500

500 10 42 60 66 83 365 9 32 45 51 62 324 7 109 262 283 500 500

4. Proofs We focus on the proofs for independent observations and give modifications needed for dependent ones. The line of both proofs is the same, however for dependent observations it is more technical. Proof of Theorem 1. (I) The detector Q(m, k) can be decomposed into two summands: √ σ m mQ(m, k) = J1 (m, k) + J2 (m, k),

102

M. Huˇskov´ a and O. Chochola Table 4

Summary of the stopping time for independent observations with different distributions when a change in a standard deviation 2 (multiplied by κ0 ) occurs, σ m = 1/12 and k ∗ = 0. \κ0

\m Min. 1st Qu. Median Mean 3rd Qu. Max.

50 1 14 50 151 221 500

2 100 1 14 39 72 89 500

LN(0,1)-e−1/2 √ 500 3 14 31 44 61 365

50 1 56 500 333 500 500

where J1 (m, k) = m+k

J2 (m, k) =

2 100 2 35 158 239 500 500

χ22 − 2 500 3 33 75 113 153 500

2 100 1 25 93 193 440 500

50 1 25 500 282 500 500

500 3 24 59 91 125 500

50 1 500 500 397 500 500



2 100 2 119 500 360 500 500

500 3 60 173 223 396 500

m+k m 1

h(Xj , Xi ), m i=m+1 j=1

(F0 (Xi ) − 1/2) − k/m

i=m+1

m

(F0 (Xi ) − 1/2)

i=1

with h(Xj , Xi ) = I{Xj ≤ Xi }−E(I{Xj ≤ Xi }|Xi )−E(I{Xj ≤ Xi }|Xj )+EI{Xj ≤ Xi }. Since given X1 , . . . , Xm term J1 (m, k) can be expressed as the sum of independent random variables with zero mean and since for i = j E(h(Xj , Xi )|Xi ) = E(h(Xj , Xi )|Xj ) = Eh(Xj , Xi ) = 0 we get by the H´ ajek -R´enyi inequality for any q > 0:  E P ( max √ 1≤k≤N

 |J1 (m, k)| ≥ q|X , . . . , X ) 1 m m(1 + k/m)(k/(m + k))γ

≤ q −2

N

k=1

≤ q −2 D

m 

k=1

E



m j=1

2 h(Xj , Xi )

m3 (1 + k/m)2 (k/(m + k))2γ

m−2+2γ k −2γ +

N

 k −2 = q −2 O(m−1 )

k=m+1

for some D > 0. The last relation holds true for any N integer and therefore |J (m,k)| √ 2 the limit behavior of max1≤k≤N |Q(m,k)| . The qγ (k/m) is the same as max1≤k≤N mqγ (k/m) proof can be finished along the line of Theorem 2.1 in [8]. (II) The proof follows the same line as above but due to dependence modifications are needed. Notice that α-mixing of {Xi }i implies α-mixing of {φ(Xi )}i for any measurable function φ with the same mixing coefficient as the original sequence. Then by Lemma 3.3 in [4] we get that there is a positive constant D such that for h(·, ·) defined above |E(h(Xi1 , Xi2 )h(Xi3 , Xi4 )| ≤ D(α(i))2/3−ξ

Sequential procedure for change in distribution

103

for any ξ > 0, where i = min(i(2) − i(1) , i(4) − i(3) ) with i(1) ≤ i(2) ≤ i(3) ≤ i(4) . Then after some standard calculations we get that EJ1 (m, k)2 ≤ Dmk for some D > 0 and hence by Theorem B.4 in [9] we get that also under present assumptions P ( max

1≤k≤N

|J1 (m, k)| ≥ q) ≤ q −2 O(m−1 (log N )2 ). (1 + k/m)(k/(m + k))γ

The proof is then again finished along the line of Theorem 2.1 in [8] but instead of Koml´os-Major-Tusn´ ady results we use Theorem 4 in [10]. Proof of Theorem 2 Going through the proof of Theorem 1(I) we find that if in J2 (m, k) we replace 1/2 by EF (Xi ) and denote this by J2A (m, k) then even under our alternative |J A (m, k)| max √ 2 = OP (1), 1≤k≤N mqγ (k/m) Moreover, max

1≤k≤N

|J1 (m, k)| = oP (1). max √ 1≤k≤N mqγ (k/m)

| max(0, k − k ∗ )| √ → ∞. mqγ (k/m)

To prove part (II) we proceed similarly. References ´ th, L., Huˇ ´ , M., and Kokoszka, P. (2006). Change[1] Aue, A., Horva skova point monitoring in linear models. Econometrics Journal 9 373–403. [2] Bandyopadhyay, U. and Mukherjee, A. (2007). Nonparametric partial sequential test for location shift at an unknown time point. Sequential Analysis 26 99-113. [3] Chu, C.-S., Stinchcombe, M., and White, H. (1996). Monitoring structural change. Econometrica 64 1045–1065. [4] Dehling, H. and Wendler, M. (2010). Central limit theorem and the bootstrap for U -statistics of strongly mixing data. Journal of Multivariate Analysis 101 126–137. [5] Gombay, E. (1995). Nonparametric truncated sequential change-point detection. Statistics & Decisions 13 71–82. [6] Gombay, E. (2004). U-statistics in sequential tests and change detection. Sequential Analysis 23 254–274. [7] Gombay, E. (2008). Weighted logrank statistics in sequential tests. Sequential Analysis 27 97–104. ´ th, L., Huˇ ´ , M., Kokoszka, P., and Steinebach, J. (2004). [8] Horva skova Monitoring changes in linear models. Journal of Statistical Planning and Inference 126 225–251. [9] Kirch C. (2006). Resampling Methods for the change analysis of dependent data. PhD. Thesis, University of Cologne, Cologne. http://kups.ub.unikoeln.de/volltexte/2006/1795/. [10] Kuelbs, J. and Philipp, W. (1980). Almost sure invariance principles for partial sums of mixing B-valued random variables. Ann. Probab. 8 6 1003–1036.

104

M. Huˇskov´ a and O. Chochola

[11] Lee S., Lee Y. and Na O. (2009). Monitoring distributional changes in autoregressive models. Commun. Statist. Theor. Meth. 38 2969-2982. [12] Lee S., Lee Y. and Na O. (2009). Monitoring parameter change in time series. Journal of Multivariate Analysis 100 715–725. [13] Mukherjee, A. (2009). Some rank-based two-phase procedures in sequential monitoring of exchange rate. Sequential Analysis 28 137-162.

IMS Collections Nonparametrics and Robustness in Modern Statistical Inference and Time Series Analysis: A Festschrift in honor of Professor Jana Jureˇ ckov´ a Vol. 7 (2010) 105–112 c Institute of Mathematical Statistics, 2010  DOI: 10.1214/10-IMSCOLL711

A class of multivariate distributions related to distributions with a Gaussian component Abram M. Kagan1 and Lev B. Klebanov2 Abstract: A class of random vectors (X, Y), X ∈ Rj , Y ∈ Rk with characteristic functions of the form h(s, t) = f (s)g(t) exp{s Ct} where C is a (j × k)-matrix and prime stands for transposition is introduced and studied. The class contains all Gaussian vectors and possesses some of their properties. A relation of the class to random vectors with Gaussian components is of a particular interest. The problem of describing all pairs of characteristic functions f (s), g(t) such that h(s, t) is a characteristic function is open.

1. Introduction In the paper we study properties of random vectors (X, Y) taking values in Rm , m = j + k with characteristic functions h(s, t) = E exp i{s X + t Y} of the form (1)

h(s, t) = f (s)g(t) exp{s Ct}.

Here s ∈ Rj , t ∈ Rk , C is a (j × k)-matrix, prime stands for transposition, and f (s), g(t) are the (marginal) characteristic functions of X and Y. The class of m-variate distributions with characteristic functions (1) includes all Gaussian distributions and, trivially, all distributions of independent X and Y (for the latter C = 0). The dependence between X and Y is, in a sense, concentrated in the matrix C and it seems natural to call this form of dependence Gaussian-like. Note that if E(|X|2 ) < ∞, E(|Y 2 ) < ∞, −C is the covariance matrix of X and Y, −C = cov(X, Y). We call the distributions with characteristic functions (1) GL-distributions. When f (s), g(t) are characteristic functions, (1) is, in general, not a characteristic function. For example, in case of j = k = 1 if f (s) = sin s/s is the characteristic function of a uniform distribution on (−1, 1), then for any characteristic function g(t) (1) is not a characteristic function unless C = 0 (if C = 0, h(s, t) is unbounded). In the next section it is shown that if f (s), g(t) have Gaussian components, (1) is a characteristic function for all C with sufficiently small elements. We know of no other examples of f (s), g(t) when h(s, t) is a characteristic function. Note 1 Department of Mathematics, University of Maryland, College Park, MD 20742, USA. e-mail: [email protected] 2 Department of Probability and Statistics, MFM, Charles University, Sokolovsk´ a 83, 18675, Prague, Czech Republic. e-mail: [email protected] AMS 2000 subject classifications: Primary 60E05, 60E10 Keywords and phrases: Gaussian-like dependence, uncorrelatedness and independence, Fr´ echet classes.

105

106

A. M. Kagan, L. B. Klebanov

in passing that the absence of Gaussian components plays an important role in problem of the arithmetic of characteristic functions (see, e. g., [3]). The vectors (X, Y) with characteristic functions (1) have some nice properties. 2. Properties of the GL-distributions Proposition 1. If (X1 , Y1 ), (X2 , Y2 ) are independent random vectors having GL-distributions and a, b constants, (X, Y) = a(X1 , Y1 ) + b(X2 , Y2 ) also has a GL-distribution. Proposition 2. If (X, Y) has a GL-distribution and X1 (resp. Y1 ) is a subvector of X (resp. Y), then (X1 , Y1 ) also has a GL-distribution. Proof. Assuming X1 (resp. Y1 ) consisting of the first j1 (resp. k1 ) components of X (resp. Y) and denoting C1 the submatrix of the first j1 rows and k1 columns of the matrix C from the characteristic function (1) of (X, Y), s1 (resp. t1 ) the vector of the first j1 (resp. k1 ) components of s (resp. t), the characteristic function of X1 , Y1 ) is h1 (s1 t1 ) = f1 (s1 )g1 (t1 ) exp{s1 C1 t1 } with f1 (s1 ) = f (s1 , 0), g1 (t1 ) = g(t1 , 0). Proposition 3. Let (X, Y) have a GL-distribution and E(|X|2 ) < ∞, E(|Y|2 ) < ∞. If linear forms L1 = a X, L2 = b Y where a ∈ Rj , b ∈ Rk are constant vectors, are uncorrelated, they are independent. Proof. In the characteristic function (1), −C = cov(X, Y) whence cov(L1 , L2 ) = −a Cb. Thus, uncorrelatedness of L1 and L2 means a Cb = 0. But then for u, v ∈ R E exp{i(uL1 + vL2 )} = f (ua)g(vb) exp{uva Cb} = f (ua)g(vb).

Proposition 3 is related to Vershik’s (see, [5]) characterization of of Gaussian vectors. Let Z be an m-variate random vector with covariance matrix V of rank ≥ 2. If any two uncorrelated linear forms a Z, b Z are independent, Z is a Gaussian vector [5]. The reverse is a well known property of Gaussian vectors. The property stated in Proposition 3 is not characteristic of the random vectors with GL-distributions. However, if to assume additionally that (X, Y) are the vectors of the first and second components, respectively, of independent (not necessarily identically distributed) bivariate random vectors (X1 , Y1 ), . . . , (Xn , Yn ), the GLdistributions are characterized by “uncorrelatedness of a X and b Y implies their independence” property. The following result holds. Theorem 2.1. If E(Xj2 + Yj2 ) < ∞, j = 1, . . . , n and any two uncorrelated linear forms L 1 = a 1 X1 + . . . + a n Xn , L 2 = b 1 Y 1 + . . . + b n Y n are independent, then (i) cov(Xj , Yj ) = 0 implies independence of Xj and Yj (a trivial part), (ii) if, additionally, #{i : cov(Xi , Yi ) = 0} ≥ 3, the characteristic function hj (s, t) of any uncorrelated (Xj , Yj ) in a vicinity of s = t = 0 has the form of (2)

hj (s, t) = fj (s)gj (t) exp{Cj st}

A class of multivariate distributions

107

for some constant Cj , (iii) if neither of those hj (s, t) vanishes, (2) holds for all s, t ∈ R. Proof. See [1] Theorem 2.1 and the next result also proved in [1] show that some characteristic properties of the Gaussian distributions, after being modified for the setup of partitioned random vectors, become characteristic properties of the GL-distributions. Theorem 2.2. If (X1 , Y1 ), . . . , (Xn , Yn ) is a sample of size n ≥ 3 from a bivariate ¯ of the first components is independent of the population and the sample mean X ¯ vector of the residuals (Y1 − Y , . . . , Yn − Y¯ ) of the second components and (not ¯ . . . , Xn − X), ¯ then the population characteristic or) Y¯ is independent of (X1 − X, function h(s, t) in a vicinity of s = t = 0 has the form (3)

h(s, t) = f (s)g(t) exp{Cst}

for some C. If h(s, t) does not vanish, (3) holds for all s, t ∈ R. The next two properties demonstrate the role of Gaussian components in GLdistributions. Recall that a random vector ξ with values in Rs has a Gaussian component if (4)

ξ =η+ζ

where η and ζ are independent random vectors and ζ has an s-variate Gaussian distribution. In terms of characteristic functions, if f (u), u ∈ Rs is the characteristic function of ξ, (4) is equivalent to (5)

f (u) = f1 (u) exp{−u V u/2}

where V is a Hermitian (s × s)-matrix and f1 (u) is a characteristic function. In view of (5), they say also that f (u) has a Gaussian component. Theorem 2.3. If f (s), s ∈ Rj , g(t), t ∈ Rk are characteristic functions having Gaussian components and C = [crq ] is a (j × k)-matrix, then for sufficiently small |crq |, r = 1, . . . , j; q = 1, . . . , k the function h(s, t) = f (s)g(t) exp{s Ct}. is the characteristic function of a random vector (X, Y) with values in Rm , m = j + k. Plainly, h(s, 0) = f (s), h(0, t) = g(t) are the (marginal) characteristic functions of X and Y. Note that if F(F, G) is the Fr´echet class of m-variate distribution functions H(x, y) with H(x, ∞) = F (x), H(∞, y) = G(y), Theorem 2.3 means that if X ∼ F (x) and Y ∼ G(y) have Gaussian components, the class F(F, G) contains H(x, y) with the characteristic function h(s, t) = exp{i(s x + t y)} dH(x, y) Rm

of the form (3) for all C with sufficiently small elements.

108

A. M. Kagan, L. B. Klebanov

Proof. By assumption, f (s) = f1 (s) exp{−s V1 s/2}, g(t) = g1 (t) exp{−t V2 t/2} where V1 , V2 are (j ×j) and (k ×k) Hermitian matrices respectively, and f1 (s), g1 (t) are characteristic functions. Let now ζ  = (ζ1 , ζ2 ) be an m-dimensional Gaussian vector with mean vector zero and covariance matrix + , V1 C V = C  V2 where Vi is the covariance matrix of ζi , i = 1, 2 and C = [crq ] = cov(ζ1 , ζ2 ) is a (j × k)-matrix. The matrix + V =

V1 0

0 V2

,

is positive definite. Hence, for all sufficiently small |crq | (their smallness is determined by V1 , V2 ) the matrix + , + , V1 C 0 C V = + (6) C  V2 C 0 is also positive definite so that (6) is Hermitian and may be chosen as a covariance matrix. Indeed, the property of a matrix to be positive definite is determined by positivity of a (finite) number of submatrices and plainly is preserved under small additive perturbations as in (6). Now one sees that the function (3) rewritten as . 1    h(s, t) = f1 (s)g1 (t) exp − (s V1 s − 2s Ct + t V2 t) 2 is a product of three characteristic functions, f1 (s), g1 (t) and . 1    ϕ(s, t) = exp − (s V1 s − 2s Ct + t V2 t) , 2 the latter being the characteristic function of an m-variate Gaussian distribution N (0, V ), and thus is a characteristic function itself. Remark. In case of j = k = 1, the smallness of |C| required in Theorem 2.3 can be quantified. Namely, if the variances of the Gaussian components ζ1 and ζ2 are σ12 and σ22 , suffice to assume |C| < σ1 σ2 . In this case, C = ρσ1 σ2 for some ρ, |ρ| < 1 and . 1 h(s, t) = f1 (s)g1 (t) exp − (σ12 s2 − 2ρσ1 σ2 st + σ22 t2 ) 2 with the third factor on the right being the characteristic function of a bivariate Gaussian distribution. Theorem 2.4. If (X, Y) has a GL-distribution with C = 0 and X is a Gaussian vector, then any linear form b Y either is independent of X or has a Gaussian component.

A class of multivariate distributions

109

Proof. Fix b ∈ Rk . If for any a ∈ Rj , a Cb = 0, then for any u ∈ R E exp{iu(a X + b Y)} = h(ua) = f (ua)g(ub) exp{u2 a Cb} = f (ua)g(ub). Thus, in this case b Y is independent of any a X implying independence of b Y and X. Indeed, for any u ∈ R, u = 0 and v ∈ Rj , E exp{i(v X + ub Y)} = E exp{iu(a X + b Y)} = f (ua)g(ub) = f (v)g(ub). Suppose now that there exists an a ∈ Rj such that a Cb = 0. Then, denoting V the covariance matrix of X, E exp{iu(a X + b Y)} = g(ub) exp{−

u2  (a V a − 2a Cb)}. 2

One can always choose |b| large enough (replacing, if necessary, b with λb) so that a V a − 2a Cb = −σ 2 < 0. Now g(ub) = h(ua, ub) exp{−σ 2 u2 /2} and since h(ua, ub) is a characteristic function, the random variable b Y with the characteristic function g(ub) has a Gaussian component. As a direct corollary of Theorem 2.4 note that in case j = k = 1, if (X, Y ) has a GL-distribution and X is Gaussian, either Y is independent of X (in which case its distribution may be arbitrary) or it has a Gaussian component. Cram´er classical theorem (see, e. g., Linnik and Ostrovskii (1977)) claims that the components of a Gaussian random vector are necessarily Gaussian (the components of a Poisson random variable are necessarily Poisson and the components of the sums of independent Poisson and Gaussian random variables are necessarily of the same form so that the above is not a characteristic property of the Gaussian distribution). A corollary of Theorems 2.3 and 2.4 shows that the class of GLdistributions is not closed with respect to deconvolution. Corollary 1. There exist independent bivariate vectors (X1 , Y1 ), (X2 , Y2 ) whose distributions are not GL while their sum (X1 + Y1 , X2 + Y2 ) has a GL-distribution. Proof. There are examples of independent random variables Y1 , Y2 without Gaussian components whose sum Y1 + Y2 has a Gaussian component. In [4] was shown that independent identically distributed random variables Y1 , Y2 with the charac2 teristic function f (t) = (1 − t2 )e−t /2 have no Gaussian component while their sum 2 Y1 + Y2 whose characteristic function is (1 − t2 )2 e−t has a Gaussian component 2 with the characteristic function e−t /4 . Il’inskii [2] showed that any non-trivial (i. e., with ab = 0) linear combination aY1 + bY2 of the above Y1 , Y2 has a Gaussian component. It leads to that any vector (aX1 + bY1 , aX2 + bY2 ) with ab = 0, Y1 , Y2 from Il’inski’si example and Gaussian X1 , X2 has a GL-distribution. Let now (X1 , Y1 ), (X2 , Y2 ) be independent random vectors with Gaussian first components and such that Xi and Yi , i = 1, 2 are not independent (their dependence may be arbitrary). Due to Theorem 2.4, in case of j = k = 1 the distributions of the vectors (Xi , Yi ), i = 1, 2 are not GL. At the same time, both components of their sum (X, Y ) = (X1 + X2 , Y1 + Y2 ) have Gaussian components so that due to Theorem 2.3 the vector (X, Y ) has a GL-distribution.

110

A. M. Kagan, L. B. Klebanov

Combining Theorems 2.1 and 2.4 leads to a characterization of distributions with a Gaussian component by a property of linear forms. Corollary 2. Let (X1 , Y1 ), . . . , (Xn , Yn ), n ≥ 3 be independent random vectors with Gaussian first components. Assume that for i = 1, . . . , n E|Yi |2 < ∞, cov(Xi , Yi ) = 0 and the characteristic functions hi (s, t) of (Xi , Yi ) do not vanish. Then uncorrelatedness of pairs L1 = a1 X1 + . . . + an Xn , L2 = b1 Y1 + . . . + bn Yn in the first and second components is equivalent to their independence if and only if Y1 , . . . , Yn have Gaussian components. Proof. From Theorem 2.1 (assuming E(Xi ) = 0), 2 2

hi (s, t) = e−σi s

/2

gi (t) exp{Ci st}

where σi2 = E(Xi2 ), gi (t) is the characteristic function of Yi and Ci = −cov(Xi , Yi ) = 0. Then by Theorem 2.4 Yi has a Gaussian component. For the sufficiency part see Proposition 3. To the best of the authors’ knowledge, it is the first example of characterization of distributions with Gaussian components. For simplicity, let us consider the case of two-dimensional vector (X, Y ) with a GL-distribution. Hypothesis Vector (X, Y ) has GL-distribution if and only if both X and Y have Gaussian components. To support this Hypothesis note that it is true for infinitely divisible characteristic function h(s, t). This fact is rather simple, and its proof follows from L´evy Chinchine representation for infinitely divisible characteristic functions. Let us give another example of characterization of distributions with a Gaussian component, supporting the Hypothesis. To this aim consider a set ξ1 , . . . , ξn of independent random variables, and two sets a1 , . . . , an , b1 , . . . , bn of real constants. Denote (7)

J = {j : aj bj = 0}, J1 = {1, . . . , n} \ J.

Theorem 2.5. Let X=

n

a j ξj , Y =

j=1

n

bj ξj .

j=1

Denote by h(s, t) the characteristic function of the pair (X, Y ) and suppose that the set J = ∅. The pair (X, Y ) has a GL-distribution if and only if all ξj with j ∈ J have Gaussian distribution. In this case (8)

h(s, t) = f (s)g(t) exp{cst},

where both f and g have Gaussian components or are Gaussian. Proof. Let us calculate h(s, t). We have (9) h(s, t) = E exp{isX + itY } = E exp

⎧ n ⎨



j=1

⎫ ⎬ i(saj + tbj )ξj



=

n  j=1

hj (saj + tbj ),

A class of multivariate distributions

111

where hj is the characteristics function of ξj (j = 1, . . . , n). From (8) and (9) it follows that (10)

n 

hj (saj + tbj ) = f (s)g(t) exp{cst}.

j=1

The equation (10) is very similar to that appearing in known Skitovich–Darmois Theorem. The same method shows us that the functions hj with j ∈ J are characteristic functions of Gaussian distributions. Therefore, the functions f (s) and g(t) are represented as the products of Gaussian characteristic functions (hj with j ∈ J) and some other functions (hj with j ∈ J1 ). Reverse statement is trivial. GL-distributions may be of some interest for the theory of statistical models. Let F (x) and G(y) be a j- and k-variate distribution functions with |x|2 dF (x) < ∞, |y|2 dG(y) < ∞. Does there exist an m-variate, m = j + k, distribution function H(x, y) with marginals F and G such that if (X, Y) ∼ H, the covariance matrix cov(X, Y) is a given (j × k)-matrix C? In other words, is it possible to assume as a statistical model the triple (F, G; C)? Since for any variables ξ, η with σξ2 = var(ξ) < ∞, ση2 = var(η) < ∞, |cov(ξ, η)| ≤ σξ ση ,

(11)

the elements of C must satisfy conditions `a la (11). Even in case of j = k = 1, (11) is not always (i. e., not for all F, G) sufficient. Proposition 4. If X ∼ F, Y ∼ G have Gaussian components with covariance matrices V1 , V2 , there exist models (F, G; C) for all C with sufficiently small elements, their smallness is determined by V1 and V2 . Proof. As shown in Theorem 2.3, the function h(s, t) = f (s)g(t) exp{±s Ct} where f (s), g(t) are the characteristic functions of X and Y, is for all C with sufficiently small elements the characteristic function of a distribution H(x, y) with marginals F and G. Simple calculation shows that cov(X, Y) = ∓C. Certainly, the presence of Gaussian components in X and Y is an artificial condition for the existence of a model (F, G; C). And besides, the statistician would prefer to work with the distribution function or the density and not with the characteristic function. H. Furstenberg, Y. Katznelson and B.Weiss (private communication) showed that if j- and k-dimensional random vectors X ∼ F, Y ∼ G with finite second moments are such that for any unit vectors a ∈ Rj , b ∈ Rk E(|a X|) > A, E(|b Y|) > A, then for all sufficiently small (depending on A) absolute values of the elements of an (j × k)-matrix C there exists an m = (j + k)-variate distribution H(x, y) with marginals F (x) and G(y) and cov(X, Y) = C. Their proof is based on the convexity of the Fr´echet class F(F, G) that allows constructing the required H as a convex combination of Hrq ∈ F(F, G) where for a given pair(r, q), m xr yq dHrq (x , y) = ± R

112

A. M. Kagan, L. B. Klebanov

for some  > 0 while for all other pairs (r , q  ) = (r, q), m xr yq dHrq (x , y) = 0. R

The resulting H, though given in an explicit form, is not handy for using in applications and would be interesting to construct (in case of absolutely continuous F and G) an absolutely continuous H with the required property. Acknowledgement The first author worked on the paper when he was visiting Department of Statistics, the Hebrew University in Jerusalem as a Forchheimer professor. The work of the second author was supported by the Grant MSM 002160839 from the Ministry of Education of Czech Republic and by Grant IAA 101120801 from the Academy of Sciences of Czech Republic. References [1] Bar-Lev, S. K., Kagan, A. M. (2009). Bivariate distributions with Gaussiantype dependence structure. Comm. in Statistics - Theory and Methods, 38, 2669-2676. [2] Il’inskii, A.I. (2010) On a question by Kagan. J. Math. Physics, Analysis and Geometry, accepted for publication. [3] Linnik, Yu. V. (1960). Decomposition of Probability Distributions. Amer. Math. Soc. [4] Linnik, Yu. V., Ostrovskii, I. V. (1977). Decomposition of Random Variables and Vectors. Amer. Math. Soc. [5] Vershik, A. M. (1964). Some characteristic properties of Gaussian stochastic processes. Theor. Probab. Applic., 9, 390-394.

IMS Collections Nonparametrics and Robustness in Modern Statistical Inference and Time Series Analysis: A Festschrift in honor of Professor Jana Jureˇ ckov´ a Vol. 7 (2010) 113–122 c Institute of Mathematical Statistics, 2010  DOI: 10.1214/10-IMSCOLL712

Locating landmarks using templates∗ Jan Kalina Charles University in Prague and Academy of Sciences of the Czech Republic Abstract: This paper examines different approaches to classification and discrimination applied to two-dimensional (2D) grey-scale images of faces. The database containing 212 standardized images is divided to a training set and a validation set. The aim is the automatic localization of the mouth. We focus on template matching and compare the results with standard classification methods. We discuss the choice of a suitable template and inspect its robustness aspects. While methods of image analysis are well-established, there exists a popular belief that statistical methods cannot handle this task. We ascertain that simple methods are successful even without a prior reduction of dimension and feature extraction. Template matching and linear discriminant analysis turn out to give very reliable results.

1. Introduction The aim of this paper is to locate landmarks in two-dimensional (2D) grey-scale images of faces, to examine some aspects of template matching including the construction of templates and robustness aspects, and to compare different methods for locating landmarks. In contrary to standard approaches, we want to examine methods applied to raw data, without a prior reduction of dimension and feature extraction. There exists a popular belief that statistical methods cannot handle this task. We refer to [13] giving a survey of 181 recent articles on face detection and face recognition, which is still not an exhaustive survey but rather a study of selected remarkable specific approaches. Existing methods of image analysis are complicated combinations of ad hoc methods of mathematics, statistics and informatics as well as heuristic ideas which are tailor-made to suit the particular data and the particular task. These black boxes are far too complex to implement for users of the methods in all areas of applications. We point out that these reliable methods are based on extremely simple features, albeit organized in a cascade (see [10]), and furthermore simple templates are used also in complicated situations, for example in the spaces with the reduced dimension (see [9]). Our aim is also to compare template matching with methods of multivariate statistics; these turn out to yield successful results for standardized images. Reduction of dimension becomes unnecessary when very fast computers are available to analyze raw data and template matching has a clear interpretation and can be implemented routinely. Charles University in Prague, KPMS MFF, Sokolovsk´ a 83, 186 75 Praha 8 and Institute of Computer Science, Academy of Sciences of the Czech Republic, Pod Vod´ arenskou vˇ eˇ z´ı 2, 182 07 Praha 8, Czech Republic. e-mail: [email protected] url: http://www.euromise.org/homepage/people/kalina.html ∗ This work is supported by Jaroslav H´ ajek Center for Theoretical and Applied Statistics, project LC 06024 of the Ministry of Education, Youth and Sports of the Czech Republic. AMS 2000 subject classifications: Primary 62H30; secondary 62H35. Keywords and phrases: classification and discrimination, high-dimensional data. 113

114

J. Kalina

Possible applications of detecting objects in images include also face detection for forensic anthropology, secret service or military applications, but also other applications on images with other objects than faces (weather prediction from satellite images, automatic robot vision) or even detection of events in financial time series (fraud detection). We work with the database of images from the Institute of Human Genetics, University Clinic in Essen, Germany, which was acquired as a part of grants BO 1955/2-1 and WU 314/2-1 of the German Research Council (DFG). It contains 212 grey-scale images of faces of size 192 × 256 pixels. We divide them to a training database of 124 images and a validation database with 88 images. A grey value in the interval [0,1] corresponds to each pixel, where low values are black and large values white. The images were taken under standardized conditions always with the person sitting straight in front of the camera looking in it. While the size of the head can differ only slightly, the heads are often rotated by a small angle and the eyes are not in a perfectly horizontal position in such images. For example there are no images with closed eyes, hair over the face covering the eyes or other nuisance effects. The database does not include images with a three-dimensional rotation (a different pose). The Institute of Human Genetics is working on interesting problems in the genetic research using images of faces. The ambitions of the research are to classify automatically genetic syndromes from a picture of a face; to examine the connection between the genetic code and the size and shape of facial features; and also to visualize a face based only on its biometric measures. Some of the results are described in the papers by [12], [9] and [1]. All such procedures require as the first step the localization of landmarks, although this is not their primary aim. The landmarks are defined as points of correspondence (exactly defined biologically or geometrically) on each object that matches between and within populations (see [2] or [3]). Examples of landmarks include the soft tissue points located on inner and outer commisure of each eye fissure, the points located at each labial commisure, the midpoints of the vermilion line of the upper and lower lip (see [4]). The team of genetics researchers uses two approaches to locate 40 landmarks in each face as follows [1]. One possibility is the manual identification, carefully performed by an anthropologist trained in this field. Another approach used at the institute is a semi-automatic procedure based on [12]. This starts with a twodimensional wavelet transformation of the images and uses templates in the space of the wavelet coefficients. However it turns out to be very sensitive to slight rotations of the face. This is the motivation for our study of template matching and its robustness. Chapter 2 is devoted to template matching applied to locating the mouth in images of the training database. We study robustness to local modifications or different lighting conditions. Chapter 3 compares different methods of classification analysis for the same task. 2. Locating the mouth using template matching We describe our construction of templates and apply them with the aim to localize the mouth in the training database with 124 images of faces. Template matching is a tailor made method for object detection in grey-scale images using an ideal object with the ideal shape in the typical form, particularly applicable to locating

Locating landmarks using templates

115

Fig 1. An image from the database. Every image is a matrix of 192 × 256 pixels.

faces or their landmarks in a single image. [13] gives a list of references on template matching. The template is placed on every possible position in the image and the similarity is measured between the template and each part of the image, namely the grey value of each pixel of the template is compared with the grey value of the corresponding pixel of the image. The standard solution is to compute the Pearson product-moment correlation coefficient r to compare all grey values of the image ignoring the coordinates of the pixels. In the following text we consider the Pearson product-moment correlation coefficient r (shortly called correlation coefficient) and the weighted Pearson product-moment correlation coefficient (shortly called weighted correlation coefficient). 2.1. Construction of templates In the references [13] or [10] we have found no instructions on a sophisticated construction of templates. We construct the set of mouth templates in the following way. Starting with a particular mouth with a typical appearance, we compute the Pearson product-moment correlation coefficient between this mouth of size 27 × 41 pixels and every possible rectangular area of the size 27 × 41 pixels of every image of the training set. In 16 images the maximal correlation coefficient between the template and the image exceeds 0.85 and this largest correlation coefficient is obtained each time in the mouth. The symmetrized average of the grey values of the 16 mouths is used as the first template. The process of averaging removes individual characteristics and retains typical properties of objects. The procedure was then repeated with such initial mouth, which did not have the correlation coefficient with any of the previous mouth templates above 0.80. Some of the initial templates are rectangles including just the mouth itself and the nearest neighbourhood, others go as far downwards as to the chin. Nonstandard mouths are also included as initial templates, for example not horizontal, open with visible teeth, smiling or luminous lips after using lipstick. Therefore we subjectively select different sizes of the templates. Altogether a set of 13 mouth templates of different sizes was constructed. All the 13 templates together lead to correct locating the mouth in every of 124 examined images, when the correlation coefficient is used as the measure of association between the template and the image. Based on these templates we created a new set of templates. We selected one particular template and averaged such mouths, which have the correlation coefficient with it over 0.80. The symmetrized mean becomes one of the new templates. Then we selected another of the previous templates, symmetrized it and performed

116

J. Kalina

Fig 2. Left: one of the templates for the mouth. Right: a mouth with a plaster.

the same procedure. The selection of templates from the set of 13 templates was subjective and we have tried to select templates, which would be very different from those selected in previous steps. When the number of these new templates reached 7, it was possible to locate all the mouths in the whole database. Therefore our final set includes 7 mouth templates with different sizes, namely two templates with a beard and five without it. One of the templates is shown in Figure 2 (left). This template has the size 21 × 51 pixels. It locates the mouth in 99 % images of the training database, when using the correlation coefficient r as the measure of similarity between the template and the image. It is also the best template in the following sense. In a particular image the separation between the mouth and all non-mouths can be measured in the form max{r(template, mouth); all positions of the mouth} (2.1) . max{r(template, non-mouth); all non-mouths} The worst separation (2.1) over all the 124 images is a measure of the quality of a template. The best such result is obtained for the non-bearded template in Figure 2 (left). 2.2. Results of the template matching The references on image analysis (for example [6] or [7]) describe the Pearson product-moment correlation coefficient as the standard and only recommendable measure of similarity between the template and the image. The importance of the lips or the central area of the template can be underlined properly if the weighted Pearson product-moment correlation coefficient n ¯W )(yi − y¯W ) i=1 wi (xi − x (2.2) rw (x, y) = ! . n n ¯W )2 ] j=1 [wj (yj − y¯W )2 ] i=1 [wi (xi − x is used with radial weights wR . Let both the template and the weights be matrices R of size n1 × n2 pixels. The idea is to define the radial weight wij of a pixel with coordinates [i, j] inversely proportional to its distance from the midpoint [i0 , j0 ]. Formally let us firstly define 1 ∗ (2.3) wij = . 2 (i − i0 ) + (j − j0 )2 If n1 and n2 are odd numbers, then wi∗0 j0 is not defined and we define additionally wi∗0 j0 = 1. The radial weights wR are defined as (2.4)

∗ wij R = n1 n2 wij k=1

l=1

∗ wk

,

i = 1, . . . , n1 , j = 1, . . . , n2 .

Locating landmarks using templates

117

Table 1 Percentages of images with the correctly located mouth using different templates. Comparison of the Pearson product-moment correlation coefficient, weighted Pearson product-moment correlation coefficient with radial weights and Spearman’s rank correlation coefficient. The templates have different sizes. Template with description All 7 templates 1. Non-bearded 2. Non-bearded 3. Non-bearded 4. Non-bearded 5. Non-bearded 6. Bearded 7. Bearded

r 1.00 0.99 0.93 0.94 0.92 0.95 0.91 0.62

rw 1.00 0.99 0.94 0.91 0.69 0.96 1.00 0.78

rS 0.94 0.83 0.80 0.82 0.83 0.60 0.50 0.43

Size of the template 21 × 51 27 × 41 21 × 41 21 × 41 26 × 41 26 × 56 29 × 56

Weighted correlation coefficient with equal weights corresponds to classical Pearson correlation coefficient without weighting. Now we examine the performance of particular mouth templates in locating the mouth over the training set of 124 images of the database using the classical correlation coefficient r, weighted correlation coefficient rw with radial weights and Spearman’s rank correlation rS as the similarity measures between the template and the image. The results are summarized in Table 1 as percentages of correctly localized mouths over the database with 124 images. The top of the table gives results with 7 templates from Section 2.1. Further, the table contains results of locating the mouth with just one template at the time. The template in Figure 2 (left) with radial weights yields the best results over non-bearded templates in terms of the separation (2.1), where the correlation coefficient r is replaced by weighted correlation rw with radial weights. The improvement in locating the mouth with radial weights compared to equal weights is remarkable in images with a different size or rotation of the face. Other attempts to define templates or other combinations of several templates were less successful. Spearman’s rank correlation coefficient rS has a low performance in locating the mouth. The validation set contains 88 images taken under the same conditions as the training set. The set of 7 templates locate the mouth correctly in 100 % of images of the validation set with both equal and radial weights. The non-bearded template has the performance 100 % also for both equal and radial weights for the weighted correlation coefficient. Robust modifications of the correlation coefficient in the context of image analysis of templates were inspected by [8]; the best performance was obtained with a weighted Pearson product-moment correlation coefficient with weights determined by the least weighted squares regression [11]. The next section 2.3 studies robustness aspects of template matching. Although the literature is void of discussions about robustness aspects in the image analysis context, we will see in Section 3 that also some non-robust classification methods perform very successfully in comparison with template matching with the weighted Pearson product-moment correlation coefficient r. 2.3. Robustness of the results An important aspect of the methods for locating objects in images is their robustness with respect to violations of the standardized conditions. This study goes beyond the study of sensitivity to asymmetry of the image by [8].

118

J. Kalina

To examine the local sensitivity of the classical and weighted correlation coefficient, we study the effect of a small plaster similarly with Figure 2 (right). Grey values in a rectangle of size 3 × 5 pixels are set to 1. Every mouth in the database is modified in this way placing the plaster always on the same position to the bottom right corner of the mouth, below the midpoint of the mouth by 7 to 9 rows and on the right from the midpoint by 16 to 20 columns. We use the set of 7 templates and different weights to search for the mouth in such modified images. Equal weights localize the mouth correctly in 88 % out of the 124 images. Radial weights wR are robust to such plaster and locates the mouth correctly in 100 % of images. Now we study theoretical aspects of the robustness of the template matching. ¯w for the weighted means of the template t and We need the notation t¯w and x n an image (mouth or non-mouth) x respectively, for example x ¯w = i=1 wi xi . The 2 weighted variance Sw (x; w) of x with weights w is defined by 2 (x) = Sw

(2.5)

n

wi (xi − x ¯ w )2

i=1 2 and an analogous notation Sw (t) is used for the weighted variance of grey values of the template t with weights w. The weighted covariance Sw (x, t) between x and t equals n

(2.6) Sw (x, t) = wi (xi − x ¯w )(ti − t¯w ). i=1

The following practical theorem studies the robustness of rw (x, t) with respect to an asymmetric modification of the image, for example a part of the image can have a different illumination, in the matrix notation x∗ = (x∗ij )i,j with x∗ij = xij for j < j0 and x∗ij = xij + ε for j ≥ j0 for some j0 for every i. We study how adding a constant ε to a part of the image effects the weighted correlation coefficient of such image with the original template and original weights. T Here the notation x+ε with x = (x1 , . . . , xn )T stands for (x1 +ε, x2 +ε, . . . , xn +ε) . We also use the following notation. The image x is divided to two parts and I or II denote the sum over the pixels of the first or second part, respectively. Dividing x to three parts, the sums over particular parts are denoted by the image I, II and III . Theorem 2.1. Let t denote the template, x the image and w the weights. We assume these matrices to have the same size. Then the following formulas are true. 1. For x = (x1 , x2 )T and x∗ = (x1 , x2 + ε)T , rw (x∗ , t) = Sw (x, t) + ε II wi ti − εv2 t¯w  , (2.7) = 2 (x) + v (1 − v )ε2 + 2ε(2v − 1)( Sw (t) Sw ¯w ) 2 2 2 II wi xi − v2 x where v2 = II wi . 2. For x = (x1 , x2 )T and x∗ = (x1 + ε, x2 − ε)T , rw (x∗ , t) = Sw (x, t) + ε( I wi ti − II wi ti ) − εv t¯w  , (2.8) = 2 (x) + ε2 (1 − v)2 − 2εv x Sw (t) Sw ¯w + 2ε( I wi xi − II wi xi ) where v =

I

wi −

II

wi .

Locating landmarks using templates

119

3. For x = (x1 , x2 , x3 )T and x∗ = (x1 , x2 + ε, x3 − ε)T , rw (x∗ , t) = Sw (x, t) + εt¯w (w3 − w2 ) + ε( II wi ti − III wi ti )  , (2.9) = 2 (x) + t + ε2 [w + w − (w + w )2 ] Sw (t) Sw 2 3 2 3 where w2 = II wi and w3 = III wi and  



(2.10) t=ε w i xi − w i xi + x ¯w (w3 − w2 ) . II

III

4. Let ε denote a matrix of the same size as x containing constants (εij )ij . Then (2.11)

rw (x + ε, t) =

S (x, t) + Sw (t, ε)  w . 2 (x) + S 2 (ε) + 2S (x, ε) Sw (t) Sw w w

For the special case with the symmetric mouth, symmetric template and symmetric weights we can formulate the following corollary of Theorem 2.1, where we ∗ can express rw (x, t) as a function of rw (x, t). In this special case the weighted corre∗ lation coefficient rw (x, t) always decreases compared to rw (x, t), and the theorem expresses the level of the decrease and thus proves the template matching to be reasonably robust to small modifications of the template. Theorem 2.2. Let us consider a particular template t, image x and weights w. We assume that all these matrices have the same size and are symmetric along the vertical axis. Then the following formulas are true. 1. Let t, x and w have an even number of columns. Let us perform the following modification x∗ of the mouth x. Grey values on one side of the axis are equal to those of x and the remaining are increased by ε compared to those from x. Then the weighted correlation coefficient between the template and the modified mouth x∗ can be expressed by Sw (x) (2.12) rw (x∗ , t) = rw (x, t) ! . 2 (x) + ε2 Sw 4 2. Let t, x and w have an even number of columns. Let us perform the following modification x∗ of the mouth x. Grey values on one side of the axis are increased by ε and the remaining are decreased by ε compared to those from x. Then the weighted correlation coefficient between the template and the modified mouth x∗ can be expressed by Sw (x) (2.13) rw (x∗ , t) = rw (x, t)  . 2 (x) + ε2 Sw 3. Let us perform the following modification x∗ of the mouth x. For a specific number k in {0, 1, . . . , j/2}, grey values in columns 1, . . . , k are increased by ε and in columns k − j + 1, . . . , k are decreased by ε compared to those from x. The remaining grey values are equal to those in x. Then the weighted correlation coefficient between the template and the modified mouth x∗ can be expressed by Sw (x) (2.14) rw (x∗ , t) = rw (x, t)  , 2 Sw (x) + 2vε2 n k where v = i=1 j=1 wij .

120

J. Kalina

Table 2 Percentages of correctly classified images using different classification methods implemented in R software. The classification rule is learned over the training data set with 124 images and further applied to the validation set with 88 images. The template matching uses 7 templates with radial weights. Classification method Linear discriminant analysis Support vector machines Hierarchical clustering Classification tree Neural network – multilayer Neural network – Kohonen Template matching

Results over the training set validation set 1.00 1.00 0.90 0.85 0.53 0.97 0.90 1.00 1.00 0.98 0.96 1.00 1.00

R library neural e1071 cluster tree neural kohonen -

3. Locating the mouth using classification methods This section compares classification methods applied to locating the mouth in the original images. This has not been inspected in this context without the usual prior steps of dimension reduction and feature extraction because of a high computational complexity. Locating the mouth in the whole images without a preliminary reduction of dimension is a task with an enormous computational complexity. Therefore we consider the mouth and only one non-mouth from every image of the training set with 124 images, always with the size 21 × 51 pixels; this is the size of the template in Figure 2 (left). We select such non-mouth which has the largest correlation coefficient with the template in Figure 2 (left). A shifted mouth was not considered to be a non-mouth, so the non-mouths are required be at least five pixels distant (in the Euclidean sense) from the mouth. All mouths and non-mouths are selected in such position that the correlation coefficient with the template in Figure 2 (left) is larger than the correlation coefficient between the template and the same image (mouth or non-mouth) shifted aside; this ensures the images to have centered in the same way, treating the fact that the midpoint of the template does not correspond to the midpoint of the lips. Such training database for the next work contains 248 images (a group of 124 mouths and a group of 124 non-mouths) with the aim to classify these images to groups. We apply linear discriminant analysis, support vector machines, hierarchical clustering, classification trees and neural networks to this task. These methods were selected as standard for classification analysis (see [5]). We point out that the dimension of the data much larger than the number of data. Now we discuss the results of particular methods summarized in Table 2, which describes the results of the classification over the training set with 248 images. The resulting classification rule was further used on the validation set to examine the reliability of the classification rules, which had been learned over the described training set. The validation set was created from the original validation database of 88 images in the same way again as a set containing the mouth and only one non-mouth from each image in the same way as before, so it contains 176 images (88 mouths and 88 non-mouths). We use additional libraries of the R software (http://cran.r-project.org) for the computation of standard classification methods; the libraries are listed in Table 2. The linear discriminant analysis yielding 100 % correct results consists in computing the classification score and classifying based on the inner product of the image with the score. The classification yields correct results without error. In-

Locating landmarks using templates

121

fluential values of the score appear in the top corners. This corresponds to the intuition, because the top corners have the lowest variability in the images of both mouths and non-mouths. Results of the support vector machines classifier with a radial basis kernel were not convincing, although the classification is based on 136 support vectors, which indicates the complexity of this classification problem. Such classification rule is based on 136 closest images to the nonlinear boundary between the group of mouths and the group of non-mouths. The hierarchical clustering with the average linkage method with the Euclidean distance measure giving two clusters as the output yields poor output. One cluster contained 58 non-mouths and the other contained 190 remaining images, namely 66 non-mouths and all 124 mouths. The method is not able to classify correctly such worst non-mouths which visually resemble a mouth. While there is a much larger variability among the non-mouths than among mouths, the method perceives the mouths to be a large and rather heterogeneous group. Non-mouths very different from mouths are classified as non-mouths, while problematic non-mouths are classified as mouths. Hierarchical clustering is an agglomerative (bottom-up) method starting with individual objects as individual clusters and merges recursively a selected pair of clusters into a single cluster; therefore it does not allow to classify a new observation from the validation set. The classification tree is based only on 6 pixels, which can be found outside lips. It relies too strongly on specific properties of the training set and can hardly be accepted as a practical classification rule. For neural networks we use two different approaches. The multilayer perceptron networks with 4 neurons as an example of supervised methods yields 100 % correct results in classifying the images as mouths or non-mouths. Kohonen self-organizing maps are an example of unsupervised methods based on mapping the multivariate data down onto a two-dimensional grid, while its size is a selectable parameter. We were not able to find any value of this size, for which 100 % correct results would be obtained. The validation set also contains one atypical face. This is an older lady with an unusually big mouth, which is at the same time affected by small rotation, nonsymmetry and a light grimace. Nevertheless the classifiers either localize the mouth correctly in this image, or they fail also in several other faces (Table 2). 4. Conclusions The aim of this work was to study different methods for the automatic localization of the mouth in two-dimensional grey-scale images of faces. Standard approaches start with an initial transformation of the image, for example Procrustes superimposition or even principal components analysis used in the right circumstances. These reduce the dimension of the image, so that the ultimate analysis is done on shape and shape alone. However the templates applied to raw data have not been examined from the statistical point of view. Chapter 2 of this papers describes our approach to the construction of templates. A set of 7 mouth templates is able to localize the mouth in all 124 images of the training database; here the weighted Pearson product-moment correlation coefficient was used with radial weights. It is presented theoretically how this weighted correlation coefficient varies for distorted images. Chapter 3 presents an experiment comparing different classification methods. Classification trees are rather controversial for these data; they are based on a very

122

J. Kalina

small number of pixels. This instability could be solved by using large patches (e.g. patch mean) or some other features (e.g. Haar-like features) rather than pixel intensities. Neural networks represent a black box, for which we are not able to analyze the result in a transparent and explanatory way. Results of support vector machines (SVM) and hierarchical clustering were not satisfactory. The SVM depend on several parameters to be tuned to perform optimally; an inexperienced practitioner using default parameter settings would however not obtain successful results. Therefore we praise template matching, linear discriminant analysis and multilayer neural networks, which yielded correct results in 100 % of images of both the training and validation databases. Non-robust methods turn out to be able to attain the best results, which is the case of the template matching and linear discriminant analysis. At the same time template matching and linear discriminant analysis allow for a nice and clear interpretation. The author is thankful to two anonymous referees for valuable comments and tips for improving the paper. References ¨ hringer, S., Vollmar, T., Tasse, C., Wu ¨ rtz, R. P., Gillessen[1] Bo Kaesbach, G., Horsthemke, B., and Wieczorek, D. (2006). Syndrome identification based on 2D analysis software. Eur. J. Human Genet. 14 1082– 1089. [2] Bookstein, F. L. (1991). Morphometric tools for landmark data. Geometry and biology. Cambridge University Press, Cambridge. [3] Dryden, I. L. and Mardia, K. V. (1999). Statistical shape analysis. John Wiley, New York. [4] Farkas, L. (1994). Anthropometry of the head and face. Raven Press, New York. ¨ rdle, W. and Simar, L. (2003). Applied multivariate statistical analysis. [5] Ha Springer, Berlin. [6] Jain, A. K. (1989): Fundamentals of digital image processing. Prentice-Hall, Englewood Cliffs. [7] James, M. (1987): Pattern recognition. BSP Professional books, Oxford. [8] Kalina, J. (2007). Locating the mouth using weighted templates. Journal of Applied Mathematics, Statistics and Informatics 3 111–125. ¨ rtz, R. P., Malsburg von der, C., and [9] Loos, H. S., Wieczorek, D., Wu Horsthemke, B. (2003). Computer-based recognition of dysmorphic faces. Eur. J. Human Genet. 11 555–560. [10] Viola P. and Jones M.J. (2004). Robust real-time face detection. Int. Journal of Comp. Vision 57 137–154. ´ (2001). Regression with high breakdown point. In J. Antoch, [11] V´ıˇ sek, J. A. G. Dohnal (Eds.): ROBUST 2000, Proceedings of the 11-th summer school ˇ ˇ JCMF, Neˇctiny, September 11-15, 2000, JCMF and Czech Statistical Society, Prague, 324–356. ¨ ger, N., and Malsburg von der, C. [12] Wiskott, L., Fellous, J. M., Kru (1997). Face recognition by elastic bunch graph matching. IEEE Trans. Pattern Anal. Machine Intel. 19 775–779. [13] Yang, M.-H., Kriegman, D. J., and Ahuja, N. (2002) Detecting faces in images: A survey. IEEE Trans. Pattern Anal. and Machine Intel. 24 34–58.

IMS Collections Nonparametrics and Robustness in Modern Statistical Inference and Time Series Analysis: A Festschrift in honor of Professor Jana Jureˇ ckov´ a Vol. 7 (2010) 123–133 c Institute of Mathematical Statistics, 2010  DOI: 10.1214/10-IMSCOLL713

On the asymptotic distribution of the analytic center estimator Keith Knight∗ University of Toronto Abstract: The analytic center estimator is defined as the analytic center of the so-called membership set. In this paper, we consider the asymptotics of this estimator under fairly general assumptions on the noise distribution.

1. Introduction Consider the linear regression model Yi = xTi β + εi

(1.1)

(i = 1, · · · , n)

where xi is a vector of covariates (of length p) whose first component is always 1, β is a vector of unknown parameters and ε1 , · · · , εn are i.i.d. random variables with |εi | ≤ γ0 where it is assumed that γ0 is known. We will not necessarily require that the bound γ0 be tight although there are advantages in estimation if it is known that the noise is “boundary visiting” in the sense that P (|εi | ≤ γ0 − ) < 1 for all  > 0. Given the bound γ0 on the absolute errors, we can define the so-called membership set (Schweppe [19]; Bai et al., [4])   (1.2) Sn = φ : −γ0 ≤ Yi − xTi φ ≤ γ0 for all i = 1, · · · , n , which contains all parameter values consistent with the assumption that |εi | ≤ γ0 . There is a considerable literature on estimation based on the membership set in different settings; see, for example, Milanese and Belforte [16], M¨akil¨ a [15], Tse et al. [21], and Ak¸cay et al. [3]. The membership set Sn in (1.2) is a bounded convex polyhedron and we can  is use some measure of its center to estimate β. The analytic center estimator β n defined to be the maximizer of the concave objective function gn (φ)

=

n

  ln γ02 − (Yi − xTi φ)2

i=1

(1.3)

=

n



 ln(γ0 − Yi + xTi φ) + ln(γ0 + Yi − xTi φ) .

i=1 ∗ This

research was supported by a grant from the Natural Sciences and Engineering Research Council of Canada 1 Department of Statistics, University of Toronto, 100 St. George St., Toronto, ON M5S 3G3 Canada e-mail: [email protected] AMS 2000 subject classifications: Primary 62J05; secondary 62G20. Keywords and phrases: analytic center estimator, set membership estimation, Poisson processes. 123

124

K. Knight

 is the analytic center (Sonnevend, [20]) of the membership set Sn . The idea β n is that the logarithmic function essentially acts as a barrier function that forces the estimator away from the boundary of Sn and thus makes the constraint that the estimator must lie in Sn redundant. In certain applications, the analytic center estimator is computationally convenient since it can be computed efficiently in “online” applications, more so other estimators based on the membership set such as the Chebyshev center or the maximum volume inscribed ellipsoid estimators. Bai et al. (2000) derive some convergence results for the analytic center estimator but do not give its limiting distribution. In addition, Bai et al. [4], Ak¸cay [2], and Kitamura et al. [11] discuss properties of the membership set, showing under different conditions that the membership set shrinks to a single point as the sample size increases.  satisfies The maximizer of gn in (1.3) lies in the interior of Sn and hence β n (1.4)

n

i=1

 Yi − xTi β n x = 0. 2  )2 i γ0 − (Yi − xTi β n

The “classical” approach to asymptotic theory is to approximate (1.4) by a linear √  √  n(β n − β) via function of n(β n − β) and derive the limiting distribution of this approximation. However, expanding (1.4) in a Taylor series around β, it is easy to see that if the distribution of {εi } has a sufficiently large concentration of probability in a neighbourhood of ±γ0 then asymptotic normality will not hold. Intuitively, we should have a faster convergence rate in such cases but a different approach is needed to prove this. In this paper, we will consider the asymptotic distributions of both the membership set and the analytic center estimator under the assumption that the noise distribution is regularly varying at the boundaries ±γ0 of the error distribution. In section 2, we provide some of the necessary technical foundation for section 3 where we derive the asymptotics of the membership set and the analytic center estimator. 2. Technical preliminaries Define F to be the distribution function of {εi }; we then define non-decreasing functions G1 and G2 on [0, 2γ0 ] by (2.1)

G1 (t)

=

1 − F (γ0 − t)

(2.2)

G2 (t)

=

F (−γ0 + t).

We will assume that both G1 and G2 are regularly varying at 0 with the same parameter of regular variation α and that G1 and G2 are “balanced” in a neighbourhood of 0. More precisely, for each x > 0, lim t↓0

Gk (tx) = xα Gk (t)

and lim t↓0

for k = 1, 2

G1 (t) =κ G1 (t) + G2 (t)

where 0 < κ < 1. Thus for some sequence of constants {an } with an → ∞ and some α > 0, we have (2.3) (2.4)

lim nG1 (t/an )

=

κtα

lim nG2 (t/an )

=

(1 − κ)tα

n→∞ n→∞

Analytic center estimator

125

where 0 < κ < 1. The parameter α describes the concentration of probability mass close to the endpoints ±γ0 ; this concentration increases as α becomes smaller. The type of convergence as well as the rate of convergence are determined by α. If α > 2, we can approximate the left hand side of (1.4) by a linear function and obtain asymptotic normality using the classical argument. On the other hand, when α < 2, the limiting distribution is determined by the the errors lying close to the endpoints ±γ0 ; in particular, given the conditions (2.1) – (2.4) on the distribution F of {εi }, it is straightforward to derive a point process convergence result for the number of {εi } lying within O(a−1 n ) of ±γ0 . We will make the following assumptions about the errors {εi } and the design {xi }: (A1) {εi } are i.i.d. random variables on [−γ0 , γ0 ] with distribution function F where G1 and G2 defined in (2.1) and (2.2) satisfy (2.3) and (2.4) for some sequence {an }, α > 0, and 0 < κ < 1. (A2) There exists a probability measure μ on Rp such that for each set B with μ(∂B) = 0, n 1

lim I(xi ∈ B) = μ(B). n→∞ n i=1 Moreover, the mass of μ is not concentrated on a lower dimensional subspace of Rp . Under conditions (A1) and (A2), it is easy to verify that the point process (2.5)

Mn (A × B)

=

n

I {an (γ0 − εi ) ∈ A, −xi ∈ B}

i=1

+

n

I {an (γ0 + εi ) ∈ A, xi ∈ B}

i=1

converges in distribution with respect to the vague topology on measures (Kallenberg, [10]) to a Poisson process M whose mean measure is given by - . (2.6) E[M (A × B)] = α tα−1 dt μ ¯(B) A

where (2.7)

μ ¯(B) = κμ(−B) + (1 − κ)μ(B).

We can represent the points of the limiting Poisson process M in terms of two independent sequences of i.i.d. random variables {Ei } and {X i } where {Ei } are exponential with mean 1 and {X i } have the measure μ ¯ defined in (2.7). For a given value of α, we then define (2.8)

Γi = E1 + · · · + Ei

for i ≥ 1.

The points of the Poisson process M in (2.5) (with mean measure given in (2.6)) 1/α are then represented by {(Γi , X i ) : i ≥ 1}. In the case where the support of {xi } (and of the limiting measure μ) is unbounded, we need to make some additional assumptions; note that (A3) and (A4) below hold trivially (given (A1) and (A2)) if {xi } are bounded.

126

K. Knight

(A3) G1 and G2 defined in (2.1) and (2.2) satisfy n {G1 (t/an ) + G2 (t/an )} = tα {1 + rn (t)} where for any u,

max |rn (xTi u)| → 0.

1≤i≤n

(A4) For the measure μ defined in (A2), 1

xi α → n i=1 n

xα μ(dx) < ∞.

Moreover, 1 max xi α → 0. n 1≤i≤n  maximizes a concave objective function or, equivalently, As stated above, β n minimizes a convex objective function. The key tool that will be used in deriving the  is the notion of epi-convergence in distribution (Geyer, limiting distribution of β n [9]; Pflug, [18]; Knight, [12]; Chernozhukov, [7]; Chernozhukov and Hong, [8]) and point process convergence for extreme values (Kallenberg, [10]; Leadbetter et al, [13]). 3. Asymptotics It is instructive to first consider the asymptotic behaviour of the membership set as a random set. Define a centered and rescaled version of Sn defined in (1.2): Sn (3.1)

an (Sn − β) n 2   u : an (εi − γ0 ) ≤ uT xi ≤ an (εi + γ0 ) . =

=

i=1

Sn

Note that is closely related to the point process Mn defined in (2.5). The following result describes the asymptotic behaviour of {Sn } as a sequence of random closed sets using the topology induced by Painlev´e–Kuratowski convergence (Molchanov, 2005). Since we have a finite dimensional space, it follows that the Painlev´e–Kuratowski topology coincides with the Fell (hit or miss) topology (Beer, d

[6]); thus Sn −→ S  if P (Sn ∩ K = ∅) → P (S  ∩ K = ∅) for all compact sets K such that P (S  ∩ K = ∅) = P (S  ∩ int K = ∅) . It turns out that the convexity of the random sets {Sn } provides a very simple sufficient condition for checking convergence in distribution. Lemma 3.1. Assume the model (1.1) and conditions (A1) – (A4). If Sn is defined as in (3.1) then (3.2)

Sn −→ S  = d

∞  2

1/α

u : u T X i ≤ Γi



i=1

where {Γi }, {X i } are independent sequences with Γi defined in (2.8) and {X i } i.i.d. with distribution μ ¯ defined in (2.7).

Analytic center estimator

127

Proof. First, note that S  has an open interior with probability 1. To see this, define S 

=

∞  2

1/α

u : uX i  ≤ Γi

i=1

3

=

1/α

Γ u : u ≤ min i i X i 



4

and note that S  ⊂ S  . Using the properties of the Poisson process M whose mean measure is defined in (2.6), we have ) *   1/α Γi P min ¯(dx) > r = exp −rα xα μ i X i  for r ≥ 0. Thus S  contains an open set with probability 1 and therefore so must S  . The fact that S  contains an open set makes proof of convergence in distribution very simple; we simply need to show that P (u1 ∈ Sn , · · · , uk ∈ Sn ) → P (u1 ∈ S  , · · · , uk ∈ S  ) for any u1 , · · · , uk . Defining x+ = xI(x > 0) and x− = −xI(x < 0), we then have P (u1 ∈ Sn , · · · , uk ∈ Sn )    . n  −1 T −1 T 1 − G2 an max (uj xi )+ − G1 an min (uj xi )− = 1≤j≤k

i=1

1≤j≤k



  3 α  α 4  T T μ(dx) + κ min uj x exp − (1 − κ) max uj x

=

4 3  α T μ ¯(dx) exp − max uj x

=

1≤j≤k

1≤j≤k



+

1≤j≤k



+ 

P (u1 ∈ S , · · · , uk ∈ S ) ,

which completes the proof. Note that S  is bounded with probability 1; this follows since for any u = 0, P (X Ti u > 0) ≥ min(κ, 1 − κ) > 0, hence P (X Ti u > 0 infinitely often) = 1. Thus with probability 1, for each u ∈ S  there exists j such that such that 0 < X Tj u ≤ Γj and so for t sufficiently large tu ∈ S  . Lemma 3.1 says that points in the membership set lie within Op (a−1 n ) of β and  (or indeed any estimator based on the therefore the analytic center estimator β n  − β = Op (a−1 ). Since an = n1/α L(n) it follows membership set) must satisfy β n n that we have a faster than Op (n−1/2 ) convergence rate when α < 2. On the other hand, if α > 2 then n1/2 /an → ∞; fortunately, in these cases, it is typically possible to achieve Op (n−1/2 ) convergence. Theorem 3.1. Assume the model (1.1) and conditions (A1) – (A4) for some α ≥ 2 and assume that E[(γ0 − εi )−1 ] = E[(γ0 + εi )−1 ].  maximizes (1.3). Suppose that β n

128

K. Knight



 − β) −→ N (0, σ 2 C −1 ) where n(β n 5 2 E[(γ02 + ε2i )/(γ02 − ε2i )2 ] σ 2 = Var[εi /(γ02 − ε2i )] and C = xxT μ(dx).

(i) If α > 2 then

d

(ii) If α = 2 then (1)

bn

 − β) −→ N (0, C −1 ) (β n d

(2) bn (1)

where {bn } satisfies

n 1 γ02 + ε2i p 2 − ε2 )2 −→ 1 (1) (γ bn i=1 0 i

(2)

and {bn } satisfies n 1

(2) bn i=1

γ02

εi d xi −→ N (0, C −1 ). − ε2i

The proof of Theorem 3.1 is standard and will not be given here. Note that conditions (A2) – (A4) are much stronger than necessary for Theorem 3.1 to hold. For example, we need only assume that 1

xi xTi n i=1

→ C

1 max xi 2 n 1≤i≤n

→ 0

n

and

for asymptotic normality to hold. More generally, Theorem 3.1 also holds in the case where the bounds ±γ0 are overly conservative in the sense that for some  > 0, P (−γ0 +  ≤ εi ≤ γ0 − ) = 1. In this case, if the model (1.1) contains an intercept (that is, one element of xi is always 1) then we can rewrite the model (1.1) as Yi

=

θ + xTi β + (εi − θ)

=

xTi β  + εi

(i = 1, · · · , n)

where εi = εi −θ. Then there exists θ such that {εi } satisfies the moment conditions in Theorem 3.1 and so the proof of Theorem 3.1 will go through as before.  is highly dependent on the limiting When α < 2, the limiting behaviour of β n Poisson process M (with mean measure given in by (2.6)). In particular, the sequences of random variables {(γ0 − εi )−1 } and {(γ0 + εi )−1 } lie in the domain of a stable law with index α and so it is not surprising to have non-Gaussian limiting distributions. Theorem 3.2. Assume the model (1.1) and conditions (A1) – (A4) for some 0 <  maximizes (1.3). Define {Γi } and {X i } as in Lemma α < 2 and assume that β n  3.1 and S as in (3.2).

Analytic center estimator

129

d  − β) −→ (a) If α < 1 then an (β U where U maximizes n * ) ∞

X Ti u ln 1 − 1/α Γi i=1

over u ∈ S  . (b) If α = 1 and na−1 n E

+

 ,   εi  εi  ≤ an  →0 I  γ 2 − ε2  γ02 − ε2i 0 i

 − β) −→ U where U maximizes then an (β n * 3 ) ) *4 ∞ ∞



X Ti u X Ti u X Ti u 1/α −  −E I(Γi ≥ 1) 1/α 1/α 1/α Γi Γi Γi i=1 i=1 d

over u ∈ S  where (x) = ln(1 − x) + x. (c) If 1 < α < 2 and E[(γ0 − εi )−1 ] = E[(γ0 + εi )−1 ] d  − β) −→ then an (β U where U maximizes n * 3 *4 ) ) ∞ ∞



X Ti u X Ti u X Ti u −  −E 1/α 1/α 1/α Γi Γi Γi i=1 i=1

over u ∈ S  where (x) = ln(1 − x) + x.  − β) maximizes the concave function Proof. an (β n Zn (u) =

n

 ln 1 +

i=1

xTi u an (γ0 − εi )



 + ln 1 −

xTi u an (γ0 + εi )

.

subject to u ∈ Sn defined in (3.1). Since the limiting objective function is finite on an open set (since S  contains an open set with probability 1), it suffices to show finite dimensional weak convergence of Zn . Note that we can write (for u ∈ Sn ), Zn (u) =

  xT u Mn (dw × dx) ln 1 − w

where Mn is defined in (2.5). For α < 1, we approximate ln(1 + xT u/w) by a sequence of bounded functions {gm (w, x; u)}. Following Lepage et al. [14], we have gm (w, x; u)

d

−→ →



i=1 ∞

1/α

gm (Γi

, X i ; u) 1/α

ln(1 − X Ti u/Γi

as n → ∞ )

with probability 1 as m → ∞

i=1

and  , +     T   ln(1 + x u/w) − gm (w, x; u) Mn (dw × dx) >  = 0. lim lim sup P  m→∞ n→∞

130

K. Knight

For 1 ≤ α < 2, a similar argument works by writing ln(1 + xT u/w) = xT u/w + (xT u/w) and applying the argument used for α < 1 to

n - 

 − (x u/w) Mn (dw × dx) = T

i=1

xTi u an (γ0 − εi )



 +

xTi u an (γ0 + εi )

. .

The result now follows by noting that, in each case, the limiting objective function Z has a unique maximizer on the set S  ; to see this, note that Z is strictly concave on S  and that as u → ∂S  , Z(u) → −∞. In Theorem 3.2, note that no moment condition is needed when α < 1. In this  − β), U , can be interpreted as the analytic center of the case, the limit of an (β n  random set S , and thus P (U ∈ int S  ) = 1. In contrast, we require a moment condition for 1 ≤ α < 2 (such as E[(γ0 − εi )−1 ] = E[(γ0 + εi )−1 ]

(3.3)

for α > 1) in order to have P (U ∈ int S  ) = 1. What happens if the moment condition, for example (3.3), fails? Theorem 3.3 below states that the limiting dis − β) is concentrated the vertices of the limiting membership tribution of an (β n  set S . Theorem 3.3. Assume the model (1.1) and conditions (A1) – (A4) for some α ≥ 1  maximizes (1.3). Define S  as in (3.2) with {Γi } and {X i } as and assume that β n in Lemma 3.1. If for some (non-negative) sequence {bn } (bn = n for α > 1) b−1 n

n



 p (γ0 − εi )−1 − (γ0 + εi )−1 −→ ω = 0

i=1 d  − β) −→ U where U maximizes then an (β n ω uT x μ(dx) subject to u ∈ S  .

 − β) maximizes Proof. an (β n   . n -  xTi u xTi u an

+ ln 1 − ln 1 + Zn (u) = bn i=1 an (γ0 − εi ) an (γ0 + εi ) for u ∈ Sn . Defining (x) = ln(1 − x) + x as before, we have (for u ∈ Sn ), Zn (u)

=

n  1 T  x u (γ0 − εi )−1 − (γ0 + εi )−1 bn i=1 i   . n -  xTi u an

xTi u + +  − bn i=1 an (γ0 − εi ) an (γ0 + εi )

n  1 T  xi u (γ0 − εi )−1 − (γ0 + εi )−1 + op (1) bn i=1 p −→ ω uT x μ(dx)

=

Analytic center estimator

131

noting that an = o(bn ) and applying the results of Adler and Rosalsky [1]. Since S  is bounded, the linear function ω uT x μ(dx) has a finite maximum on S  . Uniqueness follows from the assumption that the measure μ puts zero mass on lower dimensional subsets. For α > 1, ω = E[(γ0 − εi )−1 − (γ0 + εi )−1 ] while for α = 1, ω is typically first moment of an appropriately truncated version of (γ0 − εi )−1 − (γ0 + εi )−1 where the truncation depends on the slowly varying component of the distribution function  − β) depends on ω only F near ±γ0 . Note that the limiting distribution of an (β n via its sign. Like κ, ω is a measure of the relative weight of the distribution F near its endpoints ±γ0 . However, they are not necessarily related in the sense that for a given value of κ, ω can be positive or negative; for example, κ > 1/2 does not imply that ω > 0. The following implication of Theorem 3.3 is interesting: Even  lies in the interior of Sn (and thus an (β  − β) lies in the interior of S  ), though β n n n the limiting distribution is concentrated on the boundary of S  . It is also interesting to compare the limiting distribution of the analytic center estimator to those of other estimator, for example, the least squares estimator constrained to the membership set and the Chebyshev center estimator. The con minimizes strained least squares estimator β n n

(Yi − xTi φ)2

i=1

depend on whether or not E(εi ) = subject to φ ∈ Sn . The asymptotics of β n

0. If E(εi ) = 0 then an (β n − β) converges in distribution to the maximizer of E(εi ) xT u μ(dx) subject to u ∈ S  similar to the result of Theorem 3.3 with ω defined differently; note that this result holds for any α > 0. Moreover, for α ≥ 1,  and the constrained least squares estimator β

the analytic center estimator β n n have the same limiting distribution if both ω and E(εi ) are non-zero and have the same sign. On the other hand, when E(εi ) = 0, the type of limiting distribution

has the same depends on α, specifically whether or not α < 2. If α ≥ 2 then β n limiting distribution as the unconstrained least squares estimator; for example, for α > 2, we have √ d

− β) −→ n(β N (0, Var(εi )C −1 ) n

− β) converges in distriwhere C is defined as in Theorem 3.1. For α < 2, an (β n T  bution to the maximizer of W u subject to u ∈ S where W ∼ N (0, C) and W is independent of the Poisson process defining S  . Similarly, we can derive the asymptotics for the Chebyshev center estimator, defined as the center of largest radius ball (in the Lr norm) contained within Sn ;

maximizes δ subject to the constraints β n xTi φ + xi q δ −xTi φ + xi q δ

≤ ≤

Yi + γ 0 γ0 − Yi

for i = 1, · · · , n for i = 1, · · · , n

, Δn ) is the solution of this linear program where q is such that r−1 +q −1 = 1. If (β n d

− β), an Δn ) −→ (U , Δ0 ) where the limit maximizes δ subject to then (an (β n

1/α

uT X i + δX i q ≤ Γi

for i ≥ 1.

Note that P (U ∈ int S  ) = 1 without any moment conditions. The downside of the Chebyshev center estimator is that it is somewhat computationally more complex than the analytic center estimator.

132

K. Knight

Acknowledgements The author would like to thank the referees for their very useful comments on this paper. References [1] Adler, A. and Rosalsky, A. (1991). On the weak law of large numbers for normed weighted sums of I.I.D. random variables. International Journal of Mathematics and Mathematical Sciences. 14 191–202. [2] Akc ¸ ay, H. (2004). The size of the membership-set in a probabilistic framework. Automatica 40 253–260. [3] Akc ¸ ay, H., Hjalmarsson, H. and Ljung, L. (1996). On the choice of norm in system identification. IEEE Transactions on Automatic Control. 41 1367–1372. [4] Bai, E. W., Cho, H. and Tempo, R. (1998). Convergence properties of the membership set. Automatica 34 1245–1249. [5] Bai, E. W., Fu, M., Tempo, R., and Ye, Y. (2000). Convergence results of the analytic center estimator. IEEE Transactions on Automatic Control 45 569–572. [6] Beer, G. (1993). Topologies on Closed and Closed Convex Sets. Kluwer, Dordrecht. [7] Chernozhukov, V. (2005). Extremal quantile regression. Annals of Statistics 33 806–839. [8] Chernozhukov, V. and Hong, H. (2004). Likelihood estimation and inference in a class of non-regular econometric models. Econometrica 77 1445–1480. [9] Geyer, C. J. (1994). On the asymptotics of constrained M-estimation. Annals of Statistics 22 1993–2010. [10] Kallenberg, O. (1983). Random Measures. (third edition) Akademie-Verlag. [11] Kitamura, W., Fujisaki, Y., and Bai, E.W. (2005). The size of the membership set in the presence of disturbance and parameter uncertainty. Proceedings of the 44th IEEE Conference on Decision and Control 5698–5703. [12] Knight, K. (2001). Limiting distributions of linear programming estimators. Extremes 4 87–104. ´n, H. (1983). Extremes [13] Leadbetter, M. R., Lindgren, G. and Rootze and Related Properties of Random Sequences and Processes. Springer, New York. [14] Lepage, R., Woodroofe, M., and Zinn, J. (1981). Convergence to a stable distribution via order statistics. Annals of Probability 9 624–632. ¨ kila ¨ , P. M. (1991). Robust identification and Galois sequences. Interna[15] Ma tional Journal of Control 54 1189–1200. [16] Milanese, M. and Belforte, G. (1982). Estimation theory and uncertainty intervals evaluation in the presence of unknown but bounded errors – linear families of models and estimators. IEEE Transactions on Automatic Control 27 408–414. [17] Molchanov, I. (2005). Theory of Random Sets. Springer, London. [18] Pflug, G. Ch. (1995). Asymptotic stochastic programs. Mathematics of Operations Research 20 769–789. [19] Schweppe, F. C. (1968). Recursive state estimation: unknown but bounded errors and system inputs. IEEE Transactions on Automatic Control 13 22–28.

Analytic center estimator

133

[20] Sonnevend, G. (1985). An analytic center for polyhedrons and new classes of global algorithms for linear (smooth convex) programming. In Lecture Notes in Control and Information Sciences 84 866–876. [21] Tse, D. N. C., Daleh, M. A. and Tsitsiklis, J. N. (1993). Optimal asymptotic identification under bounded disturbances. IEEE Transactions on Automatic Control 38 1176–1190.

IMS Collections Nonparametrics and Robustness in Modern Statistical Inference and Time Series Analysis: A Festschrift in honor of Professor Jana Jureˇ ckov´ a Vol. 7 (2010) 134–142 c Institute of Mathematical Statistics, 2010  DOI: 10.1214/10-IMSCOLL714

Rank tests for heterogeneous treatment effects with covariates Roger Koenker∗ Abstract: Employing the regression rankscore approach of Gutenbrunner and Jureˇ ckov´ a [2] we consider rank tests designed to detect heterogeneous treatment effects concentrated in the upper tail of the conditional response distribution given other covariates.

1. Introduction Heterogeneous treatment response has long been recognized as an essential feature of randomized controlled experiments. The Neymann [11] framework of “potential outcomes” foreshadows modern developments by Rubin [13] and others acknowledging the right of each experimental subject to have a distinct response to treatment. Statistical inference based on ranks has played an important role in these developments. Lehmann [9] describes several heterogeneous treatment effect models and derives locally optimal rank tests for them. Rosenbaum [12] has reemphasized the relevance of heterogeneity of treatment effects in biomedical applications and stressed the rank based approach to inference. He et al. [5] have recently proposed tests based on “expected shortfall” designed to detect response in the upper or lower tail of the response distribution after adjusting for covariate effects. Rank tests for the treatment-control model have focused almost exclusively on the two sample problem without considering possibly confounding covariate effects. In this paper we will describe some new rank tests designed for several heterogeneous treatment effect models. The tests employ the regression rankscores introduced by Gutenbrunner and Jureˇckov´ a [2] and therefore are able to cope with additional covariate effects. 2. Quantile Treatment Effects For the two sample setting Lehmann [10] introduced a general model of treatment response in the following way: Suppose the treatment adds the amount Δ(x) when the response of the untreated subject would be x. Then the distribution G of the treatment responses is that of the random variable X+Δ(X) where X is distributed according to F . Department of Economics, 410 David Kinley Hall, 1407 W. Gregory, MC-707, Urbana, IL 61801, USA. e-mail: [email protected] ∗ Partially supported by NSF Grant SES 08-50060. The author would like to thank Xuming He and Ya-Hui Hsu for valuable conversations on the subject of this paper. AMS 2000 subject classifications: Primary 62G10; secondary 62J05. Keywords and phrases: regression rankscores, rank test, quantile treatment effect 134

Heterogeneous treatment effects

135

Thus, F (x) = G(x+Δ(x)) so Δ(x) is the horizontal distance between the control distribution, F , and the treatment distribution, G, Δ(x) = G−1 (F (x)) − x. Plotting Δ(x) versus x yields what is sometimes called the “shift plot.” For present purposes we find it more convenient to evaluate Δ(x) at x = F −1 (τ ) and define the quantile treatment effect as δ(τ ) = G−1 (τ ) − F −1 (τ ) The average treatment effect can be obtained by simply integrating: 1 ¯ δ(τ ) dτ = (G−1 (τ ) − F −1 (τ )) dτ ≡ μ(G) − μ(F ), δ= 0

for some τ0 and τ1 in (0, 1). But mean treatment may obscure many important features of δ(τ ). Only in the pure location shift case do we not lose something by the aggregation. We now consider three simple models of the quantile treatment effect. Partial Location Shift: Rather than assuming that the treatment induces a constant effect δ(τ ) = δ0 over the entire distribution we may instead consider a partial form of the location shift restricted to an interval δ(τ ) = δ0 I(τ0 < τ < τ1 ). Thus, the shift may occur only in the upper tail, or near the median, or of course, over all of (0, 1). Partial Scale Shift: Similarly, we may consider treatment effects that correspond to scale shifts of the control distribution over a restricted range, δ(τ ) = δ0 I(τ0 < τ < τ1 )F −1 (τ ). Imagine stretching the right tail of the control distribution beyond some specified τ0 quantile, while leaving the distribution below F −1 (τ0 ) unperturbed. Lehmann Alternatives: expressed as G(x) = F (x)γ

The family of Lehmann (1953) alternatives may be or

1 − G(x) = (1 − F (x))1/γ ,

and has been widely considered in the literature in part perhaps because it is closely associated with the Cox proportional hazard model. In the two sample version of the Cox model, when 1/γ = k, an integer, the treatment distribution is that of a random variable taking the minimum of k trials from the control distribution. The quantile treatment effect for the Cox form of the Lehmann alternative is easily seen to be, (1)

δ(τ ) = F −1 (1 − (1 − τ )γ ) − F −1 (τ ).

Rosenbaum [12] and Conover and Salsburg [1] argue that the Lehmann family offers an attractive model for two sample treatment-control experiments in which

136

R. Koenker

a substantial fraction of subjects fail to respond to treatment, but the remainder exhibit a significant response. Each of the foregoing semi-parametric alternatives are intended to capture to some degree the idea that the treatment strongly influences the response, but in some restrictive way that makes conventional tests for a full location shift unsatisfactory. As in the motivating example of He et al. [5] involving treatments for rheumatoid arthritis there is a need for a more targeted approach capable of detecting a more localized effect. 3. Rank Tests for QTEs We very briefly review some general theory of rank tests in the regression setting based on the regression rankscores introduced by Gutenbrunner and Jureˇckov´a [2]. For further details see, Gutenbrunner et al. [3] or Koenker [7]. Consider the linear quantile regression model (2)

QY |X,Z (τ |x, z) = x β(τ ) + zδ(τ ).

We have a binary treatment variable, z, and p other covariates, denoted by the vector x. We would like √ to test the hypothesis H0 : δ(τ ) ≡ 0 versus local alternatives Hn : δn (τ ) = δ0 (τ )/ n in the presence of other covariate effects represented by the linear predictor x β(τ ) terms. Of course, in the two sample setting the latter term is simply an intercept. We will write X to denote the matrix with typical row xi of the observed covariates. Under the null hypothesis the regression rankscores are defined as, a ˆ(τ ) = argmax {a y|X a = (1 − τ )X 1,

a ∈ [0, 1]n }

This n-vector constitutes the dual solution to the quantile regression problem

ˆ ) = argmin β(τ ρτ (yi − x i β). ˆ ˆ The function a ˆi (τ ) = 1 when yi > x ˆi (τ ) = 0 when yi < x i β(τ ) and a i β(τ ) and integrating, 1 ˆbi = a ˆi (τ ) dτ i = 1, . . . , n, 0

yields “ranks” of the observations. In the two sample setting these a ˆi (τ )’s are exactly the rankscores of H´ajek (1965). Generalizing, we may consider integrating with another score function to obtain, 1 ˆbϕ = a ˆi (τ ) dϕ(τ ). i 0

ˇ ak [4] the choice of ϕ is dictated by the form of the As described in H´ajek and Sid´ alternative Hn . When δ0 (τ ) is of the pure location shift form δ0 (τ ) = δ0 , there are three classical options for ϕ: normal (van der Waerden) scores ϕ(τ ) = Φ−1 (τ ), Wilcoxon scores ϕ(τ ) = τ , and sign scores ϕ(τ ) = |τ − 12 |. These choices are optimal under iid error models (3)

yi = x i β + ui

Heterogeneous treatment effects

137

when the ui ’s are Gaussian, logistic and double exponential, respectively. In this form the model is a special case of (2) in which the coordinates of β(τ ) are all independent of τ except for the “intercept” component that takes the form β0 (τ ) = Fu−1 (τ ), the quantile function of the iid errors. For simplicity of exposition, we will maintain this iid error model in the next subsection, with the understanding that eventually it may be relaxed. 4. Noncentralities and Scores Choice of the score function, ϕ can be motivated by examining the noncentrality parameter of the corresponding rank tests under local alternatives. Our test statistic is −1 2 Tnϕ = s n Qn sn /A (ϕ) ˆ) (z − zˆ), zˆ = PX z, the projection of z onto the where sn = (z − zˆ) ˆbϕ n , Qn = (z − z space spanned by the x covariates, and A2 (ϕ) = (ϕ(t)− ϕ) ¯ 2 dt, with ϕ¯ = ϕ(t) dt. Theorem 1. (Gutenbrunner, Jureˇ a, Koenker and Portnoy) Under the local √ckov´ alternative, Hn : δn (u) = δ0 (u)/ n to the null model (3), Tn is asymptotically χ21 (η) with noncentrality parameter 1 2 − 12 η = [Qn A (ϕ)] f (F −1 (u))δ0 (u) dϕ(u). 0

A general strategy for selecting score functions, ϕ, is to optimize this noncentrality parameter given choices of δ0 (u) and f . In the case of location shift, δ0 (u) = δ0 , 1 η = δ0 f (F −1 (u)) dϕ(u) 0



=

1

−δ0 0

f  −1 (F (u))ϕ(u) du, f

and optimal performance of the test is achieved by choosing ϕ(u) = f  /f (F −1 (u)), thereby achieving the same asymptotic efficiency as the likelihood ratio test. In the case of partial location shifts we may consider trimmed score functions of the form, ϕ(u) =

f  −1 (F (u))I(τ0 < u < τ1 ). f

In particular we will consider the trimmed Wilcoxon scores ϕ(u) = uI(τ0 < u < τ1 ) in the next section. Hettmansperger [6] has previously considered symmetrically trimmed Wilcoxon tests motivated by robustness considerations. It is important to emphasize that the optimal score functions depend on both on the density f and the form of the treatment response δ0 (u), however following conventional practice in rank statistics we will focus on the latter dependence and attempt to select tests that are robust to the former. For scale shift alternatives we have local alternatives of the form √ δn (u) = δ0 F −1 (u)/ n and noncentrality parameter 1

η = [Qn A2 (ϕ)]− 2 δ0



f (F −1 (u))F −1 (u) dϕ(u)

138

R. Koenker

and again integrating by parts we have optimal score functions of the form, ϕ(u) = −(1 + F −1 (u) ·

f  −1 (F (u))) f

which for the Gaussian distribution yields ϕ(u) = (Φ−1 (u))2 − 1. Again, we may consider partial scale shifts and obtain restricted forms. Finally, for alternatives √ of the Lehmann type (1) we will consider localized versions with γn = 1 + γ0 / n, so expanding, √ √ δn (u) = γ0 (f (F −1 (u)))−1 [−(1 − u) log(1 − u)]/ n + o(1/ n), uniformly for u ∈ [, 1 − ] for some  > 0. Again integrating by parts in the noncentrality expression we have, 1 η = −[Qn A2 (ϕ)]− 2 γ0 [(1 − u) log(1 − u)] dϕ(u) 2 − 12 = −[Qn A (ϕ)] γ0 [log(1 − u) + 1]ϕ(u) du, so the optimal score function is ϕ(u) = log(1 − u) + 1. (An alternative derivation of this result can be found in Conover and Salsburg [1]). An apparent advantage of this class of alternatives is that the score function is independent of the error distribution F . 5. Simulation Evidence Throughout this section we will consider models that under the null hypothesis take the form, yi = β0 + xi β1 + vi with vi iid from some distribution, F , with Lebesgue density, f . The covariate, x will be standard normal. Three families of alternatives will be considered, one from each of the three general classes already discussed: Location Shift Scale Shift Lehmann Shift

δn (u) = γn I(τ0 < u < τ1 ) δn (u) = γn F −1 (u)I(τ0 < u < τ1 ) δn (u) = F −1 (1 − (1 − u)γn ) − F −1 (u) √ where in the location √ and scale shift cases, γn = γ0 / n while in the Lehmann case γn = 1 + γ0 / n. Having specified quantile functions for the alternatives, it is straightforward to generate data according to these specifications. Under the alternatives we have, yi = β0 + xi β1 + zi δn (Ui ) + F −1 (Ui ), where the Ui are iid U [0, 1] random variables. The treatment indicator, zi is generated as Bernoulli with probability 1/2 throughout the simulations. A convenient property of the regression rankscores is that they are invariant to the parameter, β, so we can take β = 0 for purposes of generating the data for the simulations. Of course, test statistics are based on inclusion of the covariate, xi in estimation of the rankscores under the null model. Dependence between xi and the treatment indicator is potentially a serious problem. Asymptotically, this is seen in the appearance of Qn in the noncentrality parameter. But to keep things simple, we will maintain independence of x and z mimicking full randomization of treatment.

Heterogeneous treatment effects

139

We consider the following collection of tests for “treatment effect:” T N S W[τ0 , τ1 ] H[τ0 , τ1 ] L

Student t-test Normal (van der Waerden) rank test Sign (median) rank test Trimmed Wilcoxon rank test Trimmed normal scale rank test Lehmann Alternative rank test

All the rank tests are computed as described in Section 3, following Gutenbrunner et al. [3]. The piecewise linearity of the a ˆi (u) functions can be exploited, so

1

ˆbϕ = i

a ˆi (u) dϕ(u) = 0

J

a ˆi (τj ) − a ˆi (τj−1 ) j=1

τj − τj−1

τj

ϕ(u) du. τj−1

The last integral can be computed in closed form for all of our examples. See the function ranks in [8] for further details.

Location T N W[0,1] S W[.6,.95] H[0,1] H[.5,1] L Scale T N W[0,1] S W[.6,.95] H[0,1] H[.5,1] L Lehmann T N W[0,1] S W[.6,.95] H[0,1] H[.5,1] L

n=50

γ0 = 0 n=100 n=500

n=50

γ0 = 0.5 n=100 n=500

n=50

γ0 = 1 n=100 n=500

0.0518 0.0540 0.0559 0.0678 0.0547 0.0363 0.0300 0.0460

0.0560 0.0561 0.0576 0.0649 0.0514 0.0432 0.0434 0.0529

0.0523 0.0516 0.0524 0.0542 0.0527 0.0467 0.0514 0.0531

0.1234 0.1133 0.1045 0.0752 0.2906 0.1473 0.2211 0.1846

0.1448 0.1359 0.1188 0.0510 0.3667 0.2179 0.3376 0.2612

0.1566 0.1531 0.1262 0.0536 0.4504 0.2538 0.3844 0.2970

0.3212 0.2030 0.1693 0.0519 0.5341 0.3882 0.6654 0.4481

0.3402 0.2577 0.1982 0.0460 0.7156 0.4926 0.7827 0.5744

0.4468 0.4090 0.3070 0.0534 0.9175 0.7166 0.9055 0.7831

0.0496 0.0506 0.0531 0.0698 0.0536 0.0346 0.0318 0.0460

0.0569 0.0573 0.0565 0.0580 0.0554 0.0412 0.0440 0.0539

0.0514 0.0507 0.0500 0.0520 0.0491 0.0453 0.0475 0.0493

0.1033 0.0903 0.0798 0.0709 0.1665 0.1118 0.1385 0.1307

0.1382 0.1123 0.0894 0.0473 0.2077 0.2026 0.3205 0.2175

0.1593 0.1451 0.0974 0.0553 0.2635 0.3093 0.4873 0.3282

0.2671 0.1557 0.1290 0.0569 0.3610 0.2561 0.4336 0.2999

0.2984 0.1867 0.1395 0.0507 0.4593 0.3817 0.6418 0.4208

0.4277 0.3368 0.2066 0.0562 0.6506 0.7460 0.9326 0.7556

0.0545 0.0559 0.0568 0.0717 0.0555 0.0366 0.0336 0.0500

0.0534 0.0547 0.0544 0.0594 0.0514 0.0364 0.0433 0.0529

0.0488 0.0493 0.0507 0.0570 0.0508 0.0483 0.0468 0.0474

0.3866 0.3719 0.3700 0.3145 0.3802 0.0459 0.2709 0.3892

0.4347 0.4215 0.4093 0.2830 0.4402 0.0841 0.4081 0.4808

0.5420 0.5379 0.5093 0.3698 0.5512 0.1397 0.5494 0.6111

0.7795 0.7388 0.7273 0.5395 0.7885 0.0812 0.7111 0.8034

0.8618 0.8403 0.8291 0.6520 0.8601 0.1022 0.8240 0.8920

0.9612 0.9585 0.9457 0.8246 0.9662 0.3149 0.9616 0.9823

Table 1 Rejection Frequencies for Several Rank Tests: Nominal level of significance for all tests is 0.05, table entries are each based on 10,000 replications, all models have standard normal iid errors under the null and local alternatives with the indicated γ0 parameters.

In Table 1 we report results of a simulation with Gaussian F . Entries in the table represent empirical rejection frequencies for 10,000 replications. There are three sample sizes, three settings of the local alternative parameter, γ0 , and three distinct forms for the alternative hypothesis. Eight tests are evaluated: two versions of the Wilcoxon test one trimmed, one untrimmed; and two of the normal scale test

140

R. Koenker

one trimmed, one untrimmed. The first three columns of the table evaluate size of the test. These entries generally lie with experimental sampling accuracy for the nominal 0.05 level of the tests. Power of the tests for γ0 = 0.5 and γ0 = 1 are reported in the next six columns.

Location T N W[0,1] S W[.6,.95] H[0,1] H[.5,1] L Scale T N W[0,1] S W[.6,.95] H[0,1] H[.5,1] L Lehmann T N W[0,1] S W[.6,.95] H[0,1] H[.5,1] L

n=50

γ0 = 0 n=100 n=500

n=50

γ0 = 0.5 n=100 n=500

n=50

γ0 = 1 n=100 n=500

0.0447 0.0498 0.0534 0.0645 0.0502 0.0354 0.0304 0.0445

0.0491 0.0532 0.0538 0.0448 0.0523 0.0421 0.0419 0.0486

0.0455 0.0477 0.0493 0.0537 0.0507 0.0496 0.0519 0.0488

0.0718 0.0756 0.0791 0.0647 0.1746 0.0757 0.0930 0.0964

0.0883 0.0889 0.0860 0.0586 0.2314 0.1038 0.1520 0.1314

0.0962 0.1054 0.1013 0.0493 0.3133 0.1246 0.1720 0.1575

0.1521 0.1160 0.1132 0.0544 0.3774 0.1693 0.2677 0.2055

0.1880 0.1627 0.1438 0.0507 0.5304 0.2411 0.3839 0.3043

0.2047 0.2187 0.1991 0.0520 0.7467 0.3226 0.4714 0.3987

0.0494 0.0566 0.0598 0.0713 0.0542 0.0339 0.0293 0.0469

0.0475 0.0518 0.0538 0.0491 0.0532 0.0426 0.0415 0.0510

0.0498 0.0505 0.0521 0.0531 0.0507 0.0497 0.0508 0.0512

0.0753 0.0750 0.0707 0.0656 0.1289 0.0741 0.0776 0.0919

0.1075 0.0902 0.0805 0.0642 0.1688 0.1246 0.1902 0.1448

0.1163 0.0976 0.0802 0.0496 0.1952 0.1709 0.2493 0.1815

0.1430 0.1068 0.0975 0.0582 0.2568 0.1315 0.1831 0.1647

0.2264 0.1529 0.1237 0.0528 0.3697 0.2521 0.4136 0.3004

0.2937 0.2157 0.1527 0.0545 0.5176 0.4378 0.6404 0.4724

0.0436 0.0497 0.0536 0.0703 0.0525 0.0366 0.0336 0.0468

0.0459 0.0488 0.0495 0.0468 0.0513 0.0435 0.0412 0.0488

0.0465 0.0474 0.0490 0.0560 0.0504 0.0490 0.0473 0.0509

0.2645 0.3319 0.3361 0.2698 0.3447 0.0393 0.2144 0.3417

0.4320 0.4286 0.4129 0.3174 0.4540 0.0894 0.4222 0.4884

0.5146 0.5270 0.4979 0.3462 0.5401 0.1377 0.5439 0.6057

0.4851 0.6928 0.6994 0.5261 0.7158 0.0320 0.4772 0.7082

0.7550 0.8440 0.8360 0.6585 0.8699 0.0927 0.8293 0.8982

0.9345 0.9551 0.9426 0.8242 0.9578 0.2932 0.9614 0.9799

Table 2 Rejection Frequencies for Several Rank Tests: Nominal level of significance for all tests is 0.05, table entries are each based on 10,000 replications, all models have iid Student t3 errors under the null and local alternatives with the indicated γ0 parameters.

The restricted location shift alternative is specified as δn (u) = γn I(0.6 < u < 1) so there is no signal at the median and the poor performance of the sign test reflects this handicap. The other classical tests of global location shift also perform rather badly, even worse than the global normal scale test. The best performance is achieved by the trimmed Wilcoxon test, but the trimmed normal scale tests is also quite a strong contender. The restricted scale shift alternative is specified as δn (u) = γn Φ−1 (u)I(0.5 < u < 1) so again there is no signal at the median and the sign test is a disaster. The Student t test, the Wilcoxon, and the normal scores tests perform even worse than their lack-luster showing for the location shift alternative. Here, not surprisingly given that it was designed for this situation, the trimmed normal scale test is the clear winner. The Lehmann alternative affords an opportunity for all the tests to demonstrate some strength; these alternatives combine features of global location and scale shift with a more pronounced effect in the right tail so all the tests have something to offer. Again, not surprisingly, the Lehmann test designed for this situation is the clear winner, but the classical location shift tests are not far behind. Only the global

Heterogeneous treatment effects

141

normal scale test is poor in this case. The banal conclusion that may be drawn from Table 1 seems to be that it pays to know what the alternative is before choosing a test. But if we delve slightly deeper we may be led to the conclusion that the Lehmann alternatives are quite adequately countered by traditional rank tests, while the asymmetric forms of the Wilcoxon and normal scale tests are better for stronger forms of asymmetric response captured in the partial location and scale shift alternatives. Before jumping to such conclusions, however, it would be prudent to consider whether the normality assumption that underlies all of the simulation results of Table 1 is critical. Table 2 reports simulation results for an almost identical experimental setup except that Gaussian error is replaced everywhere by Student t3 error. Most of the features of the two tables are very similar. Especially in the partial location shift setting one sees even worse performance of the classical global rank tests and the t test. Performance of the Lehmann test deteriorates somewhat for both the location and scale alternatives under Student errors, but remains strong for the Lehmann alternative. 6. Conclusions Rank tests continue to play an important role in many domains of statistical application like survival analysis, but their potential value in the context of linear models remains under-appreciated. The regression rankscore methods of Gutenbrunner and Jureˇckov´ a [2] have opened a wide vista of new opportunities for rank based inference in the regression setting. More targeted inference is particularly important in the context of heterogeneous treatment models. We have taken a few steps in this direction, but there are interesting new paths ahead. References [1] Conover, W. and Salsburg, D. (1988). Locally most powerful tests for detecting treatment effects when only a subset of patients can be expected to ‘respond’ to treatment. Biometrics 44 189–196. ˇkova ´ , J. (1992). Regression quantile and re[2] Gutenbrunner, C. and Jurec gression rank score process in the linear model and derived statistics. Ann. Statist. 20 305–330. ˇkova ´ , J., Koenker, R., and Portnoy, S. [3] Gutenbrunner, C., Jurec (1993). Tests of linear hypotheses based on regression rank scores. J. Nonparametric Statistics 2 307–331. ˇ a ´ jek, J. and Sid ´ k, Z. (1967). Theory of Rank Tests. Academia, Prague. [4] Ha [5] He, X., Hsu, Y.-H., and Hu, M. (2009). Detection of treatment effects by covariate adjusted expected shortfall. Preprint. [6] Hettmansperger, T. (1968). On the trimmed Mann–Whitney statistics. Annals of Math. Stat. 39 1610–1614. [7] Koenker, R. (2005). Quantile Regression. Cambridge Univ. Press, Cambridge. [8] Koenker, R. (2009). quantreg. R package version 4.45, available from http://CRAN.R-project.org/package=quantreg. [9] Lehmann, E. (1953). The power of rank tests. Ann. Math. Stat. 24 23–43. [10] Lehmann, E. (1974). Nonparametrics: Statistical Methods Based on Ranks. Holden-Day.

142

R. Koenker

[11] Neyman, J. (1923). On the application of probability theory to agricultural experiments. Essay on Principles. Section 9. Statistical Science 5 465–472. (In translation from the original Polish.) [12] Rosenbaum, P. R. (2007). Confidence intervals for uncommon but dramatic responses to treatment. Biometrics 63 1164–1171. [13] Rubin, D. B. (1978). Bayesian inference for causal effects: The role of randomization. The Annals of Statistics 6 34–58.

IMS Collections Nonparametrics and Robustness in Modern Statistical Inference and Time Series Analysis: A Festschrift in honor of Professor Jana Jureˇ ckov´ a Vol. 7 (2010) 143–152 c Institute of Mathematical Statistics, 2010  DOI: 10.1214/10-IMSCOLL715

A class of minimum distance estimators in AR(p) models with infinite error variance∗ Hira L. Koul1 and Xiaoyu Li Michigan State University Abstract: In this note we establish asymptotic normality of a class of minimum distance estimators of autoregressive parameters when error variance is infinite, thereby extending the domain of their applications to a larger class of error distributions that includes a class of stable symmetric distributions having Pareto-like tails. These estimators are based on certain symmetrized randomly weighted residual empirical processes. In particular they include analogs of robustly weighted least absolute deviation and Hodges–Lehmann type estimators.

1. Introduction When modeling extremal events one often comes across autoregressive time series with infinite variance innovations, cf. Embrecht, K¨ uppelberg and Mikosch [10]. Assessing distributional properties of classical inference procedures in these time series models is thus important. Weak and strong consistency with some convergence rate of the least square (LS) estimator of the autoregressive parameter vector in such models are discussed in Kanter and Steiger [13], Hannan and Kanter [12], and Knight [14]) while Davis and Resnick [4, 5] discuss its limiting distribution. Strong consistency and convergence rate of the least absolute deviation (LAD) estimator are considered separately by Gross and Steiger [11], and An and Chen [1]. Davis, Knight and Liu [6] and Davis and Knight [5] discuss consistency and asymptotic distributions of the LAD and M-estimators in autoregressive models of a known order p when error distribution is in the domain of attraction of a stable distribution of index α ∈ (0, 2). Knight [15] proves asymptotic normality of a class of M -estimators in a dynamic linear regression model where the errors have infinite variance but the exogenous regressors satisfy the standard assumptions. Ling [18] discusses asymptotic normality of a class of weighted LAD estimators. Minimum distance (m.d.) estimation method consists of obtaining an estimator of a parameter by minimizing some dispersion or pseudo distance between the data and the underlying model. For a stationary autoregressive time series of a known order p with i.i.d. symmetric innovations a class of m.d. estimators was proposed in Koul [16]. This class of estimators is obtained by minimizing a class of certain integrated squared differences between randomly weighted empirical processes of residuals and negative residuals. More precisely, let p be a known positive integer ∗ Research

in part supported by the USA NSF DMS Grant 0704130. State University, East Lansing, USA. e-mail: [email protected] AMS 2000 subject classifications: Primary 62G05; secondary 62M10, 62G20. Keywords and phrases: asymptotic normality, Pareto-like tails distributions.

1 Michigan

143

144

H. L. Koul and Xiaoyu Li

and consider the linear autoregressive process {Xi } obeying the model (1.1)

Xi = ρ1 Xi−1 + ρ2 Xi−2 + · · · + ρp Xi−p + εi , 

i = 0, ±1, ±2, · · · ,

for some ρ := (ρ1 , · · · , ρp ) ∈ R , where the innovations {εi } are i.i.d. r.v.’s from a continuous distribution function (d.f.) F , symmetric around zero, not necessarily known otherwise. We shall also assume {Xi } is a strictly stationary solution of the equations (1.1). Some sufficient conditions for this to exist in the case of some heavy tail error distributions are given in the next section. Here, and in the sequel, by stationary we mean strictly stationary. Let Yi−1 := (Xi−1 , · · · , Xi−p ) . Because of the assumed symmetry of the innovation d.f. F , Xi − ρ Yi−1 and −Xi + ρ Yi−1 have the same distribution for each i = 1, · · · , n. Using this fact, the following class of m.d. estimators was proposed in Koul [16]. 6 n  6 −1/2

Kh+ (t) := h(Yi−1 ) I(Xi ≤ x + t Yi−1 ) 6n p

i=1

62 6 −I(−Xi < x − t Yi−1 ) 6 dG(x), ρ+ h

:= argmin{Kh+ (t); t ∈ Rp }.

Here h is a measurable function from Rp to Rp with its components hk , k = 1, · · · , p, G is a nondecreasing right continuous function on R having left limits, possibly inducing a σ-finite measure on R, and  ·  stands for the usual Euclidean norm. A large subclass of the estimators ρ+ h , as h and G vary, is known to be robust against additive innovation outliers, cf. Dhar [8]. The class of estimators ρ+ h , when h(x) = x and as G varies, have desirable asymptotic relative efficiency properties. Moreover, for h(x) = x, ρ+ h becomes the LAD estimator when G is degenerate at zero while for G(x) ≡ x, it is an analog of the Hodges–Lehmann estimator. Asymptotic normality of these estimators under a broad set of conditions on h, G and F was established in Koul (16, 17, chapter 7). These conditions included the condition of finite error variance. Main reason for having this assumption was to ensure stationarity of the underlying process {Xi } satisfying (1.1). Given the importance of heavy tail error distributions and robustness properties of these m.d. estimators, it is desirable to extend the domain of their applications to autoregressive time series with heavy tail errors. We now establish asymptotic normality of these estimators here under similar general conditions in which not only the error variance is not finite but also even the first moment may not be finite. In the next section, we first state general conditions for asymptotic normality of these estimators. Then we give a set of sufficient and easy to verify conditions that imply these general conditions. Among the new results is the asymptotic normality of a class of analogs of robust Hodges–Lehmann type estimators of the autoregressive parameters when error distribution has infinite variance. We also give examples of several functions h and G that satisfy the assumed conditions. In the last section another class of m.d. estimators based on residual ranks is discussed briefly to be used when errors may not have a symmetric distribution. 2. Main result To describe our main result we now state the needed assumptions, most of which are the same as in Koul [17]. (2.1)

Either

e h(y)y  e ≥ 0,

Or

e h(y)y  e ≤ 0,

∀ y, e ∈ Rp , e = 1.

A class of minimum distance estimators in AR(p) models

(2.2)

(a) 0 < E(|hk (Y0 )| Y0 ) < ∞,

∀ 1 ≤ k ≤ p.

145

(b) Eh(Y0 )2 < ∞.

In the following assumptions b is any positive finite real number. (2.3) Eh(Y0 )2 |F (x + n−1/2 (v  Y0 + aY0 )) − F (x)| dG(x) = o(1), ∀ v ≤ b, a ∈ R. There exists a constant k ∈ (0, ∞), such that for all δ > 0, v ≤ b and 1 ≤ k ≤ p,   lim inf P

(2.4)

n

n

−1/2

n

n

 −1/2  h± v Yi−1 + δni ) k (Yi−1 ) F (x + n

i=1 k=1

 2 −F (x + n−1/2 v  Yi−1 − δni ) dG(x) ≤ kδ 2 = 1, − + where δni := n−1/2 δYi−1 , h+ k := max(0, hk ), hk := hk − hk .

6 n  6 −1/2

h(Yi−1 ) F (x + n−1/2 v  Yi−1 ) − F (x) 6n

(2.5)

i=1

−n

62 6 v Yi−1 f (x) 6 dG(x) = op (1),

−1/2 

∀ v ≤ b.

The d.f. F has Lebesgue density f satisfying the following. ∞ 2 (2.6)(a) 0 < f dG < ∞, (b) 0 < f dG < ∞, (c) (1 − F ) dG < ∞. 0

Assumption of stationarity replaces the assumption of finite error variance (7.4.7)(b) of Koul [17]. We are now ready to state our main result. Theorem 2.1 Assume the autoregressive process given at (1.1) exists and is strictly stationary. In addition, assume the functions h, G, F satisfy assumptions (2.1) – (2.6) and that G and F are symmetric around zero. Then, n where Bn := n−1 :=

(ρ+ h

n i=1

Sn+

1/2

− ρ) = − B n

.−1

2

f dG

Sn+ + op (1),

 h(Yi−1 )Yi−1 , and

n−1/2

n

  h(Yi−1 ) I(εi ≤ x) − I(−εi < x) f (x) dG(x)

i=1

=

n−1/2

n

h(Yi−1 )[ψ(−εi ) − ψ(εi )],



i=1

Consequently,   Var(ψ(ε)) B −1 HB −1 , n1/2 (ρ+ h − ρ) →d N 0, 2 2 ( f dG) where B := Eh(Y0 )Y0 and H := Eh(Y0 )h(Y0 ) .

x

ψ(x) :=

f dG. −∞

146

H. L. Koul and Xiaoyu Li

The existence of ρ+ h under the finite variance assumption has been discussed in Dhar [9]. Upon a close inspection one sees that this proof does not require the finiteness of any error moment but only the stationarity of the process and assumptions (2.1), (2.2)(b) and (2.6)(c). Also note that (2.1), (2.2)(a) and the Ergodic Theorem implies the existence of B −1 , and Bn−1 for all n. In view of the stationarity of the process {Xi }, the details of the proof of Theorem 2.1 are very similar to that of Theorem 7.4.5 in Koul [17] and are left out for an interested reader. 3. Some stronger assumptions and Examples In this section we shall now discuss some easy to verify sufficient conditions for (2.3) to (2.5). In particular, we shall show that the above theorem is applicable to robust LAD and analogs of robust Hodges–Lehmann type estimators. First, consider (2.4) and (2.5). As shown in Koul [17], under the finite error variance assumption, (2.2)(a), (2.4) and (2.5) are implied by (2.2)(b), (2.6)(a) and the assumption |f (x + s) − f (x)|2 dG(x) → 0, s → 0. (3.1) We shall now show that (2.4) and (2.5) continue to hold under (2.2)(a), (2.6)(a) and (3.1) when {Xi } is stationary, without requiring the error variance to be finite. First, consider (2.4). Recall δni := n−1/2 δYi−1 . Then, the r.v.’s inside the probability statement of (2.4) equals to 

n−1/2

=

n

h± k (Yi−1 )

i=1 n

n



δni −δni

f (x + n−1/2 v  Yi−1 + s) ds

2 dG(x)

± n−1 h± k (Yi−1 )hk (Yj−1 )

i=1 j=1



· =

δni −δni

δ 2 n−2



δnj

f (x + n−1/2 v  Yi−1 + s)f (x + n−1/2 v  Yj−1 + t) dsdt dG(x)

−δnj n

n

± Yi−1 Yj−1 h± k (Yi−1 )hk (Yj−1 )

i=1 j=1

1 δni δnj





δni



−δni

δnj



−δnj

f (x + n−1/2 v  Yi−1 + s)

·f (x + n−1/2 v  Yj−1 + t) dG(x) dsdt n  2

Yi−1  |hk (Yi−1 )| δ 2 n−1 i=1

1 1≤i,j≤n δni δnj



δni



δnj

max

−δni

−δnj



f (x + n−1/2 v  Yi−1 + s)

−1/2 



·f (x + n v Yj−1 + t) dG(x) dsdt n  2

Yi−1  |hk (Yi−1 )| 4δ 2 n−1 

i=1

1 × max 1≤i≤n 2δni



δni −δni



f 2 (x + n−1/2 v  Yi−1 + s) dG(x)

1/2 2 ds

A class of minimum distance estimators in AR(p) models

→p

147

   2  f 2 dG . 4δ 2 E Y0 |hk (Y0 )|

The above last claim is implied by the Ergodic Theorem which uses (2.2)(a), and the fact that under (2.6)(a) and (3.1), the second factor in the last but one bound above tends, in probability, to a finite and positive limit f 2 dG. The argument for verifying (2.5) is similar. Let bni := n−1/2 b||Yi−1 . Then, 6 n  62 6 −1/2

6 h(Yi−1 ) F (x+n−1/2 v  Yi−1 )−F (x)−n−1/2 v  Yi−1 f (x) 6 dG(x) 6n i=1

p  n



−1/2 n hk (Yi−1 ) = k=1

≤ b2 n−2

i=1 p n n



n−1/2 v  Yi−1

{f (x + s) − f (x)} ds

2 dG(x)

0

Yi−1 Yj−1 |hk (Yi−1 )||hk (Yj−1 )|

k=1 i=1 j=1

bni bnj   1 |f (x+s)−f (x)||f (x+t)−f (x)| dG(x) dsdt 1≤i,j≤n bni bnj −b ni −bnj p  n 2



n−1 Yi−1 |hk (Yi−1 )| ≤ 4b2 × max

k=1



i=1

1 × max 1≤i≤n 2bni →p 0.



bni −bni



|f (x + s) − f (x)|2 dG(x)

1/2

2 ds

The last but one inequality follows from the Cauchy–Schwarz inequality, |f (x + s) − f (x)||f (x + t) − f (x)| dG(x)  1/2  1/2 |f (x + s) − f (x)|2 dG(x) ≤ |f (x + t) − f (x)|2 dG(x) , while the last claim follows from (2.2)(a), Ergodic Theorem, and (3.1). Now we turn to the verification of (2.3). First, consider the case when G is a finite measure. In this case, by the Dominated Convergence Theorem, (2.2)(b) and the continuity of F readily imply (2.3). Of special interest among finite measures G is the measure degenerate at zero. Now assume that the distribution of Y0 is continuous. Then, because F is continuous, the joint distribution of Yi−1 , Xi , 1 ≤ i ≤ n, is continuous for all n, and hence, n 6

62 6 6 Kh+ (t) := 6 h(Yi−1 )sign(Xi − t Yi−1 )6 ,

∀ t ∈ Rp , w.p. 1,

i=1

and the corresponding m.d. estimator, denoted by ρ+ h,LAD , becomes an analog of the LAD estimator. Note also that now (3.1) is equivalent to the continuity of f at zero, ψ(x) of Theorem 2.1 equals f (0)I(x > 0) and Var(ψ(ε)) = f 2 (0)/4, where ε is the innovation variable having d.f. F . We summarize asymptotic normality result for ρ+ h,LAD in the following

148

H. L. Koul and Xiaoyu Li

Corollary 3.1 Assume the stationary AR(p) model (1.1) and assumptions (2.1), (2.2) hold. In addition, assume that the symmetric error density f is continuous at 0 and f (0) > 0. Then,  B −1 HB −1  , B := Eh(Y0 )Y0 , H := Eh(Y0 )h(Y0 ) . n1/2 (ρ+ h,LAD − ρ) →d N 0, 4f 2 (0) Note that this result does not require finiteness of any error moment. Examples of h that satisfy (2.1) and (2.2) include the weight function (3.2) (3.3)

h(y) = h1 (y) := yI(y ≤ c) + c(y/y2 )I(y > c),

c > 0,

h(y) = h2 (y) := y/(1 + y2 ).

Note that both are bounded functions and trivially satisfy (2.1). Moreover, continuity of Y0 implies that h1 satisfies (2.2), because for all 1 ≤ k ≤ p,  |Y |    0k 0 < cE ≤ E |h1k (Y0 )| Y0  I(Y0  > c) Y0    ≤ E Y0 2 I(Y0  ≤ c) + cI(Y0  > c) ≤ c2 + c. Similarly, h2 also satisfies (2.2), because for all 1 ≤ k ≤ p,  |Y |Y     0k 0 0 n1/2 β|s|) dG(x) ds → 0, ∀ 0 < β < ∞.

A class of minimum distance estimators in AR(p) models

149

Now we shall show that (3.5) and (3.6) implies (2.3). Then, by the Fubini Theorem, Eh(Y0 )2 |F (x + n−1/2 (v  Y0 + aY0 )) − F (x)| dG(x) ≤

2

n−1/2 (b+|a|) Y0



=

C E f (x + s) dG(x) ds −n−1/2 (b+|a|) Y0 f (x + s) P (Y0  > n1/2 c−1 |s|) dG(x) ds C2



0, by (3.6).

To summarize, we have shown (2.2)(a), (2.6)(a) and (3.1) imply (2.4) and (2.5) for general h and G, while (3.6) implies (2.3) for bounded h and a σ-finite G. Verification of (3.6) is relatively easy if the following two assumptions hold. (3.7) G is absolutely continuous with dG(x) = γ(x) dx, where γ is bounded, i. e., γ∞ := supx∈R |γ(x)| < ∞, (3.8) EY0  < ∞. For, then, by Fubini’s Theorem, the left hand side of (3.6) is bounded above by     γ∞ f (x + s) dx P Y0  ≥ n1/2 c−1 |s| ds = 2n−1/2 cγ∞ EY0  → 0. Among the G satisfying (3.7) is the Lebesgue measure dG(x) ≡ dx, where γ(x) ≡ 1. For this G, (2.6) and (3.1) are implied by (2.6)(a) and E|ε| < ∞, and the ψ(x) of Theorem 2.1 equals to F (x), where F is the d.f. of ε, so that Var(ψ(ε) = Var(F (ε)) = 1/12. Moreover, Kh+ (t)

=

n−1

p

n

n

  hk (Xi−1 )hk (Xj−1 ) Xi + Xj − (Yi−1 + Yj−1 ) t

k=1 i=1 j=1

  −Xi − Xj − (Yi−1 − Yj−1 ) t , + and the corresponding ρ+ h , denoted by ρh,HL , is a robust analog of the Hodges– Lehmann type estimator, when h is bounded. Note that for bounded h, (2.2) is implied by (3.8). Because of the importance of this class of estimators we summarize their asymptotic normality result in the following corollary.

Corollary 3.2 Assume the stationary AR(p) model (1.1) holds. In addition, suppose h is bounded and satisfies (2.1), and the error d.f. F is symmetric around zero and satisfies f 2 (x) dx < ∞, and E|ε| < ∞. Then, (3.8) holds, and  − ρ) → N 0, n1/2 (ρ+ d h,HL

B −1 HB −1  , 12( f 2 (x) dx)2

where B := Eh(Y0 )Y0 and H := Eh(Y0 )h(Y0 ) . Perhaps it is worth emphasizing that none of the above mentioned literature dealing with the various estimators in AR(p) models with infinite error variance include this class of estimators.

150

H. L. Koul and Xiaoyu Li

It is thus apparent from the above discussion that asymptotic normality holds for some members of the above class of m.d. estimators without requiring finiteness of any moments, and for some other members requiring only the first error moment to be finite. If one still does not wish to assume (3.8), then it may be possible to verify (3.6) for some heavy tail error densities. We do not do this but now will give an example of a large class of strictly stationary processes satisfying (1.1) and for which this condition holds but which has infinite variance. Recall that a d.f. F of the error variable ε is said to have a Pareto-like tails of index α if for some α > 0, 0 ≤ a ≤ 1, 0 < C < ∞, xα (1 − F (x)) → aC,

(3.9)

xα F (−x) → (1 − a)C,

x → ∞.

From Brockwell and Davis [2], p. 537, Proposition 13.3.2, it follows that if 1 − ρ1 x − ρ2 x2 − · · · − ρp xp = 0, |x| ≤ 1, and if F satisfies (3.9), then {Xi } satisfying (1.1) exists and is strictly stationary and invertible. Now, (3.9) readily implies xα P (|ε| > x) → C, as x → ∞, and hence E|ε|δ < ∞, for δ < α, E|ε|δ = ∞, for δ ≥ α. Suppose 1 < α < 2. Then E|ε| < ∞, and Var(ε) = ∞. Thus we have a large class of strictly stationary AR(p) processes with finite first moment and infinite variance. In particular these processes satisfy (3.8). We summarize the above discussion in the following corollary. Corollary 3.3 Assume the autoregressive model (1.1) holds with the error d.f. F having Pareto-like tail of index 1 < α < 2. In addition, suppose (2.1) holds, G has a bounded Lebesgue density, h is bounded, F has square integrable Lebesgue density, and both F and G are symmetric around zero. Then, the conclusion of Theorem 2.1 holds for the class of m.d. estimators ρ+ h. This still leaves open the problem of obtaining asymptotic distribution of a suitably standardized ρ+ h when a stationary solution to (1.1) exists with the error d.f. having Pareto-like tail of index α ≤ 1. 4. M.D. estimators when F is not symmetric Here we shall describe an asymptotic normality result of a class of minimum distance estimators when F may not be symmetric and when in (1.1) error variance may be infinity. Let Ri (t) denote the rank of Xi − t Yi−1 among Xj − t Yj−1 , j = 1, · · · , n, n −1 ¯ hn := n i=1 h(Yi−1 ), and define the randomly weighted empirical process of residual ranks Zh (t, u)

:=

n−1/2

¯ n )[I(Ri (t) ≤ nu) − u], (h(Yi−1 ) − h

u ∈ [0, 1],

i=1

Kh (t)

n

1

Zh (t, u)2 dL(u),

:=

ρ h := argmin{Kh (t); t ∈ Rp },

0

where L is a d.f. on [0, 1]. See Koul [17] for a motivation on using the dispersion Kh . It is an analog of the classical Cram´er – von Mises statistic useful in regression and autoregressive models. The following proposition describes the asymptotic normality of ρ h . Proposition 4.1 Assume the process satisfying (1.1) is strictly stationary with the error d.f. F having uniformly continuous Lebesgue density f and finite first

A class of minimum distance estimators in AR(p) models

151

moment. In addition, assume L is a d.f. on [0, 1], (3.5) holds, and the following n hold with Y¯n−1 := n−1 i=1 Yi−1 . ¯ n )(Yi−1 − Y¯n−1 ) e ≥ 0, (4.1) Either e (h(Yi−1 ) − h ¯ n )(Yi−1 − Y¯n−1 ) e ≤ 0, ∀ i = 1, · · · , n, e ∈ Rp , e = 1. Or e (h(Yi−1 ) − h Let F −1 (u) := inf{x; F (x) ≥ u}, q(u) := f (F −1 (u)), 0 ≤ u ≤ 1. Then, - ρh − ρ) = − Cn rn1/2 (

.−1

1

q 2 dL

S n + op (1),

0

where Cn := n−1 S n

:=

1

n

i=1 (h(Yi−1 )

n−1/2

0

=

n

¯ n )(Yi−1 − Y¯n−1 ) , and −h

  ¯ n ) I(F (εi ) ≤ u) − u q(u) dL(u) (h(Yi−1 ) − h

i=1

−n−1/2

n



ϕ(u) du], 0

i=1



1

¯ n )[ϕ(εi ) − (h(Yi−1 ) − h

u

ϕ(u) :=

q dL. 0

  1 Consequently, n1/2 ( ρh −ρ) →d N 0, τ 2 C −1 GC −1 , where τ 2 := Var(ϕ(ε))/( 0 q 2 dL)2 and       C := E h(Y0 ) − Eh(Y0 ) Y0 , G := E h(Y0 ) − Eh(Y0 ) h(Y0 ) − Eh(Y0 ) . The proof of this claim is similar to that of the asymptotic normality of an analogous estimator θˆmd discussed in chapter 8 of the monograph by Koul [17] in the case of finite variance, hence not given here. Note that again for bounded h, ρ h are robust against innovation outliers. A useful member of this class is obtained when L(u) ≡ u. In this case Kh (t) = −2n

−2

p

n

n

  ¯ nk )(hk (Yj−1 ) − h ¯ nk )Ri (t) − Rj (t), (hk (Yi−1 ) − h

k=1 i=1 j=1

¯ nk := n−1 h

n

hk (Yi−1 ),

1 ≤ k ≤ p.

i=1

In the case of finite variance and when h(x) ≡ x, the asymptotic variance of the corresponding estimator is smaller than that of the LAD (Hodges–Lehmann) estimator at logistic (double exponential) errors. It is thus interesting to note that the above asymptotic normality of the robust analogs of this estimator holds even when error variance may be infinite. Note that when L(u) ≡ u, the corresponding τ 2 = 1/[12( f 3 (x) dx)2 ]. Acknowledgement. Authors would like to thank the two anonymous referees for some constructive comments. References [1] An, H. Z. and Chen, Z. G. (1982). On convergence of LAD estimates in autoregression with infinite variance. J. Multiv. Anal. 12 335–345.

152

H. L. Koul and Xiaoyu Li

[2] Brockwell, P. J. and Davis, R. A. (1991). Time Series: Theory and Methods. Second edition. New York, Springer. [3] Davis, R. A. and Knight, K. (1987). M-estimation for autoregressions with infinite variance. Stoc. Proc. & Appl. 40 145–180 North-Holland. [4] Davis, R. A. and Resnick, S. I. (1985). Limit theory for moving average of random variables with regularly varying tail probabilities. Ann. Probab. 13 179–195. [5] Davis, R. A. and Resnick, S. I. (1986). Limit theory for the sample covariance and correlation functions of moving averages. Ann. Statist. 14 533-558. [6] Davis, R. A., Knight, K. and Liu, J. (1992). M -estimation for autoregressions with infinite variance. Stoc. Proc. Appl. 40 1 145–180. [7] Denby, L. and Martin, D. (1979). Robust estimation of the first order autoregressive parameter. J. Amer. Statist. Assoc. 74 140–146. [8] Dhar, S. K. (1991). Minimum distance estimation in an additive effects outliers model. Ann. Statist. 19 205–228. [9] Dhar, S. K. (1993). Computation of certain minimum distance estimators in AR[k] model. J. Amer. Statist. Assoc. 88 278–283. ¨ ppelberg, C., and Mikosch, T. (2001). Modelling [10] Embrecht, P., Klu extremal events for insurance and finance. Springer-Verlag, Berlin – Heidelberg. [11] Gross, S. and Steiger, W. L. (1979). Least absolute deviation estimates in autoregression with infinite variance. J. Appl. Probab. 16 104–116. [12] Hannan, E. J. and Kanter, M. (1977). Autoregressive processes with infinite variance. Adv. Appl. Probab. 6 768–783. [13] Kanter, M. and Steiger, W. L. (1974). Regression and autoregression with infinite variance. Advances in Appl. Probability 6 768–783. [14] Knight, K. (1987). Rate of convergence of centred estimates of autoregressive parameters for infinite variance autoregression. J. Time Ser. Anal. 8 51–60. [15] Knight, K. (1993). Estimation in dynamic linear regression models with infinite variance errors. Econometric Theory 9 570–588. [16] Koul, H. L. (1986). Minimum distance estimation and goodness-of-fit tests in first-order autoregression. Ann. Statist. 14 1194–1213. [17] Koul, H. L. (2002). Weighted Empirical Processes in Dynamic Nonlinear Models. Second edition. New York, Springer. [18] Ling, S. Q. (2005). Self-weighted least absolute deviation estimation for infinite variance autoregressive models. J.R. Statist. Soc. B 67 381-393.

IMS Collections Nonparametrics and Robustness in Modern Statistical Inference and Time Series Analysis: A Festschrift in honor of Professor Jana Jureˇ ckov´ a Vol. 7 (2010) 153–168 c Institute of Mathematical Statistics, 2010  DOI: 10.1214/10-IMSCOLL716

Integral functionals of the density David M. Mason*,1 Elizbar Nadaraya2 and Grigol Sokhadze3 University of Delaware and Tbilisi State University Abstract: We show how a simple argument based on an inequality of McDiarmid yields strong consistency and central limit results for plug-in estimators of integral functionals of the density.

1. Introduction Let X be a random variable with cumulative distribution function F having density f. Let us consider a general class of integral functionals of the form   (1.1) T (f ) = Φ f (0) (x), f (1) (x), . . . , f (k) (x) dx, IR

with k ≥ 0, where f (0) = f and f (j) denotes the jth derivative of f , for j = 1, . . . , k, if k ≥ 1, and Φ is a smooth function defined on IRk+1 . Under suitable regular conditions, which will be specified below, T (f ) is finite. Some special cases of (1.1) are  2 (1.2) (i) f (k) (x) dx. φ (f (x)) f (x) dx, (ii) Φ (f (x)) dx and (iii) IR

IR

IR

The estimation of integral functionals of the density and its derivatives has been studied by a large number statisticians over many decades. Such integral functionals frequently arise in nonparametric procedures such as bandwidth selection in density estimation and in location and regression estimation using rank statistics. For good sources of references to current and past research literature in this area along with statistical applications consult Nadaraya [9], Levit [7], and Gin´e and Mason [5]. We shall be studying plug-in estimators of T (f ). These estimators are obtained by replacing f (j) , for j = 0, . . . , k, by kernel estimators based on a random sample of X1 , , . . . , Xn , n ≥ 1, i.i.d. X, defined as follows. Let K (·) be a kernel defined on IR with properties soon to be stated. For h > 0 and each x ∈ IR define the function on IR Kh (x − ·) = h−1 K ((x − ·) /h) . 1 Statistics Program, University of Delaware, 206 Townsend Hall, Newark, DE 19716, USA, E-mail: [email protected] 2 Department of Mathematical Statistics, Tbilisi State University 2, Universitet str., Tbilisi, 0143, Republic of Georgia, E-mail: [email protected] 3 Department of Probability Theory and Mathematical Statistics, Tbilisi State University 2, Universitet str., Tbilisi, 0143, Republic of Georgia, E-mail: [email protected] ∗ Research partially an NSF Grant. AMS 2000 subject classifications: Primary 62E05, 62E20; secondary 62F07. Keywords and phrases: Integral functionals, kernel density estimators, inequalities, consistency and central limit theorem.

153

154

D. M. Mason, E. Nadaraya and G. Sokhadze

The kernel estimator of f based on X1 , , . . . , Xn , n ≥ 1, and a sequence of positive constants h = hn converging to zero, is 1

fhn (x) = Khn (x − Xi ), for x ∈ IR, n i=1 n

and the kernel estimator of f (j) , for j = 1, . . . , k, is 1 (j) (j) fhn (x) = K (x − Xi ), for x ∈ IR, n i=1 hn n

where Khn is the jth derivative of Khn . Note that Khn = hn−j−1 K (j) , where K (j) is (0) (0) the jth derivative of K. We shall often write fh (x) = fh (x) and Kh (x) = Kh (x). Also we denote the expectation of these estimators as (j)

(1.3)

(j)

(j) (j) fh (x) = E fh (x), for j = 0, . . . , k.

Our plug-in estimator of T (f ) is   (1) (k)  (1.4) T (fh ) = Φ fh (x), fh (x), . . . , fh (x) dx. IR

The goal of this paper is to show how a simple argument based on an inequality of McDiarmid yields a useful representation for T (fh ). This means that it can be written as a sum of i.i.d. random variables plus a remainder term that converges to zero at a good stochastic rate. This will permit us to establish a nice strong consistency result and central limit theorem for T (fh ). In the process we shall generalize and extend the results and methods of Mason [8] to multivariate integral functionals and estimators of the form (1.1) and (1.4). The [8] paper dealt solely with the special case in example (i). In a paper closely related to this one, [5] investigated the Levit [7] estimator of integral functionals of the density: (1.5) Θ(F ) = ϕ(x, F (x), f (x), . . . , f (k) (x)) dF (x) , IR

which is formed by replacing in (1.5) the cumulative distribution function F by the empirical distribution function Fn and the f (j) by modified kernel estimators. They used very powerful U–statistics inequalities to obtain uniform in bandwidth type consistency and central limit results for the Levit estimator. These are results that hold uniformly in an ≤ h ≤ bn , where an and bn are suitable sequences of positive constants converging to zero.   With a lot more effort, we could derive analog results here for T fh using the methods in [5], as well as the modern empirical process tools developed in Einmahl and Mason [4] and Dony, Einmahl and Mason [2] in their work on uniform in bandwidth consistency of kernel type estimators. However, such an endeavour is well beyond the scope of the present paper. We should point out that one cannot   extend our approach to handle the addition of x, F and Fn into T (f ) and T fh without imposing moment conditions on F and Φ. The reason is that one has to integrate with respect to dx instead of dF (x). Our representation theorem is stated and proved in Section 2. In Section3 we  use it to derive a strong consistency result and central limit theorem for T fh . We conclude by applying our central limit theorem to the three examples in (1.2).

Integral functionals

155

2. A representation theorem Before we state our representation theorem, we shall gather together our basic assumptions along with some of their implications that will be used throughout this paper. Assumptions on the density f . (F.i) The density function f is continuously differentiable up to order k ≥ 1, if k ≥ 1.   (F.ii) For some constant M > 0, supx∈IR f (j) (x) ≤ M for j = 0, . . . , k. (F.iii) For each j = 0, . . . , k, f (j) ∈ L1 (IR). Assumptions on the kernel K. (K.i) (K.ii)

IR

|K| (x) dx = κ < ∞.

IR

K(x) dx = 1.

(K.iii) The kernel K is k + 1-times continuously differentiable. (K.iv) For some D > 0, supx∈IR |K (j) (x)| ≤ D < ∞, j = 0, . . . , k + 1. (K.v) For each j = 0, . . . , k, lim|x|→∞ K (j) (x) = 0 and K (j) ∈ L1 (IR). We shall repeatedly use the fact following by integration by parts that under our assumptions on f and K, that for j = 0, . . . , k, (2.1)     x−y x−y (j) −j−1 (j) −1 f (y) dy = h f (j) (y) dy. fh (x) = h K K h h IR IR For j = 0, 1, . . . , k, set gn,h (x) = hj fh (x) = (j)

(j)

1 (j) K ((x − Xi ) /h). nh i=1 n

Our assumptions on f and K permit us to apply Theorem 2 of [2] to get for some h0 > 0, every c > 0 and each j = 0, 1, . . . , k, with probability 1,  √  (j)  (j) nh gn,h (x) − Egn,h (x)  =: Gj (c) < ∞. (2.2) lim sup sup sup n→∞ c log n ≤h≤h0 x∈R | log h| ∨ log log n n This implies that as long as hn converges to zero at a rate such that hn ≥ (c log n) /n for some c > 0, for each j = 0, 1, . . . , k, with probability 1, * )   | log hn | ∨ log log n  (j)  (j) (2.3) sup fhn (x) − fhn (x) = O . √ 1/2+j x∈R nhn To see this, notice that h

−j

(j) Egn,h

(x) =

(j) fh (x)

=

h IR

−j−1

 K

(j)

x−y h

 f (y) dy,

156

D. M. Mason, E. Nadaraya and G. Sokhadze (j)

where fh (x) is as in (1.3). Now by applying the formula (2.1) we get (j) fh (x)

=h

−1



K IR

x−y h

which, in turn, by the change of variables v =

 f (j) (y) dy,

x−y h

or y = x − hv

(2.4)

K (v) f (j) (x − hv) dv.

= IR

From (2.4) we get via (K.i) and (F.ii) that    (j)  (2.5) sup fh (x) ≤ κM , 0 ≤ j ≤ k. x∈IR

Therefore as long as  √ | log hn | ∨ log log n/( nh1/2+k ) → 0, as n → ∞, (2.6) n we can infer from (2.3) that with probability 1 for all large enough n    (1) (k) fh (x), fh (x), . . . , fh (x) : x ∈ IR ⊂ C,

(2.7)

where C is any open convex set such that (2.8)

[−κM, κM ]

k+1

⊂ C.

Assumptions on Φ (Φ.i) Φ(0, . . . , 0) = 0. (Φ.ii) The function Φ possesses all derivatives up to second order on an open convex k+1 . set C containing [−κM, κM ] (Φ.iii) The second order derivatives of Φ are uniformly bounded on C by a constant BΦ > 0. For j = 0, . . . , k, let (2.9)

Φj (y0 , y1 , . . . , yk ) =

∂Φ(y0 , y1 , . . . , yk ) ; ∂yj

and for 0 ≤ i, j ≤ k set Φi,j (y0 , y1 , . . . , yk ) =

∂ 2 Φ(y0 , y1 , . . . , yk ) . ∂yi ∂yj

Our assumptions on Φ say that for all 0 ≤ i, j ≤ k, (2.10)

sup {|Φi,j | (y0 , y1 , . . . , yk ) : (y0 , y1 , . . . , yk ) ∈ C} ≤ BΦ .

  We shall first verify that T (f ), T (fh ) and T fh are finite.

Integral functionals

157

Notice that by Taylor’s theorem for each (y0 , y1 , . . . , yk ) ∈ C for some y k ∈ C    k  k



 1  |Φ| (y0 , y1 , . . . , yk ) =  Φj (0, 0, . . . , 0) yj + Φi,j ( yk ) yi yj  , 2 i,j=0 IR  j=0  which for some constant AΦ is



≤ AΦ ⎝

k

j=0

|yj | +

k

⎞ |yj |

2⎠

.

j=0

This implies using (2.10) that for any k+1 bounded measurable functions ϕ0 ,. . . , ϕk in L1 (IR) taking values in C, |Φ| (ϕ0 (x) , ϕ1 (x) , . . . , ϕk (x)) dx < ∞. (2.11) IR

From the assumptions on f and K we can easily infer that the functions f (j) and (j) fh , j = 0, . . . , k are bounded and in L1 (IR) . This when combined with (2.5) and (2.11) implies that both T (f ) and T (fhn ) are finite. Similarly, the assumptions on (j) K imply that each fh is bounded and in L1 (IR) , which in combination with (2.7) and (2.11) gives, with probability 1, that the estimator T (fhn ) is finite for all n sufficiently large. Next we shall represent the difference T (fhn ) − T (fhn ) as a sum of i.i.d. random variables Sn (hn ) plus a remainder term Rn . By Taylor’s formula we can write T (fhn ) − T (fhn ) = Sn (hn ) + Rn ,

(2.12)

where for any h > 0, Sn (h) is the sum of i.i.d. random variables k  

(1) (k) (j) (j) (h) = Φj (fh (x), fh (x), . . . , fh (x)) fh (x) − fh (x) dx; (2.13) Sn j=0

IR

and Rn is the remainder term k    1

(i) (i) (j) (j) (2.14) Rn = Φi,j ( yk (x)) fhn (x) − fhn (x) fhn (x) − fhn (x) dx, 2 i,j=0 IR with y k (x) on the line joining (1) (k) (1) (k) (fh (x), fh (x), . . . , fh (x)) and (fh (x), fh (x), . . . , fh (x)).

Here is our representation theorem. It determines the size of the stochastic remainder term Rn in the representation (2.12). Our consistency result and central limit theorem for T fh will follow from it. Theorem 2.1. Assume the above conditions on the density f , the kernel K and the function Φ. Then for any positive sequence h = hn ≤ 1 converging to zero at the rate (2.6) the remainder term in the representation (2.12) satisfies, with probability 1,   (2.15) Rn = O log n/(nh2k+1 ) . n Moreover, (2.16)

  ) . Rn = Op 1/(nh2k+1 n

158

D. M. Mason, E. Nadaraya and G. Sokhadze

Remark 2.2. We call (2.15) a strong representation and (2.16) a weak representation. Proof of Theorem 2.1. Applying standard inequalities, we get from (2.10), (2.5) and (2.7) that for some CΦ > 0, with probability 1 for all large n, |Rn | ≤ CΦ

(2.17)



k  IR j=0

(j) (j) fhn (x) − fhn (x)

2 dx.

Let Wk be the Sobolev space of functions g having continuous derivatives of order up to k ≥ 1, each in L2 (IR) , with the Sobolev norm 7 8 k 8

gk = 9 |g (j) (x)|2 dx. IR

j=0

The space Wk has the inner product $g1 , g2 %k =

k

j=0

(j)

(j)

g1 (x)g2 (x) dx. IR

Set rn (k) = fhn − fhn 2k . We see that with this notation, |Rn | ≤ CΦ rn (k). Next set 1 Yi = Yi (x) = {Khn (x − Xi ) − fhn (x)} , n where fh (x) = E fh (x). Then n

n

n

Yi (x) =

i=1

n 1

{Khn (x − Xi ) − fhn (x)} = fhn (x) − fhn (x). n i=1

Therefore

62 6 n 6 6

6 6 Yi 6 . rn (k) = 6 6 6

(2.18)

i=1

k

Let us now estimate the  · k norm of the function gi = gi (x) = n1 Khn (x − Xi ) for each i = 1, . . . , n. We have ⎛ ⎞1/2  k 2

1 (j) gi k = ⎝ Khn (x − Xi ) dx⎠ 2 n IR j=0 ⎛

⎞1/2  2 k 

1 x − X 1 i =⎝ 2 K (j) dx⎠ n j=0 IR hj+1 hn n ⎛ ⎞1/2  2  k x − X 1 ⎝ 1 x − X i i⎠ K (j) = d n j=0 h2j+1 hn hn IR n ⎛ ≤⎝

k 

K (j) (u)

j=0

IR

2

⎞1/2 du⎠

 5 ! . n h2k+1 n

Integral functionals

159

Therefore  5 ! 2k+1 =: Dn /2. n hn gi k ≤ Kk

(2.19)

Note that (K.iv) and (K.v) imply that K2k is finite. Observe that (2.19) yields the bound, Yi k ≤ gi k + Egi k ≤ Dn .

(2.20)

We shall control the size of rn (k) using McDiarmid’s inequality, which for convenience we state here. McDiarmid’s inequality (See Devroye [1]) Let Y1 , . . . , Yn be independent random variables taking values in a set A and assume that the function H : An → IR, satisfies for each i = 1, . . . , n and some ci , |H(y1 , . . . , yi−1 , yi , yi+1 , . . . , yn ) − H(y1 , . . . , yi−1 , y, yi+1 , . . . , yn )| ≤ ci .

sup y1 ,...,yn ,y,∈A

then for every t > 0, ) P {|H(Y1 , . . . , Yn ) − EH(Y1 , . . . , Yn )| ≥ t} ≤ 2 exp −2t

2

n 5

* c2i

.

i=1

Applying McDiarmid’s inequality, in our situation, with 6 6 n 6

6 6 6 H(Y1 , . . . , Yn ) = 6 Yi 6 6 6 i=1

k

and ci = 2Dn , for i = 1, . . . , n, which comes from (2.20), we obtain for every t > 0, 6 6 6  4 3 6 n  2 2k+1  n 6

6   6 6 t nhn 6 6 6  6 . Yi 6 − E 6 Yi 6  ≥ t ≤ 2 exp − (2.21) P 6 6 6 6  6 2K2k i=1

k

i=1

k



√ Setting t = 2 log n/ nh2k+1 into the probability bound in (2.21), we get via the n Borel–Cantelli lemma that with probability 1, 6 6 6 6 * ) √ n n 6

6

6 6 log n 6 6 6 6 (2.22) . Yi 6 = E 6 Yi 6 + O  6 6 6 6 6 nh2k+1 n i=1

k

i=1

k

Furthermore, by Jensen’s inequality, 6 * 6 6 n ) 6 n  k n

6 6 2 6 62

2 6 6 6 6 (j) E6 Yi (x) dx, E Yi 6 ≤E6 Yi 6 = 6 6 6 6 IR i=1

k

i=1

k

i=1 j=0

that is, 6 * ) 6 n n k 6 6 2  2 1

6 6 (j) (j) E6 Yi 6 ≤ 2 E Kh (x − Xi ) − fh (x) dx 6 6 n i=1 j=0 IR i=1 k

160

D. M. Mason, E. Nadaraya and G. Sokhadze



 2  n k x − Xi 1

1 (j) E K dx n2 i=1 j=0 IR hn hj+1 n

 2  n

k

1 x − Xi (j) ≤ E K dx hn n2 h2k+2 n i=1 j=0 IR =

n

k  2  x − y 

1 (j) f (y) dydx, K 2 hn n2 h2k+2 n i=1 j=0 IR

which by using Fubini’s theorem is seen to   (2.23) = K2k / nh2k+1 . n From (2.18), (2.22) and (2.23) we conclude for sequence h = hn con any positive   verging to zero at the rate (2.6) that Rn = O log n/ nh2k+1 , a.s. n The proof of (2.18) follows similar lines. Therefore we have proved our main result.  3. Applications of the representation theorem 3.1. Consistency As our first  application of Theorem 2.1 we shall establish a strong consistency result  for T fh . Theorem 3.1. Assume the conditions of Theorem 2.1. If a positive sequence h = hn ≤ 1 is chosen so that   (3.1) log n/ nh2k+1 → 0, n then with probability 1, we have, as n → ∞, T (fhn ) → T (f ) .

(3.2)

Proof of Theorem 3.1. First, by Theorem 2.1 and (3.1), (3.3)

T (fhn ) − T (fhn ) = Sn (hn ) + Rn with Rn = o(1), a.s.

Let X1 , . . . , Xn be i.i.d. with density f. Recall the definition of Φj in (2.9) and set for i = 1, . . . , n,

(3.4)

Zi (hn ) :=

k

j=0

(1)

IR

(k)

(j)

Φj (fhn (x), fhn (x), . . . , fhn (x))Khn (x − Xi ) dx.

and for future reference write for any h > 0 and X with density f , (3.5)

Z (h) :=

k

j=0

(1)

(k)

(j)

Φj (fh (x), fh (x), . . . , fh (x))Kh (x − X) dx. IR

Integral functionals

161

In this notation we can write Sn (hn ) = n−1

(3.6)

n

{Zi (hn ) − EZi (hn )} .

i=1

Keeping in mind that (2.5) implies    (1) (k) k+1 (3.7) fh (x), f h (x), . . . , f h (x) : x ∈ IR ⊂ [−κM, κM ] and that we can infer from the assumptions on Φ that for some DΦ > 0,   k+1 ≤ DΦ sup |Φj | (y0 , y1 , . . . , yk ) : (y0 , y1 , . . . , yk ) ∈ [−κM, κM ] we get that for 1 ≤ i ≤ n, |Zi (hn )| ≤

k

IR

j=0

≤ DΦ

   (j)  (1) (k) |Φj | (fhn (x), fhn (x), . . . , fhn (x)) Khn  (x − Xi ) dx

  k  k  



 (j)   (j)  x − Xi −j−1 (x − X dx ) dx = D h Khn  K  i Φ n hn IR j=0 IR j=0 = DΦ

k

h−j n

   (j)  K  (u) du ≤ Lh−k n IR

j=0

for some L > 0. Therefore we can apply Hoeffding’s inequality [6] to get, √ . 2 log nL ≤ 2 exp (−2 log n) , P |Sn (hn )| > √ k nhn from which we readily conclude using the Borel–Cantelli lemma that, with probability 1, !  (3.8) Sn (hn ) = O log n/ (nh2k ) . n ! Thus whenever

log n nh2k n

= o(1), then, with probability 1,

(3.9)

Sn (hn ) = o (1) .

Next we shall show that T (fh ) → T (f ). Recall by (2.4), for each j = 0, . . . , k, (j) fhn (x) = K (v) f (j) (x − hn v) dv, IR

which by (F.ii), (K.i) and the dominated convergence theorem implies that for each j = 0, . . . , k, (j) fhn (x) → f (j) (x) for a.e. x ∈ IR. Thus for a.e. x ∈ IR, as → ∞, (3.10)

(1)

(k)

Φ(fhn (x), fhn (x), . . . , fhn (x)) → Φ(f (x), f (1) (x), . . . , f (k) (x)).

162

D. M. Mason, E. Nadaraya and G. Sokhadze

Write for each j = 0, . . . , k,         (j) |K| (v) f (j)  (x − hv) dv and g (j) = κ f (j)  , ghn (x) = IR

   (j)  (j) where κ is as in (K.i). Clearly for each j = 0, . . . , k, fhn  ≤ ghn , and (3.11)

(j)

ghn (x) → g (j) (x) for a.e. x ∈ IR.

Notice that for each n ≥ 1 and j = 0, . . . , k,    (j)  (j) (3.12) ghn (x) dx = g  (x) dx. IR

IR

Also since Φ (0, . . . , 0) = 0 and Φ is assumed to be differential with continuous derivatives Φj on C, where C satisfies (2.8), we get by (3.7) and the mean value theorem that for some MΦ > 0, (3.13)

(1)

(k)

|Φ| (fhn (x), fhn (x), . . . , fhn (x)) ≤ MΦ

k

(j)

ghn (x), for all x ∈ IR.

j=0

From (3.10), (3.11), (3.12) and (3.13), we readily that as n → ∞, (1) (k) T (fh ) = Φ(fhn (x), fhn (x), . . . , fhn (x)) dx → T (f ) , IR

using a standard convergence result that is stated, for instance, as problem 12 on p. 102 of Dudley [3]. It says that if fn and gn are integrable functions for a measure μ with |fn | ≤ gn , such that as n → ∞, fn (x) → f (x) and gn (x) → g (x) for almost all x. Then gn dμ → g dμ < ∞, implies that fn dμ → f dμ.   Therefore whenever T fh − T (fh ) = o (1) a.s., we have (3.14)

  T fh − T (f ) → 0 a.s.

n Now (3.1), i. e., nhlog 2k+1 → 0, implies n which imply (3.14).

log n nh2k n

→ 0. Thus both (3.3) and (3.9) hold, 

Remark 3.2. In the case k = 0, Theorem 3.1 generalizes the first part of Theorem 2 in [8] from k = 0 to k ≥ 0 and to a larger class of functions Φ. Moreover, the proof of Theorem 3.1 completes that of the first part of Theorem 2 of [8]. A final easy step showing that T (fh ) → T (f ) is missing there. 3.2. Central limit theorem Inthis section we shall use Theorem 2.1 to establish a central limit theorem for T fh . Before stating and proving our result, we must first introduce some additional assumptions and then derive a limiting variance needed in its formulation. Assumptions on the density f .

Integral functionals

163

(F.iv) Assume that for some 0 < M < ∞, |f (x)| ≤ M for x ∈ IR, and if k ≥ 1 then f is 2k-times continuously differentiable and its derivatives f (j) satisfy for x ∈ IR, |f (j) (x)| ≤ M < ∞, j = 1, . . . , 2k. Assumptions on the kernel K. We assume conditions (K.i)-(K.v) on the kernel. Assumptions on Φ. Φ : Rk+1 → R, k ≥ 0, such that Φ (0, . . . , 0) = 0 and all of its partial derivatives in y0 , . . . , yk , ∂ m0 ∂ mk Φ(y0 , . . . , yk ), m0 . . . ∂y0 ∂ykm0 where m0 + m1 + · · · + mk = j, 0 ≤ j ≤ k + 1, k+1

are continuous on an open convex set C containing [−κM, κM ] uniformly bounded on C by a constant BΦ > 0.

and they are

Preliminaries to calculating a variance Let p and q be m times continuously differentiable functions such that for each 0≤j≤m lim p(j) (±v) q (m−j) (±v) = 0.

(3.15)

v→∞

We shall be use the formula following from integration by parts and (3.15): m (m) p (v) q (v) dv = (−1) p (v) q (m) (v) dv. (3.16) R

R

Set

(j) fh (y

+ hu) =

h IR

−j−1

 K

(j)

 y−t + u f (t) dt, h

which by the change of variable v = y−t h + u or t = y + h (u − v) h−j K (j) (v) f (y + h (u − v)) dv. = IR

Applying, in turn, the formula (3.16) we get (j) (3.17) fh (y + hu) = K (v) f (j) (y + h (u − v)) dv. IR

Notice from (3.17), (F.iv) and (K.i), we get from the bounded convergence theorem that for every 0 ≤ j ≤ 2k and a.e. y ∈ IR (3.18)

(j)

fh (y + hu) → f (j) (y).

Let Ψ be a function from IRk+1 → IR satisfying the assumptions on Φ and set

164

D. M. Mason, E. Nadaraya and G. Sokhadze

(3.19)

  Ψ (y) = Ψ f (y), f (1) (y), . . . , f (k) (y)

and

  (1) (k) Ψ (y, h) = Ψ fh (y), fh (y), . . . , fh (y) .

(3.20) Notice that we have

  (1) (k) Ψ (y + hu, h) = Ψ fh (y + hu), fh (y + hu), . . . , fh (y + hu) .

(3.21)

Clearly by (3.18), Ψ (y + hu, h) → Ψ (y). Let for j = 0, . . . , k, Ψj (y0 , y1 , . . . , yk ) =

∂Ψ(y0 , y1 , . . . , yk ) . ∂yj

Further set for j = 0, . . . , k,   Ψj (y) = Ψj f (y), f (1) (y), . . . , f (k) (y)   (1) (k) Ψj (y, h) = Ψj fh (y), fh (y), . . . , fh (y) .

and Note that we have

  (1) (k) Ψj (y + hu, h) = Ψj fh (y + hu), fh (y + hu), . . . , fh (y + hu) .

We see that

⎞ ⎛ k

dΨ (y + hu, h) (j+1) Ψj (y + hu, h) fh (y + hu)⎠ . = h⎝ du j=0

Write Ψ(1) (y0 , y1 , . . . , yk+1 ) =

k

Ψj (y0 , y1 , . . . , yk ) yj+1,

j=0

and observe that Ψ(1) (y) := We see that

where

  d Ψ (y) = Ψ(1) f (y) , . . . , f (k+1) (y) . dy

dΨ (y + hu, h) = hΨ(1) (y + hu, h) , du   (k+1) Ψ(1) (y + h, h) = Ψ(1) fh (y + hu) , . . . , fh (y + hu) .

We shall write

  (k+1) Ψ(1) (y, h) = Ψ(1) fh (y) , . . . , fh (y) .

Now for m ≥ 1 set (m−1)

Ψj

(y0 , y1 , . . . , yk+m−1 ) =

d (m−1) Ψ (y0 , y1 , . . . , yk+m−1 ) , 0 ≤ j ≤ k + m − 1. dyj

Integral functionals (0)

Here Ψ(0) = Ψ and Ψj

165

= Ψj . Also let

Ψ(m) (y0 , y1 , . . . , yk+m ) =

k+m−1

(m−1)

Ψj

(y0 , y1 , . . . , yk+m−1 ) yj+1 ,

j=0

and note that Ψ(m) (y) :=

  dm Ψ (y) = Ψ(m) f (y) , . . . , f (k+m) (y) . m dy

Set

  (k+m) (y + hu) Ψ(m) (y + h, h) = Ψ(m) fh (y + hu) , . . . , fh

and

  (k+m) (y) . Ψ(m) (y, h) = Ψ(m) fh (y) , . . . , fh

We readily get that dm Ψ (y + hu, h) = hm Ψ(m) (y + hu, h) dum

(3.22) and, as h & 0, (3.23)

h−m

dm Ψ (y + hu, h) dm (m) (m) = Ψ (y + hu, h) → Ψ (y) = Ψ (y) . dum dy m

Computation of limit variance We are now prepared to compute our limiting variance. Let Φj (x) and Φj (x, h) be defined exactly as Ψj (x) and Ψj (x, h). Recall the definition of Sn (h) in (2.13) and that of Z (h) in (3.5). We can write Sn (h) =

k

IR

j=0

and Z (h) =

  (j) (j) Φj (x, h) fh (x) − fh (x) dx.

k

Φj (x, h) h−j−1 K (j)



IR

j=0

x−X h

 dx.

Thus we see that if Z1 (h), . . . , Zn (h) are i.i.d. Z (h), then Sn (h) =d n−1

n

(Zi (h) − EZi (h)) .

i=1

Now EZ (h) =

k +

j=0

(3.24)

=

IR

IR

 K

(j)

IR

k +

j=0

Φj (x, h) h

−j−1

Φj (y + hu, h) h IR

−j

x−y h



, dx f (y) dy

, K

(j)

(u) du f (y) dy.

166

D. M. Mason, E. Nadaraya and G. Sokhadze

Note that we get from (3.16) and (3.22), the identity (j) j −j (j) (3.25) Φj (y + hu, h) h K (u) du = (−1) Φj (y + hu, h)K (u) du, IR

IR

and from (3.23) we conclude that for a.e. y ∈ IR and all u, as h & 0,   dj (1) (k) (j) f (y), f Φ (y), . . . , f (y) =: Φj (y) . j h h dy j

(j)

Φj (y + hu, h) →

(3.26) Set (3.27)

μk (y) =

k

j

(j)

(−1) Φj (y) .

j=0

Note that our assumptions imply that for some B > 0 and all h > 0 and j = 0, . . . , k,      (j)  (j)   max sup Φj (y + hu, h) ≤ B and thus Φj (y + hu, h) K(u) ≤ B |K| (u) . 0≤j≤k u,y

Therefore by (3.26) and the dominated convergence theorem as h & 0 (j) (j) (j) Hh (y) := Φj (y + hu, h) K (u) du → Φj (y) . IR

   (j)  Now Hh (y) ≤ B IR |K| (u) du = Bκ. Hence by the bounded convergence theorem (j) (j) Hh (y) f (y) dy → Φj (y) f (y) dy. IR

IR

This of course implies that as h → 0, ⎫ ⎧ ⎨

k ⎬ j (j) EZ (h) → (−1) Φj (y) f (y) dy = μk (y) f (y) dy = Eμk (X) . ⎭ IR ⎩ IR j=0

Next write for 0 ≤ j, m ≤ k,     x−y z−y K (m) dxdz Φj (x, h) Φm (z, h) h−2−j−m K (j) γj,m (y) = h h IR2 = Φm (y + hu, h) Φj (y + hv, h) h−j−m K (j) (u) K (m) (v) dvdu. IR2

Similarly we see that Eγj,m (X) → (−1)

(j)

m+j

Φ(m) m (y) Φj (y) f (y) dy. IR

Therefore since EZ 2 (h) =

k

k

j=0 m=0

γj,m (y) f (y) dy, IR

we conclude that as h → 0, EZ (h) → 2

k

k

j=0 m=0

(j)

(m) Φm (y) Φj (y) (−1) IR

m+j

f (y) dy = Eμ2k (X) .

Integral functionals

167

Clearly the same proof shows that as h → 0, EZ 4 (h) → Eμ4k (X) . Also it is readily verified that Eμk (X), Eμ2k (X) and Eμ4k (X) are finite under the conditions on Φ and f . In summary, we get Lemma Under the above assumptions for any sequence of positive numbers hn → 0, as n → ∞, (3.28)

nV ar (Sn (hn )) = V ar (Z(hn )) → V ar (μk (f (X))) =: σ 2 (f ) < ∞

and EZ 4 (hn ) → Eμ4k (X) < ∞.

(3.29)

Part (2.16) of Theorem 2.1 and the above lemma, combined with Lyapunov’s central limit theorem, yield the next result. Theorem 3.3. Under the above assumptions imposed in this subsection on the density f , the √ kernelK and function Φ, if a positive sequence h = hn ≤ 1 is chosen so that 1/ nh2k+1 → 0 then n    √  (3.30) n T (fh ) − T (fh ) →d N 0, σ 2 (f ) . In the next subsection, we shall discuss smoothness conditions that permit the replacement of T (fh ) by T (f ) in (3.30). 3.3. Three examples of the application of Theorem 3.3 In this subsection we apply Theorem 3.3 to the three examples (i), (ii) and (iii) in (1.2). In the first two k = 0, so in √ addition to the smoothness conditions in our central limit theorem, we require nhn → ∞. In example (i), μ0 (f (x)) = d Φ1 (f (x)) , where Φ1 (x) = dx (φ (x) x) , giving σ 2 (f ) = V ar (Φ1 (f (X))) . This matches with the second part of Theorem 2 of [8]. In example (ii), one gets that μ0 (f (x)) = Φ (f (x)) and σ 2 (f ) = V ar (Φ (f (X))) . This agrees with Theorem 3 of [5]. Note that example (i) is a special√case of (ii). To apply Theorem 3.3 to example (iii) we must choose  hn such that nhk+1 → ∞. In this case μk (f (x)) = n (2k) 2 2f (x), and σ (f ) = V ar 2f (2k) (X) , which is in agreement with Theorem 4 of [5]. Let us now briefly discuss conditions under which we can replace T (fh ) by T (f ) in (3.30). Towards this end, we cite here Proposition 1 of [5], which, in turn, was motivated by Proposition 1 of [7]. Proposition 3.4. Assume that K is integrable, has compact support, and for some integer s ≥ 1, (3.31) K(u) du = 1, uk K(u) du = 0 for k = 1, . . . , s, IR

IR

and let H be a non-negative measurable function. Then there is a constant CK > 0 such that, for every s times continuously differentiable function g satisfying for some h0 > 0, Lg > 0, 0 < α ≤ 1,     (3.32) sup |h|−α g (s) (x + h) − g (s) (x) =: Lg H(x), for every x ∈ IR, |h|≤h0

168

D. M. Mason, E. Nadaraya and G. Sokhadze

one has, for all 0 < h ≤ h0 and every x ∈ IR,      1 x−u  ≤ hs+α CK Lg H(x).  du − g (x) (3.33) g (u) K  h h IR Therefore if our kernel K also has compact support and satisfies (3.31) with s = 1 and our density f fulfills condition (3.32) with α = 1 and H ∈ L1 (IR), then for all h > 0 small enough, for every Φ, which is Lipschitz on [−κM, κM ], there exists a constant B > 0 such that     2   Φ (fh (x)) dx − Φ (f (x)) dx ≤ h B H(x) dx.  IR

IR

Thus if we have both and (ii) that (3.34)



IR

√ nhn → ∞ and nh2n → 0, we can conclude in examples (i)

   √    n T fhn − T (f ) →d N 0, σ 2 (f ) .

We can also apply this proposition to example (iii) for any k √≥ 1. Here, in order to able to replace T (fh ) by T (f ), we require that both nh2k+1 → ∞ and n √ bes+α nhn → 0, where s and α satisfy (3.32) and (3.33). Acknowledgements The authors thank the two referees for a careful reading of the manuscript. References [1] Devroye, L. (1991). Exponential inequalities in nonparametric estimation. Nonparametric functional estimation and related topics (Spetses, 1990), 31–44, NATO Adv. Sci. Inst. Ser. C Math. Phys. Sci., 335, Kluwer Acad. Publ., Dordrecht. [2] Dony, J., Einmahl, U. and Mason, D. M. (2006). Uniform in bandwidth consistency of local polynomial regression function estimators. Austrian J. Statistics 35 105–120. [3] Dudley, R. M. (1989). Real Analysis and Probability, Chapman & Hall Mathematics Series, New York. [4] Einmahl, U. and Mason, D. M. (2005). Uniform in bandwidth consistency of kernel-type function estimators. Ann. Statist. 33 1380–1403. ´, E. and Mason, D. M.(2008) Uniform in bandwidth estimation of inte[5] Gine gral functionals of the density function. Scand. J. Statist. 35 739–761. [6] Hoeffding, W. (1963). Probability inequalities for sums of bounded random variables, J. Amer. Statist. Assoc. 58 13–30. [7] Levit, B. Ya (1978). Asymptotically efficient estimation of nonlinear functionals. (Russian) Problems Inform. Transmission 14 65–72. [8] Mason, D. M. (2003). Representations for integral functionals of kernel density estimators. Austrian J. Statistics 32 131–142. [9] Nadaraya, E. A. (1989). Nonparametric estimation of probability densities and regression curves. (Translated from the Russian) Mathematics and its Applications (Soviet Series), 20. Kluwer Academic Publishers Group, Dordrecht, 1989; Russian original: Tbilis. Gos. Univ., Tbilisi , 1983.

IMS Collections Nonparametrics and Robustness in Modern Statistical Inference and Time Series Analysis: A Festschrift in honor of Professor Jana Jureˇ ckov´ a Vol. 7 (2010) 169–181 c Institute of Mathematical Statistics, 2010  DOI: 10.1214/10-IMSCOLL717

Qualitative robustness and weak continuity: the extreme unction? Ivan Mizera1,∗ University of Alberta Abstract: We formulate versions of Hampel’s theorem and its converse, establishing the connection between qualitative robustness and weak continuity in full generality and under minimal assumptions.

1. Qualitative robustness The definition of qualitative robustness was given by Hampel [11]. Suppose that tn is a sequence of statistics (estimators or test statistics), that, for each sample size n, describe a procedure. Let P be a probability measure that identifies the stochastic model we believe that underlies the data, and let LP (tn ) be the distribution of tn under this stochastic model; Hampel [11] implicitly views the data as independent, identically distributed random elements of some sampling space X (assumed to be complete separable metric space), with P then the common distribution of these random elements, a member of P(X ), the space of all probability measures on X (defined on the Borel σ-field generated by the topology of X ). Let π denote the Prokhorov metric on P(X ), as defined in Huber [15]; see also Section 3 below. Definition 1. Let P be a probability measure from P(X ). A procedure tn is called qualitatively robust at P if for any  > 0 there is δ > 0 such that (1)

π(P, Q) ≤ δ implies π(LP (tn ), LQ (tn )) < 

for all sufficiently large n. The fact that (qualitative) “robustness is related to some form of continuity”, as we can read, for instance, on page 72–73 of Maronna et al. [18], became a part of universal statistical knowledge. It was demonstrated already by Hampel [11] for procedures representable by functionals on the space P(X ), the procedures that can be summarized in terms of a functional, T , defined on a subset of P(X ) rich enough to guarantee that for any relevant collection of xi ’s, (2)

tn (x1 , . . . , xn ) = T (Δx1 ,...,xn ),

where Δx1 ,...,xn stands for the empirical probability supported by the points x1 , x2 , . . . , xn (the probability allocating mass 1/n to every of the xi ’s). For procedures representable by functionals, qualitative robustness is essentially equivalent to ∗ Research

supported by the Natural Sciences and Engineering Research Council of Canada. of Mathematical and Statistical Sciences, University of Alberta, Edmonton, Alberta, T6G 2G1, Canada. e-mail: [email protected] AMS 2000 subject classifications: Primary 62F35, 62B99; secondary 60K35, 60B10. Keywords and phrases: qualitative robustness, weak continuity, consistency. 1 Department

169

170

I. Mizera

weak continuity, the continuity with respect to the weak convergence of probability measures—as defined, for instance, by Billingsley [1]. Possible subtleties arising in this context can be illustrated on a very simple (and already discussed elsewhere) example: median. We define the estimator as the value of t where the graph of the function 1

sign(Xi − t) n i=1 n

(3)

ψ(t) =

crosses zero level. This happens when ψ(t) = 0; but there may be no such t, as ψ is not continuous. Nonetheless, given that ψ is nondecreasing, we can complete its graph by vertical segments connecting the jumps, and then take t giving the location where such augmented graph intersects the horizontal coordinate axis. Such a provision takes care of jumps, but still leaves possible ambiguity: sometimes there may be not one, but several t such that ψ(t) = 0. (Note that if ψ happens to cross zero level at a jump, then the corresponding location is unique.) To finalize the definition of median, we have to adopt some “ambiguity resolution stance”. Roughly, there are three possibilities. (i) Ignore: that is, consider median defined not for all n-tuples of data, but only for those for which it is defined uniquely. In statistical science, such a strategy is often vindicated by the fact that data configurations yielding ambiguous results happen to be rather a mathematical than practical phenomenon, especially if the underlying stochastic model implies their occurrence with probability zero. While this point of view is pertinent, for instance, for the Huber estimator—which in theory can yield non-unique result, but in practice seldom will—for the median, however, the ambiguity is bound to occur for most of data configurations with even n. (ii) View the definition as set-valued: instead of uniquely defined median, consider a median set—in this case always a closed interval, due to the monotonicity of ψ. This strategy is likely to be successful if it can be pursued without invoking too much of non-standard mathematics—simply as an attitude that instead of ignoring ambiguous data configurations, one can rather admit an occasional possibility of multiple solutions and still maintain some theoretical control over these, as pointed out by Portnoy and Mizera [23] in the discussion of Ellis [8]. (iii) Consider a suitable selection: that is, define the median as a point selected in some specific way from the median set. The often used alternative is the midpoint of the median interval—but minimum or maximum could be considered too. The selection strategy may be naturally suggested by the implementation of the method, when a specific algorithm returns some particular, and thus unique, solution. A functional representation of the median can be obtained via the straightforward extension of (3): we define the median functional, T (P ), to be the location t where the graph of (4) ψP (t) = sign(x − t)P (dx) crosses the zero level (using the same provision as above to define what this precisely means). Any “ambiguity resolution stance” mentioned above can be directly generalized to this situation.

Qualitative robustness and weak continuity

171

A standard argument shows that if Pn → P weakly, then ψPn (t) → ψP (t) for every continuity point t of ψ; further analytical argument based on monotonicity yields that every limit point of a sequence tn of points giving locations where ψPn crosses zero level is a location where ψP crosses zero level. In the terminology introduced below, T is weakly semicontinuous at every P , and weakly continuous at every P for which it is uniquely defined. Therefore, by Theorem 6.2 of Huber [15], or by our Theorem 1, median is qualitatively robust at every P for which T is uniquely defined. The justification of this step—from continuity to qualitative robustness—is the theme of this note, and we will return to it in the next section. Let us illustrate now why uniqueness is necessary, on a simple example (capturing nevertheless the essence of behavior for any P yielding a non-degenerate median interval): let P be concentrated with equal mass 1/2 in two points, −1, and +1. Fix  = 1/4, say. Let + Q− α and Qα be concentrated on {−1, +1} with corresponding probabilities 1/2 + α, 1/2 − α and 1/2 − α, 1/2 + α, respectively; given δ > 0, we can always choose α > 0 + so that both π(P, Q− α ) < δ and π(P, Qα ) < δ. A standard probabilistic argument yields that we can find N such that for any n > N , the probability of median being +1 is bounded from above by 1/4, if we sample from Q− α ; and the same bound takes place for the probability of median being −1, if we sample from Q+ α . Consequently, π(LQ1 (tn ), LQ2 (tn )) > 1/4 for n > N . Given that we arrived to this for fixed  and arbitrary δ, we conclude that median cannot be qualitatively robust at P . Note that to reach this conclusion, we essentially do not need to know how the estimator is defined in ambiguous situations: albeit data configurations with equal number of −1’s and +1’s are possible, they occur only with small probability, which further decreases to 0 for n → ∞. The fact that qualitative robustness requires uniqueness of T at P does not depend, and would not change with an adopted “ambiguity resolution stance”. The probabilistic behavior of a sample of size n from P is not relevant either: except for the data configuration with equal number of −1’s and +1’s, which occurs with small probability ηn (tending to 0 with growing n), there are only two possible cases, by symmetry each occurring with probability 1 − ηn /2: either −1’s or +1 are in majority, and the median is then unambiguously equal to −1 or +1. The situation is somewhat different in the functional setting, where the weak continuity of T depends on the adopted “ambiguity resolution stance”. If we take in the example above T (P ) to be 0, the midpoint of the median set [−1, 1], then we loose continuity: for α → 0 we have T (Q− α ) = −1 for all α > 0, which suddenly jumps to 0 for P (which corresponds to α = 0). If we adopt the set-valued definition of T , then we have a set-valued weak semicontinuity at P : for any sequence Qn converging weakly to P , the sets T (Qn ) are eventually contained in an -neighborhood of the set T (P ). It might be tempting to consider this as a, possibly extended, definition of robustness, as indicated in Section 1.4 of Huber [15]: “We could take this . . . as our definition and call a . . . statistical functional T robust if it is weakly continuous. However, following Hampel [11], we prefer to adopt a slightly more general definition.”

Indeed, Definition 1 has an advantage that it is directly based on the procedures rather than on their functional representations, whose existence and form may not be always that clear and intuitive as in our median example, and whose scope is limited to permutation-invariant, exchangeable situations—while Definition 1 exhibits a clear potential for extensions to situations structured beyond such a framework; and also, as we have seen, essentially does not depend on the adopted

172

I. Mizera

“ambiguity resolution stance”. Thus, adopting the Hampel [11]’s Definition 1 of the qualitative robustness, we would like to revisit now how it relates to the weak continuity of T at P , in situations when T is uniquely defined at P , but possibly may not be so elsewhere. 2. Weak continuity Definition 2. A functional T is called weakly continuous at P , if for any  > 0 there is δ > 0 such that (5)

π(P, Q) ≤ δ implies d(θ, τ ) < 

for any value θ and τ of T at P and Q, respectively. The appearance of the word “any” above means that the definition is formulated for set-valued T , without explicitly mentioning this fact; the value of T is considered to be a subset of X . Of course, univalued T (with values that are singletons, sets consisting of precisely one element) are a special case. For a set-valued functional T , we can also define weak semicontinuity of T at P by the requirement that for any  > 0 there is δ > 0 such that π(P, Q) < δ implies that T (Q) ⊆ T (P ) , the set T (P ) containing all points within  distance from the set T (P ). This seems to be equivalent to Definition 2, but is not: T is weakly continuous at P , if and only if it is weakly semicontinuous and univalued at P . As mentioned above, Hampel [11] pointed out that weak continuity at P implies qualitative robustness at P . However, his Theorem 1 and its Corollary required also an additional assumption of global pointwise continuity of all tn : every tn had to be continuous as a function of the vector (x1 , x2 , . . . , xn ), for all such vectors. Although Hampel [11] gives also a version (Theorem 1a) which weakens this assumption and allows exceptions from the pointwise continuity if those occur under zero probability P , verifying his condition can be in general burdensome. For instance, the condition of pointwise global continuity holds true, and is not difficult to verify for every data vector, if we define the median as the midpoint of the median interval. However, when exploring, in Mizera and Volauf [20], the same topic for a multivariate generalization of the median called the Tukey median, we realized that this route would lead to serious complications. First, specifying the appropriate selection from a convex set in Rk is not that straightforward for k > 1; second, we realized that the Tukey median may not be always continuous—so, third, we would have to show that such configurations occur with probability zero under P yielding the weak continuity of the Tukey median. Such complications are not necessary: Huber [15], while giving the result the name of Hampel, also noted that weak continuity of T at P is all what is needed. A somewhat related global version was given already by Hampel [11]: weak continuity at an empirical probability Δx1 ,...,xn implies the pointwise continuity of tn at (x1 , . . . , xn ); therefore, if weak continuity is postulated at all P , the pointwise continuity then follows. The Hampel [11]’s proof suggests that some version of local pointwise continuity, or even local boundedness would suffice; but it is not obvious how such a condition would have to be formalized. So, we could use Theorem 6.2, Section 2.6 of Huber [15] to conclude that the Tukey median is qualitatively robust whenever it is weakly continuous—if not for the following. Huber [15]’s formulation and proof uses for the first π in (1) the L´evy metric, instead of the Prokhorov one. This means that Theorem 6.2 is formally

Qualitative robustness and weak continuity

173

valid only for X = R; that is good enough for our median example, but does not apply to the Tukey median, when X = Rk . Actually, Huber [15] allows tn to assume values in Rk ; but P is clearly restricted to P(R). We did not consider this minor detail to be of major importance; it is clear that Huber [15] envisioned the broad validity of his Theorem 6.2—only for educational or practical reasons he preferred the simple argument based on the uniformity in the Glivenko–Cantelli theorem (with a direct consequence for the L´evy metric) to possibly more technical treatment required for the general case (which can be nowadays carried in the language of the modern theory of empirical processes, which Huber [15] pioneered in his works). Thus, writing Mizera and Volauf [20], we believed that we could limit our focus to continuity questions, their statistical consequences for robustness being well known. However, the reviewers of Mizera and Volauf [20] did not initially share this view—until we introduced in the revised version a theorem, which up to some technical details is identical with Theorem 1 below. Its proof, however, was considerably out of scope of Mizera and Volauf [20]; in lieu of it we rather promised that “the proof of the theorem will appear elsewhere in the literature”—hoping that somebody (a referee or anybody else) would argue that this is not really necessary, because the theorem appears to be an obvious consequence of Huber [15], Hampel [11], or some other reference. However, it seems that our hope has not materialized, and it is time to fulfill our promise now. Before formulating the theorem and showing how the original proof of Huber [15] can be altered to cover rigorously also the multidimensional case, we need to discuss one formal subtlety. Thinking of our functionals and procedures as of set-valued mappings, we are not completely sure whether we may still speak in a mathematically consistent manner about their distribution. There are ways to formalize the notion of law for set-valued random functions—however, we would prefer to stay away from this level of abstraction. In practice, a lot of procedures consist of functions yielding unique values with probability one—we will call such set-valued functions lawful, as we can speak about their distributions without ambiguities. For instance, the 1 regression estimator is lawful as long as the distribution of covariates is continuous. However, the case of median—as well as that of the Tukey median—is different; the median is not lawful for even n, unless we consider its lawful version: a univalued selection from the estimator, that is a univalued function picking always one value from the set of all possible ones. This resembles the selection strategy for the “ambiguity resolution stance”, with one important distinction: now the selection does not have to be deterministic, but may be also randomized: a lawful version of the sample median may be a point selected at random according to the uniform distribution on the median interval. We stress that lawful versions are introduced exclusively for “law enforcement”, to ensure that the symbol L(tn ) in the definition of qualitative robustness is well-defined; as far as other aspects are concerned, we will consider functionals in their original deterministic expression. Theorem 1. Suppose that a procedure tn is represented by a functional T . If T is weakly continuous at P , then any lawful version of tn is qualitatively robust at P . The proof—which is that of Huber [15], only the argument using the L´evy metric is replaced by a more general one—is given in Section 3. The rest of this section presents the converse to Theorem 1, to make this note self-contained; we essentially follow Hampel [11], the proofs are given in Section 3. The appropriate formulation of the converse requires some insights into the nature how the procedure is represented by a functional. We remark that the general

174

I. Mizera

question of representability by functionals may involve some delicate aspects; Hampel [11] and Huber [15] addressed the question to some extent; see also Mizera [19]. For example, such representation exist only when the tn ’s exhibit some mutual consistency—if an empirical probability for a given n arises as an empirical probability for some other n, the corresponding tn should yield the same result. Again, we do not want to go into more depth than needed here. Developing all the theory in the set-valued context, we have to include an appropriate definition of convergence in probability: for the purposes of Definition 3, we say that a sequence of random sets En converge to E in probability, if for any selected subsequence xn ∈ En , the distance of xn to E converges to 0 in probability. In the set-valued terminology, this may be called rather “upper convergence”, but for the present purpose, the name and definition are good enough; the interesting cases will be those when E = {x} is a singleton, and then the term “convergence” is justified, and means that xn converges to x in probability for any sequence tn selected from the En ’s. Definition 3. A representation of a procedure tn by a functional T is called consistent at P , if tn converges in probability to T (P ) whenever the data are independent and identically distributed according to the law P . Proposition 1. If a procedure tn is represented by a functional T weakly continuous at P , then this representation is consistent at P . Definition 4. A representation of a procedure tn by a functional T is called regular, if (i) it is consistent for every P in the domain of T ; and (ii) for every P and every τ ∈ T (P ), there is a sequence Pν of empirical probabilities weakly converging to P , the functional T is univalued at every Pν , and T (Pν ) converges to τ . The following result serves as a “prototype” of the converse part of Hampel’s theorem. It can be used for disproving qualitative robustness in nonregular cases— in particular, when T is not univalued at P . Proposition 2. Suppose that a procedure tn is represented by a functional T . If + − + there are Q− ν , Qν such that (i) both Qν and Qν weakly converge (in n) to P ; (ii) − + T (Qν ) converges to θ and T (Qν ) to τ , where θ = τ ; (iii) T is univalued at every Q− ν − + and Q+ ν ; (iv) the representation of tn by T is at every Qν and Qν consistent—then no lawful version of tn is qualitatively robust at P . The converse to Theorem 1 is formulated for regular representations. Theorem 2. Suppose that the representation of a procedure tn by a functional T is regular. If some lawful version of tn is qualitatively robust at P , then T is weakly continuous (in particular, uniquely defined) at P . 3. Proofs We assume that S is a Polish space, a complete and separable metric space with a metric d. For E ⊂ S, E  denotes the -fattening of E, the set of all x ∈ S within  distance from E. The Prokhorov metric, π(P, Q), is defined as the infimum of all  > 0 such that P (E) ≤ Q(E  ) +  for all measurable E. It is uniformly equivalent to the bounded Lipschitz metric β, (6)

2 2 π (P, Q) ≤ β(P, Q) ≤ 2π(P, Q). 3

Qualitative robustness and weak continuity

175

The bounded Lipschitz metric is defined as     β(P, Q) = sup  f dP − f dQ, f ∈BL(S)

where BL(S) stands for the set of all real functions on S satisfying sup |f (u)| + sup u∈S

u,v∈S

|f (u) − f (v)| ≤ 1; d(u, v)

in particular, |f | ≤ 1 for all f from BL(S). A set F is called totally bounded, if for any  > 0 there is a finite collection of -balls, balls with radius  in metric , covering F ; the symbol N (, F, ) then denotes the minimal cardinality of such a collection, the -covering number of F in metric . Symbols LpE denote the usual metrics on spaces of functions defined on E. Let X1 , X2 , . . . , Xn be independent random variables, each with the distribution Q; it can be arranged that all Xi are defined on the same probability space (Ω, S, PQ ) (depending on Q). Let Qn be the (random) empirical probability measure supported by the random variables Zi ; note that the distribution of Qn depends on Q. Lemma 1. Let K be a totally bounded subset of S. For every  > 0,   + ,   (7) PQ sup  f dQn − f dQ > 48 f ∈BL(K  )

tends to 0 uniformly in all Q ∈ P(K  ). Proof. Proceeding as in the proof of Theorem 6 of Dudley et al. [7], we obtain an upper bound for (7), (8)

2

2

−18n 2N (6, BL(K  ), L1Qn ) e−18n ≤ 2N (6, BL(K  ), L∞ . K ) e

The inequality, obtained by approximating the functions in BL(K  ) by stepwise functions and using their analytical properties, (9)

N (6, BL(K



), L∞ K )

 ≤

1 2

N (2,K , d)

and the fact that N (2, K  , d) ≤ N (, K, d) together imply, given the total boundedness of K, that the covering numbers in (8) are bounded uniformly in n. Hence the expressions in (8) and consequently in (7) tend to 0, uniformly in Q.   Lemma 2. For fixed E ⊆ S and any  > 0, the sequence PQ Qn (E) > 2 converges to 0 as n → ∞, uniformly in all Q ∈ P(S) such that Q(E) ≤ . Proof. Use the Chebyshev inequality for the Bernoulli sequence of independent events with p = Q(E) ≤ ,     PQ Qn (E) ≥ 2 = PQ Qn (E) − p ≥ 2 − p  p(1 − p)  1 ≤ PQ |Qn (E) − p| ≥  ≤ ≤ . n2 n The lemma follows.

176

I. Mizera

Lemma 3. Let P ∈ P(S). of S such that + (10) PQ sup

For any  > 0, there exists a totally bounded subset K    

f ∈BL(S)

K

f dQn −

K

 ,  f dQ > 96 → 0,

uniformly in all Q ∈ P(S) such that π(P, Q) ≤ /2. Proof. Given  > 0, choose a compact subset K of S such that P (K) ≥ 1 − /2; here we use the fact that a probability measure on a Polish space is tight, in the terminology of Theorem 1.4 of Billingsley [1]. Fix η > 0 and choose n0 such that (7) in Lemma 1 is bounded by η/3 for all n ≥ n0 . Choose n1 such that (11)

n1 (1 − ) ≥ n0 , 2

4 η ≤ , n1 (1 − ) 3

and

η 1 ≤ ; 2 2304 n1  3

note that the first inequality also implies n1 ≥ n0 . Let Q be an element from P(S) such that π(P, Q) ≤ /2; then    1 − ≤ P (K) ≤ Q(K /2 ) + ≤ Q(K  ) + 2 2 2 and consequently 1 −  ≤ Q(K  ). Let NQ be the (random) number of Xi ∈ K  ; let QK  denote the conditional probability on K  defined by QK  (E) = Q(E ∩ K  )/Q(K  ). Using again the Chebyshev argument as in Lemma 2, for the Bernoulli series of events with p = Q(K  ) ≥ 1 − , we obtain, using the first two inequalities in (11), that for any n ≥ n1 ,       PQ NQ ≤ n0 ≤ PQ NQ ≤ 12 n1 (1 − ) ≤ PQ NQ ≤ 12 n(1 − )  , +  NQ  p (12) 4(1 − p) 4 η ≤ ≤ PQ  − p  ≥ ≤ ≤ , n 2 np n1 (1 − ) 3 uniformly in Q. The Chebyshev inequality yields once again, now together with the third inequality in (11), that for n ≥ n1 ,  + ,  NQ  p 1 p(1 − p) η  ≤ ≤ (13) PQ  − p > 48 ≤ ≤ , 2 2 2 n 48 n 2034 n1  2034 n1  3 again uniformly in Q. Dividing the expression within (10) by p = Q(K  ), we obtain that for n ≥ n1 ,   , +  1

 96 1   sup  f (Xi ) − f dQ ≥ PQ p K p f ∈BL(S) np X ∈K  i   , +  1

 96 1

  ≤ PQ sup  f (Xi ) − f (Xi ) ≥ NQ 2p f ∈BL(S) np X ∈K  Xi ∈K  i   , +  96  1

1 ≥ + PQ sup  f dQ f (Xi ) −  Q(K  ) 2p f ∈BL(S) NQ X ∈K  i    + ,  1

 NQ   = PQ  f (Xi ) ≥ 48 − p sup  n f ∈BL(S) NQ X ∈K  i   , +  1

 48 sup  + PQ f (Xi ) − f dQK   ≥ p f ∈BL(K  ) NQ X ∈K  i    , + + ,  1

  NQ  ≤ PQ  sup  f (Xi ) − f dQK   > 48 . − p > 48 + PQ n f ∈BL(K  ) NQ  Xi ∈K

Qualitative robustness and weak continuity

177

By (13), the left-hand expression is dominated by η/3; the right-hand one can be written as    + , ∞

 1

      sup  PQ f (Xi ) − f dQK  > 48  NQ = m PQ [NQ = m] f ∈BL(K  ) NQ  m=1 Xi ∈K

which can be split to two sums: the first is dominated by

  PQ [NQ = m] = PQ NQ ≤ n0 ≤ 13 η,

m≤n

by (12); the second is    + ,

  1

    sup  PQ f (Xi ) − f dQK   > 48  NQ = m PQ [NQ = m] f ∈BL(K  ) NQ X ∈K  m>n i 

 , + ∞

1 m     sup  P QK  f (Zi ) − f dQK  > 48 PQ [NQ = m] = f ∈BL(K  ) m i=1 m>n ≤ 13 η



PQ [NQ = m] ≤ 13 η,

m>n

where Z1 , Z2 , . . . , Zm are independent random variables (different for each m), each with distribution QK  , so that Lemma 1 applies. As η was arbitrary, the lemma follows. Lemma 4. For any α, η > 0, there exists δ > 0 and ν such that   PQ π(Qn , P ) > α < η

(14)

whenever n ≥ ν and π(P, Q) < δ. Proof. Given P and α, choose  < α2 /12 such that 96 ≤ α2 /3. As in the proof of Lemma 3, we take a compact K such that Q(K  ) ≥ 1 −  whenever π(Q, P ) < δ = /2. By (6), we obtain     PQ π(Qn , P ) > α ≤ PQ β(Qn , P ) > 23 α2 + ≤ PQ sup |f | dQn + sup f ∈BL(S)

+

+ PQ

f ∈BL(S)

S\K 

  sup 

f ∈BL(S)



K

f dQn −

  ≤ PQ Qn (S \ K  ) > 14 α2 + PQ

+

K

, S\K 

|f | dQ >

1 2 3α

 ,  f dQ> 13 α2

  sup 

f ∈BL(S)

K

f dQn − K

 ,   f dQ> 96(1 − ) 

By Lemma 3, there exists n1 such that the second term is bounded by η/2 for n ≥ n1 . Since Q(S \ K  ) ≤  < α2 /12, Lemma 2 yields n2 such that for n ≥ n2 ,     PQ Qn (S \ K  ) > 14 α2 ≤ PQ Qn (S \ K  ) > 16 α2 ≤ 12 η. Setting ν = max{n1 , n2 } concludes the proof.

178

I. Mizera

Proof of Theorem 1. Let  be the metric on the range of T . Given  > 0, weak continuity of T at P yields α such that (τ, T (P )) < /3 whenever τ ∈ T (Q) and π(P, Q) < α. Setting η to /3 and taking ν and δ yielded by Lemma 4, we obtain that if n ≥ ν and π(P, Q) ≤ δ, then   PQ (τ, T (P )) > 13  < 13  whenever τ ∈ T (Qn ). The Strassen theorem — see Huber [15], Chapter 2, Theorem 3.7, or also the original paper Strassen [26]—then gives (15)

π(LQ (tn ), δT (P ) ) ≤ 13 ;

here δT (P ) stands, in the spirit of the notation introduced above, for the point (Dirac) measure concentrated in T (P ). Using (15) once again for Q = P and then combining both inequalities, we obtain the desired result: if π(P, Q) ≤ δ, then π(LQ (tn ), LP (tn )) <  for n ≥ ν, uniformly in Q. Proof of Proposition 1. The proposition follows from the Varadarajan theorem, stating that when the data are independently sampled from P , the corresponding empirical probability measures converge weakly to P with probability one. The consistency then follows from the weak continuity of T at P . Proof of Proposition 2. Let  be the metric on the range of T , and suppose that (θ, τ ) =  > 0. Suppose that some lawful version of tn is qualitatively robust at P . + Given /4, we may pick Q− , Q+ , out of Q− ν and Qν satisfying assumptions (ii), (iii), and (iv), such that T is univalued, and the representation of tn by T is consistent at both Q− and Q+ ; by qualitative robustness, we can pick them so that for some n1 (16)

π(LQ− (tn ), LP (tn )) ≤ 14 ,

(17)

π(LQ+ (tn ), LP (tn )) ≤ 14 ,

for all n ≥ n1 . The consistency at Q− and Q+ yields n2 such that for all n ≥ n2 , (18) (19)

 1  − 1 PQ− (T (Q− n ), T (Q )) ≥ 4  < 4 ,  1  − 1 PQ− (T (Q− n ), T (Q )) ≥ 4  < 4 .

Take n ≥ max{n1 , n2 }. Applying the Strassen theorem to (18) and (19) (given that T is univalued at Q− and Q+ ), we obtain that (20)

π(LQ− (tn ), δT (Q− ) ) < 14 ,

(21)

π(LQ+ (tn ), δT (Q+ ) ) < 14 .

Combining (20), (21) with (16) and (17) yields that d(θ, τ ) = π(δT (Q− ) , δT (Q+ ) ) < , a contradiction. Proof of Theorem 2. Suppose that θ, τ ∈ T (P ), θ = τ . By the regularity of T , there + are Q− ν and Qν that satisfy the assumptions of Proposition 2. Hence θ = τ . The same argument yields that θ must be equal to the limit (possibly in a one-point compactification of the range of T ) of any other sequence T (Pν ) such that Pν → P . Hence, T has a unique limit at P , equal to T (P ).

Qualitative robustness and weak continuity

179

4. Final remarks After the introduction by Hampel [11], which reappeared in the more settled form in Hampel, Ronchetti, Rousseeuw and Stahel (1986), and the influential treatment by Huber [15], all in the context of estimation and independent sampling, qualitative robustness was extended to hypothesis testing framework by Lambert [16] and Rieder [24]; dependent data models of time series flavor were considered by Papantoni-Kazakos [21], Boente et al. [2]; some further theoretical aspects were addressed by Cuevas [3]. It seems that despite these developments, its use for evaluating robustness was not too intense: a few relevant references are Rieder [25], Good and Smith [9], Cuevas and Sanz [4], Machado [17], and He and Wang [13]. The fade-out citation pattern is indicated by the only 21st century exception retrieved from scholar.google.com, Daouia and Ruiz-Gazen [5]. As the name indicates, and the definition clearly shows, qualitative robustness does not provide any “quantitative” appraisal: the procedure is judged either not robust or robust—and in the latter case we do not know “how much”. The rush for “more” and “most” robust methods might have been the reason that other robustness criteria gained more following. Nevertheless, given the multitude of “desirable features” considered in the screening of aspiring data-analytic techniques, qualitative robustness may be just enough to draw a dividing line in the territory of robustness—especially in complex situations where classical criteria modeled in standard circumstances may loose steam. In the universe of mathematical sciences, qualitative robustness is similar to the notion of stability used in the theory of differential equations: a small change in initial conditions still renders the new solution staying in a tube enclosing the original one. Interestingly, the translation of “qualitatively robust at P ” to “solution exists, is unique, and depends continuously on the data”, discussed in this note, corresponds exactly to what in applied mathematics is called well-posed problem in the sense of Hadamard [10]. Indeed, continuous dependence on the data is essential for any procedure, in particular for its numerical implementation—which is always based on approximation; it ensures the stability of an algorithm. A referee pointed out that among the references given above, we might have missed some that, like Hildebrand and M¨ uller [14], refer more generally to “robustness” or “continuity” without mentioning explicitly qualitative robustness. In the similar spirit, we may see witness a resurrection of the term (likely under a different name) in learning theory — see Poggio, Rifkin, Mukherjee, and Niyogi (2004). Of course, numerical stability requires only pointwise continuity; qualitative robustness goes a step further, requiring continuity with respect to the distribution underlying the data. Some may argue that it goes too far, indicating that continuity violated by statistical procedures otherwise in common use may be too stringent a requirement. In the context of well-posedness in the sense of Hadamard [10], the usual mode of requiring continuity is “in some reasonable topology”. All this indicates that the most important aspect of qualitative robustness, and robustness theory in general, lies at its very start—as pointed out by Davies [6], and put down already by Huber [15], “It is by no means clear whether different metrics give rise to equivalent robustness notions; to be specific we work with L´ evy metric for F and the Prokhorov metric for L(Tn ).”

We remark that such a choice might came out as natural under the influence of Billingsley [1] in the times of Hampel [11]; the question is whether it still remains

180

I. Mizera

such. Of course, the relationship between qualitative robustness and continuity discussed in this note indicates that it is only the induced topology, not a particular metric, that matters for qualitative robustness. References [1] Billingsley, P. (1968). Convergence of Probability Measures. Wiley, New York. [2] Boente, G., Fraiman, R. and Yohai, V. J. (1987). Qualitative robustness for stochastic processes. Ann. Statist. 15 1293–1312. [3] Cuevas, A. (1988). Quantitative robustness in abstract inference. J. Statist. Planning Inference 18 277–289. [4] Cuevas, A. and Sanz, P. (1989). A class of qualitatively robust estimates. Statistics 20 509–520. [5] Daouia, A. and Ruiz–Gazen, A. (2006). Robust nonparametric frontier estimators: Qualitative robustness and influence function. Statistica Sinica 16 1233–1253. [6] Davies, P. L. (1993). Aspects of robust linear regression. Ann. Statist. 21 1843–1899. ´, E. and Zinn, J. (1991). Uniform and universal [7] Dudley, R. M. Gine Glivenko-Cantelli classes. J. Theoret. Probab. 4 485–510. [8] Ellis, S. P. (1998). Instability of least squares, least absolute deviation and least median of squares linear regression. Statist. Sci. 13 337–350. [9] Good, I. J. and Smith, E. P. (1986). An additive algorithm analogous to the singular decomposition or a comparison of polarization and multiplicative models: An example of qualitative robustness. Commun. Statist. B 15 545–569. [10] Hadamard, J. (1902). Sur les probl`emes aux d´eriv´ees et leur signification physique. Princeton University Bulletin 49–52. [11] Hampel, F. R. (1971). A general qualitative definition of robustness. Ann. Math. Statist. 42 1887–1896. [12] Hampel, F. R., Ronchetti, E. M., Rousseeuw, P. J. and Stahel, W. A. (1986). Robust Statistics: The Approach Based on Influence Functions. Wiley, New York. [13] He, X. and Wang, G. (1997). Qualitative robustness of S ∗ -estimators of multivariate location and dispersion. Statistica Neerlandica 51 257–268. ¨ ller, C. H. (2007). Outlier robust corner[14] Hildebrand, M. and Mu preserving methods for reconstructing noisy images. Ann. Statist. 35 132–165. [15] Huber, P. J. (1981). Robust Statistics. Wiley, New York. [16] Lambert, D. (1982). Qualitative robustness of tests. J. Amer. Statist. Assoc. 77 352–357. [17] Machado, J. A. F. (1993). Robust model selection and M-estimation. Econometric Theory 9 478–493. [18] Maronna, R. A., Martin, R. D., and Yohai, V. J. (2006). Robust Statistics: Theory and Methods. Wiley, New York. [19] Mizera, I. (1995). A remark on existence of statistical functionals. Kybernetika 31 315–319. [20] Mizera, I. and Volauf, M. (2002). Continuity of halfspace depth contours and maximum depth estimators: Diagnostics of depth-related methods. Journal of Multivariate Analysis 83 365–368. [21] Papantoni-Kazakos, P. (1984). Some aspects of qualitative robustness in time series. In Robust and Nonlinear Time Series Analysis,(J. Franke, W.

Qualitative robustness and weak continuity

[22] [23]

[24] [25] [26]

181

H¨ardle and D. Martin, eds.) Lecture Notes in Statistics 26 218–230. SpringerVerlag, New York. Poggio, T., Rifkin, R., Mukherjee, S., and Niyogi, P. (2004). General conditions for predictivity in learning theory. Nature 428 419–422. Portnoy, S. and Mizera, I. (1998). Comment of “Instability of least squares, least absolute deviation and least median of squares linear regression”. Statistical Science 13 344–347. Rieder, H. (1982). Qualitative robustness of rank tests. Ann. Statist. 10 205– 211. Rider, H. (1983). Continuity properties of rank procedures. Statist. Decisions 1 341–369. Strassen, V. (1976). The existence of probability measures with given marginals. Ann. Math. Statist. 36423–439.

IMS Collections Nonparametrics and Robustness in Modern Statistical Inference and Time Series Analysis: A Festschrift in honor of Professor Jana Jureˇ ckov´ a Vol. 7 (2010) 182–193 c Institute of Mathematical Statistics, 2010  DOI: 10.1214/10-IMSCOLL718

Asymptotic theory of the spatial median Jyrki M¨ ott¨ onen1,∗ , Klaus Nordhausen2,3 and Hannu Oja2 University of Helsinki, University of Tampere and Tampere University Hospital Abstract: In this paper we review, prove and collect the results on the limiting behavior of the regular spatial median and its affine equivariant modification, the transformation retransformation spatial median. Estimation of the limiting covariance matrix of the spatial median is discussed as well. Some algorithms for the computation of the regular spatial median and its different modifications are described. The theory is illustrated with two examples.

1. Introduction For a set of p-variate data points y1 , . . . , yn , there are several versions of multivariate median and related multivariate sign test proposed and studied in the literature. For some reviews, see Small [23], Chaudhuri and Sengupta [6] and Niinimaa and n Oja [17]. The so called spatial median which minimizes the sum i=1 |yi − μ| with a Euclidean norm | · | has a very long history, Gini and Galvani [8] and Haldane [10] for example have independently considered the spatial median as a generalization of the univariate median. Gover [9] used the term mediancenter. Brown [3] has developed many of the properties of the spatial median. This minimization problem is also sometimes known as the Fermat-Weber location problem, see Vardi and Zhang [25]. Taking if μ ˆ solves the the gradient of the objective function, one sees that equation ni=1 {U(yi − μ ˆ )} = 0 with spatial sign U(y) = |y|−1 y, then μ ˆ is the observed spatial median. The spatial sign test for H : μ = 0 based on the sum of 0 n spatial signs, i=1 U(yi ) was considered by M¨ ott¨ onen and Oja [14], for example. The spatial median is unique, if the dimension of the data cloud is greater than one, see Milasevic and Ducharme [13]. The so called Weiszfeld algorithm for the computation of the spatial median has a simple iteration step, namely μ ← μ +{ ni=1 |yi − μ|−1 }−1 ni=1 {U(yi − μ)}. The algorithm may fail sometimes, however, but a slightly modified algorithm which converges quickly and monotonically is described by Vardi and Zhang [25]. One drawback of the spatial median (the spatial sign test) is the lack of equivariance (invariance) under affine transformations of the data. The performance of the spatial median as well as the spatial sign test then may be poor compared to affine equivariant and invariant procedures if there is a significant deviance from ∗ Corresponding

author of Social Research, University of Helsinki P.O.Box 68, 00014 University of Helsinki, Finland, e-mail: [email protected] 2 Tampere School of Public Health, University of Tampere, 33014 University of Tampere, Finland, e-mail: [email protected]; [email protected] 3 Department of Internal Medicine, Tampere University Hospital, P.O.Box 2000, 33521 Tampere, Finland AMS 2000 subject classifications: Primary 62H12; secondary 62G05, 62G20. Keywords and phrases: asymptotic normality, Hettmansperger–Randles estimate, multivariate location, spatial sign, spatial sign test, transformation retransformation. 1 Department

182

Asymptotic theory of the spatial median

183

a spherical symmetry. Chakraborty et al. [4] proposed and investigated an affine equivariant modification of the spatial median constructed using an adaptive transformation and retransformation (TR) procedure. An affine invariant modification of the spatial sign test was also proposed. Randles [19] used Tyler’s transformation [24] to construct an affine invariant modification of the spatial sign test. Later Hettmansperger and Randles [11] proposed an equivariant modification of the spatial median, again based on Tyler’s transformation; this estimate is known as the Hettmansperger–Randles (HR) estimate. In this paper we review and collect the results on the limiting behavior of the regular spatial median and its affine equivariant modification, the transformation retransformation spatial median. In Section 2 some auxiliary results and tools for asymptotic studies are given. Asymptotic theory for the regular spatial median is reviewed in Section 3. Estimation of the limiting covariance matrix of the spatial median is discussed in Section 4. Section 5 considers the transformation retransformation spatial median. The paper ends with some discussion on the algorithms for the computation of the spatial median in Section 6 and two examples in Section 7. Many of the results can be collected from Arcones [1], Bai et al. [2], Brown [3], Chakraborty et al. [4], Chaudhuri [5], M¨ ott¨onen et al. [14] and Rao [20]. See also Nevalainen et al. [16] for the spatial median in the case of cluster correlated data. For the proofs in this paper it is crucial that the dimension p > 1. For the properties of the univariate median, see Section 2.3 in Serfling [22], for example. 2. Auxiliary results Let y = 0 and μ be any p-vectors, p > 1. Write also r = |y| and u = |y|−1 y. Then accuracies of different (constant, linear and quadratic) approximations of function μ → |y − μ| around the origin are given by (A1) (A2) (A3)

||y − μ| − |y|| ≤ |μ|,  −1 |μ|2 and ||y − μ| − |y| + u μ| ≤ 2r   |y − μ| − |y| + u μ − μ (2r)−1 [Ip − uu ]μ ≤ C1 r−1−δ |μ|2+δ for all 0 < δ < 1, where C1 does not depend on y or μ.

Similarly, the accuracies of constant and linear approximations of unit vector |y − μ|−1 (y − μ) around the origin are given by    y−μ y  (B1)  |y−μ| − |y|  ≤ 2r−1 |μ| and    y−μ  y (B2)  |y−μ| − |y| − 1r [Ip − uu ]μ ≤ C2 r−1−δ |μ|1+δ for all 0 < δ < 1, where C2 does not depend on y or μ. For these and similar results, see Arcones [1] and Bai et al. [2]. Lemma 1. Assume that the density function f (y) of the p-variate continuous random vector y is bounded. If p > 1 then E{|y|−α } exists for all 0 ≤ α < 2. The following key result for convex processes is Lemma 4.2 in Davis et al. [7] and Theorem 1 in Arcones [1]. Theorem 1. Let Gn (μ), μ ∈ Rp , be a sequence of convex stochastic processes, and let G(μ) be a convex (limit) process in the sense that the finite dimensional distriˆ μ ˆ 1, μ ˆ 2 , . . . be random variables butions of Gn (μ) converge to those of G(μ). Let μ,

184

J. M¨ ott¨ onen, K. Nordhausen and H. Oja

such that ˆ = inf G(μ) and Gn (μ ˆ n ) = inf Gn (μ), n = 1, 2, . . . G(μ) μ

μ

ˆ ˆ n →d μ. Then μ 3. Spatial median Let y be a p-variate random vector with cdf F , p > 1. The spatial median of F minimizes the objective function D(μ) = E{|y − μ| − |y|}. Note that no moment assumptions are needed in the definition as ||y−μ|−|y|| ≤ |μ| but for the asymptotic theory we assume that (C1) The p-variate density function f of y is continuous and bounded. (C2) The spatial median of the distribution of y is zero and unique. We next define vector and matrix valued functions + , y yy yy 1 U(y) = Ip − , and B(y) = , A(y) = |y| |y| |y|2 |y|2 for y = 0 and, by convention, U(0) = 0 and A(0) = B(0) = 0. We write also A = E {A(y)} and B = E {B(y)} . The expectation defining B clearly exists and is bounded (|B(y)|2 = tr(B(y) B(y)) = 1). Our assumption implies that E(|y|−1 ) < ∞ and therefore also A exists and is bounded. Auxiliary result (A3) in Section 2 then implies Lemma 2. Under assumptions (C1) and (C2), D(μ) = 12 μ Aμ + o(|μ|2 ). See also Lemma 19 in Arcones [1]. Let Y = (y1 , . . . , yn ) be a random sample from a p-variate distribution F . Write Dn (μ) = ave{|yi − μ| − |yi |}. The function Dn (μ) as well as D(μ) are convex and bounded. Boundedness follows from (A1). The sample spatial median μ ˆ is defined as ˆ = μ(Y) ˆ μ = arg min Dn (μ). ˆ is unique if the observations do not fall on a line. Under assumption The estimate μ ˆ is unique with probability one. As D(μ) is the limiting process of Dn (μ), (C1) μ Theorem 1 implies that μ ˆ →P 0. The statistic Tn = T(Y) = ave {U(yi )} is the spatial sign test statistic for testing the null hypothesis that the spatial median is zero. As μ is assumed to be a zero vector, the multivariate central limit theorem implies that √ Lemma 3. nTn →d Np (0, B). The approximation (A3) in Section 1 implies that  n + ,,  n n + 

1 1 yi yi yi   −1/2 1 Ip − μ μ| − |yi |} − √ μ−μ  {|yi − n 2   n i=1 2|yi | |yi | n i=1 |yi | i=1

Asymptotic theory of the spatial median



n

|μ|2+δ

C1 n(2+δ)/2

i=1

ri1+δ

185

→P 0, for all μ,

and we get Lemma 4. Under assumptions (C1) and (C2),   √ 1 nTn − Aμ μ →P 0. nDn (n−1/2 μ) − 2   Now apply Theorem 1 with Gn (μ) = nDn (n−1/2 μ) and G(μ) = z − 12 Aμ μ where z ∼ Np (0, B). We then obtain (A is positive definite) √ ˆ →d Np (0, A−1 BA−1 ). Theorem 2. Under assumptions (C1) and (C2), nμ √ y →d It is well known that, if E(yi ) = 0 and the second moments exist, also n¯ Np (0, Σ) where Σ is the covariance matrix of yi . The asymptotic relative efficiency of the with respect to the sample mean is then given by  spatial median  det (Σ)/det A−1 BA−1 . The spatial median has good efficiency properties even in the multivariate normal model. M¨ott¨ onen et al [15] for example calculated the asymptotic relative efficiencies e(p, ν) of the multivariate spatial median with respect to the mean vector in the p-variate tν,p distribution case (t∞,p is the p-variate normal distribution). In the 3-variate and 10-variate cases, for example, the asymptotic relative efficiencies are e(3, 3) = 2.162, e(10, 3) = 2.422,

e(3, 10) = 1.009, e(10, 10) = 1.131,

e(3, ∞) = 0.849, e(10, ∞) = 0.951.

4. Estimation of the covariance matrix of the spatial median ˆ one natuFor a practical use of the normal approximation of the distribution of μ rally needs an estimate for the asymptotic covariance matrix A−1 BA−1 . We estimate A and B separately. Recall that we assume that the true value μ = 0. Write,   as before, 1 yy yy A(y) = Ip − and B(y) = . |y| |y|2 |y|2 ˆ = A(Y) = ave {A(yi − μ)} ˆ = B(Y) = ave {B(yi − μ)} ˆ ˆ . Then write A and B ˆ and B ˆ converge in probability to the We will show that, under our assumptions, A population values A = E {A(yi )} and B = E {B(yi )} , respectively: ˆ →P B. ˆ →P A and B Theorem 3. Under assumptions (C1) and (C2), A that the true spatial median μ = 0. By Theorem 2, √ Proof We thus assume ˜ = ave {A(yi )} and B ˜ = ave {B(yi )} . Then by the law of ˆ = Op (1). Write A nμ ˜ →P B. Our auxiliary result (B1) implies that ˜ →P A and B large numbers A    (y − μ)(y − μ) |μ| yy   ≤4 − , ∀ y = 0, μ,   2 2 |y − μ| |y| |y| ˜ ˆ − B| ˜ ≤ 1 n {4|μ|/|y ˆ and therefore by Slutsky’s theorem |B i |} →P 0. As B →P i=1 n ˆ B, also B →P B. ˆ →P A. We play with three positive constants, “large” δ1 , We now prove that A √ ˆ < δ1 / n. (This is true “small” δ2 and “small” δ3 . For a moment, we assume that |μ|

186

J. M¨ ott¨ onen, K. Nordhausen and H. Oja

with  a probability that can bemade close to one with  large δ1 .) Next we write I1i = δ δ 2 2 ˆ < √n , I2i = I √n ≤ |yi − μ| ˆ < δ3 and I3i = I { |yi − μ| ˆ ≥ δ3 } . I |yi − μ| Then =

1

ˆ (A(yi ) − A(yi − μ)) n i=1

=

1

1

ˆ + ˆ (I1i · [A(yi ) − A(yi − μ)]) (I2i · [A(yi ) − A(yi − μ)]) n i=1 n i=1

+

1

ˆ . (I3i · [A(yi ) − A(yi − μ)]) n i=1

n

˜ −A ˆ A

n

n

n

The first average is zero with probability n n   2 δ2p cp M δ22 cp M ≥ 1− → e−cp M δ2 , P (I11 = . . . = I1n = 0) ≥ 1 − p/2 n n where M = supy f (y) < ∞ and cp is the volume of the p-variate unit ball. (The first average is thus zero with a probability that can be made close to one with small choices of δ2 > 0.) For the second average, one gets 1

ˆ |I2i · [A(yi ) − A(yi − μ)]| n i=1 n

ˆ 1 6I2i |μ| 1 6I2i δ1 ≤ ˆ i| n i=1 |yi − μ||y n i=1 δ2 |yi | n



n

which converges to a constant which can be made as close to zero as one wishes with small δ3 > 0. Finally, also the third average 1

ˆ |I3i · [A(yi ) − A(yi − μ)]| n i=1 n

ˆ 1 6I3i |μ| 1 6I3i δ1 ≤ √ ˆ i| n i=1 |yi − μ||y n n i=1 δ3 |yi | n



n

converges to zero in probability for all choices of δ1 and δ3 , and the proof follows. ˆ can be approximated Theorems 2 and 3 thus   suggest that the distribution of μ 1 ˆ −1 ˆ ˆ −1 by Np μ, n A BA . Approximate 95 % confidence ellipsoids for μ are given   ˆB ˆ −1 A(μ ˆ ˆ A ˆ ≤ χ2p,.95 , where χ2p,.95 is the 95 % quantile of by μ : n(μ − μ) − μ) a chi square distribution with p degrees of freedom. Also, by Slutsky’s theorem, under the null hypothesis H0 : μ = 0 the squared version of the test statistic ˆ −1 Tn →d χ2 . Q2 = nTn B p 5. Transformation retransformation spatial median Shifting the data cloud, naturally shifts the spatial median by the same constant, ˆ n a + Y) = a + μ(Y), ˆ that is, μ(1 It is also easy to see that rotating the data  ˆ ˆ cloud also rotates the spatial median correspondingly, that is, μ(YO ) = Oμ(Y), for all orthogonal p × p matrices O. Unfortunately, the estimate is not equivariant under heterogeneous rescaling of the components, and therefore not fully affine equivariant.

Asymptotic theory of the spatial median

187

A fully affine equivariant version of the spatial median can be found using the so called transformation retransformation estimation technique. First, a positive definite p × p scatter matrix S = S(Y) is a matrix valued sample statistic which is affine equivariant in the sense S(1n a + YB ) = BS(Y)B for all p-vectors a and all nonsingular p × p matrices B. Let S−1/2 be any matrix which satisfies S−1/2 S(S−1/2 ) = Ip . The procedure is then as follows. 1. 2. 3. 4.

Take any scatter matrix S = S(Y). Transform the data matrix: Y(S−1/2 ) . −1/2  ˆ Find the spatial median for the standardized data matrix μ(Y(S ) ). 1/2 −1/2  ˜ ˆ Retransform the estimate: μ(Y) = S μ(Y(S ) ).

˜ This median μ(Y) utilizing “data driven” transformation S−1/2 is known as the transformation retransformation (TR) spatial median. (See Chakraborty et al. [4] for other type of data driven transformations.) Then the affine equivariance follows: Theorem 4. Let S = S(Y) be any scatter matrix. Then the transformation retrans−1/2  ˜ ˆ formation spatial median μ(Y) = S1/2 μ(Y(S ) ) is affine equivariant, that is,   ˜ n a + YB ) = a + Bμ(Y). ˜ μ(1 The proof easily follows from the facts that the regular spatial median is shift and orthogonally equivariant and that (S(1n a + YB ))−1/2 = O(S(Y))−1/2 for some orthogonal matrix O. In the following we assume (without loss of generality) that the population value of S is √ Ip , and that S = S(Y) is a root-n consistent estimate of Ip . We write Δ = n(S−1/2 − I) = Op (1) and Y∗ = Y(S−1/2 ) . Then we have the following result for the test statistic. Lemma 5. Let Y be a random sample from a symmetric distribution satisfying (C1) and (C2). (By a symmetry we mean that −yi and yi have √ the same distribution.) Assume also that scatter matrix S = S(Y) satisfies n(S − Ip ) = Op (1). √ Then n(T(Y∗ ) − T(Y)) →P 0. √ Proof Our assumptions imply that also Δ = n(S−1/2 − Ip ) = Op (1). Thus S−1/2 = Ip + n−1/2 Δ where Δ is bounded in probability. Using auxiliary result (B2) in Section 2 we obtain 1

1

1

√ U(S−1/2 yi ) − √ Ui = (Δ − Ui ΔUi )Ui + oP (1) n i=1 n i=1 n i=1 n

n

n

where Ui = U(yi ), i = 1, . . . , n. For |Δ| < M , the second term in the expansion converges uniformly in probability to zero due to its linearity with respect to the elements of Δ and due to the symmetry of the distribution Ui . (E(Ui ) = n of n E(Ui ΔUi Ui ) = 0) . Therefore n−1/2 i=1 U(S−1/2 yi )−n−1/2 i=1 Ui →P 0 and the proof follows. 2 We also have to show that A(Y∗ ) and A(Y) both converge to A, and similarly with B(Y∗ ) and B(Y): Lemma 6. Let Y be a random sample from a distribution√ satisfying (C1) and (C2). Assume also that scatter matrix S = S(Y) satisfies n(S − Ip ) = Op (1). A(Y∗ ) − A(Y) →P 0 and B(Y ∗ ) − B(Y) →P 0.

188

J. M¨ ott¨ onen, K. Nordhausen and H. Oja

Proof Again S−1/2 = Ip + n−1/2 Δ where Δ = Op (1). Suppose that Δ ≤ M . (P (Δ ≤ M ) → 1 as M → ∞.) Write yi∗ = (Ip − n−1/2 Δ)yi . Then     ∗ ∗  1  yi y i 1 yi yi  1  |Ip − (Ip − n−1/2 Δ)−1 |   .  |y∗ |2 − |yi |2  ≤ √n |Δ| and  |y∗ | − |yi |  ≤ |yi | i i The first inequality gives |B(Y ∗ ) − B(Y)| ≤ together imply that       1 1 yi yi yi∗ yi∗     |yi | Ip − |yi |2 − |y∗ | Ip − |y∗ |2  i i Then |A(Y∗ ) − A(Y)| ≤

1 n

n i=1



1 |yi |



3M √ n

√1 |Δ| n



→ 0. The two inequalities

1 |yi |

+ o(n−1/2 )





 3M −1/2 √ + o(n ) . n

→P 0.



Using Lemmas 5 and 6 and the auxiliary results in Section 2 we then get Theorem 5. Let Y be a random sample from a symmetric distribution √ satisfying (C1) and (C2). Assume also that scatter matrix S = S(Y) satisfies n(S − Ip ) = √ √ ˜ ˆ Op (1). Then nμ(Y) and nμ(Y) have the same limiting distribution. Proof Write again S−1/2 = Ip +n−1/2 Δ, and yi∗ = (Ip −n−1/2 Δ)yi , i = 1, . . . , n, and Y∗ = (y1∗ , . . . , yn∗ ) . Then our auxiliary results imply that that  + ,,  n n n + 

1 1 yi∗  yi∗ yi∗    ∗ −1/2 ∗ 1 μ μ| − |yi |} − √  {|yi − n ∗| μ − μ n ∗ | Ip − |y∗ |2   |y 2|y n i i i i=1 i=1 i=1 ≤

C1 n(2+δ)/2

n

|μ|2+δ |(Ip − n−1/2 Δ)−1 |1+δ i=1

|yi∗ |1+δ

→P 0

√ √ ˆ ∗ ) and nμ(Y) ˆ nμ(Y Thus Lemmas 5 and 6 together with Theorem 1 imply that √ √ ˜ ˆ ∗ ), the result follows have the same limiting distribution. As nμ(Y) = S1/2 nμ(Y from Slutsky’s theorem.  ˜ can in the symmetric case be Based on the results  above, the  distribution of μ ˆ −1 B ˆ SA ˆ −1 (S1/2 ) with ˜ , where Cov(μ) ˜ = n1 S1/2 A approximated by Np μ, Cov(μ) S S      ei ei ei ei 1 ˆ ˆ and BS = ave |ei |2 calculated from the standardAS = ave |ei |2 Ip − |ei |2 ˜ i = 1, . . . , n. ized residuals ei = S−1/2 (yi − μ), The stochastic convergence and the limiting normality of the spatial median did not require any moment assumptions. Therefore, for the transformation, a scatter matrix with weak assumptions should be used as well. It is an appealing idea to link also the spatial median with the Tyler’s transformation. This was proposed by Hettmansperger and Randles [11]:

ˆ be a p-vector and S > 0 a symmetric p × p matrix, and define Definition 1. Let μ ˆi = S−1/2 (yi − μ), ˆ e i = 1, . . . , n. The Hettmansperger–Randles (HR) estimate of ˆ and S which simultaneously satisfy location and scatter are the values of μ ave {U(ˆ ei )} = 0 and p ave {U(ˆ ei )U(ˆ ei ) } = Ip .

Asymptotic theory of the spatial median

189

Note that the HR estimate is not a TR estimate as the location vector and scatter matrix are in fact estimated simultaneously. This pair of estimates was first mentioned in Tyler [24]. Hettmansperger and Randles [11] developed the properties of these estimates. They showed that the HR estimate has a bounded influence function and a positive breakdown point. The distribution of the HR location estimate can be approximated by   1 1/2 ˆ −2 1/2 Np 0, S AS S np ˆ S = ave(A(S−1/2 (yi − μ))) ˆ and S is Tyler’s scatter matrix. where A 6. Computation of the spatial median The spatial median can often be computed using the following two steps: Step 1: Step 2:

ei ← yi − μ, i = 1, . . . , n  n  n −1 −1 μ← μ + i=1 |ei | i=1 U(ei )

provided an initial estimate for μ. The above algorithm may fail in case of ties or when an estimate falls on a data point. Assume then that the distinct data points are y1 , . . . , ym with multiplicities w1 , . . . , wm (w1 + . . . + wm = n). The algorithm by Vardi and Zhang [25] then uses the steps: Step 1: Step 2: Step 3:

ei ← yi − μ, i = 1, . . . , m c ← ( ei =0 wi )/| ei =0 wi U(ei )| −1  −1 w |e | μ ← μ + max (0, 1 − c) ei =0 i i ei =0 wi U(ei )

Furthermore many other approaches can be used to solve this non-smooth optimization problem. For example H¨ossjer and Croux [12] suggest a steepest descent algorithm combined with stephalving and discuss also some other algorithms. We prefer however the above algorithm since it seems efficient and can be easily combined with the HR approach with the following steps: Step 1: Step 2: Step 3:

ei ← S−1/2 (yi − μ), i = 1, . . . , n  n −1 1/2 n μ← μ + {|ei |−1 } S i=1 i=1 {U(ei )} n  S ← (p/n) S1/2 {U(e )U(e ) } S1/2 . i i i=1

There are actually two ways to implement the algorithm. The first one is just to repeat these three steps 1, 2 and 3 until convergence. The second one is first (i) to repeat steps 1 and 2 until convergence, and then (ii) repeat steps 1 and 3 until convergence. Finally (i) and (ii) are repeated until convergence. The second version is sometimes considered faster and more stable, see Hettmansperger and Randles[11] and the references therein. Both versions of the algorithm are easy to implement and the computation is fast even in high dimensions. Unfortunately, there is no proof for the convergence of the algorithms so far, although in practice they always seem to work. There is no proof for the existence or uniqueness of the HR estimate either. In practice, this is not a problem, however. One can start with any initial root-n consistent estimates, then repeat the above steps for location and scatter, and stop after k iterations. If, in the spherical case around the origin, the initial location and shape estimates, ˆ and S are root-n consistent, that is, say μ √ √ ˆ = OP (1) and n(S − Ip ) = OP (1) nμ

190

J. M¨ ott¨ onen, K. Nordhausen and H. Oja

and tr(S) = p then the k-step estimate using the single loop version of the above algorithm (obtained after k iterations) satisfies   k  k  √ √ 1 p √ 1 1 ˆk = ˆ + 1− nμ nμ n ave{ui } + oP (1) −1 p p E(ri ) p − 1 and √

k √ 2 n(S − Ip ) p+2  k   2 p + 2√ 1− n (p · ave{ui ui } − Ip ) + oP (1). p+2 p 

n(Sk − Ip )

= +

Asymptotically, the k-step estimate behaves as a linear combination of the initial pair of estimates and Hettmansperger–Randles estimate. The larger k, the more similar is the distribution to that of the HR estimate. More work is needed, however, to carefully consider the properties of this k-step HR-estimate. 7. Examples

−0.2

0.0

0.2

0.4

−2 −1

0

1

2

3

4

0.1

0.3

−0.4

0.2 0.0

0.0

0.2

0.4

0.4 −0.3

−0.1

y_1

−0.4

2

3

4

−0.4

−0.2

−0.2

y_2

−2 −1

0

1

y_3

−0.3

−0.1

0.1

0.3

−0.4

−0.2

0.0

0.2

0.4

sample mean vector spatial median equivariant spatial median

Fig 1. The sample mean vector, the spatial median and the HR location estimate with corresponding bivariate 95% confidence ellipsoids for a simulated dataset from a non-spherical 3-variate t3 distribution.

In this section we compare the mean vector, the regular spatial median, and the HR location estimate for simulated and real datasets. First, the simulated data with

Asymptotic theory of the spatial median

1

2

3

4

5

22

24

26

28

30

32

12

14

0

191

5 4 3

3

4

5

4

6

8

10

Lagged Salinity

2 1 0

28

30

32

0

1

2

Trend

22

24

26

Discharge

4

6

8

10

12

14

0

1

2

3

4

5

sample mean vector spatial median equivariant spatial median

Fig 2. Salinity data with the sample mean vector, the spatial median and the HR location estimate with corresponding bivariate 95% confidence ellipsoids. Two outliers are marked with a darker colour.

sample size n = 200 was generated from a 3-variate spherical t distribution with 3 degrees of freedom. In the case of a spherical distribution, the regular spatial median and the affine equivariant HR location estimate are behaving in a very similar way. To illustrate the differences between these two estimates in a non-spherical case, the third component was multiplied by 10. The three location estimates with their bivariate 95% confidence ellipsoids are presented in Figure 1. The mean vector is less accurate due to the heavy tails of the distribution. For non-spherical data, the equivariant HR location estimate is more efficient than the spatial median as seen in the Figure. If the measurement units for the components are the same, however, as in the case of the repeated measures, and heterogeneous rescaling is not natural, then of course the spatial median may be preferable. To illustrate the robustness properties of the three estimates we consider the three variables “Lagged Salinity”, “Trend” and “Discharge” in the Salinity dataset discussed in Rousseeuw and Leroy [21]. There are two clearly visible outliers among the 28 observations. As seen from Figure 2, the mean vector and the corresponding confidence ellipsoid are clearly affected by these outliers. The HR estimate seems a bit more accurate than the spatial median due to the different scales of the marginal variables. Estimation of the spatial median and HR estimate and their covariances is implemented in the R package MNM [18].

Acknowledgements. We thank the two referees for their valuable comments on the earlier version of the paper.

192

J. M¨ ott¨ onen, K. Nordhausen and H. Oja

References

[1] Arcones, M. A. (1998). Asymptotic theory for M-estimators over a convex kernel. Econometric Theory 14 387–422. [2] Bai, Z. D., Chen, X. R., Miao, B. Q., and Rao, C. R. (1990). Asymptotic theory of least distances estimate in multivariate linear models. Statistics 21 503–519. [3] Brown, B. M. (1983). Statistical uses of the spatial median. Journal of the Royal Statistical Society, Series B 45 25–30. [4] Chakraborty, B, Chaudhuri, P., and Oja, H. (1998). Operating transformation re-transformation on spatial median and angle test. Statistica Sinica 8 767–784. [5] Chaudhuri, P. (1992). Multivariate location estimation using extension of R-estimates through U -statistics type approach. Annals of Statistics 20 897– 916. [6] Chaudhuri, P. and Sengupta, D. (1993). Sign tests in multidimension: Inference based on the geometry of data cloud. Journal of the American Statistical Society 88 1363–1370. [7] Davis, R. A., Knight, K., and Liu, J. (1992). M-estimation for autoregression with infinite variance. Stochastic Processes and Their Applications 40 145–180. [8] Gini, C. and Galvani, L. (1929). Di talune estensioni dei concetti di media ai caratteri qualitative. Metron 8. [9] Gower, J. S. (1974). The mediancentre. Applied Statistics 2 466–470. [10] Haldane, J. B. S. (1948). Note on the median of the multivariate distributions. Biometrika 35 414–415. [11] Hettmansperger, T. P. and Randles, R. H. (2002). A practical affine equivariant multivariate median. Biometrika 89 851–860. ¨ ssjer, O. and Croux, C. (1995). Generalizing univariate signed rank [12] Ho statistics for testing and estimating a multivariate location parameter. Journal of Nonparametric Statistics 4 293–308. [13] Milasevic, P. and Ducharme, G. R. (1987). Uniqueness of the spatial median. Annals of Statistics 15 1332–1333. ¨ tto ¨ nen, J. and Oja, H. (1995). Multivariate spatial sign and rank meth[14] Mo ods. Journal of Nonparametric Statistics 5 201–213. ¨ tto ¨ nen, J., Oja, H., and Tienari, J. (1997). On the efficiency of mul[15] Mo tivariate spatial sign and rank tests. Annals of Statistics 25 542–552. [16] Nevalainen, J., Larocque, D., and Oja, H. (2007). On the multivariate spatial median for clustered data. Canadian Journal of Statistics 35 215-231. [17] Niinimaa, A. and Oja, H. (1999). Multivariate median. In: Encyclopedia of Statistical Sciences (Update Volume 3). Eds. by Kotz, S., Johnson, N. L. and Read, C. P., Wiley. ¨ tto ¨ nen, J., and Oja, H. (2009). MNM: Multivariate [18] Nordhausen, K. Mo Nonparametric Methods. An Approach Based on Spatial Signs and Ranks. R package version 0.95-1. [19] Randles, R. H. (2000). A simpler, affine equivariant multivariate, distribution-free sign test. Journal of the American Statistical Association 95 1263– 1268. [20] Rao, C. R. (1988). Methodology based on the L1 -norm in statistical inference. Sankhy¯ a Ser. A 50 289–313.

Asymptotic theory of the spatial median

193

[21] Rousseeuw, P. J. and Leroy, A. M. (1987). Robust Regression and Outlier Detection. John Wiley & Sons, New York. [22] Serfling, R. J. (1980). Approximation Theorems of Mathematical Statistics. Wiley, New York. [23] Small, C. G. (1990). A survey of multidimensional medians. International Statistical Review 58 263–277. [24] Tyler, D. E. (1987). A distribution-free M -estimator of multivariate scatter. Annals of Statistics 15 234–251. [25] Vardi, Y. and Zhang, C.-H. (2000). The multivariate L1 -median and associated data depth. The Proceedings of the National Academy of Sciences USA (PNAS) 97 1423–1426.

IMS Collections Nonparametrics and Robustness in Modern Statistical Inference and Time Series Analysis: A Festschrift in honor of Professor Jana Jureˇ ckov´ a Vol. 7 (2010) 194–203 c Institute of Mathematical Statistics, 2010  DOI: 10.1214/10-IMSCOLL719

Second-order asymptotic representation of M-estimators in a linear model Marek Omelka1,∗ Charles University in Prague Abstract: The asymptotic properties of fixed-scale as well as studentized M -estimators in linear models with fixed carriers are studied. A two term von Mises expansion (second order asymptotic representation) is derived and verified. Possible applications of this result are shortly discussed.

1. Introduction Suppose that observations Y = (Y1 , . . . , Yn )T follow a linear model (1.1)

Yi = β1 xi1 + . . . + βp xip + ei = β T xi + ei ,

i = 1, . . . , n,

where β = (β1 , . . . , βp )T is a vector of unknown parameters, xi = (xi1 , . . . , xip )T (i = 1, . . . , n) are rows of a known matrix Xn , and e1 , . . . , en are independent, identically distributed random variables with an unknown cumulative distribution function (cdf) F . Given an absolutely continuous loss function ρ, a fixed scale (studentized) M ˆ of the parameter β is defined as a solution of the minimisation estimator β n n

  ρ Yi − tT xi := min,

) or

i=1

n

 ρ

Yi −tT xi Sn

*



:= min ,

i=1

where Sn is an estimator of scale. If the function ρ is differentiable with ψ = ρ being continuous, then the estimaˆ may be found as a solution of the system of equations tor β n (1.2)

n

) T

xi ψ(Yi − b xi ) = 0

or

i=1

n

* T xi xi ψ( Yi −b ) Sn

=0 .

i=1

As the defining equation (1.2) gives more flexibility to tune properties of M ˆ is usually defined as a carefully chosen estimators by a choice of a function ψ, β n root of (1.2). ∗ The

work was supported by the grant MSM 0021620839. Omelka, Department of Probability and Mathematical Statistics, Faculty of Mathematics and Physics, Charles University, Sokolovsk´ a 83, 186 75 Prague, Czech Republic. e-mail: [email protected] AMS 2000 subject classifications: Primary 62G05; secondary 62F05. Keywords and phrases: M-estimator, empirical processes. 1 Marek

194

SOAR of M-estimators

195

It is well known (see e. g. Jureˇckov´ a and Sen [14]) that provided some standard ˆ admits the following regularity assumptions are met, then the M -estimator β n representation (1.3)



ˆ − β) = n(β n

−1 Vn √ γ1 n

n

xi ψ(ei ) + Rn ,

(or (3.4)) ,

i=1

n with γ1 = E ψ  (e1 ) and Vn = n1 i=1 xi xT i , where the remainder term Rn is of order op (1). The equation (1.3) is sometimes called the first order asymptotic repˆ or a Bahadur-Kiefer ˆ or asymptotic linearity of β resentation of the estimator β n n representation. Let us recall that the interest in the behaviour of the remainder term Rn goes back to the work of Bahadur [6] and Kiefer [15], where a similar expansion for a sample quantile was considered. Provided that the function ψ and the distribution of the errors F are sufficiently smooth, in Jureˇckov´a and Sen [12] it was proved that Rn = Op ( √1n ). The asymptotic distribution of the random vari√ able n Rn was studied by Boos [7] for the special case of a location model and by Jureˇckov´a and Sen [13] for an M -estimator of a general scalar parameter. The case of a discontinuous (score) function ψ = ρ was treated in Jureˇckov´a and Sen [11]. Many interesting results about the distributional as well as almost sure behavior of the reminder term Rn can be found in the work of Arcones. Among others let us mention results for U -quantiles in Arcones [1], multivariate location M -estimators in Arcones and Mason [5], and the two dimensional spatial medians in Arcones [3]. Important contributions to the study of the behavior of the reminder term Rn in the context of a linear model (1.1) are the results of Jureˇckov´a and Sen [12] from which the OP -rate for a general M -estimator of β can be deduced. Arcones [2] considered Lp -regression estimators (i. e. ρ(x) = |x|p , p ≥ 1) and found the almost sure behavior of Rn . Further, Arcones [4] and Knight [16] focused on the least absolute deviation regression estimator (i. e. ρ(x) = |x|) and derived the limiting distribution of n1/4 Rn . Our paper extends the results of Boos [7], and Jureˇckov´a and Sen [13] in the following way. We derive a two term von Mises expansion (a second order asymptotic representation ) of the M -estimator in the linear model (1.1) and we rigorously (2) verify that the second term of the von Mises expansion Tn satisfies √1 |T(2) n − Rn |2 = op ( n ),

where | · |2 stands √ for the Euclidean norm. That yields not only the asymptotic distribution of n Rn , but it also enables a finer comparison of an M -estimator with another estimator (e. g. an R-estimator) that is asymptotically equivalent. Moreover, our approach can be easily modified to verify higher order von Mises expansions of one-step M -estimators that were derived in Welsh and Ronchetti [19] in a heuristic way. In Section 2, we state some auxiliary results on asymptotic behaviour of M processes, which may be of independent interest. In Section 3, we derive a two term von Mises expansions of an M -estimator. We finish with a short discussion of possible applications of our results. The proofs are to be found in Omelka [18]. 2. Auxiliary results In this section some auxiliary results concerning the asymptotic behaviour of certain processes associated with M -estimation in the model (1.1) are stated. It is useful to distinguish whether an M -estimator is studentized or not.

196

M. Omelka

2.1. Fixed scale Let {cin , i = 1, . . . , n} and {xin , i = 1, . . . , n} be triangular arrays of scalars and vectors in Rp respectively, and t = (t1 , . . . , tp )T . Our interest is in the (fixed scale) M -process (2.1)

n

Mn (t) =

 cin ψ(ei −

tT√xin ) n

− ψ(ei ) +

tT√xin n

 ψ  (ei ) ,

i=1

where t ∈ T = {s ∈ R : |s|2 ≤ M } and M is an arbitrarily large but fixed constant. We will make the following assumptions: p

X.1

1 2 c = O(1), n i=1 in n

X.2

max1≤i≤n |cin | √ = 0, n→∞ n lim

1

|xin |22 = O(1), n i=1 n

lim

n→∞

X.3 lim max

n→∞ 1≤i≤n

X.4

max1≤i≤n |xin |2 √ = 0, n

|cin | |xin |2 √ = 0, n

1 2 c |xin |22 = O(1), n i=1 in n

Bn2 =

as n → ∞.

While assumptions X.1 – 3 are analogous to the assumptions used in Jureˇckov´a [9] to deal with Wilcoxon rank process, the last assumption X.4 is purely for convenience. If Bn2 = O(1) were not satisfied, we would work with the process Mn (t) = Mn (t) and derive analogous results. Bn In Section 3 we will substitute xij (j = 1, . . . , p) for cin to find the second ˆ . For cin = |xin |2 , order asymptotic distributions of the regression M -estimator β n assumptions X.1 – 4 may be summarised as XX.1 1

|xin |42 = O(1), n i=1 n

(2.2)

max1≤i≤n |xin |22 √ = 0. n→∞ n lim

For notational simplicity, in the following we will write simply ci and xi instead of cin and xin . The distribution function F of the errors in the model (1.1) and the function ψ used to construct an M -estimator through (1.2) are assumed to satisfy the following regularity conditions. Fix. 1 ψ is absolutely continuous with a derivative ψ  such that E[ψ  (e1 )]2 < ∞. Fix. 2 The (random) function p(t) = ψ  (e1 +t) is continuous in the quadratic mean at the point zero, that is lim E [p(t) − p(0)] = lim E [ψ  (e1 + t) − ψ  (e1 )] = 0. 2

2

t→0

t→0

SOAR of M-estimators

197

Fix. 3 The second derivative of the function λ(t) = E ψ(e1 + t) is finite and continuous at the point 0. Inspecting Fix. 1 – 3 one sees that the more is assumed about the function ψ, the less is needed to be assumed about F and the other way around. In robust statistics it is quite common to put restrictive conditions on the function ψ, as the distribution F of the errors is generally unknown. For instance if the function ψ is twice differentiable, then it is not difficult to verify that assumptions Fix. 1 – 3 are met if both ψ  and ψ  are bounded and ψ  is continuous F -almost everywhere. This includes e. g. Tukey’s biweight function ψ(x) = x(1 −

x2 2 k2 )

I{|x| ≤ k}.

An important class of ψ functions which do not posses a second derivative everywhere are piecewise linear functions. This class includes e. g. Huber’s function ψ(x) = max{min{x, k}, −k}. Assumptions Fix. 1 – 3 are satisfied provided that: A.1 ψ is a continuous piecewise linear function with the derivative ψ  (x) = αj ,

for rj < x ≤ rj+1 , j = 0, . . . , k,

where α0 , α1 , . . . , αk are real numbers, α0 = αk = 0 and −∞ = r0 < r1 < . . . < rk < rk+1 = ∞. A.2 The cdf F is absolutely continuous with a derivative which is continuous at the points r1 , . . . , rk . Note that assumption A.1 trivially implies Fix. 1 and A.2 ensures both Fix. 2 and Fix.3. Many of the following results (in particular for studentized M -estimators) simplify significantly if the distribution of the errors is symmetric. For the sake of later reference let us state this assumption explicitly. Sym The distribution of the errors is symmetric and the ψ-function is antisymmetric, that is F (x) = 1 − F (−x) and ψ(x) = −ψ(−x) for all x ∈ R. Put γ2 for the second derivative of the function λ(t) = E ψ(e1 + t) at the point 0. k That is γ2 = j=1 αj [f (rj+1 ) − f (rj )] in the case of a piecewise linear ψ and γ2 = E ψ  (e1 ) for a sufficiently smooth and integrable ψ. Note that if Sym holds then γ2 = 0. n Theorem 1. Put Wc,n = n1 i=1 ci xi xT i . If X.1 – 4 and Fix. 1 – 3 hold, then     (2.3) E sup Mn (t) − γ22 tT Wc,n t = o(1). t∈T

Later it will be useful to rewrite the statement of Theorem 1 (with the help of Chebychev’s inequality) as (2.4)

n

ci ψ(ei −

i=1

T t√ xi ) n



n

ci ψ(ei ) +

i=1

n

c i xi

i=1 T

= − √t n

n

i=1

uniformly in t ∈ T .

T γ√ 1t n

ci xi [ψ  (ei ) − γ1 ] +

γ2 2

tT Wc,n t + op (1)

198

M. Omelka

2.2. Studentized M-processes As the M -estimator is not in general scale invariant, in practice it is usually studentized. To investigate properties of the studentized M -estimators, it is useful to study the asymptotic properties of the ‘studentized’ M -process

Mn (t, u) =

n

  −1/2 ci ψ e−u n (ei −

T t√ xi )/S n



− ψ(ei /S)

i=1 T

+ St √xni ψ  (ei /S) +

√u ei n S

 ψ  (ei /S) ,

where (t, u) ∈ T = {(s, v) : |s|2 ≤ M, |v| ≤ M } (⊂ Rp+1 ) with M being an arbitrarily large but fixed constant. As the studentization brings in perturbations in scale, more restrictive assumptions on the function ψ and the distribution of the errors than in the fixed scale case are needed. St.1 ψ is absolutely continuous with a derivative ψ  such that   2 E ψ  eS1 < ∞. 1 +t St.2 The (random) function p(t, v) = ψ  ( eSe v ) is continuous in the quadratic mean at the point (0, 0), that is

lim (t,v)→(0,0)

2

E [p(t, v) − p(0, 0)] =

lim (t,v)→(0,0)

  1 +t   2 E ψ  eSe = 0. − ψ  eS1 v

1 +t St.3 The function λ(t, v) = E ψ( eSe v ) is twice differentiable and the second partial derivatives are continuous and bounded in a neighbourhood of the point (0, 0).

If the function ψ is twice differentiable almost everywhere then it is not difficult to show that assumptions St. 1 – 3 are met if the following functions ψ  (x), x ψ  (x), ψ  (x), x ψ  (x) and x2 ψ  (x) are bounded and continuous F -almost everywhere. If ψ is a piecewise linear function, then the assumptions St.1-3 are met provided A.1-2 hold with the only modification that the points r1 , . . . , rk in A.2 are replaced by the points S r1 , . . . , S rk . Before we proceed, it will be useful to introduce the following notation. Let the e1  e1 +t 1 +t partial derivatives of the functions λ(t, v) = E ψ( eSe v ) and δ(t, v) = E S ψ ( Sev ) be indicated by subscripts. Put (2.5)

 γ1 = λt (0, 0) =  γ2 = λtt (0, 0) =

1 S

1 S2

E ψ

E ψ 

,

   γ1e = −λv (0, 0) = E eS1 ψ  eS1 ,

,

   γ2e = δt (0, 0) = E Se12 ψ  eS1 ,

 e  1

S

 e1  S

γ2ee = −δv (0, 0)

  2   = E eS1 ψ  eS1 .

The formulas in the brackets are for the case of ψ sufficiently smooth and appropriately integrable. We do not give formulas for the case of a piecewise linear ψ as they are rather complicated in general case. According to the assumptions St. 1 – 3 all these quantities are finite. Note that λtv (0, 0) = γ1 + γ2e and λvv (0, 0) = γ1e + γ2ee .

SOAR of M-estimators

199

Theorem 2. If X.1-4 and St. 1 – 3 hold, then (2.6)

  E sup Mn (t, u) − (t,u)∈T

γ2 2

tT Wc,n t



(γ2e +γ1 )u tT n

n

ci x i −

(γ2ee +γ1e ) u2 2n

i=1

n 

 ci  = o(1), i=1

where Wc,n was defined in Theorem 1. n Remark 1. Note that if i=1 ci = 0, the last term (corresponding to small perturbations in scale) on the left-hand side of (2.6) vanishes. If assumption Sym (of symmetry) is satisfied, then γ2 = γ1e = γ2ee = 0 and even the second term on the left-hand side of (2.6) disappears. Thus under assumption Sym Theorem 2 implies that (2.7)

n

 −1/2 ci ψ e−u n (ei −

i=1

T t√ xi )/S n





n

ci ψ(ei /S) +

T γ√ 1t n

i=1 T

= − √t n

n

ci x i

1 S

 ψ  (ei /S) − γ1 −

n

c i xi

i=1 √u n

i=1

n

ci

 ei S



ψ  (ei /S)

i=1

+

(γ2e +γ1 )u tT n

n

ci xi + op (1),

i=1

uniformly in (t, u) ∈ T . 3. Second order asymptotic representation of M-estimators In Section 2 technical results on approximation of linear processes associated with M -estimation in linear models were presented. One of the possible applications of these results is deriving a two term von Mises expansion of M -estimators defined in (1.2). 3.1. First order asymptotic representation (FOAR) Deriving the second order asymptotic representation of a fixed scale M -estimator is very straightforward provided one √is allowed to substitute the parameter t in ˆ − β). To justify this substitution the the asymptotic expansion (2.4) with n(β n √ ˆ − β) = Op (1). That is ˆ has to be n-root consistent, that is √n(β estimator β n n guaranteed by the following two assumptions: Fix. 4 (St.4) The function h(t) = E ρ(e1 − t) (or h(t) = E ρ( e1S−t )) has a unique minimum at t = 0, that is for every δ > 0: inf |t|>δ h(t) > h(0). n XX.2 V = limn→∞ Vn , where Vn = n1 i=1 xi xT i and V is a positive definite p × p matrix. With the help of Fix. 4, XX.2 and Theorem 1 which implies   n   T   √1

t√ xi (3.1) sup  n xi [ψ(ei − n ) − ψ(ei )] + γ1 Vn t = op (1),   t∈T i=1

200

M. Omelka

one can use the technique of the proof of Theorem 5.5.1 of Jureˇckov´a and Sen [14] ˆ of system of equations (1.2) such that to show that there exists a root β n √ ˆ − β) = Op (1). n(β n

(3.2)

√ ˆ Now inserting n(β n − β) for the parameter t in (3.1) gives the first order asymptotic representation (1.3). 3.1.1. FOAR for a studentized M-estimator To be able to be as explicit as possible we will concentrate on models (1.1) that include an intercept,√that is xi1 = 1 for i = 1, . . . , n. Let us also assume the scale estimator Sn to be n-consistent, that is there exists a finite positive constant S such that √ Sn (3.3) n( S − 1) = Op (1). Similarly as for a fixed scale M -estimator one can derive the first order asymptotic representation (3.4)



ˆ − β) = n(β n

−1 Vn √ γ1 n

n

xi ψ

 ei  S



γ1e γ1



n( SSn − 1) u1 + op (1),

i=1

where u1 = (1, 0, . . . , 0)T ∈ Rp and γ1 , γ1e are defined in (2.5) of Section 2.2. ˆ does not depend on the asymptotic Note that the FOAR of the slope part of β n distribution of the scale estimator Sn . This holds true also for the intercept provided the assumption of symmetry Sym is satisfied, which implies γ1e = 0. 3.2. Second order asymptotic representation (SOAR) 3.2.1. SOAR for a fixed-scale M-estimator For our convenience l = 1, . . . , p n let us restate expansion (2.4) for the vector case. For p p p put Wnl = n1 i=1 xli xi xT i and let Wn be a bilinear form from R × R to R given by Wn (t, s) = (tT Wn1 s, . . . , tT Wnp s)T . Corollary 1. Assume XX.1 and Fix. 1 – 3, then it holds uniformly in t ∈ T (3.5)

n

i=1

xi ψ(ei −

T t√ xi ) n



n

xi ψ(ei ) + γ1



n Vn t

i=1 T

= − √t n

n

 x i xT i [ψ (ei ) − γ1 ] +

γ2 2

Wn (t, t) + op (1).

i=1

The proof follows by applying Theorem 1 to each of the coordinate separately. ˆ − β) can be substituted for t in (3.5). ˆ satisfies (3.2), √n(β As the estimator β n n The first order asymptotic representation (1.3) and some algebraic manipulations

SOAR of M-estimators

201

yield (3.6)



ˆ − β) − n(β n

−1 Vn √ γ1 n

i=1 n

3 =

−1 Vn √ γ1 n

− √1n

n

i=1

+

−1 γ2 Vn √ 2 γ1 n

xi ψ(ei ) 43  x i xT i [ψ (ei )

) −1 Vn √ γ1 n

Wn

n

−1 Vn √ γ1 n

− γ1 ]

xi ψ(ei ),

−1 Vn √ γ1 n

i=1

n

4 xi ψ(ei )

i=1 n

*

xi ψ(ei )

+ op ( √1n ).

i=1

If the symmetry assumption Sym is satisfied, then the second term on the righthand side vanishes and both factors in the first term are asymptotically normal as well as asymptotically independent. This is in agreement with the results of Jureˇckov´a and Sen [13] where the asymptotic distribution of the second term in the von Mises expansion is shown to be a product of two normal distributions. 3.2.2. SOAR for a studentized M-estimator √ If n-consistency of Sn as expressed by (3.3) holds and assumptions St.1-4 and XX.1-2 are satisfied, one can proceed very similarly as for the fixed scale M estimators. Informally speaking, the second order asymptotic √ ˆ representation √ for studentized M -estimators may be found by substituting n(β n log( SSn ) n − β) for t, for u and xi for ci in (2.7). But as the resulting expression is rather long, we will write it down only when the assumption of symmetry Sym holds. After some algebra we get (3.7)



ˆ − β) − n(β n

−1 Vn √ γ1 n

=

xi ψ(ei /S)

i=1

3 − √1n

n

−1 Vn √ γ1 n

n

43

 xi x T i [ψ (ei /S)

−1 Vn √ γ1 n

− γ1 ]

i=1



4 xi ψ(ei /S)

i=1

3 √1 n

n



n( SSn

− 1)

−1 Vn √ γ1 n

n

xi

 ei S



ψ (ei /S)

4



i=1

+

γ2e +γ √ 1 γ1 n



3

n( SSn

− 1)

−1 Vn √ γ1 n

n

4 xi ψ(ei /S)

+ op ( √1n ).

i=1

Inspecting (3.7) it may be of interest to note that although the first order asymptotic distribution of a studentized M -estimator of the slope parameters does not depend on the asymptotic distribution of Sn , the second order distribution does, even if the assumption Sym is satisfied. Thus when excluding artificial or pathological examples, the studentized M -estimator cannot be asymptotically equivalent of second order with an R-estimator or a fixed scale M -estimator. 4. Conclusions We have presented a way how to derive a second order asymptotic representation of an M -estimator in a linear model with fixed carriers. This representation may be

202

M. Omelka

ˆ with another estimator that is asympused e. g. to compare the M -estimator β n ˆ totically equivalent to β n . This may be for example a one-step M -estimator (see e. g. Welsh and Ronchetti [19]) or an appropriate R-estimator (see Huˇskov´a and Jureˇckov´a [8] and Jureˇckov´ a [10]). For instance, it is well known that if ψ(x) is proportional to (F (x) − 12 ), then the fixed-scale M -estimator is asymptotically equivalent to an R-estimator based on the Wilcoxon scores. Our results can be used for a finer comparison of those estimators. The second order asymptotic results also proved to be useful when investigating ‘Rao Score type’ confidence interval, see Omelka [17]. Acknowledgements The author wish to express his thanks to Prof. Jana Jureˇckov´a for her encouragement, guidance and support when supervising his PhD thesis. The author is also thankful to two anonymous referees for their remarks and comments. References [1] Arcones, M. A. (1996). The Bahadur–Kiefer representation for U -quantiles. Ann. Statist. 24 1400–1422. [2] Arcones, M. A. (1996). The Bahadur–Kiefer Representation of Lp Regression Estimators. Econometric Theory 12 257–283. [3] Arcones, M. A. (1998). The Bahadur–Kiefer representation of two dimensional spatial medians. Ann. Inst. Stat. Math 50 71–86. [4] Arcones, M. A. (1998). Second order representations of the least absolute deviation regression estimator. Ann. Inst. Stat. Math 50 87–117. [5] Arcones, M. A. and Mason, D. M. (1997). A general approach to Bahadur– Kiefer representations for M -estimators. Mathematical Methods of Statistics 6 267–292. [6] Bahadur, R. R. (1966). A note on quantiles in large samples. Ann. Math. Statist. 37 577–580. [7] Boos, D. D. (1977). Comparison of L- and M -estimators using the second term of the von Mises expansion. Tech. rep., North Carolina State University, Raleigh, North Carolina. ´ , M. and Jurec ˇkova ´ , J. (1981). Second order asymptotic relations [8] Huˇ skova of M-estimators and R-estimators in two-sample location model. J. Statist. Plann. Inference 5 309–328. ˇkova ´ , J. (1973). Central limit theorem for Wilcoxon rank statistics [9] Jurec process. Ann. Statist. 1 1046–1060. ˇkova ´ , J. (1977). Asymptotic Relations of M-estimates and R-estimates [10] Jurec in linear regression model. Ann. Statist. 5 464–472. ˇkova ´ , J. and Sen, P. K. (1989). A second-order asymptotic distri[11] Jurec butional representation of M -estimators with discontinuous score functions. Annals of Probability 15 814–823. ˇkova ´ , J. and Sen, P. K. (1989). Uniform second order asymptotic [12] Jurec linearity of M -statistics in linear models. Statist. Dec. 7 263–276. ˇkova ´ , J. and Sen, P. K. (1990). Effect of the initial estimator on [13] Jurec the asymptotic behavior of one-step M-estimator. Ann. Inst. Statist. Math. 42 345–357.

SOAR of M-estimators

203

ˇkova ´ , J. and Sen, P. K. (1996). Robust Statistical Procedures: Asymp[14] Jurec totics and Interrelations. Wiley, New York. [15] Kiefer, J. (1967). On Bahadur’s representation of sample quantiles. Ann. Statist. 38 1323–1342. [16] Knight, K. (1997). Asymptotics for L1 regression estimators under general conditions. Technical Report 9716, Dept. Statistics, Univ. Toronto. [17] Omelka, M. (2006). An alternative method for constructing confidence intervals from M -estimators in linear models. In: Proceedings of Prague Stochastics 2006, 568–578. [18] Omelka, M. (2006). Second order properties of some M -estimators and R-estimators. Ph.D. thesis, Charles University in Prague, available at http://www.karlin.mff.cuni.cz /˜omelka/Soubory/omelka thesis.pdf. [19] Welsh, A. H. and Ronchetti, E. (2002). A journey in single steps: robust one-step M-estimation in linear regression. J. Statist. Plann. Inference 103 287–310.

IMS Collections Nonparametrics and Robustness in Modern Statistical Inference and Time Series Analysis: A Festschrift in honor of Professor Jana Jureˇ ckov´ a Vol. 7 (2010) 204–214 c Institute of Mathematical Statistics, 2010  DOI: 10.1214/10-IMSCOLL720

Extremes of two-step regression quantiles∗ Jan Picek1 and Jan Dienstbier1,2 Technical University of Liberec and Charles University in Prague Abstract: The article deals with estimators of extreme value index based on twostep regression quantiles in the linear regression model. Two-step regression quantiles can be seen as a possible generalization of the quantile idea and as an alternative to regression quantiles. We derive the approximation of the tail quantile function of errors. Following Drees (1998) we consider a class of smooth functionals of the tail quantile function as a tool for the construction of estimators in the linear regression context. Pickands, maximum likelihood and probability weighted moments estimators are illustrated on simulated data.

1. Introduction Let E1 , . . . , En , n ∈ N be independent and identically distributed random variables with a common distribution function function F belonging to some max-domain of attraction of an extreme-value distribution Gγ for some parameter γ ∈ R, i.e there exists a function a(t) with a constant sign such that for any x > 0 and some γ ∈ R (1.1)

F −1 (1 − tx) − F −1 (1 − t) x−γ − 1 = . t→0 a(t) γ lim

The relation (1.1) is equivalent to the Fisher–Tippet result: If for some distribution function Gγ (x) and sequences of real numbers a(n) > 0 and b(n), n ∈ N, limn→∞ F n (a(n)x + b(n)) = Gγ (x) for every continuity point x of G, then Gγ is the extreme value distribution, i. e. Gγ (x) = exp(−(1 + γx)−1/γ ), γ = 0, and the case γ = 0 is interpreted as the limit γ → 0. The problem of estimating the so-called extreme value index γ, which determines the behavior of the distribution function F in its upper tail, has received much attention in the literature, see e. g. [3] and references cited there. More attention has been paid to estimators that are based on a certain number of upper order statistics. They are usually scale invariant but not invariant under a shift of the data, see [1] for some examples. However, one of the challenging ideas of the recent advances in the field of statistical modeling of extreme events has been the development of models with timedependent parameters or more generally models incorporating covariates. ∗ The research was supported by the Ministry of Education of the Czech Republic under research project LC06024. 1 Department of Applied Mathematics, Technical University of Liberec, Studentsk´ a 2, CZ-461 17 Liberec, Czech Republic, e-mail: [email protected]; [email protected] 2 Department of Statistics, Charles University in Prague, Sokolovska 83, CZ-186 75 Prague 8, Czech Republic, AMS 2000 subject classifications: Primary 62G30, 62G32; secondary 62J05. Keywords and phrases: two-step regression quantile, R-estimator, extreme value index, tail function.

204

Extremes of two-step regression quantiles

205

Therefore, in the present paper we aim at extending the general result given in Drees (1998) to linear regression. Consider the following linear model (1.2)

Y = β0 1n + Xβ + E,

where Y = (Y1 , . . . , Yn ) is a vector of observations, X is an (n × p) known design matrix with rows xi = (xi1 , . . . , xip ) , i = 1, . . . n, 1n = (1, . . . , 1) ∈ Rn , E = (E1 , . . . , En ) is a vector of i. i. d. errors with an unknown distribution function F , β0 and β = (β1 , . . . , βp ) are the unknown parameters. The outline of this paper is as follows. Section 2 describes the construction of the two-step regression quantiles. In Section 3 the estimation of the extremes of the twostep extreme regression quantiles is given. Following Drees (1998) we establish the approximation for the tail quantile function of residual and we show the consistency and the asymptotic distribution of functionals of the tail quantile function in Section 4. The simulation study is contained in Section 5. 2. Two-step regression quantiles Jureˇckov´a and Picek [9] proposed an alternative of the α-regression quantiles sugˆ (α) be gested by Koenker and Basset [12] in the model (1.2) as follows: Let β nR an appropriate R-estimate of the slope parameter β and let β˜n0 denote [nα]-order   ˜ (α) := β˜n0 , β  (α) ˆ (α), then the vector β statistic of the residuals Yi − x β i

nR

n

nR

is called the two-step α-regression quantile. The initial R-estimator of the slope parameters is constructed as an inverse of the rank test statistic calculated in the Hodges-Lehmann manner, see [11]: Denote p Rni (Y − Xb) the rank of Yi − x i b among (Y1 − x1 b, . . . , Yn − xn b), b ∈ R , i = 1, . . . , n. Note that Rni (Y−Xb) is also the rank of Yi −b0 −xi b among (Y1 −b0 (α)− x 1 b, . . . , Yn − b0 (α) − xn b) for any α ∈ (0, 1) because the ranks are translation invariant. Consider the vector Sn (b) = (Sn1 (b), . . . , Snp (b)) of the linear rank statistics, where   n

Rni (Y − Xb) (2.1) Snj (b) = , b ∈ Rp , j = 1, . . . , p. xij ϕα n + 1 i=1 ˆ and ϕα = α − I[x < 0], x ∈ R. Then the estimator β nR is defined as  = argmin β nR b∈Rp Sn (b)1 ,

(2.2) where S1 =

p j=1

|Sj | is the L1 norm of S, see [6]; or  = argmin β nR b∈Rp Dn (b),

(2.3) where (2.4)

Dn (b) =

n

i=1

(Yi − xi b)ϕα



Rni (Y − Xb) n+1



is the Jaeckel’s measure of rank dispersion, see [5].  β nR estimates only the slope parameters and the computation is invariant of the size of the intercept.

206

J. Picek and J. Dienstbier

Assume the following conditions on distribution function F of errors and on X in model (1.2): (A1) F has a continuous density f that is positive on the support of F and has   2 f (x) finite Fisher’s information, i. e. 0 < dF (x) < ∞. f (x)   −1 n xi = 0. (A2) limn→∞ max1≤i≤n x i k=1 xk xk n = D∗ , where x∗i = (1, xi1 , . . . , xip ) , i = 1, . . . , n, (A3) limn→∞ n−1 i=1 x∗i x∗ i and D∗ is a positively definite (p + 1) × (p + 1) matrix. Under conditions (A1) – (A3), the R-estimator (2.2) and (2.3) admits the following asymptotic representation, 1  − β) n 2 (β nR 1

= n− 2 (f (F −1 (α))−1 D−1

(2.5)

n

n

  xi α − I[Ei < F −1 (α)] + op (n−1/4 ),

i=1

where D = limn→∞ Dn , Dn = n1 i=1 xi x i , for details see [10]. The solutions of (2.2) and (2.3) are generally not unique, nevertheless the asymptotic representation (2.5) applies to any of such solution; e. g. we can take the center of gravity of the set of all solutions. Jureˇckov´a and Picek showed in [9] that the two-step regression quantiles are asymptotically equivalent to the regression quantiles suggested by Koenker and Basset in [12]. The α-regression quantile is obtained as a solution of the minimization 3 n 4

p ˆ (α) := argmin (2.6) β ρα (Yi − b0 − x b), b0 (α) ∈ R, b ∈ R n

(b0 ,b)

i

i=1

with the loss function given by ρα (x) = |x|(αI[x > 0] + (1 − α)I[x < 0]), x ∈ ˆ (α) is the vector β(α) = (β0 + R. The population counterpart of the vector β n −1 F (α), β1 , . . . , βp ) . The difference between empirical regression quantile and its theoretical population counterpart is OP (n−3/4 ) under general conditions on X and F , see e. g. Theorem 7.4.1. in [10]. 3. Extremes of two-step quantiles ˆn:n , which they The authors of [9] also considered the extreme two-step quantile E define as the maximum of the residuals   ˆn:n = max{Y1 − x β E 1 nR , . . . , Yn − xn β nR }

(3.1)

 calculated with respect to an appropriate R-estimate β nR of β. Under suitable ˆ conditions (see [9]) En:n is a consistent estimate of En:n + β0 and ˆn:n − En:n − β0 | = Op (n−δ ) |E

(3.2)

as n → ∞, 0 < δ
E1:n +β0 +un Notice that E implies  ) ≤ E1:n + β0 + un < E ˆD = E1:n + β0 + xD (β − β ˆ1:n . E nR 1 1   ˆ1:n is the smallest observation among E ˆi , i = 1, . . . , n , therefore it canHence, E ˆD . not be greater than E 1 ˆ2:n ≤ E2:n + β0 + un because E ˆ2:n > E2:n + β0 + un leads to Similarly, E  ) ≤ E2:n + β0 + un < E ˆD = E2:n + β0 + xD (β − β ˆ2:n E nR 2 2 and

 ) ≤ E2:n + β0 + un < E ˆ2:n . ˆD = E1:n + β0 + xD (β − β E nR 1 1

If we proceed analogously, we get (3.7)

ˆi,n ≤ Ei,n + β0 + un , E

i = 1, . . . , n.

ˆn:n ≥ On the other hand, it holds for the highest two-step ordered residual E ˆ En:n + β0 − un , because En:n < En:n + β0 − un implies  ) ≥ En:n + β0 − un > E ˆD = En:n + β0 + xD (β − β ˆn:n . E nR n n We get by the similar arguments as in (3.7) (3.8)

ˆi,n ≥ Ei,n + β0 − un , E

i = 1, . . . , n.

Finally, un = Op (n−δ ) together with (3.8) and (3.7) imply (3.6). 4. Estimators of extreme value index Suppose for a while we have a simple location model, i. e. β = 0 in (1.2). Many estimators of γ that are based on upper order statistics considered can be represented (at least approximately) as a smooth functionals T (Qn ) of the empirical tail quantile function   kn −1 Qn (t) := Fn 1− t = Xn−[kn t]:n , t ∈ [0, 1], n with Fn−1 denoting the empirical distribution function and Xi:n the ith order statistic of the i.i.d. sample. Note that Qn depends on the (kn + 1) largest order statistics (1 ≤ kn < n). Drees in [4] studied the asymptotic behaviour of such estimators. Consider the general regression model (1.2) and the largest order statistics of the ˆk:n > 0. Then define the tail quantile residuals. Let any k ∈ N be such that E function of the residuals as follows ˆ n,k (t) := E ˆn−[kt]:n . Q ˆ n,k is the consistent estimate of the empirical tail function of the Observe that Q errors Qn,k (t) = En−[kt]:n in the sense of Lemma 3.1. We shall provide an approxiˆ n,k for the intermediate sequences of k(n). mation of Q

Extremes of two-step regression quantiles

209

Suppose that the distribution function F in (1.2) satisfies (1.1). To obtain the ˆ n,k , however, it is useful to impose stronger condition concerning approximation of Q the second order approximations of the tails (4.1)

lim

F −1 (1−tx)−F −1 (1−t) a(t)



x−γ −1 γ

A(t)

t→0

= K(x),

where a is the function related to (1.1), A(t) is a function of constant sign and K is some function that is not a multiple of the (x−γ − 1)/γ. It can be shown that there is some ρ = 0 such that K(x) = zγ−ρ = (xρ−γ −1)/(γ−ρ), which for the cases ρ = 0 and γ = 0 is understood to be equal to the limit of zγ−ρ , as γ → 0 or ρ → 0, respectively, see [3] for details. The so-called second-order condition (4.1) naturally arises when discussing the bias of the estimators of γ, see [3] or [1]. Under second order condition (4.1) one can establish following uniform approximation of the tail quantile function. Theorem 4.1. Suppose that the distribution function F of errors in (1.2) satisfies (4.1) for some γ ∈ R and ρ ≤ 0. Suppose that the assumptions of Lemma 3.1 are fulfilled. Then we can define a sequence of Wiener processes {Wn (t)}t≥0 such that for suitable chosen functions A and a and each ε > 0,    ˆ −1 1 − nk − β0  1 1 γ+ 2 +ε  Qn,k (t) − F sup t − zγ (t) − k − 2 t−(γ+1) Wn (t)   a(k/n) t∈(0,1]       k K(t)  = oP k −1/2 + |A(k/n)| , +A (4.2) n n → ∞, provided k = k(n) → ∞, k/n → 0 and



kA(k/n) = O(1)

Proof. Immediately follows from (3.6) and the approximation of the tail quantile function derived in Theorem 2.1 of [4]. Following [4] we consider the class of smooth statistical functionals of the estimated ˆ n,k ) for fixed parameter values γ. We are going empirical tail quantile function T (Q to describe the properties of the functionals on space of functions that are close to ˆ n,k ). Since F −1 (1 − t) diverges as t → 0 the tail quantile function (or its estimate Q for γ > 0, we introduce weighted space H of real functions on the interval [0, 1] which are smooth and similar to the tail quantile function  .  (log log(3/t))1/2 h(t)  (4.3) H := h : [0, 1] → [0, ∞] h ∈ C[0, 1], lim , t ∈ [0, 1] . t↓0 t1/2 For each γ ∈ R and h ∈ H we define seminorm on the space of real functions on the unit interval by zγ,h := tγ h(t)|z(t)|. In the view of Theorem 4.1 (4.4)

Dγ,h :=

 .  z : [0, 1] → R lim tγ h(t)z(t) = 0, (tγ h(t)z(t))t∈[0,1] ∈ D[0, 1] t↓0

equipped with the weighted supremum seminorm zγ,h is the suitable space in ˆ n,k can be established. Furthermore, let Cγ,h := which weak convergence of Q  z ∈ Dγ,h |z|(0,1] ∈ C(0, 1] be a subset of continuous functions on (0, 1] of Dγ,h . We shall formulate the key theorem showing the consistence and asymptotical norˆ n,k . mality of a broad class of functionals of Q

210

J. Picek and J. Dienstbier

Theorem 4.2. Suppose that for γ ∈ R and some h ∈ H the functional T : span(Dγ,h , 1) → R satisfies (i) (ii) (iii) (iv)

T|Dγ,h is B(Dγ,h , B(R)-measurable (where B denotes the Borel-σ-field), T (az + b) = T (z), for all z ∈ Dγ,h , a > 0, b ∈ R, T (zγ ) = T (1/γ(x−γ − 1)) = γ T|Dγ,h is Hadamard differentiable tangentially to Cγ,h ⊂ Dγ,h , at zγ with a derivative Tγ , i. e. for some signed measure νT,γ it holds for all 0 < εn → 0 and all yn ∈ Dγ,h such that yn → y ∈ Cγ,h (4.5)

T (zγ − εyn ) − T (zγ ) lim = Tγ (y) = εn →0 εn



1

y dνT,γ . 0

Then under the assumptions of Theorem 4.1 provided that ˆ n,k ) → γ (i) T (Q 1/2 ˆ n,k ) − γ)) → N (μT,γ,ρ , σT,γ ), where (ii) L(kn (T (Q

μT,γ,ρ

:=



kA(k/n) → λ

1

zγ−ρ dνT,γ  1  γ−1 Var t W (t) dνT,γ (t) 0

σT,γ

:=



1

0



1

(st)γ−1 min(s, t) dνT,γ (s) dνT,γ (t)

= 0

0

Proof. Follows from Theorem 4.1 similarly as the proof of Theorem 3.2 in [4]. Theorem 4.2 assures that any location and scale invariant estimator of γ is consistent even if it is calculated from estimated residuals instead of the unobservable errors in (1.2). Moreover, as have been shown in [4] practically all location and scale invariant estimators of γ belongs to the class satisfying the assumptions of Theorem 4.2. Example 4.1. (i) Pickands estimator of γ is generated by the functional   1 z(1/4) − z(1/2) (4.6) TPick (z) = I[(z(1/4) − z(1/2))(z(1/2) − z) > 0]. log log 2 z(1/2) − z(1) (ii) Generalized probability weighted moment can be regarded as + , z dv1 (4.7) TPWM (z) = I z dv2 = 0 z dv2 for suitable finite signed Borel measures vi on [0,1], see [4]. Since the larger observation approximately follow the Generalized Pareto (GP) distribution, if we apply the maximum likelihood procedure to the observations exceeding a given high threshold using GP distribution, we obtain an estimator of extreme value index (i. e. the shape parameter GP distribution). The maximum likelihood estimator is location and scale invariant, details see [3]. Note that we could also give the similar results of Theorem 4.1 and Lemma 3.1  ˆ if we would replace β nR by any other suitable estimator of the slope parameter β n fulfilling ˆ − β = OP (n−1/2 ). β n

Extremes of two-step regression quantiles

211

 Nevertheless, we focus on β nR because it estimates only the slope parameters in (1.2) and the computation is invariant of the size of the intercept. But primarily we would like to stress that the nature of the two-step regression quantiles and their relation to the regression quantiles of Koenker and Basset, which makes their properties an interesting subject to study. The studied two-step regression α-quantile is asymptotically equivalent and numerically very close to the regression α-quantile and the maximal two-step regression quantile coincides with the maximal regression quantile as it was already mentioned. That is important if we have proved some results for the two-step regression quantiles only. While there were described asymptotic properties of the maximal regression quantile, see [15], [8] and others, only [2] studied the properties of the extreme and intermediate regression quantiles for different sequences of αn but only in the pointwise sense. Theorem 4.1 gives immediately the uniform approximation of the tails of the twostep quantiles, which enables to base the tail modelling fully on the quantile function ˆ n,k (t). of the residuals Q The intrinsic connections between the regression quantiles and the two-step regression are important in the case that the assumptions are violated. There exist various interesting results showing the stability of regression quantiles even under dependency and heterogenity of the conditional distribution of the errors, for some overview see [13]. In this context, the extreme two-step regression quantiles can be observed as an interesting pattern for working with extreme regression quantiles. On the other hand, the previously described method are directly applicable for some real case studies where the independence of the errors is assumed. We can refer e. g. the Condroz dataset presented in [1] considering calcium level and pH level of the soil in different regions. We could find other examples e. g. in the climatology, where the most widely-used method for dealing with the problem of dependency is declustering. That approach is presented in [14], where authors proposed a methodology for estimating high quantiles of distributions of daily temperature, based on the peaks-over-threshold analysis with a time-dependent threshold expressed in terms of regression quantiles. 5. Numerical Illustration In order to check how the estimators of extreme value index perform in the linear regression model we have conducted a simulation study. We considered the model Yi = β0 + x i β + Ei ,

i = 1, . . . , n,

where the errors Ei , i = 1, . . . , n, were simulated from the Burr, Generalized Pareto and Pareto distributions with the following parameter values: sample size n = 400, β0 = 2, β = (β1 , β2 ) = (−1, 2), α = 0.5. Concerning the regression matrix we generated two columns (x11 , . . . , xn1 ) and (x12 , . . . , xn2 ) as two independent samples from the uniform distributions R(0, 10) and R(−5, 15), respectively. The  was computed by minimizing Jaeckel’s objective function (2.4). R-estimator β R 10 000 replications of the model were simulated for each combination of the pa (0.5) were calculated. rameters and then the residuals based on the R-estimator β R For the sake of comparison, the values of Pickands, maximum likelihood, and probability weighted moments estimator were computed for k - the varying fraction of ordered residuals. In Figures 1 – 3 we plotted the median, the 10 %-, 25 %-, 75 %- and 90 %- quantiles of sample of 10 000 estimated values of extreme index by three considered estimators against the intermediate sequences k in the regression model. For the sake of

212

J. Picek and J. Dienstbier

1.5 1.0 0.5 −0.5

0.0

Values of the estimated extreme index

1.0 0.5 0.0 −0.5

Values of the estimated extreme index

1.5

comparison, the same procedures were performed on the (normally unobservable) errors to see how much is lost by estimating the regression coefficients. Notice that the performance of the estimators practically depends only on the distribution of errors and not on the structure of regression matrix.

20

40

60

80

100

20

40

k

60

80

100

k

2.0 1.5 1.0 0.0

0.5

Values of the estimated extreme index

1.5 1.0 0.5 0.0

Values of the estimated extreme index

2.0

Fig 1. The median, the 10 %-, 25 %-, 75 %- and 90 %- quantiles in the sample of 10 000 estimated values of extreme index by Pickands (solid), maximum likelihood (dotted) and probability weighted moments estimators (dashed) for Generalized Pareto distribution of errors with the shape parameter γ = 0.5 (denoted by the horizontal line) in the regression model (left) and in the location model with unobserved errors (right).

20

40

60 k

80

100

20

40

60

80

100

k

Fig 2. The median, the 10 %-, 25 %-, 75 %- and 90 %- quantiles in the sample of 10 000 estimated values of extreme index by Pickands (solid), maximum likelihood (dotted) and probability weighted moments estimators (dashed) for Pareto distribution of errors the with shape parameter γ = 1 (denoted by the horizontal line) in the regression model (left) and in the location model with unobserved errors (right).

0.5

1.0

213

−0.5

0.0

Values of the estimated extreme index

0.5 0.0 −0.5

Values of the estimated extreme index

1.0

Extremes of two-step regression quantiles

20

40

60

80

k

100

20

40

60

80

100

k

Fig 3. The median, 10%-, 25%-, 75%- and 90%- quantiles in the sample of 10 000 estimated values of extreme index by Pickands (solid), maximum likelihood (dotted) and probability weighted moments estimators (dashed) for Burr distribution of errors with the shape parameter γ = 0.2 (denoted by the horizontal line) in the regression model (left) and in the location model with unobserved errors (right).

The simulation study indicated: (i) Results are affected by the specification of different values of k but the estimators give quite stable results for a suitable choice of fraction k. We see that the variance will be smallest for highest values of k. (ii) The Pickands estimator, compared to the other estimators, shows a much larger variability. On the other hand, the maximum likelihood estimator is biased for the Pareto distribution. It is considered on the basis of the theoretical result that the threshold excesses have a corresponding approximate distribution within the Generalized Pareto family (see e. g. [1]). Hence, it seems that asymptotic result does not work properly in our situation. (iii) The R-estimator is a solution of the optimization problem (2.2) in such a way it depends on initial values for the parameters to be optimized over. It seems from our simulation experiment that the resulting value of minimization does not depend (or depends very weakly) on the initial points. However, an unsuitable choice is the time expensive and it may complicate the computation considerably. (iv) As we have verified on a considerably larger simulation experiment, the properties of the two-step regression quantiles are very weakly affected by the chosen α and by the form of the matrix.

Acknowledgments

The authors thank three referees for their careful reading and for their comments, which helped to improve the text.

214

J. Picek and J. Dienstbier

References [1] Beirlant, J., Goegebeur, Y., Teugels, J., and Segers, J. (2004). Statistics of Extremes, Theory and Applications. Wiley, Chichester. [2] Chernozhukov, V. (2005). Extremal Quantile Regression. Ann. Math. Statist. 33 (2) 806–839. [3] de Haan, L. and Ferreira, A. (2006). Extreme Value Theory, An Introduction. Springer, New York. [4] Drees, H. . (1998) On Smooth Statistical Tail Functionals. Scandinavian Journal of Statistics 25 187–210. [5] Jaeckel, L. A. (1972). Estimating regression coefficients by minimizing the dispersion of the residuals. Ann. Math. Statist. 43 1449-1459. ˇkova ´ , J. (1971). Nonparametric estimate of regression coefficients. Ann. [6] Jurec Math. Statist. 42 1328–1338. ˇkova ´ , J. (1977). Asymptotic relation of M-estimates and R-estimates in [7] Jurec the linear regression model. Ann. Statist. 5 464–472. ˇkova ´ , J. (2007). Remark on extreme regression quantile. Sankhy¯ [8] Jurec a 69 87–100. ˇkova ´ , J. and Picek, J. (2005). Two-step regression quantiles. Sankhy¯ [9] Jurec a 227–252. ˇkova ´ , J. and Sen, P. K. (1996). Robust Statistical Procedures: Asymp[10] Jurec totics and Inter-Relations. J. Wiley, New York. [11] Hodges, J. L. and Lehmann, E. L. (1963). Estimation of location based on rank tests. Ann. Math. Statist. 34 598–611. [12] Koenker, R. and Bassett, G. (1978). Regression quantiles. Econometrica 46 33–50. [13] Koul, H. L. (2002). Weighted Empirical Processes in Dynamic Nonlinear Models. (Second Edition) Springer, New York. ´ , J., Picek, J. and Beranova ´ , R. (2010). Estimating extremes in [14] Kysely climate change simulations using the peaks-over-threshold method with a nonstationary threshold. Glob. Planet. Change, doi:10.1016/j.gloplacha.2010.03.006. ˇkova ´ , J. (1999). On extreme regression quantiles. [15] Portnoy, S. and Jurec Extremes 2 (3) 227–243.

IMS Collections Nonparametrics and Robustness in Modern Statistical Inference and Time Series Analysis: A Festschrift in honor of Professor Jana Jureˇ ckov´ a Vol. 7 (2010) 215–223 c Institute of Mathematical Statistics, 2010  DOI: 10.1214/10-IMSCOLL721

Is ignorance bliss: Fixed vs. random censoring Stephen Portnoy∗ Department of Statistics, University of Illinois Abstract: While censored data is sufficiently common to have generated an enormous field of applied statistical research, the basic model for such data is also sufficiently non-standard to provide ample surprises to the statistical theorist, especially one who is too quick to assume regularity conditions. Here we show that estimators of the survival quantile function based on assuming additional information about the censoring distribution behave more poorly than estimators (like the inverse of Kaplan–Meier) that discard this information. This phenomenon will be explored with special emphasis on the Powell estimator, which assumes that all censoring times are observed.

1. Introduction In many situations where censored observations appear, it is not unreasonable unreasonable to assume that the censoring values are known for all observations (even the uncensored ones). For example, one of the earliest approaches to censored regression quantiles was introduced by work of Powell in the mid 1980’s. Powell [10] assumed that the censoring values were constant, thus positing observations of the form Y ∗ = min(Y, c) (where Y ∗ is observed and Y is the possibly unobserved survival time that is assumed to obey some linear model). More generally, we may be willing to assume that we observe a sample of censoring times {ci } and a sample of censored responses Yi∗ = min(Yi , ci ), a model that could apply to a single sample. In this case, one could use the empirical distributions of the {Yi∗ } and {ci } and take the ratio of empirical survival functions to estimate the survival function of Y . This is asymptotically equivalent to applying the Powell method on a single sample. Despite some optimality claims in Newey and Powell [5], it turns out that the Kaplan–Meier estimate is better (asymptotically, and by simulations in finite samples) even though it does not use the full sample of {ci } values. More generally, even in multiple regression settings, the censored regression quantile estimators (Portnoy [7]) are better in simulations than Powell’s estimator, even for the constant censoring situation for which Powell’s estimator was developed. Remarkably, in the one sample case, replacing the empirical function of {ci } by the true survival function (assuming it is known) yields an even less efficient estimator. Thus, it appears that discarding what appears to be pertinent information improves the estimators. Here we will try to quantify and explain this conundrum. Department of Statistics, University of Illinois, 725 S. Wright St., Champaign IL 61801, U. S. A. e-mail: [email protected] ∗ Research partially supported by NSF Award DMS-0604229 AMS 2000 subject classifications: Primary 62N02, 62J05; secondary 62B10 Keywords and phrases: regression quantiles, conditional quantile, Powell estimator. 215

216

S. Portnoy

The basic phenomenon was first brought to my attention by Roger Koenker at coffee some time ago. A specific example concerned asymptotics for estimators of the survival quantile function in a single censored sample. Though the basic asymptotic results all appeared in the standard literature, the computations were combined as problems on a take-home exam for advanced econometric students to emphasize the rather surprising result that the more information an estimator incorporates, the poorer the asymptotic behavior. In fact, a recent treatment closely related to the one sample case of Section 2 below appears in Wang and Li [11]. Specifically, consider the one-sample model: Yi are i.i.d. with cdf F (x), ci i.i.d. with cdf G(x), and we observe Yi∗ = min{Yi , ci }. The problem is to estimate Q(τ ) ≡ F −1 (τ ) (nonparametrically). There are a number of estimators that converge in distribution at rate n−1/2 to an asymptotic normal distribution with mean Q(τ ) and asymptotic variance: (1)

a V ar ≡

F (ξτ )(1 − F (ξτ )) v(ξτ ) f 2 (ξτ )

ξτ = Q(τ ) = F −1 (τ ),

where f (x) is the density for F (x), and where v(x) depends on the estimator. The most classical estimator of the quantile function is the inverse of the Kaplan– Meier estimator. This inversion is trivial since the Kaplan–Meier estimator is monotonic; and its asymptotic variance is well-known to have dF (w) (1 − F (x)) x (2) vKM (x) = . 2 F (x) 0 (1 − F (w)) (1 − G(w)) There are (at least) two alternatives that are appropriate if all the ci -values are observed. The first is of particular interest here and was developed by Powell [10]: ˆ ) to minimize the following (non-convex) define the quantile function estimate, Q(τ objective function over β: (3)

n

ρτ (min{Yi∗ , ci } − min{β, ci }).

i=1

This was originally introduced for fixed (constant) censoring in linear models for the conditional median, but it was quickly recognized that the definition worked ˆ ) is given by (1) whenever all ci -values were known. The asymptotic variance of Q(τ with (4)

vP OW (x) = (1 − G(x))−1 .

An alternative with exactly the same asymptotic variance is the “synthetic” estimator (for example, see Leurgans [3]). Note that the c.d.f. for the observed value min{Yi , ci } is H(x) = 1 − (1 − F (x))(1 − G(x)). Thus, if all {ci } are known, ˆ for the observations, and G ˆ for the censoring we can use the empiric c.d.f.’s (H times) to estimate F: (5)

ˆ ˆ Fˆ (x) = 1 − (1 − H(x))/(1 − G(x)).

This function can be inverted (with perhaps some difficulty because of possible nonmonotonicity) to provide an estimate of the quantile function, whose asymptotic variance can be shown directly to coincide with that of the Powell estimator. To provide complete notation, define vGˆ (x) ≡ vP OW (x).

Is ignorance bliss: Fixed vs. random censoring

217

Finally, suppose we actually know G; that is, we have additional information. ˆ by G in (5) and invert. Here Then we can replace G vG (x) =

1 − (1 − F (x))(1 − G(x)) . F (x)(1 − G(x))

For estimation of the median, ξ = Q(1/2) = F −1 (1/2), vKM

=

vP OW = vGˆ vG

= =

0

ξ

dF (w) (1 − F (w))2 (1 − G(w))

(1 − G(ξ))−1 (1 + G(ξ))/(1 − G(ξ))

For τ = 1/2, it is immediate that: vKM (ξτ ) ≤ vP ow (ξτ ) = vGˆ (ξτ ) ≤ vG (ξτ ). These inequalities hold for all τ : vGˆ (ξτ ) ≤ vG (ξτ ) since (1 − G(x)) ≤ 1 in the numerator of vG (x). To show vKM (ξτ ) ≤ vGˆ (ξτ ), note that (1 − G(x)) ≥ (1 − G(w)) in the denominator of the integral in (2), and the integral can be computed directly to provide a cancellation of the initial factors. To provide some specific calculations where the integral in vKM can be computed, let 1 − F (x) = e−x and let G have density g(x) = c e−c(x−a) for x ≥ a. Figure 1 shows efficiencies for median estimators with respect to the asymptotic variance of the Kaplan–Meier estimator for a = 1.8 as a function of c. The unobservable estimate, med{Y }, is also plotted for comparison. Note that it is only slightly more efficient than Kaplan–Meier.

0.6

eff

0.8

1.0

Efficiency: F = pexp, G = pexp(1.8,c)

0.4

Pow=Ghat true−G med(Y)

2

4

0

2

4

6

8

6

8

0.3 0.1

cen prob

0.5

0

c

Fig 1. One Sample Efficiencies for exponential distributions.

Several remarks can be offered.

218

S. Portnoy

• Newey and Powell [5] establish asymptotic optimality of the Powell estimator among all “regular” estimators. Unfortunately, their regularity conditions preclude estimators like the Kaplan–Meier estimator that does not admit an asymptotic expansion whose second-order term is independent of the firstorder term. Thus, the fact that the Powell estimator performs more poorly than the Kaplan–Meier estimator does not contradict their result. • By convex optimization (specifically, the Generalized Method of Moment Spaces – see Collins and Portnoy [1]), it is possible to find the range of values for the efficiency of the Powell estimator for any given amount of censoring. Specifically, if p is the probability of censoring, then the efficiency of the Powell estimator is greater than (1 − 2p), and this efficiency can be attained. This bound is plotted in the lower panel of Figure 1. • When a is nearly log(2) (the median for the negative exponential distribution), the probability of censoring is nearly .5, and serious computational difficulties can occur in samples of size 50 or bigger for Powell and the “synthetic” estimators. For the Powell estimator, the problem seems to be the multimodality of the objective function, an issue that will be discussed for regression quantiles later. For synthetic estimators, the ratio of survival functions is not monotonic, which may lead to computational problems in the inversion. 2. Regression comparisons Since the Powell estimator was intended for the case of linear regression quantiles, the results of the previous section may not seem unduly surprising. Nonetheless, we show here that the message is remarkably similar in the regression case. Unfortunately, current asymptotic theory for quantile regression estimator that require only conditional independence of the duration and censoring variables do not admit tractable formulas for asymptotic variances (see Portnoy and Lin [9], and Peng– Huang [6]). Thus we will restrict to simulation comparisons. Since the methods of Portnoy [7] and Peng–Huang [6] appear to be quite similar, we will also focus on comparisons between the Powell method and the CRQ (“censored regression quantiles”) method of Portnoy [7]. It is important to note that CRQ requires all estimable regression quantiles to be linear (in the parameters). The Powell estimator does not impose this requirement, positing a linear model only for the quantile of interest. Thus will we consider cases where the conditional quantiles are not linear, expecting that the Powell estimator should do better in such cases. Thus, it is surprising that even for moderately large samples (n = 400), the CRQ method still outperforms the Powell estimator, often quite substantially. This appears to be due to computational difficulties associated with fact that the Powell method is fitting a nonlinear response function (max{xi β, ci }), and so its objective function turns out to be multimodal. Because of the computational difficulties involved in minimizing the Powell objective function, we will restrict to cases with a single explanatory variable, for which the Powell estimator can be computed by exhaustive search. Specifically, the Powell estimator can be taken to be an “elemental” estimator; that is, one interpolating exactly p observations (at least when observations are in general position). This holds for the same reason as for ordinary regression quantiles: if an optimal solution is not elemental, the linear parameters can be changed without increasing the objective function until p observations are interpolated. Thus, for simple linear

Is ignorance bliss: Fixed vs. random censoring

219

regression, we will employ an algorithm that exhaustively examines all “n-choose-2” elemental solutions and finds the one minimizing the Powell objective function n

(6)

ρτ (min{Yi∗ , ci } − min{xi β, ci }).

i=1

While there are approximate algorithms that are much faster (especially in larger problems), these methods depend strongly on a “starting value”, and will have rather different distributional properties (depending on the starting value). The Powell estimator will be compared with results from the R-function crq using the default “grid” algorithm of Portnoy [7] as implemented in the quantreg R-package (Koenker [2]). To be specific, we consider the following design for a simulation experiment with three models (two of which are heteroscedastic and nonlinear), two error distributions, and three choices for sample size (n = 50, 100, 200). In each case, we take 1000 replications with the pairs (xi , Yi ) i.i.d., and take constant censoring with c = 10. For all cases, we resample xi ∼ U nif (0, 4) in each replication. The Models are: Linear: Y = 5 + 2x + ε Nonlinear: Y = 2.5x + max(x, 2) ε Heavy Nonlin: Y = x + 4 max(x, 2) ε

−20

−10

0

y

10

20

30

The error distributions are either ε ∼ N (0, 1) or ε has a location shift of a negative exponential distribution with density f (x) = exp{−(x + a)} where a ≈ −.69 is chosen to provide med(ε) = 0.

0

1

2

3

4

x

Fig 2. Deciles for heavy nonlinear model (Normal errors).

Note that only the first model has all linear regression quantiles; and thus CRQ would be consistent only for this model. The conditional median is linear in all

220

S. Portnoy

eff

0.6

0.7

0.8

0.9

three models; and so the Powell estimator should be consistent in all cases. A plot of the conditional quantiles for the case of “heavy” non-linearity is given in Figure 2. Figures 3 and 4 provide the results of the simulations expressed as the ratios of the median absolute errors for the CRQ estimator over the Powell estimator. Ratios of mean squared errors showed much less efficiency for the Powell estimator.

0.5

iid light nl heavy nl normal neg ext 50

100

150

200

n

Fig 3. Efficiency for intercept: MAE(CRQ)/MAE(Powell).

The following conclusions seem quite clear from the plots: • The one-sample story appears to hold for regression. • Heavy nonlinearity hurts Powell (computational problems) more than CRQ (bias) for n ≤ 200. For larger n, the bias may become more serious. Even if we believe only the median is linear, CRQ seems to be better for moderate nonlinearity and sample size. One possible reason that CRQ seems so good concerns the fact that the CRQ estimator weights each censored observation depending on the quantile crossing the observation. Since each weight applies only to a small number of observations, the accuracy in estimating the weights may not be very crucial. Also, most censoring occurs near the median, where nonlinearity is smaller. Some further complementary simulation experiments were run. One used the approximate algorithm for the Powell estimator given in the “quantreg” R-Package (see Koenker [2]). This algorithm is based on work of Fitzenberger and attempts to find a local minimum of the Powell objective function (with the starting value defaulting to the naive regression quantile estimator that ignores censoring). This does correct the worst problems with the Powell estimator in the case of heavy censoring; and in fact this version of the Powell estimator slightly outperforms CRQ for estimating the slope parameter when n = 50. In all other cases, even this version is less efficient than CRQ with efficiencies varying from .6 to .95 over the range of cases in the simulation experiment above. Since this algorithm can

221

eff

0.5

0.6

0.7

0.8

0.9

Is ignorance bliss: Fixed vs. random censoring

0.3

0.4

iid light nl heavy nl normal neg ext 50

100

150

200

n

Fig 4. Efficiency for slope: MAE(CRQ)/MAE(Powell).

differ from the formal Powell estimator, it is not clear what asymptotic properties it has. Nonetheless, the simulations suggest that even this version is not preferable to CRQ. Finally, a simulation experiment was run with an alternative estimator suggested by work in Lindgren [4]. This method is based on binning the data (by x-values) into M bins, applying Kaplan–Meier to the data in each bin, and fitting the resulting quantiles by linear least squares. Here we choose M = 8 bins equally spaced for x ∈ (0, 4). Such proposals appear regularly in the literature, but binning difficulties (the curse of dimensionality) seriously degrade such methods beyond the case of simple linear regression. In any event, this estimator performed only slightly better than the Powell estimator, and clearly suffered in comparison with CRQ. 3. Inconsistency of the Powell estimator As noted above, if only the quantile of interest is linear, the Powell estimator can remain consistent while CRQ is inconsistent. However, the conditions for consistency for these estimators differ in nontrivial ways. The author has obtained several examples where the Powell estimator is inconsistent while CRQ remains consistent (Portnoy [8]). The basic idea is that the use of a nonlinear fit in the Powell estimator permits breakdown in cases where standard regression quantile methods (RQ and CRQ) maintain breakdown robustness. In fact, CRQ can be consistent even though some lower conditional quantiles are nonlinear: specifically, when the lower quantiles are below all censored observations. The examples do appear to violate conditions for known consistency results, and so do not suggest any error in the proof of consistency for the Powell estimator. They do emphasize that the nonlinear nature of the Powell objective function does impose additional regularity conditions. Though the examples of inconsistency are somewhat pathological, they do sug-

222

S. Portnoy

gest cases where fitting a nonlinear response function (viz., the Powell estimator) leads to a (very) incorrect estimate of the true regression line. In fact, the following finite sample simulated example shows that Powell’s estimator may be extremely poor even though the data do not appear unreasonable and the CRQ estimates appear quite reasonable. Specifically, we consider an example where x ∼ Unif(0, 4) and Y0 ∼ 5 + x + 4 max{x, 2} N (0, 1). Here censoring is at the constant value, c = 10, and so we observe Y = min(Y0 , 10). The specific data may be generated in R as follows:

0 −5

y

5

10

# generate powell-crq examples set.seed(23894291) for(i in 1:92) { x 2x + 2 log(pT ) ≤ 2pT exp −(x + log(pT )) 1≤j≤p 1≤t≤T

= 2 exp [−x] . Furthermore, by the inequality of Wallace [9], for all a > 0,  ,   + T0 2 a − log(1 + a) . IP χj ≥ T (1 + a) ≤ exp − 2

242

S. van de Geer

We now use that

a2 . 2(1 + a)  2 , + a T0 2 . IP(χj ≥ T0 (1 + a)) ≤ exp − 4 1+a a − log(1 + a) ≥

This gives

"

Insert a= Then

4x 4x + . T0 T0

a2 4x , ≥ 1+a T0 "    4x 4x IP χ2j ≥ T0 1 + ≤ exp[−x]. + T0 T0

so

Finally, apply the union bound to arrive at   IP max χ2j /T0 ≥ ξ02 ≤ exp[−x]. 1≤j≤p

) * Proof of Lemma 4.1. ˆ − β T Σβ| ≤ Σ ˆ − Σ∞ β21 , |β T Σβ and βj 1 ≤ Hence β1 =

p

 T0 βj 2 + R(T0 )W βj 2 ,

. p

 √ βj 1 ≤ T0 βj 2 + T0 / nW βj 2 ,

j=1

j=1

 √ where we use R(T0 ) ≤ T0 / n. Finally, invoke T0 /n ≤ λ and T0 /n ≤ λμ. Proof of Lemma 5.1. Let β be some vector in R(S). Then pen(βS ) = pen1 (βS ) + pen2 (βS ) ≤ 4pen1 (βS ), and pen(β) = pen1 (βS ) + pen1 (βS c ) + pen(β) ≤ 4pen1 (βS ). Define ˆ − Σ∞ |S|/φ2 (S). η 2 := nλ2 Σ Then, since φ(S) ≤ 1, and |S| ≥ 1, λ2 βj 22 = λ2 βj 2Σˆ ≤ λ2 βj 2Σj + η 2 (λβj 2 + λμW βj 2 )2 . j

It follows that pen1 (βS ) = λ

p

j=1

βj 2 ≤ λ

j∈S

βj Σj + ηpen(βS )

) *

Lasso with group structure

243

 |S|λβΣ /φ(S) + 4ηpen1 (βS ) )  *  λβ2Σˆ + φ(S)ηpen(β)/ |S| + 4ηpen1 (βS ) ≤ |S| φ(S)  ≤ λ |S|βΣˆ + 8ηpen1 (βS ). ≤

The assumption 8η ≤ gives

1 2

 pen1 (βS ) ≤ 2λ |S|βΣˆ /φ(S). ) *

Proof of Theorem 6.1. Throughout, we assume we are on T . We have for all β, ˆ ≤ 2T X(βˆ − β)/n + pen(β) + β − β 0 2ˆ βˆ − β 0 2Σˆ + pen(β) Σ 1 pen(βˆ − β) + pen(β) + β − β 0 2Σˆ . 4 It follows that for all S and for β = βS , ≤

3 3 βˆ − β 0 2Σˆ + pen1 (βˆS c ) + pen2 (βˆ − βS ) 4 4 5 pen1 (βˆS − βS ) + 2pen2 (βS ) + βS − β 0 2Σˆ . 4

≤ Case i) If

pen1 (βˆS − βS ) ≥ βˆ − β 0 2Σˆ + 2pen2 (βS ),

we get (8.1)

4βˆ − β 0 2Σˆ + 3pen1 (βˆS c ) + 3pen2 (βˆ − βS ) ≤ 9pen1 (βˆS − βS ).

So we then have βˆ − βS ∈ R(S). We therefore can apply Lemma 5.1, to find that when S ∈ S(Σ), from (8.1), 4βˆ − β 0 2Σˆ + 3pen(βˆ − βS ) ≤ 12pen1 (βˆS − βS ) 

≤ 24λ

16λ2 |S| |S|βˆ − βS Σˆ /φ(S) ≤ 3βˆ − βS 2Σˆ + 2 . φ (S)

Hence βˆ − β 0 2Σˆ + 3pen(βˆ − βS ) ≤ so also

16λ2 |S| , φ2 (S)

16λ2 |S| . βˆ − β 0 2Σˆ + pen(βˆ − βS ) ≤ 2 φ (S)

244

S. van de Geer

Case ii) If

pen1 (βˆS − βS ) < β − β 0 2Σˆ + 2pen2 (βS ),

we obtain 4βˆ − β 0 2Σˆ + 3pen1 (βˆS c ) + 3pen2 (βˆ − βS ) ≤ 9βS − β 0 2Σˆ + 18pen2 (βS ), and hence 4βˆ − β 0 2Σˆ + 3pen(βˆ − βS ) ≤ 12β − β 0 2Σˆ + 24pen2 (βS ). This gives βˆ − β 0 2Σˆ + pen(βˆ − βS ) ≤ 4β − β 0 2Σˆ + 8pen2 (βS ). ) *

References [1] Bickel, J., Ritov, Y., and Tsybakov, A. (2009). Simultaneous analysis of Lasso and Dantzig selector. Annals of Statistics 37 1705–1732. [2] Koltchinskii, V. (2009). Sparsity in penalized empirical risk minimization. Annales de l’Institut Henri Poincar´e, Probabilit´es et Statistiques 45 7–57. [3] Koltchinskii, V. and Yuan, M. (2008). Sparse recovery in large ensembles of kernel machines. In Conference on Learning Theory, COLT 29–238. ¨ hlmann, P. (2009). High-dimensional [4] Meier, L., van de Geer, S., and Bu additive modeling. Annals of Statistics 37 3779–3821. [5] Ravikumar, P., Liu, H., Lafferty, J., and Wasserman, L. (2008). SpAM: sparse additive models. Advances in neural information processing systems 20 1201–1208. [6] Tibshirani, R. (1996). Regression shrinkage and selection via the Lasso. J. Roy. Statist. Soc. Ser. B 58 1 267–288. [7] van de Geer, S. (2007). The deterministic Lasso. In JSM proceedings, (see also http://stat.ethz.ch/research/research reports)/2007/140. Amer. Statist. Assoc. ¨ hlmann, P. (2009). On the conditions used to [8] van de Geer, S. and Bu prove oracle results for the Lasso. Electronic Journal of Statistics 1360–1392. [9] Wallace, D. L. (1959). Bounds for normal approximations of student’s t and the chi-square distributions. Ann. Math. Statist. 30 1121–1130. [10] Yuan, M. and Lin, Y. (2006). Model selection and estimation in regression with grouped variables. Journal Royal Statistical Society Series B 68 1 49.

IMS Collections Nonparametrics and Robustness in Modern Statistical Inference and Time Series Analysis: A Festschrift in honor of Professor Jana Jureˇ ckov´ a Vol. 7 (2010) 245–253 c Institute of Mathematical Statistics, 2010  DOI: 10.1214/10-IMSCOLL724

Nonparametric estimation of residual quantiles in a conditional Koziol–Green model with dependent censoring No¨ el Veraverbeke Hasselt University Abstract: This paper discusses nonparametric estimation of quantiles of the residual lifetime distribution. The underlying model is a generalized Koziol– Green model for censored data, which accomodates both dependent censoring and covariate information.

1. Introduction Consider a fixed design regression model where for each design point (covariate) x ∈ [0, 1] there is a nonnegative response variable Yx , called lifetime or failure time. As in the case in many clinical or industrial trials, Yx is subject to random right censoring by a nonnegative censoring variable Cx . The observed random variables at the design point x are Zx = min(Yx , Cx ) and δx = I(Yx ≤ Cx ). Let us denote by Fx , Gx and Hx the distribution functions of Yx , Cx and Zx respectively. The main goal is to estimate the distribution function Fx (t) = P (Yx ≤ t) (and functionals of it) from independent data (Z1 , δ1 ), . . . , (Zn , δn ) at fixed design points 0 ≤ x1 ≤ . . . ≤ xn ≤ 1. Here Zi = min(Yi , Ci ) and δi = I(Yi ≤ Ci ). Note that at the design points xi we write Yi , Ci , Zi , δi instead of Yxi , Cxi , Zxi , δxi . The classical assumption of independence between Yx and Cx leads to the well known product-limit estimator of Beran [1]), which is the extension of the estimator of Kaplan and Meier [10] to the covariate case. However the assumption of independence between lifetime and censoring time is not always satisfied in practice and we should rather work with a more general assumption about the association between Yx and Cx . As in Zheng and Klein [18], Rivest and Wells [15] and Braekers and Veraverbeke [16] we will work with an Archimedean copula model for Yx and Cx . See Nelsen [12] for information on copulas. It means that, for each x ∈ [0, 1] we assume (1.1)

¯ ¯ P (Yx > t1 , Cx > t2 ) = ϕ−1 x (ϕx (Fx (t1 )) + ϕx (Gx (t2 )))

for all t1 , t2 , where ϕx is a known generator function depending on x in a general ¯ x = 1 − Gx . We recall that for each x, ϕx : [0, 1] → [0, +∞] way, and F¯x = 1 − Fx , G Hasselt University, Belgium. e-mail: [email protected] AMS 2000 subject classifications: Primary 62N01; secondary 62N02, 62G08. Keywords and phrases: Archimedean copula, asymptotic normality, dependent censoring, fixed design, Koziol–Green model, quantiles, residual lifetime. 245

246

N. Veraverbeke

is a continuous, convex, strictly decreasing function with ϕx (1) = 0. In the random right censorship model there is an extensive literature on an important submodel initiated by Koziol and Green [11]. It is a submodel obtained by imposing an extra assumption on the distribution functions Fx and Gx . In this way it is a type of informative censoring. In the case of independence between Yx and Cx , the Koziol–Green assumption is (1.2)

¯ x (t) = (F¯x (t))βx G

for all t ≥ 0, where βx > 0 is some constant depending in a general way on x. This extra assumption leads to an estimator for the survival function that is more efficient than the Kaplan-Meier estimator. See Cheng and Lin [4] in the case without covariates and Veraverbeke and Cadarso Suarez [17] in the regression case. In order to generalize (1.2) to the dependent censoring case, we recall that for continuous Fx , (1.2) is equivalent to (1.3)

Zx and δx are independent.

Translating property (1.3) into the model (1.1) gives that it is equivalent to the assumption (1.4)

¯ x (t)) = βx ϕx (F¯x (t)) ϕx (G

for all t ≥ 0 and for some βx > 0. Let us consider condition (1.4) for some examples of Archimedean copula models. For the independence case (ϕx (t) = − log t), (1.4) coincides with (1.2). For ¯ x (t) = the Gumbel copula (ϕx (t) = (− log t)α , α ≥ 1), condition (1.4) becomes G 1/α βx −α ¯ (Fx (t)) . For the Clayton copula (ϕx (t) = t − 1, α > 0), (1.4) becomes ¯ x (t) = (1 + βx (Fx (t)−α − 1))−1/α . This becomes (1.2) as α → 0. G In this paper we focus on nonparametric estimation of the median (or any other quantile) of the conditional residual lifetime in the above model. The conditional residual lifetime distribution is defined as Fx (y | t) = P (Yx − t ≤ y | Yx > t), i. e. the distribution of the residual lifetime, conditional on survival upon a given time t and at a given value of the covariate x. For any distribution function F , we denote by TF the right endpoint of the support of F . Then, for 0 < y < TFx , we have that Fx (y | t) =

Fx (t + y) − Fx (t) . 1 − Fx (t)

We define, for 0 < p < 1, the p-th quantile of Fx (y | t): (1.5)

Qx (t) = Fx−1 (p | t)

= inf{y | Fx (y | t) ≥ p} = −t + Fx−1 (p + (1 − p)Fx (t))

where for any 0 < q < 1 we write Fx−1 (q) = inf{y | Fx (y) ≥ q} for the q-th quantile of Fx . The paper is organized as follows. In Section 2 we discuss estimation of Fx and Fx−1 . We deal with residual quantiles in Sections 3 and 4. Some concluding remarks are in Section 5.

Residual quantiles

247

2. Estimation of the conditional distribution function and quantile function Estimation of Qx (t) on the basis of observations (Zi , δi ), i = 1, . . . , n, will be done −1 by replacing Fx and Fx−1 in (1.5) by corresponding empirical versions Fxh and Fxh where Fxh is the estimator studied in Braekers and Veraverbeke [2] and Gaddah and Braekers [8]. The derivation of this estimator goes as follows. From (1.1) we ¯ x (t)) = ϕx (F¯x (t)) + ϕx (G ¯ x (t)). Combining this with assumption have that ϕx (H ¯ x (t)) = (1 + βx )ϕx (G ¯ x (t)), or with γx = 1 = P (δx = 1): (1.4) gives ϕx (H 1+βx (2.1)

¯ F¯x (t) = ϕ−1 x (γx (ϕx (Hx (t))).

In order to estimate F¯x (t) at some fixed x ∈]0, 1[, we will use the idea that observations (Zi , δi ) with xi close to x give the largest contribution to the estimator. Therefore we will smooth in the neighborhood of x by using Gasser-M¨ uller type weights defined by xi

1 wni (x; hn ) = cn (x; hn )

1 K hn



x−z hn

 dz

(i = 1, . . . , n)

xi−1

 x dz, x0 = 0, K is a known probability density where cn (x; hn ) = 0 n h1n K x−z hn function and h = {hn } is a positive bandwidth sequence, tending to 0 as n → ∞. The estimator Fxh (t) of Fx (t) is now obtained by replacing γx and Hx (t) in (2.1) by the following empirical versions 

γxh =

n i=1

Hxh (t) =

wni (x; hn )δi n

wni (x; hn )I(Zi ≤ t).

i=1

Hence the estimator is given by (2.2)

¯ F¯xh (t) = ϕ−1 x (γxh ϕx (Hxh (t))).

To formulate some results on this estimator we need to introduce some further notations and some regularity conditions. First some notations: for the design points x1 , . . . , xn we write Δn = min1≤i≤n (xi − ¯ n = max1≤i≤n (xi − xi−1 ) and for the kernel K we write K2 = xi−1 ) and Δ 2 ∞ ∞ ∞ 2 K 2 K (u) du, μK 1 = −∞ u K(u) du, μ2 = −∞ u K(u) du. −∞ On the design and on the kernel, we will assume the following regularity conditions: ¯ n = O(n−1 ), Δ ¯ n − Δ = o(n−1 ) (C1) xn → 1, Δ n (C2) K is a probability density function with finite support [−M, M ] for some M > 0, μK 1 = 0, and K is Lipschitz of order 1. The results also require typical smoothness conditions on the elements of the model. For a fixed 0 < T < TFx , ∂ ∂2 (C3) F˙x (t) = ∂x Fx (t), F¨x (t) = ∂x 2 Fx (t) exist and are continuous in (x, t) ∈ [0, 1] × [0, T ] ∂ ∂2 (C4) β˙ x = ∂x βx , β¨x = ∂x 2 βx exist and are continuous in x ∈ [0, 1].

248

N. Veraverbeke

The generator ϕx of the Archimedean copula has to satisfy 2

∂ ∂ ϕx (v), ϕx (v) = ∂v (C5) ϕx (v) = ∂v 2 ϕx (v) are Lipschitz continuous in the x∂3  direction, ϕx (v) = ∂v3 ϕx (v) ≤ 0 exists and is continuous in (x, v) ∈ [0, 1]×]0, 1].

Below we will use asymptotic representations for the estimator Fxh and the corre−1 sponding quantile estimator Fxh . The representation for Fxh in Lemma 1 is taken −1 from Theorem 2 in Braekers and Veraverbeke [3]. The representation for Fxh (pn ) in Lemma 2 is formulated for random pn , tending to a fixed p as n → ∞ at a certain rate. The proof of Lemma 2 is not given since it parallels that of a similar result in Gijbels and Veraverbeke ([9], Theorem 2.1). Lemma 1. nh5n log n

Assume (C1) – (C5) in [0, T ] with T < TFx , hn → 0,

log n nhn

→ 0,

= O(1). Then, for t < TFx , Fxh (t) = Fx (t) +

n

wni (x; hn )gx (Zi , δi , t) + rn (x, t)

i=1

where gx (Zi , δi , t)

=

¯ x (t)) −ϕx (H {I(δi = 1) − δx }  ¯ ϕx (Fx (t))

+

γx

¯ x (t)) ϕx (H {I(Zi ≤ t) − Hx (t)} ϕx (F¯x (t))

and, as n → ∞, sup |rn (x, t)| = O((nhn )−1 log n) a.s. 0≤t≤T

Lemma 2. nh5n log n

Assume (C1) – (C5) in [0, T ] with T < TFx , hn → 0,

= O(1). Assume that

Fx−1 (p)

< T and that

fx (Fx−1 (p))

log n nhn

= o(1),

> 0, where fx = Fx .

If {pn } is a sequence of random variables (0 < pn < 1) with pn −p = OP ((nhn )−1/2 ), then as n → ∞, −1 Fxh (pn ) = Fx−1 (p) +

1

(pn fx (Fx−1 (p))

− Fxh (Fx−1 (p))) + oP ((nhn )−1/2 ).

3. Estimation of quantiles of the conditional residual lifetime From (1.5) it follows that the obvious estimator for Qx (t) is given by (3.1)

−1 (p + (1 − p)Fxh (t)) Qxh (t) = −t + Fxh

where Fxh is the estimator in (2.2). Denote qx = p + (1 − p)Fx (t) and qxh = p + (1 − p)Fxh (t). We have the following asymptotic normality result.

Residual quantiles

249

Theorem 1. Assume (C1) – (C5) in [0, T ] with T < TFx . Assume that Fx−1 (qx ) < T and that fx (Fx−1 (qx )) > 0. (a) If nh5n → 0 and (log n)2 /(nhn ) → 0: (nhn )1/2 (Qxh (t) − Qx (t)) → N (0; σx2 (t)) (b) If hn = Cn−1/5 for some C > 0: (nhn )1/2 (Qxh (t) − Qx (t)) → N (βx (t); σx2 (t)). Here σx2 (t) =

+

+ ¯ x (t)) ϕx (H ¯ x (Fx−1 (qx )) ,2 ϕx (H 1 − γx K22 − (1 − p) γx ϕx (F¯x (t)) fx2 (Fx−1 (qx )) ϕx (F¯x (Fx−1 (qx ))   ¯ x (t)) ϕ 2 (H 2 Hx (t)(1 − Hx (t)) γx (1 − p)2 x 2 ¯ ϕx (Fx (t))  ¯ x (Fx−1 (qx )) ϕx2 (H Hx (Fx−1 (qx ))(1 − Hx (Fx−1 (qx ))) + 2 ϕx (F¯x (Fx−1 (qx ))

,. ¯ x (t)) ϕx (H ¯ x (Fx−1 (qx )) ϕx (H −1 Hx (t)(1 − Hx (Fx (qx )) −2(1 − p)  ¯ ϕx (Fx (t)) ϕx (F¯x (Fx−1 (qx ))

βx (t) = (1 − p)bx (t) + bx (Fx−1 (qx )) with . ¯ x (t)) ¯ x (t)) 1 5/2 K −ϕx (H γx ϕx (H ¨ Hx (t) . γ¨x + bx (t) = C μ2 (3.2) 2 ϕx (F¯x (t)) ϕx (F¯x (t)) Proof.

Using Lemma 2 first and then Lemma 1, we have that

Qxh (t) − Qx (t) = =

1 [q fx (Fx−1 (qx )) xh

=

1 fx (Fx−1 (qx ))

n

1 (q −1 fx (Fxh (qx )) xh

− Fxh (Fx−1 (qx ))) + oP ((nhn )−1/2 )

− qx − (Fxh (Fx−1 (qx )) − Fx (Fx−1 (qx )))] + oP ((nhn )−1/2 )

wni (x; hn )[(1 − p)gx (Zi , δi , t) − gx (Zi , δi , Fx−1 (qx ))]

i=1

+oP ((nhn )−1/2 ). From this asymptotic representation it is now standard to derive the asymptotic normality results. It also uses the expressions for covariance and bias functions as in Gaddah and Braekers [8]. Note. In the case of independent censoring we have that ϕx (t) = − log t and the expression for the asymptotic variance simplifies to 

K 22 2 1−γx 2 ¯2 −1 2 γx (1 − p) ln (1 − p)Fx (t) fx (Fx (qx ))  . 1 Hx (Fx−1 (qx )) 2 2 ¯ 2− γx (t) (1−p)1/γx − Hx (t) + γx (1 − p) Fx

250

N. Veraverbeke

If there are no covariates this leads to a (corrected) formula in Cs¨org˝o [6]. And if there is no censoring (γx = 1), we also recognize the formula of Cs¨org˝o and Cs¨org˝o [7]: p(1 − p)F¯ (t) . f 2 (p + (1 − p)F¯ (t)) 4. Estimation of quantiles of the duration of old age In many situations it is necessary to replace the t in Qx (t) by some estimator tˆ. The variable t is then considered as an unknown parameter, usually the starting point of “old age”. For example, t could be defined through the proportion of retired people in the population under study, that is t = Fx−1 (p0 ) for some known p0 . The −1 unknown t could then be estimated by tˆ = Fxh (p0 ). Let tˆ be some general estimator for t and consider the estimator (3.1) with t replaced by tˆ: −1 Qxh (tˆ) = −tˆ + Fxh (p + (1 − p)Fxh (tˆ)).

The next theorem gives an asymptotic representation for Qxh (tˆ)−Qx (t). It requires a stronger form of condition (C3): ∂2 ˙ (C3’) F˙ x (t), F¨x (t), F ”x (t) = ∂t 2 Fx (t), Fx (t) = in (x, t) ∈ [0, 1] × [0, T ].

∂2 ∂x∂t Fx (t)

exist and are continuous

Theorem 2. Assume (C1) (C2) (C3’) (C4) (C5) in [0, T ] with T < TFx , Fx−1 (qx ) < T , nh5 fx (Fx−1 (qx )) > 0. Assume hn → 0, (log n)2 /(nhn ) → 0, log nn = O(1). Also assume that tˆ − t = OP ((nhn )−1/2 ). Then, as n → ∞, fx (t) −1 x (Fx (qx ))

Qxh (tˆ) − Qx (t) = (−1 + (1 − p) f +f

1

−1 x (Fx (qx ))

n

)(tˆ − t)

wni (x; hn ){(1 − p)gx (Zi , δi , t) − gx (Zi , δi , Fx−1 (qx ))}

i=1

+oP ((nhn )−1/2 ). Proof. Denote qˆxh = p + (1 − p)Fxh (tˆ). Then qˆxh − qx = (1 − p)(Fxh (tˆ) − Fx (t)] −1 and Qxh (tˆ) − Qx (t) = −(tˆ − t) + (Fxh (ˆ qxh ) − Fx−1 (qx )). Now write Fxh (tˆ) − Fxh (t) = {[Fxh (tˆ) − Fxh (t)] − [Fx (tˆ) − Fx (t)]} (4.1) +{Fxh (t) − Fx (t)} + {Fx (tˆ) − Fx (t)}. To the first term on the right hand side we can apply a modulus of continuity result analogous to the one in Veraverbeke [16]. The proof in the present situation goes along the same lines and therefore it is not given here. It requires condition (C3’). To the second term in the right hand side of (4.1) we apply our Lemma 1 and to the third term we apply a first order Taylor expansion. This gives that qˆxh − qx = (1 − p){fx (t)(tˆ − t) +

n

wni (x; hn )gx (Zi , δi , t)} + oP ((nhn )−1/2 ).

i=1

This, together with Lemma 2, leads to the asymptotic representation for Qxh (tˆ) − Qx (t).

Residual quantiles

251

−1 Example. If t = Fx−1 (p0 ) and tˆ = Fxh (p0 ) for some known p0 , we can apply Lemma 2 to tˆ − t and from Theorem 2 we obtain that . n

gx (Zi , δi , Fx−1 (p0 )) gx (Zi , δi , Fx−1 (qx )) Qxh (tˆ) − Qx (t) = − wni (x; hn ) fx (Fx−1 (p0 )) fx (Fx−1 (qx )) i=1

+

oP ((nhn )−1/2 ).

Bias and variance of the main term can be calculated and we obtain by standard arguments the following result. −1 (p0 ), q = p + (1 − p)p0 . Assume (C1) (C2) Corollary. Let t = Fx−1 (p0 ), tˆ = Fxh (C3’) (C4) (C5) in [0, T ] with T < TFx , hn → 0, Fx−1 (q) < T , fx (Fx−1 (q)) > 0, fx (Fx−1 (p0 )) > 0.

(a) If nh5n → 0 and (log n)2 /(nhn ) → 0: d ˜x2 (t)) (nhn )1/2 (Qxh (tˆ) − Qx (t)) → N (0; σ

(b) If hn = Cn−1/5 for some C > 0: d ˜x2 (t)) (nhn )1/2 (Qxh (tˆ) − Qx (t)) → N (β˜x (t); σ

Here

σ ˜x2 (t)

= K22

1−γx 2 γx



1

+

+γx2 (1 − p0 )2− γx

β˜x (t)

(1−p) ln(1−p0 ) fx (Fx−1 (p0 ))

Hx (Fx−1 (p0 )) fx2 (Fx−1 (p0 ))



2(1−p)Hx (Fx−1 (p0 )) fx (Fx−1 (p0 ))fx (Fx−1 (q))

=

bx (Fx−1 (p0 )) fx (Fx−1 (p0 ))





1−p)(1−p0 ) ln((1−p)(1−p0 )) fx (Fx−1 (q))

+

(1−p)

2

2− 1 γx

Hx (Fx−1 (q)) fx2 (Fx−1 (q))



bx (Fx−1 (q)) , fx (Fx−1 (q))

with bx (t) as in (3.2).

5. Some concluding remarks We developed asymptotic theory for nonparametric estimation of residual quantiles of the lifetime distribution in the Koziol–Green model of right random censorship. The possible dependence between responses and censoring times is modeled by a copula. There are several remarks in order before this can be applied to real data examples. (1) The model assumes that the Archimedean copula is known and also that the generator depends on the covariate. We remark that, due to the censoring, it is not possible to estimate the generator ϕx using only the data (Zi , δi ), i = 1, . . . , n. As can be seen in Braekers and Veraverbeke [2, 3] and Gaddah and Braekers [8], a good suggestion is to choose a reasonable ϕx by looking at the graph of a dependence measure for Yx and Cx . One could for example take Kendall’s tau (τ (x)), which is related to the generator via the simple formula τ (x) = 1 + 4 (ϕx (t)/ϕx (t)) dt.

252

N. Veraverbeke

(2) The expressions for asymptotic bias and variance are explicit but require a lot of further estimation of unknown quantities. In order to avoid this, we suggest the following bootstrap procedure. For i = 1, . . . , n obtain Zi∗ from Hxi g (t) and independently, δi∗ from a Bernoulli distribution with parameter γxi g , where Hxi g (t) and γxi g are defined as in Section 2, but with a bandwidth g = {gn } that is typically asymptotically larger than h = {hn }, n ∗ ∗ i. e. gn /hn → ∞ as n → ∞. Next calculate γxhg = i=1 wni (x; hn )δi and ∗ ∗ n ∗ ∗ Hxhg (t) = i=1 wni (x; hn )I(Zi∗ ≤ t) and use F xhg (t) = ϕ−1 x (γxhg ϕx (H xhg (t)) as a bootstrap version of F xh (t). (3) Also the choice of the bandwidth is an important practical issue. For this, we propose to use the above bootstrap scheme and to minimize asymptotic mean squared error expression over a large number of bootstrap samples. (4) Alternative approaches to the copula model could be explored. For example one could assume conditional independence of Y and C, given that the (random) covariate X equals x. Residual quantiles could be defined and studied starting from Neocleous and Portnoy [13] and El Ghouch and Van Keilegom [5]. These authors developed non- and semiparametric estimators based on the nonparametric censored regression quantiles of Portnoy [14]. Acknowledgements This research was supported by the IAP Research Network P6/03 of the Belgian Science Policy and by the Research Grant MTM2008-03129 of the Spanish Ministerio de Ciencia e Innovaci´ on. References [1] Beran, R. (1981). Nonparametric regression with randomly censored survival data. Technical Report, Univ. California, Berkeley. MR [2] Braekers, R. and Veraverbeke, N. (2005). A copula-graphic estimator for the conditional survival function under dependent censoring. Canad. J. Statist. 33 429–447. [3] Braekers, R. and Veraverbeke, N. (2008). A conditional Koziol–Green model under dependent censoring. Statist. Probab. Letters 78 927–937. [4] Cheng, P. E. and Lin, G. D. (1987). Maximum likelihood estimation of a survival function under the Koziol–Green proportional hazards model. Statist. Probab. Letters. 5 75–80. [5] El Gouch, A. and Van Keilegom, I. (2009). Local linear quantile regression with dependent censored data. Statistica Sinica 19 1621–1640. ¨ rgo ˝ , S. (1987). Estimating percentile residual life under random censor[6] Cso ship. In: Contributions to Stochastics (W. Sandler, ed.) 19–27. Physica-Verlag, Heidelberg. ¨ rgo ˝ , M. and Cso ¨ rgo ˝ , S. (1987). Estimation of percentile residual life. [7] Cso Oper. Res. 35 598–606. [8] Gaddah, A. and Braekers, R. (2009). Weak convergence for the conditional distribution function in a Koziol–Green model under dependent censoring. J. Statist. Planning Inf. 139 930–943.

Residual quantiles

253

[9] Gijbels, I. and Veraverbeke, N. (1988). Weak asymptotic representations for quantiles of the product-limit estimator. J. Statist. Planning Inf. 18 151– 160. [10] Kaplan, E. L. and Meier, P. (1958). Nonparametric estimation from incomplete observations. J. Amer. Statist. Assoc. 53 457–481. [11] Koziol, J. A. and Green, S. B. (1976). A Cram´er-von Mises statistic for randomly censored data. Biometrika 63 465–474. [12] Nelsen, R. B. (2006). An Introduction to Copulas. Springer-Verlag, New York. [13] Neocleous, T. and Portnoy, S. (2009). Partially linear censored quantile regression. Lifetime Data Analysis 15 357–378. [14] Portnoy, S. (2003). Censored regression quantiles J. Amer. Statist. Assoc. 98 1001–1012. [15] Rivest, L. and Wells, M. T. (2001). A martingale approach to the copulagraphic estimator for the survival function under dependent censoring. J. Multivariate Anal. 79 138–155. [16] Veraverbeke, N. (2006). Regression quantiles under dependent censoring. Statistics 40 117–128. ´ rez, C. (2000). Estimation of the [17] Veraverbeke, N. and Cadarso Sua conditional distribution in a conditional Koziol–Green model. Test 9 97–122. [18] Zhang, M. and Klein, J. P. (1995). Estimates of marginal survival for dependent competing risks based on an assumed copula. Biometrika 82 127–138.

IMS Collections Nonparametrics and Robustness in Modern Statistical Inference and Time Series Analysis: A Festschrift in honor of Professor Jana Jureˇ ckov´ a Vol. 7 (2010) 254–267 c Institute of Mathematical Statistics, 2010  DOI: 10.1214/10-IMSCOLL725

Robust error-term-scale estimate ´ Jan Amos V´ıˇ sek1,∗ Faculty of Social Sciences, Charles University and Institute of Information Theory and Automation Abstract: A scale-equivariant and regression-invariant estimator of the variance of error terms in the linear regression model is proposed and its consistency proved. The estimator is based on (down)weighting the order statistics of the squared residuals which corresponds to the consistent and scale- and regression-equivariant estimator of the regression coefficients. A small numerical study demonstrating the behaviour of the estimator under the various types of contamination is included.

Let N denote the set of all positive integers, R the real line and Rp the p-dimensional   Euclidean space. For a sequence of (p+1)-dimensional random vectors {(Xi , ei ) }∞ i=1 , for any n ∈ N and some fix β 0 ∈ Rp the linear regression model will be considered in the form p

 0 (1) Yi = Xi β + ei = Xij βj0 + ei , i = 1, 2, . . . , n or Y = Xβ 0 + e. j=1

To put the introduction which follows in the proper context let us assume:   ∞ Conditions C1 The sequence (Xi , ei ) i=1 is sequence of independent and identically distributed (p + 1)-dimensional random variables, distributed according to distribution functions (d.f.) FX,e (x, r) = FX (x) · Fe (r) where Fe (r) = F (rσ −1 ). Moreover, F (r) is absolutely continuous with density f (r) bounded by U and 2 IEFe e1 = 0, varFe (e1 ) = σ 2 . Finally, IEFX X1  < ∞. Remark 1 The assumption that the (parent) d.f. F (r) is continuous is not only technical assumption. Possibility that the error terms in regression model are discrete r.v.’s implies problems with treating response variable and it requires special considerations, similar to those which we carry out when studying binary or limited response variable, see e. g. in Judge et al. [16]. Absolute continuity is then a technical assumption. Without the density, even bounded density, we have to assume that F (r) is Lipschitz and it would bring a more complicated form of all what follows. A general goal of regression analysis is to fit a model (1) to the data. The analysis usually starts with estimating the regression coefficients βj ’s, continues by the estimation of the variance σ 2 of the error terms ei ’s (sometimes both steps run ∗ Research

ˇ number 402/09/0557. was supported by grant of GA CR Macroeconomics and Econometrics, Inst. of Economic Studies, Fac. of Social Sciences, Charles University, Opletalova ulice 26, 110 01 Praha 1 and Dept. Econometrics, Institute of Information Theory and Automation, Academy of Sciences of the Czech Republic. e-mail: [email protected] AMS 2000 subject classifications: Primary 62J02; secondary 62F35 Keywords and phrases: robustness, weighting the order statistics of squared residuals, consistency of the scale estimator. 1 Dept.

254

Robust error-term-scale estimate

255

simultaneously, Marazzi [22]), then it includes a validation of the assumptions, etc. The present paper is devoted to the (robust) estimation√of σ 2 . In the classical LSˆ 2 ) for studentization of analysis we need the estimate of σ (usually assumed as σ the estimates of regression coefficients in order to establish the significance of the explanatory variables. In the robust analysis we employ it at first for studentizing the residuals, in the case when the properties of our estimate depends on the absolute magnitude of residuals, e. g. as in the case of M -estimators. So the estimation of the variance of error terms (in the case of the homoscedasticity of error terms) is one of standard (and important) steps of regression analysis. But it need not be a very simple task. As early as in 1975 Peter Bickel [3] showed that to achieve the scale- and regressionequivariance of the M -estimates of regression coefficients the studentization of residuals has to be performed by a scale-equivariant and regression-invariant estimate of the scale of error terms. A proposal of such an estimator by Jana Jureˇckov´a and Pranab Kumar Sen [18] is based on regression scores. The idea is derived from the regression quantiles of Roger Koenker and Gilbert Bassett [21] and the evaluation utilizes standard methods of the stochastic linear programming, see Jureˇckov´a, Picek [17]. As the regression quantiles are based on L1 metric (they are in fact M estimators of the quantiles of d. f. of error terms, provided we know β 0 ), they can cope with outliers but can be significantly influenced by the presence of leverage points in the data, see Maronna, Yohai [23]. We propose an alternative estimator of σ 2 based on L2 -metric. In fact, our proposal generalizes an LT S-based scale estimator studied by Croux and Rousseeuw [8]. Of course, by a decision how many order statistics of the squared residuals will be taken into account one can adapt the estimator to the contamination level. We shall return to this problem at the end of paper in Conclusions. Croux–Rousseeuw estimator was also tested on the economic data by Bramanti and Croux [6]. Later, there appeared the paper by Pison et al. [24] proposing a correction of the estimator for small samples. Our estimator can be also accommodated to the level and to the character of contamination by selecting an appropriate estimator of regression coefficients (we shall discuss the topic at the end of this section). Similarly as in the classical regression analysis, the evaluation of the estimator proposed here represents the step which follows the estimation of regression coefficients. We assume that the respective estimator of regression coefficients is scale- and regression-equivariant and consistent. Nowadays the robust statistics offer a whole range of such estimators. Let us recall e. g. the least median of squares (LM S) (Rousseeuw [26]), the least trimmed squares (LT S) (Hampel et al. [11]), the least weighted squares (LW S) (V´ıˇsek [40]) or the instrumental weighted variables (IW V ) (V´ıˇsek [41]), to give some among many others (instrumental weighted variables is the robustified version of classical instrumental variables which became in the past (say) three decades the main estimating method in econometrics, being able to cope with the broken orthogonality condition, see Judge et al. [16], Stock, Trebbi [30] or Wooldridge [44]). There are nowadays also quick and reliable algorithms for evaluation of the estimates. The research for such algorithms started at very early days of robust statistics (Rousseeuw, Leroy [29]) and it brought a lot of results, see e. g. Marazzi [22]). The research significantly intensified when Thomas Hettmansperger and Simon Sheather [14] discovered a high sensitivity of LM S with respect to a small shift of data (one datum among 80 was changed less than 10% but the estimates

´ V´ıˇsek J. A.

256

changed surprisingly about hundreds – or for some coefficients, even thousands – percents). Fortunately, there appeared a new algorithm by Boˇcek, Lachout [5], based on a modification of the simplex method, which showed that the results by Hettmansperger and Sheather were achieved due to a wrong algorithm they used, see V´ıˇsek [35]. The algorithm by Boˇcek and Lachout is (to the knowledge of present author) still superior in the sense of the minimization of corresponding order statistic. Later also an algorithm returning a tight approximation to LT S was proposed ˇ ıˇzek, V´ıˇsek (V´ıˇsek [34, 35]) and included into XPLORE, see H˝ardle et al. [12] or C´ [9]). Several variants of this algorithm was studied for various situations and improved especially for utilization for very large data sets, e. g. Agull´o [1], Hawkins [13], Rousseeuw, Driessen [27, 28] and also by Hofmann et al. [15] – for deep theoretical study of the algorithms see Klouda [20]. Recently, the algorithm was generalized for evaluating LW S as well as for IW V , see V´ıˇsek [39]. Although Hettmansperger’s and Sheather’s results appeared misleading, an evaluation of LT S by an exact algorithm (searching through all corresponding subsamples) for their correct and damaged data (the data are nowadays referred to as Engine Knock Data, Hettmansperger, Sheather [14]) showed that the two respective estimates of regression coefficients are about hundreds percents different. It “has broken down” a statistical folklore that the robust methods with the high breakdown point – although losing (a lot of) efficiency – can reliably indicate (at least rough) idea about the underlying model. An explanation (by academic data) is given by the next three figures. First two of them indicate that a small change of observation given by the tiny circle (the change may be even arbitrary small – if closer to the intersection of the two lines) can cause a large change of the fitted model, if we use unconsciously an estimator with high breakdown point. The last figure demonstrates that LT S and LM S can give mutually orthogonal models. The observations drawn by circles are taken into account by both estimators while the observations given by ‘+’ and ‘x’ are considered only by LT S and LM S, respectively. In both cases the curiosities appeared due to the zero-one object function, or in other words, due to the fact that the estimators too much rely on some points and completely reject some others. Hence, some other pairs of estimators with high breakdown point can presumably exhibit a similar behaviour. 12

12 100

10

Decreasing model

10

8

8

6

6

4

4

2

2

0

0

90

Increasing model

LTS

80 70 60 50

LMS

40 30 20

−2 −20

−10

0

10

20

30

40

50

−2 −20

10 0 −10

0

10

20

30

40

50

0

20

40

60

80

A shock caused at the first moment by Hettmansperger’s and Sheather’s results has also began studies of the sensitivity of robust procedures with respect to (small) changes in the data, which in fact continued the studies by Chatterjee and Hadi [7] or Zv´ara [45]. It appeared that the estimator with discontinuous object function suffer by large sensitivity with respect of deleting even one point, see V´ıˇsek [33, 36, 37]. That is why we offer in the numerical study in the last section as the robust estimator of regression coefficient the least weighted squares (LW S) with continuous object function.

Robust error-term-scale estimate

257

Weighting the order statistics of squared residuals Let us start with recalling definitions of notions we shall need later. Definition 1 The estimator of regression coefficients, is said to be scale-equivariant (regression-equivariant) if for any c ∈ R+ , b ∈ Rp , Y ∈ Rn and X – matrix of type n × p – we have   ˆ ˆ X) ˆ + Xb, X) = β(Y, ˆ X) + b . (2) β(cY, X) = cβ(Y, β(Y Definition 2 The estimator σ ˆ 2 of the variance σ 2 of error terms is said to be scale-equivariant (regression-invariant) if for any c ∈ R+ , b ∈ Rp , Y ∈ Rn and X – matrix of type n × p   σ ˆ 2 (cY, X) = c2 σ ˆ 2 (Y, X) σ ˆ 2 (Y + Xb, X) = σ ˆ 2 (Y, X) . Now we are going to give a proposal of estimator of variance σ 2 of error terms ei ’s  2 (see (1)). Let for any β ∈ Rp ri (β) = Yi − Xi β denote the i-th residual and r(h) (β) the h-th order statistic among the squared residuals, i. e. we have 2 2 2 r(1) (β) ≤ r(2) (β) ≤ · · · ≤ r(n) (β).

Finally, let w(u) be a weight function w : [0, 1] → [0, 1] and put γ = r2 f (r)dr.

w (F (|r|)) ·

Remark 2 Under Conditions C1 the d. f. Fe (r) has the density fe (r) = σ −1 f (r · σ −1 ) and hence sup fe (r) ≤ σ −1 · U. (3) r∈R −1

Denote Ue = σ · U . Further, v 2 f (v)dv = γ · σ 2 , i. e. (4)

γ −1 ·

w (Fe (|r|)) · r2 · fe (r)dr = σ 2 ·

  w F (|v| · σ −1 ) ·

w (Fe (|r|)) · r2 · fe (r)dr = σ 2 .

Definition 3 Let βˆ(n) be an estimator of regression coefficients. Then put   n

i−1 2 −1 1 2 r(i) (5) w (βˆ(n) ). σ ˆ(n) = γ · n i=1 n 2 needs to be adjusted to the parent d. f. F (r) by γ. Remark 3 The estimator σ ˆ(n) It is similar as e. g. mean absolute deviation, see Hampel et al. [11] and Rousseeuw, Leroy [29].

We will need some conditions on the weight function. Conditions C2 The weight function w(u) is continuous nonincreasing, w : [0, 1] → [0, 1] with w(0) = 1. Moreover, w(u) is Lipschitz in absolute value, i. e. there is L such that for any pair u1 , u2 ∈ [0, 1] we have |w(u1 ) − w(u2 )| ≤ L · |u1 − u2 |. ˇ ak [10] for any i ∈ {1, 2, . . . , n} and any β ∈ Rp let us Following H´ajek and Sid´ define regression ranks as (6)

π(β, i) = j ∈ {1, 2, . . . , n}



2 ri2 (β) = r(j) (β).

´ V´ıˇsek J. A.

258

Let us denote the empirical distribution function (e.d.f.) of the absolute value of residual as n n   1

1  (n) Fβ (r) = (7) I {|rj (β)| < r} = I |Yj − Xj β| < r . n j=1 n j=1 Due to (6), ri2 (β) is the π(β, i)-th smallest value among the squared residuals, i. e. |ri (β)| is the π(β, i)-th smallest value among the absolute values of the residuals. Hence e. d. f. has at |ri (β)| its π(β, i)-th jump (of magnitude n1 ), nevertheless due to the sharp inequality in the definition of e. d. f. (see (7)) we have (n)

Fβ (|ri (β)|) =

(8) Then we have from (5)

1

· w n i=1 n

2 σ ˆ(n)

(9)

−1

=

γ

=

γ −1 ·

π(β, i) − 1 . n



π(β, i) − 1 n

 ri2 (β)

n  1  (n) ˆ ˆ w Fβˆ (|ri (β)|) ri2 (β). n i=1

Putting moreover         Fβ (r) = P |Y1 − X1 β| < r = P |e1 − X1 β − β 0 | < r , (10) 2 we can give key lemmas for reaching the consistency of σ ˆ(n) .

Lemma 1 Let Conditions C1 hold. Then for any ε > 0 there is Kε and nε ∈ N so that for all n > nε 4* )3  √  (n)  (11) > 1 − ε. P ω∈Ω: sup n Fβ (r) − Fβ (r) < Kε r∈R+ , β∈Rp

For the proof see V´ıˇsek [38] (the proof is based on generalization of result by Kolmogorov and Smirnov). An alternative way how to prove (11) is to employ Skorohod ˇ ep´an [31] for the method and e. g. Portnoy [25], embedding (see Breiman [4] or Stˇ Jureˇckov´a, Sen [19] or V´ıˇsek [42] for examples of employing this technique). Lemma 2 Under Conditions C1 there is K < ∞ so that for any pair β (1) , β (2) ∈ Rp 6 6   we have 6 6 sup Fβ (1) (r) − Fβ (2) (r) ≤ K · 6β (1) − β (2) 6 . r∈R

We have            Fβ (r) = P e1 − X1 β − β 0  < r = I{s − x β − β 0  < r}dFX,e (x, s)

Proof:

(see (10)). Then

  sup Fβ (1) (r) − F(β (2) ) (r) r∈R

          ≤ sup I{|s−x β (1) −β 0 | < r}−I{|s−x β (2) −β 0 | < r} fe (s)ds dFX (x). r∈R

Further, recalling that supr∈R fe (r) ≤ Ue (see Remark 2), we have           I |s − x β (1) − β 0 | < r − I |s − x β (2) − β 0 | < r  fe (s)ds  

Robust error-term-scale estimate



max{−r+x



(β (1) −β 0 ),−r+x (β (2) −β 0 )}

min{−r+x (β (1) −β 0 ),−r+x (β (2) −β 0 )}

+



max{r+x



259

fe (s)ds



(β (1) −β 0 ),r+x (β (2) −β 0 )}

min{r+x (β (1) −β 0 ),r+x (β (2) −β 0 )}

fe (s)ds

     ≤ 2 · Ue · x β (1) − β (2)  .

Hence putting K = 2 · Ue · IE X1 , for any β (1) , β (2) ∈ Rp we have       (1)  x β − β (2)  fX (x)dx sup Fβ (1) (r) − Fβ (2) (r) ≤ 2 · Ue   r∈R 6 6 6 6 6 6 6 6 ≤ 2 · Ue · IE X1  · 6β (1) − β (2) 6 ≤ K · 6β (1) − β (2) 6 . Lemma 3 Let Conditions C1 and C2 hold. Then there is K < ∞ so that for any pair β (1) , β (2) ∈ Rp and any i = 1, 2, . . . , n we have 6   6       6 6      w Fβ 0 ri (β (1) ) − w Fβ 0 ri (β (2) )  ≤ K · 6β (1) − β (2) 6 · Xi  . Proof:

Let us recall once again that            Fβ (r) = P e1 − X1 β  < r = I{s − x β  < r}fe (s)ds dFX (x)

and that supr∈R fe (r) ≤ Ue (see Remark 2). Then            Fβ 0 ri (β (1) ) − Fβ 0 ri (β (2) )                 ≤ I |s − x β 0 | < ri (β (1) ) − I |s − x β 0 | < ri (β (2) )  fe (s)ds dFX (x). Further         I |s − x β 0 | < ri (β (1) ) − I |s − x β 0 | < ri (β (2) )  fe (s)ds   max{−|ri (β (1) )|+x β 0 ,−|ri (β (2) )|+x β 0 } fe (s)ds ≤ min{−|ri (β (1) )|+x β 0 ,−|ri (β (2) )|+x β 0 } max{|ri (β (1) )|+x β 0 ,|ri (β (2) )|+x β 0 } fe (s)ds + min{|ri (β (1) )|+x β 0 ,|ri (β (2) )|+x β 0 } 6   6 6   6 ≤ 2 · Ue · ri (β (1) ) − ri (β (2) ) ≤ 2 · Ue · Xi  · 6β (1) − β (2) 6     where we have used |a| − |b| ≤ |a − b|. Hence putting K = 2 · L · Ue , we have 6 6        6 6       w Fβ 0 ri (β (1) ) − w Fβ 0 ri (β (2) )  ≤ K · 6β (1) − β (2) 6 Xi  . Assertion 1 We have n  n n 6  6

6 62

6  2 ˆ  6 6 6 2 (12) |ei | · Xi  + 6β 0 − βˆ6 · Xi  . ri (β) − e2i  ≤ 2 · 6β 0 − βˆ6 · i=1

i=1

i=1

´ V´ıˇsek J. A.

260

Proof: Straightforward steps gives    6 62  6 6  2  6 6  2 ˆ 6 2 6 2   ˆ 0 2 −ei  ≤ 2·|ei |·Xi · 6βˆ −β 0 6+Xi  · 6βˆ −β 0 6 . ri (β)−ei  =  ei −Xi β −β Conditions C3 The estimator of regression coefficients βˆ(n) is scale- and regressionequivariant and consistent. Corollary 1 Under Conditions C1 and C3 we have (13)

n  1  2 ˆ  ri (β) − e2i  = op (1) n i=1

1 2 ˆ r (β) = Op (1). n i=1 i n

and hence also

  Proof: Under Conditions C1 we have IE {|e1 | · X1 } < ∞ as well as IE 6X1 2 6< ∞. n n 6 6 2 Hence n1 i=1 |ei | · Xi  = Op (1) and also n1 i=1 Xi  = Op (1). As 6βˆ − β 0 6 = op (1), applying Assertion 1, we prove the left hand side of (13). Then n n n  1

1 2 ˆ 1  2 ˆ  ri (β) ≤ e2 = Op (1). ri (β) − e2i  + n i=1 n i=1 n i=1 i 2 Theorem 1 Let Conditions C1, C2 and C3 hold. Then the estimator σ ˆ(n) is weakly consistent, scale-equivariant and regression-invariant.

Proof: Fix ε > 0 and according to Lemma 1 find Kε > 0 and nε ∈ N so that for any n > nε we have )3 4*  √  (n)  (14) > 1 − ε. P ω∈Ω: sup n Fβ (r) − Fβ (r) < Kε r∈R+ , β∈Rp

Denote the set 3 (15)

Bn =

ω∈Ω:

sup r∈R+ , β∈Rp

 √  (n)  n Fβ (r) − Fβ (r) < Kε

4 .

Then for any ω ∈ Bn we have   n    1   2 2 ˆ ˆ  ˆ(n) − w Fβˆ (|ri (β)|) ri (β) γ · σ   n i=1   n   1

     (n) 2 ˆ  ˆ ˆ = w Fβˆ (|ri (β)|) − w Fβˆ (|ri (β)|) ri (β) n  i=1 ≤

n    1   (n)  2 ˆ ˆ ˆ −w Fβˆ (|ri (β)|) w Fβˆ (|ri (β)|)  ri (β) n i=1

n    √  (n)  − 32  2 ˆ  ≤ L · sup n Fβ (r)−Fβ (r) n ri (β) . r∈R+ , β∈Rp 3

Due to (13) we have n− 2

i=1

n  2 ˆ  i=1 ri (β) = op (1) and hence, due to (14),

Robust error-term-scale estimate 2 γ·σ ˆ(n) −

(16)

261

n  1  ˆ ˆ = op (1). w Fβˆ (|ri (β)|) · ri2 (β) n i=1

Now, taking into account Condition C2, we have  n  1        2 ˆ ˆ ˆ  w Fβˆ (|ri (β)|) (17) − w Fβ 0 (|ri (β)|) · ri (β)  n  i=1 n  1  ˆ − Fβ 0 (|ri (β)|) ˆ  · r2 (β). ˆ ≤ L· Fβˆ (|ri (β)|) i n i=1 ˆ Now, employing Lemma 2, we have (write for a while ri instead of ri (β)) (18) ≤

n n   1

 2 1     r  Fβˆ (|ri |) − Fβ 0 (|ri |) · ri2 ≤ sup Fβˆ (r) − Fβ 0 (r) n i=1 n i=1 i r∈R n 6 6 1

 2 6 6 r  . K · 6βˆ − β 0 6 · n i=1 i

ˆ (18) is op (1). Similarly, employing Under Condition C1, due to the consistency of β, Lemma 3 and once again Condition C1 and C2, we have (remember that ri (β 0 ) = ei ) (19)

n n 

  1  ˆ ˆ −1 ˆ = op (1). w Fβ 0 (|ri (β)|) · ri2 (β) w Fβ 0 (|ei |) · ri2 (β) n i=1 n i=1

Employing Corollary 1, due to Conditions C1, C2 and C3 we have (for βˆ − β 0  ≤ 1)   n 1

    2  2  ˆ (20) w Fβ 0 (|ei |) · ri (β) − ei   n  i=1



n  6 61

 6 6 2 2 · 6βˆ − β 0 6 |ei | · Xi  + Xi  = op (1). n i=1

Finally, (16), (17), (19) and (20) implies that  1  = w Fβ 0 (|ei |) · e2i + op (1). n i=1 n

(21)

γ·

2 σ ˆ(n)

2 Taking into account (4), the weak consistency of σ ˆ(n) follows from (21). 2 The scale-equivariance and the regression-invariance of σ ˆ(n) follows directly from 2 two facts. Firstly, estimator σ ˆ is based on the squared residuals of the estimator βˆ (n)

of regression coefficients. As the estimator βˆ is scale- and regression-equivariant, the residuals are scale-equivariant and regression-invariant, see (2). Since the weights depend on the empirical d. f., they are scale- and regression-invariant. Conditions C4√ The estimator of regression coefficients βˆ(n) is scale- and regressionequivariant and n-consistent. Corollary 2 Under Conditions C1 nand C4 

  − 12 2 ˆ (22) n ri (β) − e2i  = Op (1). i=1

´ V´ıˇsek J. A.

262

Proof: Similarly as in (12) we have n  n n  6 1

62 1

√ 6 √ 6 1  2 ˆ  6 6 6 6 2 n− 2 |ei |·Xi + n 6β 0 − βˆ6 · Xi  . ri (β) − e2i  ≤ 2· n 6β 0 − βˆ6· n n i=1 i=1 i=1 (23) Using similar arguments as in the proof of Corollary 1, we conclude the proof. 2 Theorem 2 Let the Conditions C1, C2 and C4 hold. Then the estimator σ ˆ(n) is √ n-consistent.

Proof:

Similarly as above, (13) and (14) yields n  √ 1  2 ˆ ˆ = Op (1). n · γ ·σ ˆ(n) −√ w Fβˆ (|ri (β)|) ·ri2 (β) n i=1

(24)

Employing again Lemma 1 and Condition C1 and C2, we have   n     1

    ˆ ˆ ˆ  w Fβˆ (|ri (β)|) −w Fβ 0 (|ri (β)|) ri2 (β) √   n i=1 n 1

√  (n)  ˆ = Op (1). ≤ L· sup (25) n Fβ (r) − Fβ (r) r2 (β) n i=1 i r∈R+ , β∈Rp Similarly, utilizing Lemma 3 and once again Condition C1 and C2, we have n n   1  1  2 ˆ ˆ ˆ = Op (1). (26) √ w Fβ 0 (|ri (β)|) ·ri (β)− √ w Fβ 0 (|ri (β 0 )|) ·ri2 (β) n i=1 n i=1 6 6 6 6 Using Corollary 2, due to Conditions C1, C2 and C4 we have (for 6β 0 − βˆ6 ≤ 1)   n  1

    2  2 ˆ − e  (27) w Fβ 0 (|ei |) · ri (β) √ i  n  i=1



n  6  √ 6 6 0 ˆ6 1

2 2 n 6β − β 6 |ei | · Xi  + Xi  = Op (1). n i=1

Finally, (25), (26) and (27) implies that n     1   2 n·γ· σ ˆ(n) − σ2 = √ w Fβ 0 (|ei |) · e2i − γ · σ 2 + Op (1) n i=1 √ 2 and the n-consistency of σ ˆ(n) follows from the Central Limit Theorem and Remark 2.



In the next chapter we offer a numerical study of the proposed scale estimator σ ˆ 2 . We shall use βˆ(LW S,n,w) , given as solution of extremal problem βˆ(LW S,n,w) = arg min p β∈R

n

i=1

 w

i−1 n

 2 r(i) (β),

see V´ıˇsek [35], in the role of the robust, scale- and regression-equivariant estimator of regression coefficient. We shall need following conditions:

Robust error-term-scale estimate

Conditions (28)

263

C5 There is the only solution of      =0 β  IE w (Fβ (|r(β)|)) X1 e − X1 β − β 0

namely β 0 (the equation (28) is assumed as a vector equation in β ∈ Rp ). Conditions N C 1 The derivative f  (r) exists and is bounded in absolute value by Be < ∞. The derivative w (α) exists and is Lipschitz of the first order (with the corresponding constant Jw < ∞). Theorem 3 Under Conditions C1, C2 and C5 βˆ(LW S,n,w) is consistent, scale- and regression-equivariant. Similarly, under Conditions C1, C2, C5 and N C 1 βˆ(LW S,n,w) √ is n-consistent. Proof

can be found in V´ıˇsek [40, 43].

Hence βˆ(LW S,n,w) can be used as the estimator we have considered in the construction of σ ˆ2. Numerical study The model (1) was employed with coefficients given in the first row of tables presented below. The explanatory variables were generated as sample from 3-dimensional normal population with zero means and diagonal covariance matrix (diagonal elements equal to 9). The error terms were generated as normal with zero mean and variance equal to 2. We have generated 100 datasets, each of them containing 100 observations. As the robust, scale- and regression-equivariant estimator we have used βˆ(LW S,n,w) , see the end of the previous chapter. The weight function was given for processing a mild contamination (see below) as (29) w(u) = 1 for u ∈ [0, 0.8], w(u) = 20 · (0.8 − u) + 1 u ∈ [0.8, 0.85], w(u) = 0 otherwise. For processing a heavy contamination (see again below) we have began with a weight function of type (29) but with the upper bound of the first interval equal to 0.4 (instead of 0.8) and with much slower slope. Then we increased (step by step equal to 0.01) the upper bound of interval [0, 0.4]. The estimate of the scale of error term βˆ(LW S,n,w) as well as of regression coefficients were stable and they lost a stability when we overcame the value of the upper bound) 0.45. Hence we used w(u) = 1 for u ∈ [0, 0.45],

w(u) = 2.5 · (0.45 − u) + 1 u ∈ [0.45, 0.85],

w(u) = 0 otherwise. As a benchmark we offer results of the ordinary least squares βˆ(OLS,n) and of the least weighted squares βˆ(LW S,n,w) for data without any contamination (the first table). The following tables collect results of the estimation of model by βˆ(OLS,n) and βˆ(LW S,n,w) under various types of contamination specified in the captions of tables (inside the frames). The estimates were evaluated by algorithm discussed in V´ıˇsek [39] and implemented in MATLAB (the implementation is available on request). Every table contains in

´ V´ıˇsek J. A.

264

its first row the true values of regression model. The second and the third row of tables contain the empirical means from hundred βˆ(OLS,n) ’s and βˆ(LW S,n,w) ’s, respectively, evaluated for the (above mentioned) 100 datasets. The type and level of contamination is given in the first line of respective frame. The adjusting constant γ was evaluated by numerical integration. Finally, 1 2 ˆ(OLS,n) = r (β ) n − p i=1 i   n 1

i−1 2 r(i) = γ −1 · w (βˆ(LW S,n,w) ). n i=1 n n

2 σ ˆOLS

and

2 σ ˆLW S

The results of estimating the variance of the error terms by these estimators are given on the second and on the third line of the frames, respectively. Regression without contamination For this case we have started with the weight function given in (29) and we have shifted the interval [0.8,0.85] to the right – step by step (equal 0.01) – so long while the results were stable, so that we have used finally w(u) = 1 for u ∈ [0, 0.95] and w(u) = 20 · (0.95 − u) + 1 for u ∈ [0.95, 1]. 2 σ ˆOLS = 1.99(.0641)

2 σ ˆLW S = 1.99(.0647)

β0

1.5

4.3

−3.2

βˆ(OLS,n)

1.49(.0040)

4.28.0039)

−3.20(.0060)

βˆ(LW S,n,w)

1.49(.0042)

4.28.0044)

−3.20(.0063)

Regression with mild contamination Contamination: For the first 5 observations we changed: (let us recall that the true values of coefficients are in the first row of tables, while the second and the third ones contain βˆ(OLS,n) and βˆ(LW S,n,w) , respectively; variances of estimates are in parenthesis) Yi to 2 ∗ Yi

Yi to 2 ∗ Yi 2 σ ˆOLS 2 σ ˆLW S

and Xi to 2 ∗ Xi 2 σ ˆOLS = 72.52(1775.0) 2 σ ˆLW S = 2.29(0.082)

= 7.17(9.91) = 2.30(.059)

1.5

4.3

−3.2

1.5

4.3

−3.2

1.55(.016)

4.43(.017)

−3.33(.022)

1.04(.490)

3.17(.646)

−2.29(.644)

1.49(.007)

4.30(.006)

−3.20(.007)

1.49(.007)

4.30(.006)

−3.21(.007)

Robust error-term-scale estimate

265

Regression with heavy contamination but with inappropriate weight function Contamination: For the first 45 observations we changed: Yi to 2 ∗ Yi

Yi to 2 ∗ Yi

and Xi to 2 ∗ Xi 2 σ ˆOLS = 237.52(1144.6) 2 σ ˆLW S = 214.1(1097.1)

2 σ ˆOLS = 34.90(26.26) 2 σ ˆLW S = 31.23(20.76)

1.5

4.3

-3.2

1.5

4.3

-3.2

2.16(.072)

6.18(.091)

−4.61(.010)

−.77(.206)

−2.14(.248)

1.67(.211)

1.89(.125)

5.41(.283)

−4.06(.181)

−1.1(.176)

−3.03(.579)

2.26(.432)

Regression with heavy contamination and accommodated weight function Contamination: For the first 45 observations we changed: Yi to 2 ∗ Yi

Yi to 2 ∗ Yi

and Xi to 2 ∗ Xi 2 σ ˆOLS = 232.4(899.9) 2 σ ˆLW S = 2.62(0.104)

2 σ ˆOLS = 333.99(24.66) 2 σ ˆLW S = 1.89(0.057)

1.5

4.3

-3.2

1.5

4.3

-3.2

2.16(.087)

6.18(.101)

−4.60(.104)

−.71(.188)

−2.1(.244)

1.54(.242)

1.54(.112)

4.52(.682)

−3.34(.394)

1.5(.109)

4.20(.736)

−3.14(.41)

Conclusions of numerical study. It is clear that the outliers have a small influence on the estimates while the “combined” contamination (simultaneously by outliers and leverage points) much larger. Nevertheless, both βˆ(LW S,n,w) as well 2 as σ ˆLW S have copped with contamination quite well - if the weight function was properly accommodated to the level of contamination. In practice we do not know the level of contamination. Then we may keep a (rather general) rule saying that 2 starting with the “highest possible” robustness of σ ˆ(n) and of βˆ(LW S,n,w) , we can decrease their robustness so long when the estimates lose their stability, see e. g. Ben´aˇcek, V´ıˇsek [2]. Acknowledgement. We would like to thank to two anonymous referees. Their comments indicated very precisely what was to be modified to make the text clear and easier to understand. References ´ , J. (2001). New algorithms for computing the least trimmed squares [1] Agullo regression estimators. Computational Statistics and Data Analysis 36 425–439. ´ (2002). Determining factors of trade spe´c ˇek, V. and V´ıˇ [2] Bena sek, J. A. cialization and growth of a small economy in transition. Impact of the EU

266

[3] [4] [5] [6] [7] [8]

[9] [10] [11]

[12] [13] [14] [15]

[16]

[17] [18] [19]

[20]

[21] [22] [23]

[24]

´ V´ıˇsek J. A.

opening-up on Czech exports and imports. IIASA, Austria, IR series no. IR03-001 1–41. Bickel, P. J. (1975). One-step Huber estimates in the linear model. J. Amer. Statist. Assoc. 70 428–433. Breiman, L. (1968). Probability. Addison-Wesley Publishing Company, London. ˇek, P. and Lachout, P. (1993). Linear programming approach to LM SBoc estimation. Memorial volume of Comput. Statist. & Data Analysis 19 129–134. Bramanti, M. C. and Croux, C. (2007). Robust estimators for the fixed effects panel data model. The Econometrics Journal 10 321–540. Chatterjee, S. and Hadi A. S. (1988). Sensitivity Analysis in Linear Regression. J. Wiley & Sons, New York. Croux, C. and Rousseeuw, P. J. (1992). A class of high-breakdown scale estimators based on subranges. Communications in Statistics – Theory and Methods 21 1935 –1951. ˇ´ıˇ ´ (2000). The least trimmed squares. User Guide C zek, P. and V´ıˇ sek, J. A. of Explore. ˇ a ´ jek, J. and Sid ´ k, Z. (1967). Theory of Rank Test. Academic Press, Ha New York. Hampel, F. R., Ronchetti, E. M., Rousseeuw, P. J., and Stahel W. A. (1986). Robust Statistics – The Approach Based on Influence Functions. J. Wiley & Sons, New York. ˝ rdle, W., Hla ´ vka, Z, and Klinke, S. (2000). XploRe Application Ha Guide. Springer Verlag, Heilderberg. Hawkins, D. M. (1994). The feasible solution algorithm for least trimmed squares regression. Computational Statistics and Data Analysis 17, 185–196. Hettmansperger, T. P. and Sheather, S. J. (1992). A cautionary note on the method of Least Median Squares. The American Statistician 46 79–83. Hofmann, M., Gatu, C., and Kontoghiorghes E. J. (2010). An exact least trimmed squares algorithm for a range of coverage values. J. of Computational and Graphical Statistics 19 191–204. ¨ tkepohl, H., and Lee, Judge, G., Griffiths, W. E., Hill, R. C., Lu T. C. (1982). Introduction to the Theory and Practice of Econometrics. J. Wiley & Sons, New York. ˇkova ´ , J. and Picek J. (2006). Robust Statistical Methods with R. Jurec Chapman & Hall, New York. ˇkova ´ , J. and Sen, P. K. (1984). On adaptive scale-equivariant M Jurec estimators in linear models. Statistics and Decisions 2, Suppl. Issue No. 1. ˇkova ´ , J. and Sen, P. K. (1993). Regression rank scores scale statisJurec tics and studentization in linear models. Proc. Fifth Prague Symposium on Asymptotic Statistics, Physica Verlag 111–121. Klouda, K. (2007). Algorithms for computing robust regression estimates. Diploma Thesis, Faculty of Nuclear Sciences and Physical Engineering, Czech Technical University, Prague. Koenker, R. and Bassett, G. (1978). Regression quantiles. Econometrica 46 33–50. Marazzi, A. (1992). Algorithms, Routines and S Functions for Robust Statistics. Wadsworth & Brooks/Cole Publishing Company, Belmont. Maronna, R. A. and Yohai, V. J. (1981). The breakdown point of simultaneous general M -estimates of regression and scale. J. of Amer. Statist. Association 86 (415), 699–704. Pison, G., Van Aelst, S., and Willems, G. (2002). Small sample correc-

Robust error-term-scale estimate

267

tions for LTS and MCD. Metrika 55 111–123. [25] Portnoy, S. (1983). Tightness of the sequence of empiric c. d. f. processes defined from regression fractiles. In Robust and Nonlinear Time-Series Analysis (J. Franke, W. H˝ardle, D. Martin, eds.), 231–246. Springer-Verlag, New York. [26] Rousseeuw, P. J. (1984). Least median of square regression. J. Amer. Statist. Association 79 871–880. [27] Rousseeuw, P. J. and Driessen, K. (2000). An algorithm for positivebreakdown regression based on concentration steps. In Data Analysis: Scientific Modeling and Practical Application (W. Gaul, O. Opitz, M. Schader, eds.), 335 - 346. Springer-Verlag, Berlin. [28] Rousseeuw, P. J. and Driessen, K. (2002). Fast-LTS in Matlab, code revision 20/04/2006. [29] Rousseeuw, P. J. and Leroy, A. M. (1987). Robust Regression and Outlier Detection. J. Wiley, New York. [30] Stock, J. H. and Trebbi, F. (2003). Who invented instrumental variable regression? Journal of Economic Perspectives 17 177–194. ˇ e ˇ pa ´ n, J. (1987). Teorie pravdˇepodobnosti. Academia, Praha. [31] St ´ (1994). A cautionary note on the method of the Least Median [32] V´ıˇ sek, J. A. of Squares reconsidered. Trans. Twelfth Prague Conference on Information Theory, Statistical Decision Functions and Random Processes, Academy of Sciences of the Czech Republic, 254–259. ´ (1996). Sensitivity analysis of M -estimates. Annals of the Insti[33] V´ıˇ sek, J. A. tute of Statistical Mathematics 48 469–495. ´ (1996). On high breakdown point estimation. Computational [34] V´ıˇ sek, J. A. Statistics 137–146. ´ (2000). Regression with high breakdown point. Robust 2000 (J. [35] V´ıˇ sek, J. A. Antoch & G. Dohnal, eds.) Union of the Czech Mathematicians and Physicists, 324–356. ´ (2002). Sensitivity analysis of M -estimates of nonlinear regres[36] V´ıˇ sek, J. A. sion model: Influence of data subsets. Annals of the Institute of Statistical Mathematics 54 261–290. ´ (2006). The least trimmed squares. Sensitivity study. Proc. [37] V´ıˇ sek, J. A. Prague Stochastics 2006 (M. Huˇskov´a & M. Janˇzura, eds.), matfyzpress, 728–738. ´ (2006). Kolmogorov–Smirnov statistics in multiple regression. [38] V´ıˇ sek, J. A. Proc. ROBUST 2006 (J. Antoch & G. Dohnal, eds.), Union of the Czech Mathematicians and Physicists, 367–374. ´ (2006). Instrumental weighted variables – algorithm. Proc. [39] V´ıˇ sek, J. A. COMPSTAT 2006 777–786. ´ (2009). Consistency of the least weighted squares under het[40] V´ıˇ sek, J. A. eroscedasticity. Submitted to the Kybernetika. ´ (2009). Consistency of the instrumental weighted variables. An[41] V´ıˇ sek, J. A. nals of the Institute of Statistical Mathematics 61 543–578. ´ (2010). Empirical distribution function under heteroscedasticity. [42] V´ıˇ sek, J. A. To appear in Statistics. ´ (2010). Weak √n-consistency of the least weighted squares under [43] V´ıˇ sek, J. A. heteroscedasticity. Submitted to Acta Universitatis Carolinae – Mathematica et Physica. [44] Wooldridge, J. M. (2001). Econometric Analysis of Cross Section and Panel Data. MIT Press, Cambridge, Massachusetts. ´ ra, K. (1989). Regresn´ı anal´yza (Regression Analysis – in Czech). [45] Zva Academia, Praha.