Posterior predictive outlier detection using sample

0 downloads 0 Views 191KB Size Report
Nov 4, 1997 - weighting methods of Bradlow and Zaslavsky 1997a are useful. ... Alan M. Zaslavsky is Associate Professor of Statistics, Department of Health ...
DRAFT Posterior predictive outlier detection using sample reweighting Alan M. Zaslavsky and Eric T. Bradlow November 4, 1997

Abstract In a Bayesian model, we de ne an outlier as an observation which is \surprising" relative to its predictive distribution, under the model, given the remainder of the data. Hence \outlyingness" can be measured by the posterior predictive p-value of any interesting scalar summary of the (possibly multivariate) observation. For this calculation, we exclude the case of interest from the data, analogously to studentization of regression residuals. When Bayesian inference about the parameters is conducted by drawing a sample from their posterior distribution, as with a Markov Chain Monte Carlo sampler, the p-value can be calculated by reweighting the sample to re ect deletion of the target observation and then drawing from the predictive distribution. Therefore the case-deletion weighting methods of Bradlow and Zaslavsky (1997a) are useful. A variety of outlier checks are illustrated using hierarchical models for two data sets, a standard linear hierarchical model for rat growth and a complex ordinal model for survey data with nonignorable missing responses.

1 Introduction 1.1 De nitions An outlier is an observation whose value is \surprising." This statement has a natural interpretation in Bayesian models, in which all de ned values have posterior  Alan M. Zaslavsky is Associate Professor of Statistics, Department of Health Care Policy,

Harvard Medical School, Boston, MA. Eric T. Bradlow is Assistant Professor of Marketing and Statistics, Wharton School, University of Pennsylvania, Philadelphia, PA.

1

DRAFT distributions. We de ne the \surprisingness" of an observation or collection of observations yi (where i may be the index of a single datum or a set of indices denoting

any subset of the complete data Y) by the posterior predictive p-value for a scalar statistic Si (yi ). This is the probability that a new realization yi of that set of observations would yield a value of Si (yi ) equal to or exceeding the observed value, conditional on Y(,i), the vector of all of the data except yi :

pi = P (Si(yi)  Si(yi)jY(,i)):

(1)

Under the model, this is the quantile of a realized value Si (yi ) in its case-deleted predictive distribution and, if the densities are continuous, has a uniform distribution on [0; 1]. An unusually small value of pi then constitutes evidence that the model is de cient in predicting yi . (This corresponds to a one-tailed test; in situations where extreme values of Si (yi ) in either direction represent outlyingness, we can de ne a second statistic Si0(yi ) = ,Si (yi ), or equivalently search for values of pi that are close to either 0 or 1.) We index the check statistic Si to allow for the possibility that the statistic is calculated di erently for di erent parts of the data, but we suppress this index when the same function is used for every unit. The dependence of yi on the other data Y(,i) is usually through a set of parameters , in which case (1) can be rewritten as

pi = E P (Si(yi) > Si(yi)j) =

Z

P (Si(yi) > Si(yi)j)[jY(,i)] d;

(2)

where [AjB ] is used generically to represent a conditional density. The integral is the expectation with respect to the posterior distribution of , the parameter vector, given the data Y(,i) with yi omitted. Typically, in a hierarchical model there will

be a partition  = (i ; (,i)) such that i is particularly tied to yi , but depends on 2

Y ,i

DRAFT

(

)

only through (,i) ; then (2) may be rewritten as:

E , (E P (Si(yi) > Si(yi)j)j(,i)) (

=

i)

i

Z Z

(3)

P (Si(yi) > Si(yi)j)[ij(,i)][(,i)jY(,i)] di d(,i):

It may also be interesting or convenient to condition the predictive distribution of Si yi ) on another statistic (possibly vector-valued), Ti (yi ). (Again we index Ti to allow for nonuniform data structures.) This arises in situations where there is a partial ordering in the data space which can be made into a complete ordering by conditioning. An important class of examples consists of models for data with one or more ordinal or interval-scaled outcomes for each unit and missing data (or a variable number of observations per unit). While it may be fairly easy to construct a summary measure Si (yi ) for the ordinal or interval outcomes, such as the sum of the scores, there is no natural way to order outcomes with di erent patterns of nonmissing observations. We can condition the predictive distribution on the pattern of missing data, in order to obtain an unambiguous ordering. (In Section 3 we discuss the alternatives of conditioning or not conditioning on T (yi ) when drawing .)

1.2 Outliers and in uential observations Outlyingness of Si(yi ) is de ned with respect to its predictive distribution given all the data except yi . This exclusion corresponds to \studentization" of residuals in ordinary linear regression. It is essential to detection of outliers which are highly in uential on , speci cally on their own predicted values, or \high-leverage" outliers in the language of regression. For this reason the techniques of Bradlow and Zaslavsky (1997a) for calculation of case-deleted distributions by reweighting a sample from the posterior distribution become relevant. Another relationship of this work 3

DRAFT to Bradlow and Zaslavsky (1997a) is that, as in that article, we measure outlyingness along directions speci ed by predetermined scalar functions Si rather than by global comparison of distributions; this is important when yi is multidimensional. In general, the inference in outlier detection concerns predictions in data space and therefore is quite di erent from that in in uence analysis, which concerns parameter distributions. Nonetheless, in uence measures can also be used to de ne measures of outlyingness. As in Bradlow and Zaslavsky (1997a) we may de ne in uence by posterior expectation of a function of the parameters,

Si(yi) = E h() j Y(,i); yi

(4)

for some scalar h(), and regard this as a function of yi . Alternatively, outlyingness may be de ned by a measure of the distance between the full-data and case deleted distributions of  or of some subvector A , [A j

Y ,i ] and [A j Y]. (

)

For exam-

ple, Kullback-Liebler divergence can be estimated from a sample by substituting 



h() = log [A j Y(,i)]=[A j Y(,i); yi] in (4). The divergence measure describes

an observation as outlying when it has an unusually large in uence (relative to other potential observations) on the parameter distribution, while (4) looks for in uence in a predetermined direction. This approach is useful when interesting directions can be identi ed in the parameter space but there is no obvious or canonical way of summarizing the data yi without reference to the underlying model. In particular, in hierarchical models, it is natural to de ne the outlier statistic using a function

h(i) of the parameter corresponding to the data yi . Several examples appear in Section 4. Although outlier metrics based on parameter distributions are similar to those used in analysis of case in uence, the analysis is di erent. De ning Si (yi) as 4

DRAFT in (4), in uence analysis looks at the substantive importance of the di erence 



(E h() j Y) , E h() j Y(,i) = Si (yi ) , E Si (yi ), where the second term E h() j Y(,i) = EY , (E h() j Y) integrates over the predictive distribution of yi. Outlier (

i)

analysis using an in uence measure such as (4) refers Si (yi ) to the same predictive distribution of Si (yi). An observation can be outlying and still not be in uential. It is also possible for an observation to be in uential but not very outlying, if it has (in some sense), very high leverage because of the values of its design covariates (those which are not random under the model). On the other hand, any observation which is \surprisingly in uential" given its design values is also outlying under the corresponding in uencede ned measure (4).

1.3 Outlier detection and global model checks Posterior predictive checks were originally developed in the context of global modelchecking (Rubin 1984), i.e. checking whether a model is properly speci ed. In this context, we regard a posterior predictive check as global if the number of checks remains xed as the amount of data increases. Outlier detection, on the other hand, refers to detection of de ciencies in the model that are identi able in particular subsets of the data corresponding to some natural unit, for example individual respondents to a survey or cases in a clinical study. Therefore the number of outlier checks grows in proportion to the amount of data. Our outlier checks are not strictly posterior predictive checks, because the check distribution conditions on a subset of the data Y(,i) rather than the full data Y; subsetting is sensible for outlier detection but has no natural analog for global model checks. 5

DRAFT In general, a predictive check for outlier detection can be converted into a global check of model t by de ning some statistic that summarizes the excess of outliers, such as a count of cases for which pi < C . The evidence that there are excessively many outliers can be assessed more informally by a quantile plot of the ordered pi against their expectations under the model, i.e. uniformly spaced values on [0; 1]. as illustrated in Section 4. The probability plot is not very good at highlighting extreme values of pi . Therefore it should be supplemented with direct examination of the tail cases. Also, the p-values can be transformed to emphasize the extremes; for example, a plot of ,1 (pi ) against expect normal quantiles closely resembles the traditional normal quantile plot for residuals. For models with highly discrete outcomes, the uniform distribution is not necessarily an adequate approximation and it might be necessary to form a reference distribution for the pi by simulating new data sets under the model, if a global check of model t is desired.

2 Background One part of the Bayesian literature on outliers proceeds by extending a baseline model to include a model for outliers. Usually this is done by specifying an epsiloncontaminated sampling distribution, i.e. a mixture model with a component for large deviations. Some of this research is primarily concerned with reducing the in uence of outliers on parameter estimates (\robusti cation"), and some also detects outliers by calculating the posterior probability that each observation comes from the \large deviation" component of the mixture. Box and Tiao (1968) propose a model of this type for the normal distribution, assuming that the number of scale-contaminated outliers is binomially distributed; they approximate the pos6

DRAFT terior distribution of the index set of the observations from the outlier component. Verdinelli and Wasserman (1991) use a Gibbs sampler to draw parameters under a

t-error shift-contamination model. Sharples (1990) extends this approach by specifying a location- or scale-contamination model at both stages of a normal hierarchical model, with an emphasis on \robustifying" the model rather than detecting outliers. West (1984) discusses more generally the use of long-tailed error distributions in regression and hierarchical models, but when these models are not speci ed as discrete mixtures, they do not imply a de nition of outliers. Another research approach de nes outliers without reference to an outlier model; this is also a feature of our work. Several papers along these lines de ne outlyingness in the parameter space, examining the posterior distributions of parameters which are closely related to particular observations or groups of observations. Chaloner and Brant (1988) compare the prior and posterior probabilities that the error term of a normal linear model is large, ji j > K . If the posterior expectation of the number of large residuals is much larger than the prior expectation under the model, this suggests lack of model t. A similar approach was applied earlier by Zellner and Moulton (1985) in model selection. Weiss and Lazaro (1992), and Weiss, Cho, and Yanuzzi (1997) detect outliers in in random e ects models for repeated measures data, using a related approach based on comparison of prior and posterior distributions of parameters at various levels of the model. For a general approach to sensitivity analysis, which includes outlier detection as a special case, see Weiss (1996). The conditional predictive ordinate (CPO) of Geisser (1980) is a predictive density evaluated at the observed value; Geisser proposes using it to identify the most outlying observations. The CPO has the defect that comparisons of the CPO for 7

DRAFT di erent observations are not meaningful if the predictive distributions have di erent (or incomparable) scales. Generally, the CPO can be used to order possible data values and therefore tail areas of the CPO, proposed as a diagnostic by Geisser (1987), are a special case of our approach. For a scalar observation with a unimodal predictive density, ordering possible data values by the CPO is equivalent to ordering them by the value of the observation, except that the tails are combined in a particular way (not necessarily appropriate if the distribution is asymmetrical). In normal linear regression, Chaloner and Brant (1988, sec. 4) point out that tail areas of the CPO can be interpreted as p-values obtained by referring the standardized residual to the t distribution; in the next section, we show that this is also a posterior predictive outlier check. Some papers, such as Peng and Dey (1995), refer to outliers but are actually, according to our de nitions, about case in uence analysis. Section 1.2 discusses the relationship between these topics.

3 Algorithms In some simple problems, there are closed-form expressions for pi , which sometimes correspond to standard regression diagnostics. In the standard Bayesian nor-

mal regression problem yi  N (x0i;  2), with (n + 1)  k regressor matrix X and

uniform priors for and log  , the predictive check distribution is yi j Y(,i)  (1 + x0i (X0(,i)X(,i) ),1 xi )1=2s tn,k + x0i ^(,i) , where tn,k has the t distribution with

n , k degrees of freedom, s and ^(,i) are the usual estimators of  and based on the n observations in Y(,i) , and X(,i) is the regressor matrix with row i deleted (Gelman et al. 1995, p. 239). A straightforward calculation shows that the corre8

DRAFT sponding outlier p-value is identical to that obtained by referring the studentized p residual (yi , xi ^)=s 1 , hii , where H = X(X0X),1X0, to the t distribution with

n , k degrees of freedom, if s is estimated using only Y(,i) (rather than Y as in standard regression diagnostics). In general, simulation algorithms for calculation of posterior predictive outlier

checks are based on either (2) or (3). The straightforward translation of (2) into an algorithm is to draw a sample from the density [ j Y(,i) ], then draw samples from [yi j ], and estimate pi by counting. For complex models, it may be dicult

to draw from [ j Y(,i)] and the computational expense of repeating the process for each yi is likely to be prohibitive. We therefore consider algorithms that make use of a single sample (1) ; (2); : : :

from the full-data posterior distribution of the parameters [jY], obtained from a Markov Chain Monte Carlo sampler or any other sampling method. Such an algorithm, also based on (2), draws a sample from [jY(,i)] by importance-reweighting

the original sample by case deletion weights wn / [(n) jY(,i)]=[(n)jY] = [yi j(n) ; Y(,i)],1 . Then for each  in the new sample, the algorithm draws from [yi j; Y(,i)]. On the

other hand, Bradlow and Zaslavsky (1997a) illustrate that this reweighting scheme may be very inecient, and in fact Peruggia (1997) demonstrates that even in a simple linear regression problem this reweighting scheme sometimes yields weights with in nite variance when yi is a highly in uential observation. Bradlow and Za-

slavsky (1997a) show that the density [(,i) j Y(,i) ] can be simulated by a variety of importance-reweighting schemes and that in many cases one of these schemes will be quite ecient even though weighting by [yi j(n) ; Y(,i)],1 is not practical. This

suggests an algorithm based on (3), which rst draws from [(,i) jY(,i)], then from [i j(,i) ; Y(,i)] and nally from [yi j; Y(,i)]. 9

DRAFT If P (Si (yi ) > Si (yi )j) can be calculated easily, either of these algorithms may be made more ecient by averaging this probability over draws of  (either by importance resampling or by calculating the weighted average over the entire sample

(1); (2); : : :) instead of drawing yi and counting. This is similar to the \RaoBlackwellization" of in uence measures described in Bradlow and Zaslavsky (1997a, sec. 2.1). Further gains in eciency can be obtained, when the draws (1) ; (2); : : : are autocorrelated (as with output from a MCMC sampler), using weighted systematic sampling. Assign to each draw (n) in the original sample of N draws an interval

P [bn,1 ; bn) where bn = nn0 =1 wn , the cumulative sum of deletion weights, and b0 =

0; bN = 1. To obtain a sample of M draws, select a random start 0  Z < 1=M and draw the (n) whose intervals contain the values Z; Z + 1=M; : : :; Z + (M , 1)=M . This is similar to systematic probability-proportional-to-size sampling in surveys (Sarndal, Swenson and Wretman 1992, pp. 96-97). These algorithms can be modi ed to condition the outlier check on a statistic

Ti(yi). One approach is simply to draw values repeatedly from [yi j ] until one is obtained for which the condition Ti(yi ) = Ti(yi ) is satis ed. This is obviously inapplicable if Ti is not discrete, and even if Ti is discrete, it may be highly inef cient if the realized value Ti (yi ) has small predictive probability. More ecient implementations exploit the structure of the data and probability distributions. In Section 4.2, we use the conditional independence (given ) of the scalar components of yi to sample eciently from a distribution conditional on the pattern of missing data. There are actually two subtly di erent approaches to drawing yi conditional 10

DRAFT on Ti(yi ), whose di erences are clari ed by looking carefully at the corresponding algorithms. The conditional approach described above ignores any evidence about case i in drawing ; algorithmically, each draw of  produces one draw of yi regardless of how many draws are rejected before Ti(yi ) = Ti (yi ) is satis ed. Another approach is equivalent to partitioning yi by a transformation yi ! (Ti(yi ); zi), and rede ning

the data for the case to consist of only the component zi , i.e. drawing from [zi j Ti(yi); Y(,i)]. An algorithm for this check is to draw  from [ j Y(,i)], draw yi, and then discard both  and yi whenever Ti (yi) 6= Ti (yi ). In this case, the rejection

step implements the conditioning of  on Ti (yi ) as well as Y(,i); for this reason it

di ers from the previous algorithm which only discards the draw of yi . The two de nitions of the check distribution have slightly di erent meanings, because the second (more conditional) version uses the information in Ti (yi ) more heavily and will produce di erent results when that information is in uential on  and hence on the predictive distribution of yi . When S (yi ) is an in uence statistic, i.e. a function of the posterior distribution as in (4), then reweighting must be applied a second time to estimate the new conditional expectation (4). For a given draw yi , each draw in the original sample from [ j Y] is given weight proportional to [ j Y(,i); yi]=[ j Y(,i); yi ], which in a hierarchical model is equivalent to [yi j ]=[yi j ]. Although this calculation is slower than when S is a simple function of yi , it can be speeded by calculating S using a subsample of the original sample and repeatedly weighting it. For positively autocorrelated draws such as those generated by an MCMC sampler, as noted above, a systematic subsample may provide a close approximation to the results obtained with the full sample. This approach is helped by the fact that repeated reweighting of the same sample sometimes produces more accurate comparisons than would be 11

DRAFT obtained by comparing independent samples (Bradlow and Zaslavsky 1997a, sec. 5.1). Some imprecision in this calculation is acceptable because only the ordering of the posterior expectations is used in the calculation of pi . Furthermore, even if the ordering calculated using a subsample is not exactly the same as that would be obtained using the original sample, the p-values that are obtained are correct for that ordering and therefore are a valid measure of \surprise."

4 Examples We consider two examples in this paper, both of which involve hierarchical models. The rst example is a hierarchical linear model applied to a data set describing growth of 30 rats, with measurements at ve time points for each rat (Gelfand, Hills, Racine-Poon, and Smith 1990). The second example is a complex hierarchical model for ordinal survey data with \NA" responses; for each item, each subject either provided an ordinal rating or declined to respond (Bradlow 1994, Bradlow and Zaslavsky 1997b). A summary of each model and of the methods used to calculate case-deleted weights for diagnosing in uence appears in Bradlow and Zaslavsky (1997a, Sec. 3-4).

4.1 Example I: A Random Coecient Growth Curve Model In this data set, the unit can be de ned as either the weight of one rat i at one time point j or the vector of observations for one rat. With the rst de nition, we refer

S (1)(yij ) = yij to its predictive distribution given the data with only yij deleted; we call the corresponding p-value p(1) ij . 12

DRAFT Several checks can be de ned for each rat, using various functions of the vecomponent vector of weights yi . The model describes each rat's growth by its own slope and mean parameters, and it is natural to summarize each rat's data by the sample versions of these quantities, i.e. the sample mean and slope of the weights at the ve time points for each rat. These yield the statistics S (2)(yi ) = yi and S (3)(yi ) = b(yi). Another way to summarize the data is by looking at the e ect of each observation on the posterior means of the regression parameters for case i, for the mean ( i ) and slope ( i ). We de ne S (4)(yi ) = E i jY(,i); yi and

S (5)(yi) = E ijY(,i); yi. Finally, each component yij can be checked for outlyingness after deleting the entire rat's data; because S (6j )(yi ) = yij is a function of yi it is a legitimate check statistic for yi , but the distribution against which it is checked is di erent from that used in calculation of p(1) ij because here the entire data vector for a rat is deleted. For each case, we reweighted samples from the full-data posterior distribution for case deletion as described in Bradlow and Zaslavsky (1997a, Sec. 3) and then drew new values of the rat weights from the predictive distribution of yi (or of yij for the single-observation check, S (1) and S (6j )). We used 200 draws to evaluate each (2) (3) (6j ) (4) (5) of p(1) ij ; pi ; pi ; pi and 100 draws for pi ; pi . Because of the symmetry between

high and low values in this model, we calculate both upper- and lower-tailed pvalues for each diagnostic, and report whichever of the two is most interesting (i.e. smallest). Quantile plots for each of the sets of outlier diagnostics appear in Figure 1. In general the pi in each panel fall fairly close to the expected quantiles (diagonal line). In each plot, there are a few cases that are close to the extremes (0 or 1) and we examined these with larger simulation samples to determine whether they are in fact 13

DRAFT outlying relative to their predictive distributions. The rats with extreme p-values

j ) at each of the ve time points are are summarized in Table 1. The values of p(6 9

outlying (high), and the number of such outlying cases is beyond what would be expected with 150 observations, under the model. Rat 14 is also outlying (high) at time 4, and moderately so times 2 and 3. These rats are also outlying on the two measures of mean which treat the rat as the unit, those based on the sample mean, p(2)  :0020 and p(2) 9 14  :0236, and those based on the posterior mean of i , (4) (2) p(4) is moderately surprising even as 9  :05 and p14  :03. The realized value of p9 the extreme of a sample of 30 (p = :058). (5) Rat 2 is somewhat outlying on the slope measures, p(3)  :03, 2  :0210 and p2

although none of the individual observations for rat 2 are particularly unusual. These p-values are not surprising as extreme values from a sample of 30. Figure 2 plots the two measures of outlying overall level, p(2) i based on a sample mean and p(4) i based on the posterior mean of the mean parameter, and similarly (5) for the two measures of outlying slope p(3) i and pi . As might be expected, the two

measures are closely related (with a few discrepancies), but the computational e ort required for the sample measures is much less. (6j ) Figure 3 shows the relationship between p(1) ij and pi . The di erence between

j) these diagnostics is that p(6 i indicates whether an individual observation (one rat

at one time) is outlying when all the data for the rat are deleted, but p(1) ij indicates whether the observation is \surprising" even after using the other observations in its prediction. We nd that essentially the same observations are outlying by both diagnostics. Most of these are from rat 9, which was also found to be outlying due to its extremely high mean; due to shrinkage of the mean, the individual values are 14

DRAFT surprising even after the others are taken into account. Another outlier is rat 25 at the rst time point, when it was the lightest of all the rats.

4.2 Example II: A Random E ects Item Response Model The second example is based on a data set derived from a customer satisfaction survey conducted by the DuPont Corporation. The survey consisted of 20 items concerning satisfaction with various dimensions of DuPont's performance in one product line, and 102 questionnaires were returned. Each item had a 1 to 10 ordinal response scale; if a customer did not respond to a particular item it was coded as an \NA" (no answer) response. Bradlow (1994) and Bradlow and Zaslavsky (1997b) describe a hierarchical model for these data with item and person parameters for satisfaction. There are also person parameters for general propensity to respond (saliency) and propensity (responsiveness) to respond when the potential response falls in an \indi erence zone" corresponding to scores 4 to 7 on a 1{10 ordinal scale. We now propose a number of check statistics that that can be used to de ne outlier diagnostics for this model. These posterior predictive checks are similar to some of those in Bradlow and Zaslavsky (1997b) for global model-checking, with differences that have to do with the di erent purpose of the outlier checks, as discussed in Section 1.3. In Bradlow and Zaslavsky (1997b) we show that there are a variety of formulations of the model checks corresponding to di erent hypothetical schemes for regenerating the data, each of which conditions on the posterior distribution of a di erent subvector of the parameter vector. For outlier detection, the diagnostic is uniquely de ned by the de nition of i and S (and T if relevant). As noted in the previous section, however, there is more than one way of de ning the units and 15

DRAFT hence the index i. (a) De ne i = (j; k) to be the index of the response by one person to one item, and let S (1)(yi ) = I (yi = NA). The check looks for NA responses that appear when they are predicted to be unlikely, i.e. for which P (yjk = NA) is small. (The opposite tail would correspond to non-NA values when an NA is very probable, but the predicted probability of an NA is never over .56 so there are no outliers in that tail.) The form of this diagnostic allowed ecient calculation by averaging the probability of NA over the distribution of , as suggested in Section 3, rather than by sampling and counting NA values. (b) Again letting i index single responses, let S (2)(yi ) = yi , conditioning on T (yi) = I (yi 6= NA), the indicator for a non-NA value. This check looks simply for extreme values of the ordinal response. There are two possible versions of this, one parallel to S (6j ) of Section 4.1 for which the entire case is deleted before resampling the parameter, and the other parallel to S (1) for which only a single value is deleted. We used the former in our calculations. Although S (1) and S (2) are functions of the same data, we need two distinct checks because we have no natural ordering of the extended response set 1, 2, . . . , 10, NA. (c) Let i index all observations from a particular respondent. De ne S (3)(yi ) as the number of NAs from respondent i. This extends S (1) to multivariate observations, detecting customers who are surprisingly unresponsive to the set of items. (d) De ne S (4)(yi ) as the mean of the non-NA observations of yi , which detects customers who are unusually satis ed or dissatis ed. To make this statistic 16

DRAFT interpretable, we condition on the pattern of the NAs for that respondent, i.e.

T (yi) is the entire pattern of NAs. Otherwise, we would be comparing means based on di erent subsets of the items; this is not entirely sensible because di erent items have di erent overall satisfaction levels. This statistic extends

S (1) to a multivariate observation. A host of other statistics could be de ned similarly, such as the maximum of the ordinal values, minimum of the ordinal values, range or variance of the ordinal values, and so forth. For all of these, the same rationale applies for conditioning on the pattern of NA responses. (e) Because the items also constitute a collection, we can identify items, rather than persons, with either unusual ordinal responses or unusual patterns of NA responses. Let i index all of the responses to a particular item. Then check statistics can be de ned analogously to those described at (c) and (d) above. (f) Statistics constructed from posterior distributions of parameters, like those based on i and i in the rat growth example, are less arbitrary than those of types (c) and (d). For example, although the sum of the ordinal responses appears plausible as a summary of overall satisfaction, it does not necessarily agree with the ranking of possible response vectors implied by the ranking of posterior means of the satisfaction parameter i . This approach gives a check which is more tightly tied to the model for satisfaction, although there is a large cost in additional computation. Of the diagnostics listed above, we did not pursue the item diagnostics (e) because the items were regarded as a predetermined set. We also did not pursue diagnostics (f) based on in uence measures, because preliminary analyses had shown a close relationship between yj and in uence on j . 17

DRAFT Several individual NA responses, examined by diagnostic (a), had low estimated probabilities of NA (minimum .003, and six with probabilities below .01). In each of these cases, the area of competence of the respondent (technical or sales) matched the topic area of the item, and hence the model predicted a low probability of NA response. There are also particular reasons why some of the subjects involved would be expected to give ordinal scores, e.g. very high scores on other items, suggesting strongly held opinions. It is dicult to tell whether the extreme p-values are surprising as part of an ensemble because due to the extreme discreteness of the outcome they cannot be referred to a uniform reference distribution. Stratifying by predicted probability of NA, we nd that among the 158 person-items with predicted probability of NA below .005, the predicted total number of NAs is .52 but three are observed. The probability of observing three or more NAs is less than .002, so there is some evidence that these are outliers. The diagnostic of Hosmer and Lemeshow (1980) was calculated, using eight strata by probability of NA. The value of the diagnostic was 31.4, due to the surplus of NAs in the P (yjk = NA) < :005 stratum and a de cit in the :1 < P (yjk = NA) < :2 stratum, suggesting some lack of t in the model for NA responses. The diagnostics (b) for single ordinal responses indicated no surprisingly high ordinal ratings; the ve smallest upper p-values (estimated on 200 draws each) were .010, .065, 0.065, 0.070, 0.075, out of 1920 ordinal responses (excluding 120 NAs). This was predictable due to the large probabilities of responding in the higher response categories. The smallest lower-tail p-values (calculated with 1200 draws) also were not surprisingly small, compared to the expected order statistics of the uniform distribution. Nonetheless, the extreme cases identi ed by this check are interesting; several of them are extremely low scores (1 to 3 on a 10-point scale); 18

DRAFT others are moderately low scores (4 to 6) from respondents whose other scores were at the top end of the scale, and who would be predicted to give generally high ratings. Use of the uniform reference distribution is somewhat conservative in detecting outliers due to the discreteness of the data distribution. Diagnostic (c) for the number of NA responses revealed 3 cases with p < :01 (estimated p = :0002; :0012; :004 based on 5000 draws). These cases gave an NA for 18, 7, and 4 items respectively out of 20 items, while other cases with 10, 10, 8, 7 and 5 NAs were not as outlying. (The mean number of NAs per respondent was 1.17.) This illustrates the predictive power of the model for NA responses. The diagnostic based on the count of NAs is less a ected by discreteness than that based on NA response to a single item, and so it is more useful here (but still somewhat conservative) to refer the p-values to the uniform distribution. The probability that the third order statistic of 102 draws from the uniform distribution is .004 (corresponding to the third case) or less is .008, suggesting that there may be more clustering of NAs than is explained by the model. We calculated diagnostics (d) for the ensembles of ordinal responses from each subject, with two de nitions of S : the mean of the ordinal responses, and their range (di erence between lowest and highest ordinal response). The predictive probability of the observed value is generally small for the mean statistic, but larger for the range statistic, because the latter is more discrete (integer-valued), and hence the p-values are potentially more a ected by discreteness. We calculated both the diagnostic that conditions on the NA pattern for the case at hand when drawing  and the one that does not. The less conditional diagnostic was more sensitive in identifying outliers, detecting (under a criterion of p < :01) all but one of the ve outliers detected by the conditional diagnostic and ve additional outliers. This is a consequence 19

DRAFT of the improved model predictions for the ordinal values that are possible when the data on NA values are made available to the model, i.e. when  is drawn conditional on the pattern of missingness. Of the ten values p < :01 (out of 408 possible with 102 cases, two tails, and two statistics), ve were estimated as 0 based on 5000 draws. This signi cantly more than would be expected by chance if the

p-values were uniformly distributed, and even more surprising if the discreteness of the statistics is considered. Nine cases were identi ed by one or the other of the ordinal response diagnostics (including one that was outlying on both mean and range). Three cases had surprisingly low means. One of these was case 96, with a very low mean response of 3.5, the lowest of the ordinal response means (for which the mean across all cases was 8.23). This case and one other with a moderately low mean was identi ed as an in uential observation by Bradlow and Zaslavsky (1997a), and the remaining case was not. There were ve cases with unusually small ranges, including the single case for which all responses were identical, and four out of the ve for which only two adjacent categories were used. This suggests that some respondents may have a tendency to give very similar responses to all items and that this tendency is not captured by the model, which assumes a similar underlying response variance for every subject. Finally, two cases had a surprisingly large range (responses including both extremes of the scale). This may be another response behavior favored by some respondents but not described by the model.

20

DRAFT

5 Summary and conclusions

We have illustrated the applicability of an approach to outlier detection based on a version of the posterior predictive p-value to two complex hierarchical models. The algorithms are computationally feasible and the results are readily interpretable. Outlier detection should not be a pro forma exercise, but rather an integral part of the iterative process of model tting, model testing, and model improvement. Testing and rejecting outliers, although common in applied practice, is not entirely consistent (except when there is external evidence to indicate that an observation is a ected by a known defect in the experimental process) with the Bayesian approach, which favors comprehensive modeling strategies rather than ad hoc approaches. This is not inconsistent with being concerned about sensitivity of inferences to unusually in uential observations, which we have explored for our examples in another paper (Bradlow and Zaslavsky 1997a). Here, though, we are more concerned with the implications of our ndings for model modi cation. The rat growth data set provides some evidence that the normal models used are not entirely adequate. An obvious direction for model modi cation would be to use a longer-tailed distribution at one or the other of the stages of the hierarchy. The fact that most of the outliers at individual time points are also outliers on the full set measures for a rat suggests that the large deviations are appearing in the distributions of the parameters i and i . Several model modi cations are suggested by the outliers detected in the DuPont survey data. There is evidence of miscalibration of the model for NA responses, which might be improved by a longer-tailed latent variable distribution (producing more NAs at the extremes). A modi cation of this sort was examined by Bradlow 21

DRAFT (1994), who found that had little impact on the inferences that were of interest, such as regression parameters and estimated satisfaction rankings. In this case, lack of t in the model has little practical import. There is some evidence for two diametrically opposite patterns in the ordinal responses, neither of which is predicted by the model. Some subjects gave ratings clustered at both extremes of the scale, while others gave a set of ratings that were nearly identical across items. The model might be expanded by including a mixture component for respondents who exhibit these unusual response behaviors. We would conjecture that such a model would reduce the in uence of these responses on parameter estimates, re ecting the fact that these patterns are inconsistent with the hypothesized process of consideration of each item by each respondent. A few of these cases were found to be highly in uential (Bradlow and Zaslavsky 1997a), and model modi cation along these lines would be a principled basis for reducing this in uence.

6 References Box, G. E. P., and Tiao, G. C. (1968), \A Bayesian approach to some outlier problems", Biometrika, 55, 119-129. Bradlow, E.T. (1994), Analysis of ordinal survey data with \No Answer" responses, Doctoral Dissertation, Harvard University. Bradlow, E.T. and Zaslavsky, A.M. (1997a), \Case in uence analysis in Bayesian inference", Journal of Computational and Graphical Statistics, 6, 314{331. Bradlow, E.T. and Zaslavsky, A.M. (1997b), \A hierarchical latent variable model 22

DRAFT for ordinal data with `No Answer' responses" (unpublished manuscript). Chaloner, K., and Brant, R. (1988), \A Bayesian approach to outlier detection and residual analysis", Biometrika, 75, 4, 651-659. Geisser, S. (1980), Discussion of a paper by G. E. P. Box, Journal of the Royal Statistical Society, Series A, 143, 416-417.

Geisser, S. (1987), \In uential observations, diagnostics, and discordancy tests", Journal of Applied Statistics, 14, 133-142.

Gelfand, A.E., Hills, S.E., Racine-Poon, A., and Smith, A.F.M. (1990), Illustration of Bayesian inference in normal data models using Gibbs sampling, Journal of the American Statistical Association, 85, 972{985.

Gelman, A., Carlin, J.B., Stern, H. and Rubin, D.B. (1995), Bayesian Data Analysis, London: Chapman and Hall. Hosmer, D. W., and Lemeshow, S. (1980), \Goodness of t tests for the multiple logistic regression model", Communication in Statistics, Part A{Theory and Methods, 9, 1043-1069

Peng, F., and Dey, D. K. (1995), \Bayesian analysis of outlier problems using divergence measures", The Canadian Journal of Statistics, 23, No. 2, 199-213. Perruggia, M. (1997), \On the variability of case-deletion importance sampling results in the Bayesian linear model", Journal of the American Statistical Association, 92, 199{207.

Rubin, D.B. (1984), Bayesianly justi able and relevant frequency calculations for the applied statistician, Annals of Statistics, Vol. 12, 1151-1172. 23

DRAFT Sarndal, C.-E., Swenson, B., and and Wretman, J. (1992), Model Assisted Survey Sampling, New York: Springer-Verlag.

Sharples, L. D. (1990), \Identi cation and accomodation of outliers in general hierarchical models", Biometrika, 77, 3, 445-453. Verdinelli, I., and Wasserman, L. (1991), \Bayesian analysis of outlier problems using the Gibbs sampler", Statistics and Computing, 1, 105-117. Weiss, R. E. and Lazaro, C. G. (1992), \Residual plots for repeated measures, Statistics in Medicine, 11, 115-124.

Weiss, R. E. (1996), \An approach to Bayesian sensitivity analysis", Journal of the Royal Statistical Society, Ser B, 58, 739-750.

Weiss, R. E., Cho, M. and Yanuzzi, M. (1997), \On Bayesian calculations for mixture likelihoods and priors", unpublished manuscript. West, M. (1984), \Outlier models and prior distributions in Bayesian linear regression", Journal of the Royal Statistical Society, Series B, 46, No. 3, 431-439. Zellner, A. and Moulton, B. R. (1985), \Bayesian regression diagnostics with applications to international consumption and income data", Journal of Econometrics, 29, 187-211.

24

DRAFT

Table 1: Data for rats with extreme p-values on one or more checks. Rat # (i) 2

9

14

p(1) ij

p(2) i

p(3) p(4) p(5) i i i

p(6i j)

.185 .3648 .0210 .39 .03 .1962 .455 .4816 .330 .3476 .290 .3542 .085 .1962 .068 .0020 .0478 .05 .05 .0394 .006 .0022 .005 .0024 .003 .0004 .008 .4816 .145 .0236 .1376 .03 .11 .0824 .050 .0370 .025 .0300 .010 .0118 .090 .3476

Number of draws 200 5000 5000 100 100 5000 1000 for case 9.

25

DRAFT

0

50

100

_ (b) S(2) (using y) ordered values of pi 0.4 0.8

•• ••

150

••

••

••

0

• ••

5

••

••

••••



••

•••

••



••

5

10

15

20

25

30

••

••

••

••

•••

• ordered values of pi 0.4 0.8

ordered values of pi 0.4 0.8

•••• •

••

0

• •••

5

••



0.0

0.0





••• 0

(e) S(5) (using β)

•••

10

15

20

20

25



0.0

0.0

•• 0

15

30

•• ordered values of pi 0.4 0.8

ordered values of pi 0.4 0.8

••

••

••••

(d) S(4) (using α) •

•••

•••



10

^) (c) S(3) (using b



••••

•••

0.0

0.0

ordered values of pi 0.4 0.8

(a) S(1) (using single time point) ••••• •••••• • • ••• •••••• ••••• • ••••• ••••••• • • • • ••• •••••••• •••••••• • • • ••• •••••• ••••••• • • • • • •••• ••••••• • • • • • • • ••••• ••••••• • • • • • • • • ••••

25

30

••

5





••



••

•• ••

••





••



•• ••



10

15

20

25

30

(f) S(6) (using single time point) •••••• •••••• • • • • • •••• •••• • • • ••• ••••••• • • • • • ••• •••••••• •••••• • • • • • • • • • •• ••••• •••••• • • • • •• •• ••••• • • • • • ••••• •••••• • • • • • • • • • •• •••• 0

50

100

150

Figure 1: Quantile plots of pi against Uniform(0,1) distribution (ranks of pi for six posterior predictive check diagnostics), for 26 the rat data set.

0.0

0.8 0.4

••

0.0

• •• • • •• • •

• • •• • • •• • •

•• • • ••

Diagnostic 5 (beta)

0.8 0.4 0.0

Diagnostic 4 (alpha)

DRAFT

0.4

0.8

• •• • •

• • • • •• • • • • • • • • • • • • • • •• 0.0

Diagnostic 2 (ybar)



0.4



0.8

Diagnostic 3 (bhat)

Figure 2: Relationships between corresponding diagnostics based on sample values and posterior means, for the rat data.

27

0.6 0.4 0.0

0.2

Diagnostic 6

0.8

1.0

DRAFT

•••• • • • • •• •• •••••••• • •• •• • • • ••• • • • •••••• •• • •••••••••••• • • • ••• • •••• •• ••••• ••• • • • • • •••• • • • ••• ••••• • • •• ••• ••••• ••••• • ••••• 0.0

0.2

0.4

0.6

0.8

1.0

Diagnostic 1

Figure 3: Relationship between diagnostics S (1) (deleting entire case) and S (6) (deleting only a single observation), for the rat data.

28