Hierarchical Models for Employment Decisions

Hierarchical Models for Employment Decisions Joseph B. Kadane

George G. Woodworth

October 3, 1997

Abstract

Federal law prohibits discrimination in employment decisions against persons in certain protected categories. The common method for measuring discrimination involves a comparison of some aggregate statistic for protected and non-protected individuals. This approach is open to question when employment decisions are made over an extended time period. We show how to use hierarchical proportional hazards models (Cox regression models) to analyze such data. When decisions are made at one time, the proportional hazards model reduces to the familiar doubly constrained hypergeometric model. Key words: Age Discrimination; Bayesian Analysis; Hierarchical Model; Proportional Hazards Model

1 Introduction Federal law forbids discrimination against employees or applicants because of an employees race, sex, religion, national origin, age 40 or older , or handicap. General discrimination law { say discrimination by race or sex { oers two somewhat distinct legal theories. A disparate treatment case involves policies that on their face treat individuals dierently depending Joseph B. Kadane is Leonard J. Savage Professor of Statistics and Social Sciences, Department of Statistics, Carnegie Mellon University, Pittsburgh, PA 15213. George G. Woodworth is Professor, Department of Statistics and Actuarial Science and Department of Preventive Medicine, University of Iowa, Iowa City, IA 52242. The research of Joseph B. Kadane was partially supported by NSF Grant DMS-9303557. We thank Michael Finkelstein, Joseph Gastwirth, Bruce Levin and John Rolph for their helpful comments on an earlier draft.

1

on their (protected) group membership, like a rule prohibiting women from being re ghters. A disparate impact case, however, permits evidence that a facially neutral policy { say a height requirement for re ghters { has the eect of making it relatively more dicult for women than men to obtain such employment. If the data show a pattern of unfavorable actions ( ring, failure to hire, failure to promote, low raises, etc.) disproportionately against the protected group, this can establish or help to establish a prima facie case against the defendant. A prima facie case does not establish the defendant's liability. Instead it shifts the burden of producing evidence to the defendant to explain the business necessity of the disproportionately adverse actions taken against the protected group. In such a case, the employer would have to justify the requirement in terms of the needs of the job, and the fact nder would have to determine whether the justi cation is a pretext for discrimination or not. (Gastwirth (1992) and Kadane and Mitchell (1998)). Race and sex discrimination cases fall under Title VII of the Civil rights act of 1964, whose provisions permit a prima facie case to be made by statistical evidence that members of the protected class are more likely to experience the adverse outcome of an employment decision. However, age discrimination cases are heard under the Age Discrimination in Employment Act of 1967, whose provisions allow dierential treatment of employees based on \reasonable factors other than age", which could be interpreted as barring a disparate impact age discrimination case. The Supreme Court in Hazen Paper v. Biggins (123 L. Ed. 2d 338, 113 S. Ct. 1701 (1993)), explicitly declined to decide this matter. Various courts and judges have discussed it (Judge Greenberg in DiBase v. Smith Kline Beecham Corp. (48 F. 3rd 719 (1995), Judge Posner in Finnegan et al. v. Transworld Airlines, 967 F. 2nd 1161 (7th circuit, 1992), and the references cited there). However this legal debate is resolved, we expect that statistical evidence of how an employer's policies aect older workers will continue to be relevant, for the following reason. Federal Rule of Evidence 401 de nes relevant evidence as evidence that has \any tendency to make the existence of any fact that is of consequence to the determination of the action more probable or less probable that it would be without the evidence". The issue in disparate treatment cases is establishing the intent of the employer. If an analysis shows that the facially-neutral policy of the employer did dierentially harm older workers, that reasonably makes it more probable that the employer intended the harm. Thus we expect our analyses to continue to be relevant, regardless of the fate of the doctrine of disparate impact in age 2

discrimination cases. In this paper we advocate the use of Bayesian analysis of employment decisions, which raises a second legal issue concerning the relevance of such analysis to age discrimination cases. The rules on what constitutes admissible expert testimony in U.S. Courts have changed. Under the Frye rule (Frye v. United States, 54 App. D.C. 46, 47, 293 F 1013, 1014 (1923)), expert opinion based on a scienti c technique is inadmissible unless the technique is \generally accepted" as reliable in the scienti c community. Congress adopted new Federal Rules of Evidence in 1975. Rule 702 provides \If scienti c, technical or other specialized knowledge will assist the trier of fact to understand the evidence or to determine a fact in issue, a witness quali ed as an expert by knowledge, skill, experience, training or education, may testify thereto in the form of an opinion or otherwise". In the case of Daubert, et ux. etc. et al. v. Merrell Dow Pharmaceuticals, Inc. (509 U.S. 579; 113 S. Ct. 2786 (1993)), the Supreme Court unanimously held that Federal Rule 702 superseded the Frye test. The Daubert decision, continuing with dicta (of lesser standing than holdings), of seven Supreme Court justices, goes on to de ne \scienti c ... knowledge". \The adjective `scienti c' implies a grounding in the methods and procedures of science. Similarly, the word `knowledge' connotes more than subjective belief or unsupported speculation". Thus the Daubert decision might be read as casting doubt on the admissibility of Bayesian analyses in Federal Court, since the priors (and likelihoods) are intended to express subjective belief. We think that this reading of Daubert is occasioned by a misinterpretation of what Bayesian statisticians mean by subjectivity. The alternative, i.e. claims of objectivity in the sense that anyone who disagrees is either a fool or a knave, is without basis, and appears to be an attempt at proof by verbal intimidation. But to hold as we do, that every model (including frequentist models) re ects and expresses subjective opinion, is not to hold that every such opinion has an equal claim on the attention of a reader or a court. For an analysis to be most useful it should be persuasive to a trier of fact (judge or jury) that an analysis done with his or her own model { likelihood and prior { would result in similar conclusions. This can be done with a combination of arguments based on reasons for the chosen likelihood and prior (other data, scienti c theory, etc.) and robustness (the conclusions would be similar with other models `not too far' from the one analyzed.) Thus our view is that an analysis ought neither to be admissible nor inadmissible because it uses a subjective Bayesian approach. Instead its admissibility ought to depend on its persuasiveness in explaining, with a 3

combination of speci c arguments and robustness, why a trier of facts' conclusions might be similar to, and hence in uenced by, the analysis oered. We believe that a properly grounded and explained Bayesian model, of the kind proposed here, is both admissible and relevant in age discrimination cases. We con ne our discussion to binary employment decisions such as hiring, job assignment, promotion, layo, or termination. The outcome of such a decision is either favorable or unfavorable to the employee, who may or may not be in a legally protected class. We use age discrimination in termination decisions to illustrate our ideas because age discrimination cases dominate our experience. Age discrimination over an extended time period is more dicult to model, both because the same individual can over time move from the unprotected to the protected class, and because unlike gender or race, age is a continuous characteristic and consequently the hazard rate may vary within the protected class. Following Finkelstein and Levin (1994), we nd that proportional hazards (Cox regression) models provide the exibility to deal with these issues, yet reduce to familiar hypergeometric, and product of hypergeometric models for certain data con gurations.

2 Proportional Hazards Models Suppose that a rm makes one or more employment decisions (e.g., terminations) at times t1 ; : : : ; tp . Individuals who could have been aected by decisions at time ti are said to be at risk and collectively constitute the risk set at that time. Minimum data for making a prima-facie case consist of cross-tabulations of each risk set by status (protected /not protected) and by outcome (adverse/favorable), see for example Table 1 of Kadane (1990). However, in the case of age discrimination, it is desirable to know the exact ages of each member of the risk set at each termination, see for example the appendix to Finkelstein and Levin (1994). Perhaps it is most convenient to obtain ow data for each individual who was at risk at one or more of the decision times under analysis. Flow data consists of beginning and ending dates of each employee's period of employment, that employee's birth date, and the reason for separation from employment (if it occurred). We have seen no examples in which employees were re-hired for non-overlapping terms, but such cases could easily be handled by entering one data record for each distinct pe4

riod of employment. What is required is suciently detailed information to identify who was at risk at the time of each decision, and to determine age and outcome for each person at risk. Suppose that we have data on all individuals whose periods of employment include a subset of the interval [t1 ; tp ]. Employees who leave employment at or before tp will have a reason for separation, for example death, voluntary retirement or resignation, layo with no subsequent recall, or involuntary termination. Table 1 is a fragment of a data set gathered in a hypothetical age discrimination case. Data were obtained on all persons employeed by the rm any time between 01/01/94 and 01/31/96. Entry Date is the later of 01/03/1994, or the date of hire. The rst record is censored in the sense that since that employee was still in the work force as of 1/31/96, we are unable to determine the cause of his or her eventual separation from the rm (involuntary termination, death, retirement, etc ). Such data are obtained from the employer by the plainti in the pretrial discovery phase. It is generally necessary for the plainti's attorney to justify the need for obtaining data over a particular time period - for example, it might be the period from the imposition of a particular policy to the end of the plainti's employment. Frequently the defendant can convince the court to narrow the scope of the data provided, arguing for example that retrieving records more than ve years old or linking records involving employee transfers between divisions would be be burdensome. Employee ID Birth Date Entry Date Separation Date Reason 01 05/23/48 07/27/84 . . 02 12/17/31 01/03/84 11/20/84 Inv Term 03 03/14/48 06/29/84 07/27/84 Inv Term 04 02/26/40 10/05/84 06/07/85 Resigned .. .. .. .. .. . . . . . Table 1: Flow data for the period January 1, 1994 to December 31, 1996 Prior to Finkelstein and Levin (1994), we suspect that most statisticians would have aggregated the ow data into a small number of two-by-two contingency tables. When there are a few mass terminations, as in Kadane (1990), the aggregation is natural; otherwise the analyst would probably aggregate to monthly, quarterly, or annual data depending on the number of terminations, e.g., for quarterly data, the risk set would be all individuals on the work force for all or part of a given quarter and adverse outcomes would 5

be recorded for individuals involuntarily terminated during that quarter. Let Ni ; ni ; ki ; xi ; 1 i p; denote at time ti the number of employees at risk, the number of those who were in the protected group, the total number terminated, and the number of protected employees who were terminated, respectively. Employment decisions are an almost unique example of a genuinely doubly conditioned two by two table. Given the risk set, the number of protected employees is clearly xed, and as a matter of law, the employer is not liable for the number of persons terminated, only whom it chose to terminate given the total. See Kadane (1990), section 3.2, for further discussion of this point). It is assumed that employment decisions made at time ti are conditionally independent of decisions made at other times given the risk sets and number of terminations. Let eisr be the expected number of employees of status s (1=protected, 0=not protected) who experience outcome r (1=involuntary termination, 0=other) at time ti . Let i = (ei11 ei00 )=(ei01 ei10 ) be the odds ratio at time ti . In this notation, the likelihood function is, ! !

Y p

i=1

t X i

x=0

ni xi

Ni ? ni x i ki ? xi ! ; ! Ni ? ni x ni i ki ? x x i

a b

!

(1)

where for any positive integer a, = 0 if integer b is not in the interval [0; a]. Kadane (1990) in a fantasy cross-examination, explores the weakness of the philosophical underpinnings of the use of signi cance tests to quantify strength of evidence in litigation. He suggests that the conventional frequentist explanation of signi cance tests does not produce probability statements relevant to the particular case at trial. He argues that Bayesian analysis in contrast, answers the relevant legal question, \namely, what the probability is that the [employer's] policy discriminated against people in the protected class". Kadane considered two models for the prior distribution of the odds ratios. In the homogeneous, single parameter, case he gave the log odds, L = log(), a normal distribution with zero mean and xed variance. He computed the posterior probability of adverse impact (L > 0) of the employer's policy on the protected class for various values of the prior variance. For inhomogeneous odds ratios, he assumed independent distributions for 6

the log odds, Li = log(i ), for each wave of terminations and computed the probability of adverse impact separately for each wave of terminations, again for various values of the prior variance. The assumption of independent odds ratios is rather strong; it seems more likely that an employer's policy at one round of termination would bear some relationship to previous policy. Consequently, in the next section we propose a hierarchical prior model in which the log odds are assumed to be neither independent nor identical. Finkelstein and Levin (1994) showed how to use proportional hazards (Cox regression) models to deal with disaggregated employment decision data and to incorporate variations in log odds within the protected class. Cox (1972) considers a group of individuals at risk for a particular type of failure. Individuals can enter the risk set (be hired) at any time and leave the risk set at any time either by failure (involuntary termination), or for other reasons (death, voluntary resignation, re-assignment, retirement). Let Tj be the time to failure for the jth employee and let Sj (t) = P (Tj t) be the survival function. Let Ri be the risk set, Di be the set of individuals who were involuntarily terminated, and Ci the censored individuals at time ti { employees who left the risk set for reasons other than involuntary termination or who were still at risk at the end of the period of observation. The conditional likelihood, given that the individuals in (Di [ Ci ) failed no earlier than time ti , is p Y Y j (ti ): i=1 j 2Di

Conditioning on the failure times ti , on the numbers of individuals in Di , and on the risk sets (Ri ), the likelihood becomes, Q (t ) p Y P jQ2D j i (t ) ; j 2D (j ) i i=1 2P i

i

i

where Pi s the set of all distinct permutations of the individuals in Ri . Two permutations and 0 are distinct if they produce distinct sets of failures (f(j ) 2 Di g 6= f0 (j ) 2 Di g). The proportional hazards model assumes that j (t) = 0 (t)exp(zjT ), consequently the conditional likelihood is Q exp(z T (t ) ) p Y 2D j i ; (2) P jQ exp( zT(j ) (ti ) ) j 2D i=1 2 i

i

i

7

In its simplest form, the analysis of employment decisions involves comparing the termination rate of protected employees (e.g., age 40 and above) with that of unprotected individuals. In this case zj (t) is a binary variable indicating whether or not employee j was in the protected class at time t, is the log-odds on termination of protected vs. unprotected employees, and equation (2) reduces to a special case of equation (1) with i = exp( ). Finkelstein and Levin (1994) suggested setting zj (t) = (agej (t)?40)+ , where agej (t) is the age of employee j at time t. In this case, is the increase in the log odds per year of age over 40.

2.1 Hierarchical Priors

Often, the analysis involves terminations taking place over several years, in which case it may be more appropriate to allow to vary over time. Here, the conditional likelihood, given the risk sets, termination times, and numbers terminated is, Q exp(z T (t ) ) p Y 2D j i i ; (3) L( ) = P jQ T j 2D exp(z(j ) (ti ) i ) i=1 2 where i ; is the hazard parameter at time ti . We model i = (ti ) as values of an r-dimensional Gaussian process (the hazard process) (t). Hereafter we consider the scalar case (r=1). A hierarchical model for the prior distribution of = ( 1 p )T has the form f ( ) = g( j )h( ) where g is a p-variate normal density with mean vector ( );and covariance matrix ( ), both of which are functions of a vector of hyperparameters with density h( ). It is desirable to choose a prior distribution which is likely to be widely acceptable. For this purpose we suggest a smoothness prior (see Gersch 1982 and the references cited there). i

i

i

2.2 Smoothness Priors

Let (t) be the Gaussian process generating 1 p . A smoothness prior requires that we have an opinion about the second derivative of (t). Wahba (1978) introduced the representation, Zt ? : 5 (t) = 0 + 1 t + (t ? u)dW (u); (4) 0

8

where t has been rescaled to the unit interval, W(u) is a Wiener process on 0 u 1, and 0 and 1 have diuse prior distributions, say normal with zero means and small known precision ". This representation is natural, since both Bayes and maximum likelihood estimates of 1 p are points on a cubic smoothing spline. From (4) Kohn and Ansley (1987) derived the state space representation, i = i?1 + di i0?1 + di ui ; 2 i p; (5) i0 = i0?1 + vi where di = ti ? ti?1 , 1 = 0 , 10 = 1 , i0 = d (t)=dti , and (ui ; vi ), 2 i p,

are independent bivariate normal vectors with zero means and covariance matrices " # di 2 3 (6) 6 3 6 : The Kohn-Ansley state space representation (5) is easy to write as a directed graphical model of the type implemented in the general-purpose Bayesian package BUGS (Spiegelhalter, et al. 1996a); however, we found that the resulting Markov chain mixes very slowly and therefore seek a better conditioned representation. The nite dierence analog of (4) is, 1 = 1 2 = 1 + 02 (t2 ? t1 ) P j = 1 + 02 (tj ? t1 ) + ji=3 (tj ? ti?1 )00i where 0i = ( i ? i?1 ) =di , 00i = 0i ? 0i?1 , and 00 is the p ? 2 1 vector of second dierences. From (5) and (6) we obtain the tridiagonal variancecovariance matrix of ( 1 ; 02 ; 003 ; : : : ; 003 )T , 11 = 1="; 22 = 1=" + d2 =3; (7) jj = (dj + dj +1 ) =3; 3 j p; j;j +1 = j +1;j = dj =3; 2 j p ? 1; where " is the prior precision of the components of the initial state vector ( 1 ; 10 ), and the smoothness parameter scales the prior precision of the second dierence vector 00 . See Spiegelhalter, et al. (1996b) for the use of a similar approach. To improve the condition of the prior distribution, we express = Uz as a linear combination of orthogonal basis vectors, where u1 = 1 (the p 1 9

vector of 1's), u2 = t ? 1t, andPu3 : : : up are eigenvectors corresponding to the non-zero eigenvalues of M T ?100 M , sorted from smallest (least precise) to largest. M is the second dierence operator; i.e., 00 = M . Clearly z2 is the slope and z1 the mean intercept of the graph of over the observation times. It can be shown that z3 : : : zp are linear functions of the second dierences of and are independent with zero means and precisions i , where i is the eigenvalue corresponding to ui ; 3 i p: In applications " the precision of the initial state vector ( 1 ; 10 ) is small both in absolute terms and compared to the smoothness parameter . In that case it can be p shown that the prior correlation between z1 and z2 is approximately 1= 5 and their prior conditional precisions given z3 : : : zp are less than ". Thus since z1 and z2 have vague priors and are generally well identi ed by the data, the speci cation of their prior distribution is not critical and we treat them as independent, zero mean normal variates with precision ". Thus up to an additive constant the log posterior distribution of ( ; ) is p " X 2 l( ; ) = l( )? (z12 + z22 ) ? (8) 2 2 j =3 j zj + ln(g( )); where = Uz, l( ) is the log likelihood, and g( ) is the prior distribution of the smoothness parameter . In applications with unequal data spacing we nd that the precisions, 3 ; : : : ; p , of the principal components vary over several orders of magnitude eectively forcing the last few components of z to zero. To avoid numerical instability, we sometimes nd it necessary to drop the last few terms. The goal is to estimate the log odds ratios i and to compute the probability that the employer's policy discriminated against members of the protected class at time ti ; i.e., R exp(l( ; ))d d j R>0j (9) P ( i > 0j Data) = exp(l( ; ))d d i

Closed form integration in (9) is not feasible. Due to the high dimensionality of the parameter space, numerical quadrature is out of the question and the Laplace approximation (Kass, Tierney, and Kadane 1988) is computationally infeasible. For these reasons we chose to approximate moments and tail areas of the posterior distribution by Markov Chain Monte Carlo methods (Tierney 1994) implemented in the BUGS package (Spiegelhalter, Thomas, and Best 1996a). 10

The posterior distribution is sensitive to the smoothness parameter , which scales the precision of the second dierences of the log-odds ratio process (t). We think that the best way to form an opinion about is to note that the variance of its second dierence, 00 = (t + 2d) ? 2 (t + d)+ (t) ; is 2d3 =3 (see (7)). We illustrate this with two examples.

3 Examples The data in these examples come from two real cases { we'll call them Case K and Case W { in which an individual was suing a former employer for age discrimination in his or her dismissal.

3.1 Case K

In this case ow data for all individuals employed by the defendant at any time during the period 3/23/90 through 1/27/95 were available to the statistical expert. During that period 96 employees were involuntarily terminated, 79 of whom were age 40 or above at the time of termination. Involuntary terminations less than one week apart with no intervening changes in the work force, were treated as simultaneous, producing 46 distinct termination times. The 46 2x2 contingency tables are displayed in Table 2 along with the posterior means, standard deviations and probabilities of adverse impact (positive log odds ratio) at each time point. The earliest and latest involuntary terminations were 1554 days apart, a span of almost ve years. Over such a long period it is almost certainly the case that the log odds ratio was not constant. A smoothness prior places essentially no restriction on the size of the linear drift of the log odds over this period; however since the data contain very little information about the smoothness of the log odds ratio process the prior distribution of the smoothess hyperparameter must be speci ed with some care. We think that a total change greater than 8 in the log odds ratio { about a 3000-fold change in the odds ratio { is unlikely. If that entire change were concentrated in half of the observation period the second dierence based on the midpoint and endpoints would be 16. If that number represents two standard deviations, then since in this case d = :5, we have 2d3 =3 64 or :001 : We selected a gamma prior for with shape parameter 2 and mean :0025; which places about 2.5% of its mass outside the range :001 . Shape parameters smaller than 2 produce numerical instability. 11

The statistical expert would be able to report that for terminations between days 580 and 1359, the probability exceeds .95 that the employer's policy at that discriminated against employees aged 40 and above. The plainti in case K had been dismissed on day 764; the calculations show that under the model, it is virtually certain that at the time, persons over 40 were more likely to be red that were persons under 40. The case settled before trial. The use of a hierarchical model would enable the expert to particularize his statement to the plainti's case. t

1 173 404 442 505 526 533 553 565 580 603 659 666 694 701 708 729 750 764 782 795 844 855 861 925

N 190 208 273 273 283 283 284 286 289 287 293 290 290 287 278 276 272 270 269 269 266 264 258 245 247

n k x mean stdev P [+] 102 1 1 -1.02511 2.86596 0.3455 110 1 0 -0.18502 1.69847 0.4470 150 1 1 0.86237 1.24707 0.7660 151 1 1 1.01801 1.16489 0.8130 159 1 0 1.25531 1.02212 0.8980 159 2 1 1.33219 0.97332 0.9215 161 1 1 1.35753 0.95702 0.9290 164 1 0 1.42781 0.91297 0.9455 167 1 0 1.46800 0.88900 0.9590 167 1 1 1.51681 0.86095 0.9655 170 10 9 1.58836 0.81929 0.9805 165 1 1 1.75572 0.74470 0.9965 165 1 1 1.77326 0.73823 0.9965 163 9 8 1.83634 0.71842 0.9985 155 4 3 1.85113 0.71523 0.9985 152 1 1 1.86538 0.71284 0.9990 150 1 1 1.90427 0.70867 0.9975 147 2 2 1.94064 0.70407 0.9975 143 1 1 1.96343 0.70278 0.9980 141 1 1 1.98984 0.70022 0.9990 138 2 2 2.00795 0.69981 0.9995 136 6 6 2.05602 0.71719 0.9980 130 4 4 2.06394 0.72427 0.9980 120 2 1 2.06783 0.72920 0.9980 121 1 0 2.09506 0.79215 0.9955

12

t

953 978 1006 1015 1023 1035 1093 1100 1104 1111 1139 1198 1212 1222 1251 1282 1317 1320 1359 1373 1555

N 245 242 236 234 227 227 230 226 224 223 220 217 216 215 215 213 210 207 207 204 205

n 120 118 113 111 105 105 105 101 101 100 97 93 91 90 89 89 89 88 89 89 90

k 1 4 3 6 1 1 1 2 1 4 1 1 1 3 2 1 1 1 1 1 2

x 1 4 3 5 1 1 1 1 1 2 1 1 1 3 2 1 1 0 0 1 1

mean 2.09184 2.08186 2.06325 2.05654 2.04992 2.03825 1.96348 1.95302 1.94684 1.93560 1.88624 1.77452 1.74789 1.72966 1.67673 1.61819 1.55095 1.54510 1.46912 1.44271 1.12597

stdev 0.81188 0.82627 0.83785 0.84150 0.84469 0.84848 0.85244 0.85241 0.85234 0.85204 0.84877 0.82774 0.81878 0.81186 0.79675 0.79653 0.83004 0.83526 0.95015 1.01296 2.64089

P [+]

0.9965 0.9975 0.9975 0.9970 0.9970 0.9960 0.9920 0.9925 0.9930 0.9925 0.9925 0.9865 0.9855 0.9855 0.9840 0.9825 0.9730 0.9730 0.9535 0.9335 0.6600

Table 2: Case K. Aggregated ow data, t=time(days), N=workforce, n=protected class, k=terminated, x=terminated of protected. Posterior distribution of log odds ratios: mean, standard deviation, and probability that the log odds ratio is positive. Monte-Carlo: two 1500 run simulations discarding the rst 500.

3.2 Case W

Data were made available to the plainti on all individuals who were in the workforce at any time during the study period. Dates of hire and separation and reason for separation were available as well as age in years at entry into the data set (the rst day of the study period or the date of hire) and at separation. The data request was made before the expert statistician was retained and birth dates had not been asked for, so there was some uncertainaty about whether the handful of employees near the protected age (40 and older) at a given time were or were not in the protected class. 13

t

627 649 689 729 775 823 866 967 1104 1185 1272 1349 1387 1452 1515 1538 1562 1631 1706 1727 1762 1853 2011 2052 2193 2226

N 193 192 183 169 169 164 163 157 148 145 151 151 128 125 115 110 110 103 100 86 83 75 74 70 69 68

n

55 55 56 48 49 47 47 45 43 43 42 42 32 31 36 34 33 32 33 26 24 22 22 19 19 19

k 1 8 15 1 4 1 5 6 3 2 2 19 1 4 3 1 1 3 12 2 3 1 2 1 1 1

x CEO mean stdev P[+] 0 O 0.013 0.707 0.518 0 O 0.010 0.605 0.516 8 O 0.004 0.461 0.504 0 O -0.004 0.390 0.505 0 O -0.013 0.409 0.500 0 O -0.023 0.492 0.486 1 O -0.030 0.577 0.484 5 O -0.042 0.716 0.477 1 O -0.018 0.743 0.493 0 O 0.013 0.675 0.522 0 O 0.065 0.568 0.551 6 O 0.128 0.476 0.611 0 O 0.165 0.453 0.644 0 O 0.237 0.456 0.705 1 N 0.323 0.502 0.750 1 N 0.359 0.527 0.763 0 N 0.398 0.554 0.773 0 N 0.521 0.623 0.807 6 N 0.672 0.680 0.848 2 N 0.717 0.692 0.858 2 N 0.796 0.708 0.874 1 N 1.012 0.724 0.920 2 N 1.395 0.741 0.972 0 N 1.497 0.788 0.970 0 N 1.855 1.220 0.932 1 N 1.942 1.383 0.920

Table 3: Case W. Aggregated ow data: t=time(days), N=workforce, n=protected class, k=terminated, x=terminated of protected, Old or New CEO. Smoothness prior is described in text. Posterior distribution of the log odds ratios: mean, standard deviation, probability of adverse impact. Monte Carlo: four runs of 1500, discarding 500 from each.

14

We have not attempted to incorporate that uncertainty into this analysis and resolved the ambiguities by assigning uncertain cases to the unprotected class. At about the midpoint of the observation period, a new CEO was hired. The plainti called witnesses who testi ed that the new CEO had made remarks suggesting bias against older workers. Over the course of the study period the workforce was reduced by about two thirds; 103 employees were involuntarily terminated in the process. Aggregated ow data along with posterior means, standard deviations and probability of discrimination are reported in Table 3. The smoothness paramater had a gamma prior distribution with shape parameter 4 and mean :0015. It is clear that the log odds ratio was close to zero before the new CEO arrived and that it increased substantially after his arrival. The plainti was laid o on day 1706 and was terminated three months later. The case was settled out of court.

4 Discussion A standard criticism of Bayesian analyses is that the prior assumptions are arbitrary. One response is, "Compared to what?". Bayesian analyses can be explained to a jury in less convoluted ways than frequentist analyses and make explicit the sensitive assumptions in an analysis, rather than covering them with a mantle of false objectivity. The most commonly used frequentist approach assumes that the odds ratio is constant whereas our hierarchical model recognizes that it is likely to change over time and includes the constant odds ratio as a special case. To some the need to think carefully about the prior distribution of the smoothness parameter may seem fatally to open the analysis to attack by opposing counsel on the grounds of arbitrariness. To that we respond that the assumption of constant odds ratio is not only arbitrary but implausible on its face and that a more realistic analysis has a better chance of prevailing.

References [1] Cox, D. R. (1972), "Regression Models and Life-Tables," Journal of the Royal Statistical Society, Ser. B, 34, 187{220.

15

[2] Finkelstein, Michael O., and Levin, Bruce. (1994), "Proportional Hazard Models for Age Discrimination Cases," Jurimetrics Journal 34, 153-171. [3] Gastwirth, J. (1992), Employment Discrimination: A statistician's Look at Analysis of Disparate Impact Claims, Law and Inequality: A Journal of Theory and Practice. Vol. XI, December 1992, Number 1. [4] Gersch,W. (1982), "Smoothness Priors," in Encyclopedia of Statistical Sciences, Vol. 8, eds. Samuel Kotz, Norman L. Johnson Campbell B. Read, New York: John Wiley & Sons, pp. 518-526. [5] Kadane, Joseph B. (1990), "A Statistical Analysis of Adverse Impact of Employer Decisions," Journal of the American Statistical Association 85, 925-933. [6] Kadane, J and Mitchell, C. (1998) "Statistics in Proof of Employment Discrimination Cases," in Controversies in Civil Rights: The Civil Rights Act of 1964 in Perspective, ed. B. Grofman, University of Virginia Press, forthcoming. [7] Kass, R.E., Tierney, L., and Kadane, J.B. (1988). \Asymptotics in Bayesian Computation," in Bayesian Statistics 3, 161-278, J.M. Bernardo, M.H. DeGroot, D.V. Lindley and A.F.M.Smith (Eds.) [8] Kohn, R. and Ansley, C.F. (1987) A New Algorithm for Spline Smoothing Based on Smoothing a Stochastic Process. SIAM J. Sci.Stat. Comput. Vol 8, No.1, pp 33-48. [9] Spiegelhalter, D.J., Thomas, A. and Best, N.G (1996a), BUGS: Bayesian inference Using Gibbs Sampling, Version 0.50, MRC Biostatistics Unit, Cambridge, UK. [10] Spiegelhalter, D.J., Thomas, A. and Best, N.G(1996b), BUGS 0.5 Examples, Volume 2, Chapter 10, MRC Biostatistics Unit, Cambridge, UK. [11] Tierney, L. (1994), "Markov Chains for Exploring Posterior Distributions(with discussion)," The Annals of Statistics, 22, 1701-1762. [12] Wahba, G. (1978), "Improper priors, spline smoothing and the problem of guarding against model errors in regression," Journal of the Royal Statistical Society, Series B, 40, 364-372. 16