Diagnosing Tests: Using and Misusing Diagnostic and Screening Tests

25 downloads 272225 Views 106KB Size Report
development is possible. ..... and failure to take this into consideration will inflate the ap- ..... articles that report on the development and validity of these tests is ...
JOURNAL OF PERSONALITY ASSESSMENT, 81(3), 209–219 Copyright © 2003, Lawrence Erlbaum Associates, Inc.

STATISTICAL DEVELOPMENTS AND APPLICATIONS

Diagnosing Tests: Using and Misusing Diagnostic and Screening Tests DIAGNOSTIC STREINER TESTS

David L. Streiner Baycrest Centre for Geriatric Care Department of Psychiatry University of Toronto

Tests can be used either diagnostically (i.e., to confirm or rule out the presence of a condition in people suspected of having it) or as a screening instrument (determining who in a large group of people has the condition and often when those people are unaware of it or unwilling to admit to it). Tests that may be useful and accurate for diagnosis may actually do more harm than good when used as a screening instrument. The reason is that the proportion of false negatives may be high when the prevalence is high, and the proportion of false positives tends to be high when the prevalence of the condition is low (the usual situation with screening tests). My first aim of this article is to discuss the effects of the base rate, or prevalence, of a disorder on the accuracy of test results. My second aim is to review some of the many diagnostic efficiency statistics that can be derived from a 2 × 2 table, including the overall correct classification rate, kappa, phi, the odds ratio, positive and negative predictive power and some variants of them, and likelihood ratios. In the last part of this article, I review the recent Standards for Reporting of Diagnostic Accuracy guidelines (Bossuyt et al., 2003) for reporting the results of diagnostic tests and extend them to cover the types of tests used by psychologists.

Within the past few years, diagnostic and screening tests have been the focus of many articles in the popular press. On one hand, some governments and blood-collection agencies have been criticized and sued for not adequately screening blood and blood products for HIV and hepatitis. On the other hand, recent meta-analyses have cast doubt on the usefulness of both breast self-examination (Baxter & the Canadian Task Force on Preventive Health Care, 2001) and mammography (Olsen & Gotzsche, 2001) in younger women for preventing breast cancer, and a court decision has thrown out the polygraph, or lie detector, as evidence in criminal cases (Committee to Review the Scientific Evidence on the Polygraph, 2003; United States v. Scheffer, 1998). These reports have generated considerable uncertainty and confusion, and give rise to four questions: (a) What is the difference between diagnostic and screening tests?, (b) Under what circumstances are each of them useful?, (c) When can they do more harm than good?, and (d) What should be the minimum criteria for reporting studies about tests? Diagnostic and screening tests are similar in that they are used to detect the presence or absence of some attribute in

people. In some cases, the question is to determine how much of the attribute a person has (e.g., aptitude and intelligence tests, university or graduate school admissions exams), whereas in clinical settings, the people are often either unaware of whether they have it (e.g., tuberculosis or Tay-Sachs disease) or may be unwilling to admit its presence (e.g., using illicit drugs or having passed secrets to foreign governments). The difference between them depends on the way they are used and not how they are developed. Diagnostic tests are used when the person is suspected of having the attribute, and the purpose is to confirm this or rule it out, whereas screening tests, as the name implies, are given more broadly, primarily to large groups of asymptomatic people in which the aim is to determine which (if any) of them have the attribute in question.1 The groups who are given screening

1

The term screening test can be used to denote a briefer version of a diagnostic test, such as the Brief Symptom Inventory (Derogatis & Spencer, 1982). In this article, the term refers only to the way a test is used—that is, for screening purposes—and not to its length.

210

STREINER

tests can range from everyone in the population (i.e., mass screening) to a more individualized, case finding approach of people at high risk (Nielsen & Lang, 1999). Both diagnostic and screening tests can use a variety of formats—assays of blood, urine, or other bodily components; X rays; paper-and-pencil questionnaires; measures of physiological functioning; voice prints; and many others. Some screening tests have become so useful that they are applied routinely, such as testing for the presence of phenylketonuria (PKU) in newborns. If undetected, the absence of the enzyme that processes the amino acid phenylalanine can lead to severe mental retardation, seizures, and hyperactivity. However, if the child is put on a phenylalanine-restricted diet, then normal development is possible. In this article, I focus primarily on situations in which there is a dichotomous outcome, such as the presence or absence of a condition. I assume that tests that result in a continuum have been dichotomized into “present” or “absent,” although I also touch briefly on using multiple cut points.

VALIDATING A TEST To determine when diagnostic and screening tests are and are not useful, it is necessary to understand how they are developed and validated. The usual starting point is to assemble two groups of people, one of which is known to have the attribute and the other composed of people who are known to not have it. This raises two issues that are really opposite sides of the same coin: How can we form these groups if we have not yet developed the test, and why develop a new test if we already have one that we can use to form the groups? There are a number of reasons for replacing existing tests with new ones, including decreased cost or discomfort, improved accuracy, and greater timeliness. For example, the “gold standard” for detecting TB is the chest X ray. However, it suffers from a number of shortcomings. It exposes the person to ionizing radiation, and it requires expensive equipment and highly paid technicians and radiologists to take and interpret the pictures. Consequently, it has been replaced by the tuberculin skin test for screening purposes, despite the fact that the latter procedure sometimes produces false negative results in people with a defective immune system (as well as false positive results for reasons I discuss later). The gold standard for the diagnosis of Alzheimer’s disease is a brain biopsy. However, because this can be done only after the patient has died, there has been considerable effort to develop tests that can be used while the person is still alive, ranging from brain imaging to psychological test batteries (e.g., Chen et al., 2000). There are some instances in which there is no existing test, and the researcher must resort to inducing the attribute experimentally. For example, if the purpose is to develop a screening test for illicit drugs, the researcher may give some people known amounts of the substance and others a

placebo to determine if the new assay can differentiate between these groups. Many studies that try to validate the polygraph use the paradigm of telling participants in one group that they are to assume the role of someone who stole something, and to improve motivation, they can earn extra money if they fool the polygraph operator (e.g., Raskin & Hare, 1978). In the area of personality tests, scales to detect malingering or “faking good” are often validated by telling the respondents to answer as if they had a serious psychological problem or were in perfect psychological health (e.g., Bury & Bagby, 2002).

PROPERTIES OF A TEST To return to the story line, once the two groups have been formed, both are given the test to be validated, resulting in the situation shown in Table 1. That is, there are four possible results: People who actually have the attribute were correctly detected by the new test (Cell A, true positives) or missed by it (Cell C, false negatives), whereas people who do not have the attribute were erroneously labeled as having it (Cell B, false positives) or correctly labeled as not having it (Cell D, true negatives). As an example, I use data from Rice and Harris’s (1995) study validating a test of proneness to violence, the Violence Risk Appraisal Guide (VRAG). Their sample consisted of men who had committed a violent crime and who were followed to determine which ones would commit another violent crime within 3½ years. For reasons that I explain in the following sections, I do a bit of violence myself with their data and assume that there were 200 men, half of whom committed a violent crime within that time frame and half of whom did not.2,3 Using the optimal cut point on the VRAG, one can fill in the cells, as in Table 2. Column-Based Indexes A number of attributes of the test, collectively known as diagnostic efficiency statistics, can be derived from these numbers. The first three—sensitivity, specificity, and the likelihood ratio—are conditional on the population. Because the 2 × 2 tables I work with, by tradition, have the population attributes as columns and the test results as rows, these three indexes are usually referred to as column based. The sensitivity of a test is defined as the proportion of people who have the attribute who are detected by the test. In Table 2, this would be 2

Because all of the subsequent calculations deal with proportions and ratios, the actual number of people is immaterial. One hundred in each group is used to simplify the calculations, but the results will be the same if there were 1,000 in each group, or 259, or any other number. 3 It is recognized that some people in this latter group may actually have committed some violent act but have not been caught. This effect of this “classification error” of the gold standard is beyond the scope of this article, but see Fleiss (1981) for a full discussion of it.

211

DIAGNOSTIC TESTS TABLE 1 Classification of Results From a Validity Study of a Diagnostic Test

and does not contribute to making a diagnosis. For the VRAG, the LR+ is .81/(1 – .60) = 2.025, meaning that a positive test result is twice as likely for those who are violent as for those who are not. The equivalent formula for a negative test result is

Gold Standard

Result of New Test

Present

Present

True Positive

Row Total

LR – =

False Positive

A+B

A C

B D

C+D

False Negative A+C

True Negative B+D

N = A + B+ C + D

Absent Column total

Absent

TABLE 2 Hypothetical Results From a Test of Proneness to Violence Committed a Violent Act

Result of the Violence Risk Appraisal Guide Violent Not violent Column total

Yes

No

Row Total

81 A C 19 100

40 B D 60 100

121

79 200

Note. Data modified from Rice and Harris (1995). Prevalence = 100/200 = .500; sensitivity = 81/100 = .810; specificity = 60/100 = .600; positive predictive power (PPP) = 81/121 = .669; negative predictive power (NPP) = 60/79 = .759; incremental PPP = .669 – .50 = .169; incremental NPP = .759 – .50 = .259; quality PPP = (.669 – .50)/(1 – .50) = .338; quality NPP = (.759 – .50)/(1 – .50) = .518; likelihood ratio+ = .810/(1 – .600) = 2.025; likelihood ratio– = .600/(1 – .810) = 3.158; kappa = (141 – 100)/(200 – 100) = .410; odds ratio = (81)(60)/(40)(19) = 6.395; phi = [(81)(60) – (40)(19)]/√[(100)(100)(121)(79)] = .419; pretest odds+ = .50/(1 – .50) = 1.000; pretest odds– = (1 – .50)/.50 = 1.000; posttest odds+ = 1.000 × 2.025 = 2.025; posttest odds– = 1.000 × 3.158 = 3.158.

Sensitivity =

81 A = = .810. A + C 81 + 19

(1)

That is, 81% of people who later committed an act of violence were correctly picked up by the VRAG. The specificity of a test is the proportion of people without the attribute who are correctly labeled by the test, or Specificity =

60 D = = .600, B + D 40 + 60

(2)

so that 60% of nonviolent people are accurately identified as being so by the test. One can combine the sensitivity and specificity into a single number called the likelihood ratio (LR+), which is defined as LR + =

Sensitivity True Positive Rate = . 1 – Specificity False Positive Rate

(3)

The LR+ is another index of the accuracy of the test and tells what the odds are that a positive test result has come from a person who has the attribute. When the LR+ is 1, the test is useless

Specificity True Negative Rate = , 1 – Sensitivity False Negative Rate

(4)

which is .60/(1 – .81) = 3.158, indicating that a score below the cut point is three times as likely to have come from a person who is nonviolent.4 One advantage of LR+ and LR– is that when used with scales that have a continuous outcome, they can be calculated for a number of different cut points. This more closely corresponds to how we use such tests; the higher the score, the more likely it is to have come from someone with the disorder. Later, I discuss how the LR+ and LR– can be used to reflect the probability of having the trait. Sensitivity, specificity, and the LRs are generally seen as fixed properties of the test (Sackett, Haynes, Guyatt, & Tugwell, 1991; Streiner & Norman, 1996). That is, as long as the test is used with similar groups of people, these attributes should not change. However, if the test is used with people who have different amounts of the trait in question, then sensitivity and specificity will have to be recalculated; for example, a test validated on inpatients with severe depression will likely have different properties when used with outpatients with dysthymia. Row-Based Indexes Once a test has been validated and put into general use, it is generally used by itself, and we will not have any other test result against which we can assess it. That means that we must turn from being concerned with sensitivity and specificity—that is, the proportion of those who do and do not have the condition who are correctly classified—to the proportion of those people labeled by the test as having or not having the trait who in fact do and do not have it. In other words, we are interested in the proportions across the two rows of the table rather than down the columns (again, by tradition, the row-based indexes refer to the test results). These two test attributes are referred to as the positive predictive power (PPP) and the negative predictive power (NPP).5 They are defined in the next two equations using the data from Table 2:

4 Note that some textbooks, such as Sackett, Haynes, Guyatt, and Tugwell (1991), have defined LR– as (1 – Sensitivity)/Specificity. This gives the LR of a negative test result, given that the person has the attribute, as opposed to a negative test result for someone without the attribute. This definition in the text is used because it makes the relationships among LR– and other diagnostic efficiency indexes much more direct. 5 The terms predictive power and predictive value are both used; the former primarily in the psychological literature, the latter in the medical literature.

212

STREINER A A+ B (5) 81 = = .669, 81 + 40

Positive Predictive Power (PPP) =

and

D C+D (6) 60 = = .759. 19 + 60

Negative Predictive Power (NPP) =

In other words, of those who were predicted to commit a violent act according to their score on the VRAG, 67% actually went on to be violent, and of those labeled by the test as nonviolent, 76% were in fact not violent. Again, the VRAG seems to perform quite well. Table-Based Indexes It is also possible to derive a large number of indexes based on the table as a whole. Although these vary with regard to magnitude and their minimum and maximum values, Kraemer et al. (1999) showed how many of them are actually related to each other. Perhaps the simplest is the hit rate, which is also referred to as the overall correct classification (OCC). This is just the proportion of correct decisions, or (A + D)/N, which for Table 2 is (81 + 60)/200 = .71. The major problem with the OCC is that is does not account for agreement that may occur just by chance. That is, even if the diagnosis is made by flipping a coin, sometimes it will be correct, and failure to take this into consideration will inflate the apparent accuracy of the test. The most widely used statistic that corrects for chance agreement is Cohen’s kappa (κ; Cohen, 1960), which is κ=

No – Ne , N – Ne

(7)

where No is the observed number of correct agreements, Ne the number of agreements expected by chance, and N is the total sample size.6 For these data, this works out to κ=

(81 + 60) – (60.5 + 39.5) 141 – 100 = = .410, (8) 200 – (60.5 + 39.5) 200 – 100

κK =

The expected number for cells A and D is calculated with the formula: [(A + B)(A + C) + (C + D)(B + D)]/N. It is also possible to compute κ using proportions rather than counts, using the formula (po – pe)/(1 – pe) in which case the expected proportion is [(A + B)(A + C) + (C + D)(B + D)]/N2. 7 This is a somewhat unfortunate choice of terms, as Cohen (1968) previously used the same one, which he abbreviated as κw, to refer to κ scaled for partial agreement.

(9)

where K′ = (1 – K). When sensitivity and specificity are equally important, K = K′ = ½, and κ = κK. When the only concern is sensitivity, K = 1, and K = 0 when the only concern is specificity. A statistic more familiar to psychologists is the phi coefficient (φ), defined as φ=

AD – BC , [( A + B)(C + D)( A + C )( B + C )]½

(10)

and that is, using the same data φ=

(81)(60) – (40)(19) = .419. (100) × 100 × 121 × 79)½

(11)

The use of φ is justified for a number of reasons. It is the Pearson correlation8 for dichotomous data and is consequently a legitimate effect size indicating the strength of the association; because of this, φ2 is related to the proportion of variance accounted for (R2) in regression equations (Kraemer et al., 1999); and Nφ2 is equal to the familiar χ2. One advantage of the more complicated equation for κK (Equation 9) is that it can be used to show the relationship between κ and φ: φ = κ K =0 κ K =1 .

(12)

Although relatively rarely encountered in psychological journals, what is referred to as the relative odds or odds ratio (OR) OR =

AD (81)(60) = = 6.395, BC (40)(19)

(13)

is often used in medical journals. Its main disadvantages are that its upper limit is unbounded (that is, if either Cell B or C has a value of zero, the upper limit is infinity) and that many people misinterpret relative odds as if it were relative risk (Streiner, 1998). For these and other reasons, Sackett, Deeks, and Altman (1996) recommended against its use. One reason for its continuation, although, is that it enters into calculations with the LR, as I discuss a bit later.

which is considerably lower than the .71 found when I did not correct for chance agreement. Another way of writing the formula for κ, which Kraemer et al. (1999) referred to as “weighted kappa” or κK, is7 6

( AD – BC ) ( A + B)( B + D)K + (C + D)( A + C )K '

THE EFFECT OF PREVALENCE ON THE PROPERTIES OF A TEST As Meehl and Rosen (1955) pointed out nearly half a century ago, PPP and NPP are not like sensitivity and specificity in that they are not fixed properties of a test. Rather, they are

8

Phi (and for that matter, the point biserial correlation) are merely computational simplifications of Pearson’s r devised in the days before computers obviated the need for such shortcuts.

DIAGNOSTIC TESTS highly dependent on the prevalence or base rate of the condition in the population being tested [using the notation in Table 1, the prevalence is (A + C)/N]. This has direct consequences regarding the use to which any test is put. In this example, as in many validation studies, the prevalence is 50%; that is, half of the people had the condition (in this case, violence) and half did not. This is done because statistical tests are most efficient in these circumstances. However, if the test is used for screening purposes, the prevalence would be lower and with many conditions (e.g., HIV/AIDS or schizophrenia) considerably lower than 50%. How would the test perform under these circumstances? There are two equivalent ways to determine this: by using formulae and by redrafting the table. The formulae, known collectively as Bayes’s theorem, were named after the Reverend Thomas Bayes and first published in 1763, 2 years after his death.9 For the PPP, it is PPP = P( Dx | T +) =

P( Dx ) × P(T + | Dx ) , (14) [ P( Dx ) × P(T + | Dx )] + [ P( Dx ) × P(T + | Dx )]

where P(Dx|T+) is the probability of having the diagnosis (Dx) given that the person has a positive test result (T+); P(Dx) is the probability of the diagnosis in the group being tested (i.e., its prevalence or base rate;P ( Dx ) = 1 – P ( Dx ); P(T+|Dx) is the probability of a positive test result given that the person has the diagnosis (i.e., the sensitivity); and P(T + | Dx ) is the probability of a positive test result given that the person does not have the diagnosis (i.e., 1 – specificity). An equivalent way of writing the formula, using the definitions for each of the terms, is

NPP = (1 – Prevalence) × Specificity . [(1 – Prevalence) × Specificity] + [Prevalence × (1 – Sensitivity)]

(17) Using the formulae, it is possible to determine how the VRAG will work in other situations (ignoring the fact that if the sample’s characteristics change, the data for sensitivity and specificity may not hold). For example, after a number of instances of school children shooting and sometimes killing classmates and teachers, there was a call to screen all students for their proneness to violence. Assume that a school board decides to use the VRAG for this purpose,10 and also assume that the prevalence of violence is (an overly inflated estimate of) 5% [i.e., P(Dx) = .05]. Keeping the same values for the sensitivity and specificity, the PPP is PPP = P ( Dx | T +) =

Prevalence × Sensitivity . (Prevalence × Sensitivity) + [(1 – Prevalence) × (1 – Specificity)]

.05 × .81 = .096. (18) (.05 × .81) + (.95 × .40)

In other words, of all the VRAG results that say “violent,” only 9.6% will have been from children who are likely to actually become violent; the other 90.4% would be false positives from people who will not do anything violent. This is a universal, immutable law of tests: As the prevalence drops, so does the PPP, whereas the proportion of false positives increases. If the actual prevalence is 1% rather than 5%, the PPP is 2%, meaning that 98% of positive test results are wrong and are actually false positives. Can the test be used to identify nonviolent people? For this, we have to use the NPP: NPP = P ( Dx | T –) =

PPP =

213

.95 – .60 = .984. (19) (.95 × .60) + (.05 × .19)

where P ( Dx | T –) is the probability that the person does not have the diagnosis given that the test result is negative (T–) P (T – | Dx ) is the probability of a negative test result given that the person does not have the diagnosis (i.e., specificity); and P(T–|Dx) is the probability of a negative test result given that the diagnosis is present (i.e., 1 – sensitivity), or

When the prevalence is low, the NPP is high; therefore, the test can be used to rule out a condition. In this case, over 98% of nonviolent student are so identified by the test. I note, although, that a probability of .984 is only marginally higher than the rate of nonviolence, which is .950. Later, I discuss ways of quantifying the increase in information provided by positive and negative test results. The alternative method of deriving these figures is to redraft the table, keeping the same numbers for the sensitivity and specificity but making the column totals reflect a 5% prevalence. To keep the numbers whole, assume that 10,000 people have been evaluated, although as mentioned infootnote 2, any number can be used with identical results. In Table 3,we begin by using 500 and 9,500 as the column totals. We multiply 500 times the sensitivity for Cell A and 9,500 times the specificity

9 This proves, perhaps, that it is possible to first perish and then publish.

10 I emphasize that the authors of the VRAG who are aware of the base rate problem and the different nature of the populations, never intended it to be used in this way (G. T. Harris, personal communication, January 27, 2003).

(15) Similarly, the formula for the NPP is NPP = P( Dx | T –) =

(16) P( Dx ) × P(T – | Dx ) , [ P( Dx ) × P(T – | Dx )] + [ P( Dx ) × P(T – | Dx )]

214

STREINER TABLE 3 Hypothetical Results When the Prevalence Is 5%

Result of the Violence Risk Appraisal Guide

whereas for negative test results Pretest Odds – =

Committed a Violent Act Yes

No

Row Total

3,800 B D

4,205

Not violent

405 A C

Column total

95 500

5,700 9,500

Violent

Note. Prevalence = 500/10,000 = .050; sensitivity = 405/500 = .810; specificity = 5700/9500 = .600; positive predictive power (PPP) = 405/4,205 = .096; negative predictive power (NPP) = 5,700/5,795 = .984; incremental PPP = .096 – .05 = .046; incremental NPP = .984 – .95 = .034; quality PPP = (.096 – .05)/(1 – .05) = .049; quality NPP = (.984 – .95)/(1 – .95) = .672; likelihood ratio+ = .810/(1 – .600) = 2.025; likelihood ratio– = .600/(1 – .810) = 3.158; kappa = (6105 – 5,715.5)/(10,000 – 5,715.5) = .091; odds ratio = (405)(5,700)/(3,800)(95) = 6.395; phi = [(405)(5,700) – (3,800)(95)]/√ [(500)(9,500)(4,205)(5,795)] = .181; pretest odds+ = .05/(1 – .05) = .053; pretest odds– = (1 – .05)/.05 = 19.000; posttest odds+ = .053 × 2.025 = .107; posttest odds– = 19.000 × 3.158 = 60.000.

for Cell D. By subtraction, we get Cells B and C and by addition, the row totals. Using Equation 3, the PPP is 405/4,205 = .096, and using Equation 4, the NPP is 5,700/5,795 = .984, which are the same figures calculated using Bayes’s theorem. Without going through the calculations, I can state the other universal, immutable rule of test results: Assuming that a test is reasonably accurate, then when the prevalence of a condition is high, (a) the PPP is also high and (b) the NPP is low. Thus, I summarize the two rules: 1. When the prevalence is low, a test is best used to rule out a condition but not to rule it in. 2. When the prevalence of a condition is high, a test is best used to rule it in but not to rule it out. I add a third rule:

(23)

and NPP =

5,795 10,000

1 – Prevalence , Prevalence

Posttest Odds – . Posttest Odds – + 1

(24)

Again assuming a prevalence of 5% then to detect the presence of violence, the pretest odds+ are 0.05/0.95 = 0.053 (Equation 21). With an LR+ of 2.025 that was calculated previously, this means that the posttest odds+ are 0.053 × 2.025 = 0.107 by Equation 20, and the PPP is 0.107/1.107 = 0.097, which is the same value (within rounding error) obtained using Bayes’s theorem. To detect the absence of violence, the pretest odds– are (1 – .05)/.05 = 19.000 by Equation 23. In Equation 4, we found an LR– of 3.158; therefore, the posttest odds– are 19.000 × 3.158 = 60.000, and the NPP (Equation 24) is 60.000/61.000 = .984, which agrees with the number derived from Equation 6. Table 4 summarizes the effects of prevalence on the various statistics mentioned in this article. The reader can see the magnitude of the effects by applying the equations to the numbers in Tables 2 and 3. INCREMENTAL VALIDITY In the example in which the prevalence of violence was 5%, we found that the VRAG accurately detected 98.4% of the children who would not be violent. At first glance, this would appear to make it a good test for ruling out proneness to violence. However, we have to bear in mind that, if we did not use the test and simply said everyone was non-violent, we would be correct 95% of the time (i.e., the rate of non-occurrence of violence, or 1 – Prevalence). Thus, it is not sufficient to look just at the NPP (or PPP) but rather the increase in the predictive values over what would be expected by going with the base rate. Gibertini, Brandenburg, and Retzlaff (1986) referred to this as

3. Tests work best when the prevalence is 50%. The LRs can also be used to calculate the PPP and NPP using the following formulae: Posttest Odds = Pretest Odds × LR,

(20)

where, for positive test results Pretest Odds+ =

Prevalence , 1 – Prevalence

(21)

and PPP =

Posttest Odds+ , Posttest Odds+ + 1

(22)

TABLE 4 Statistics Affected and Not Affected by the Prevalence Affected by Prevalence

Not Affected by Prevalence

PPP NPP Incremental PPP Incremental NPP Quality PPP Quality NPP Kappa (κ) Phi (φ) Pretest odds (positive and negative) Posttest odds (positive and negative)

Sensitivity Specificity Odds ratio LR+ LR–

Note. PPP = positive predictive power; NPP = negative predictive power; LR+ = likelihood ratio of positive result; LR– = negative result.

DIAGNOSTIC TESTS the “incremental” PPP, or IPPP, and the incremental NPP, or INPP, which are defined as IPPP = PPP – P(Dx).

(25)

INPP = NPP – P( Dx ).

(26)

When the IPPP or INPP is equal to zero, the PPP or NPP is exactly equal to chance expectations. The difficulty with these indexes, although, is that they are difficult to interpret. A variation of them, proposed by Kraemer (1992), rescales the IPPP and INPP by dividing by the maximum possible range, which is 1 – P(Dx) for Equation 25 and 1 – P( Dx ) for Equation 26. The advantage is that the ratio is again zero when the test adds nothing but assumes a maximum value of 1.00 when there are no diagnostic errors. For this example, then, Kraemer’s (1992) “quality of the NPP” is QNPP =

NPP – P ( Dx ) .9836 – .95 = = .672, 1 – .95 1 – P( Dx )

215

vealed secret information by giving a polygraph exam to all employees. There are times, although, when it is necessary to screen for low-prevalence conditions, such as PKU in neonates or HIV or hepatitis C in blood, and when the cost of missing cases is high. In these circumstances, diagnostic tests are often used sequentially. The first test has a high sensitivity and uses a cut point to ensure that very few cases are missed, even though this results in a large number of false positives. However, the prevalence of the condition in this subsequent screened sample is closer to the ideal 50% so that a second test, with high specificity and a cut point set to weed out false positives, performs better. Even so, there is a likelihood of false positive cases remaining, with all of the resultant problems of labeling and perhaps unnecessary interventions (e.g., Bergman & Stamm, 1967). USING SCREENING TESTS FOR EARLY DETECTION

(26)

meaning that there is a 67% increase in diagnostic value by using the test. As Hsu (2002) pointed out, this is similar to the chance correction in Cohen’s kappa. Although these equations quantify the amount of information added by using the test, they do not (and cannot) address the issue of the trade-off with societal costs. On the positive side, we can be more assured that those students labeled by the test as nonviolent actually will not commit an act of violence. On the negative side, (a) we are still not 100% sure about them; (b) there is the danger of the test being misinterpreted to label those with high scores as violent, despite the very high false positive rate; and (c) there is the real dollar cost of administering and scoring the test, which would likely require resources being diverted from other programs.

The major reason for the existence of screening tests is to detect disorders before clinical signs appear. The assumption, pictured in Figure 1, is that earlier detection can lead to early intervention, thus preventing the disorder from progressing. If a cancerous tumor in the breast is diagnosed early or an enlarged prostate gland is detected before it becomes cancerous, for example, the hope is that conservative surgery will prevent problems from developing later. Similarly, detecting children with previously undiagnosed obsessive–compulsive disorder (OCD; e.g., Bamber, Tamplin, Park, Kyte, & Goodyer, 2002) may enable a therapy program to intervene before the problem becomes more serious. However, the use of screening tests for early detection depends on two assumptions: (a) that the natural history of the disorder is that these previously undetected problems will in fact become major ones if untreated and (b) that a treatment exists and is effective.

SCREENING FOR LOW PREVALENCE CONDITIONS As seen previously, the major problem with screening tests is that if the prevalence of the condition is low, most of the positive results will in fact be false positives. Assuming that the results of the VRAG are valid, then it may be a useful test in assessing the propensity toward violence of specific groups. This in fact is its true purpose; Rice and Harris (1995) intended it to be used with prisoners who have a history of violence, and the proportion who go on to commit another act of violence is roughly 30%. However, when used to screen for violence in groups in which the expectation is low, then it likely does more harm than good by falsely labeling a large number of nonviolent people. Again, this is true for all screening tests and is one of the bases of objections to mass screening for illicit drugs among potential hirees or trying to determine who has re-

FIGURE 1 Hypothesized progression of a disorder without (top) and with (bottom) screening.

216

STREINER

One of the problems highlighted by Olsen and Gotzsche (2001) in their review of mammography is called the length-time bias. This refers to the possibility that screening tests may detect slowly progressive forms of the disorder, which have better prognoses, rather than more serious forms that develop later and are picked up once clinical symptoms appear (Kramer & Brawley, 2000). That is, it assumes that the disorder is not a unitary entity that once started inevitably leads to the same endpoint. Using the example of OCD, it may be that type of the disorder detected by screening in childhood may evolve into “normal” obsessiveness and compulsiveness and not into the more crippling disorder seen later in life. In other words, some of the false positives are not simply a random subset of people without the disorder but represent a group that resembles the true positives in a way that “fools” the test. The result of early screening, then, may lead to a number of people receiving unnecessary therapy (or having unnecessary biopsies, in the case of mammography) without in any way affecting the natural history of the more serious form of the disorder. The second possible problem with screening tests is referred to as the lead-time (or zero-time shift) bias (Feinleib & Zelen, 1969). This occurs when the screening test is able to detect early forms of the disorder, but the treatments are unable to affect the outcome. This results in two phenomena. First, it gives the erroneous impression with life-threatening disorders that people are surviving longer because there is a greater interval between detection and death. However, this is due only to the second phenomenon that people are aware of their disorder for a longer time. The conclusion is that unless psychologists are sure that disorders that are detected early (a) are indeed the precursors of the more serious form seen once clinical symptoms become manifest and (b) can be effectively treated, mass screening programs may do more harm than good.

REPORTING STUDIES OF DIAGNOSTIC TESTS Diagnostic and screening tests can have a major impact on the lives of the people who are assessed. They can influence whether a person is offered a job; which treatments, if any, a patient receives; and whether a parent can have custody, or even access, to his or her children. Despite this, the quality of articles that report on the development and validity of these tests is, at best, mediocre (Reid, Lachs, & Feinstein, 1995). In an attempt to remedy this situation, there have been a few articles, primarily in the medical literature, that have outlined criteria for establishing and reporting on the validity of diagnostic tests (e.g., Guyatt, Tugwell, Feeny, Haynes, & Drummond, 1986; Jaeschke, Guyatt, & Sackett, 1994a, 1994b; Riegelman, 2000). These were brought together by a group of (again, primarily medical) journal editors and researchers in the Standards for Reporting of Diagnostic Accuracy (STARD) statement (Bossuyt et al., 2003). STARD con-

sists of a checklist of 25 items with the implicit message that articles that do not meet these criteria will not be published in the journals that subscribe to it. Broadly speaking, the principles fall within six categories: 1. Identification of the article. 2. Description of the participants. 3. Description of the index diagnostic test and the reference or gold standard. 4. An indication of how the tests were administered. 5. Reporting of the results. 6. A discussion of the use of the test. What follows is a summary of the STARD statement, which I have modified to make more applicable to the spectrum of diagnostic tests used by psychologists. 1. Identification of the article: The article should clearly state the purpose of the study such as determining the validity of the test, or seeing how well it works with a specific group of patients, or comparing a number of similar tests. Simply using terms such as diagnostic value or clinical utility in the abstract is rarely sufficient because not enough information is given to help readers to determine whether the article may be useful for their purposes. The guidelines also say that the key words sensitivity and specificity be used to help in electronic searches of Medline; I would add diagnostic efficiency to this for the PsycINFO database. 2. Description of the participants: As has been pointed out in many texts and articles about psychometric theory (e.g., Nunnally, 1970; Streiner & Norman, 1995), one does not validate a test but rather a use to which the test is put. This means that a test that is valid for one group of people or in one setting may not necessarily be valid with other people or in different contexts. Consequently, the study must describe who was included in the study and who was excluded, how the people were recruited (e.g., were they patients coming to a counseling center, referred because of a specific problem or the results of a previous test, students in an introductory psychology class, and so forth), whether the participants were all people who met the criteria or a subset of them, the presence of comorbid disorders, and demographic data. Ideally, a table or flowchart would show how many participants were recruited and the number dropped at each stage because they did not meet the inclusion criteria or were unable to complete the test. This information is necessary so that the reader can determine the nature of the people for whom the test has been validated and the generalizability of the findings. 3. Description of the index diagnostic test and the reference or gold standard: The evaluation of a new test is highly dependent on the accuracy of the gold standard. In many cases though, the reference has properties

DIAGNOSTIC TESTS more akin to pyrites (“fool’s gold”) than real gold. For example, the Diagnostic Interview Schedule (Robins, Helzer, Croughan, & Ratcliff, 1981) was validated by comparing it to psychiatrists’diagnoses (e.g., Helzer et al., 1985), yet we know that the reliability and validity of clinician-assigned diagnoses is relatively poor (e.g., Miller, Dasher, Collins, Griffiths, & Brown, 2001). This error of classification imposes an upper limit on the psychometric properties of the new test, which can only partially be corrected for statistically (Fleiss, 1981). In other situations, the reference may consist of a continuous scale such as the Beck Depression Inventory (Beck, Steer, Ball, & Ranieri, 1996), which is then dichotomized to form “depressed” and “non-depressed” groups.11 In such instances, the rationale for the cut point(s) or categories must be justified and it must be verified whether these were determined before or after the data were gathered. If the cut point was set after the fact (e.g., a median split or upper and lower terciles), it may artificially inflate the validity of the new test while decreasing the likelihood that another study will be able to replicate the results. 4. An indication of how the tests were administered and scored: There should be a clear description of how both the new test and the reference were administered and scored. This can be an issue even for self-administered scales. For example, a recently developed quality of life scale for epilepsy (Ronen, Streiner, Rosenbaum, & the Canadian Pediatric Epilepsy Network, 2003) is meant to be used by children as young as 6 years of age. At the lower age range though, some children need help reading and perhaps understanding the items. The description of the scale should indicate (a) whether help can be given and (b) if so, whether this may consist of simply reading the item aloud to the child or if paraphrasing is permitted. If either the new or the reference scale has to be administered by an examiner, the article should describe what expertise is required (e.g., an advanced degree in psychology, a background in mental health, or no expertise) and what training, if any, needs to be done. Similarly, if scoring consists of more than just adding up the point value of each item and requires the examiner to make a judgment about the answer or behavior, such as with various intelligence scales or the Brief Psychiatric Rating Scale (Overall & Gorham, 1962), this should be described either in the article or a manual in enough detail to allow others to use the scale. For some types of scales, especially self-report ones, respondents may omit some items or endorse two or more options. The article should describe how these missing data are handled: the entire scale dropped from the 11 Note that dichotomization is poor practice because it can dramatically reduce effect sizes (e.g., MacCallum, Zhang, Preacher, & Rucker, 2002; Streiner, 2002).

217

analysis, or the final score given as a proportion of questions answered, or the missing items replaced by the mean, and so forth. Finally, if the test result requires clinical interpretation as opposed to looking up the score in a table, the study should minimize criterion contamination by blinding clinicians to the results of the other test. In Meyer’s (2002) review of 43 studies comparing diagnoses from the Diagnostic Interview Schedule (Robins et al., 1981) and the Composite International Diagnostic Interview (Robins et al., 1988), he found a direct relationship between the mean weighted kappas (κs) of agreement and the degree of criterion contamination. When contamination was high, κ was .68, dropping to .51 with moderately high contamination, .39 with moderately low contamination, and .25 when it was low. It should be borne in mind, though, that criterion contamination can also occur when both tests are self-report measures. The respondents may answer the second scale not only on the basis of the questions themselves but also influenced by their recall of how they responded to similar items on the first test to appear consistent. This is one reason that Campbell and Fiske (1959), in their seminal article on the multitrait-multimethod matrix, recommended validating a test with a criterion that is “maximally different” in terms of format, such as a self-report scale against a performance task or an observer-completed scale. 5. Reporting the results: Perhaps the first rule of reporting results should be “get it right.” Kessel and Zimmerman (1993) reviewed 26 studies from leading journals that reported the findings of diagnostic test performance. Of these, 9 (34.6%) had at least one error in the calculation of diagnostic efficiency statistics, and 3 (11.5%) used unconventional (i.e., wrong) definitions of the terms. Because of this, they recommended that the article should have a 2 × 2 table, much like Table 2 in this article, giving the actual numbers so that others can check the accuracy of the calculations. Other minimum reporting requirements for the reliability of self-administered scales would include the test–retest reliability and the interval between the two test administrations. Tests that require an examiner should report the interrater reliability. Some structured and semistructured interviews report the reliability of scoring in which the second scorer views the interview either through a one-way mirror or via a videotape (e.g., Hesselbrock, Stabenau, Hesselbrock, Mirkin, & Meyer, 1982). Although it is necessary to establish interscorer reliability, it is not a substitute for two separate administrations. With many instruments, characteristics of the interviewer may affect the responses that are given (e.g., Finkel, Guterbock, & Borg, 1991). Furthermore, because of often complicated skip patterns, judgments made by the examiner during the course of the interview may determine whether certain

218

STREINER

questions or even entire sections are given or not. Measures of internal consistency, such as Cronbach’s alpha or the mean interitem correlation, should be reported for scales in which this is a desirable property. Some tests, although, consist of items that are not expected to have high internal consistency and these test indexes should not be calculated (Streiner, 2003). The latest set of recommendations of the American Psychological Association regarding the reporting of statistical results states that in addition to point estimates of a parameter such as a reliability coefficient, articles should also report confidence intervals (Wilkinson & the Task Force on Statistical Inference, 1999). When validity studies involve comparing two or more groups, it is rarely sufficient to report the results of a t test or analysis of variance. Study results can be statistically significant but clinically trivial, especially when the sample size is large. It is better to report the results in a way that reflects the clinical utility of the scale such as the amount of variance accounted for by it, group overlap, or misclassification rates. 6. Discussion of the use of the test: Diagnostic tests are used in many different ways. Some may be useful for clinical decision making, whereas others because of the nature of the test itself, or low reliability, may be used only for research purposes. Validity testing may show that some tests may be accurate enough to place people within broad groups but not show sufficient sensitivity to change to track improvement or deterioration for individual people. Other attributes of the test, such as the length of time it takes to administer it or the skill level required by the examiner, may make some tests appropriate only for research purposes or for occasional use, whereas others can be given to patients every session to measure progress (or the lack of it). The article should point out how the test would be used in practice, with the recognition that this may change as more studies are done with it. DISCUSSION When they are used properly, diagnostic and screening tests can be powerful tools in the psychologist’s tool kit. There is little doubt that they have been extremely useful in such areas as identifying children with intellectual problems as well as those who could benefit from enriched educational programs, aiding in the diagnosis of psychological and behavioral problems, helping people make vocational choices, predicting who will succeed as a Peace Corps volunteer, and a range of others. However, this presupposes that the tests themselves have been developed with sufficient concern regarding issues of reliability and validity and that they are used only for the purposes that were intended (or in areas in which they have been shown through subsequent studies that they can be used validly). There are literally thousands of tests that have been developed (e.g., Goldman, Mitchell, & Egelson, 1997), and it is safe to say that few of them would meet all, or even most, of the

STARD criteria for reporting reliability and validity. Even with well-developed and widely used tests such as the Minnesota Multiphasic Personality Inventory–2 (Greene, 2000) or the Millon Clinical Multiaxial Inventory–III (Millon, Millon, & Davis, 1994), validity is an incremental process. Over time, tests are used with different populations (e.g., ethnic minorities or those from other cultures) and in ways not foreseen by the original developers. This is not only to be expected but desired, as it extends the potential usefulness of the instruments. However, before these validity studies are conducted and ideally replicated, clinical decisions based on the scales must be highly tentative. In his article, I have focused on another way many tests have been used in ways that are not validated: taking them out of the clinic, where they are used on an individual basis, into the community, where they are used for case finding or mass screening. This is often a highly dubious proposition, as readers have seen. Scales that perform well with people for whom there is a strong suspicion that they have the trait being measured (i.e., the prevalence is close to 50%) will nearly always perform poorly in trying to identify people when the prevalence is low. Indeed, Cadman et al. (1984) showed that even a well-developed test designed for screening purposes misidentified too many children in kindergarten to be clinically useful. This is not a criticism of the test as much as it is on the use to which it is put. This is a lesson that was taught nearly 50 years ago (Meehl & Rosen, 1955) but one that requires continual repetition. REFERENCES Bamber, D., Tamplin, A., Park, R. J., Kyte, Z. A., & Goodyer, I. M. (2002). Development of a short Leyton obsessional inventory for children and adolescents. Journal of the American Academy of Child & Adolescent Psychiatry, 41, 1246–1252. Baxter, N., & the Canadian Task Force on Preventive Health Care. (2001). Preventive health care, 2001 update: Should women be routinely taught breast self-examination to screen for breast cancer? Canadian Medical Association Journal, 164, 1837–1846. Beck, A. T., Steer, R. A., Ball, R., & Ranieri, W. F. (1996). Comparison of Beck Depression Inventories–IA and –II in psychiatric outpatients. Journal of Personality Assessment, 67, 588–597. Bergman, A. B., & Stamm, S. J. (1967). The morbidity of cardiac nondisease in schoolchildren. New England Journal of Medicine, 18, 1008–1013. Bossuyt, P., Reitsma, J. B., Bruns, D. E., Gatsonis, C. A., Glasziou, P. P., Irwig, L. M., et al. (2003). Towards complete and accurate reporting of studies of diagnostic accuracy: The STARD initiative. Annals of Internal Medicine, 138, 40–44. Retrieved February 24, 2003, from http:// www.consort-statment.org/stardstatement.htm. Bury, A. S., & Bagby, R. M. (2002). The detection of feigned uncoached and coached posttraumatic stress disorder with the MMPI–2 in a sample of workplace accident victims. Psychological Assessment, 14, 472–484. Cadman, D., Chambers, L. W., Walter, S. D., Feldman, W., Smith, K., & Ferguson, R. (1984). The usefulness of the Denver Developmental Screening Test to predict kindergarten problems in a general community population. American Journal of Public Health, 74, 1093–1097. Campbell, D. T., & Fiske, D. W. (1959). Convergent and discriminant validation by the multitrait-multimethod matrix. Psychological Bulletin, 56, 81–105. Chen, P., Ratcliff, G. D., Belle, S. H., Cauley, J. A., DeKosky, S. T., & Ganguli, M. (2000). Cognitive tests that best discriminate between asymptomatic AD and those who remain nondemented. Neurology, 55, 1847–1853.

DIAGNOSTIC TESTS Cohen, J. (1960). A coefficient of agreement for nominal scales. Educational and Psychological Measurement, 20, 37–46. Cohen, J. (1968). Weighted kappa: Nominal scale agreement with provision for scaled disagreement or partial credit. Psychological Bulletin, 70, 213–220. Committee to Review the Scientific Evidence on the Polygraph. (2003). The polygraph and lie detection. Washington, DC: National Academy Press. Derogatis, L. R., & Spencer, M. S. (1982). The Brief Symptom Inventory (BSI): Administration, scoring, and procedures manual—1. Baltimore: Johns Hopkins University Press. Feinleib, M., & Zelen, M. (1969). Some pitfalls in the evaluation of screening programs. Archives of Environmental Health, 19, 412–415. Finkel, S. E., Guterbock, T. M., & Borg, M. J. (1991). Race-of-interviewer effects in a preelection poll: Virginia 1989. Public Opinion Quarterly, 55, 313–330. Fleiss, J. L. (1981). Statistical methods for rates and proportions (2nd ed.). New York: Wiley. Gibertini, M., Brandenburg, N., & Retzlaff, P. (1986). The operating characteristics of the Millon Clinical Multiaxial Inventory. Journal of Personality Assessment, 50, 554–567. Goldman, B. A., Mitchell, D. F., & Egelson, P. (1997). Directory of unpublished experimental mental measures (Vol. 7). Washington, DC: American Psychological Association. Greene, R. (2000). MMPI–2: An interpretive manual (2nd ed.). Boston: Allyn & Bacon. Guyatt, G. H., Tugwell, P. X., Feeny, D. H., Haynes, R. B., & Drummond, M. A. (1986). A framework for clinical evaluation of diagnostic technologies. Canadian Medical Association Journal, 134, 587–594. Helzer, J. E., Robins, L., McEvoy, L., Spitznagel, E. L., Stoltzman, R., Farmer, A., et al. (1985). A comparison of clinical and Diagnostic Interview Schedule diagnoses. Archives of General Psychiatry, 42, 657–666. Hesselbrock, V., Stabenau, J., Hesselbrock, M., Mirkin, P., & Meyer, R. (1982). A comparison of two interview schedules: The Schedule for Affective Disorders and Schizophrenia—Lifetime and the National Institute for Mental Health Diagnostic Interview Schedule. Archives of General Psychiatry, 39, 674–677. Hsu, L. M. (2002). Diagnostic validity statistics and the MCMI–III. Psychological Assessment, 14, 410–422. Jaeschke, R., Guyatt, G., & Sackett, D. L. (1994a). Users’ guides to the medical literature: III. How to use an article about a diagnostic test. A. Are the results of the study valid? Journal of the American Medical Association, 271, 389–391. Jaeschke, R., Guyatt, G., & Sackett, D. L. (1994b). Users’ guides to the medical literature: III. How to use an article about a diagnostic test. B. What are the results and will they help me in caring for my patients? Journal of the American Medical Association, 271, 703–707. Kessel, J. B., & Zimmerman, M. (1993). Reporting errors in studies of the diagnostic performance of self-administered questionnaires: Extent of problem, recommendations for standardized presentation of results, and implications for the peer review process. Psychological Assessment, 5, 395–399. Kraemer, H. C. (1992). Evaluating medical tests: Objective and quantitative guidelines. Newbury Park, CA: Sage. Kraemer, H. C., Kazdin, A. E., Offord, D. R., Kessler, R. C., Jensen, P. S., & Kupfer, D. J. (1999). Measuring the potency of risk factors for clinical or policy significance. Psychological Methods, 4, 257–271. Kramer, B. S., & Brawley, O. W. (2000). Cancer screening. Hematology and Oncology Clinics of North America, 14, 831–848. MacCallum, R. C., Zhang, S., Preacher, K. J., & Rucker, D. D. (2002). On the practice of dichotomization of quantitative variables. Psychological Methods, 7, 19–40. Meehl, P. E., & Rosen, A. (1955). Antecedent probability and the efficiency of psychometric signs, patterns, or cutting scores. Psychological Bulletin, 52, 194–216. Meyer, G. J. (2002). Implications of information gathering methods for a refined taxonomy of psychopathology. In L. E. Beutler & M. Malik (Eds.), Rethinking the DSM: A psychological perspective (pp. 69–105). Washington, DC: American Psychological Association.

219

Miller, P. R., Dasher, R., Collins, R., Griffiths, P., & Brown, F. (2001). Inpatients diagnostic assessments: 1. Accuracy of structured vs. unstructured interviews. Psychiatry Research, 105, 255–264. Millon, T., Millon, C., & Davis, R. (1994). Millon Clinical Multiaxial Inventory–III. Minneapolis, MN: National Computer Systems. Nielsen, C., & Lang, R. S. (1999). Principles of screening. Medical Clinics of North America, 83, 1323–1337. Nunnally, J. C. (1970). Introduction to psychological measurement. New York: McGraw-Hill. Olsen, O., & Gotzsche, P. C. (2001). Screening for breast cancer with mammography. Cochrane Database of Systematic Reviews, CD001877. Overall, J. E., & Gorham, D. R. (1962). The Brief Psychiatric Rating Scale. Psychological Reports, 10, 799–812. Raskin, D. C., & Hare, R. D. (1978). Psychopathy and detection of deception in a prison population. Psychophysiology, 15, 126–136. Reid, M. C., Lachs, M. S., & Feinstein, A. R. (1995). Use of methodological standards in diagnostic test research: Getting better but still not good. Journal of the American Medical Association, 274, 645–651. Rice, M. E., & Harris, G. T. (1995). Violent recidivism: Assessing predictive validity. Journal of Consulting and Clinical Psychology, 63, 737–748. Riegelman, R. K. (2000). Studying a study and testing a test (4th ed.). New York: Lippincott Williams & Wilkins. Robins, L., Helzer, J. E., Croughan, J., & Ratcliff, K. S. (1981). National Institute of Mental Health Diagnostic Interview Schedule: Its history, characteristics, and validity. Archives of General Psychiatry, 38, 381–389. Robins, L. N., Wing, J., Wittchen, H. U., Helzer, J. E., Babor, T. F., Burke J., et al. (1988). The Composite International Diagnostic Interview. An epidemiologic instrument suitable for use in conjunction with different diagnostic systems and in different cultures. Archives of General Psychiatry, 45, 1069–1077. Ronen, G. M., Streiner, D. L., Rosenbaum, P., & the Canadian Pediatric Epilepsy Network. (2003). Health-related quality of life in childhood epilepsy: The development of self-report and proxy-response measures. Epilepsia, 44, 598–612. Sackett, D. L., Deeks, J. J., & Altman, D. G. (1996). Down with odds ratios! Evidence-Based Medicine, 1, 164–166. Sackett, D. L., Haynes, R. B., Guyatt, G. H., & Tugwell, P. (1991). Clinical epidemiology: A basic science for clinical medicine (2nd ed.). Boston: Little, Brown. Streiner, D. L. (1998). Risky business: Making sense of estimates of risk. Canadian Journal of Psychiatry, 43, 411–415. Streiner, D. L. (2002). Breaking up is hard to do: The heartbreak of dichotomizing continuous data. Canadian Journal of Psychiatry, 47, 262–266. Streiner, D. L. (2003). Being inconsistent about consistency: When coefficient alpha does and doesn’t matter. Journal of Personality Assessment, 80, 217–222. Streiner, D. L., & Norman, G. R. (1995). Health measurement scales: A practical guide to their development and use (2nd ed.). Oxford, England: Oxford University Press. Streiner, D. L., & Norman, G. R. (1996). PDQ epidemiology (2nd ed.). Toronto, Ontario, Canada: Decker. United States v. Edward G. Scheffer, 523 U.S. Supreme Court 303 (1998). Wilkinson, L., & The Task Force on Statistical Inference. (1999). Statistical methods in psychology journals: Guidelines and explanations. American Psychologist, 54, 594–604.

David L. Streiner Kunin-Lunenfeld Applied Research Unit Baycrest Centre for Geriatric Care 3560 Bathurst Street Toronto, Ontario, Canada M6A 2E1 E-mail: [email protected] Received February 11, 2003 Revised April 2, 2003

Suggest Documents