Journal of Clinical Epidemiology 53 (2000) 931–939
Choice of effect measure for epidemiological data S.D. Walter* Department of Clinical Epidemiology and Biostatistics, McMaster University, Hamilton, Ontario, Canada L8N 3Z5 Received 5 January 1999; received in revised form 11 November 1999; accepted 26 January 2000
Abstract The debate concerning the choice of effect measure for epidemiologic data has been renewed in the literature, and it suggests some continuing disagreement between the pertinent clinical and statistical criteria. In this article, some defining characteristics of the main choices of effect measure [risk difference (RD), relative risk (RR), and odds ratio (OR)] for binary data are presented and compared, with consideration of both the clinical and statistical perspectives. Relationships of these measures to the relative risk reduction (RRR) and number needed to treat (NNT) are also discussed. A numerical comparison of models of constant RD, RR, and OR is made to assess when and by how much they might differ in practice. Typically the models show only small numerical differences, unless extreme extrapolation is involved. The RD and RR models can predict impossible event rates, either less than zero or greater than 100%. Each measure has potential theoretical justification. RD and RR may enjoy some advantages for communication of risk, but OR may be preferred for data analysis. A clear distinction should be maintained between the objectives of data analysis and subsequent risk communication, and different effect measures may be needed for each. © 2000 Elsevier Science Inc. All rights reserved. Keywords: Effect measure; Epidemiology; Statistics
1. Introduction When reporting the results of studies, clinicians and statisticians both strive for simple summaries of the data. Simplicity in reporting enhances communication to consumers of the results. One way of achieving simplicity is by using an effect measure that is generalizable to most or all of the participants in the study. For instance, when reporting the results of a clinical trial one might use an effect measure such as the relative risk of the outcome (e.g., death) under an experimental treatment compared to controls. Ideally, the overall relative risk would be constant in all strata of the study, and thus applicable to all subgroups of patients. A similar principle is found in meta-analyses that aggregate data from several studies. Here one might report the overall effect size, and an investigation of heterogeneity in the effect between studies. Statistical or mathematical criteria have often governed the choice of effect measure, even though the results might then be less interpretable to clinicians or patients. A particular instance has been the frequent use of the odds ratio, a measure that is thought to be less interpretable by consumers. In this article, we first review the most frequently used effect measures, and highlight their main strengths and * Corresponding author. Tel: 905-525-9140, ext. 22338; fax: 905-5293012. E-mail address:
[email protected] (S.D. Walter).
weaknesses according to various authorities. We then carry out some numerical evaluations to determine the likely magnitude of difference between models of constancy in these effect measures. These evaluations indicate the situations in which the choice between alternative effect measures would be most critical. We also briefly examine the question of heterogeneity on an empirical basis. 2. Methods 2.1. Definition of effect measures We limit attention to situations where we wish to summarize the difference between two groups with respect to a binary outcome, with event rates P1 and P2. Typical examples include: a clinical trial with treated and control groups, with an outcome event such as cure or death; a cohort study with participants defined as exposed or unexposed to a risk factor, and the outcome being the development of a disease; or a case-control study in which case and control groups are compared with respect the exposure to a binary risk factor. For convenience, we will emphasize the situation where P2 represents the event rate in a control group, and P1 the event rate in a treated group, but we can easily generalize our comparisons to the other designs. Frequently used effect measures with binary data of this type are as follows:
0895-4356/00/$ – see front matter © 2000 Elsevier Science Inc. All rights reserved. PII: S0895-4356(00)00 2 1 0 - 9
Risk difference ⫽ RD ⫽ P2 ⫺ P1
932
S.D. Walter / Journal of Clinical Epidemiology 53 (2000) 931–939
Relative risk ⫽ RR ⫽ P1/P2 Relative risk reduction ⫽ RRR ⫽ (P2 ⫺ P1)/P2 P1 ⁄ ( 1 – P1 ) Odds ratio = OR = ---------------------------P2 ⁄ ( 1 – P2 ) Number needed to treat ⫽ NNT ⫽ 1/(P2 ⫺ P1). As an example, suppose the event rates are P1 ⫽ 0.3 and P2 ⫽ 0.4 in the treated and control groups, respectively. Then we have: RD ⫽ 0.4 ⫺ 0.3 ⫽ 0.1; RR ⫽ 0.3 / 0.4 ⫽ 0.75; RRR ⫽ (0.4 ⫺ 0.3) / 0.4 ⫽ 0.25; OR ⫽ (0.3/0.7) / (0.4/0.6) ⫽ 0.64; NNT ⫽ 1 / (0.4 ⫺ 0.3) ⫽ 10. RD compares the outcome rates on an arithmetic scale, RR does the same on the multiplicative scale, while OR uses the odds scale. RRR expresses the treatment effect relative to the event rate in the controls. NNT represents the expected number of individuals that one has to treat in the experimental group in order to prevent (on average) one event, compared to the expected number of events when using the control therapy. By manipulation of the expressions above, we may note that RRR ⫽ 1 ⫺ RR. Also NNT ⫽ 1/RD. Hence, RRR does not provide any additional statistical information compared to RR, and NNT provides no additional information beyond RD. They may, however, have some advantages in terms of clinical usefulness [1,2]. For instance, NNT is a measure of the clinical effort required in order to achieve one additional beneficial outcome in a series of patients. 2.2. Commentary on the usefulness of the odds ratio The odds ratio (OR) has been one of the most popular effect measures for statistical analysis, but a number of recent publications have been strongly critical of this measure. It is instructive to review some of this criticism, and some of the reaction that it has provoked. For instance, Sinclair and Bracken [3] claimed that OR is not of value, indicating: Because the control group’s risk affects the value of the OR, the OR cannot substitute for the RR in conveying clinically important information to physicians. . . . The value of a typical estimate for reduction in relative odds does not correspond to that for reduction in RR. The implication of this argument is that the OR is somehow “wrong” and that the RR is “right.” Implicitly, Sinclair and Bracken are saying that they dislike the OR because it provides a different numerical assessment of effect compared to RR. They go further when they claim that “The OR magnifies the apparent size of the treatment effect. . . . [relative to RR].” Here, and in many other points in their article, these authors argue that a clinician is inclined to think on the RR scale, and may be led astray if he/she inadvertently interprets an OR as a RR estimate. To do so would then “exaggerate” the effect of treatment.
This type of argument does not recognize that OR and RR are, by their very definitions, different measures. They may sometimes be close numerically, but it is clearly an untenable position to routinely equate the two. Nevertheless, Sinclair and Bracken suggest that clinicians will typically fail to recognize the distinction. They even point out that “the bounds of the confidence interval around RR may exclude the point estimate of the OR.” Now, normally one would not expect the coverage of the confidence interval for one parameter to bear any relation to the point estimate of a second, distinct parameter. Yet by this statement, Sinclair and Bracken imply that some users of RR and OR would indeed expect such a relationship to hold true. In a similar vein, Davies et al. [4] discuss how the OR “misleads” when its numerical value differs from that of RR. They describe quantitatively and qualitatively how the two measures will differ, emphasizing the potential error when OR is interpreted as RR. They note, as do Sinclair and Bracken, that OR “overstates the case when interpreted as a relative risk,” but they also conclude that “there is no point at which the degree of overstatement is likely to lead to qualitatively different judgements about the study.” Zhang and Yu [5] propose a method to “correct” the odds ratio in order to improve its properties as an estimator of relative risk. A somewhat different theme is taken up by Rothwell [6], who aims for simplicity of reporting study results by requiring the effect measure to be approximately constant for patients at different levels of baseline risk. He states: Analyses based on relative odds are difficult to interpret because the relative odds reduction will inevitably vary with baseline risk, whereas the RRR is unaffected unless there is genuine variation in the relative treatment effect. The perspective of this author is immediately evident from the claim that the relative odds (or OR) will inevitably vary with baseline risk. In other words, he would not admit the possibility that OR might actually be constant in a given set of data. In contrast, he expects RRR (and hence RR) to remain constant across subgroups unless there is variation in the treatment effect. So the “effect” is, de facto, measured by the value of RR. A similar argument is made by Morabia et al. [7], who point to the nonequivalence of interaction on the RR and OR scales, describing it as a “fallacy.” Criticism of OR simply because it may not “agree” with RR seems unjustified; one can equally well turn that argument around and claim that RR is a poor measure because it does not agree with OR. Furthermore, there is little if any actual evidence that physicians do indeed misinterpret ORs as RRs, in the way that is implied by these various authors. Finally, whether a given measure is constant or variable across subgroups is largely an empirical question. More recently, Sackett [8] has also argued that OR has limited clinical usefulness, especially at the bedside. He recognizes that OR and RR may be numerically different: “In many trials, ORs are not even similar to RRs,” a fact that is
S.D. Walter / Journal of Clinical Epidemiology 53 (2000) 931–939
clearly true, but, again, not one which necessarily means that RR is to be preferred for data analysis. Implicitly, Sackett (like Rothwell) strives for simplicity in reporting results, by seeking a measure that is constant for patients at different levels of baseline risk. He notes: “If RRR is constant for different event rates, then OR is not constant.” Sackett declares a preference for the RRR (which, we may recall, is a function only of RR), and for NNT as clinically useful measures. In addition to the simple relationship between NNT and RD, one may also express NNT in terms of RR as follows: 1 (1) NNT = ---------------------------P 2 ( 1 – RR ) Thus, if RR ⫽ constant, one may “calibrate” NNT on a patient-specific basis, as long as one can characterize an individual’s baseline risk P2. One can take the overall value of RR, as estimated in a study, and apply it to patients at different levels of baseline risk for whom P2 can be estimated. All this is, however, founded on the belief (or hope) that RR is indeed constant, a fact that need not be true in any given situation. Drawing a distinction with the relatively simple relationship of eq. (1) between RR and NNT, Sackett notes that there is no correspondingly simple conversion from OR to NNT. This deficiency of OR is one of the main reasons for Sackett’s negativism towards it as a clinically useful measure. Several correspondents objected strongly to Sackett’s position [9–11]. Olkin [9] argued against RR, saying “risk ratios have fundamental flaws as a measure.” Similarly, Senn [10] indicated “In recommending RR in preference to the OR or the (even better) log-OR, Sackett et al. are in danger of sacrificing truth on the alter of presumed relevance and simplicity.” In an earlier publication reviewing various possible effect measures for use in meta-analyses, Fleiss [12] wrote: The OR is not prone to the artifactual appearance of interaction across studies due to the influence on other measures of association or effect of varying marginal frequencies or to constraints on one or the other sample proportion. On the basis of this and all of its other positive features, the OR is recommended as the measure of choice for measuring effect or association. Here Fleiss too is seeking a measure with the “simplicity” characteristic. Specifically, he desires a measure that is constant across studies (i.e., not demonstrating study by treatment interactions). He strongly recommends OR as the preferred measure, and suggests that other measures (including RR and RD) may demonstrate spurious interactions, because of certain artifacts, as we will explore below in more detail. To complete this review, we may go back to Cox’s classic book [13] on the analysis of binary data. Cox provides four criteria that may be applied in choosing an effect measure: 1. “it is reasonable to require that if successes and failures are interchanged the measure of the difference between groups is either unaltered or at most changed in sign.” 2. “It is desirable to work with a measure that remains stable over a range of conditions.”
933
3. “If a particular measure has an especially clear-cut practical meaning, for example an explicit economic interpretation, we may decide to use it.” 4. “A measure for which the statistical theory of inference is simple is, other things being equal, a good thing.” In his first criterion, Cox requires that the “labeling” of positive or negative outcomes in the study (or the events and nonevents) is arbitrary, and should not fundamentally affect the measure of treatment benefit. Criterion 2 raises, once again, the idea of simplicity in reporting; here Cox requires a measure that remains the same when the overall event rate changes. In criterion 3, we have the idea that content–matter considerations may dictate the choice of measure. In the context of clinical usefulness, for example, this criterion might direct the choice to be RR or NNT, according to the comfort level of the consumers (clinicians and patients) in using those measures. Finally, criterion 4 introduces the notion that ease of estimation is also a factor in the decision. Statistical properties have undoubtedly influenced the choice of measure in the past, and might explain the popularity of the OR in particular. It is interesting that Cox places this criterion last, perhaps implying that the mathematical properties are less important than the others. 3. Properties of effect measures We now briefly review some properties of RD, RR, and OR in terms of simplicity for interpretation purposes, their statistical estimation, and their possible motivating models. The characteristics are summarized in Table 1. 3.1. Risk difference (RD) Although we recognize that “simplicity” is a qualitative property, most would agree that RD is a simple measure, and is therefore easily understood. It satisfies Cox’s first criterion of symmetry, because if the study groups are interchanged, RD simply changes sign. Thus, in our numerical example, a RD advantage of 10% in the mortality rates of the treated vs. controls just becomes a 10% deficit in terms of the controls relative to the treatment group. A model of constant RD can, however, predict impossible event rates (i.e., outside the range [0, 1]). For example, if the model was RD ⫽ 0.1 for all subjects, then for individTable 1 Properties of risk difference (RD), relative risk (RR), and odds ratio (OR) Measure
Simple measure? Symmetric (measure unaffected by labelling of study groups)? Predicted event rates restricted to [0, 1] if measure is assumed constant? Unbiased estimate available? Efficient estimation in small samples? Motivating biological model available?
RD
RR
OR
Yes
Yes
No
Yes
No
Yes
No Yes No Yes
No No No Yes
Yes No Yes Yes
934
S.D. Walter / Journal of Clinical Epidemiology 53 (2000) 931–939
uals whose expected mortality rate under the control treatment is less than 10%, the predicted mortality rate under the new treatment would be negative. An unbiased estimate of RD can be obtained from sample data, based on the difference of two independent binomial variables. The choice of RD as the effect measure can be motivated by a model in which there is a Poisson process for adverse events, with a possibly different rate for the process in each group [14]. Individuals who have experienced one or more adverse event “hits” in the study period are defined to have higher risk of the outcome. This model leads naturally to RD as the appropriate measure to express the effect of treatment. Note that if the model RD ⫽ constant is valid, then NNT is also constant. Under this assumption, NNT then relates therapeutic effort to clinical yield for patients at various levels of baseline risk. 3.2. Relative risk (RR) RR is also a simple and generally well-understood measure. However, it fails Cox’s first criterion of symmetry. In the numerical example, RR ⫽ 0.75 when the RR ratio is formulated in terms of the treated event rate (0.3) over the control event rate (0.4). If instead of using mortality as the outcome, one had elected to study the corresponding survival rates, the ratio would be 0.7/0.6 ⫽ 1.167. This is not equivalent to the reciprocal of 0.75. The problem of asymmetry of RR can have a profound effect on one’s view of the study. In the hypothetical study 1 shown in Table 2, RR for mortality is 0.50 (treated vs. controls), which in most situations would be thought of as a large treatment effect. In contrast, RR for the equivalent survival rates is 1.01, which most observers would think of as a very small effect. Thus, we have fundamentally different impressions of the data, depending on whether we choose to analyze death or survival. This is a distinction that should make no difference at all. In the summary data from hypothetical study 2 in Table 2, the death rates are much higher than before, but RR for death is the same (0.50) as in study 1. The RR for survival is now 3.00, which would usually be thought of as “large.” In-
Table 2 Effect of analyzing “positive” or “negative” outcomes on relative risk (RR); two examples
Study 1 Death rate Survival rate RR (for death) RR (for survival) Study 2 Death rate Survival rate RR (for death) RR (for survival)
Treated
Controls
1% 99%
2% 98% 0.50 1.01
40% 60%
80% 20% 0.50 3.00
deed, it is larger than 2.00, the reciprocal of the RR for death. Hence, even though RR for death is the same in the two studies, different conclusions are suggested if RR for survival is adopted as the measure. This ambiguity in summarizing the data with RR has long been recognized [15] and is, in my opinion, very troubling. Sinclair and Bracken [3] have suggested that one can avoid this ambiguity by adopting the convention of only reporting unfavorable events. I find this an unrealistic proposal, because there are many practical situations where one routinely reports the favorable event. For instance, in analyzing “remission of symptoms,” one would usually regard the (favorable) elimination of symptoms as the outcome. Finally, there are numerous circumstances where an optimistic orientation might be required for risk communication. For example, surgeons may tell their prospective patients that they have a 99% chance of surviving an operation, rather than telling them about the 1% risk of death, even though the two statements contain the same statistical information. Similarly, one would probably discuss the chances of a favorable “return to work” outcome with patients in a rehabilitation program. Like RD, a model of constant RR can also predict impossible event rates outside [0, 1]. For example, suppose one assumes RR ⫽ 2.00 for an outcome in the treated vs. the control group. Then an individual with P2 ⫽ 0.6 would have a predicted outcome rate of 0.6 ⫻ 2.00 ⫽ 1.2 if he were in the treated group. The range limitations imposed on RR (and, as seen earlier, for RD) are taken by Greenland [16] to indicate “purely logical reasons for disbelieving constancy of the difference or ratio of proportions.” An unbiased estimate of RR cannot be obtained in general, although the bias is small in large samples. RR arises as a natural effect measure under a Poisson model in which “hits” correspond to protective events [14]. Individuals with no “hit” are then defined to be at higher risk for the outcome. If RR ⫽ constant, then NNT is inversely proportional to P2 as can be seen in eq. (1). If this model is correct, one can use the value of RR as reported in a study, and apply it to individual patients at different baseline risks P2. 3.3. Odds ratio (OR) The OR is undeniably the most difficult measure to intuit, so it is likely to be less useful than RD or RR for communicating risk. OR meets Cox’s criterion of symmetry. In our example, OR ⫽ 0.64 when considering the event rate in the treated vs. controls. If the study groups are interchanged, one obtains OR ⫽ (0.4/0.6) / (0.3/0.7) ⫽ 1.56, which is simply the reciprocal of 0.64 (apart from rounding error). Hence, if log(OR) is used, only a change of sign is required. Also, OR is the only measure that guarantees to avoid impossible predicted event rates under a model that the measure is constant. Like RR, no unbiased estimate of OR is available, but bias will be small in reasonably large samples. Greenland [16] has also noted that estimation of OR is highly efficient in small samples, whereas RD and RR are inefficiently estimated.
S.D. Walter / Journal of Clinical Epidemiology 53 (2000) 931–939
There are a number of other reasons to support OR as the choice of effect measure. First, consider a model in which the two groups being compared have underlying normal distributions for some quantity of interest; for example, one might measure serum creatine kinase levels in subjects who may or may not have experienced a myocardial infarction [17]. One can set a diagnostic threshold above which an individual is defined to be a “case” as opposed to a noncase. Under this model the OR is approximately constant for any choice of cut-point [18]. (It is exactly constant if the underlying distributions are logistic rather than normal; the differences between the two are small and mainly affect the extreme tails). Second, OR arises when using the likelihood ratios. For instance, again in the context of diagnostic testing, the posttest odds of disease ⫽ OR ⫻ (pretest odds). Lachenbruch [19] has recommended OR in the context of regulatory approval of new diagnostic tests. Note, however, that few practicing physicians are familiar with likelihood ratios [20]. Third, in the 2 ⫻ 2 contingency table, OR is the single parameter in the noncentral hypergeometric distribution that determines the sample distribution of data. With OR, these ideas extend easily to problems with multiple factors, some of which may have more than two levels; in contrast, some of the difficulties identified with RD and RR become more extreme in these situations. Accordingly, OR is the fundamental measure associated with the analysis of multiway contingency table data, log-linear models, and logistic regression in particular. Finally, there is the well-known fact that OR alone can be estimated from any of the three basic epidemiologic study designs (retrospective, prospective, or cross-sectional). 4. Numerical evaluation We now consider a numerical evaluation of RD, RR, and OR, to determine how widely these measures might differ in practice. Our approach is to consider various values for the outcome rate P2, and the corresponding predicted values P1 under the models RD ⫽ constant, RR ⫽ constant, and OR ⫽ constant in turn. It is convenient to define a given set of evaluations according to the value of one of the measures: arbitrarily we used RR as this starting point. For each set of enumerations, we also defined a special (but again arbitrary) value P2* at which the three measures would agree, and which has the effect of defining the constant values of RD and OR. A value for P2* is necessary to provide a basis for comparison of the measures; it can also be thought of as a “central” value where the bulk of the data reside and determine the overall value for each measure in turn. As an example, suppose we fix RR ⫽ 0.75 and P2* ⫽ 0.5. Then, at P2 ⫽ 0.5 we find P1 ⫽ 0.5 ⫻ 0.75 ⫽ 0.375 under the constant RR model. These values of P1 and P2 correspond to RD ⫽ 0.5 ⫺ 0.375 ⫽ 0.125, and OR ⫽ (0.375/0.625) / (0.5/ 0.5) ⫽ 0.60. Accordingly we used RD ⫽ 0.125 as the model of constant RD and OR ⫽ 0.60 as the model of constant OR when the three measures are compared at other values of P2.
935
Given the three constant model parameter values, one can compare the predicted values of P1 for a range of values of P2. We took P2 to range from 0.1 to 0.9. We denote the predicted values of P1 by P1(RD), P1(RR), and P1(OR), corresponding to the models of constant RD, RR, and OR. Fig. 1 shows a typical set of results, for the assumed values RR ⫽ 0.75 and P2* ⫽ 0.5. Overall, all three predicted values of P1 are very close, differing typically by at most 10%. The constant OR model predicts higher values of P1 when P2 ⬎ P2*, and slightly lower values than the RR model when P2 ⬍ P2*. The RD model gives the lowest values of P1 when P2 ⬍ P2*, but it predicts values between OR and RR models otherwise. Note that the RD model has inadmissible (negative) predicted values, in the neighborhood of P2 ⫽ 0.1. Fig. 2 shows similar results, except that now P2* ⫽ 0.1. Thus, in contrast to Fig. 1 where the models agree at central values of risk P2, Fig. 2 depicts a situation where the analysis would be dominated by individuals at lower baseline risk. In Fig. 2, the RR and OR predictions are always below those from the RD model, with the RR model giving the lowest values. The RR and OR curves diverge as P2 increases, but one has to pass from a baseline of P2 ⫽ 0.1 to the opposite extreme of high baseline risk in order to induce a large difference between the RR and OR models. This would represent extreme extrapolation from the bulk of the data near P2* to individuals at totally different levels of baseline risk. In Fig. 3, P2* ⫽ 0.9. Here the RR and OR curves are most discrepant in the central range of P2 with up to about 20% differences in their predicted values of P1. But at low values of P2, the two curves are again quite close. The RD model gives inadmissible (negative) values of P1 for P2 below about 0.2. Fig. 4 exemplifies a situation when RR ⬎ 1, for instance, if the event rate in the treated group is higher than in controls. Specifically RR ⫽ 1.5 and P2* ⫽ 0.1. Here the OR and RD curves are always quite close, but the RR curve di-
Fig. 1. Predicted event rate P1 in treated group, under models of constant RR, RD, or OR. RR ⫽ 0.75, P2* ⫽ 0.5.
936
S.D. Walter / Journal of Clinical Epidemiology 53 (2000) 931–939
Fig. 2. Predicted event rate P1 in treated group, under models of constant RR, RD, or OR. RR ⫽ 0.75, P2* ⫽ 0.1.
verges from the others as P2 increases, and achieves inadmissible values of P1 (greater than 1) for high values of P2. Finally, in Fig. 5, RR ⫽ 1.5 again, but P2* ⫽ 0.5. In this situation both RR and RD demonstrate values of P1 ⬎ 1. The three curves are reasonably close when they assume admissible values of P1, with various orderings in their predicted values. 5. General results One can show that the following general condition applies to the ordering of P1(RD), P1(RR), and P1(OR): If ( RR – 1 ) ( P 2 – P 2 * ) > 0, then P 1 ( RR ) > P 1 ( RD ) and P 1 ( RR ) > P 1 ( OR )
(2)
Fig. 3. Predicted event rate P1 in treated group, under models of constant RR, RD, or OR. RR ⫽ 0.75, P2* ⫽ 0.9.
Fig. 4. Predicted event rate P1 in treated group, under models of constant RR, RD, or OR. RR ⫽ 1.5, P2* ⫽ 0.1.
See the Appendix for details of the proof of result (2). It implies that the constant RR model will predict the highest value of P1 if RR ⬎ 1 and P2 ⬎ P2*. As an illustration, if we refer to Fig. 5, we have that the condition RR ⬎ 1 is satisfied (because RR ⫽ 1.5); thus, P1(RR) is the highest of the three model values if P2 ⬎ P2* (i.e., for P2 ⬎ 0.5). But for P2 ⬍ 0.5, condition (2) is untrue and then P1(RR) is lower than the other two model values. Result (2) indicates that there will be one of four possible orders of the predicted values of P1 for a given value of P2, viz. (i) ( ii ) ( iii ) ( iv )
P 1 ( RR ) ⭓ P 1 ( OR ) ⭓ P 1 ( RD ) P 1 ( RR ) ⭓ P 1 ( RD ) ⭓ P 1 ( OR ) P 1 ( RD ) ⭓ P 1 ( OR ) ⭓ P 1 ( RR ) P 1 ( OR ) ⭓ P 1 ( RD ) ⭓ P 1 ( RR )
(3)
Fig. 5. Predicted event rate P1 in treated group, under models of constant RR, RD, or OR. RR ⫽ 1.5, P2* ⫽ 0.5.
S.D. Walter / Journal of Clinical Epidemiology 53 (2000) 931–939
6. Empirical examples Two studies of stroke prevention discussed by Rothwell [6] provide examples of how the various effect measures might vary between study strata. In both studies, participants were assessed at baseline for the risk of stroke, and this risk prediction was then used to create strata for the analysis. In the first study (ECST—European Carotid Surgery Trial), participants were randomized to surgical intervention or control; the basic results are shown in Table 3. For patients at low risk, there were slightly higher rates of stroke for those in the surgical group compared to control, but for higher risk patients surgery yielded lower rates of subsequent stroke. Hence, RD increases with baseline risk, and RR and OR decrease. In the second study (the UK-TIA trial), randomization was between treatment with aspirin and control. In this instance participants at lower risk benefited more with the intervention, while the benefit for those at higher risk was much more modest (Table 3). Here, therefore, RD decreases and RR and OR increase with baseline risk. In both studies, OR is close to RR because the stroke event rates were quite low. The data suggest than none of the three effect measures could reasonably be regarded as constant for all groups of patients in either study, and indeed that the direction of change in the measures was opposite for the two studies. As a second example, a recent survey [21] of 115 metaanalyses found that approximately one third of them demonstrated a significant relationship of RD to P2, but only about 13% showed significant relationships of RR or OR to P2. Thus, heterogeneity may be eliminated somewhat more often by using a multiplicative or odds scale. In general, however, there is little information on the best scale to reduce or eliminate heterogeneity associated with covariables other than P2. 7. Discussion Both clinician and statistician investigators want to summarize and represent their data as simply as possible. In choosing their effect measure, one requires an index of risk that is generalizable to a wide variety of circumstances. So, for instance, one might choose a measure because it can be Table 3 Examples of variation in effect measures in strata of two studiesa Stroke rate Predicted risk ECST study ⬍ 10% 10–15% ⬎ 15% UK-TIA study ⬍ 10% 10–15% ⬎ 15% a
Control group
Treated group
RD
RR
OR
8.4% 14.0% 20.2%
9.8% 7.0% 6.1%
⫺1.4% 7.0% 14.1%
1.16 0.50 0.30
1.18 0.46 0.26
6.7% 11.5% 15.4%
2.9% 8.9% 14.9%
3.8% 2.6% 0.5%
0.43 0.77 0.97
0.42 0.75 0.96
See [4] for details.
937
shown to be applicable to patients at different underlying baseline risk; or one might find in a meta-analysis that the measure is approximately constant across different studies. Constancy of the effect measure across strata is equivalent to no interaction of the effect by strata. It is important to remember that a claim of “no interaction” is inherently dependent on the scale by which the effect is measured [14,16]. In my opinion, the choice between measures should, in the first place, be based on the findings in data, for instance, according to how well the observations fit a model of constant RD, RR, or OR. The initial choice should not be made entirely on the basis of mathematical convenience, or on a crude or prejudicial presumption of the superiority of a measure’s comprehensibility to users. Indeed, as seen in the empirical examples (cf. Table 3), none of these models may apply in actual data. A survey of meta-analyses suggested that RR and OR may be regarded as constant more often than can RD. However, the potential to distinguish the performance of competing models may be low in single studies [22] unless precise estimates of event rates can be determined in all the relevant risk groups. So-called mixture models can be used as a hierarchial framework to evaluate the fit of alternative models to a given set of data (see Thomas [23] for an example). As seen earlier, various other strengths and weaknesses pertain to the various effect measures. While RD and RR enjoy greater ease of interpretation, OR has several statistical advantages. For instance, models of constant RR or constant RD can predict risks greater than 100% or less than 0, a feature that Greenland [16] has described as logically ruling these measures out, if the measure of choice is to be regarded as constant. In contrast, such impossible event rates cannot occur if the model is based on OR. The conclusions from an analysis based on RR can be profoundly influenced by the arbitrary choice of whether risk is expressed in terms of the “positive” event (e.g., survival) or its “negative” complement (e.g., death). This problem is avoided with RD or OR. To a close approximation, OR is the natural measure that emerges when study groups to be compared have underlying cumulative normal response probabilities, or when population subgroups (e.g., disease cases and noncases) are compared for the evaluation of a diagnostic test, with underlying normal distributions for the test values. In turn, the normal distribution also has a strong theoretical and empirical foundation [12,13]. Generalizations to exposures and outcomes with more than two levels, and to multiway tables when several variables are involved, are more easily achieved using OR. There should be a clear separation of the statistical data analysis from the problem of how to express risk to clinicians and patients. The data analysis should be guided at a fundamental level (if possible) by a motivating biological model, and by the empirical fit of the model to data. Ideally the model will point to a suitable choice of effect measure. Once the analysis has been completed, on can then turn to the question of risk communication. If, for instance, clinicians and their patients feel more comfortable using RR to
938
S.D. Walter / Journal of Clinical Epidemiology 53 (2000) 931–939
guide their decisions, whereas previous analysis of the data had used ORs, there is no technical difficulty in achieving a numerical “conversion” from one scale to the other, once a baseline level of risk P2 is established. One should recognize that measures of risk such as RR and OR are inherently different, and therefore numerical differences between them should not be surprising. So, for example, the rejection of OR by many authors just because it does not “agree” with RR in some situations seems illogical and scientifically unfounded. I feel that the discussion of the alleged tendency of OR to “mislead” (when it is interpreted as RR) itself distracts from the real issue—that of choosing a biologically meaningful measure in the first place. It is only in special situations where one would wish to routinely regard one measure as a representation of another measure. A good example of this is the well-known approximation of RR by OR in case-control studies of rare diseases [24]. In practice, biological justification of the statistical model is often difficult to achieve. We have mentioned simple Poisson processes that can lead to additive or multiplicative rate models. These ideas have been applied, for instance, to theories of multi-stage carcinogenic initiation and promotion [25]. Greenland [16] has shown how RD and RR can be used for inference from aggregated data to the individual: for example, RD can be thought of as the average change in individual risk associated with a change in treatment. In contrast, the same is not true of OR: the overall OR cannot be interpreted as the average of risk odds for individuals in a population. The OR can nevertheless be justified in other ways, as indicated earlier. Our numerical evaluations of models that assumed RD, RR, and OR to be constant suggested that they differ rather little, unless a very wide range of outcome probabilities is involved, and only if one engages in substantial extrapolation over the range of baseline risk P2. For a given level of baseline risk, the predicted values of risk P1 otherwise typically differ by less than 10%. This implies that it may be difficult to discern a superior model for a given set of data; two or more models may fit equally well (or badly). Furthermore, it may be difficult on substantive grounds to justify extrapolation of a given treatment effect measure over a very wide range of baseline risks. Patients at particularly high or low baseline risk may differ in other ways that may affect their prognosis, to the extent that it would then be unreasonable to expect any measure of treatment effect measure to apply to all patients. Predicted events rates outside the acceptable 0–100% range with the constant RR and RD models are an extreme indication of the inappropriateness of such extrapolation. That the models are numerically similar in most circumstances is essentially the same conclusion as that reached by Davies et al. [4] in their comparison of RR and OR. They found that the only case in which substantial numerical discrepancies could be induced between RR and OR was when the effect size was large and the baseline risk was high. However, one would conclude an important effect using either measure in this case. Elsewhere one would also reach
qualitatively similar conclusions about the treatment with either measure. Despite this general concordance of models, there is evidence [26–28] that the choice of effect measure can influence clinical decision making. Clinicians apparently will value treatments differently according to which scale is used to provide the data to them, even though the underlying facts remain the same. Specifically, they tend to rate a given treatment as more effective when performance is reported to them using relative risk reduction measures than when using absolute risk or NNT. Bucher [28] concludes that physicians are affected by the “predominant use of reduction of relative risk in trial reports and advertisements,” and Naylor et al. [27] comment on the “under-use of summary measures that relate treatment burden to therapeutic yields” (referring to NNT). These findings illustrate the importance of maintaining a clear separation of the tasks of data analysis and interpretation by researchers, and subsequent communication of risk by clinicians to their patients. Prevailing practice in journals may be a strong determinant of how investigators choose to report results. Given the important effects on consumers of the scale in which effect sizes are reported, editorial policies should encourage authors to justify their choice of analytic model and summary measure, and if appropriate to report the most important findings with alternative measures. For example, adding NNT values to RR estimates as measures of therapeutic effectiveness would facilitate the interpretation of results in “a clinically relevant manner” [27]. Also, further work seems needed to determine the impact on readers and other consumers of reporting results in different ways. In conclusion, it appears that opinions are still divided about the merits of alternative effect measures, especially OR. The RD and RR may sometimes have a biological rationale, and enjoy greater popularity in the domain of risk communication—possible because of the simpler and more familiar mental arithmetic involved. Despite this, there are other justifications for the OR, which probably remains the leading candidate for purposes of data analysis.
Appendix. Comparison of predicted risk in treated group, based on models of constant RD, RR, and OR The goal of the numerical evaluations was to compare the values of the event rate in the treated group for given levels of baseline risk P2, under models when RD, RR, and OR are assumed constant in turn. The starting point was a specification of the value of RR and a particular value P2* of the baseline risk. First, under the model that RR ⫽ constant, we have that the predicted risk in the treated group at a given values of P2 is P 1 ( RR ) = RR ⋅ P 2
(A1)
Second, the model RD ⫽ constant implies that P 1 ( RD ) = P 2 – RD
(A2)
S.D. Walter / Journal of Clinical Epidemiology 53 (2000) 931–939
for all P2. At P2* in particular, where the models of constant RR and RD are assumed to agree, we have from eq. (A1) that P 1 ( RD ) = RR ⋅ P 2 *. Hence, the constant value of RD is RD = P 2 * – ( RR ⋅ P 2 * ) = P 2 * ( 1 – RR ) . Applying this in model (A2) gives the general predicted risk under the constant RD model as P 1 ( RD ) = P 2 – P 2 * ( 1 – RR ) .
(A3)
Note that the prediction under the constant RD model is now given in terms of the values RR and P2* defining a particular numerical comparison. Comparing the risk predictions (A1) and (A3) under the RR and RD models gives P 1 ( RR ) – P 1 ( RD ) = ( RR ⋅ P 2 ) – P 2 + P 2 * ( 1 – RR ) = ( RR – 1 ) ( P 2 – P 2 * ) Thus, P 1 ( RR ) > P 1 ( RD )
if
RR > 1 and P 2 > P 2 *
(A4)
or if both these conditions are false. This proves result (2) for the comparison of the RR and RD models. We can pursue similar ideas to compare the model of constant RR with the model of constant OR. From the definition of OR we may derive that OR ⋅ P 2 P 1 = -----------------------------------1 + P 2 ( OR – 1 )
(A5)
Now using the fact that the RR and OR models are defined to agree at P2 ⫽ P2*, we have OR ⋅ P 2 * ---------------------------------------- = RR ⋅ P 2 * 1 + P 2 * ( OR – 1 ) which after simplification yields RR ( 1 – P 2 * ) OR = -----------------------------1 – RR ⋅ P 2 *
(A6)
We now substitute from (A6) into (A5) to obtain, after some simplification: P 2 ⋅ RR ⋅ ( 1 – P 2 * ) P 1 ( OR ) = ----------------------------------------------------------RR ( P 2 – P 2 * ) + ( 1 – P 2 )
(A7)
Finally, comparing (A7) and (A1) shows that P1(RR) ⬎ P1(OR) if (RR ⫺ 1)(P2 ⫺ P2*) ⬎ 0, which is that the same condition as (A4) governing the relative sizes of P1(RR) and P1(RD). There appears to be no general rule determining the relative sizes of P1(RD) and P1(OR). Putting these results together yields the possible ordering of risk predictions under the models of constant effect measure, shown in the set of inequalities (3).
939
References [1] Sackett DL, Cook RJ. Understanding clinical trials. BMJ 1994;309: 755–76. [2] McQuay HJ, Moore RA. Using numerical results from systematic reviews in clinical trials. Ann Intern Med 1997;126:712–20. [3] Sinclair JC, Bracken MB. Clinically useful measure of effect in binary analyses of randomized trials. J Clin Epiemiol 1994;47:881–9. [4] Davies HTO, Crombie IK, Tavakoli M. When can odds ratios mislead? BMJ 1998;316:989–91. [5] Zhang J, Yu KF. What’s the relative risk? A method of correcting the odds ratio in cohort studies of common outcomes. JAMA 1998; 280(19):1690–1. [6] Rothwell PM. Can overall results of clinical trials be applied to all patients? Lancet 1995;345:1616–9. [7] Morabia A, Ten Have T, Landis R. Interaction fallacy. J Clin Epidemiol 1997;50:809–12. [8] Sackett DL. Down with odds ratios! Evidence-Based Med 1996;1:164–6. [9] Olkin I. Odds ratios revisited. Evidence-Based Med 1998;3:71. [10] Senn S. Odds ratios revisited. Evidence-Based Med 1998;3:71. [11] Walter SD. Odds ratios revisited. Evidence-Based Med 1998;3:71. [12] Fleiss JL. Measures of effect size for categorical data. In: Cooper H, Hedges LV, editors. The Handbook of Research Synthesis. New York: Russell Sage, 1994. [13] Cox DR. Analysis of Binary Data. London: Methuen; 1970. [14] Walter SD, Holford TR. Additive, multiplicative and other models of disease risks. Am J Epidemiol 1978;108:341–6. [15] Sheps MC. Shall we count the living or the dead? N Engl J Med 1958;259:1210–4. [16] Greenland S. Interpretation and choice of effect measures in epidemiologic analyses. Am J Epidemiol 1987;125:761–8. [17] Sackett DL, Haynes RB, Tugwell P. Clinical Epidemiology. Boston: Little Brown, 1985. [18] Fleiss JL. On the asserted invariance of the odds ratio. Br J Prev Soc Med 1970;24:45–6. [19] Lachenbruch PA. The odds ratio. Controlled Clin Trials 1997;18:381–2. [20] Reid MC, Lane DA, Feinstein AR. Academic calculations versus clinical judgements: practicing physicians’ use of quantitative measures of measures of text accuracy [see comments]. Am J Med 1998; 104(4):374–80. [21] Schmid CH, Lau J, Mcintosh, MW, Cappelleri JC. An empirical study of the effect of control rate as a predictor of treatment efficacy in meta-analysis of clinical trials. Stat Med 1998;17:1923–42. [22] Callas PW, Pastides H, Hosmer DW. Empirical comparisons of proportional hazards, poisson, and logistic regression modeling of occupational cohort data. Am J Ind Med 1998;33(1):33–47. [23] Thomas DC. General relative-risk models for survival time and matched case-control analysis. Biometrics 1981;37(4): 673–86. [24] Pearce N. What does the odds ratio estimate in a case-control study? Int J Epidemiol 1993;22:1189–92. [25] Armitage P, Doll R. The age distribution of cancer and a multi-stage model of carcinogenesis. Br J Cancer 1954;8:1–12. [26] Forrow L, Taylor WC, Arnold RM. Absolutely relative: how research results are summarized can affect treatment decisions. Am J Med 1992;92:121–4. [27] Naylor CD, Chen E, Strauss B. Measured enthusiasm: does the method of reporting trial results alter perceptions of therapeutic effectiveness? Ann Intern Med 1992;117:916–21. [28] Bucher HC, Weinbacher M, Gyr K. Influence of method of reporting study results on decision of physicians to prescribe drugs to lower cholesterol concentration. BMJ 1994;309:761–4.