Alternative Perspectives on the Biased Foundations of

0 downloads 0 Views 2MB Size Report
signed so that no actual difference in event rate existed between the ... signed" to medical treatment. In contrast to ...... Diamond GA, Spodick DH. CASSandRa.

Alternative Perspectives on the Biased Foundations of Medical Technology Assessment George A. Diamond, MD, and Timothy A. Denton, MD

• Medical technology assessment seeks to improve the care of individual patients (the conventional unit of clinical practice) through evaluation studies conducted in groups of patients (the conventional unit of clinical investigation). This distinction between individuals and groups has practical relevance to the design, analysis, and clinical applicability of technology assessment studies. We define several biased perspectives about technology assessment that derive from the distinction between individuals and groups: a misguided emphasis on efficacy versus effectiveness, on statistical significance versus clinical importance, and on objective versus subjective outcomes. In each case, we contrast these alternative perspectives and speculate on their implications for health care policy. Annals of Internal Medicine. 1993;118:455-464. From Cedars-Sinai Medical Center and the School of Medicine, University of California, Los Angeles, California. For current author addresses, see end of text.

cians, and in academic centers of excellence. As a result, the diffusion of technology from the investigational laboratory to clinical practice is fueled more by the promise of performance than by performance itself; despite this imbalance, both these characteristics are important. The drug, device, or procedure must first have utility among a group of patients in an ideal setting (efficacy), but it must also have utility for the individual patient in a realistic clinical setting (effectiveness) (5). Reports of efficacy get published in journal articles, but many of these articles fail to emphasize the complex learning curve and limited range of applicability associated with a new technology, factors that are critical to the technology's effectiveness. Clinicians need to know more about the way a technology works in the real worldpopulated by real doctors, real patients, and real problems-than they do about the way it works in the Utopian world described in a journal article (4, 6). We shall illustrate this distinction with a relatively common example: the analysis of crossovers in a randomized, clinical trial. Crossovers

Unintentional treatment crossover in a clinical trial occurs when a patient allocated to one treatment group The truths which are ultimately accepted as thefirstprinciples of receives the treatment intended for the other group. In a science, are really the last results of metaphysical analysis . . . a trial of coronary artery bypass surgery, for example, J. S. Mill, Utilitarianism a patient assigned to receive medical treatment might decide to have surgery during the follow-up period because of worsening symptoms or a patient assigned to surgery might refuse the procedure and opt for medical H e a l t h care costs currently account for 14% of the treatment instead. U.S. Gross National Product —an increase from 9.4% in Crossovers such as these are common in trials of 1980-and are increasing at three times the rate of the coronary artery bypass surgery. In the Veterans Affairs Consumer Price Index (1). The high cost of medical (VA) Study (7), 17% of patients allocated to medical technology is only one of many reasons for these intreatment had surgery during the next 21 months, and creases. Nevertheless, most physicians think that un6% of patients allocated to surgical treatment during necessary use of medical technology has contributed to that same period refused the procedure. The cumulative the rising cost of health care (2); similar views are crossover to surgery in the VA trial increased to 30% voiced both by the general public and by elected offiafter 8 years and to 38% after 11 years (8). In the cials (3). In response, increasing attention is directed at European Coronary Surgery Study (ECSS) (9), crossthe process of medical technology assessment. This over rates after 5 years were 24% for medical allocaprocess, however, is built on a foundation of "metations and 7% for surgical allocations. In the Coronary physical" assumptions about study design, data analyArtery Surgery Study (CASS) (10), crossover rates after sis, and practical clinical application (4). We analyze the 5 years were 24% for medical allocations and 8% for rational basis for several of these assumptions and dissurgical allocations. cuss their implications for health care policy. Our goal is to optimize the relevance of technology assessment to Crossover rates of this magnitude raise serious conthe care of the individual patient. cerns about bias. In CASS, noncompliant patients crossing from medical allocation to surgical treatment were more symptomatic and had more anatomic disease Efficacy versus Effectiveness than did their compliant counterparts. The annual crossover rates for those in the medical group with single-, New medical technology, whether drug or device, is double-, and triple-vessel disease were 2.0%, 4.2%, and usually evaluated using optimal conditions-in highly se7.6%, respectively (P < 0.001) (11). In contrast, palected patient populations, by the best trained physi© 1993 American College of Physicians

Downloaded From: by a Univ of Utah User on 08/19/2013


tients crossing from surgical allocation to medical treatment had fewer symptoms and less anatomic disease; 15% of patients with single-vessel disease refused surgery and crossed to medical therapy, whereas only 3% with triple-vessel disease did so (P = 0.02) (10). By inference, then, treatment crossover in the CASS was related to the magnitude of myocardial ischemia and the risk for subsequent coronary events. This bias favored medical treatment, because sicker patients allocated to medical treatment tended to cross over to surgery (thereby decreasing the proportion of events among those actually receiving medical treatment), whereas healthier patients allocated to surgical treatment tended to refuse surgery (thereby increasing the proportion of events among those actually receiving surgical treatment). Because conclusions from clinical trials influence medical practice (11) and because this bias can lead to erroneous conclusions, the analysis and interpretation of such trials remain controversial. Several analytical methods can be used to deal with the problem. The first method is analysis by treatment received. In this case, patients are grouped according to the actual treatment administered, regardless of the initial assignment. The second method is analysis by treatment assigned. Here, patients are grouped according to the initial treatment allocation-the intention to treatregardless of the actual treatment administered. These alternatives are the core of a long-standing debate between clinicians and statisticians. The clinicians argue that analysis by treatment assigned is unrealistic and misleading. They claim that patients not receiving treatment should not be analyzed as if they had. The statisticians argue that analysis by treatment received is naive and improper, because it is not the administration, but the allocation, of treatment that has been randomized. They maintain that a randomized trial should analyze only randomized groups. This debate can be resolved by recognizing that each analytic approach is best suited to a particular kind of trial (12, 13). Explanatory Trials According to Schwartz and Lellouch (12), an explanatory trial is aimed at efficacy and understanding. The explanatory trial seeks to verify a biological hypothesis like: Treatment A is better than Treatment B. If we are interested in comparing the effect of coronary artery bypass surgery and coronary angioplasty on long-term survival, we might restrict the study to older patients with multivessel disease in whom survival differences will be easier to show during a reasonably short follow-up period. This design has two advantages: First, the investigator has precise control of the risk for wrongly concluding that A and B are different when they actually are not (the conventional false-positive or Type-I error) by choosing the threshold for statistical significance. Second, the investigator can also control the risk for wrongly concluding that A and B are not different when they actually are (the conventional falsenegative or Type-II error) by determining the number of patients in the study. The disadvantage, however, is that the sample might be defined in a limited manner that restricts the inferential power of the conclusions 456

(for example, with respect to younger patients or those with single-vessel disease). The goal of an explanatory trial requires that one compare groups defined by the treatment they actually received. Accordingly, treatment crossovers must be excluded from analysis, even though such exclusions will bias the analysis whenever the crossovers do not occur randomly. As a result, the explanatory power of such (pseudorandomized) trials is compromised, and they can no longer be relied on to determine if one treatment is better than the other. Pragmatic Trials A pragmatic trial is aimed at effectiveness and decision (12). The pragmatic trial seeks to define the utility of choosing among available alternatives, thereby minimizing the probability of administering the inferior treatment: Should we recommend Treatment A or Treatment B? The disadvantage of the pragmatic trial is that it must be done in a large sample that is broadly representative with respect to the decision if the results are to be extrapolated. Its advantage is that the crossover problem is circumvented by definition. Because the result of recommending Treatment A to recommending Treatment B is being compared, rather than the result of administering Treatment A to administering Treatment B, the data can be analyzed with respect to the initial therapeutic allocation (treatment assigned) without regard for actual therapeutic administration (treatment received). Thus, although it has been claimed that " . . . analysis by treatment assigned results in . . . reducing the clinical relevance of the findings" (14), Schwartz and Lellouch (12) conclude that this analysis " . . . is precisely that of interest in practice." Computer Simulation of Crossover Ferguson and coworkers (15) recently did a series of computer simulations to quantify the effect of crossover from one treatment arm to another during a hypothetical, randomized clinical trial to compare the event-free survival of patients who had medical compared with surgical treatment of coronary artery disease. The simulation was designed so that the two treatments were actually equally effective. First, a logistic prediction model was constructed from an actual data set to predict 62 cardiac events among 598 patients who had rest-exercise radionuclide angiography (based on age, gender, resting left ventricular ejection fraction, exercise duration, and maximum exercise-induced, electrocardiographic ST-segment depression). Patients were randomly assigned to one of the two hypothetical treatment arms: "surgery" and "medicine." Each patient was then selected for crossover to the other arm based on a crossover index defined as the product of a uniform random number and the probability of an event estimated from the logistic prediction model. The 150 (25%) patients with the highest crossover index among those initially randomized to the "medicine" arm were considered crossovers to the "surgery" arm, and the 60 (10%) patients with the lowest crossover index among

15 March 1993 • Annals of Internal Medicine • Volume 118 • Number 6

Downloaded From: by a Univ of Utah User on 08/19/2013

Table 1. Medicine Compared with Surgery: Analysis by Treatment Received Variable Event, n No event, n Total, n




797 9363 10 160

1666 12 094 13 760

2463 21 457 23 920

those initially assigned to the "surgery" arm were considered crossovers to the "medicine" arm. These crossover rates are similar to those reported in actual trials. Twenty independent simulations were done, resulting in 2463 events among 23 920 patients (598 patients per trial times 2 treatment groups times 20 trials). The pooled data were analyzed according to "treatment assigned" and "treatment received." The analysis by "treatment received" is summarized in Table 1. There were 1666 (12.1%) events among the 13 760 patients considered to have "received" surgical treatment, and 797 (7.8%) events among the 10 160 patients considered to have "received" medical treatment. The odds ratio for analysis by "treatment received" was 1.62 (95% CI, 1.47 to 1.76), and the 4.3% absolute difference in event rate (equivalently considered a 36% lower event rate in the medical group) was significant (P < 0.0001). Because the study was designed so that no actual difference in event rate existed between the two groups, the conclusion that survival is better among medically treated patients is incorrect. The analysis by "treatment assigned" is summarized in Table 2. There were 1264 (10.6%) events among the 11 960 patients "assigned" to surgical treatment and 1199 (10.0%) events among the 11 960 patients "assigned" to medical treatment. In contrast to the preceding analysis by "treatment received," the odds ratio for these data was 1.06 (CI, 0.97 to 1.15), and the 0.6% absolute difference in event rate (equivalently considered a 6% lower event rate in the medical group) was not significant (P = 0.17). Thus, crossover bias affects the interpretation of explanatory clinical trials designed to assess efficacy using analysis of patients in terms of the treatment they receive. Analysis in terms of the initial treatment assignment minimizes the effect of this bias in pragmatic trials designed to assess effectiveness. Statistical Significance versus Clinical Importance P Values Even when a technology assessment trial is properly designed with respect to efficacy and effectiveness, investigators tend to emphasize statistical significance more than clinical importance in their analysis (16) and Table 2. Medicine Compared with Surgery: Analysis by Treatment Assigned Variable Event, n No event, n Total, n




1199 10 761 11 960

1264 10 696 11 960

2463 21 457 23 920

to thereby canonize what Feinstein (17) calls the "uriniferous" P value. Although this problem partly stems from a subtle confusion about what the P value really means (18, 19), much of it is due to the simple fact that the statistics of hypothesis testing (P values and the associated concept of "statistical significance") were originally designed for sample sizes around 30, not 30 000. Nevertheless, investigators (20) are now suggesting that some questions of clinical interest-for example, the Women's Health Project from the National Institutes of Health (NIH)-will require assessment trials involving as many as 140 000 patients. With trials of such size, a clinically inconsequential difference in outcome is readily elevated to the lofty level of statistical significance. Assume that two antihypertensive drugs have been evaluated in a randomized trial comprising 5000 patients. The systolic blood pressure (mean ± SD) among the 2500 patients receiving Drug A is 139 ± 30 mm Hg compared with 141 ± 30 mm Hg among the 2500 patients receiving Drug B. The difference averages 2 mm Hg, with a standard error of 0.8 mm Hg and a 95% CI ranging from 0.3 to 3.7 mm Hg. Surprising as it seems, this small difference is statistically significant (t = 2.4; P < 0.02). Now, a 2-mm Hg reduction in blood pressure may have some measurable effect on outcome in a large sample of patients, but it is not likely to be clinically important to the individual patient. Confidence Intervals Confidence intervals have been suggested as a way to assess both statistical significance and clinical importance (21). To assess statistical significance, one computes the confidence interval for the observed difference (usually 95%) and determines if the lower bound of this interval overlaps zero, the value for no effect. In our antihypertensive trial, the lower bound is 0.3 mm Hg, meaning that the observed difference is "statistically significant" at a P value threshold of 0.05. To assess clinical importance, however, a judgment must be made about the "smallest clinically important difference" (21). If we decide the "smallest clinically important difference" in our antihypertensive trial is 4 mm Hg, then the observed "statistically significant" difference of 2 mm Hg is not "clinically important" because the 3.7-mm Hg upper bound of the confidence interval is less than the 4-mm Hg threshold level. However, the choice of what constitutes the "smallest clinically important difference" is often subjective; if someone chose 3 mm Hg rather than 4 mm Hg as the critical threshold level, the "statistically significant" difference would also be "clinically important." On the other hand, one might choose to retain the 4-mm Hg threshold but to use a wider confidence interval as the basis for comparison. If we use a 99% CI, the observed difference remains "clinically important" because the new 4.2-mm Hg upper bound is greater than our 4-mm Hg threshold, but it is no longer "statistically significant" because the new -0.2-mm Hg lower bound now overlaps zero. The wider confidence interval corresponds to a P-value threshold of 0.01 rather than 0.05, and the observed P value of 0.02 is greater than this

15 March 1993 • Annals of Internal Medicine • Volume 118 • Number 6

Downloaded From: by a Univ of Utah User on 08/19/2013


Table 3. Flatulence in the DELI Presence of Flatulence

Yes No Total

Primary Analysis jota\ Baloney Cabbage 3981 3249 7230

4265 3209 7474

Secondary Analysis Total


Men Cabbage

8246 6458 14 704

1778 2133 3911

1066 1422 2488

new threshold. Thus, as a consequence of the subjective choices we make in attempting to infer statistical significance and clinical importance, the same 2-mm Hg difference can be interpreted quite differently. Post-Hoc Analyses Similar statistical tests of inference are sometimes used to do multiple post-hoc comparisons in a technology assessment trial (22). This practice is usually motivated by the investigator-author's desire to generate new hypotheses that can serve as the subject of future investigations (to get the most out of the data), but its limitations are not widely appreciated by the eventual clinician-reader. It is altogether proper that investigators assess outcome responses in meaningful subsets of the total sample of study patients. However, clinicians must be careful in their interpretation of the results of such secondary analyses, especially when the analysis was done with respect to "treatment received" (23), when the outcome variable was not identified at the outset, or when the sample size of the trial is large. How often, for instance, do we read something like the following passage in the published report of a large clinical trial? Although there was no difference between treated and untreated groups with respect to the primary outcome variable (mortality), treatment did result in a statistically significant improvement in one of the secondary outcome variables (symptoms) among older women. What is the clinical relevance of a statement such as this? Rather than upsetting those involved in well-respected clinical trials such as TIMI (24) and TAMI (25) or GISSI (26) and BARI (27), we shall attempt to answer this question using a whimsical hypothetical example. DELI is a large, multicenter, randomized trial designed to compare the frequency of flatulence after eating baloney and cabbage. After years of study, the investigators observe that 55% of 7230 patients eating baloney had flatulence compared with 57% of 7474 patients eating cabbage (Table 3). A conventional chisquare test (28) including all 14 704 study patients supports the conclusion that baloney is better than cabbage, in terms of producing less flatulence (chisquare = 5.9; P = 0.015). Before publishing these results, however, one of the investigators recalls that baloney has a higher fat content than cabbage (27.5% compared with 0.2%, according to published tables), and she suspects that fat intolerance and flatulence are both more common in women than in men. She therefore suggests doing a secondary analysis that is stratified by gender (see Table 3). As predicted, this stratified 458


Women Cabbage


2844 3555 6399

2203 1116 3319

3199 1787 4986

5402 2903 8305

analysis indeed shows that cabbage is better than baloney in the women (chi-square = 4.2; P = 0.040); 64% of the women eating cabbage had flatulence compared with 66% of the women eating baloney. Common sense and experience suggest that what goes up in one subgroup goes down in the other. Consequently, we now expect that baloney must be better than cabbage in the men. Contrary to this expectation, however, cabbage is better than baloney in the men also (chi-square = 4 . 1 ; p = 0.043); 43% of the men eating cabbage had flatulence compared with 45% of the men eating baloney. In other words, cabbage is better than baloney in each subset of a sample in which baloney is better than cabbage! This counterintuitive observation-known as the Simpson paradox-is not caused by nonrandom allocation or statistical interaction. It occurs because the observed proportions (the frequencies of flatulence) are weighted averages rather than simple averages (the weights being the actual proportions of men and women) (29, 30). The algebra is a little convoluted, but the message is straightforward: "What is true for the parts may not be true for the whole" (31). Sometimes a study hypothesis is supported only in one post-hoc subset of the total patient sample (for instance, "treatment helps older women"). If an imaginative investigator can imbue the conclusion with biological plausibility, he or she might be tempted to report this alone and not the complementary but equally supported implausible conclusion ("treatment harms younger women"). Journal editors, reviewers, and readers might never notice this asymmetry. Murphy clearly states the important limitation of such post-hoc analyses by noting that a . . . responsible surmise before the fact (which then has to face the tribunal of empirical testing) must be squarely based on prior data and the state of the art. A surmise put forward after the data have been inspected does not have to meet any such demand and may be little more than a rationalization of them. Furthermore, such a surmise (which is, in fact, an interpretation of what has already happened) leads almost inevitably to conceptual gerrymandering (32). The principal purpose of including these secondary analyses in the published report should be to show the stability of the conclusion drawn from the primary analysis-to tell us how sure we are. When the primary study hypothesis is confirmed in a variety of secondary analyses (stratified by age, gender, and severity of disease), this naturally increases our confidence about the clinical importance of the primary conclusion. And the greater the variety, the greater our confidence.

15 March 1993 • Annals of Internal Medicine • Volume 118 • Number 6

Downloaded From: by a Univ of Utah User on 08/19/2013


Objective versus Subjective Outcomes Even when a technology assessment trial is properly designed and analyzed from the scientific perspective, it is frequently less well designed from the clinical perspective. According to 1991 Heart and Stroke Facts, a popular source booklet published by the American Heart Association, 511 050 of 6 080 000 Americans with coronary heart disease died in 1988 (33). Nowhere in this booklet, however, is there any mention of the plight of the survivors-their level of disability and quality of life. Similarly, much of technology assessment is based on physician-oriented "objective" outcomes such as physical signs and test responses, rather than patientoriented "subjective" outcomes such as quality of life and well-being. Objective outcomes are certainly easier for investigators to measure, but subjective outcomes are arguably more relevant to individual patients. Value Unfortunately, subjective estimates of outcome often defy explicit characterization. Despite such inherent uncertainties, little doubt exists that therapies such as coronary bypass surgery and coronary angioplasty possess value, because they affect one's (subjective) sense of well-being as well as one's (objective) state of health. Pirsig (34) argues compellingly that this so-called "valu e " is neither objective nor subjective: Any person of any philosophic persuasion who sits on a hot stove will verify without any intellectual argument whatsoever that he is in an undeniably low-quality situation: that the value of his predicament is negative. This low quality is not just a vague, woolly-headed, crypto-religious, metaphysical abstraction. It is an experience. It is not a judgment about an experience. It is not a description of experience. The value itself is an experience. As such it is completely predictable. It is verifiable by anyone who cares to do so. It is reproducible. . . . Later the person may generate some oaths to describe this low value, but the value will always come first, the oaths second. Without the primary low valuation, the secondary oaths will not follow. . . . Our culture [nevertheless] teaches us to think it is the hot stove that directly causes the oaths. It teaches that the low values are a property of the person uttering the oaths. Not so. The value is between the stove and the oaths. Between the subject and the object lies the value. This value is more immediate . . . more real than the stove. Whether the stove is the cause of the low quality or whether possibly something else is the cause is not yet absolutely certain. But that the quality is low is absolutely certain. It is the primary empirical reality from which such things as stoves and heat and oaths and self are later intellectually constructed. Value is often viewed as mystical mind-stuff: " . . . the one thing that can't be looked u p " (35). As described here, however, value seems to meet all the criteria of a conventional scientific concept. The only question then is: How shall we measure it? The Measurement of Value Common sense suggests that well-being represents a composite of several different aspects (mental, physical,

social). The psychometric approach to the measurement of value (36), therefore, uses detailed questionnaires to separately quantify each of these seemingly independent dimensions. The Sickness Impact Profile is a 136-item questionnaire that measures 12 different dimensions: body care, mobility, ambulation, emotional, social, alertness, communication, sleep and rest, home, work, recreation, and eating (37). In its reductionistic statistical rigor, the psychometric approach appeals more to scientists than clinicians. In contrast, the decision-theoretic approach (36) attempts to provide an overall summary measure that integrates these various dimensions and assigns weights to them according to each patient's subjective preferences. In essence, this approach asks a single, natural, clinical question: How do you feel (physically and emotionally)? In its holistic emphasis on the individual patient, the decision-theoretic approach appeals more to clinicians than to scientists. Critics of the decision-theoretic approach argue that aggregate measures of well-being are inappropriate, because they attempt to add apples and oranges. Advocates reply that it is not the quality of each fruit that matters but the overall quality of the entire basket (37). A useful analogy can be made with the sense of olfaction. Humans readily recognize a complex odor such as almond, but we usually cannot identify its constituent primary odors (camphoraceous, floral, and pepperminty in the case of almond [38]), and we rarely can localize the spatial origin of the odor. Compared with vision, olfaction is a low-resolution system. If our sense of vision had the same resolution as our sense of smell, the sky would just "go all birdish" whenever a duck flew overhead (39). Our sense of value is similarly low resolution. We know how we feel, but we may not know why. Each approach to assessing value best serves a particular purpose. The decision-theoretic approach (like the pragmatic clinical trial) sniffs out the current state of well-being, without regard for its underlying causes. In contrast, the psychometric approach (like the explanatory clinical trial) espies the causes themselves.

Analysis of Groups versus Individuals Even when appropriate attention is given to estimating values in a technology assessment trial, its analysis is confounded by a practical philosophic dilemma. Although much of our medical knowledge derives from the sophisticated statistical analysis of groups, most of our medical care is aimed at individual patients. But is what is good for a group necessarily good for each member of the group? Obviously not. On average, surgery is unquestionably lifesaving in the face of a ruptured appendix, but sometimes a patient dies during anesthetic induction. Patients have a remarkable range of preferences with respect to their willingness to trade short-term risk for long-term gain (40) or to trade the duration of survival for the quality of that survival (41). How then do we decide what is good?

15 March 1993 • Annals of Internal Medicine • Volume 118 • Number 6

Downloaded From: by a Univ of Utah User on 08/19/2013


Utilitarianism A conventional answer to that question relies on the utilitarian "greatest-happiness" principle first expressed in 1729 by Francis Hutcheson (42): In comparing . . . actions, . . . judge thus; that in equal degrees of happiness . . . the virtue is in proportion to the number of persons to whom the happiness shall extend; . . . and in equal numbers, the virtue is as the quantity of the happiness . . . so that, that action is best, which procures the greatest happiness for the greatest numbers. . . . Fifty years later, Jeremy Bentham (43) developed an entire "moral calculus" around this principle. Sum up all the values of all the pleasures on the one side, and those of all the pains on the other. The balance, if it be on the side of pleasure, will give the good tendency of the act upon the whole, with respect to the interests of that individual person; if on the side of pain, the bad tendency of it upon the whole. Take an account of the number of persons whose interests appear to be concerned; and repeat the above process with respect to each. Sum up the numbers expressive of the degrees of good tendency, which the act has, with respect to each individual, in regard to whom the tendency of it is good upon the whole: do this again with respect to each individual, in regard to whom the tendency of it is bad upon the whole. Take the balance', which, if on the side of pleasure, will give the general good tendency of the act, with respect to the total number or community of individuals concerned; if on the side of pain, the general evil tendency, with respect to the same community. This utilitarian philosophy, soon championed by J. S. Mill, has had a major influence on the ethical and legislative structure of Anglo-American society. The Health Care Financing Administration prides itself on the fact that it . . . has devised an approach to the assessment and assurance of the quality of care [provided to Medicare beneficiaries] in which the principal criterion to be used is the utilitarian one . . . that is, whether the pattern of practice systematically benefits or harms patients (44). And more recently, NIH director Bernadine Healy proposed a strategic plan designed to manage NIH-sponsored research " . . . in a way that will bring the greatest good to the greatest number" (45). Unfortunately, the philosophy of utilitarianism is internally inconsistent, because the literal implementation of its guiding principle-the greatest happiness for the greatest numbers-requires us to maximize two independent functions at once (a mathematical impossibility). As we shall see, the inconsistency of utilitarianism causes some troubling paradoxes with direct relevance to medical decision making. Decision Analysis When faced with a difficult decision under conditions of uncertainty, decision analysts advise us to organize our options into some logical structure (by constructing a decision tree), to enumerate all possible outcomes (by 460

calculating the expected values), and to select the optimal option (by choosing the one with the maximum expected value) (46). Consider two mutually exclusive diseases, each of which is universally fatal if left untreated. Suppose your patient has a 60% chance of having Disease A and therefore a 40% chance of having Disease B. Suppose further that two mutually exclusive treatment options exist (patients given one treatment option forever forego the alternative). Treatment 1 will cure 50% of patients with Disease A but will not affect Disease B. Treatment 2 will cure 80% of patients with Disease B but will not affect Disease A. Which treatment should you choose? To identify the optimal choice according to decision theory (46), we must calculate the expected value of each treatment option as the summed product of the probabilities (p) and utilities (u, here expressed in terms of "cure rate") for each possible disease state (A and B): Expected Value = [p(A) • u(A)] + [p(B) • u(B)]. Thus, Expected Value of Treatment 1 = (0.6 • 0.5) + (0.4 • 0) = 0.3 Expected Value of Treatment 2 = (0.6 • 0) + (0.4 • 0.8) = 0.32 Here, Treatment 2 is preferred to Treatment 1 because it provides the greater expected value (0.32 compared with 0.3). But sometimes it is not so easy. Suppose again that there are two mutually exclusive treatment options. If you prescribe Treatment 1, 90% of the patients will improve by an average of 60%, and the remaining 10% will be unaffected. If you prescribe Treatment 2, 60% of the patients will improve by an average of 90%, and the remaining 40% will be unaffected. According to decision theory, the two treatments are equivalent: Expected Value of Treatment 1 = (0.9 • 0.6) + (0.1 • 0) = 0.54 Expected Value of Treatment 2 = (0.6 • 0.9) + (0.4 • 0) = 0.54 Based on the Hutcheson-Bentham-Mill utilitarian principle, however, Treatment 2 is preferred to Treatment 1 because it provides the "greatest happiness" (90% compared with 60% average improvement among those responding to treatment); whereas Treatment 1 is preferred to Treatment 2 because it serves the "greatest numbers" (90% compared with 60% of the patients having some degree of improvement). Philosophy may be ignored but not escaped; and those who most ignore it escape the least. Hence, we can escape this paradox only by recognizing that the optimal choice depends on the philosophic perspective of the decision maker. For example, a health care planner might prefer the treatment that produces the greatest average benefit (without regard for the number of patients who benefit), whereas a health care provider might prefer the treatment that benefits the greatest number of patients (without regard for the magnitude of benefit). The former epidemiologic perspective emphasizes the average benefit in a group of patients (thereby favoring Treatment 1), whereas the latter clinical perspective emphasizes the likelihood that a patient in the group will derive benefit (thereby favoring Treatment 2).

15 March 1993 • Annals of Internal Medicine • Volume 118 • Number 6

Downloaded From: by a Univ of Utah User on 08/19/2013

Figure 1. Frequency distribution of benefit for each arm of the hypothetical DELI-2 trial (baloney compared with cabbage). The area under each distribution represents 1000 patients. The shaded areas represent the raw distributions, and the curves represent the fitted theoretical (beta) distributions. The gray bars represent patients who derived positive benefit, and the black bars represent patients who derived negative benefit. Note that the mean of the distribution on the left (0.340) is higher than that of the distribution on the right (0.240), whereas the gray region (representing positive benefit) for the distribution on the right (0.907) is larger than the corresponding gray region for the distribution on the left (0.817). Clinical Trial Analysis Let us see how these different perspectives affect the analysis and interpretation of a clinical trial. DELI-2 is a hypothetical, randomized clinical trial designed to determine the outcome of 1000 persons after ingestion of baloney compared with 1000 persons after ingestion of cabbage. Each person was asked to express outcome (the "benefit" associated with ingestion) along an arbitrary analogue scale (from - 1 representing the worst imaginable outcome, through 0 representing a neutral outcome, to +1 representing the best imaginable outcome). Figure 1 illustrates the distribution of benefit in each of the two groups. If we take an epidemiologic perspective and focus on the group, baloney is better than cabbage because average benefit is greater among baloney eaters than cabbage eaters (z = 8.4, P < 0.0001 using a nonparametric Mann-Whitney U test, which makes no assumptions regarding the shape of the underlying distributions; a conventional t-test [47, 48] produces the same result). However, if we take a clinical perspective and focus on the individual patients, where cabbage is better than baloney because more cabbage eaters reported positive benefit than did baloney eaters (chi-square = 33.3, P < 0.0001 based on Table 4). Thus, two conventional statistical analyses of these data produce contradicTable 4. Positive and Negative Benefits in DELI-2* Benefit Positive Negative Total




817 183 1000

907 93 1000

1724 276 2000

* "Positive benefit" is represented by the grey area of the distribution of benefit to the right of the midpoint of the x-axis in Figure 1, and "negative benefit" is represented by the black area to the left of the midpoint.

tory conclusions. What is best for the group is not necessarily what is best for the individual (and we have not even addressed the clinical importance of the observed differences). Paradoxical conclusions like these can occur in real clinical trials designed to assess left ventricular function (where the epidemiologic perspective might be represented by the mean change in ejection fraction and the clinical perspective by the proportion of patients with an improvement in ejection fraction), physical performance (mean change in exercise duration versus the proportion of patients with improved exercise duration), or a variety of other outcomes. This perhaps explains why physicians do not make the same decision for an individual patient as for a group of comparable patients (49). Health Care Policy Implications Appropriateness The implicit long-range goal of technology assessment is to improve the quality of health care while simultaneously controlling its costs-by defining what we should do, rather than what we can do. Few would disagree that medical technology should be used only when appropriate. But is the concept of "appropriateness" definable in some " a priori" scientific way? Or must we satisfy ourselves instead with arbitrary ad-hoc tabulations like the 488 "indications" for appropriate performance of coronary artery bypass surgery identified in the RAND-UCLA Health Services Utilization Study (50)? Some advocate "cost-effectiveness" as an optimal index of appropriateness. But are patients or physicians really likely to identify the technology with the highest cost-effectiveness as that having the greatest personal value? We think not. Patients and physicians care about effectiveness; administrators and planners

15 March 1993 • Annals of Internal Medicine • Volume 118 • Number 6

Downloaded From: by a Univ of Utah User on 08/19/2013


Table 5. Reimbursement for Diagnostic Testing Based on Disease Probability

Table 6. Reimbursement for Therapy Based on Expected Benefit

Symptom Class


Asymptomatic Atypical angina Typical angina

CAD Probability*

Proportional Reimbursement!


6 • p(l - p)

0.05 0.50 0.90

0.285 1.500 0.540

* CAD = coronary artery disease. t Proportional reimbursement (r) is a function of CAD probability (p) and a reward factor (k): r = kp(l - p). If k = 6, as in this example, the physician can realize a 50% reward for optimal use, whereas total reimbursement remains unchanged (assuming the CAD probability for all referrals is broadly distributed [55]). If k = 5.4, total reimbursements could be reduced by 10% while still providing a 35% reward as an incentive for optimal use.

care about costs. Who really cares about cost-effectiveness? These difficulties notwithstanding, determinations of appropriateness are commonly used to regulate or guide the use of our technology. Regulatory controls currently represent the centerpiece of national health policy (51). The arbitrary blanket elimination of payment for physician interpretation of electrocardiograms for Medicare beneficiaries mandated by the U.S. Congress through the Omnibus Budget Reconciliation Act of 1990 is a particularly onerous example. Other types of regulatory control include preapproval, charge limitation, mandatory assignment of payment, and mandatory second opinions. In general, such controls are relatively difficult to initiate and administer but rank high in behavioral effectiveness (their ability to influence physicians' choices). In response to these regulatory mandates, many professional organizations such as the American College of Cardiology, the American Heart Association, and the American College of Physicians have introduced various voluntary controls in the form of "practice guidelines" (52). These too are an effective means of modifying physician behavior (53) and are much easier to initiate. But, whereas behavioral modification is necessary to resolve the current health care crisis, it is not by itself sufficient. We can no more infer clinical effectiveness from behavioral effectiveness than we can infer clinical significance from statistical significance. Incentives One way to shift the focus toward clinical effectiveness is to introduce suitable reimbursement incentives. It is well known that diagnostic tests have maximal value only when the probability of disease is near 0.5 and that they have very little diagnostic value when the probability is near 0 or 1 (54, 55). Thus, if reimbursement for a relatively effective but expensive test were scaled as a function of disease probability, physicians could be rewarded for more appropriate use (in patients with an intermediate disease probability) and penalized for less appropriate use (in patients with high or low probabilities). Assume that the conventional reimbursement for the electrocardiographic stress test is $200. Under the incentive system summarized in Table 5, the reimbursement would be $300 (200 x 1.5) for a patient 462

40 60 80

Probability of Ischemia* 0.5


0.003 (1.5) 0.017 (1.7) 0.075 (1.3)

0.199 (6.4) 0.452 (6.1) 0.834 (3.7)

0.340 (7.7) 0.675 (7.6) 0.913 (4.1)

* Numbers in parentheses represent the gain in quality-adjusted survival after coronary artery bypass surgery for a patient of the indicated age with triple-vessel coronary artery disease and depressed left ventricular function (60, 63). Decimal values represent the expected benefit associated with that gain in survival (59). The probability of ischemia is estimated from the exercise stress test response (61). See text for discussion.

with atypical angina (in whom the test carries substantial utility) (56) compared with $57 (200 x 0.285) for an asymptomatic patient (in whom the test has very little utility) (57). Little doubt exists that a reimbursement plan like this would influence test use. Its practicality is another matter. Laskey has suggested that clear advantages in survival or quality of life must exist for a therapy to be deemed appropriate (58). If so, reimbursement might be tied directly to such survival estimates. The estimates in Table 6 derive from a statistical prediction model with respect to the medical and surgical treatment of coronary artery disease based on data from the medical literature (59-61). The model is similar to that used by the American College of Cardiology/American Heart Association Subcommittee on Coronary Artery Bypass Graft Surgery in its development of guidelines for use of this procedure (62). Each of these survival estimates is adjusted with respect to the quality of life associated with angina pectoris, using a conventional time-tradeoff method (63, 64), and the expected benefit associated with surgical treatment is expressed as the proportional improvement in quality-adjusted survival (59, 60). Reimbursement can be scaled as a function of expected benefit. For example, if p is the average expected benefit in a group of candidates for coronary artery bypass surgery (based on our prediction model), m is the prevailing monetary reimbursement for surgery, and b is the expected benefit for a particular patient (again based on our prediction model), then the adjusted monetary reimbursement m* is given by bm/p. Assume that m is $30 000 and/? is 0.5. By reference to the estimated benefits in Table 6, the adjusted reimbursement would be $180 for inappropriately (relatively) doing bypass surgery in a 40 year old with a 10% probability of ischemia (b = 0.003, m* = 0.003 • 30000/ 0.5) but would be $54 780 for appropriately (relatively) doing bypass surgery in an 80 year old with a 90% probability of ischemia (b = 0.913, m* = 0.913 • 30000/ 0.5). This reimbursement strategy has fiscal neutrality: the total adjusted reimbursements will be the same as the total unadjusted reimbursements so long as the values of p and m are updated at regular intervals. In the real world, the prediction model used must reliably rank patients in order of decreasing survival, and the quality adjustments to these survival estimates must derive from individual patient preferences, not population av-

15 March 1993 • Annals of Internal Medicine • Volume 118 • Number 6

Downloaded From: by a Univ of Utah User on 08/19/2013


erages. But if practical implementations of such Utopian strategies can be devised (65), reimbursement incentives like this could serve as an effective way to encourage more appropriate health care use while simultaneously containing costs.

partially successful (67, 68), a variety of expedient, bureaucratic solutions might be mandated by political compromise, legislative action, and regulatory fiat. Physicians must do all they can to assure that both the art and science of medicine survive this partisan political process.


Requests for Reprints: George A. Diamond, M.D., Division of Cardiology, Cedars-Sinai Medical Center, 8700 Beverly Boulevard, Los Angeles, CA 90048.

Some say the process of clinical decision making is an art rather than a science, and health care policywhether by guideline or regulation-is anathema to this process. Consequently, individual patients and physicians must remain free to make their choices without restriction. What is best for the group (the epidemiologic perspective) is not necessarily what is best for the individual patient (the clinical perspective). On the other hand, a precise quantitative relationship exists between the two perspectives. Individual clinical decisions can vary widely, but the aggregate average of all these individual decisions should be equivalent to any policy formally based on the axioms of decision theory; otherwise, at least some of these decisions will be inconsistent and irrational (because they fail to optimize utility). An optimal health care policy for the group is the rational starting point for an optimal health care decision for the individual patient. Neither voluntary guidelines nor regulatory controls can succeed in isolation. The success of free-market capitalism reflects a balance between the voluntary local controls typical of a laissez-faire economy and the centralized regulation characteristic of a socialistic economy. Just as the individual consumer is the fulcrum of a balanced capitalistic economy, the individual patient must be the fulcrum of a balanced health care system. Accordingly, we believe that greater attention should be paid to developing and disseminating a variety of properly validated, clinical decision aids (to help physicians better assess the individual patient's disease probability, prognosis, therapeutic preferences, and expected benefit). These decision aids can improve clinical decision making 1) by placing the limited personal experience of the individual clinician into broader perspective through comparison to a larger repository of clinical and financial information; 2) by making explicit the hidden assumptions that are implied by the decision; and 3) by alerting the decision maker whenever the decision is inconsistent with these assumptions, with the available information, or with the conventional rules of logic. The impact of these decision aids should be monitored closely and the information shared with all interested parties (patients, physicians, hospital administrators, and third-party payors). Ongoing feedback of the costs, use patterns, and health care outcomes to the individual patient and physician decision makers can yield additional benefits (66). Our emphasis on the individual patient heralds a major paradigm shift. In a recent Time magazine poll, 91% of those surveyed agreed that "our health care system needs fundamental change," and 70% said they would be "willing to pay higher taxes" for some form of universal health insurance (1). Because voluntary efforts at controlling health care costs are likely to be only

Current Author Addresses: Drs. Diamond and Denton: Division of Cardiology, Cedars-Sinai Medical Center, 8700 Beverly Boulevard, Los Angeles, CA 90048.

References 1. Castro J. Condition: critical. Time. 1991;138(21):34-42. 2. Iglehart JK. Opinion polls on health care. N Engl J Med. 1984;310: 1616-20. 3. Blendon RJ, Altaian DE. Public attitudes about health-care costs. A lesson in national schizophrenia. N Engl J Med. 1984;311:613-6. 4. Cochrane AL. Effectiveness and Efficiency. Random Reflections on Health Services. London: Burgess and Son Ltd, 1972. 5. Brook RH, Williams KN, Avery AD. Quality assurance today and tomorrow: forecast for the future. Ann Intern Med. 1976;85:809-17. 6. Dans PE. Issues along the Potomac: "efficacy" and "technology transfer." South Med J. 1977;70:1225-31. 7. Murphy ML, Hultgren HN, Detre K, Thomsen J, Takaro T, Participants of the Veterans Administration Cooperative Study. Treatment of chronic stable angina. A preliminary report of survival data of the randomized Veterans Administration cooperative study. N Engl J Med. 1977;297:621-7. 8. The Veterans Administration Coronary Artery Bypass Surgery Cooperative Study Group. Eleven-year survival in the Veterans Administration randomized trial of coronary bypass surgery for stable angina. N Engl J Med. 1984;311:1333-9. 9. European Coronary Surgery Study Group. Long-term results of prospective randomized study of coronary artery bypass surgery in stable angina pectoris. Lancet. 1982;2:1173-80. 10. Coronary Artery Surgery Study (CASS): a randomized trial of coronary artery bypass surgery. Survival data. Circulation. 1983;68: 939-50. 11. Lamas GA, Pfeffer MA, Hamm P, Wertheimer J, Rouleau JL, Braunwald E. Do the results of randomized clinical trials of cardiovascular drugs influence medical practice? N Engl J Med. 1992;327: 241-7. 12. Schwartz D, Lellouch J. Explanatory and pragmatic attitudes in therapeutical trials. J Chronic Dis. 1967;20:637-48. 13. Feinstein AR. Clinical Epidemiology. The Architecture of Clinical Research. Philadelphia: WB Saunders, 1985;690-3. 14. Rahimtoola SH. Some unexpected lessons from large multicenter randomized clinical trials. Circulation. 1985;72:449-55. 15. Ferguson JG, Diamond GA, Swan HJC, Pollock BH, Work JW. Cross and double cross: should clinical trials be analyzed by the principle of intention to treat? J Am Coll Cardiol. 1988;11:127A. 16. Pocock SJ, Hughes MD, Lee RJ. Statistical problems in the reporting of clinical trials. A survey of three medical journals. N Engl J Med. 1987;317:426-32. 17. Feinstein AR. Tempest in a p-pot? Hypertension. 1985;7:313-8. 18. Diamond GA, Forrester JS. Clinical trials and statistical verdicts: probable grounds for appeal. Ann Intern Med. 1983;98:385-94. 19. Diamond GA, Spodick DH. CASSandRa. Am J Cardiol. 1987;59:698701. 20. Palca J. NIH unveils plan for Woman's Health Project. Science. 1991;254:792. 21. Braitman LE. Confidence intervals assess both clinical significance and statistical significance. Ann Intern Med. 1991;114:515-7. 22. Smith DG, Clemens J, Crede W, Harvey M, Gracely EJ. Impact of multiple comparisons in randomized clinical trials. Am J Med. 1987; 83:545-50. 23. Lee YJ, Ellenberg JH, Hirtz DG, Nelson KB. Analysis of clinical trials by treatment actually received: is it really an option? Stat Med. 1991;10:1595-605. 24. The Thrombolysis in Myocardial Infarction (TIMI) trial. Phase I findings. TIMI Study Group. N Engl J Med. 1985;312:932-6. 25. Topol EJ, Califf RM, Kereiakes DJ, George BS. Thrombolysis and Angioplasty in Myocardial Infarction (TAMI) trial. J Am Coll Cardiol. 1987;10:65B-74B. 26. Gruppo Italiano per lo Studio della Streptochinasi nell'Infarcto Miocardico (GISSI). Effectiveness of intravenous thrombolytic treatment in acute myocardial infarction. Lancet. 1986;1:397-401. 27. BARI [Bypass Angioplasty Revascularization Investigation], CABRI [Coronary Artery Bypass Revascularization Investigation],

15 M a r c h 1993 • Annals

Downloaded From: by a Univ of Utah User on 08/19/2013

of Internal


• V o l u m e 118 • N u m b e r 6


28. 29. 30. 31. 32. 33. 34. 35. 36. 37.

38. 39. 40.




44. 45. 46.

47. 48.



EAST [Emory Angioplasty Surgery Trial], GABI [German Angioplasty Bypass Investigation], and RITA [Randomized Intervention Treatment of Angina]: Coronary angioplasty on trial. Lancet. 1990; 335:1315-6 (Unattributed editorial). Feinstein AR. Clinical Epidemiology. The Architecture of Clinical Research. Philadelphia: WB Saunders, 1985;148-51. Simpson EH. The interpretation of interaction in contingency tables. J Royal Stat Soc (Series B). 1951;2:238-41. Hand DJ. Psychiatric examples of Simpson's paradox. Br J Psychiatry. 1979;135:90-1. Wonnacott RJ, Wonnacott TH. Introductory Statistics (4th edition). New York: John Wiley, 1985:17. Murphy EA. A Companion to Medical Statistics. Baltimore: Johns Hopkins University Press, 1985;139-40. 1991 Heart and Stroke Facts. Dallas: American Heart Association, 1991. Pirsig RM. Lila. An Inquiry into Morals. New York: Bantam Books, 1991:66. Powers R. The Gold Bug Variations. New York: William Morrow, 1991:30. Kaplan RM. Health-related quality of life in cardiovascular disease. J Consult Clin Psychol. 1988;56:382-92. Bush JW. Relative preferences versus relative frequencies in healthrelated quality of life evaluations. In: Wenger NK, Mattson ME, Furberg CD, Elinson J (eds). Assessment of Quality of Life in Clinical Trials of Cardiovascular Therapies. New York: LaJacq, 1984:118-39. Amoore JE, Johnston JW Jr, Rubin M. The stereochemical theory of odor. Sci Am. 1964;210(2):42-9. Dennett DC. Consciousness Explained. Boston: Little, Brown and Company, 1991:46. McNeil BJ, Pauker SG, Sox HC Jr, Tversky A. On the elicitation of preferences for alternative therapies. N Engl J Med. 1982;306:125962. McNeil BJ, Weichselbaum R, Pauker SG. Speech and survival: tradeoffs between quality and quantity of life in laryngeal cancer. N Engl J Med. 1981;305:982-7. Hutcheson F. An Inquiry into the Origin of our Ideas of Beauty and Virtue; in Two Treatises. I. Concerning Beauty, Order, Harmony, Design. II. Concerning Moral Good and Evil. London: Knapton J and J, Darby J, Bettesworth A, Fay ram F, Pemberton J, Osborn J, Longman T, Rivington C, Clay F, Batley J, Ward A. (3rd edition), 1729;179-80. Bent ham J. An Introduction to the Principles of Morals and Legislation. Oxford: Clarendon Press (reprint of " A New Edition, corrected by the author," 1823; original publication, 1789), 1907;31. Krakauer H, Bailey RC. Epidemiological oversight of the medical care provided to Medicare beneficiaries. Stat Med. 1991;10:521-40. Palca J. Scientists take one last swing. Science. 1992;257:20-1. Weinstein MC, Fineberg HV, Elstein AS, Frazier HS, Newhauser D, Neutra RR, et al. Clinical decision analysis. Philadelphia: WB Saunders, 1980:189-91. Feinstein AR. Clinical Epidemiology. The Architecture of Clinical Research. Philadelphia: WB Saunders, 1985:114. Glass GV, Peckham PD, Sanders JR. Consequences of failure to meet assumptions underlying the fixed effects analysis of variance and covariance. Rev Education Res. 1972;42:237-88. Redelmeier DA, Tversky A. Discrepancy between medical decisions for individual patients and for groups. N Engl J Med. 1990;322: 1162-4. Winslow CM, Kosecoff JB, Chassin M, Kanouse DE, Brook RH. The


51. 52. 53.

54. 55.



58. 59.





64. 65.

66. 67.


appropriateness of performing coronary artery bypass surgery. JAMA. 1988;260:505-9. Starr P. The Social Transformation of American Medicine. Basic Books: New York, 1982:379-419. ACC/AHA Subcommittee on Coronary Artery Bypass Graft Surgery. Guidelines and indications for coronary artery bypass graft surgery. J Am Coll Cardiol. 1991;17:543-89. Clancy CM, Cebul RD, Williams SV. Guiding individual decisions: a randomized, controlled trial of decision analysis. Am J Med. 1988; 84:283-8. Diamond GA, Hirsch M, Forrester JS, Vas R, Staniloff H, Halpern S, et al. Application of information theory to clinical diagnostic testing: the electrocardiographic stress test. Circulation. 1981;63:915-21. Gitler B, Fishbach M, Steingart RM. Use of electrocardiographicthallium exercise testing in clinical practice. J Am Coll Cardiol. 1984;3:262-71. Doubilet P, McNeil BJ, Weinstein MC. The decision concerning coronary angiography in patients with chest pain. A cost-effectiveness analysis. Med Decis Making. 1985;5:293-309. Sox HC Jr, Littenberg B, Garber AM. The role of exercise testing in screening for coronary artery disease. Ann Intern Med. 1989;110: 456-69. Laskey WK. Gender differences in the management of coronary artery disease: bias or good clinical judgment? Ann Intern Med. 1992;116:869-71. Barnoon S, Wolfe H. Measuring the Effectiveness of Medical Decisions. An Operations Research Approach. Charles C Thomas: Springfield, Illinois, 1972:74-109. Wong JB, Sonnenberg FA, Salem DN, Pauker SG. Myocardial revascularization for chronic stable angina: Analysis of the role of percutaneous transluminal coronary angioplasty based on data available in 1989. Ann Intern Med. 1990;113:852-71. Diamond GA, Staniloff HM, Forrester JS, Pollock BH, Swan HJC. Computer-assisted diagnosis in the noninvasive evaluation of patients with suspected coronary artery disease. J Am Coll Cardiol. 1983;1:444-55. Fisch C, Beller GA, DeSanctis RW, Dodge HT, Kennedy JW, Reeves JT, et al. ACC/AHA guidelines and indications for coronary artery bypass graft surgery. A report of the American College of Cardiology/American Heart Association Task Force on Assessment of Diagnostic and Therapeutic Cardiovascular Procedures (Subcommittee on Coronary Artery Bypass Graft Surgery). Circulation. 1991 ;83: 1125-73. Pliskin JS, Stason WB, Weinstein MC, Johnson RA, Cohn PF, McEnany MT, et al. Coronary artery bypass graft surgery: clinical decision making and cost-effectiveness analysis. Med Decis Making. 1981;1:10-28. Miyamoto JM, Eraker SA. Parameter estimates for a QALY utility model. Med Decis Making. 1985;5:191-213. Diamond GA, Denton TA, Forrester JS, Matloff JM. Fee-for-benefit: An ethical strategy to improve health care outcome and control costs through rational reimbursement incentives. Med Decis Making. 1992;12:347. Tierney WM, Miller ME, McDonald CJ. The effect on test ordering of informing physicians of the charges for outpatient diagnostic tests. N Engl J Med. 1990;322:1499-504. Lomas J, Anderson GM, Domnick-Pierre K, Vayda E, Enkin MW, Hannah WJ. Do practice guidelines guide practice? The effect of a consensus statement on the practice of physicians. N Engl J Med. 1989;321:1306-11. Epstein AM. Changing physician behavior. Increasing challenges for the 1990s. Arch Intern Med. 1991;151:2147-9.

15 March 1993 • Annals of Internal Medicine • Volume 118 • Number 6

Downloaded From: by a Univ of Utah User on 08/19/2013