International Journal for Quality in Health Care 2001; Volume 13, Number 3: pp. 187–196
Validating risk-adjusted surgical outcomes: chart review of process of care JAMES GIBBS1,2, KEVIN CLARK1, SHUKRI KHURI3,4,5, WILLIAM HENDERSON1,2, KWAN HUR1 AND JENNIFER DALEY4,6 1
Cooperative Studies Program Coordinating Center, The Edward Hines Jr VA Hospital, Hines, IL, USA, 2Institute for Health Services Research and Policy Studies, Northwestern University, Evanston, IL, 3Boston VA New England Health Care System, West Roxbury, MA, 4Harvard Medical School, Boston, MA, 5Brigham and Women’s Hospital, Boston, MA, and 6Institute for Health Policy, Massachusetts General Hospital/Partners Health Care System, Boston, MA, USA
Abstract Objective. The primary purpose of this study was to validate risk-adjusted surgical outcomes as indicators of the quality of surgical care at US Department of Veterans Affairs (VA) hospitals. The secondary purpose was to validate the risk-adjustment models for screening cases for quality review. Design. We compared quality of care, determined by structured implicit chart review, for patients from hospitals with higher and lower than expected operative mortality and morbidity (hospital-level tests) and between patients with high and low predicted risk of mortality and morbidity who died or developed complications (patient-level tests). Subjects. 739 general, peripheral vascular and orthopedic surgery cases sampled from the 44 VA hospitals participating in the National VA Surgical Risk Study. Main outcome measures. A global rating of quality of care based on chart review. Results. Ratings of overall quality of care did not differ significantly between patients from hospitals with higher and lower than expected mortality and morbidity. On some of the secondary measures, patient care was rated higher for hospitals with lower than expected operative mortality. At the patient level of analysis, those who died or developed complications and had a high predicted risk of mortality or morbidity were rated higher on quality of care than those with a low predicted risk of adverse outcome. Conclusions. The absence of a relationship between most of our measures of process of care and risk-adjusted outcomes may be due to an insensitivity of chart reviews to hospital-level differences in quality of care. Site visits to National VA Surgical Risk Study hospitals with high and low risk-adjusted mortality and morbidity have detected differences on a number of dimensions of quality. The patient-level findings suggest that the risk-adjustment models are useful for screening adverse outcome cases for quality of care review. Keywords: chart review, outcomes, quality of care, surgery
The primary purpose of this study was to validate riskadjusted surgical outcomes as indicators of the quality of surgical care at VA hospitals (hospital-level validation). The secondary purpose was to validate the risk-adjustment models for screening cases for quality review (patient-level validation). The National VA Surgical Risk Study (NVASRS) was undertaken to develop risk models that were sufficiently adjusted for variation in patient risk factors, so that operative mortality and morbidity could be used for the comparative assessment of the quality of surgical care [1–4]. The assumption in developing risk-adjustment models has been that
residual variation in risk-adjusted outcomes may be attributed to the quality of care. Persistent variation in risk-adjusted outcomes, however, may indicate patient-specific risk factors not included in the models, other factors not accounted for, or simply random variation. It is necessary, therefore, to validate that variation in outcomes after risk adjustment are attributable to process of care [5,6]. A site visit study [7] found differences at a number of dimensions of quality between hospitals identified by the NVASRS as outliers on risk-adjusted operative mortality and morbidity. This study used chart review measures to compare
Address reprint requests to J. Gibbs, Cooperative Studies Program Coordinating Center (151K), The Edward Hines Jr VA Hospital, Hines, IL 60141-5151, USA. E-mail:
[email protected]
2001 International Society for Quality in Health Care and Oxford University Press
187
J. Gibbs et al.
patient care between outlier hospitals. Differences between outlier hospitals on chart-based measures may further validate the risk-adjusted outcomes as indicators of quality of care. Several chart review studies have attempted to show an association between hospital outlier status for risk-adjusted mortality and processes of care. Hannan et al. [8] compared quality of care for coronary artery bypass graft (CABG) surgery patients at high and low outlier hospitals as determined from risk models developed from prospectively collected data in New York State. Chart reviews were conducted in hospitals with higher and lower than expected risk-adjusted mortality rates, and a higher rate of quality of care problems was reported for the high outlier hospitals. Dubois et al. [9] evaluated the care received by patients with three high mortality conditions (pneumonia, stroke, and acute myocardial infarction) from outlier hospitals determined from models based on hospital discharge data. Explicit chart review, based on a priori fixed criteria, revealed no differences in quality of care between high and low outlier hospitals. However, implicit chart review, which relies on the judgment of experts, indicated that deceased stroke and pneumonia patients had a 5% incidence of preventable deaths in high mortality outlier hospitals compared with 1% preventable deaths in low mortality outlier hospitals. The authors concluded that high outlier hospitals cared for sicker patients and may also have provided poorer care. Best and Cowper [10] used implicit chart review to compare quality of care of non-surgical patients from hospitals with high and low observed-to-expected (O/E) mortality, determined from risk models based on discharge data. They found no difference between patients from high and low outlier hospitals on ratings of preventability of death and concluded that the failure of the risk models to predict differences in care may be partially due to inaccuracy of diagnostic coding and the absence of pertinent risk data in discharge records.
operations combined and for each of the eight subspecialties. Morbidity was scored as the presence or absence of one or more of 21 predefined postoperative complications, ranging from minor complications such as superficial wound infection to life threatening conditions such as acute myocardial infarction and pulmonary embolism. Other methods of scoring morbidity resulted in substantially the same risk models and sets of outlier hospitals. These other scores included the total number of morbidities reported and several weighted scores, with weights based on postoperative length of stay and a number of other criteria, including expert ratings of the likelihood that a complication would result in death or permanent sequelae. Good inter-rater reliability has been demonstrated for the collection of the morbidity data [1]. The predictive power of the models developed to estimate expected mortality was very good, with a c index of 0.89 for all operations and c indexes ranging from 0.86 to 0.91 for six of the subspecialties and 0.77 and 0.79 for thoracic surgery and vascular surgery respectively. These c indices compare favorably with the c index of 0.79 for the model used to risk adjust CABG mortality in the New York State Cardiac Surgery Reporting System [11]. For morbidity, the c index for all operations was 0.78 and ranged from 0.69 to 0.79 for the eight subspecialties. Goodness of fit was tested with the Hosmer-Lemeshow statistic. In general, the observed and expected numbers of adverse events in each decile of risk were quite close. The only goodness-of-fit statistics that were significant at P Ζ 0.01 were for all operations combined and the general surgery mortality model, primarily because of the very large number of cases in these categories. Since the initial study data were collected, several hundred thousand additional cases have been added to the database, which yielded identical risk models that have been cross-validated with little or no degradation in the predictive validity of the models.
Study hypotheses For hospital-level validation, we hypothesized that:
Methods Overview of the NVASRS The NVASRS [1–4] is one of the largest and most comprehensive efforts to date to risk adjust operative outcomes. Surgical nurse reviewers prospectively collected data on preoperative patient risk factors, operative procedures and outcomes for 87 000 major operations at 44 VA tertiary care hospitals [1]. Generic risk data were collected, including general risk factors (such as patient’s functional status), comorbidities for all major organ systems and laboratory tests. Risk models were developed for all operations combined and for eight subspecialties [general surgery, neurosurgery, orthopedic surgery, ear, nose and throat surgery, plastic surgery, non-cardiac thoracic surgery, urological surgery and peripheral vascular (PV) surgery] [2,3]. Using the risk models, expected 30-day mortality and morbidity were calculated for each hospital and O/E mortality, and morbidity ratios were determined both for all
188
(i) Ratings of quality of care would be lower for hospitals with high O/E mortality ratios and higher for centers with low O/E mortality ratios (hypothesis one). (ii) Quality of care ratings would be lower for hospitals with high O/E morbidity ratios and higher for centers with low O/E morbidity ratios (hypothesis two). For patient-level validation, we hypothesized that: (iii) Patients with low predicted risk of operative mortality who died within 30 days of surgery would be judged to have received lower quality of care than deceased patients with high predicted risk of operative mortality (hypothesis three). (iv) Patients with low predicted risk of operative morbidity who developed complications within 30 days
Validating surgical outcomes
after surgery would be judged to have received lower quality of care than patients with high predicted risk of operative morbidity who developed complications (hypothesis four). Hypothesis one (hospital-level mortality) was tested using charts for general surgery and PV surgery patients. We chose these subspecialties because they were relatively high volume with higher than average mortality rates. The original design specified separate tests for each of these subspecialties, but they were combined because of an insufficient number of outlier hospitals in PV surgery to provide the number of mortality cases specified by our design. Hypothesis two (hospital-level morbidity) was tested separately for general surgery and orthopedic surgery, which provided tests in two high volume but distinctly different types of surgery. One test of hypothesis three (patient-level mortality) and one test of hypothesis four (patient-level morbidity) were performed with additional charts sampled from general, PV and orthopedic surgery. Sampling medical records We specified a sample size necessary to detect a difference between test group means of one-third to one-half of a point on our primary outcome measure, a five point quality of care scale (see Table 1 for scale). We estimated a standard deviation of 0.72 on this scale from a pilot study, and, using an alpha of 0.05 and 80% power, we calculated that 75 cases per test group were required to detect a difference of one-third of a point between test groups on our primary outcome measure. The actual number of cases reviewed per test group ranged from 64 to 81 (mean=73.9). A total of 739 cases were reviewed. Each case was independently reviewed by two reviewers, and the scores of the two reviewers were pooled for the analysis. For hospital-level tests, patients were sampled from statistically significant high and low outlier hospitals among the 44 hospitals participating in the NVASRS. For the hospitallevel mortality test, the five hospitals with the lowest statistically significant O/E ratios in general surgery and PV surgery (0.49–0.60) and the three hospitals with the highest statistically significant O/E ratios in these subspecialties (1.43–1.89) were selected. Within the groups of high and low outlier hospitals, cases were stratified by 30-day postoperative vital status and approximately equal numbers of live and deceased patients were randomly sampled from each hospital in proportion to the hospital’s volume of cases. The same procedure was followed for selecting outlier hospitals and sampling patients for the hospital-level morbidity tests, except that cases were stratified by morbidity (one or more 30-day complications) and sampled to yield 50% morbidity cases. The ranges of O/E ratios for selected low and high outlier hospitals for the morbidity test in general surgery were 0.45–0.75 and 1.32–1.62. For the orthopedic surgery test, the ranges of O/E ratios for selected low and high outlier hospitals were 0.51–0.73 and 1.26–1.57 respectively. For sampling cases for the patient-level mortality test, deceased patients from general, PV and orthopedic surgery
at the NVASRS hospitals were arrayed by their predicted probability of postoperative mortality as determined by the risk models. These probabilities ranged from 0 to 0.96 (median=0.14). Patients with a probability of mortality ≤0.02 (mean=0.010, SD=0.004), the bottom 12.4% of the distribution, were randomly sampled to provide a low risk group. Patients with a probability of mortality > 0.14 (mean=0.43, SD=0.219), the top 50% of the distribution, were randomly sampled to provide a higher risk group. We sampled the higher risk group from the entire upper half of the distribution rather than from the top end alone to reduce possible bias due to reviewers’ rating quality of care more leniently for deceased patients whose charts revealed a very high risk of dying. Likewise, for the patient-level morbidity test, morbidity cases from general, PV and orthopedic surgery were arrayed by probability of 30-day postoperative complications. The predicted probability of morbidity ranged from 0.01 to 0.99 (median=0.305). Patients with a probability of morbidity ≤0.04, the bottom 1.6% of the distribution, were randomly selected to provide a low risk group (mean=0.029, SD= 0.007), and patients with a probability of morbidity >0.305, the top 50% of the distribution, were randomly sampled to yield a higher risk group (mean=0.55, SD=0.179). Chart review For sampled patients, we obtained the entire medical record for the index hospitalization, discharge summaries for all hospitalizations for the previous five years and outpatient records for one year before and 30 days after the index hospitalization. Charts were checked for completeness and legibility and organized into a common format for review. We used a structured implicit chart review form developed by The RAND Corporation for diverse medical and surgical conditions [12,13]. We selected this instrument because of its demonstrated reliability and validity [14,15]. It contains several process of care scales, which have shown good interitem reliability and satisfactory inter-rater reliability. Validity has been demonstrated by showing that the implicit review scores were related to 30-day mortality and also to scores based on explicit review. The review form elicits subjective ratings of different components of care, including aspects of the admitting workup; the use of tests and consultants; the use of common therapies; aspects of surgery; and length of stay. After considering each specific component of care, the reviewer assigned an overall rating. We eliminated several questions from the instrument that were unnecessary for our study and added questions about the involvement of the attending surgeon and anesthesia care (suggested by surgeons in a pilot study) and one question about the preventability of postoperative complications to supplement an existing item about preventability of mortality. Surgeon reviewers were recruited from VA Medical Centers and from Birch and Davis Associates, Inc., a firm under contract with the Veterans Health Administration to review charts for its External Peer Review Program. Twenty-one
189
J. Gibbs et al.
VA general surgeons and eight VA vascular surgeons were recruited. Ninety percent were board certified in their subspecialty (10% board eligible); all were clinically active (currently performing inpatient surgery), and all reported teaching activity (supervision of residents). At Birch and Davis, eight general surgeons, three vascular surgeons and nine orthopedic surgeons reviewed study charts. All were board certified. Reviewers were trained in small group sessions of 5–6 hours, following a protocol adapted from The RAND Corporation guidelines for structured implicit review [14]. Orthopedic surgeons were trained separately from general and vascular surgeons. Training was led by one of the investigators (JD) and consisted of (i) discussing written guidelines for each question on the structured implicit review form, (ii) reviewing a practice chart with the group, (iii) each surgeon reviewing 1 or 2 practice charts on his/her own, followed by (iv) a group discussion of the practice chart(s). VA surgeon reviewers convened at a central location for three 2.5 day training and review sessions. After the training portion of the session, each reviewer was given 20 charts to complete and return before leaving the facility. Charts were assigned randomly to reviewers within subspecialties. Reviewers from Birch and Davis were trained at the company’s office in Silver Spring, Maryland, with separate sessions for general/vascular surgeons and orthopedic surgeons. Reviewers took charts with them for review and periodically were sent additional charts. The number of charts completed by these reviewers ranged from 15 to 91 (mean= 49). Outcome measures Our primary measure was a five point scale that assessed overall quality of care. The wording of the scale is given in Table 1. In addition, nine secondary measures, listed in Table 2 and Table 3, were used to test the hypotheses. For all measures, higher scores indicate better quality of care. Item two asked reviewers whether a death or complications could have been prevented through better care, with the following response alternatives: (i) definitely could have been prevented, (ii) probably could have been prevented, (iii) probably could not have been prevented, (iv) definitely could not have been prevented. Measures 3–7 and 9 refer to multi-item scales that were scored by summing the scores for individual items. Number 8, appropriateness of the length of stay, was a singleitem scale. Number 10, the all item scale, was constructed by summing the number of items from measures 3–9 that were rated as standard or better than standard care. The items differed in the wording and number of response alternatives, and they were divided into two categories that best corresponded to standard or above standard care versus substandard levels of care in order to give each item equal weight. Statistical analysis Scores for each case were obtained by averaging the independent ratings of the two reviewers. The data were analyzed using the Statistical Analysis System (SAS) release
190
6.12. For hospital-level tests, we used two-way analysis of variance (ANOVA) with hospitals treated as nested within high and low outlier groups to control for variation in quality of care ratings between hospitals within outlier groups and variation between adverse and non-adverse outcome cases. For patient-level analysis, we used difference of means (ttests).
Results Inter-rater reliability We assessed inter-reviewer agreement with the intraclass correlation coefficient, a chance-corrected index of agreement that is very similar to a weighted kappa [16,17]. For our primary measure, we obtained coefficients of 0.50 for orthopedic surgeons, 0.56 for vascular surgeons, and 0.40 for general surgeons, indicating fair to good agreement [18]. Although this level of agreement is not sufficient to make judgments about an individual case, it is sufficiently high for meaningful aggregate comparisons because the reliability of mean scores across multiple records will have a higher reliability for such aggregate comparisons [14,19]. Hospital-level analysis Table 1 shows the distribution of ratings for the primary outcome measure. The mean rating of overall quality of care was significantly lower for deceased patients (2.91) than for survivors (3.37) in the mortality model testing sample (P < 0.01). However, for both of the morbidity test samples, the mean ratings were higher for those with complications than for those without complications (in general surgery, 3.49 versus 3.18 (P=0.02) and in orthopedic surgery, 3.43 versus 3.20 (P=0.08). Table 2 shows the mean scores on the quality of care measures for general and peripheral vascular surgery cases from high and low mortality outlier hospitals. There was no difference in mean scores on the primary measure. For the secondary measures, the difference on the all-item scale, which combines scores for the individual scales, was significant at P=0.04. Most of the differences on the other measures were in the predicted direction but were not statistically significant. Table 3 shows the mean quality of care ratings for general surgery and orthopedic surgery patients from hospitals with high and low risk-adjusted operative morbidity. Only the differences on the anesthesia care measure (measure seven) were significant at less than P=0.05, and in the orthopedic sample, the difference was not in the expected direction, i.e. the higher quality of care score was associated with higher rather than lower operative morbidity. Most of the differences, including those for the primary measure, were not in the predicted direction. As another set of tests (not shown), we tested for differences on all measures after dividing each scale into two categories that best corresponded to standard or above standard care versus substandard care. For the single item measures (numbers one, two and eight), we tested for a
Extreme, above standard (5) Above standard (4) Standard (3) Below standard (2) Extreme, below standard (1) Total Mean rating
0.0 6.5 58.4 35.1 0.0 100% 2.91
0.0 21.8 66.7 11.5 0.0 100% 3.37
2.9 17.1 58.6 21.4 0.0 100% 3.49
0.0 13.0 67.5 19.5 0.0 100% 3.18
4.0 28.0 53.3 14.7 0.0 100% 3.43
0.0 11.4 77.2 11.4 0.0 100% 3.20
For the purpose of showing the distribution of ratings on this scale, ratings of cases that were not equal to a whole number due to averaging reviewer scores were rounded down to the nearest integer. However, the means shown at the bottom of the table are calculated from the unrounded mean ratings for each case, consistent with the means used to test the hypotheses.
1
‘Considering everything you know about this patient, how would you rate the overall quality of care?’
Morbidity model tests ............................................................................................................................. General surgery Orthopedic surgery Cases with Cases without Cases with Cases without complications complications complications complications (n=77) (n=78) (n=77) (n=70) (n=75) (n=79) ................................................................................................................................................................................................................................................................................................................
Mortality model tests .................................................... General and PV surgery Decedants Survivors
Table 1 Percentage distribution and mean ratings of overall quality of care1
Validating surgical outcomes
191
J. Gibbs et al.
Table 2 Mean quality of care ratings for cases from mortality outlier hospitals1 General and PV surgery ..................................................................................................................... High outlier Low outlier hospitals hospitals P value2 ............................................................................................................................................................................................................................. Primary measure (range of possible scores) 1. Overall quality of care (1–5) 3.12 (0.63,74) 3.14 (0.63,81) 0.69 Secondary measures (range of possible scores) 2. Preventability of death (1–4) 3. Initial assessment (5–25) 4. Tests and treatments (15–45) 5. Components of care (6–30) 6. Surgery (8–40) 7. Anesthesia (5–25) 8. Length of stay (1–3) 9. Adverse consequences (6–18) 10. All item scale (0–46) 1
2.62 19.18 43.43 22.51 31.02 18.54 2.68 16.85 41.62
(0.69,34) (2.83,74) (2.60,74) (2.84,74) (3.87,73) (2.37,73) (0.49,57) (1.70,74) (4.78,74)
2.84 19.48 43.98 22.76 31.68 18.88 2.66 16.80 42.75
(0.69,38) (2.35,81) (1.52,81) (2.49,81) (3.79,81) (2.09,81) (0.49,69) (2.08,81) (3.36,81)
0.13 0.32 0.26 0.46 0.27 0.12 0.80 0.60 0.04
Standard deviations and number of cases, respectively, are given in parentheses. 2 P value from ANOVA.
difference between high and low outlier hospitals in the proportion of cases rated as substandard care. For the other measures, which were multi-item scales, we tested for a difference between the mean number of items rated standard or better. These tests also failed to show differences between high and low outlier hospitals on the primary measure. However, differences between mortality outlier hospitals on three of the secondary measures (measures three, six and seven) approached significance, with P-values ranging from 0.06–0.08. (The all item scale had already been analyzed as a summary of all the items scored as substandard versus standard or better care, and as reported in Table 2, the scores differed between high and low mortality outlier hospitals at P=0.04.). Patient-level tests Table 4 shows the mean quality of care-ratings for the patientlevel tests. For both the mortality and morbidity model tests (hypotheses three and four), adverse outcome patients with higher predicted risk of the adverse outcome were rated higher on overall quality of care than adverse outcome patients with lower predicted risk of an adverse outcome. On the secondary measures, three of the differences (numbers 2b, five, and 10) were significant at P=0.05 or lower in the mortality test sample, whereas the only significant difference in the morbidity sample (number 4) was not in the predicted direction. As for the hospital-level tests, we also tested for differences after dividing the measurement scales into two categories to represent standard or above standard care versus substandard care. These tests (not shown) confirmed the results of the initial tests and showed additional significant differences at P=0.04 level for measures seven and nine in the mortality model testing.
192
Discussion We did not confirm the hospital-level hypotheses that ratings of overall quality of care would differ between hospitals with higher- and lower-than expected operative mortality and morbidity. The findings for several secondary measures, including a scale that combined all of the subscales, provided some evidence that lower risk-adjusted mortality is associated with better processes of care. We confirmed the patient-level hypotheses that adverse outcome patients with low risk of an adverse outcome would be judged to have lower quality of care than those with a high risk of an adverse outcome. These findings were stronger for mortality than morbidity in that predicted differences also were observed on several secondary measures in the mortality but not in the morbidity tests. There are several limitations to the risk-adjustment and chart review methods that may have contributed to the failure to detect quality of care differences between outlier hospitals on most measures, even if differences existed. The NVASRS developed generic risk models because it was designed to risk adjust outcomes for assessing surgical performance at the all operation and subspecialty levels, not at the level of individual procedures. Although the predictive power of the generic risk models was good for morbidity and very good for mortality, procedure specific risk models (including procedure specific complications for morbidity models) would probably provide even more complete risk adjustment than obtained with generic risk models. Likewise, our structured implicit review instrument was generic in nature. Procedure specific review, using procedure/condition specific items might increase the precision of measurement, but this was not feasible in our study because our tests required review of cases with many different types of procedures.
1
2.84 19.76 44.21 23.01 32.29 18.64 2.66 17.03 43.42
Secondary measures (range of possible scores) 2. Preventability of complications (1-4) 3. Initial assessment (5-25) 4. Tests and treatments (15-45) 5. Components of care (6-30) 6. Surgery (8-40) 7. Anesthesia (5-25) 8. Length of stay (1-3) 9. Adverse consequences (6-18) 10. All item scale (0-46) 2.79 19.15 44.07 22.74 32.32 19.23 2.63 17.05 42.80
(0.58,40) (2.59,69) (1.04,69) (2.64,69) (3.82,69) (2.13,69) (0.51,63) (1.55,69) (2.74,69)
3.22 (0.64,69)
Standard deviations and number of cases, respectively, are given in parentheses. 2 P value from ANOVA.
(0.64,44) (2.24,78) (1.43,78) (2.45,78) (3.37,78) (2.09,78) (0.48,76) (1.81,78) (2.40,78)
3.24 (0.52,78)
Primary measure (range of possible scores) 1. Overall quality of care (1-5) 0.85 0.54 0.47 0.72 0.70 0.03 0.72 0.56 0.16
0.54 3.22 21.10 44.49 24.51 33.70 20.26 2.51 17.67 44.00
(0.59,39) (2.05,77) (1.12,77) (2.15,77) (4.45,77) (2.00,76) (0.56,75) (0.73,77) (2.61,77)
3.40 (0.68,77)
3.12 20.42 44.70 24.16 33.33 19.37 2.61 17.57 44.23
(0.73,39) (2.80,77) (1.09,77) (2.52,77) (3.36,77) (1.97,77) (0.45,74) (0.73,77) (2.34,77)
3.23 (0.52,77)
0.54 0.35 0.16 0.58 0.81 0.03 0.18 0.55 0.44
0.24
General surgery Orthopedic surgery ............................................................................................... ............................................................................................... High outlier Low outlier P High outlier Low outlier P 2 2 hospitals hospitals value hospitals hospitals value ................................................................................................................................................................................................................................................................................................................
Table 3 Mean quality of care ratings for cases from morbidity outlier hospitals1
Validating surgical outcomes
193
194
1
3.13 18.98 43.08 22.27 30.06 18.35 2.72 16.75 42.07
Secondary measures (range of possible scores) 2a. Preventability of complications (1–4) 2b. Preventability of death (1–4) 3. Initial assessment (5–25) 4. Tests and treatments (15–45) 5. Components of care (6–30) 6. Surgery (8–40) 7. Anesthesia (5–25) 8. Length of stay (1–3) 9. Adverse consequences (6–18) 10. All item scale (0–46) 2.52 18.65 42.63 21.29 29.55 18.18 2.68 16.45 40.64
(0.49,61) (2.71,74) (3.04,74) (3.25,74) (4.52,74) (2.38,74) (0.59,45) (1.70,74) (4.86,74)
2.68 (0.68,74)
Standard deviations and number of cases are given in parentheses. 2 Difference of means (t-tests).
(0.59,71) (2.75,74) (2.59,74) (2.48,74) (3.56,74) (1.98,74) (0.45,51) (1.59,74) (3.65,74)
3.00 (0.61,74)
Primary measure (range of possible scores) 1. Overall quality of care (1–5)
< 0.01 0.46 0.34 0.04 0.44 0.63 0.72 0.26 0.04
< 0.01
19.93 44.08 22.66 31.13 18.36 2.66 17.02 43.22
(2.36,64) (1.20,64) (2.44,64) (4.05,64) (2.24,64) (0.42,56) (1.64,64) (3.21,64)
2.88 (0.53,51)
3.31 (0.61,64)
19.46 44.48 23.46 32.14 19.01 2.54 17.35 43.58
(2.55,71) (0.94,71) (2.53,71) (3.87,71) (2.37,71) (0.53,71) (1.42,71) (2.79,71)
2.72 (0.69,44)
3.01 (0.55,71)
0.27 0.03 0.06 0.14 0.10 0.18 0.22 0.50
0.20
< 0.01
Morbidity model tests ............................................................................................... Mortality model tests .................................................................................................. Patients with complications ............................................................................................... Patients who died .................................................................................................. High predicted Low predicted High predicted Low predicted P probability of probability of P 2 complication complication value2 probability of death probability of death value ................................................................................................................................................................................................................................................................................................................
Table 4 Mean quality of care ratings for adverse outcome cases by high and low predicted probability of adverse outcome1
J. Gibbs et al.
Validating surgical outcomes
Higher inter-reviewer reliability would improve measurement. We obtained fair to good chance corrected reviewer agreement, better than most studies using implicit review [20] but not as high as desirable. However, we doubt that better reviewer agreement would change the findings significantly. A test for hospital-level differences, using a subset of cases for which there was complete reviewer agreement, also failed to reveal differences on the primary measure of quality of care. A site visit study [7] that investigated hospitals identified by the NVASRS as outliers for all operations combined, did find process and structure differences between outlier hospitals. Data were collected at 10 hospitals with significantly higher-than-expected mortality and morbidity O/E ratios and 10 with significantly lower-than-expected O/E ratios of adverse outcomes. For both sets of hospitals, five were outliers for risk-adjusted mortality and five for risk-adjusted morbidity. Low outlier hospitals scored better than high outliers on ratings of a number of dimensions of quality (technology/equipment, technical competence, relationship with affiliated institutions, and overall quality of care). If site visits detected differences in process and structure at the hospital level, why did most of our chart review measures fail to detect differences? There are several possible reasons. The site visits measured system-level indicators of quality of care, which may vary more between hospitals than individual patient care measured by chart review. Also, welldesigned site visits may be more sensitive to quality differences than chart review. Site visits allow direct observation of the surgical service and opportunity to investigate issues that arise. Charts do not capture some important aspects of treatment, such as surgeon skill, and they vary in completeness of documentation of tests and treatment. By measuring some of the major causes of variation in patient care, site visits may account for more of the variation in the quality of patient care than revealed by chart reviews. For example, inferior technology and poor communication among surgical staff, noted by site visitors, may have systematically affected the quality of the care delivered, but chart review may detect only a portion of the substandard care resulting from these factors. Chart reviews may reveal the more egregious errors associated with severe adverse outcomes but miss some of the system-related issues impacting on patient outcomes. Our patient-level findings suggest that the risk models, particularly the mortality risk models, are useful for screening adverse outcome patients for quality of care review. A possible criticism of the findings, however, is that reviewers may have been less critical of the care received by patients in the highrisk groups because they recognized high-risk patients from chart data and assumed that their prognosis was poor despite the quality of care. We attempted to guard against such a bias by sampling higher risk patients from the entire upper half of the risk distribution rather than from only the top portion. Evidence against a rating inflation bias is the fact that the mean rating of 3.00 for overall quality of care for deceased patients with a high probability of death in the sample for the patient-level analysis was not significantly
higher than the mean rating of 2.94 for deceased patients in the hospital-level sample who were not sampled by risk status (P=0.56). Likewise, the mean rating of 3.31 for patients with complications in the patient-level sample was not significantly different than the rating of 3.35 for patients in the hospitallevel sample who experienced complications (P=0.62). Nevertheless, we cannot rule out the possibility that reviewers applied different standards in judging the care received by high-risk patients. Future studies may determine whether high-risk patients who survive and do not develop complications also are rated higher on quality of care than lowrisk patients. This would strengthen the findings and provide evidence that the risk models are useful for identifying exemplary care as well as poorer care. In conclusion, except for a few secondary measures, our chart reviews failed to detect quality of care differences between hospital outliers on risk-adjusted mortality and morbidity. Differences detected by site visits to similarly selected NVASRS outlier hospitals suggest that chart reviews may be relatively insensitive to hospital-level differences in quality. Our patient-level findings suggest that the risk models are useful for screening adverse outcome cases for review.
Acknowledgements This research was supported by grant number SDR 93–008 from the Department of Veterans Affairs, Veterans Health Administration, Health Services Research and Development Service, and also by the Office of Patient Care Services, the Office of Quality Management and the Cooperative Studies Program. An earlier version of this paper was presented at the Second Annual Chicago and Great Lakes Health Services Research Symposium, Chicago, Illinois, 12–13 March 1998. The authors wish to thank the following persons for their contributions: Dolores Ippolito, MPH, for assistance in recruitment of chart reviewers and preparation of charts; Gay Watkins for preparation of charts; William Best, MD, for advice on study design and analysis; Bharat Thakkar, MS, for computer programming; Barbara Jones, RRA, and Verna Hightower for coordinating acquisition of chart reviews at Birch and Davis Associates, Inc. We would also like to thank the following surgeons who performed the chart reviews. Their participation should not be construed as an endorsement of the analysis or conclusions of this study. VA surgeons: Drs Joaquin Aldrete, Gene Branum, Thomas Brothers, John Clark, Brian Cmolik, George Cowan, Richard Curl, Kathy Dalessandri, David Dries, John Eidt, Robert Esterl, Aaron Fink, Thomas Gilmore, Thomas Gouge, Glen Hunter, Stephen Lalka, Richard Liechty, Fred Littooy, Dana Lynge, Donald McConnell, George Meier, Dave Pitcher, Mark Sawicki, John Tarpley, Samuel Tisherman, Steve Wangensteen, Thomas Whitehill, Eric Wiebke, Paul Zagar. Surgeons under contract with Birch and Davis Associates, Inc. for peer review: Drs Betsy Ballard, Amir Banisar, Stuart Battle, Andrew Bender, Ira Brecher, Thomas
195
J. Gibbs et al.
Calhoun, Doreen DiPasquale, William Holbrook, Robert Kan, Walter Landmesser, Francis Milone, Robert Nothwanger, Mark Peterson, Nathan Price, Roger Raiford, Leo Rozmaryn, James Salander, Michael Seremetis, Carlos Silva, Bert Weisbaum.
References 1. Khuri SF, Daley J, Henderson W et al. The National Veterans Administration Surgical Risk Study: Risk adjustment for the comparative assessment of the quality of surgical care. J Am Coll Surgeons 1995; 180: 519–531. 2. Khuri SF, Daley J, Henderson W et al. Risk adjustment of the postoperative mortality rate for the comparative assessment of the quality of surgical care. J Am Coll Surgeons 1997; 185: 315–327. 3. Daley J, Khuri SF, Henderson W et al. Risk adjustment of the postoperative morbidity rate for the comparative assessment of the quality of surgical care. J Am Coll Surgeons 1997; 185: 328–340. 4. Khuri SF, Daley J, Henderson W et al. The Department of Veterans Affairs’ NSQIP: The first, national, validated, outcomebased, risk-adjusted, and peer-controlled program for the measurement and enhancement of the quality of surgical care. Ann Surgery 1998; 228: 491–507. 5. Berwick DM. Health services research and quality of care: Assignments for the 1990’s. Med Care 1989; 27: 763–771. 6. Brook RH, McGlynn EA, Cleary PD. Quality of health care. Part 2: Measuring quality. New Engl J Med 1996; 335: 966–970. 7. Daley J, Forbes MG, Young GJ et al. Validating risk-adjusted surgical outcomes: Site visit assessment of process and structure. J Am Coll Surgeons 1997; 185: 341–351. 8. Hannan EL, Kilburn H JR, O’Donnell JF et al. Adult open heart surgery in New York State: An analysis of risk factors and hospital mortality rates. J Am Med Assoc 1990; 264: 2768–2774. 9. Dubois RW, Rogers WH, Moxley JH et al. Special report. Hospital inpatient mortality: Is it a predictor of quality? New Engl J Med 1987; 317: 1674–1680. 10. Best WR, Cowper DC. The ratio of observed-to-expected
196
mortality as a quality of care indicator in non-surgical VA patients. Med Care 1994; 32: 390–400. 11. Hannan EL, Kilburn, Jr JF, Lindsey ML, Lewis R. Clinical versus administrative data bases for CABG surgery: Does it matter? Med Care 1992b; 30: 892–907. 12. Rubin HR, Rogers WH, Kahn KL et al. Watching the DoctorWatchers: How well do peer review organization methods detect hospital care quality problems? J Am Med Assoc 1992; 267: 2349–2354. 13. Rubin HR, Rubenstein LV, Kahn KL, Sherwood M. Guidelines for Structured Implicit Review of Diverse Medical and Surgical Conditions. Santa Monica, CA: The RAND Corporation (Publication N3066-HCFA), 1990. 14. Rubenstein LV, Kahn KL, Reinisch EJ et al. Changes in quality of care for five diseases measured by implicit review, 1981 to 1986. J Am Med Assoc 1990; 264: 1974–1979. 15. Rubenstein LV, Kahn KL, Reinisch EJ et al. Structured Implicit Review: Analysis of the Method and Quality of Care Results. Santa Monica, CA: The RAND Corporation (Publication N-3033HCFA), 1991. 16. Fleiss JL, Cohen J. The equivalence of weighted kappa and the intraclass correlation coefficient as measures of reliability. Educational Psychology Measurement 1973; 33: 613–619. 17. Krippendorff K. Bivariate agreement coefficients for reliability of data. In Borgatta EF, Bohrnstedt GW, eds., Sociological methodology. San Francisco, CA: Jossey-Bass Publishers, 1970: pp. 139–150. 18. Fleiss JL. Statistical methods for rates and proportions. 2nd edn. New York, NY: John Wiley & Sons, Inc., 1981: pp. 212–236. 19. McDowell I, Newell C. Measuring Health: A guide to Rating Scales and Questionnaires. 2nd edn. New York: Oxford University Press, 1996. 20. Goldman RL. The reliability of peer assessments of quality of care. J Am Med Assoc 1992; 267: 958–960.
Accepted for publication 23 January 2001