Statistical methods for the assessment of prognostic ...

2 downloads 0 Views 109KB Size Report
Feb 18, 2010 - For Permissions, please e-mail: [email protected] by Carmine Zoccali on May 6, 2010 http://ndt.oxfordjournals.org.
Nephrol Dial Transplant (2010) 25: 1402–1405 doi: 10.1093/ndt/gfq046 Advance Access publication 18 February 2010

Statistical methods for the assessment of prognostic biomarkers (Part II): Calibration and re-classification Giovanni Tripepi1, Kitty J. Jager2, Friedo W. Dekker2,3 and Carmine Zoccali1 1

Correspondence and offprint requests to: Giovanni Tripepi; E-mail: [email protected]

Abstract Calibration is the ability of a prognostic model to correctly estimate the probability of a given event across the whole range of prognostic estimates (for example, 30% probability of death, 40% probability of myocardial infarction, etc.). The key difference between calibration and discrimination is that the latter reflects the ability of a given prognostic biomarker to distinguish a status (died/survived, event/non-event), while calibration measures how much the prognostic estimation of a predictive model matches the real outcome probability (that is, the observed proportion of the event). Re-classification is another measure of prognostic accuracy and it reflects how much a new prognostic biomarker increases the proportion of individuals correctly re-classified as having or not having a given event compared to a previous classification based on an existing prognostic biomarker or predictive model. Keywords: calibration; discrimination; prognostic model re-classification; risk prediction

hypothetical sample of 100 patients with end-stage renal disease (ESRD) followed up for 3 years with an observed mortality rate of 30%. A prognostic model including a biomarker that predicts a 30% probability of death at 3 years in the whole cohort is perfectly calibrated. On the other hand, a prognostic model considering another biomarker attributing a 15% probability of death to survivors and a 16% probability to non-survivors would be perfectly discriminating (the threshold of 15% perfectly discriminates patients who die from those who survive) but very poorly calibrated because the estimated probability of death (15%) is much lower than the observed probability (30%). The third step in prognostic model validation is re-classification. Such a measure is receiving increasing attention because it provides an immediate appreciation of the impact of new biomarkers in risk predictions. Re-classification quantifies how much a new prognostic biomarker increases the proportion of individuals correctly re-classified as having or not having a given event compared to a previous classification based on an existing prognostic biomarker or predictive model. Here, we focus on the Hosmer–Lemeshow test (H-L test) as a measure of calibration and net re-classification index (NRI) as a measure of re-classification.

Introduction In a previous article of this series, we discussed discrimination [1], a measure of how well a prognostic model discriminates individuals with and without the outcome of interest. Calibration, another important step in prognostic models validation, is the ability to correctly estimate the likelihood of a future event across the whole range of prognostic estimates. The concept of calibration in prognostic research is similar to that applied in clinical chemistry. As a laboratory measurement is validated against a well-recognized standard, the predictive power of a prognostic biomarker is validated against the actual (observed) occurrence of the event of interest. To better focus on the substantial difference between discrimination and calibration, we may consider the two indicators in a

Calibration Example 1 Calibration measures how much the prognostic estimate of a specific predictive model including one or more biomarkers matches the ‘real’ probability of the outcome (the observed proportion of an event in a given time period). Here, we consider a study investigating the prognostic value for death of brain natriuretic peptide (BNP) (a well-recognized biomarker of cardiovascular risk in the dialysis population) [2] in a cohort of 277 ESRD patients followed up for 4 years. During the follow-up, 112 patients died. By using the equation of a Cox model based only on BNP, the authors of this study calculated

© The Author 2010. Published by Oxford University Press on behalf of ERA-EDTA. All rights reserved. For Permissions, please e-mail: [email protected]

Downloaded from http://ndt.oxfordjournals.org by Carmine Zoccali on May 6, 2010

CNR-IBIM, Clinical Epidemiology and Physiopathology of Renal Diseases and Hypertension, Reggio Calabria, Italy, 2ERA-EDTA Registry, Department of Medical Informatics, Academic Medical Center, University of Amsterdam, Amsterdam, The Netherlands and 3 Department of Clinical Epidemiology, Leiden University Medical Centre, Leiden, The Netherlands

Biomarkers in prognostic research

1403

Table 1. Example of calibration (see text for details)

By applying the formula to our example, we have:

Deciles of estimated probability of death

Sum of predicted deaths

Sum of observed deaths

1 2 3 4 5 6 7 8 9 10

10.1 11.0 10.4 11.1 11.4 9.0 15.0 13.0 14.5 19.6

5 6 5 7 12 11 13 18 16 19

þ½ð5−10:4Þ2 =10:4 þ ½ð7−11:1Þ2 =11:1 þ½ð12−11:4Þ2 =11:4 þ ½ð11−9:0Þ2 =9:0 þ½ð13−15:0Þ2 =15:0 þ ½ð18−13:0Þ2 =13:0 þ½ð16−14:5Þ2 =14:5 þ½ð19−19:6Þ2 =19:6¼ 12:

Sum of observed deaths 20

18 16 14 e

ity nt

lin

e

10

Id

8 6 4 4

6

8

10

12

14

16

18

20

Sum of predicted probabilities of deaths Fig. 1. Scatter plot of observed and predicted probabilities (see the text for details).

A χ2 = 12 (with 8 degrees of freedom) is not statistically significant (P = 0.15), indicating satisfactory calibration of the model (that is, the proportion of deaths predicted by BNP does not significantly differ from observed deaths). The major limitation of the H-L test is that it depends on the number of subgroups being formed. Whenever the risk of a given event is below 20% [4], biostatisticians recommend calculating the H-L test by grouping individuals on the basis of deciles of estimated probabilities.

Re-classification Example 2 An important issue in clinical research is that of investigating whether a new prognostic biomarker increases the proportion of individuals correctly re-classified as having or not having a given event compared to a previous classification based on existing prognostic biomarkers or predictive models. A measure of re-classification is the NRI that takes into account the proportion of individuals with and without the event of interest who move up or down from a lower risk to a higher risk category or vice versa as a consequence of re-classification. The proportions of individuals in which previous classification and re-classified estimates coincide are not considered in the NRI calculation. The general idea underlying the NRI is that a prognostic model (for example, a model including two biomarkers: A and B), which assigns a higher probability of the relevant event to a patient who actually goes on to develop the event, is more accurate than a model (including only the biomarker A) which assigns a lower probability to the same patient. In a community-based cohort of elderly men, Zethelius and co-workers [5] applied the NRI to assess whether the use of multiple biomarkers improves the prognostic value for CV mortality of a prediction model including standard risk factors in 661 individuals without cardiovascular disease. In Table 2, individuals who died of CV causes (Table 2a) and those who survived or died of causes other than cardiovascular disease (Table 2b) were arranged according to the predicted probability of CV death as estimated either by a model including standard risk factors (restricted model—horizontal rows) or by a model including standard risk factors and four biomarkers, namely, troponin I, N-terminal pro-BNP, cystatin C and C-reactive protein (expanded model—vertical rows). In 16 individuals out

Downloaded from http://ndt.oxfordjournals.org by Carmine Zoccali on May 6, 2010

the individual predicted probability of death according to plasma levels of BNP. Patients were then grouped into deciles of estimated probabilities of death and, in each decile, the sum of predicted and observed deaths was calculated (see Table 1). A graphical representation of the performance of the prognostic model based on BNP is given in Figure 1 in which predicted deaths are plotted against the actual number of deaths. Visually, the predictive model based on BNP appears to be quite calibrated because each dot in the graph is quite close to the identity line. The H-L test is the formal statistical test for assessing calibration and it is calculated by summing up the differences between observed and predicted probabilities of each group. The H-L test follows a chi-square distribution with n − 2 degrees of freedom [3], with ‘n’ being the number of subgroups considered (10 subgroups in our instance). A non-significant H-L test indicates that the predicted and observed probabilities of the event do not differ which thus indicates a satisfactory calibration of the model. Vice versa, a significant test signals a poor calibration. The general formula of the H-L test is: h i   H‐L test χ2 = ∑ ðobserved‐estimatedÞ2 =estimated :

12

H‐L test ðχ2 Þ ¼∑½ð5−10:1Þ2 =10:1 þ ½ð6−11:0Þ2 =11:0

1404

G. Tripepi et al.

Table 2. Example of re-classification (see text for details) Individuals who died of cardiovascular causes (n = 54) a

Model including standard risk factors and biomarkers

Model including standard risk factors only

20%

20%

9 4 0

5 12 3

0 11 10

Survivors/individuals who died for causes other than cardiovascular disease (n = 607) Model including standard risk factors and biomarkers

Model including standard risk factors only

20%

20%

279 102 2

41 137 13

2 19 12

In bold, the number of individuals who move up or down from a lower risk to a higher risk category or vice versa as a consequence of re-classification.

of 54 who died of CV causes (30%), re-classification was more accurate when the authors used the expanded model to predict the outcome. In fact, when the expanded model was used, five individuals considered at low risk by the standard model (20%). On the other hand, seven individuals out of 54 who died of CV causes (13%) moved from a higher risk category to a lower risk category (see Table 2a). Among survivors or individuals who died of causes other than cardiovascular disease (Table 2b), 117 individuals out of 607 (19%) were re-classified in a lower risk category and 62 out of 607 (10%) were re-classified in a higher risk category. The NRI is calculated as follows:   NRI = pup ; CV death − pdown ; CV death   − pup ; no CV death − pdown ; no CV death : In which: pup, CV death is the proportion of individuals who died of CV cause who were re-classified to a higher risk category (0.30); pdown, CV death is the proportion of individuals who died of CV cause who were re-classified to a lower risk category (0.13); pup, no CV death is the proportion of survivors/individuals who died of causes other than cardiovascular disease re-classified to a higher risk category (0.10); pdown, no CV death is the proportion of survivors/individuals who died of causes other than cardiovascular disease re-classified to a lower risk category (0.19).

Thus, in Zethelius’s study, the NRI is: NRI = ð0:30 − 0:13Þ − ð0:10 − 0:19Þ = 0:26ð26%Þ: This result indicates that the use of four prognostic biomarkers produced a 26% net improvement in re-classification, a figure highly significant (P = 0.005). The authors concluded that, in elderly men, the simultaneous addition of four biomarkers of cardiovascular and renal abnormalities substantially improves the risk prediction of CV mortality beyond and above that of a predictive model based on standard risk factors.

Conclusions Calibration is the ability of a predictive model to match predicted and observed probabilities of a given event throughout the whole range of potential prognostic estimates. Re-classification is the gain in prognostic accuracy achieved by a predictive model when a new biomarker is added to a previous prognostic model. These features must be considered in conjunction with discrimination to assess the usefulness of a prognostic biomarker in clinical and epidemiological research. Overall, as also delineated in a previous article [1], to be considered useful in clinical practice, a new biomarker should undergo meticulous validation by testing its ability to discriminate diseased and non-diseased people, to correctly predict the outcome of interest at individual level (calibration) and to classify individuals who develop pertinent clinical outcomes better than previous biomarkers or risk equations. Conflict of interest statement. None declared.

Downloaded from http://ndt.oxfordjournals.org by Carmine Zoccali on May 6, 2010

b

Biomarkers in prognostic research

References 1. Tripepi G, Jager KJ, Dekker FW et al. Statistical methods in prognostic research (part I): how do to assess the discriminatory power of prognostic biomarkers? Nephrol Dial Transplant 2010 (under review) 2. Mallamaci F, Tripepi G, Cutrupi S et al. Prognostic value of combined use of biomarkers of inflammation, endothelial dysfunction, and myocardiopathy in patients with ESRD. Kidney Int 2005; 67: 2330–2337

1405 3. Hosmer DW, Lemeshow S. A goodness-of-fit test for the multiple logistic regression model. Comm Stat 1980; A10: 1043–1069 4. McGeechan K, Macaskill P, Irwig L et al. Assessing new biomarkers and predictive models for use in clinical practice. Arch Intern Med 2008; 168: 2304–2310 5. Zethelius B, Berglund L, Sundstrom J et al. Use of multiple biomarkers to improve the prediction of death from cardiovascular causes. N Engl J Med 2008; 358: 2107–2116 Received for publication: 30.5.09; Accepted in revised form: 4.1.10

Downloaded from http://ndt.oxfordjournals.org by Carmine Zoccali on May 6, 2010