Statistical Methods in Medical Research

Statistical Methods in Medical Research http://smm.sagepub.com/

Statistical methods for multivariate meta-analysis of diagnostic tests: An overview and tutorial Xiaoye Ma, Lei Nie, Stephen R Cole and Haitao Chu Stat Methods Med Res published online 26 June 2013 DOI: 10.1177/0962280213492588 The online version of this article can be found at: http://smm.sagepub.com/content/early/2013/06/19/0962280213492588

Published by: http://www.sagepublications.com

Additional services and information for Statistical Methods in Medical Research can be found at: Email Alerts: http://smm.sagepub.com/cgi/alerts Subscriptions: http://smm.sagepub.com/subscriptions Reprints: http://www.sagepub.com/journalsReprints.nav Permissions: http://www.sagepub.com/journalsPermissions.nav

>> OnlineFirst Version of Record - Jun 26, 2013 What is This?

Downloaded from smm.sagepub.com at Bio Medical Library, University of Minnesota Libraries on June 28, 2013

XML Template (2013) [6.6.2013–5:10pm] [1–24] //blrnas3/cenpro/ApplicationFiles/Journals/SAGE/3B2/SMMJ/Vol00000/130198/APPFile/SG-SMMJ130198.3d (SMM) [PREPRINTER stage]

Original Article

Statistical methods for multivariate meta-analysis of diagnostic tests: An overview and tutorial*

Statistical Methods in Medical Research 0(0) 1–24 ! The Author(s) 2013 Reprints and permissions: sagepub.co.uk/journalsPermissions.nav DOI: 10.1177/0962280213492588 smm.sagepub.com

Xiaoye Ma,1 Lei Nie,2 Stephen R Cole3 and Haitao Chu1

Abstract In this article, we present an overview and tutorial of statistical methods for meta-analysis of diagnostic tests under two scenarios: (1) when the reference test can be considered a gold standard and (2) when the reference test cannot be considered a gold standard. In the first scenario, we first review the conventional summary receiver operating characteristics approach and a bivariate approach using linear mixed models. Both approaches require direct calculations of study-specific sensitivities and specificities. We next discuss the hierarchical summary receiver operating characteristics curve approach for jointly modeling positivity criteria and accuracy parameters, and the bivariate generalized linear mixed models for jointly modeling sensitivities and specificities. We further discuss the trivariate generalized linear mixed models for jointly modeling prevalence, sensitivities and specificities, which allows us to assess the correlations among the three parameters. These approaches are based on the exact binomial distribution and thus do not require an ad hoc continuity correction. Lastly, we discuss a latent class random effects model for meta-analysis of diagnostic tests when the reference test itself is imperfect for the second scenario. A number of case studies with detailed annotated SAS code in MIXED and NLMIXED procedures are presented to facilitate the implementation of these approaches. Keywords Meta-analysis, diagnostic test, gold standard, generalized linear mixed models

1 Introduction In the medical literature, a diagnostic test commonly refers to a medical test to classify subjects with respect to a (disease) state of interest. Accurate diagnosis plays an important role in the disease control 1

Division of Biostatistics, School of Public Health, University of Minnesota, Minneapolis, MN, USA Division of Biometrics IV, Office of Biometrics/OTS/CDER /FDA, Silver Spring, MD, USA 3 Department of Epidemiology, University of North Carolina, Chapel Hill, NC, USA 2

*This article reflects the views of the author and should not be construed to represent FDA’s views or policies. Corresponding author: Haitao Chu, Division of Biostatistics, School of Public Health, University of Minnesota, Minneapolis, MN 55455, USA. Email: [email protected]



2

Statistical Methods in Medical Research 0(0)

and prevention. Diagnostic test outcomes could be dichotomous, ordinal or continuous. This article only focuses on the dichotomous outcome. The performance of a binary test is commonly measured by a pair of indices such as sensitivity and specificity. Sensitivity is defined as the probability of testing positive given a person being diseased and specificity is defined as the probability of testing negative given a person being disease-free.1,2 Other frequently used indices include positive and negative predictive values, and positive and negative diagnostic likelihood ratios.1,2 In meta-analysis of diagnostic tests, there is a great potential for heterogeneity due to differences in factors such as disease prevalence (p), study population characteristics, laboratory methods and study designs. While some study-level covariates such as mean age may explain some variation, random effects models are commonly recommended to account for other unobserved sources of variation. When a reference test can be considered a gold standard, a few methods are available to account for this heterogeneity.3–12 Specifically, random effects models including the hierarchical summary receiver operating characteristic (ROC) model3 and bivariate random effects metaanalysis on sensitivities and specificities are recommended.5,11,12 These approaches are identical in some situations.6,9,13 Some examples and extensive simulations demonstrated that bivariate randomeffects meta-analysis offers numerous advantages over separate univariate meta-analysis.14,15 In general, generalized linear mixed models (GLMMs), which use the exact binomial likelihood, often perform better than the linear mixed models (LMMs) which use a normal approximation.12,16 In addition, a trivariate generalized linear random-effects model were proposed to jointly models the disease prevalence, sensitivities and specificities.17 In practice, disease status is often measured by a reference test that is subject to non-trivial measurement error. This leads to a setting without a gold standard. When the reference test is subject to measurement error, the evaluation of diagnostic tests in a meta-analysis setting becomes more challenging. To the best of our knowledge, only a few articles have considered meta-analysis methods for diagnostic tests in the absence of a gold standard. Walter et al.18 discussed a latent class model for a meta-analysis of two diagnostic tests assuming varying prevalence, but constant sensitivity and specificity across studies. A more general latent class random effects model by Chu et al.19 assumes sensitivity and specificity of both tests and prevalence to be random effects. Sadatsafavi et al.20 presented a model where conditional dependence between tests is allowed, but beyond prevalence, only one of the sensitivity or specificity can be implemented using a random effect. Dendukuri et al.21 presented a Bayesian method for the meta-analysis of a tuberculous pleuritis diagnostic test in the absence of a gold standard. In this article, we present an overview and tutorial summarizing the pros and cons of these approaches and provide detailed case studies with annotated SAS code. The outline of this article is as follows. In Section 2, we summarize and compare different models when the referent test can be considered a gold standard. In Section 3, we introduce models in the absence of a gold standard. In Section 4, we present case studies to illustrate the approaches described in Sections 2 and 3. The annotated SAS code to implement these approaches is presented in Appendix 2.

2 Statistical methods when the reference test is a gold standard When the reference test can be considered a gold standard, let ni11 , ni00 , ni01 and ni10 be the number of true positives, true negatives, false positives and false negatives for the ith study (i ¼ 1, 2, . . ., N), respectively. Let ni1þ ¼ ni11 þ ni10 and ni0þ ¼ ni01 þ ni00 be the study-specific numbers of diseased and disease-free subjects. Then the study-specific sensitivity and specificity can be estimated as bi ¼ ni11 =ni1þ , and Sp ci ¼ ni00 =ni0þ , see Table 1 for a typical 2 2 table. Se



Ma et al.

3 Table 1. 2 2 table for the ith study. Reference test

Diagnostic test

Positive (þ) Negative (–)

Total

Positive (þ)

Negative (–)

Total

ni11 ni10 ni1þ

ni01 ni00 ni0þ

niþþ

In this section, we will first discuss the conventional summary ROC approach and a bivariate approach using LMMs. Both methods require direct calculations of study-specific sensitivities and specificities, and an ad hoc continuity correction when there are empty cells. Second, we will discuss the hierarchical summary ROC approach for jointly modeling positivity criteria and accuracy parameters, and a bivariate approach using GLMMs for jointly modeling sensitivities and specificities. At last, we will discuss a trivariate approach using GLMM for jointly modeling prevalence, sensitivities and specificities to account for the correlations among the three parameters. The hierarchical summary ROC approach, and the bivariate and trivariate approaches are based on the exact binomial distribution and thus do not require any ad hoc continuity correction.

2.1

The summary ROC method

The summary ROC curve method was first proposed by Moses et al.22 Reflecting the trade-off between sensitivity and specificity caused by implicit thresholds, this method had been widely used in diagnostic tests studies. As test threshold varies, the observed sensitivity (specificity) (Se (Sp)) estimates can form a concave shape for the ROC curve. Such a curve can be fitted by backtransforming the linear relationship between the logit transformations of Se and Sp to the ROC space: first, if some studies have ni11 ¼ 0 or ni00 ¼ 0, an ad hoc continuity correction is applied by adding 0.5 to each of the four cells of such studies. After the correction, sensitivity is estimated as bi ¼ ðni11 þ 0:5Þ=ðni1þ þ 1Þ and specificity is estimated as Sp ci ¼ ðni00 þ 0:5Þ=ðni0þ þ 1Þ for the ith Se study. Second, define variables S and D as the sum and the difference of logit transformed b i Þ þ logitðSp ci Þ and Di ¼ logitðSe b i Þ logitðSp ci Þ, sensitivity and specificity, such that Si ¼ logitðSe where logitð pÞ ¼ logðp=ð1 pÞÞ. This notation is slightly different than Moses et al.22 because the di Þ, where OR di ¼ ni11 =ni01 original transformation is on Se and 1 Sp. One can see that Si ¼ logðOR ni10 ni00 is the diagnostic odds ratio for the ith study. Third, for N studies, fit a linear regression line S ¼ a þ bD either by an ordinary least squares or by a weighted least squares method weighing di ÞÞ1 , where var logðOR di Þ ¼ by the inverse of within-study variance varðlogðOR 1=ni11 þ 1=ni10 þ 1=ni01 þ 1=ni00 .5 After fitting the regression line using either unweighted or weighted method, one can plot the summary ROC curve by the two estimated coefficients (i.e. ^ intercept a^ and slope b) h i ^ ^ 1 ^ ^ bÞ Se ¼ 1 þ ea=ð1 ðSp=ð1 SpÞÞð1þbÞ=ð1bÞ

ð1Þ

with Se on the y-axis and 1 Sp on the x-axis. To adjust for study-level covariates Z (e.g. different anatomical sites from which the diagnostic tests were obtained), one can fit a model with ^ i þ cZ ^ i ¼ a^ 0 þ b^0 Di . The ^ i ¼ ða^ þ cZ ^ i Þ þ bD Si ¼ a þ bDi þ cZi . We can then have Si ¼ a^ þ bD 0 0 summary ROC curve can be plotted according to new estimates a^ and b^ given Z.



4


The summary ROC method is easy to perform but suffers limitations. First, its interpretation is known to be problematic. Walter23 discussed the interpretation of area under the curve (AUC). A summary ROC curve located closer to the left upper corner of the ROC space will have a larger AUC, indicating better predictive accuracy of a test.23 However, the conclusion becomes unreliable when comparing tests whose summary ROC curves may cross each other. Alternative statistics, such as the partial AUC24 and the Q point25 also have limited application. Second, the model setting has di Þ, the data are reduced to one outcome measure per some drawbacks. First, because Si ¼ logðOR study: diagnostic odds ratio. Independent summaries of sensitivity and specificity are not available, which could be important in test evaluation. Also, the model is restricted in that the between-study heterogeneity can only be adjusted by study-level covariates, such that some components of the variance might not be explained. This is the reason why both Moses et al.22 and Irwig et al.26 recommended the unweighted least squares rather than the weighted, as in a fixed effect model, a few large studies may dominate the result if the between-study variation is present. In addition, in practice, study characteristics besides the cut-point effect contribute to the trade-off between sensitivity and specificity within a study,22,27 which are not incorporated in the summary ROC curves. Finally, an arbitrary continuity correction is needed to handle zero cells. Moses22 showed that continuity correction can push the summary ROC curve far from the ideal upper left corner of the ROC space, giving biased results. Moreover, there is a long-standing debate on what arbitrary number should be added to handle zero cells.28,29

2.2

A bivariate approach based on LLMs

To improve upon the summary ROC method, Reitsma11 proposed a bivariate LMM. The model proceeds as follows. First, logit transforms of the sensitivity and specificity are applied to each study. Different from the summary ROC method, they are considered by allowing variation as random 2 according to normal distributions, that is, logitðSei Þ N 0 , and logitðSpi Þ N 0 , 2 . A bivariate normaldistribution sensitivity and specificity can include possible correlation between 2 P logitðSei Þ 0 P within the study: N , , where ¼ and denotes the logitðSpi Þ 0 2 covariance between logit sensitivity and specificity. Second, to account for the sampling the 1 estimated logit sensitivity and specificity are 0 variation, bi log 2it Se log 2itðSei Þ A N assumed to be normally distributed as @ , C for study i, where log 2itðSpi Þ i ci log 2it Sp bi ¼ n1 þ n1 and Ci is a diagonal matrix with components of var logit Se i10 i11 1 1 c c c ci , and var logit Spi ¼ ni01 þ ni00 . Note that, generally speaking, ni1þ Sei , ni1þ 1 Sei , ni0þ Sp ci should be greater than or equal to five for normal approximation to be valid. ni0þ 1 Sp b i and logit Sp ci are assumed to have the following bivariate normal distribution: Consequently, logit Se 1 bi logit Se 0 @ A N , þ Ci 0 ci logit Sp 0

ð2Þ

Because the distributions of sensitivity and specificity are often skewed, one may prefer inference based on the medians rather than means as overall diagnostic test performance summaries. Based on parameter estimates, the median sensitivity and specificity can be back-transformed as



Ma et al.

5

cM ¼ logit1 ð^0 Þ. Similarly, confidence intervals for Se bM and Sp cM can be bM ¼ logit1 ð^ 0 Þ and Sp Se transformed from the confidence intervals of ^ 0 and ^0 . The correlation between sensitivity ^ bM ¼ and specificity can be estimated as . The standard errors (SEs) are SE Se ^ ^ SEðÞ SEðÞ cM ¼ ^ and SE Sp ^ based on the delta method. A summary ROC bM þ1= 1SbpM 1=Sb 1=Sp eM þ1= 1Sb eM curve can be constructed by logitðSeÞ ¼ ^ 0 þ

^ ðlogitðSpÞ ^0 Þ ^ 2

ð3Þ

In general, this approach is superior to the summary ROC model by analyzing sensitivity and specificity jointly in a bivariate LMM (BLMM). However, the bivariate approach estimates the degree of correlation between sensitivity and specificity, as well as both within- and between-study variation in the two indices separately. A drawback of this approach is that an ad hoc continuity correction is required in the presence of zero cells, as with the summary ROC approach. In addition, the normal approximation is sometimes violated in practice.12 The bivariate model can adjust for covariates by regression model for the mean vector of the bivariate normal distribution: 0 þ Zi P logitðSei Þ N , , where Zi is the study-level covariate and , are the logitðSpi Þ 0 þ Zi corresponding coefficient parameters.5 Adjusting for individual level covariates is also straightforward.

2.3

The hierarchical summary ROC approach

Rutter and Gatsonis3 proposed a hierarchical summary ROC approach, which is a simplification of 0 0 the ordinal regression model by Tosteson and Begg:30 g j ðxÞ ¼ j a x eb x , where gð:Þ is a link function, j ðxÞ is the probability of a response being in one of the ordered categories given covariates x, j is the cutoff values of each category, a is the location parameters and b is the scale parameter. The hierarchical summary ROC approach reduces the ordinal regression model to two categories (j ¼ 1, 2), x indicates true disease status (coded as 0.5 for Dþ and 0.5 for D) and j ðxÞ correspond to positive test rates: Sei and 1 Spi (false positive rate).3 The first stage of this model assumes binomial distributions of the number of positive outcomes in the ith study, i.e. ni11 Binðni1þ , Sei Þ and ni01 Binðni0þ , 1 Spi Þ. Choose gð:Þ to be a logit link, the model is written as logitðSei Þ ¼ ði þ 0:5i Þe0:5 ,

logitð1 Spi Þ ¼ ði 0:5i Þe0:5

ð4Þ

where the latter is the same as logitðSpi Þ ¼ ði 0:5i Þe0:5 . The positivity criterion i models the tradeoff between sensitivity and specificity in each study. Direct interpretations of the accuracy parameters i are that when ¼ 0, i ¼ logitðSei Þ þ logitðSpi Þ ¼ logðDORi Þ, which is independent of i . In the second stage, Rutter and Gatsonis3 allow i and ito varyacross studies. 2 Thus, i and i i 0 0 are assumed independently and normally distributed as: N , . i 0 0 2 A summary ROC curve can be derived based on solving functions in (4) as logitðSei Þ ¼ i e =2 þ e logitð1 Spi Þ



6


Another possible construction of a summary ROC curve pointed out by Chu and Guo13 is based on the bivariate normal distribution of i and i as 0:25^ 2 ^ 2 h i ^ ^ 0:5 ^ ^ ^ logitðSeÞ ¼ e0:5 0:5^ 0 þ ^0 þ e logit ð Sp Þ e 0:5 ð5Þ 0 0 0:25^2 þ ^2 In addition, Arends et al.10 discussed several choices of SROC curves. Median sensitivity n h io1 b M ¼ 1 þ exp ^0 þ 0:5^ 0 e0:5 ^ cM ¼ and specificity estimates are Se and Sp n h io1 ^ 1 þ exp ^0 0:5^ 0 e0:5 . Also, similar as the previous models, the hierarchical 2 summary i 0 þ Zi 0 ROC approach can incorporate study-level covariates by N , . i 0 þ Zi 0 2 The hierarchical summary ROC approach incorporates both within- and between-study variability and the correlation between the summary statistics by random effects i and i . Because sparse data are common in meta-analysis of diagnostic tests, an important advantage over the previous models is that the hierarchical summary ROC approach avoids the continuity correction by assuming the exact binomial distributions.3 A practical limitation of this model is that originally it was fitted using Bayesian Markov chain Monte Carlo approach implemented in BUGS, which requires some programming expertise. This approach is found to be the same as the following bivariate GLMM with alternative parameterizations in some situations.

2.4

The bivariate GLMM

Chu and Cole12 presented a bivariate GLMM to jointly analyze sensitivity and specificity using logit link. Later, the bivariate GLMM was broadened to a general link function.31 The model starts with binomial distribution assumptions and applies link functions on the probability parameters: ni11 Binðni1þ , Sei Þ, ni00 Binðni0þ , Spi Þ, gðSei Þ ¼ 0 þ i , gðSpi Þ ¼ 0 þ i ð6Þ i i are random effects following bivariate normal distribution where i and i 2

0 N , , and g(.) is a link function such as the logit, probit or complimentary 0

2 log–log link. Different link functions can be applied to sensitivity and specificity. Though to date the logit link is the most widely used in meta-analysis, Chu et al.31 argued that, for some meta-analyses, the choice of the link may affect model fit and inference. The parameters "2 , 2 , 2 estimate the between-study variances and " , " , explain possible correlations. b M ¼ logit1 ð^ 0 Þ and Sp cM ¼ logit1 ð^0 Þ. Similarly, The model gives median estimates as Se bM and Sp cM can be transformed from the confidence intervals of ^ 0 and confidence intervals for Se ^0 . Study-level covariate Z can be included as gðSei Þ ¼ 0 þ i þ Zi and gðSpi Þ ¼ 0 þ i þ Zi , where , are corresponding coefficient parameters. Different covariates could be used ^ for sensitivity and specificity. A regression line of gðSeÞ on gðSpÞ, gðSeÞ ¼ ^ 0 þ ^ ^ ½gðSpÞ ^0 , gives the summary ROC curve by transforming to the ROC space. Also, alternative choices of the regression lines can construct different summary ROC curves with corresponding interpretations.10 In addition to estimating the heterogeneity and correlation parameters, both hierarchical summary ROC and bivariate GLMM approaches have advantages over the bivariate LMM. First, GLMM not require the normal approximation to estimate the bivariate does ci and var logit Sp ci . Second, neither of the two approaches requires a continuity var logit Se correction because direct calculation of study-specific sensitivities and specificities is not involved.



Ma et al.

7

In the absence of study-level covariates, the two approaches are equivalent (with alternative parameterizations).6 Both hierarchical summary ROC and bivariate GLMM can be fitted using maximum likelihood. Several numerical methods might be used, for instance, the dual quasi-Newton optimization techniques, as implemented in the SAS NLMIXED procedure. The SEs and confidence intervals for parameters are estimated by the delta method and are reported automatically if specified in the ESTIMATE statement. To restrict the correlation coefficient in the range [1, 1] in the bivariate GLMM, one can use Fisher’s z transformation of . AUC for both hierarchical summary ROC and bivariate GLMM can be computed by numerical integration implemented in an SAS macro, which is available upon request from the first author.

2.5

The trivariate GLMM

The above approaches involving only sensitivities and specificities work best if all or the majority of the studies use case-control designs. When disease prevalence estimation is allowed, as in cohort study designs, we can derive other clinically interesting indices such as positive and negative predictive values. In this case, the test performance indices Se and Sp can be correlated with the prevalence, which is commonly termed ‘spectrum bias’.32 Such dependence is particularly of concern when the binary diagnostic outcome is based on a cutoff point on a continuous trait, thus misclassification rates could be higher among subjects with true value near the cut-point.33 To account for this potential dependence, Chu et al.17 extended the bivariate GLMM to a trivariate GLMM jointly modeling the disease prevalence, sensitivity and specificity. Recently, Li and Fine34 proposed a Pearson-type correlation coefficient to assess this dependence by an estimating equation-based regression framework. Here, we consider a trivariate GLMM based on the parameterization of i , Sei and Spi , where i is the disease prevalence in the ith study. The first level of this model assumes binomial distributions: ni1þ Binðniþþ , i Þ, ni11 Binðni1þ , Sei Þ, ni00 Binðni0þ , Spi Þ ð7Þ The parameters are modeled via link functions: gði Þ ¼ "0 þ "i , gðSei Þ ¼ 0 þ i and gðSpi Þ ¼ 0 þ i , see Table 2 a two two table accounting for disease prevalence. To consider heterogeneity and potential correlations of the three parameters, "i , i and i are assumed to be random effects with trivariate normal distribution: 0 1 00 1 1 0 2 1 " " " " " "i 0 @ i A N@@ 0 A, A where ¼ @ 2

A 0 i v2 Table 2. 2 2 table for the ith study accounting for disease prevalence. Reference test Diagnostic test Positive (þ) Negative (–) Total

Positive (þ)

Negative (–)

Total

ni11 i Sei ni10 i ð1 Sei Þ ni1þ i

ni01 ð1 i Þð1 Spi Þ ni00 ð1 i ÞðSpi Þ ni0þ 1 i

niþþ 1



8


The parameters "2 , 2 , 2 capture the between-study variance of the disease prevalence, sensitivity and specificity while " , " , represent correlations. Standard software such as SAS NLMIXED can maximize this likelihood. To avoid including unnecessary parameters, model selection criteria such as Akaike information criterion (AIC) can be b M ¼ g1 ð^ 0 Þ and Sp cM ¼ g1 ð^0 Þ. In this model, used. The medians are derived as ^ M ¼ g1 ð"^0 Þ, Se covariates can be incorporated for sensitivities, specificities and disease prevalence as was done for the bivariate GLMM.

3 Statistical methods when the reference test is not a gold standard Limited meta-analysis tools are available when the reference test is imperfect. Walter et al.18 discussed the latent class model for a meta-analysis of two diagnostic tests. Sadatsafavi et al.20 presented a latent class random effects model. However, beyond prevalence, only one of the sensitivity and specificity can be implemented as a random effect. Dendukuri et al.21 presented a Bayesian approach, which is an extension of the hierarchical summary ROC model, to adjust for different reference standards. We describe the latent class random effects model by Chu et al.19 using random effects to allow variation and correlation in sensitivity, specificity and prevalence between studies. Let ðSeBi , SpBi Þ be the pair of diagnostic accuracy parameters for the reference test while ðSeAi , SpAi Þ be the pair for the diagnostic test of interest. To construct the 2 2 table (Table 3) for such studies, both the above pairs of statistics and the disease prevalence are needed. The four counts in Table 3 follow a multinomial distribution, with the log-likelihood being: log L ¼

X

ni11 logðpi11 Þ þ ni10 logðpi10 Þ þ ni01 logðpi01 Þ þ ni00 logðpi00 Þ

ð8Þ

i

Chu et al.19 used random effects to model between- and within-study heterogeneity and potential correlations. We write this model in a form suitable for a general link function: gði Þ ¼ "0 þ "i ; gðSeAi Þ ¼ A0 þ Ai ; gðSpAi Þ ¼ A0 þ vAi ; gðSeBi Þ ¼ B0 þ Bi ; gðSpBi Þ ¼ B0 þ vBi

Table 3. 2 2 table when the reference test is not a gold standard. Reference test Positive (þ) ni11

Negative (–) ni01

Total

ni10 pi10 ¼ i ð1 SeAi ÞSeBi þ ð1 i ÞSpAi ð1 SpBi Þ ni1þ pi1þ ¼ i SeBi þ ð1 i Þð1 SpBi Þ

ni00 pi00 ¼ i ð1 SeAi Þð1 SeBi Þ þ ð1 i ÞSpAi SpBi ni0þ niþþ pi0þ ¼ i ð1 SeBi Þ þ ð1 i ÞSpBi 1

Diagnostic þ pi11 ¼ i SeAi SeBi þ ð1 i Þð1 SpAi Þð1 SpBi Þ pi01 ¼ i SeAi ð1 SeBi Þ þ ð1 i Þð1 SpAi ÞSpBi test Total

SeAi and SpAi are the sensitivity and specificity for the diagnostic test; SeBi and SpBi are sensitivity and specificity for the reference test. i is the disease prevalence in the ith study.



Ma et al.

9 0

where random effects follow a multivariate normal distribution: ð"i , Ai , Ai , Bi , Bi Þ Nð0, DÞ with variance–covariance matrix 1 0 2 " "A " A

"A " A

"B " B

"B " B B 2 A

A A A A A B A B A B A B C C B B 2A

A B A B A B A B C D¼B C C B 2 B

B B B B A @ 2B Median estimates of prevalence, sensitivities and specificities can be constructed as b AM ¼ g1 ð^ A0 Þ, Sp cAM ¼ g1 ðÂ0 Þ, Se bBM ¼ g1 ð^ B0 Þ and Sp cBM ¼ g1 ð^B0 Þ b M ¼ g1 ð"^0 Þ, Se b Variance and correlation parameter estimates can be derived from D. Covariates Zi can be adjusted by linear regressions for the mean vectors, for instance, gði Þ ¼ "0 þ "i þ Zi This latent class random effects model fills a gap in the existing models for meta-analysis with imperfect reference tests. This model can be used to evaluate the performance of both the diagnostic test of interest and the reference test while retaining all the advantages of the GLMMs. A limitation applies when fitting this model by SAS NLMIXED. One may encounter convergence problems because of the limited number of studies and relatively large number of parameters. Possible simplification of model assumptions may include letting disease prevalence be independent of sensitivities and specificities. Also, to avoid including unnecessary random effects whose variance approaches zero, one can apply a forward selection based on AIC. We will illustrate this process in Section 4.2 with an example.

4 Case study 4.1 A meta-analysis of rotator cuff tears diagnosis using ultrasound 4.1.1 Study background We demonstrate an application of the methods in Section 2 using data on ultrasound diagnosis of rotator cuff tears. Rotator cuff tears are a common reason for shoulder pain, which is the third most common musculoskeletal complaint. The incidence of partial rotator cuff tears is reported to be 13–32% in cadaveric studies, yet much of this incidence goes undiagnosed.35 Among the diagnostic tests for this disease, ultrasound is non-invasive and less expensive. However, it has lower sensitivity and specificity in detecting the disease than MRI or arthroscopic evaluation.36 We will re-analyze the data from a meta-analysis of 30 studies of diagnostic accuracy of ultrasound for rotator cuff tears in adults, performed by Smith.37 The studies compared the accuracy of ultrasound with either arthroscopic or open surgical findings as a gold standard test. The data are presented in Appendix 1. Figures 1 and 2 present the forest plots of sensitivity and specificity, respectively. In the rest of this section, we explore this example using the models discussed in Section 2. The corresponding SAS code can be found in Appendix 2 (B1–B6). 4.1.2 Summary ROC method Applying the summary ROC method, we analyze the data first by unweighted least squares, then by weighted least squares. The unweighted method gives estimates a^ ¼ 3.39, b^ ¼ 0.131 and AUC ¼ 0.911. The AUC can be interpreted as a likelihood of 91.11% that a randomly selected diseased subject will receive a more suspicious rating than a non-diseased subject. The weighted method gives estimates a^ w ¼ 3.573, b^w ¼ 0.400 and AUCw ¼ 0.910. To build the summary ROC



10

Statistical Methods in Medical Research 0(0) Se estimates 95%CI lower bound upper bound study 1.000 0.9184 0.985 Al−Shawi,2008 1.000 0.0250 1.000 Alasaarela,1998 0.651 0.2440 0.440 Brenneke and Morgan,1992 0.953 0.4920 0.786 Cullen,2007 0.692 0.2153 0.444 Ferrari,2002 0.932 0.0676 0.500 Friedman,1993 0.953 0.8477 0.910 Hedtmann and Fett,1995 0.991 0.7650 0.929 Iannotti,2005 0.932 0.0676 0.500 Kang,2009 0.889 0.6530 0.788 Kayser,2005 0.981 0.5455 0.846 Labanauskaite,2002 0.874 0.4891 0.708 Milosavljevic,2005 1.000 0.3976 1.000 Naqvi,2009 0.749 0.1922 0.462 Read et al,1998 0.963 0.2904 0.714 Roberts et al,2001 1.000 0.6306 1.000 Rutten et al,2010 0.728 0.2720 0.500 Takagishi,1996 0.882 0.3838 0.667 Teefey,2000 0.983 0.5954 0.867 Teefey,2005 0.998 0.6805 0.933 van Holsbeeck et al,1995 0.987 0.8246 0.936 Vlychou et al,2009 0.991 0.8747 0.955 Wiener and Seitz,1993 0.997 0.5550 0.900 Yen et al,2004 0.909 0.746 0.827 Summary 0.2

0.4

0.6

0.8

1

Figure 1. Forest plot for sensitivity in rotator cuff tears study.

curve, we plug in a^ and b^ ( a^ w and b^w ) into equation (1) then plot Se against 1Sp. The summary ROC curves are presented in Figure 3. 4.1.3 Bivariate LMM To fit the BLMM, we use the SAS MIXED procedure. The bivariate LMM method can provide summary estimates of sensitivity and specificity other than the summary ROC curve. Parameter estimates are: ^ ¼ 1.351, ^ ¼ 1.853, ^ 2 ¼ 1.040, ^ 2 ¼ 0.399 and ^ ¼ 0:116. The sensitivity and b M ¼ 0:794 and Sp specificity are estimated as Se cM¼ 0:865. Correlation estimate is ^ ¼ 0:18. The b ¼ 0:043 and SE Sp c ¼ 0:023. Plugging in the SEs can be calculated by delta method: SE Se estimates into equation (3), one can draw the summary ROC curve as presented in Figure 3. This model gives an AUC of 0.858. With the estimated medians, SEs and correlation coefficients, one can draw confidence and prediction regions around the median estimates. Compared with the summary ROC method, the bivariate LMM can provide summary estimates of overall sensitivity and specificity and their confidence regions. It may be more intuitive for investigators to compare different diagnostic tests. 4.1.4 Hierarchical summary ROC model The hierarchical summary ROC model is fitted using the SAS NLMIXED procedure. Estimates of the parameters are: ^0 ¼ 0:738, ^ ¼ 0:708, ^ 0 ¼ 3:887, ^ ¼ 1:045 and ^ ¼ 0:522. The median bM ¼ 0:827 with SE 0.042 and Sp cM ¼ 0:888 with SE 0.021. To draw sensitivity and specificity are Se



Ma et al.

11

study Sp estimates 95%CI lower bound upper bound Al−Shawi,2008 0.844 0.7436 0.917 Alasaarela,1998 1.000 0.8235 1.000 0.849 0.7241 0.933 Brenneke and Morgan,1992 0.7196 0.989 Cullen,2007 0.913 0.999 Ferrari,2002 0.962 0.8036 Friedman,1993 0.500 0.0126 0.987 0.0126 0.987 Hedtmann and Fett,1995 0.500 Iannotti,2005 0.696 0.4708 0.868 0.1871 0.813 Kang,2009 0.500 Kayser,2005 0.914 0.8648 0.950 0.4281 0.945 Labanauskaite,2002 0.750 1.000 Milosavljevic,2005 1.000 0.5407 0.846 0.5455 0.981 Naqvi,2009 0.999 0.8224 Read et al,1998 0.966 1.000 Roberts et al,2001 1.000 0.5904 0.814 0.4903 Rutten et al,2010 0.667 0.955 Takagishi,1996 0.891 0.7875 0.968 Teefey,2000 0.850 0.6211 0.929 0.8271 0.980 Teefey,2005 0.940 0.8345 0.987 van Holsbeeck et al,1995 0.3999 0.972 0.778 Vlychou et al,2009 0.985 Wiener and Seitz,1993 0.947 0.8690 0.997 Yen et al,2004 0.900 0.5550 0.928 0.847 Summary 0.888 0.2

0.4

0.6

0.8

1

Figure 2. Forest plot for specificity in rotator cuff tears study.

the summary ROC curve, plug in the estimates into the expected logit sensitivity given specificity as in equation (5), then transform to ROC space, as presented in Figure 3. The AUC is 0.908. 4.1.5 Bivariate GLMM method The bivariate GLMM models are fitted using the SAS NLMIXED procedure under three link functions: logit, probit and complementary log–log. The ‘estimate’ statements in the NLMIXED procedure can transform the parameter estimates to median sensitivity and specificity and carry out the estimation of SEs via delta method. Table 4 reports summary indices with SEs. When dependence is assumed in the model, the three links give comparable summary estimates. The logit link provides the smallest AIC (214.8), and thus selected as the best fitted model. However, the negative correlation estimate has a large SE. In fact, if one fits a logit link GLMM assuming independence, the AIC (213.5) is slightly smaller than the correlated model. This example does not strongly support correlation between sensitivity and specificity. To summarize estimates from bivariate models, we compare the bivariate LMM method, hierarchical summary ROC model and GLMM model using logit link. The summary ROC curves and confidence and prediction ellipses of these models are presented in Figure 3. Hierarchical summary ROC and GLMM models achieve same sensitivity and specificity median estimates and SEs, which agrees with the argument by Harbord et al.6 that the two models are the same with different parameterizations. The bivariate LMM model has lower estimates of sensitivity and specificity. The differences may be due to the continuity correction applied in bivariate LMM




0.9

1.0

(b)

0.5

Sensitivity

0.4

0.5 0.3

0.3

0.4

Sensitivity

0.6

0.6

0.7

0.7

0.8

0.8

0.9

(a)

1.0

12

0.2 0.1

observed data BLMM_median summary point BLMM_summary ROC curve BLMM_confidence region BLMM_predictive region unweighted summary ROC curve

0.0

0.0

0.1

0.2

observed data median summary point summary ROC curve confidence region predictive region

0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0

0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0

1−Specificity

1−Specificity

Figure 3. Summary median estimates and ROC curves from some of the introduced models. Panel A presents summary median Se and Sp estimates with confidence and predictive regions and summary ROC curve from the bivariate GLMM using logit link. Panel B presents summary median Se and Sp estimates with confidence and predictive regions and summary ROC curve from the BLMM and the summary ROC curve from the unweighted summary ROC method. ROC: receiver operating characteristic; GLMM: generalized linear mixed model; BLMM: bivariate linear mixed model.

Table 4. GLMM method estimates and SE. Model

Sensitivity (SE)

Specificity (SE)

q (SE)

AUC (SE)

rl (SE)

rt (SE)

AIC

2logL

logit link

0.827 (0.042) 0.826 (0.042) 0.817 (0.042) 0.801 (0.045)

0.888 (0.021) 0.887 (0.021) 0.885 (0.021) 0.882 (0.021)

0.298 (0.330) 0

0.908 (0.049) 0.902 (0.026) 0.915 (0.051) 0.925 (0.041)

1.143 (0.243) 1.138 (0.241) 0.636 (0.132) 0.560 (0.119)

0.679 (0.208) 0.678 (0.206) 0.359 (0.111) 0.284 (0.090)

214.8 213.5

204.8 205.5

215.2

205.2

215.9

205.9

logit linkindependence probit link c-log-log link

0.312 (0.325) 0.329 (0.321)

GLMM: generalized linear mixed model; SE: standard error; AUC: area under the summary receiver operating characteristics curve; AIC: Akaike information criterion. The boldfaced cells represent the best chosen model based on AIC. The ‘logit link-independence’ model uses logit link and assumes independence between sensitivity and specificity while the other models assume dependence.



Ma et al.

13

and some degrees of approximation involved in the MIXED procedure when study size is small.6 A simulation study from Chu and Cole12 demonstrated that the GLMM method provides unbiased estimates while the bivariate LMM model has biased estimates of SeM , SpM and . 4.1.6 Trivariate GLMM When the prevalence of disease is involved as in a trivariate model, case-control studies need to be excluded. All our studies included satisfy the 1st criterion in the QUADAS checklist which requires random selection of the sample.37 To successfully capture possible correlations without including unnecessary correlations, we fit models with all possible correlation combinations. The parameters and desired estimates, AIC and log-likelihoods are summarized in Table 5. The best model with the smallest AIC of 2653.8 is model I with no correlations (boldfaced estimates in Table 5). This suggests no correlations among disease prevalence, sensitivity and specificity in this example. This conclusion agrees with the bivariate GLMM and the estimated median sensitivity and specificity are similar to the estimates from bivariate GLMM method using logit link in Section 4.1.5. This example shows that, when the prevalence is weakly correlated with sensitivity and specificity, the bivariate GLMM gives very similar estimates to that from the trivariate GLMM.

Table 5. Trivariate model parameter estimates and SEs. Disease Test prevalence prevalence Model Se (SE) Sp (SE) (SE) (SE) rl (SE) rm (SE) re (SE) ql (SE) qe (SE) qe (SE) AIC I II III IV V VI VII VIII

0.826 (0.041) 0.827 (0.041) 0.826 (0.042) 0.822 (0.042) 0.823 (0.042) 0.828 (0.042) 0.823 (0.042) 0.823 (0.042)

0.887 (0.021) 0.888 (0.021) 0.877 (0.026) 0.887 (0.021) 0.879 (0.026) 0.880 (0.025) 0.888 (0.021) 0.881 (0.025)

0.448 (0.075) 0.448 (0.075) 0.448 (0.076) 0.447 (0.075) 0.447 (0.075) 0.448 (0.076) 0.447 (0.075) 0.447 (0.075)

0.433 (0.058) 0.433 (0.059) 0.438 (0.062) 0.430 (0.060) 0.435 (0.062) 0.437 (0.062) 0.429 (0.061) 0.434 (0.064)

1.123 (0.236) 1.130 (0.238) 1.138 (0.241) 1.109 (0.237) 1.117 (0.240) 1.146 (0.244) 1.114 (0.238) 1.117 (0.239)

0.671 (0.203) 0.673 (0.205) 0.715 (0.235) 0.678 (0.206) 0.702 (0.229) 0.697 (0.223) 0.681 (0.208) 0.703 (0.226)

1.387 (0.260) 1.387 (0.260) 1.396 (0.263) 1.383 (0.260) 1.381 (0.260) 1.396 (0.263) 1.384 (0.259) 1.387 (0.261)

0

2logL

0

0

2653.8 2641.8

0.311 0 (0.328) 0 0

0

2655.0 2641

0

0.401 2654.5 2640.5 (0.471) 0 2654.5 2640.5

0.176 (0.256) 0 0.122 0.328 2656.2 (0.270) (0.547) 0.247 0 -0.325 2655.9 (0.340) (0.516) 0.284 0.165 0 2655.9 (0.329) (0.256) 0.298 0.175 0.343 2657.5 (0.323) (0.256) (0.502)

2640.2 2639.9 2639.9 2639.5

SE: standard error; AIC: Akaike information criterion. Models I–VIII are trivariate GLMM with all possible combinations of correlation parameters. Model I assumes all correlation coefficient , " and " equal to 0. Models II–IV assume only one of the correlation coefficients is not 0: 6¼ 0, " 6¼ 0 and " 6¼ 0, respectively. Models V–VII assume two of the correlation coefficients are not 0: " 6¼ 0 and " 6¼ 0,

6¼ 0 and " 6¼ 0, 6¼ 0 and " 6¼ 0, respectively. Model VII assumes none of , " and " is 0.



14


Table 6. Pap test example – model selection procedure. Models

Random effects

2logL

AIC

BIC

I IIa IIb IIc IId IIe IIIa IIIb IIIc IIId IVa IVb IVc IVd IVe IVf

NA e A B A B e and lA e and lB " and vA " and vB ", A and B ", A and A ", A and B ", A , B and A B e, lA , A and qlA mA ", A , B and A B

45,277 41,882 43,329 43,398 43,520 42,838 40,510 40,888 40,894 40,520 39,777 39,762 40,506 39,777 39,752 40,503

45,287 41,894 43,341 43,410 43,532 42,850 40,524 40,902 40,908 40,534 39,793 39,778 40,522 39,795 39,770 40,521

45,297 41,906 43,353 43,423 43,544 42,863 40,539 40,917 40,922 40,548 39,810 39,795 40,539 39,814 39,789 40,540

AIC: Akaike information criterion; BIC: Bayesian information criterion. Models in levels I–IV include random effects and possible correlations denoted in the corresponding ‘random effects’ column. The procedure starts from the fixed effects model I. In level II, five possible random effects are added one at a time. Model IIa with random effect "(prevalence) has smallest AIC, thus "is carried to models in level III. The same process continued until level IV because model fitting became unstable with more random effects than level IV and AIC was not significantly reduced anymore. The boldfaced estimates represents the best model with smallest AIC in each level.

4.2

A meta-analysis of cervical cancer diagnosis using Pap smears test

In this section, we re-visit the example used by Walter et al.18 and apply the latent class random effects models. The data are collected from a meta-analysis of Papanicolaou (Pap) smears test accuracy by Fahey et al.38 The Pap smear is a quick, non-invasive and relatively inexpensive test for cervical cancer.38 Fahey et al.38 analysis consists of 59 cross-sectional studies using Pap smears as the diagnostic test and histology as the gold standard. However, Walter et al.’s18 model argued that the histology test has sensitivity of 0.97 and specificity of 0.62, revealing lack of a perfect gold standard. Hence, we will treat histology as an imperfect reference test then fit the data by the latent class random effects models in Section 3. The data are listed in Appendix 1 (Table A2) and corresponding SAS code is included in Appendix 2 (B7). When fitting the model using the SAS NLMIXED procedure, convergence problems appeared as more random effects were added. Thus we assume prevalence to be independent of sensitivities and specificities for ease of fitting and apply a forward-selection procedure to select random effects. We begin with a fixed effects model, and add random effects sequentially. The process of selection is outlined in Table 6. The final model obtained is IVe, in which random effects are considered for the disease prevalence, Pap smear test sensitivity and specificity and the specificity of the histology test. The parameter estimates of the best fitted models at each step are provided in Table 7. After adjustment for possible variation and correlations by random effects in our method, the final model IVe shows a low sensitivity for the Pap smears of 0.655 (SE ¼ 0.042) and a specificity of 0.835 (SE ¼ 0.032). However, the histology test outperforms the Pap smears with sensitivity of 0.903



Ma et al.

15

Table 7. Pap test example – fitted estimates and SEs. Model parameter estimates (SE) b Se_pap test Se AM b AM Sp_pap testa Sp b BM Se_reference Se b BM Sp_reference Sp Prevalence b M " A A B B

A A

I

IIa

IIIa

IVe

0.815(1.420)

0.750(0.006)

0.664(0.043)

0.655(0.042)

0.810(1.531)

0.795(0.010)

0.822(0.010)

0.835(0.032)

0.842(1.418)

0.858(0.010)

0.829(0.009)

0.903(0.013)

0.803(1.629)

0.900(0.009)

0.977(0.012)

0.989(0.014)

0.527(1.708) NA NA NA NA NA NA

0.588(0.061) 1.819(0.195) NA NA NA NA NA

0.712(0.050) 1.727(0.194) 1.367(0.147) NA NA NA NA

0.636(0.048) 1.467(0.164) 1.292(0.136) 1.269(0.164) NA NA 0.509(0.136)

SE: standard error. Models I, IIa, IIIa and IVe are the same models specified in Table 6.

(SE ¼ 0.013) and specificity of 0.989 (SE ¼ 0.014). Moreover, our estimates of the histology test differ from the estimates in Walter et al.’s18, suggesting a somewhat different interpretation in practice.

5 Discussion In this article, we have discussed methods for evaluating the performance of diagnostic tests for situations when the reference test can be considered a gold standard, as well as situations when it is error-prone. Under the scenario with a gold standard, we reviewed the traditional summary ROC method, bivariate LMM and the hierarchical summary ROC model. Then we focused on the random effect GLMM, because it has several advantages over the simpler methods. We showed how the bivariate GLMM can be fitted using a variety of link functions including logit, probit and complementary log–log, and extended the approach to a trivariate GLMM to jointly model prevalence, sensitivity and specificity. Under the situation with no gold standard, we built upon the latent class model proposed by Walter et al.18 by adding random effects to quantify possible correlation and variation following the methods by Chu et al.19 We worked through two empirical examples to illustrate the application of our models. We used the SAS MIXED and NLMIXED procedures to fit all models, and provide SAS code with detailed explanation in Appendix 2. The SAS macro METADAS may assist in automating the fitting of bivariate and hierarchical summary ROC models for meta-analysis of diagnostic tests.39 Several extensive simulation studies have been conducted in the literature to compare different methods. Hamza et al.40 studied the univariate exact binomial likelihood approach against the univariate approximate normal likelihood approach in different simulation settings. The size of meta-analysis varied from 10 to 100 studies and the true median sensitivity values ranged from 0.6 to 0.93. Overall, the simulations showed that the exact likelihood approach performs superior than the approximate approach in terms of bias and coverage probabilities. Riley et al.41 compared



16


the bivariate random-effects meta-analysis dealing with dependence between two outcomes to the univariate random-effects meta-analysis. Simulation studies showed that the bivariate approach has smaller mean-square error and is recommended over the univariate approach. Chu and Cole12 conducted simulations to study the bivariate GLMM and the BLMM approaches. Size of metaanalysis varied from 25 to 250, and Se/Sp was either relatively low (0.7/0.8) or relatively high (0.9/ 0.95). The bivariate GLMM was shown to yield unbiased estimates of Se, Sp and their correlation, while the BLMM gave biased results. Another paper of Chu et al.31 conducted simulations to compare different links used in bivariate GLMM with 40 meta-studies, 200 subjects in each study and median Se/Sp as 0.8/0.9. It suggested that the AUC and median Se/Sp estimates are relatively robust to the choice of link functions. The trivariate GLMM and bivariate GLMM were compared in Chu et al.17 under different correlation assumptions. The results suggested that misspecification resulting from AIC-based model selection is reasonably low in studied settings. When the reference test is imperfect, Chu et al.19 used different selection criteria deviance information criterion, AIC and Bayesian information criterion on selecting the appropriate random effects. The simulation results recommended including random effects because omitting important variability can cause inflated variance and decreased coverage. Among the models presented, the summary ROC approach is simple and widely used. However, it is limited as it does not assess the within- and between-study variations and possible correlations between Se and Sp. The bivariate LMM improves over the summary ROC approach by assuming random effects to explain both within- and between-study variations and possible correlations. The bivariate LMM can provide inferences both in terms of summary ROC curves and summary statistics of overall test performance. However, it has limitations due to the use of a continuity correction and a normal approximation. The GLMMs do not have the limitations of the above models because they assume exact binomial distributions. The bivariate GLMM, which is essentially the same as the hierarchical summary ROC model in certain situations, is recommended when research interests focus on sensitivity and specificity and there is strong suggestion of independence with disease prevalence. The trivariate GLMM will be most appropriate when there is interest in estimating positive or negative predictive values, because estimation of disease prevalence is required and correlation among prevalence and Se, Sp should not be ignored. Besides, the trivariate GLMM is most reliable when most of the studies are cohorts. When the reference test is not a gold standard, the latent class random effects model should be used to avoid biased estimates. A limitation related to the GLMMs is that the meta-analysis reported often includes a mixture of case-control and cohort studies designs. Thus using either the bivariate or the trivariate GLMM for all the studies can lead to problems. Another issue arises when fitting the trivariate GLMM and the latent class random effects models in the SAS NLMIXED procedure. The more random effects included, the longer it takes to converge. Under such situations, one can first get raw estimates of the desired parameters by fitting the data in models with fewer random effects. The raw estimates can then be used as starting values to improve convergence in a more complex model. For the latent class random effects model, one may need to apply simpler assumptions for ease of fitting. For instance, our example assumes independence between prevalence and the paired indices. However, as discussed, dependence between the indices may be expected. In the example of rotator cuff tears, we excluded seven studies having the partial verification problem to avoid biased results, though these studies might still be able to contribute to our analysis. To the best of our knowledge, multivariate methods to correct publication bias in a meta-analysis of diagnostic test settings still await for further development. A recent Bayesian approach to correct such bias by de Groot et al.42 may be applied to diagnostic tests with nominal outcomes.



Ma et al.

17

In summary, sensitivity analysis methods for meta-analysis of diagnostic tests investigating the impact of publication bias through a selection or pattern mixture model framework are yet to be developed. Funding This research was supported in part by the US Department of Health and Human Services Agency for Healthcare Research and Quality Grant R03HS020666, P01CA142538 and P30CA77598 from the US National Cancer Institute, and R21AI103012 from the US National Institute of Allergy and Infectious Diseases.

Conflict of interest None declared.

References 1. Zhou XH, Obuchowski NA and McClish DK. Statistical methods in diagnostic medicine. New York, NY: Wiley, 2002. 2. Pepe MS. The statistical evaluation of medical tests for classification and prediction. Oxford: Oxford University Press, 2003. 3. Rutter CA and Gatsonis CA. A hierarchical regression approach to meta-analysis of diagnostic test accuracy evaluations. Stat Med 2001; 20: 2865–2884. 4. Song F, Khan KS, Dinnes J, et al. Asymmetric funnel plots and publication bias in meta-analyses of diagnostic accuracy. Int J Epidemiol 2002; 31: 88–95. 5. van Houwelingen HC. Advanced methods in metaanalysis: multivariate approach and meta-regression. Stat Med 2002; 21: 589–624. 6. Harbord RM, Deeks JJ, Egger M, et al. A unification of models for meta-analysis of diagnostic accuracy studies. Biostatistics 2007; 8: 239–251. 7. Macaskill P. Empirical Bayes estimates generated in a hierarchical summary ROC analysis agreed closely with those of a full Bayesian analysis. J Clin Epidemiol 2004; 57: 925–932. 8. Mallett S, Deeks JJ, Halligan S, et al. Systematic reviews of diagnostic tests in cancer: review of methods and reporting. BMJ 2006; 333: 413. 9. Zwinderman AH and Bossuyt PM. We should not pool diagnostic likelihood ratios in systematic reviews. Stat Med 2008; 27: 687–697. 10. Arends LR, Hamza TH, van Houwelingen JC, et al. Bivariate random effects meta-analysis of ROC curves. Med Decis Making 2008; 28: 621–638. 11. Reitsma JB. Bivariate analysis of sensitivity and specificity produces informative summary measures in diagnostic reviews. J Clin Epidemiol 2005; 58: 982–990. 12. Chu H and Cole SR. Bivariate meta-analysis of sensitivity and specificity with sparse data: a generalized linear mixed model approach. J Clin Epidemiol 2006; 59: 1331–1332. 13. Chu H and Guo H. Letter to the editor: a unification of models for meta-analysis of diagnostic accuracy studies. Biostatistics 2009; 10: 201–203. 14. Riley R, Abrams K, Sutton A, et al. Bivariate randomeffects meta-analysis and the estimation of between-study correlation. BMC Med Res Methodol 2007; 7: 3. 15. Riley RD, Abrams KR, Lambert PC, et al. An evaluation of bivariate random-effects meta-analysis for the joint synthesis of two correlated outcomes. Stat Med 2007; 26: 78–97.

16. Hamza TH, van Houwelingen HC and Stijnen T. The binomial distribution of meta-analysis was preferred to model within-study variability. J Clin Epidemiol 2008; 61: 41–51. 17. Chu H, Nie L, Cole SR, et al. Meta-analysis of diagnostic accuracy studies accounting for disease prevalence: alternative parameterizations and model selection. StatMed 2009; 28: 2384–2399. 18. Walter SD, Irwig L and Glasziou PP. Meta-analysis of diagnostic tests with imperfect reference standards. J Clin Epidemiol 1999; 52: 943–951. 19. Chu H, Chen SN and Louis TA. Random effects models in a meta-analysis of the accuracy of two diagnostic tests without a gold standard. J Am Stat Assoc 2009; 104: 512–523. 20. Sadatsafavi M, Shahidi N, Marra F, et al. A statistical method was used for the meta-analysis of tests for latent TB in the absence of a gold standard, combining randomeffect and latent-class methods to estimate test accuracy. J Clin Epidemiol 2010; 63: 257–269. 21. Dendukuri N, Schiller I, Joseph L, et al. Bayesian metaanalysis of the accuracy of a test for tuberculous pleuritis in the absence of a gold standard reference. Biometrics 2012; 68: 1285–1293. 22. Moses LE, Shapiro D and Littenberg B. Combining independent studies of a diagnostic-test into a summary roc curve – data-analytic approaches and some additional considerations. Stat Med 1993; 12: 1293–1316. 23. Walter SD. Properties of the summary receiver operating characteristic (SROC) curve for diagnostic test data. Stat Med 2002; 21: 1237–1256. 24. Walter SD. The partial area under the summary ROC curve. Stat Med 2005; 24: 2025–2040. 25. Sutton AJ, Cooper NJ, Goodacre S, et al. Integration of meta-analysis and economic decision modeling for evaluating diagnostic tests. Med Decis Making 2008; 28: 650–667. 26. Irwig L, Tosteson ANA, Gatsonis C, et al. Guidelines for meta-analyses evaluating diagnostic tests. Ann Internal Med 1994; 120: 667–676. 27. Deeks JJ. Systematic reviews in health care: systematic reviews of evaluations of diagnostic and screening tests. Br Med J 2001; 323: 157–162. 28. Sweeting MJ, Sutton AJ and Lambert PC. What to add to nothing? Use and avoidance of continuity corrections in meta-analysis of sparse data. Stat Med 2004; 23: 1351–1375.



18


29. Starmer CF, Grizzle JE and Sen PK. Some reasons for not using the Yates continuity corrections in meta-analysis of sparse data. J Am Stat Assoc 1974; 69: 376–378. 30. Tosteson AN and Begg CB. A general regression methodology for ROC curve estimation. Med Decis Making 1988; 8: 204–215. 31. Chu H, Guo H and Zhou Y. Bivariate Random effects meta-analysis of diagnostic studies using generalized linear mixed models. Med Decis Making 2010; 30: 499–508. 32. Ransohoff DF and Feinstein AR. Problems of spectrum and bias in evaluating the efficacy of diagnostic tests. N Engl J Med 1978; 299: 926–930. 33. Brenner H and Gefeller O. Variation of sensitivity, specificity, likelihood ratios and predictive values with disease prevalence. Stat Med 1997; 16: 981–991. 34. Li J and Fine JP. Assessing the dependence of sensitivity and specificity on prevalence in meta-analysis. Biostatistics 2011; 12(4): 710–722. 35. Shin KM. Partial-thickness rotator cuff tears. Korean J Pain 2011; 24: 69–73. 36. Singisetti K. Shoulder ultrasonography versus arthroscopy for the detection of rotator cuff tears: analysis of errors. J Orthop Surg (Hong Kong) 2011; 19(1): 76–79.

37. Smith TO, Back T, Tomas AP and Hing CB. Diagnostic accuracy of ultrasound for rotator cuff tears in adults: a systematic review and meta-analysis. Clin Radiol 2011; 66(11): 1036–1048. 38. Fahey MT, Irwig L and Macaskill P. Meta-analysis of Pap test accuracy. Am J Epidemiol 1995; 141: 680–689. 39. Takwoingi Y and Deeks JJ. METADAS: an SAS macro for meta-analysis of diagnostic accuracy studies. 2011. 40. Hamza TH, van Houwelingen HC and Stijnen T. The binomial distribution of meta-analysis was preferred to model within-study variability. J Clin Epidemiol 2008; 61: 41–51. 41. Riley RD, Thompson JR and Abrams KR. An alternative model for bivariate random-effects meta-analysis when the within-study correlations are unknown. Biostatistics (Oxford, England) 2008; 9: 172–186. 42. de Groot JA, Dendukuri N, Janssen KJ, et al. Adjusting for partial verification or workup bias in meta-analyses of diagnostic accuracy studies. Am J Epidemiol 2012; 175: 847–853.

Appendix 1: Data for case studies

Table A1. Partial rotator cuff tears meta-analysis data. Study

Year

True positive

False positive

False negative

True negative

Al-Shawi Alasaarela Brenneke and Morgan Cullen Ferrari Friedman Hedtmann and Fett Iannotti Kang Kayser Labanauskaite Milosavljevic Naqvi Read et al. Roberts et al. Rutten et al. Takagishi Teefey Teefey van Holsbeeck et al. Vlychou et al. Wiener and Seitz Yen et al.

2008 1998 1992 2007 2002 1993 1995 2005 2009 2005 2002 2005 2009 1998 2001 2010 1996 2000 2005 1995 2009 1993 2004

65 1 11 11 8 2 121 26 2 41 11 17 4 6 5 8 10 10 13 14 44 64 9

12 0 8 2 1 0 0 7 5 16 3 0 2 1 0 12 7 3 4 3 2 4 1

1 0 14 3 10 2 12 2 2 11 2 7 0 7 2 0 10 5 2 1 3 3 1

65 19 45 21 25 0 0 16 5 171 9 6 11 28 7 24 57 17 52 47 7 71 9



Ma et al.

19

Table A2. Pap smear test meta-analysis data. Study

Typea

ni11

ni10

ni00

ni01

Alloub et al. (1989) Alons-van Kordelaar and Boon (1988) Anderson et al. (1992) Anderson et al. (1988) Anderson et al. (1992) Andrews et al. (1989) August (1991) Bigrigg et al. (1990) Bolger and Lewis (1988) Byme et al. (1988) Chomet (1987) Engineer and Misra (1987) Fletcher et al. (1990) Frisch et al. (1990) Giles et al. (1988) Giles et al. (1989) Gunderson et al. (1988) Haddad et al. (1988) Hellberg et al. (1987) Helmerhorst et al. (1987) Hirschowitz et al. (1992) Jones DED et al. (1987) Jones MH et al. (1992) Kashimura et al. (1991) Kealy (1986) Koonlng-1 et al. (1992) Koonlng-2 et al. (1992) Kwikkel et al. (1986) Lozowski et al. (1982) Maggi et al. (1989) Morrison EAB et al. (1992) Morrison BW et al. (1998) Nyirjesy (1972) Okagaki and Zelterman (1991) Oyer and Hanjanl (1986) Parker (1986) Pearlstone et al. (1992) Ramiez et al. (1990) Reid et al. (1991) Robertson et al. (1988) Schauberger et al. (1991) Shaw (1986) Singh et al. (1985) Skehan et al. (1990) Smith et al. (1987) Soost et al. (1991)

SC SC SC FU FU FU FU SC SC FU SC SC FU SC SC FU SC SC SC FU FU FU FU SC FU FU FU FU FU FU FU SC SC SC FU SC FU SC SC FU SC FU FU FU FU SC

8 31 70 65 20 35 39 567 26 38 45 71 4 2 5 38 4 87 15 41 76 10 28 3 79 61 62 284 66 40 11 23 65 1270 223 154 6 7 12 348 8 12 95 40 71 1205

23 43 121 6 19 20 111 140 12 17 15 10 36 3 3 7 16 12 65 61 11 48 28 5 13 27 16 68 20 12 1 10 42 263 74 20 12 3 11 212 11 6 2 20 20 454

84 14 25 6 4 156 271 157 18 37 48 306 5 21 182 62 31 9 15 29 12 174 77 1 182 35 49 68 44 47 2 44 13 1085 83 237 81 4 60 103 34 0 1 19 18 241

3 3 12 10 3 92 7 117 37 28 35 87 0 2 9 21 2 13 3 1 12 4 11 0 26 20 20 31 25 43 1 50 13 927 22 30 2 4 5 41 4 2 9 18 13 186 (continued)



20


Table A2. Continued Study

Typea

Soutter-1 et al. (1986) Soutter-2 et al. (1986) Spitzer et al. (1989) Stafl (1981) Syrjanen et al. (1987) Szarewski (1991) Tait et al. (1988) Tawa et al. (1988) Tay et al. (1987) Upadhyay et al. (1984) Walker et al. (1986) Wetrich (1986) Wheelock and Kamlnlski (1989)

SC SC FU SC FU SC SC SC FU SC FU FU FU

ni11 5 35 10 3 118 13 38 14 12 238 111 491 49

ni10 52 12 5 3 44 82 13 67 6 2 20 250 39

ni00 27 12 32 15 183 17 62 291 12 16 39 702 31

ni01 20 9 31 5 40 3 14 25 14 52 44 164 16

a

Type of study denotes the usage of the test clinically, SC as screening and FU as follow up.

Appendix 2: SAS codes for fitting models

B1 Unweighted summary ROC: data partial1; set partial; if tp ¼ 0 or fp ¼ 0 or fn ¼ 0 or tn ¼ 0 then do; tp ¼ tpþ0.5; fp ¼ fpþ0.5; fn ¼ fnþ0.5; tn ¼ tnþ0.5; n0 ¼ n0þ1; n1 ¼ n1þ1; end; se ¼ tp/n1; sp ¼ tn/n0; logitse ¼ log(se/(1se)); var_logitse ¼ 1/(se*(1se)*n1); logitsp ¼ log(sp/(1sp)); var_logitsp ¼ 1/(sp*(1sp)*n0); D ¼ logitseþlogitsp; S ¼ logitselogitsp; proc reg data ¼ partial1; model D ¼ S; run;

/*‘tp’ stands for true positive, ‘fp’ for false positive, ‘fn’ for false negative, ‘tn’ for true negative */ /*continuity correction on zero cells*/

/*calculate Se and Sp for each study*/ /* logit(Se) and logit(Sp) and their variances*/ /* D and S*/ /*fit linear regression model D ¼ aþbS*/

B2 Weighted summary ROC: data partial2; set partial1; w ¼ 1/(1/tpþ1/fpþ1/tnþ1/fn); proc reg data ¼ partial2; model D ¼ S; weight w; run;

/*calculate the weight for each study*/ /*fit weighted regression using the created weights*/


repeated / group ¼ rec; parms / parmsdata ¼ cov hold ¼ 4 to 63; run;

random dis non_dis / subject ¼ id type ¼ un s;

data partial3; set partial1; id ¼ _n_; dis ¼ 1; non_dis ¼ 0; logit ¼ logitse; var_logit ¼ var_logitse; recþ1; output; dis ¼ 0; non_dis ¼ 1; logit ¼ logitsp; var_logit ¼ var_logitsp; recþ1; output; run; data cov; if _n_ eq 1 then do; est ¼ 0; output; est ¼ 0; output; est ¼ 0; output; end; set partial3; est ¼ var_logit; output; keep est; run; proc mixed data ¼ partial3 method ¼ reml cl; class id; model logit ¼ dis non_dis / noint s cl covb df ¼ 1000, 1000;

B3 SAS MIXED procedure to fit bivariate LMM:

/*choose the residual (restricted) method(reml), ‘cl’ asks for confidence limits for covariance parameter estimates.*/ /*indicator variables ‘dis’ and ‘non_dis’ are explanatory variables for logit(Se) and logit(Sp). ‘covb’ asks for covariance matrix of fixed effects parameters. Large ‘df’’ approximate a t distribution to a normal distribution.*/ /* random effects corresponds to disease and non_disease status. An unstructured working covariance structure is stated to assume possible correlation of ‘dis’ and ‘non_dis’ within the same study.*/ /* ‘group ¼ rec’ statement specifies the with-in study-arm variance in each study.*/ /*‘parmsdata’ option reads in variable ‘est’ from the cov data. 60 within study-arm variances are kept constant.*/

/* build the data containing variable ‘est’ with 3 starting values for the covariance parameters of the random effects and 60 within study arm variances.*/

/* make each study have two observations, one for sensitivity, the other for specificity.*/


Ma et al. 21



random mu nv normal([0, 0], [exp(2*sigse), rho*exp(sigse)*exp(sigsp), exp(2*sigsp)]) subject ¼ id; estimate ‘‘Se’’ exp(mu0)/(1þexp(mu0)); estimate ‘‘Sp’’ exp(nu0)/(1þexp(nu0)); run;

proc nlmixed data ¼ partial4 fd cov corr df ¼ 1000 gtol ¼ 1e11; parms mu0 ¼ 1.5 nu0 ¼ 2.2 fz ¼ 0.23 sigse ¼ 0.37 sigsp ¼ 0.26; rho ¼ (exp(2*fz)1)/(1þexp(2*fz)); if Se ¼ 1 then beta ¼ mu0þmu; if Se ¼ 0 then beta ¼ nu0þnv; pred ¼ exp(beta)/(1þexp(beta)); model nybinomial(n, pred);

/* ‘tp’ and ‘tn’ are binomially distributed condition on random effects ‘mu’ and ‘nv’.*/ /*random effects ‘mu’ and ‘nv’ are bivariate normally distributed; ‘subject ¼ id’ indicates possible correlation of random effects within a study*/

/* ‘fd’ specifies that all derivatives be computed using finite difference approximations./ /* use fisher’s z transformation instead of the correlation coefficient r directly to ensure 1 1*/

/* assign starting values for parameters.*/ /* code ‘logitp’ as logit(Se) and logit(Sp) */ /*logit transform is applied to the probabilities*/ /* number of tp and tn are Bin(n, p) distributed */ /* ‘ut’ and ‘ua’ are random effects clustered within study. Independence is assumed between random effects. Exponential formed variance is to ensure positivity.*/ /*use estimate statement to get estimates of desired indices with standard errors. */

/* make each study has two records */ /*code ‘dis’ as 0.5 and ‘se’ ¼ 1 for disease patients, ‘dis’ as 0.5 and ‘se’ ¼ 0 for non-disease subjects*/

22

B5. SAS NLMIXED procedure to fit bivariate GLMM with logit link:

estimate ‘‘se’’ 1/(1þexp(((thetaþ0.5*alpha)*exp(0.5*beta)))); estimate ‘‘sp’’ 1/(1þexp((theta0.5*alpha)*exp(0.5*beta))); estimate ‘‘sigtheta’’ exp(sigtheta); estimate ‘‘sigalpha’’ exp(sigalpha); run;

data partial4; set partial; id ¼ _n_; dis ¼ 0.5; ny ¼ tp; n ¼ n1; se ¼ 1; output; dis ¼ 0.5; ny ¼ tn; n ¼ n0; se ¼ 0; output; keep id ny n dis se; run; proc nlmixed data ¼ partial4; parms theta ¼ 1 alpha ¼ 4 beta ¼ 0.6 sigtheta ¼ 0.7 sigalpha ¼ 1.7; logitp ¼ 2*dis*((thetaþut)þ(alphaþua)*dis)*exp(beta*dis); p ¼ exp(logitp)/(1þexp(logitp)); model nybinomial(n,p); random ut ua normal([0,0],[exp(2*sigtheta),0,exp(2*sigalpha)]) subject ¼ id;

B4 SAS NLMIXED procedure to fit the hierarchical summary ROC model:



proc nlmixed data ¼ partial fd df ¼ 1000 gtol ¼ 1e10; parms mu0 ¼ 0 nv0 ¼ 3 eta0 ¼ 1 sigse ¼ 0 sigsp ¼ 0 sigpi ¼ 1; logitsei ¼ mu0 þ mu; logitspi ¼ nv0 þ nv; logitpi ¼ eta0 þ eta; Sei ¼ 1/(1þexp(logitsei)); Spi ¼ 1/(1þexp(logitspi)); pi ¼ 1/(1þexp(logitpi)); logL ¼ tp * (log(pi) þ log(Sei)) þ fp * (log(1pi) þ log(1Spi)) þ fn * (log(pi) þ log(1Sei)) þ tn * (log(1pi) þ log(Spi)); model Y general(logL); random mu nv etanormal([0, 0, 0], [exp(2*sigse), 0, exp(2*sigsp), 0, 0, exp(2*sigpi)]) subject ¼ id; estimate ‘‘sigse’’ exp(sigse); estimate ‘‘sigsp’’ exp(sigsp); estimate ‘‘sigpi’’ exp(sigpi); estimate ‘‘Se’’ 1/(1þexp(mu0)); estimate ‘‘Sp’’ 1/(1þexp(nv0)); run;

B6. SAS NLMIXED procedure to fit trivariate GLMM:

/*log-likelihood for trivariate model*/ /*specify general log-likelihood function. Any variable can be used as the dependent variable in this situation.*/ /*‘mu’, ‘nv’ and ‘eta’ are the random effects corresponding to Se, Sp and prevalence. Possible correlation could exist within studies. The best model is achieved when all the correlation coefficients among the random effects ‘mu’, ‘nv’, ‘eta’ are zero.*/

/*model prevalence (‘pi’) together with Se and Sp*/

/* model I is the best fitted model with smallest AIC*/


Ma et al. 23


proc nlmixed data ¼ walter1999 cov fd; parms mua0 ¼ 0.6 nva0 ¼ 1.6 mub0 ¼ 2 nvb0 ¼ 4.5 eta0 ¼ 0.6 sigpi ¼ 0.4 sigmua ¼ 0.3 signva ¼ 0.2 sigmub ¼ 0 fz2 ¼ 0.5; SeA ¼ exp(mua0þmua)/(1þexp(mua0þmua)); SpA ¼ exp(nva0þnva)/(1þexp(nva0þnva)); SeB ¼ exp(mub0)/(1þexp(mub0)); SpB ¼ exp(nvb0)/(1þexp(nvb0)); pi ¼ exp(eta0þeta)/(1þexp(eta0þeta)); rho2 ¼ (exp(2*fz2)1)/(exp(2*fz2)þ1); p11 ¼ pi*SeA*SeBþ(1pi)*(1SpA)*(1SpB); p01 ¼ pi*SeA*(1SeB)þ(1pi)*(1SpA)*SpB; p10 ¼ pi*(1SeA)*SeBþ(1pi)*SpA*(1SpB); p00 ¼ pi*(1SeA)*(1SeB)þ(1pi)*SpA*SpB; logl ¼ n11*log(p11)þn01*log(p01)þn10*log(p10)þn00*log(p00); model y general(logl); random eta mua nvanormal([0,0,0],[exp(2*sigpi), 0, exp(2*sigmua), 0, rho2*exp(sigmua)*exp(signva),exp(2*signva)]) subject ¼ id; estimate "SeA’’ exp(mua0)/(1þexp(mua0)) estimate "SpA’’ exp(nva0)/(1þexp(nva0)); estimate "pi’’ exp(eta0)/(1þexp(eta0)); estimate ‘‘rhomuanva’’ (exp(2*fz2)1)/(exp(2*fz2)þ1); estimate "SeB’’ exp(mub0)/(1þexp(mub0)); estimate "SpB’’ exp(nvb0)/(1þexp(nvb0)); run;

/*log likelihood*/ /*the best model selected evolves four random effects. model IVe assumes only correlation between ‘mua’ and ‘nva’, i.e. sensitivity and specificity of the Pap sme test are correlated.*/

/*fisher’s z transformation; model V with correlation only between Se and Sp of the Pap smear test*/ /*expected probabilities in 2*2 table*/

/* five parameters are modeled: ‘SeA’ and ‘SpA’ for the Pap smear test, ‘SeB’ and ‘SpB’ for the histology test and ‘pi’ for disease prevalence*/

B7. SAS NLMIXED procedure to fit the latent class random effect model IVe for Pap smear test:


24 Statistical Methods in Medical Research 0(0)