Advanced Administration and Interpretation of Multiple Validity Tests ...

6 downloads 358 Views 2MB Size Report
Professional organizations recommend inclusion of multiple performance (PVTs) and ... Interest in performance (PVTs) and symptom validity tests (SVTs) has ...
Psychol. Inj. and Law (2015) 8:46–63 DOI 10.1007/s12207-015-9216-4

Advanced Administration and Interpretation of Multiple Validity Tests Anthony P. Odland & Andrew B. Lammy & Phillip K. Martin & Christopher L. Grote & Wiley Mittenberg

Received: 12 December 2014 / Accepted: 27 January 2015 / Published online: 17 February 2015 # Springer Science+Business Media New York 2015

Abstract Professional organizations recommend inclusion of multiple performance (PVTs) and symptom validity tests (SVTs) (Bush et al., Archives of Clinical Neuropsychology, 20, 419–426, 2005; Heilbronner et al., The Clinical Neuropsychologist, 23(7), 1093–1129, 2009). However, to date, empirically driven recommendations for interpretation of multiple validity indicators are largely absent, and the generalizability of available psychometric data is questionable in clinical practice. The current aim is to provide base rate data and recommendations for interpretation of multiple validity indicators, assuming varying correlations between each PVT at a range of specificity and sensitivity rates. As an initial step, Monte Carlo methodology was validated across 24 embedded and stand-alone validity indicators in seven extant noncompensation-seeking clinical samples. Samples were comprised of patients with psychotic and nonpsychotic psychiatric disorders, as well as different neurological conditions. Strategies are outlined for clinical integration of base rate data for advanced administration A. P. Odland (*) : A. B. Lammy : C. L. Grote Department of Behavioral Sciences, Rush University Medical Center, Chicago, IL, USA e-mail: [email protected] A. P. Odland Sanford Neuroscience Clinic, Sanford Health, Fargo, ND, USA A. B. Lammy Air Force Institute of Technology, Wright-Patterson Air Force Base, OH, USA P. K. Martin Department of Psychiatry and Behavioral Sciences, University of Kansas Medical Center, Wichita, KS, USA W. Mittenberg Nova Southeastern University Center for Psychological Studies, Fort Lauderdale, FL, USA

and interpretation of multiple validity indicators. In light of the current findings, recommendations are provided to reduce false-positive rates associated with making determinations regarding noncredible test performance. Keywords Malingering . Symptom validity . Performance validity . Base rate . Symptom exaggeration

Interest in performance (PVTs) and symptom validity tests (SVTs) has significantly expanded over the last several decades (for purposes of the present discussion, PVT and SVT are used interchangeably) (Bush et al., 2005; Heilbronner et al., 2009). Almost 25 % of articles published in both The Clinical Neuropsychologist and Archives of Clinical Neuropsychology from 2011 to 2013 are related to effort and validity testing, which represents an increase considering research published by these journals across the past 15 years (Fig. 1). Increased research interest has been accompanied by an ever-increasing number of stand-alone and embedded validity measures. By these authors’ count, more than 50 embedded and stand-alone validity tests are listed in the most recent version of Neuropsychological Assessment (Lezak, Howieson, Bigler & Tranel, 2012). As many as 28 embedded validity indicators can now be interpreted in a comprehensive neuropsychological battery consisting of the Wechsler Adult Intelligence Scales (WAIS)-IV, WMS-IV, Trail Making Test, Wisconsin Card Sorting Test, Finger Tapping, Grip Strength, CPT-II, RCFT, Controlled Oral Word Association Test (COWAT), and Animal Fluency (see Table 1 for a nonexhaustive list of selected cognitive measures). This estimate does not include stand-alone validity indicators, such as the VSVT and Test of Memory Malingering (TOMM) (Slick, 1996; Tombaugh, 1996). Similar to the development of embedded validity indicators within tests of cognitive ability, novel validity indicators

Psychol. Inj. and Law (2015) 8:46–63 Fig. 1 Proportion of published articles across two journals of clinical neuropsychology in which the primary focus is effort, validity testing, malingering, or individual validity indicators

47 40%

35%

30%

25% Archives

20%

TCN Average

15%

10%

5%

0% 99'

00'

01'

02'

03'

04'

05'

06'

07'

08'

09'

10'

11'

12'

13'

Note: Archives = Archives of Clinical Neuropsychology; TCN = The Clinical Neuropsychologist; Articles relating to addresses, conference information, introductions, book reviews, test reviews, single case studies, editorial boards, grand rounds, commentary, and position papers were excluded from this analysis

exist for already developed freestanding validity instruments. For example, in addition to the original two TOMM validity scores (Trial 2 and retention raw score; Tombaugh, 1996), four additional indicators have been created, including Trial 1 (O’Bryant, Engel, Kleiner, Vasterling, & Black, 2007), the Albany Consistency Index (Gunner, Miele, Lynch, & McCaffrey, 2012), the TOMMe10 (Denning, 2012), and the Invalid Forgetting Frequency Index (Buddin et al., 2014). If the TOMM were included within the aforementioned battery of cognitive tests, as many as 34 validity indicators would be available for interpretation. Dozens of validity indicators can be interpreted even in the context of an evaluation containing few to no freestanding measures of performance validity. While this illustration may seem extreme, it underestimates the number of available indicators when measures of psychological functioning are also interpreted (e.g., MMPI-2, MMPI-2-RF, or PAI; Ben-Porath & Tellegen, 2008/2011; Butcher, Graham, Ben-Porath, Tellegen, & Dahlstrom, 2001; Morey, 2007). Unfortunately, increased interest in symptom validity assessment is not synonymous with the development of wellvalidated measures. Reported sensitivity (true positive) and specificity (true negative) rates may be inflated due to a combination of sample-specific characteristics and small sample sizes. This issue may be particularly relevant with regard to validation of certain embedded PVTs. For example, out of the 41 studies meeting inclusion criteria for a meta-analysis (Sollman & Berry, 2011), only two contained either an Bhonest^ or Bfeigning^ criterion group with a sample size larger than 100. Cross-validation on large samples (e.g., n= 300) of noncompensation-seeking individuals with varied

medical/psychiatric diagnoses might yield higher falsepositive rates than those originally reported in smaller samples. Small sample sizes both lower the likelihood that significant findings represent a true effect and exaggerate actual effects when they exist (Button et al., 2013). Crossvalidation of embedded PVTs using no-incentive samples selected on the basis of passing other freestanding PVTs (e.g., Victor, Boone, Serpa, Buehler, & Ziegler, 2009) is also insufficient. This procedure biases the sample, as individuals providing a credible test performance, who happened to fail PVTs for reasons unrelated to incentive, were excluded from the final no-incentive sample (Bilder, Sugar, & Hellemann, 2014). This process is likely to introduce systematic bias that likely inflates specificity due to artificial range constriction. This latter process may result in apparently well-validated PVTs that in clinical practice misclassify larger than expected numbers of persons providing credible test performance. It also becomes increasingly important that the psychometric, probabilistic, and clinical implications of multiple PVT interpretation are understood. In response to the continued need to advance interpretation of multiple PVTs, a primary aim of this study is providing a set of flexible, empirically driven, interpretive guidelines to assist with in vivo administration and interpretation of multiple PVTs. Interpretation of multiple PVTs and SVTs may heighten sensitivity (true-positive rate (TPR)) and reduce the overall false-positive rate (FPR), as well as allow clinicians to comment on both the validity of individual tests and the validity of an entire evaluation (Larrabee, 2008; Boone, 2007). Impressive specificity and sensitivity rates have been achieved with discrete numbers of PVT failures for certain

48 Table 1

Psychol. Inj. and Law (2015) 8:46–63 Examples of embedded validity indicators of selected cognitive measures

Cognitive measure

Embedded validity indicator

Validated by

WAIS

RDS Alternative RDS Enhanced RDS DS-ACSS LDF-1 LDF-2 Vocabulary–Digit Span

Greiffenstein, Baker, and Gola (1994) Reese, Suhr, and Riddle (2012) Reese et al. (2012) Babikian, Boone, Lu, and Arnold (2006) Babikian et al. (2006) Babikian et al. (2006) Mittenberg, Fichera, Zielinsk, & Heilbronner (1995)

WMS-IV

LM recognition VR recognition VR recognition Symbol span d-prime Total recall discriminability Long delay free recall Forced-choice recognition Critical item analysis Combination score Immediate recall Copy Recognition

ACS ACS ACS Young, Caron, Baughman, and Sawyer (2012) Wolfe et al. (2010) Root, Robbins, Chang, & Van Gorp, (2006)

Finger tapping

Dominant hand 1st 3 trials Dominant hand Combined hand Actual-combined diff dominant hand

Arnold et al. (2005) Axelrod, Myers, & Davis (2014) Axelrod, Myers, & Davis (2014) Axelrod, Myers, & Davis (2014)

Verbal fluency

xCOWAT Output reductions, easy–hard Mean cluster size Drop after 15 s COWAT FAS overall word generation Animal word generation Omission errors Hit rate SE Perseverations Commission errors

Silverberg, Hanks, Buchanan, Fichtenberg, and Millis (2008) Curtis, Thompson, Greve, and Bianchini (2008) Sugarman and Axelrod (2014)

TMT A TMT B TMT A and B Failure to maintain set

Powell, Locke, Smigielski, and McCrea (2011) Busse & Whiteside, 2012 Iverson, Lange, Green, & Franzen, (2002) Larrabee (2003)

CVLT-II

RCFT

CPT-II

TMT

WCST

Lu, Boone, Cozolino, & Mitchell, 2003 Lu et al., 2003 Whiteside, Wald, and Busse (2011) Whiteside et al. (2011)

Ord, Boettcher, Greve, and Bianchini (2010) Ord et al. (2010) Lange et al. (2013) Lange et al. (2013)

DS digit span, WAIS Wechsler Adult Intelligence Scales, RDS Reliable Digit Span, LDF-1 Longest Digit Forward 1-Trial, LDF-2 Longest Digit Forward 2-Trial, ACSS Age-corrected scaled score, WMS-IV Wechsler Memory Scales—fourth edition, LM logical Memory, VR visual reproduction, CVLT-II California Verbal Learning Test—second edition, RCFT Rey Complex Figure Test, COWAT Controlled Oral Word Association Test, CPT-II Continuous Performance Test—2nd edition, TMT Trail Making Test, WCST Wisconsin Card Sorting Test

combinations of measures (Larrabee, 2007; Victor et al., 2009). However, while exact numbers of PVTs may be indicative of response bias in certain studies, such criterion is questionable in clinical practice (e.g., failure on two or three PVTs). In actuality, various numbers and combinations of PVTs may be interpreted, each with unique psychometric properties, and in different clinical contexts. The summative effect of these differences results in varying degrees of diagnostic accuracy that cannot be accounted for by hard-and-fast interpretive rules (Berthelson, Mulchan, Odland, Miller, & Mittenberg 2013).

For example, assume interpretation includes a combination of ten embedded and stand-alone PVTs, each with an average 10 % false-positive rate. Application of fixed interpretive standards, such as failure on two PVTs, results in an overall 25 % false-positive rate. This contrasts with 5 % aggregative falsepositive rate when two out of three PVTs are failed (Berthelson et al., 2013). Advanced interpretation of multiple PVTs should incorporate a flexible framework that considers both the number of PVTs administered and number of PVT passes/failures. For instance, assuming each PVT has 90 % specificity when interpreted alone, an adequately low false-

Psychol. Inj. and Law (2015) 8:46–63

positive rate (i.e., 5 %) can be maintained at 3/5, 5/10, and 7/ 15 PVT failures. For a 10 % aggregate false-positive rate or less 2/3, 3/5, and 4/10 failures are required (Berthelson et al., 2013). This observation is also consistent with the wellestablished finding that adding tests leads to increased rates of significant scores (Binder, Iverson, & Brooks, 2009; Brooks, Strauss, Sherman, Iverson, & Slick, 2009; Larrabee, 2014). These findings suggest that when multiple PVTs are interpreted, some number of false-positive tests are expected in those actually giving a credible performance. With the exception of the Advanced Clinical Solutions’ (ACS; Pearson, 2009) effort assessment program, and Berthelson et al. (2013) recent recommendations, however, empirically derived guidelines for multiple PVT interpretation largely do not exist. Presently, the burden of integrating performances across multiple PVTs rests largely on the shoulders of clinical judgment, a task requiring real-time cost/benefit analysis, including whether additional PVTs should be administered given a certain number of previous failures/passes. Advanced administration and interpretation of multiple PVTs require understanding aggregate PVT true- and falsepositive rates, which is not intuitive and requires knowledge of the interaction between PVT interscale correlations, individual PVT sensitivity/specificity rates, and the number of interpreted tests. Advanced interpretation of multiple PVTs also requires an adaptable framework that accommodates differing numbers and combinations of PVTs. Chained likelihood ratios (LRs; e.g., Larrabee, 2008; Meyers, Axelrod, & Davis, 2014a, Meyers et al., 2014b), discriminatory functions (e.g., Victor et al., 2009), latent group analysis (Ortega, Labrenz, Markowitsch, & Piefke, 2013; Ortega, Wagenmakers, Lee, Markowitsch, & Piefke, 2012), and multiple regression (e.g., Wolfe et al., 2010), among other approaches, have shown impressive classification rates. However, as previously mentioned, clinical generalizability may be limited when classification rates are tied to specific combinations of PVTs/SVTs. In actual practice, various numbers and types of PVTs may be administered, and time demands may limit the relatively complex calculations needed to determine classification accuracy. Contextually, specific variables (e.g., interrelatedness of a specific combination of PVTs) might also restrict application of specific interpretive methodologies requiring assumptions of normality and/or test independence (Schutte & Axelrod, 2013). For example, binomial methods and chaining of LRs require that combinations of stand-alone and embedded PVTs are independent. However, PVTs are usually correlated, and rational selection of PVTs from different domains does not ensure test independence (Berthelson et al., 2013; Schutte & Axelrod, 2013). When positive and negative intertest correlations are averaged, resulting in an overall low correlation, the assumption of independence appears tenable. However, such averaging masks existing relationships and might inflate both

49

posttest malingering probabilities along with clinical confidence. For example, in two separate studies with averaged correlations not significantly different than zero, specific intertest correlations ranged from −0.517 to 0.400 and −0.042 to 0.478 (Larrabee, 2008; Meyers et al., 2014a, Meyers et al., 2014b). Invariably, some combinations of PVTs/SVTs in these samples were not independent resulting in inaccurate classification rates. Combinations of embedded and/or stand-alone PVTs cannot be chained using LRs unless every included PVT is independent from every other PVT. Latent group analysis, similar to likelihood ratios, has the advantage of incorporating prior knowledge (i.e., estimated base rates of malingering) to improve diagnostic classification (Ortega et al., 2012;, Ortega et al., 2013). However, computationally intensive methods are not always clinically feasible, and selecting appropriate base rates might be even more challenging. Malingering base rates may be inflated due to type I error (Berthelson et al., 2013) and vary by setting, referral source, as well as diagnosis (Mittenberg, Patton, Canyock, & Condit, 2002). Such statistical techniques also do not allow for real-time evaluation of the overall validity of an ongoing evaluation. While later analysis of PVT performance can improve diagnostic accuracy, it does not accommodate for adjustments to a PVT battery in actual time (i.e., during the evaluation). For clinicians who tailor assessment batteries in vivo, an ongoing understanding of the validity of an assessment battery serves to enhance both the efficiency and effectiveness of the neuropsychological evaluation. Berthelson et al. (2013) presented a viable alternative to other available approaches to measuring the classification accuracy of multiple PVTs. The authors utilized Monte Carlo methodology to estimate rates of false-positive errors when interpreting multiple PVTs/SVTs. The Monte Carlo methodology is a simulation technique that is able to quantify risk with a high degree of certainty. In Berthelson et al. (2013), the model incorporated average PVT intercorrelations from metaanalysis to enhance the generalizability of their findings (Berthelson et al., 2013; Sollman & Berry, 2011). Assuming a 10 % false-positive rate for each SVT, for example, they demonstrated an overall 24 % false-positive rate for failure on 1/3 tests, 5 % false-positive rate for 2/3 failures, and