DeRubeis, Hollon, Evans, and Bemis (1982) and Lu- borsky, Woody ... M. Sotsky and David Glass, George Washington University; Stanley D. Imber and Paul A.
Copyright 1986 by the American Psychological Association, Inc. 0022JW6X/86/W0.75
Journal of Consulting and Clinical Psychology 1986, Vol. 54, No. 3, 381-385
The Cognitive Therapy Scale: Psychometric Properties T. Michael Vallis and Brian F. Shaw Clarke Institute of Psychiatry and University of Toronto
Keith S. Dobson University of British Columbia
The Cognitive Therapy Scale (CTS; Young & Beck, 1980) was developed to evaluate therapist competence in cognitive therapy for depression. Preliminary data on the psychometric properties of the scale have been encouraging but not without problems. In this article, we present data on the interrater reliability, internal consistency, factor structure, and discriminant validity of the scale. To overcome methodological problems with previous work, expert raters were used, all sessions evaluated were cognitive therapy sessions, and a more adequate statistical design was used. Results indicate that the CTS can be used with only moderate interrater reliability but that aggregation can be used to increase reliability. The CTS is highly homogeneous and provides a relatively undifferentiated assessment of therapist performance, at least when used with therapists following the cognitive therapy protocol. The scale is, however, sensitive to variations in the quality of therapy.
As psychotherapy research continues, more and more attention
given therapy is implemented (Schaffer). Competency assessment
is being given to therapist in-session behavior (Elkin, 1984; Schaf-
provides a metric that can facilitate detection of therapist drift
fer, 1983). DeRubeis, Hollon, Evans, and Bemis (1982) and Lu-
(Shaw, 1984) and defines a potentially important predictor vari-
borsky, Woody, McLellan, O'Brien, and Rosenzweig (1982) have
able for understanding the psychotherapy change process.
developed rating scales to assess therapy-specific therapist be-
The Cognitive Therapy Scale (CTS) was developed by Young
havior, and these scales have been shown to clearly differentiate
and Beck (1980) to evaluate therapist competence in imple-
therapy types (e.g., cognitive therapy and interpersonal therapy).
menting the cognitive therapy protocol of Beck, Rush, Shaw, and
Ensuring that different therapies are unique is an important con-
Emery (1979). The CTS is an observer-rated scale that contains
tribution, but this work does not address how competently a
11 items divided into two subscales on a rational basis. The General Skills subscale is composed of items assessing establishment of an agenda, obtaining feedback, therapist understanding, in-
The National Institute of Mental Health (NIMH) Treatment of Depression Collaborative Research Program is a multisite program initiated and sponsored by the Psychosocial Treatment Research Branch, Division of Extramural Research Programs, NIMH, and is funded by cooperative agreements to six participating sites. The principal NIMH collaborators are Irene Elkin, coordinator; John P. Docherty, acting branch chief; and Morris B. Parloff, former branch chief. Tracie Shea, of George Washington University, is associate coordinator. The principal investigators and project coordinators at the three participating research sites are Stuart M. Sotsky and David Glass, George Washington University; Stanley D. Imber and Paul A. Pilkonis, University of Pittsburgh; and John T. Watkins and William Leber, University of Oklahoma. The principal investigators and project coordinators at the three sites responsible for training therapists are Myma Weissman, Eve Chevron, and Bruce J. Rounsaville, Yale University; Brian F. Shaw and T. Michael Vallis, Clarke Institute of Psychiatry; and Jan A. Fawcett and Phillip Epstein, Rush Presbyterian St. Luke's Medical Centre. Collaborators in the data management and data analysis aspects of the program are C. James Klett, Joseph F. Collins, and Roderic Gillis, of the Perry Point, Maryland, \feterans Administration Cooperative Studies Program. This work was completed as part of the NIMH Treatment of Depression Collaborative Research Program (NIMH Grant MH3823102 awarded to the second author). We would like to thank John Rush, Maria Kovacs, Jeff Young, and Gary Emery, who served as expert raters and consultants. Correspondence concerning this article should be addressed to Brian F. Shaw, Clarke Institute of Psychiatry, 250 College Street, Toronto, Ontario, Canada M5T 1R8.
terpersonal effectiveness, collaboration, and pacing of the session (efficient use of time). The Specific Cognitive Therapy Skills subscale items assess empiricism, focus on key cognitions and behaviors, strategy for change, application of cognitive-behavioral techniques, and quality of homework assigned. All items are rated on Likert-type 7-point scales (range = 0-66). Item scale values are associated with concrete, behavioral descriptors, and a detailed rating manual is available. The CTS requires varying degrees of inference from raters. Establishing an agenda is a low-inference item. Relevant therapist behaviors include obtaining feedback from the previous session, identifying current issues, and collaborating with the patient to select one or more appropriate targets for the session. These behaviors are specific, but which of them are required to justify a given rating is unclear. At the other extreme, strategy for change is a high-inference item. The scale descriptor for the maximum score states, "Therapist followed a consistent strategy for change that seemed very promising and incorporated the most appropriate [italics added] cognitive-behavioral techniques." The rater must have considerable knowledge to make a judgment about the most appropriate strategy for a given patient at a given point in therapy. Three recent studies have examined the interrater reliability of the CTS (Dobson, Shaw, & Vallis, 1985; Hollon et al., 1981; Young, Shaw, Beck, & Budenz, 1981). Reliability coefficients
381
382
T. VAIX1S, B. SHAW, AND K. DOBSON
(intraclass correlations) ranged from .54 to .96 in these studies. Methodological problems make these data difficult to interpret, however. All used a problematic statistical design, the Balanced Incomplete Block Design (Kirk, 1968).' Further, Young et al.
Raters A total of seven raters were involved in these analyses. All raters (6 Ph.D. and I M.D.) were experts in cognitive therapy and had considerable clinical experience as well as experience in training others in cognitive
selected tapes that had a restricted range, and not all of the ther-
therapy. Three of the seven raters were cognitive therapy trainers in the
apists in the Dobson et al. or Hollon et al. studies followed the
TDCRP. The remaining four raters were consultants. Due to availability
cognitive therapy protocol. Combining cognitive and noncognitive therapists confounds the ability of the CTS to discriminate
restrictions, five raters were involved in the interrater reliability analysis. All seven raters were involved in the internal consistency, factor analysis,
cognitive therapy from noncognilive therapy and the ability of
and discriminant validity analyses.
the CTS to discriminate between the competency with which different therapists perform cognitive therapy. Only Young et al. examined cognitive therapists exclusively, and they reported a moderate reliability coefficient. An additional concern with the Hollon et al. study is that six of the seven raters were psychology
Stimulus Videotapes and Rating Procedures The various analyses involved different numbers of videotapes, selected by various means; therefore the different samples are described separately. Interrater reliability.
Each of five raters evaluated the same 10 vid-
graduate students who may not have had the requisite experience
eotapes, selected randomly from a pool of 94 tapes. This pool represented
and knowledge to make highly inferential judgments. If the CTS
all videotapes received by the cognitive therapy component of the TDCRP
is to be used as an index of the competency of cognitive therapists,
over a 6-month period (January through June of 1983). Videotapes were
interrater reliability should be established within a homogeneous
selected so that each of the nine therapists were represented in the sample.
group of cognitive therapists using experienced raters and a
A secondXape was selected from one therapist so that 10 tapes composed
complete statistical design.
the reliability sample. AH raters evaluated the videotapes over a 2-day
Fewer data are available on the internal consistency of the CTS. Data from Dobson et al. (1985) suggest that the scale is very homogeneous (a = .95). Similarly, Young et al. (1981), who
period. Raters viewed eight tapes in pairs and two tapes alone. Raters were paired so that each rater evaluated two tapes with each other rater. The order of the tapes was counterbalanced across raters. Internal consistency and factor analysis.
On four occasions, TDCRP
combined the data from all three of the studies cited above and
consultants met to evaluate selected samples of videotapes. Videotapes
performed a factor analysis, found only two overlapping factors.
were randomly selected from all tapes available during the period between
Specific cognitive therapy technical skill items loaded on Factor
consultants' visits so that each therapist who had seen a patient in that
1, whereas interpersonal skill items loaded moderately on Factor
period was represented. At two visits, two consecutive sessions were se-
1 and moderately to highly on Factor 2. Together, these data
lected from each therapist. Here the first session was randomly selected.
suggest that the CTS is a highly homogeneous scale. Finally, Hol-
This was done so that consultants could base their assessment on more
lon et al. (1981) reported preliminary data on the concurrent
than a single sample of a therapist's behavior. To increase reliability, all videotapes were rated by two raters, and the
and discriminant validity of the CTS. The CTS was shown to correlate with the Cognitive Therapy subscale of the Minnesota Therapy Rating Scale, a scale designed to measure adherence
mean was used in the analyses. Raters were paired with each other so that each rater evaluated approximately the same number of videotapes with every other rater. For those raters involved in the ongoing monitoring
and not competence, but not with the Interpersonal Therapy or
of therapists (trainers), tapes were periodically selected at random and
Pharmacotherapy subscales. The present article presents exten-
independently rated by one other rater. A total of approximately 725
sive data on the CTS scale. Expert raters, a pure sample of cognitive therapists, and a more adequate statistical design charac-
videotapes were available, from which 90 were selected for analyses. To balance for individual rater characteristics, approximately equal numbers
terized the study. Data on interrater reliability, internal consis-
of ratings were obtained from each rater.3
tency, factor structure, and discriminant validity are presented.
Discriminant validity. From the same sample used in the internal consistency analysis, 53 tapes were selected. Difference in the number of
Method
joined the project later than the rest. This rater evaluated a subset of
videotapes selected was due to fewer tapes being available for a rater who
Overview The data for this paper were derived from the training phase of the cognitive therapy component of the National Institute of Mental Health
tapes in which acceptable and unacceptable decisions were not made. In addition to making ratings on the CTS form, each rater also evaluated the session overall as to whether it was acceptable cognitive therapy or not. Acceptable sessions were judged as being of sufficient quality to be
(NIMH) Treatment of Depression Collaborative Research Program (TDCRP; Elkin, Parloff, Hadley, & Autry, 1985). Nine psychotherapists (Ph.D. or M.D.), three from each of three treatment centers in the United
1
With this design, raters do not evaluate all sessions. Instead, raters
States, were trained in cognitive therapy. Training occurred over an 18-
are paired so that each rater evaluates the same number of sessions as
month period, during which time each therapist treated four or five pa-
every other rater. Because raters and sessions are not crossed, reliabilities
tients. AH patients were suffering from unipolar depression and met the
are calculated on the basis of estimated effects.
research diagnostic criteria for major depressive disorder (Spitzer, Endicott,
2
Ratings were also made on the following variables: overall therapist
& Robins, 1978). Patients were excluded if they were psychotic or suffered
quality; acceptability of the therapists' performance as adequate for an
from bipolar affective disorder, alcoholism, or certain medical problems
outcome study; patient difficulty; patient receptivity; and the probability, based on the observed session, that the therapist would become competent
(see Elkin et al. for complete details). During the training period, each therapist received weekly individual and monthly group supervision from training staff (the authors). Training staff viewed videotapes of every therapy session and completed the CTS.2 In addition, five consultants periodically evaluated samples of videotapes. AH raters based their ratings on complete (50-min) sessions.
as a cognitive therapist and as a cognitive therapy supervisor. 3
Four raters viewed 28 videotapes, one viewed 27, one viewed 24, and
one viewed 17. Two raters viewed fewer videotapes due to unavailability during some phases of the project (one trainer joined the project after it had begun operation, and one consultant did not attend two site visits).
COGNITIVE
383
THERAPY SCALE
a valid representation of cognitive therapy. Unacceptable sessions were those whose content was judged to be unrepresentative of cognitive therapy
Table 2 Factor Analysis of the Cognitive Therapy Scale
(regardless of quality) or whose quality was judged to be poor. Item
Ratings of acceptability and unacceptability were made independently
,
of ratings on the CIS. This was possible because all tapes were evaluated by two raters. One rater was randomly selected to provide the acceptable
2. 3. 4. 5. 6. 7. 8. 9. 10. 11.
or unacceptable judgment. The CTS rating from the other rater was obtained and used in this analysis. This procedure avoided the problem of confounded ratings when the same individual makes both judgments.
Results Interrater Reliability The intraclass correlation coefficient (ICC; see Shrout & Fleiss, 1979) was calculated on CTS total scores. A one-way analysis of variance (ANOVA), with tapes (10) as the independent variable and raters (five) as the replication factor, generated the appropriate sums of squares. The ANOVA indicated that the CTS scores for the 10 videotapes differed significantly, F[9, 40) = 8.26, p < .001. Comparison within these 10 tapes resulted in four subgroups. Six tapes clustered in the low range (M = 38.0); one in the low-medium range (M = 45.00); two in the mediumhigh range (M = 54); and one, in the high range (M = 60.2). Therefore, restricted variance did not appear to be a problem, as it was for the Young et al. (1981) study (see Lahey, Downey, &Saal, 1983). The reliability of a single rater for this sample was .59, which was statistically significant, ^9, 40) = 8.23, p < .01, although not as high as that reported by Dobson et al. (1985) or Hollon et al. (1981). Reliability for individual items was low to moderate, ranging from .27 (pacing) to .59 (empiricism). Unreliability in the ICC can be due to minimum variance, differential pattern of correlations among rater pairs, or low correlation between raters (Lahey et al., 1983). That there was significant variance among the sample videotapes is an argument against a minimal variance interpretation. Examination of the pairwise correlation coefficients between all raters on CTS total scores failed to identify any particular rater as deviant. All raters demonstrated a similar range of correlations with other raters.4 Thus, with a group of therapists adhering to the cognitive therapy
Table 1 Item-Total Correlations for the Cognitive Therapy Scale (CTS)
Subscale and item
General skills
CT skills
Eigenvalue % variance
.13 .72 .91 .79 .82 .54 .79 .79 .71 .72 .28
.81 .39 .15 .15 .34 .62 .35 .32 .57 .54
7.13 64.80
0.98 8.90
.77
Internal Consistency Item-total correlations were calculated between each item and the total for (a) the General Skills subscale, (b) the Cognitive Therapy Skills subscale, and (c) the overall score (Table 1). Examination of the item-total correlations in Table 1 leads one to question the division of items into subscales. All items correlated moderately to highly with both subscales and with the total score. Although each item correlated highest with its respective subscale, the discrepancy in the magnitude of the correlations with its own and with the other subscale was small. Similarly, the two subscales correlated highly with each other, r(88) = .85, p< .001. Factor structure. Through Principal-components factor analysis, with varimax rotation, we further evaluated the internal structure of the CTS. Two factors resulted from this analysis (Table 2). The first factor (64.8% of the variance) included high positive loadings for all but three items. Items from both the General Skills and Cognitive Therapy Skills subscales loaded on this factor. This factor reflected overall cognitive therapy quality,
CTS total
.66 .83 .84 .79 .88 .77
.48 .74 .74 .61 .78 .75
.59 .81 .82 .73 .87 .79
.77
.86 .87 .94 .93 .72
.84 .82 .91 .90 .68
.72 .81 .80 .59
Factor!
protocol and rated by experts in cognitive therapy, .59 appeared to accurately estimate the reliability of a single rater. Increases in reliability can be achieved by increasing the number of raters (Epstein, 1979). For instance, by combining the data for two raters, the interrater reliability coefficient increased to .77. Using the Spearman Brown prophecy formula, we found the estimated reliability of ratings combined from three raters to be 0.84.5''
4
General skills Agenda Feedback Understanding Interpersonal effectiveness Collaboration Pacing Cognitive therapy skills Empiricism Focus on cognitions Strategy for change Implementation of strategy Homework
Agenda Feedback Understanding Interpersonal effectiveness Collaboration Pacing Empiricism Focus on cognition Strategy for change Implementation of strategy Homework
Factor 1
The range of correlations with other raters was .59 to .82 for Rater
1, .44 to .78 for Rater 2, .44 to .84 for Rater 3, .63 to .84 for Rater 4, and .53 to .82 for Rater 5. 5
This coefficient had to be estimated because six raters would have
been required to directly calculate interrater reliability when three raters were combined. 6 Although this study was not designed to evaluate intrarater reliability for the five raters, test-retest data were available from the trainers. Each
trainer reevaluated a sample of randomly selected videotapes that were originally rated at least 5 months previously. Reliability coefficients for the CTS total score, based on samples of 17, I I , and 10 videotapes were .81, .96, and .68, respectively. A second retest reliability study was conducted on one trainer 18 months after the initial study. A test-retest
Note. N = 90.
correlation of .77 was found for a sample of 10 videotapes.
384
T. VALLIS, B. SHAW, AND K. DOBSON
composed of both nonspecific factors (such as understanding and interpersonal effectiveness) and specific cognitive therapy factors (such as empiricism, focus on central cognitions, and implementation of cognitive-behavioral interventions). The second factor (8.9% of the variance) involved the items assessing therapists' activities to structure cognitive therapy sessions (agenda, pacing, and homework). This analysis confirmed the findings on internal consistency. The scale was highly homogeneous, with only two orthogonal factors. Discriminant validity. Differences in CTS item scores between acceptable and unacceptable sessions were analyzed by a multivariate analysis of variance (MANOVA). We used / tests to compare acceptable and unacceptable sessions on General Skills and Cognitive Therapy Skills subscales as well as on the total scale. Table 3 presents mean scores and the results of univariate analyses. Univariate analyses were computed on item scores because the MANOVA was highly significant (Hotelling's 7"2 = 6.54, p < .01). As reflected in Table 3, CTS scores for acceptable sessions were almost twice those of unacceptable sessions. A stepwise discriminant function analysis was also conducted. Only CTS items that significantly added to the separation of groups (using the maximum Rao procedure) were included in the discriminant function. These were, in the order in which they were added to the function, application of cognitive-behavioral techniques (Rao's V= 66.79, p < .0001), feedback (change in Rao's V = 7.83, p < .006), empiricism (change in Rao's V = 5.86, p < .02), interpersonal effectiveness (change in Rao's V = 3.76, p < .05), and collaboration (change in Rao's V = 5.61, p < .02). When the resulting discriminant function was used to predict group classification (acceptable or unacceptable cognitive therapy), 84.91% correct classification resulted. Three of the 29 acceptable sessions and 5 of the 24 unacceptable sessions were misclassified. Because the base rate for a dichotomous classification is 50%, the discriminant function increased correct classification by approximately 35%. Discussion The CTS was developed to assess therapist competency in cognitive therapy. To be a useful clinical instrument, the scale must have adequate psychometric properties, particularly rater reliability. Data from this study suggest that the CTS is used with only moderate reliability. We estimated that for a single rater, only 59% of the variance in CTS scores is attributable to differences in competency across sessions. The remaining variability (41%) is attributable to error. The major source of error likely results from different raters relying on different aspects of a therapist's behavior to make their ratings. The abundance of behavior contained in a complete therapy session no doubt contributes to this. Although a reliability coefficient of .59 is acceptable and consistent with other psychotherapy rating scales (Lahey et al., 1983), it has implications for the use of the scale. Unreliability detracts from the ability to use the scale for supervision or research. Fortunately, reliability can be bootstrapped by aggregation (Epstein, 1979). Combining two judges' ratings increased reliability to .77 in the present study. Therefore, aggregation is recommended to anyone planning to use the CTS. The interrater reliability estimate obtained in this study was lower than that of Dobson et al. (1985) or Hollon et al. (1981),
Table 3 Analysis of Acceptable and Unacceptable Therapy Sessions on the Cognitive Therapy Scale (CTS)
Item Agenda Feedback Understanding Interpersonal effectiveness Collaboration Pacing Empiricism Focus on cognition Strategy for change Implementation of strategy Homework General skills subtotal C/B skills subtotal CTS total
Acceptable sessions (n = 29)
Unacceptable sessions (n = 24)
F*
3.14 4.00 4.69
1.46 1.88 3.08
18.69" 40.53** 26.17**
4.79 4.59 4.28 3.93 4.76 4.31
3.58 2.71 2.75 2.58 2.46 2.71
11.32* 42.18* 21.57* 29.17* 45.86* 18.06*
4.24 4.31
1.95 2.71
66.79" 18.06"
25.76 21.55 47.31
15.46 11.92 27.28
7.25* 7.53* 7.90*
B
We used t tests to compare the two groups on the General Skills and CT Skills subscales and CTS total scores. * p < . 0 5 . «p