The Hamilton Depression Rating Scale: Has the Gold Standard Become a Lead Weight? R. Michael Bagby, Ph.D. Andrew G. Ryder, M.A.. Deborah R. Schuller ...
Reviews and Overviews
The Hamilton Depression Rating Scale: Has the Gold Standard Become a Lead Weight? R. Michael Bagby, Ph.D. Andrew G. Ryder, M.A. Deborah R. Schuller, M.D. Margarita B. Marshall, B.Sc.
Objective: The Hamilton Depression Rating Scale has been the gold standard for the assessment of depression for more than 40 years. Criticism of the instrument has been increasing. The authors review studies published since the last major review of this instrument in 1979 that explicitly examine the psychometric properties of the Hamilton depression scale. The authors’ goal is to determine whether continued use of the Hamilton depression scale as a measure of treatment outcome is justified.
Results: The Hamilton depression scale’s internal reliability is adequate, but many scale items are poor contributors to the measurement of depression severity; others have poor interrater and retest reliability. For many items, the format for response options is not optimal. Content validity is poor; convergent validity and discriminant validity are adequate. The factor structure of the Hamilton depression scale is multidimensional but with poor replication across samples.
Method: MEDLINE was searched for studies published since 1979 that examine psychometric properties of the Hamilton depression scale. Seventy studies were identified and selected, and then grouped into three categories on the basis of the major psychometric properties examined—reliability, item-response characteristics, and validity.
Conclusions: Evidence suggests that the Hamilton depression scale is psychometrically and conceptually flawed. The breadth and severity of the problems militate against efforts to revise the current instrument. After more than 40 years, it is time to embrace a new gold standard for assessment of depression. (Am J Psychiatry 2004; 161:2163–2177)
T
he Hamilton Depression Rating Scale (1) was developed in the late 1950s to assess the effectiveness of the first generation of antidepressants and was originally published in 1960. Although Hamilton (1) recognized that the scale had “room for improvement” (p. 56) and that further revision was necessary, the scale quickly became the standard measure of depression severity for clinical trials of antidepressants (2, 3). The Hamilton depression scale has retained this function and is now the most commonly used measure of depression (3). Our objective in this article is to provide a review of the Hamilton depression scale literature published since the last major evaluation of its psychometric properties, more than 20 years ago (4). More recent reviews have appeared (3, 5–7), but they have not systematically examined the literature with regard to a broad range of measurement issues. Significant developments in psychometric theory and practice have been made since the 1950s and need to be applied to instruments currently in use. We evaluate the Hamilton depression scale in light of these current standards and conclude by presenting arguments for and against retaining, revising, or rejecting the Hamilton depression scale as the gold standard for assessment of depression.
Am J Psychiatry 161:12, December 2004
Method Studies for the review were identified by means of MEDLINE searches for both “depression” and “Hamilton.” All studies published during the period since the last major review (January 1980 to May 2003) were considered. Studies selected for review had to be explicitly designed to evaluate empirically the psychometric properties of the instrument or to review conceptual issues related to the instrument’s development, continued use, and/or shortcomings. At least 20 published versions of the Hamilton depression scale exist, including both longer and shortened versions. This review was limited to studies that examined the original 17-item version, as the majority of the studies that evaluated the scale’s psychometrics used the 17-item version. Only a small number of studies evaluated other versions, and most of these versions contain the original 17 items. Seventy articles met the selection criteria and were categorized into three groups on the basis of the major psychometric property examined—reliability, item response, and validity. Table 1 lists the articles included in the review.
Results Reliability Clinician-rated instruments should demonstrate three types of reliability: 1) internal reliability, 2) retest reliability, and 3) interrater reliability. Cronbach’s alpha statistic (78) is used to evaluate internal reliability, and estimates ≥0.70 http://ajp.psychiatryonline.org
2163
HAMILTON DEPRESSION SCALE TABLE 1. Characteristics of Studies Examining the Psychometric Properties of the Hamilton Depression Rating Scale a
Study Aben et al. (8) Addington et al. (9) Addington et al. (10) Addington et al. (10) Akdemir et al. (11) Baca-García et al. (12) Bech (5) Bech et al. (13) Bech et al. (14) Berard and Ahmed (15) Berrios and BulbenaVillarasa (16) Brown et al. (17) Carroll et al. (18) Cicchetti and Prusoff (19) Time 1 Time 2 Craig et al. (20) Daradkeh et al. (21) Deluty et al. (22) Demitrack et al. (23) Entsuah et al. (24) Sample 1 Sample 2 Sample 3 Faries et al. (25) Feinberg et al. (26) Fleck et al. (27) Fuglum et al. (28) Gastpar and Gilsdorf (29) Gibbons et al. (30) Gilley et al. (31) Sample 1 Sample 2 Sample 3 Gottlieb et al. (32) Gullion and Rush (33) Hammond (34) Hooijer et al. (35) Hotopf et al. (36) Kobak et al. (37) Koenig et al. (38) Lambert et al. (39) Lambert et al. (40) Leentjens et al. (41) Leung et al. (42) McAdams et al. (43) Maier and Philipp (44) Maier et al. (45) Sample 1 Sample 2 Maier et al. (46) Marcos and Salamero (47) Meyer et al. (48) Middelboe et al. (49) Moberg et al. (50) Mottram et al. (51) Naarding et al. (52) Sample 1 Sample 2 Sample 3 O’Brien and Glaudin (53) Sample 1 Sample 2
Year 2002 1990 1996 1996 2001 2001 1981 1992 2002 1995 1990
N 202 250 112c 89d 94 1 66 1,128 650 22 1,204 259 278
—b —b
Medical outpatients Depressed patients
86 81 32 73 70 85
—b —b 0 58 39 66
Depressed outpatients Depressed outpatients Schizophrenia inpatients Depressed inpatients Psychiatric inpatients Professionals/laypersons
865 757 450 1,658 —b 60 —b 122 370
65 64 62 —b —b 77 —b 66 72
Psychiatric patients Psychiatric patients Psychiatric patients Depressed outpatients Depressed patients Psychiatric outpatients Depressed patients Depressed patients Psychiatric patients
185 54
56 39
× ×
× ×
English English English English Flemish English English
57 43 324 100 56 49 113
37 67 67 74 —b 65 —b
× ×
× × ×
English —b English Dutch Chinese English German
38 1,850 13 63 93 101 280
55 —b 31 37 56 23 —b
Alzheimer’s disease patients Comparsion subjects with normal cognition Parkinson’s disease patients Neurological patients Depressed patients Elderly medical patients Mental health professionals Primary care patients Psychiatric patients/community comparison subjects Elderly medical patients Psychiatric patients Psychiatric inpatients/outpatients Parkinson’s disease patients Psychiatric inpatients Schizophrenia outpatients Psychiatric outpatients
German German German Spanish English Danish English English
130 48 130 234 196 36 20 433
—b —b —b 76 68 64 70 73
Psychiatric inpatients Psychiatric inpatients Psychiatric inpatients Community geriatric subjects Medical outpatients Medical outpatients Geriatric consultation/liaison patients Elderly psychiatric referrals
× ×
Dutch Dutch Dutch
44 274 85
36 60 40
Stroke inpatients Alzheimer’s disease patients Parkinson’s disease patients
× × ×
English English
183 182
70 70
Psychiatric outpatients Psychiatric outpatients
× ×
Language Dutch English English English Turkish Spanish Danish Multilingual Danish English Castilian
1995 English 1981 English 1983 English English 1985 English 1997 Arabic 1986 English 1998 —b 2002 Multilingual Multilingual Multilingual 2000 —b 1981 English 1995 French 1996 Danish 1990 Multilingual 1993 English 1995 English English 1988 1998 1998 1991 1998 1999 1995 1986 1988 2000 1999 1996 1985 1988 1988 1990 2001 1994 2001 2000 2002
Psychometric Properties Examined
% of Female Subjects 46 —b 60 —b 66 100 70 —b —b 64 59
Subjects Stroke patients Schizophrenia inpatients Schizophrenia inpatients Schizophrenia inpatients Psychiatric patients Dysthymia outpatient Depressed inpatients Psychiatric patients Psychiatric patients Elderly psychiatric outpatients Psychiatric outpatients
1988
Item Reliability Response Validity × × × × × × × × × × × × × × × × × × ×
× × × × × × × ×
× ×
× × × × × × × × × × ×
× ×
× ×
× ×
× ×
× × ×
×
× × × × × × ×
× × × ×
×
(continued)
2164
http://ajp.psychiatryonline.org
Am J Psychiatry 161:12, December 2004
BAGBY, RYDER, SCHULLER, ET AL. TABLE 1. Characteristics of Studies Examining the Psychometric Properties of the Hamilton Depression Rating Scale a (continued)
Year 1983 2003 1997 2002 1990
Language English Danish English Italian
Psychometric Properties Examined
N 20 91 206 186
% of Female Subjects 0 74 70 62
Subjects Depressed outpatients Psychiatric and medical patients Geriatric psychiatric outpatients Depressed outpatients Depressed inpatients Psychiatric outpatients General practice outpatients Depressed outpatients Depressed inpatients/outpatients
× × × × ×
× × ×
× ×
×
×
×
Study O’Hara and Rehm (54) Olsen et al. (55) Onega and Abraham (56) Pancheri et al. (57) Paykel (58) Sample 1 Sample 2 Sample 3 Potts et al. (59) Ramos-Brieva and Cordero-Villafafila (60) Rehm and O’Hara (61) Reynolds and Kobak (62)
English English English 1990 English 1988 Spanish
101 118 167 694 135
—b —b —b 74 70
1985 English 1995 English
158 357
100 59
Riskind et al. (63) Santor and Coyne (64) Sample 1 Sample 2 Santor and Coyne (65) Sayer et al. (66) Senra Rivera et al. (67) Shain et al. (68) Smouse et al. (69) Steinmeyer and Möller (70) Steinmeyer and Möller (70) Strik et al. (71) Sample 1 Sample 2 Teri and Wagner (72) Thase et al. (73) Thompson et al. (74) Whisman et al. (75) Williams (76) Zheng et al. (77)
1987 English 2001 English English 2001 English 1993 English 2000 Castilian 1990 English 1981 English 1992 German 1992 German 2001 Dutch Dutch 1991 English 1983 English 1998 English 1989 English 1988 English 1988 Chinese
191
54
Community (symptomatic) subjects Psychiatric outpatient/nonreferred community subjects Psychiatric outpatients
316 318 732 114 52 45 —b 223e 174f
—b 70 —b 61 65 64 —b 68 68
Primary care outpatients Depressed outpatients Depressed patients Psychiatric inpatients Depressed patients Depressed adolescent inpatients Depressed patients Psychiatric inpatients Psychiatric inpatients
156 50 75 147 242 70 23 329
0 100 68 100 100 100 65 47
Medical patients Medical patients Alzheimer’s patients Depressed outpatients Psychiatric referrals Depressed outpatients Psychiatric inpatients Psychiatric inpatients/outpatients
Item Reliability Response Validity × × × × ×
× × ×
× × × × × × ×
×
× × ×
× × × × × × × × × × ×
a
Studies were published between January 1980 and May 2003 and identified by means of a MEDLINE search for both “depression” and “Hamilton.” b Not reported. c Number of subjects providing data at time 1. d Number of subjects providing follow-up data 3 months after admission. e Number of subjects providing baseline (i.e., pretreatment) data. f Number of subjects providing endpoint (week 6) data after treatment with either paroxetine or amitriptyline.
reflect adequate reliability (79, 80). The internal reliability of individual items is calculated by using corrected itemto-total correlation with Pearson’s r; items should have a correlation greater than 0.20 (79, 80). Retest reliability assesses the extent to which multiple administrations of the scale generate the same results. When scores on an instrument are expected to change in response to effective treatment, it is necessary to demonstrate that these scores remain the same in the absence of treatment. Interrater reliability assesses the extent to which multiple raters generate the same result. Although Pearson’s r is often used to compute these estimates, the preferred method is the intraclass r (81), which allows for adjustment for agreement by chance. Estimates of retest and interrater reliability should be at a minimum of 0.70 (Pearson’s r) and 0.60 (intraclass r) (82). For retest reliability of scale items, Pearson’s r >0.70 is considered acceptable (83). Am J Psychiatry 161:12, December 2004
Internal Reliability Table 2 summarizes the results from studies examining internal reliability of the total Hamilton depression scale. Estimates ranged from 0.46 to 0.97, and 10 studies reported estimates ≥0.70. Table 3 summarizes the studies that examined internal reliability at the item level. The majority of Hamilton depression scale items show adequate reliability. Six items met the reliability criteria in every sample (guilt, middle insomnia, psychic anxiety, somatic anxiety, gastrointestinal, general somatic), and an additional five items met the criteria in all but one sample (depressed mood, suicide, early insomnia, late insomnia, work and interests, hypochondriasis). Loss of insight was the item with the most variable findings, suggesting a potential problem with this item.
Interrater Reliability Total Hamilton depression scale interrater reliabilities are displayed in Table 2. Pearson’s r ranged from 0.82 to http://ajp.psychiatryonline.org
2165
HAMILTON DEPRESSION SCALE TABLE 2. Studies Reporting Reliability Estimates for the Total 17-Item Hamilton Depression Rating Scalea Study Addington et al. (9) Addington et al. (10) Akdemir et al. (11) Baca-García et al. (12) Cicchetti and Prusoff (19) Time 1 Time 2 Craig et al. (20) Deluty et al. (22) Demitrack et al. (23) Fuglum et al. (28) Gastpar and Gilsdorf (29) Gilley et al. sample 1 (31) Gottlieb et al. (32) Hammond (34) Kobak et al. (37) Koenig et al. (38) Leung et al. (42) Maier et al. (45) Sample 1 Sample 2 Time 1 Time 2 McAdams et al. (43) Meyer et al. (48) Middelboe et al. (49) O’Hara and Rehm (54) Expert raters Novice raters Pancheri et al. (57) Potts et al. (59) Ramos-Brieva and Cordero-Villafafila (60) Rehm and O’Hara (61) Study 1 Study 2 Reynolds and Kobak (62) Riskind et al. (63) Shain et al. (68) Teri and Wagner (72) Whisman et al. (75) Williams (76) Zheng et al. (77)
Year 1990 1996 2001 2001 1983
Internal Reliability (Cronbach’s alpha)
Interrater Reliability (Pearson’s r) 0.82
Interrater Reliability Retest Reliability (Intraclass r) (Pearson’s r) 0.93
0.87–0.98b 0.97
0.75
0.85 0.46 0.82
1985 1986 1998 1996 1990 1995 1988 1998 1999 1995 1999 1988
0.95 0.96 0.65–0.79b 0.81
0.86 0.48 0.92
0.99 0.46 0.91
0.98 0.97 0.94 0.70
1996 2001 1994 1983
0.72 0.70 0.77 0.57–0.80b 0.75 0.91 0.76
2002 1990 1988 1985
0.90 0.82 0.72 0.76
1995 1987 1990 1991 1989 1988 1988
0.92 0.78–0.91b 0.91–0.96b
0.92 0.73
0.96 0.97 0.65–0.97b 0.85 0.81
0.71
0.92
a
Estimates are from studies published between January 1980 and May 2003 that measured psychometric properties of the Hamilton depression scale. Studies were identified by means of a MEDLINE search for both “depression” and “Hamilton.” b Range over multiple pairs of raters.
0.98, and the intraclass r ranged from 0.46 to 0.99. Some investigators provided evidence that the skill level or expertise of the interviewer and the provision of structured queries and scoring guidelines affect reliability (19, 23, 35, 54). Across studies, the best estimate mean of interrater reliability for studies reporting higher levels of interviewer skill and use of expert raters, structured queries, and scoring guidelines did not statistically differ from that for other studies (z=0.81, n.s.). At the individual item level, interrater reliability is poor for many items. Cicchetti and Prusoff (19) assessed reliability before treatment initiation and 16 weeks later at trial end. Only early insomnia was adequately reliable before treatment, and only depressed mood was adequately reliable after treatment. Thirteen items had coefficients 0.70 was considered acceptable. Acceptable correlations are shown in boldface type.
along this continuum as they improve. Continued use of items insensitive to change underestimates the strength of actual treatment effects and makes it necessary to have larger samples to demonstrate that an effect is statistically significant. Falsely identifying patients as not having changed represents an additional source of “noise” and weakens the “signal” of a true treatment effect. A pragmatic implication of such lack of sensitivity is that new compounds shown to be promising in the laboratory may appear spuriously ineffective in clinical trials. A related issue concerns the extent to which a severity score actually measures a single unidimensional syndrome. To summarize a syndrome with a single score requires a precise understanding of what that score represents. The implicit assumption is that the severity score represents a single dimension (84); if depression is heterogeneous, interpretation of a single summed score is unclear. If, for example, items assessing psychological and physical symptoms were only loosely related, a single score would not distinguish between two potentially different groups of depressed patients—one group whose symptoms were primarily psychological and another group with primarily vegetative symptoms. Any effects of
2168
http://ajp.psychiatryonline.org
an intervention targeting only one of these aspects would be harder to detect. Gibbons et al. (85) presented a strategy for identifying a unidimensional set of items from a psychiatric rating scale and evaluating the extent to which these items adequately measure the full range of depression severity. Subsequently, a subset of Hamilton depression scale items that would measure a single dimension of depression across a wide range of severity was developed (30). This subset included depressed mood, which was sensitive at low levels; work/interests, psychic anxiety, and loss of libido, which were sensitive at mild levels; somatic anxiety, psychomotor agitation, and guilt, which were sensitive at moderate levels; and suicide, which was sensitive at severe levels. These items were proposed as a psychometrically stronger form of the full Hamilton depression scale. Santor and Coyne (64, 65) used item response theory to examine the functioning of the full Hamilton depression scale and its individual items. In one of these studies (65) they examined individual Hamilton depression scale item performance in a combined sample of primary care patients and depressed patients from the National Institute of Mental Health Treatment of Depression Collaborative Research Program. One expects different item ratings at Am J Psychiatry 161:12, December 2004
BAGBY, RYDER, SCHULLER, ET AL.
Scale Item Retardation Agitation
Psychic Anxiety
Somatic Anxiety
Gastrointestinal
General Somatic
Loss of Libido
Hypochondriasis
Weight Loss
Loss of Insight
Mean
0.24 0.31
0.24 0.35
0.42 0.36
0.35 0.29
0.33 0.37
0.29 0.34
0.29 0.30
0.34 0.36
0.29 0.26
0.06 0.06
0.29 0.32
0.03 0.40
0.07 0.40
0.39 0.64
0.34 0.58
0.28 0.53
0.32 0.55
0.05 0.55
0.34 0.23
–0.04 0.11
0.25 0.27
0.16 0.47
0.33 0.33 0.21 0.14
0.37 0.20 0.25 0.18
0.53 0.33 0.50 0.54
0.41 0.47 0.42 0.46
0.52 0.63 0.41 0.27
0.25 0.50 0.44 0.38
0.27 0.39 0.23 0.13
0.33 0.43 0.16 0.33
0.40 0.49 0.42 0.25
0.45 0.23 –0.07 0.16
0.38 0.40 0.35 0.34
0.39 0.26
0.20 0.32
0.19 0.40
0.34 0.45
0.43 0.51
0.30 0.42
0.39 0.59
0.29 –0.04
0.57 0.06
–0.02 –0.03
0.37 0.40
0.46 0.75
0.89 0.97
0.67 0.88
0.57 0.84
0.34 0.95
0.57 0.92
0.39 0.94
0.76 0.89
0.58 1.00
0.63 1.00
0.64 0.90
0.54 0.69
0.51 0.52
0.52 0.71
0.88 0.82
0.75 0.73
0.55 0.68
0.77 0.69
0.35 0.64
0.51 0.54
0.37 0.22
0.59 0.72
0.85 0.32
0.66 0.11
0.80 0.78
0.79 0.66
0.71 0.59
0.66 0.61
0.76 0.70
0.79 0.55
0.08 0.58
0.79 0.00
0.70 0.54
different levels of depression severity, with zeroes more common at mild levels of overall depression and higher item scores more common with more severe overall depression. Moreover, whereas most items on the Hamilton depression scale are, overall, sensitive to depression severity, 12 items had at least one problematic response option (the five items that had no such problems were depressed mood, guilt, suicide, work/interests, and psychic anxiety) (64). For example, the likelihood of receiving a rating of 1 on the insomnia items was essentially the same regardless of the overall severity of depression, but the likelihood of receiving a rating of 4 on somatic anxiety was very low even when overall depression was severe. These findings confirm that the rating scheme is not ideal for many items on the Hamilton depression scale, with the unfortunate effect of decreasing the capacity of the Hamilton depression scale to detect change (6, 7).
Rasch Analysis Additional efforts to analyze the performance of individual Hamilton depression scale items and to identify an underlying single dimension of depression severity have benefited from a technique known as Rasch analysis, a method similar to item response theory. Rasch analysis Am J Psychiatry 161:12, December 2004
proposes an ideal underlying dimension based on mathematical and theoretical reasoning about the construct that is being measured and then assesses the extent to which actual data correspond to this ideal. This approach was first applied to the Hamilton depression scale by Bech et al. (86), who confirmed that six items previously shown to have properties associated with unidimensionality (87) could be combined to create a shorter scale that met the formal Rasch criteria. This six-item scale was thus proposed as a better measure than the full Hamilton depression scale for assessing depression severity along a single dimension; the six-item scale is composed of items for depressed mood, guilt, work/interests, psychomotor retardation, anxiety psychic, and general somatic symptoms (87). The unidimensionality of this six-item subscale has since been confirmed in two studies that used Rasch methods (13, 14). Maier and Philipp (44) used Rasch analysis to confirm unidimensionality for a subset of Hamilton depression scale items. The resulting scale was similar to that obtained by Bech et al. (86). In another study that used Rasch analysis (46), six items were found to be problematic: suicide, psychomotor agitation, anxiety somatic, general somatic symptoms, hypochondriasis, and loss of insight. http://ajp.psychiatryonline.org
2169
HAMILTON DEPRESSION SCALE TABLE 4. Studies Reporting Estimates of Convergent Validity of the 17-Item Hamilton Depression Rating Scale, Compared With Other Depression Measuresa r
Study Akdemir et al. (11) Berard and Ahmed (15) Brown et al. (17) Carroll et al. (18) Craig et al. (20) Feinberg et al. (26) Gottlieb et al. (32) Low-severity group High-severity group Hotopf et al. (36) Kobak et al. (37) Leung et al. (42) Maier et al. total sample (46) Olsen et al. (55) Rehm and O’Hara (61) Senra Rivera et al. (67) Time 1 Time 2 Whisman et al. (75) Time 1 Time 2 Zheng et al. (77)
Year 2001 1995 1995 1981 1985 1981 1988
Beck Depression Inventory 0.48 0.48 0.70–0.85b 0.60
Brief Psychiatric Rating Scale
Center for Carroll Epidemiologic Clinical Global Rating Scale Impression Studies for Depression Scale Depression Scale 0.56
Global Assessment Scale
0.71 0.56
0.65 0.77
0.75
0.89 0.57 1998 1999 1999 1988 2003 1985 2000
0.77 0.89
0.73
–0.86 0.70 0.92
1989 0.27 0.67 1988
0.41 0.68 –0.47
a
Estimates are from studies published between January 1980 and May 2003 that measured psychometric properties of the Hamilton depression scale. Studies were identified by means of a MEDLINE search for both “depression” and “Hamilton.” b Multiple assessments over an 8-month period.
Validity Validity of psychiatric rating scales such as the Hamilton depression scale comprises 1) content, 2) convergent, 3) discriminant, 4) factorial, and 5) predictive validity. Content validity is assessed by examining scale items to determine correspondence with known features of a syndrome. Convergent validity is adequate when a scale shows Pearson’s r values of at least 0.50 in correlations with other measures of the same syndrome. Discriminant validity is established by showing that groups differing in their diagnostic status can be separated by using the scale. Predictive validity for symptom severity measures such as the Hamilton depression scale is determined by a statistically significant (p