Cathy Mihalopoulos,. B.B.S., Lisa Henry, M.Psych., .... MIHALOPOULOS,. HENRY,. ET AL. Am J Psychiatry. 152:2,. February. 1995. 221 new ..... R, Chapman.
Spurious of Diagnostic Patrick
Precision: Assessment
Procedural Validity in Psychotic Disorders
D. McGorry, Ph.D., F.R.A.N.Z.C.P., Cathy Mihalopoulos, B.B.S., Lisa Henry, M.Psych., Jenepher Dakis, F.R.A.N.Z.C.P., Henry J. Jackson, Ph.D., Michael Flaum, M.D., Susan Harrigan, Grad.Dip.Appl.Sci., Dean McKenzie, B.A., Jayashri Kulkarni, F.R.A.N.Z.C.P., and Robert Karoly, F.R.A.N.Z.C.P.
Objective: Very nostic procedures the
procedural
few that
validity
studies have use a common of four
quantified the level ofagreement set offixed operational criteria.
independent
methods
of assigning
The and
Health and Medical tion and Intervention
Council Schizophrenia Research Unit’s which focuses on first-episode psychosis.
patients (N=5O) were assessed by determine a DSM-III-R diagnosis. oped for the DSM-IV field trial, 2) 3) the Munich Diagnostic Checklists, team of clinician researchers who cordance
between
pairs
ofdiagnostic
significant
research
and
clinical
procedures
was
implications.
only
to 76%, “correct”).
Early Psychosis Consecutively
Despite
moderate.
Prevenadmitted
T
here is no doubt that the development of sets of operational criteria, and assessment procedures to apply these criteria, has resulted in substantial improvements in diagnostic reliability. Nevertheless, as pointed out by Winokur et al. ( 1 ), serious questions still remain about the reliability and validity of these sets of operational criteria in research studies. These questions arise from the fact that even when much care, time, and effort are put into the assignment of operational criteria, misreadings, misinterpretations, and idiosyncratic use of the criteria are common. The extent of this problem is often concealed by statements within research reports
at the 7th Biennial Winter Workshop on Schizophrenia, Switzerland, Jan. 23-28, 1994. Received March 7, I 994; revision received June 21, 1 994; accepted Aug. 4, 1 994. From the Early Psychosis Prevention and Intervention Centre and the Nationab Health and Medical Research Council Schizophrenia Research Unit, Royal Park Hospital. Address reprint requests to Dr. McGorry, Royal Park Hospital, Park St., Parkvibbe, Victoria 3052, Australia.
Corresponding
with converse Conclusions:
levels
misclassification These findings
the introduction
of
rates have
ofoperationally
ofdifferential classification the diagnostic criteria
themselves. Such misclassification may impede neurobiological clinical effects on patients with first-episode psychosis. (AmJ Psychiatry 1995; 152:220-223)
220
of
independent raters who used four different procedures to These procedures were 1) the diagnostic instrument develthe Royal Park Multidiagnostic Instrument for Psychosis, and 4) a consensus DSM-III-R diagnosis assigned by a were expert in the use of diagnostic criteria. Results: Con-
diagnoses, there remained an appreciable level arising from variability in the method ofassigning
Presented
diagnoses
was conducted as a satellite study to the DSM-IV Psychotic Disorders. The setting was the National
percent agreement, however, ranged from 66% of 24%-34% (assuming one procedure to be
Les Diabberets,
DSM-III-R
psychotic disorders. Method: Field Trial for Schizophrenia Research Centre,
research Related
among alternative diagThe authors examined
defined
or misclassification rather than the criteria
research
and
have
harmful
that provide of diagnostic For any
little or no information about the process assignment. of the major diagnostic systems, such as DSM-III-R, a number of assessment procedures have been developed to collect the psychopathological data and apply the diagnostic criteria. The procedures vary in how and when the data are collected and may or may not specify how the criteria are to be precisely applied. There is ample scope for many of the sources of unreliability, including information variance and the process of actually interpreting and applying the criteria, which are often broadly and imprecisely defined (e.g., how prominent do prominent affective symptoms need to be?), to influence the ultimate diagnosis. Spitzer and Williams (2, p. 1039) introduced the term “procedural validity” to focus upon this issue and defined it as follows: The
term
the
question
“procedural being
Am
validity” asked
concerns
J Psychiatry
can the
be used extent
152:2,
.
. . whenever to which
February
the
1995
MCGORRY,
new
diagnostic
of an tenon. the
procedure
established Procedural
validity
lidity
of
of the
yields
diagnostic evaluation
diagnostic
similar
. . . speaks
validity the
results
procedure
is used
only
to the
procedure
categories
and
interview
to the results
that
not
to
issue
criteria that were rated by means interview, and other ratings. The
the
of va-
themselves.
therefore flexibility,
consecutively
admitted
patients
with
first-episode
psychotic
illness treated at the Early Psychosis Prevention and Intervention Centre (7), a specialist program of Royal Park Hospital, which has a catchment area of 800,000 people, were recruited for the study during 1992. The mean age for the study group was 26.3 years (SD=6.8, range=18-4S). There were more men (N=31, 62%) than women (N=19, 38%), the majority (N=42, 84%) had never married, and
(N=28)
were unemployed
at the time of index
assessment.
Mean
number of years of education was 11.1 (SD=2.2). Organic etiology and mental retardation were exclusion criteria. Written informed consent was obtained from all subjects. The study group approximated an incidence sample of first-episode psychosis for a defined area of Melbourne. The study formed part ofthe multicenter DSM-IV Field Trial for Schizophrenia and Related Psychotic Disorders (3) that examined the reliability and concordance of three alternative sets of options for diagnosing DSM-IV psychotic disorders plus the criteria from DSM-III, DSM-III-R, and lCD-b.
study
followed
the basic
methodology
of the parent
study,
with the first 25 interviews using a test-retest reliability design and the second 25 interviews an interrater design. One interviewer (J.D.) interviewed all SO patients by using the field trial instrument developed for the parent study. The first 25 patients were interviewed on a separate occasion by a second interviewer, using the field trial instrument, within 48 hours of the initial interview (test-retest design). The second 25 patients were each interviewed by a pair of interviewers on a single occasion (interrater design). All interviews were conducted during the recovery phase of the initial psychotic episode, and the position of primary interviewer for interrater agreement was alternated. The field trial instrument enabled raters to apply six sets of diagnostic criteria:
DSM-III, DSM-III-R, IV psychotic disorders
ICD-10, and three alternative options for DSM(3). This diagnostic procedure was compared
with three other methods of assigning DSM-III-R diagnoses for psychotic disorders that were independently employed by other members of the research team with the same subjects. The diagnostic procedures used were as follows: 1 . DSM-IV field trial instrument (3). The field trial instrument was divided into two sections: the first section comprised a semistructured
Am
comprised
and
all SO patients
sequence
of checklists links between
with
ET AL.
of psycho-
the six sets of operational
not especially tight and allowed and, also, error. The diagnoses
interviewed
used
section
HENRY,
the
on completion the two sections
of the were
for some rater judgment, assigned by the rater who field
trial
instrument
were
for this study.
2. Royal
Park
Multidiagnostic
is a comprehensive principally oriented
Instrument
psychopathological
for Psychosis
assessment
for the assessment of a first psychotic and uses serial interviews and multiple
to construct
a psychopathological
14 different
systems
are applied.
This
database
of operational
is a routine
designed
episode. It is validity information sources
for the episode
diagnosis,
diagnostic
(8, 9). This
tool
including
assessment
to which DSM-III-R,
tool
in our
unit
and is employed by research psychologists highly experienced in its use. If there can be a gold standard in studies of procedural validity, and this is questionable (10), in our unit the Royal Park Multidiagnostic Instrument for Psychosis would best represent this for first-epi-
sode psychosis. The assessment field trial instrument assessment, trial
instrument
ratings
and
was carried out independently of the and the rater was blind to the field
diagnosis.
3. Munich Diagnostic Checklists for DSM-III-R (1 1, 12). This procedure consists of a set of pocket-sized lists of the criteria for each diagnostic category. It was developed as an aid to systematic diagnostic evaluation under routine clinical conditions with time limits. The presentation is clear and easy to follow. It involves a blend of simple checklist and flowchart format with a series of boxes to record ratings. In the present study, these were completed by the senior psychia-
try resident and
who
was
management
responsible
of the
for the routine
patient.
The
Munich
clinical
assessment
Diagnostic
Checklists
for DSM-III-R represent a brief but systematic approach to DSM-IIIR diagnosis, which might be more efficient than a more complex procedure, yet more rigorous than merely applying the diagnostic criteria from memory or by reference to a manual.
METHOD
Our
the second
of the content
pathology;
had
56%
for the assessment
as a cri-
Procedural validity was felt to be a more satisfactory term than concurrent validity in the operational age because it focused upon the procedure rather than a combination of the procedure and the operational definition. To consider it as a form of validity, however, does require that the assumption be made that some procedures are more valid than others. This is undoubtedly true, although criteria for assessing this are somewhat elusive. The present study was conducted as part of our international collaborative research in diagnosis and classification of psychotic disorders (3-5) and, specifically, as a satellite study to our participation as a field trial center, one of two non-U.S. sites, for the DSM-IV Field Trial for Schizophrenia and Related Psychotic Disorders (3). The principal aim was to examine the procedural validity of four methods of assigning DSM-III-R diagnoses of psychotic disorders in order to gain an estimate of the rate of potential misclassification (6) in a research environment.
Fifty
MIHALOPOULOS,
J Psychiatry
1 52:2,
February
1995
4. Consensus
diagnostic
been proposed ity (2); however,
not be valid tation
of each
of four
procedure.
as a criterion there
measure
are significant
( 1 ). The consensus case
experienced
by the treating
clinicians
A procedure
nature
of procedural
assumptions
involved
procedure psychiatry
(P.D.M.,
of this
for studies involved
that
a detailed
resident
H.J.J.,
J.K.,
may
presen-
to a stable
R.K.).
has
valid-
team
A mini-
mum of three clinicians was required to be present for each case. Each clinician completed ratings for the six criteria sets included in the field trial instrument, including DSM-III-R. This process was carried out independently, without discussion, after each case was presented. The first 25 cases were rated in a series of fortnightly meetings over a period of several months, while the second 25 cases were rated on a single day at the end of the study.
RESULTS The DSM-III-R profile, determined by the field trial instrument, of the study group was as follows: schizophrenia, N=24 (48%); schizophreniform disorder, N=7 (14%); schizoaffective disorder, N=2 (4%); brief reactive psychosis, N=2 (4%); major depressive episode with psychotic features, N=4 (8%); manic episode with psychotic features, N=10 (20%); and psychotic disorder not otherwise specified, N=i (2%). Reliability for the Royal Park Multidiagnostic Instrument for Psychosis and field trial instrument procedures was assessed and found to be satisfactory. Levels of agreement for pairs of the four diagnostic procedures are presented in table 1 . Each of the four procedures was used to assign a diagnosis from one of eight DSM-III-R categories of psychotic disorder for each of the SO subjects. Comparisons of diagnostic as-
221
PROCEDURAL
TABLE agnose
VALIDITY
OF
DIAGNOSTIC
ASSESSMENT
1. Agreement Between Pairs of Four Procedures Used to Di50 Patients With a First Episode of Psychotic Illness Unadjusted Agreement
Procedure
DSM-IV Royal ment
Pair
field trial instrument Park Multidiagnostic for Psychosis
diagnosis
(%)
0.67
0.08
76
0.64
0.08
74
0.53
0.09
66
0.64
0.08
74
0.65
0.08
74
0.65
0.08
74
and
nich Diagnostic Checklists for DSMIII-R DSM-III-R consensus diagnosis and Munich Diagnostic Checklists for
consensus
SE
Instru-
DSM-IV field trial instrument and DSM-III-R consensus diagnosis DSM-IV field trial instrument and Mu-
DSM-III-R DSM-III-R
Kappa
and
Royal Park Multidiagnostic Instrument for Psychosis Munich Diagnostic Checklists for DSM-III-R and Royal Park Mubtidiagnostic Instrument for Psychosis
signment resulted in an 8x8 matrix for each pair of procedures. Cohen’s unweighted nominal kappa (13) is presented as an index of agreement between each pair of procedures; percent agreement is also given. A total of six pairwise comparisons were possible among the four diagnostic procedures. The standard error of each of the six kappas (14) is also reported. Kappa values were moderate and statistically comparable for all pairwise comparisons, as indicated by the standard errors. Percent agreement was also modest. Because comparisons that involve standard error for kappa do not take into account possible correlation between the kappas, pairwise comparisons among agreements were made by using McNemar’s test for correlated proportions (15). This test has previously been used to compare percent agreement in studies of psychiatric diagnosis (16). McNemar’s test indicated that the difference between the highest (76%) and lowest (66%) percent agreement was not statistically significant (x2i.4S df=i, p>O.1O, onetailed), thereby confirming similar levels ofagreement between pairs of diagnostic procedures. Full diagnostic concordance among all four procedures occurred in only 54% (N=27) of the subjects. The data were also analyzed separately for the first and second halves of the cohort in view of the change in methodology for the consensus procedure, which was condensed into a single day for the last 25 subjects. This revealed that the Munich Diagnostic Checklists for DSM-III-R achieved higher levels of agreement with all other procedures for the second half than for the first. The magnitude of this difference was a 12%20% increase in percent agreement and a 0.2-0.3 increase in kappa.
DISCUSSION
In contrast concordance
222
to much research of competing sets
that has compared the of operational criteria
(17) in making psychiatric diagnoses, the aim of the present study was to evaluate the performance of alternative procedures in assigning a common set of operational criteria. No more than modest levels of agreement were found among the four procedures used for assigning DSM-III-R psychotic disorders. It could be argued that the concordance is satisfactory and sufficient to cross-validate the procedures, particularly the simpler, time-efficient Munich Diagnostic Checklists for DSM-III-R method. If we take the Royal Park Multidiagnostic Instrument for Psychosis as the criterion measure, because of its emphasis on clinical validity, comprehensive data collection, and nondiscretionary application of diagnostic decision rules, the Munich Diagnostic Checklists for DSM-III-R performed as well as the consensus procedure and the field trial instrument. The better concordance of the Munich Diagnostic Checklists for DSM-III-R in the second half of the study can be explained by a training effect in the least experienced raters (the senior psychiatry residents) rather than the change in methodology for the consensus procedure. While it can be argued that one procedure is more valid and entitled to be regarded as the criterion against which others may be judged, there is at present little consensus about the grounds for establishing such a hierarchy (10). Furthermore, such a step is unnecessary to understand the most important implications of these data. The relative lack of concordance, even between the most systematic pairs of procedures, the Royal Park Multidiagnostic Instrument for Psychosis and the field trial instrument, was reflected in levels of percent disagreement ranging from 24% to 34%. The contributing factors include virtually all the standard sources of unreliability described by Spitzer and Wilhams (2), notably variance in information, interpretation, and occasion. Some of these relate to the timing of assessments within an episode, the data sources that can be included, and the mode of applying the criteria to the clinical data set. The problem of criterion vanance has only been improved but not solved by the advent of operational criteria, as illustrated by Winokur et al. (1 ). In this study, we attempted to standardize as many procedural aspects as possible, e.g., the timing of assessments and the time period considered, the latter being enhanced by the fact that the study group consisted purely of first-episode psychotic patients. We believe that even if other diagnostic procedures had been used, the general findings would have been similar, namely, that a significant lack of concordance would have been demonstrated among competing procedures aiming to assign diagnoses according to a single operationalized system. There are important implications of this level of misclassification or differential classification for research and clinical practice. As recognized by Kendler (6), a substantial, yet poorly appreciated level of error continues to cloud interpretation of research data. While nowhere near as serious a problem as in the pre-operational era, this residual error makes it more difficult to
AmJ
Psychiatry
152:2,
February
1995
MCGORRY,
discern biological correlates of key diagnostic categonies and to link them with treatment response and patterns of outcome. Genetic studies are particularly affected by this source of error. For research purposes, it is clearly important to standardize the diagnostic procedure as well as specifying operational criteria as clearly as possible, or, alternatively, to require researchers to specify exactly how their operational diagnoses were assigned (1 ). Blacker (1 8) has discussed the question of misclassification in research in a clear and sophisticated manner and made practical suggestions for minimizing the problem. From a clinical perspective, in our specific experience with several hundred patients with first-episode psychosis, misclassification can lead to iatrogenic effects because of the prognosis-based diagnostic categories with which we continue to work in psychosis. If even rigorous operational procedures in a research context can cross-sectionally misclassify people into a diagnostic group with an inherently pessimistic set of expectations linked to it, e.g., schizophrenia, this should be of concern to clinicians. Combined with the complexity and temporal instability of psychopathology in early psychosis ( 1 9), the present study indicates the need for circumspection in diagnostic and psychoeducational approaches to patients with first-episode psychosis and to their families.
S.
3.
Am
Textbook
1. Edited by Kaplan HI, Freedman Williams & Wilkins, 1980 American Psychiatric Association: Field Trial for Schizophrenia and
Iowa 4.
in Comprehensive
City,
University
oflowa,
of Psychiatry, AM, Report Related
Sadock From Psychotic
the
DSM-IV Disorders.
I 992
Ellis trials
PM, Welch G, Purdie GL, Mellsop GW: Australasian field of the mental and behavioural disorders section of the draft lCD-b. Aust NZ J Psychiatry 1990; 24:313-321
J Psychiatry
I 52:2,
February
1995
P: Austrabasian
Psychiatric
field
Research,
6. Kendler pattern
trials
of the
draft
KS: The of familial
multi-axial
version
J
illness.
I 993
impact of diagnostic misclassification on the aggregation and coaggregation of psychiatric
Psychiatr
Res
1987;
21:55-91
7. McGorry P: Early psychosis prevention Australasian Psychiatry I 993; 1:32-34 8.
McGorry PD, tic Instrument
Copobov DL, for Psychosis,
Singh part
and intervention
BS: Royal I: rationale
centre.
Park Mubtidiagnosand review. Schi-
zophr Bull 1990; 26:501-5 15 9. McGorry PD, Singh BS, Copobov DL, Kaplan I, Dossetor CR, van Rid RJ: Royal Park Multidiagnostic Instrument for Psychosis, part
II: development,
reliability,
and
validity.
Schizophr
Bull
10.
1990; 26:517-536 Faraone SV, Tsuang MT: Measuring diagnostic accuracy in the absence ofa “gold standard.” AmJ Psychiatry 1994; 151:650-657
1 1.
Hiller
W,
von
Bose
M,
list-guided diagnoses ders. J Affect Disord 12.
Hiller W, Zaudig tic checklists for
to assess 13. 14.
15.
17.
3rd ed, vol BJ. Baltimore,
ET AL.
of the ICD-10 (mental and behavioural disorders section), in Proceedings of the Australian Society for Psychiatric Research 1993 Annual Scientific Meeting. Sydney, Australian Society for
16.
DSM-III,
HENRY,
Ellis P, McGorry P, Ungvari G, Chaplin R, Chapman M, Collings S, Hantz P, Little J, Mellsop G, Purdie G, Richards J, Silfverskiold
REFERENCES I . Winokur G, Zimmerman M, Cadoret R: ‘Cause the Bible tells me so. Arch Gen Psychiatry 1988; 45:683-684 2. Spitter RL, Williams JBW: Classification of mental disorders and
MIHALOPOULOS,
G, Agerer
D: Reliability
M, Mombour use in routine
DSM-III-R
W: The development clinical care: a guideline
diagnoses.
Arch
Gen
of check-
and anxiety
disor-
of diagnosdesigned
Psychiatry
1 990;
47:
782-784 Cohen J: A coefficient of agreement for nominal scales. Educational and Psychol Measurement I 960; 20:37-46 Fleiss JL, Cohen J, Everitt BS: Large sample standard errors of
kappa and weighted kappa. Psychol Bull 1969; 72:323-327 Siegel 5, Castellan NJ: Nonparametric Statistics for the Behavioral Sciences. New York, McGraw-Hill, 1988 McKenzie DP, Clarke DM, Low LH: A method of constructing parsimonious diagnostic and screening tests. Int J Methods in Psychiatr McGorry
Res 1992; PD, Singh
Copobov
DL: Diagnostic
revisited:
a study
cepts I 8.
Dichtl
for DSM-III-R affective 1990; 20:245-247
of psychotic
2:71-79 BS, Connell
S, McKenzie
concordance
of inter-relationships
disorder.
D,
in functional between
Psychol
Med
van
Rid
alternative
I 992;
R,
psychosis con-
22:367-378
Blacker D: Reliability, validity and the effects of misclassification in psychiatric research, in Research Designs and Methods in Psychiatry. Edited by Fava M, Rosenbaum JF. Amsterdam, Elsevier,
1992 I 9. McGorry ity
and
emerge
PD: The influence stability in and stabilize
of illness duration
functional with time:
psychosis: Aust NZ
on syndrome does
the
J Psychiatry
clar-
diagnosis
I 994;
28:
607-619
223