Spurious Precision: Procedural Validity of Diagnostic ... - CiteSeerX

7 downloads 21 Views 2MB Size Report
Cathy Mihalopoulos,. B.B.S., Lisa Henry, M.Psych., .... MIHALOPOULOS,. HENRY,. ET AL. Am J Psychiatry. 152:2,. February. 1995. 221 new ..... R, Chapman.
Spurious of Diagnostic Patrick

Precision: Assessment

Procedural Validity in Psychotic Disorders

D. McGorry, Ph.D., F.R.A.N.Z.C.P., Cathy Mihalopoulos, B.B.S., Lisa Henry, M.Psych., Jenepher Dakis, F.R.A.N.Z.C.P., Henry J. Jackson, Ph.D., Michael Flaum, M.D., Susan Harrigan, Grad.Dip.Appl.Sci., Dean McKenzie, B.A., Jayashri Kulkarni, F.R.A.N.Z.C.P., and Robert Karoly, F.R.A.N.Z.C.P.

Objective: Very nostic procedures the

procedural

few that

validity

studies have use a common of four

quantified the level ofagreement set offixed operational criteria.

independent

methods

of assigning

The and

Health and Medical tion and Intervention

Council Schizophrenia Research Unit’s which focuses on first-episode psychosis.

patients (N=5O) were assessed by determine a DSM-III-R diagnosis. oped for the DSM-IV field trial, 2) 3) the Munich Diagnostic Checklists, team of clinician researchers who cordance

between

pairs

ofdiagnostic

significant

research

and

clinical

procedures

was

implications.

only

to 76%, “correct”).

Early Psychosis Consecutively

Despite

moderate.

Prevenadmitted

T

here is no doubt that the development of sets of operational criteria, and assessment procedures to apply these criteria, has resulted in substantial improvements in diagnostic reliability. Nevertheless, as pointed out by Winokur et al. ( 1 ), serious questions still remain about the reliability and validity of these sets of operational criteria in research studies. These questions arise from the fact that even when much care, time, and effort are put into the assignment of operational criteria, misreadings, misinterpretations, and idiosyncratic use of the criteria are common. The extent of this problem is often concealed by statements within research reports

at the 7th Biennial Winter Workshop on Schizophrenia, Switzerland, Jan. 23-28, 1994. Received March 7, I 994; revision received June 21, 1 994; accepted Aug. 4, 1 994. From the Early Psychosis Prevention and Intervention Centre and the Nationab Health and Medical Research Council Schizophrenia Research Unit, Royal Park Hospital. Address reprint requests to Dr. McGorry, Royal Park Hospital, Park St., Parkvibbe, Victoria 3052, Australia.

Corresponding

with converse Conclusions:

levels

misclassification These findings

the introduction

of

rates have

ofoperationally

ofdifferential classification the diagnostic criteria

themselves. Such misclassification may impede neurobiological clinical effects on patients with first-episode psychosis. (AmJ Psychiatry 1995; 152:220-223)

220

of

independent raters who used four different procedures to These procedures were 1) the diagnostic instrument develthe Royal Park Multidiagnostic Instrument for Psychosis, and 4) a consensus DSM-III-R diagnosis assigned by a were expert in the use of diagnostic criteria. Results: Con-

diagnoses, there remained an appreciable level arising from variability in the method ofassigning

Presented

diagnoses

was conducted as a satellite study to the DSM-IV Psychotic Disorders. The setting was the National

percent agreement, however, ranged from 66% of 24%-34% (assuming one procedure to be

Les Diabberets,

DSM-III-R

psychotic disorders. Method: Field Trial for Schizophrenia Research Centre,

research Related

among alternative diagThe authors examined

defined

or misclassification rather than the criteria

research

and

have

harmful

that provide of diagnostic For any

little or no information about the process assignment. of the major diagnostic systems, such as DSM-III-R, a number of assessment procedures have been developed to collect the psychopathological data and apply the diagnostic criteria. The procedures vary in how and when the data are collected and may or may not specify how the criteria are to be precisely applied. There is ample scope for many of the sources of unreliability, including information variance and the process of actually interpreting and applying the criteria, which are often broadly and imprecisely defined (e.g., how prominent do prominent affective symptoms need to be?), to influence the ultimate diagnosis. Spitzer and Williams (2, p. 1039) introduced the term “procedural validity” to focus upon this issue and defined it as follows: The

term

the

question

“procedural being

Am

validity” asked

concerns

J Psychiatry

can the

be used extent

152:2,

.

. . whenever to which

February

the

1995

MCGORRY,

new

diagnostic

of an tenon. the

procedure

established Procedural

validity

lidity

of

of the

yields

diagnostic evaluation

diagnostic

similar

. . . speaks

validity the

results

procedure

is used

only

to the

procedure

categories

and

interview

to the results

that

not

to

issue

criteria that were rated by means interview, and other ratings. The

the

of va-

themselves.

therefore flexibility,

consecutively

admitted

patients

with

first-episode

psychotic

illness treated at the Early Psychosis Prevention and Intervention Centre (7), a specialist program of Royal Park Hospital, which has a catchment area of 800,000 people, were recruited for the study during 1992. The mean age for the study group was 26.3 years (SD=6.8, range=18-4S). There were more men (N=31, 62%) than women (N=19, 38%), the majority (N=42, 84%) had never married, and

(N=28)

were unemployed

at the time of index

assessment.

Mean

number of years of education was 11.1 (SD=2.2). Organic etiology and mental retardation were exclusion criteria. Written informed consent was obtained from all subjects. The study group approximated an incidence sample of first-episode psychosis for a defined area of Melbourne. The study formed part ofthe multicenter DSM-IV Field Trial for Schizophrenia and Related Psychotic Disorders (3) that examined the reliability and concordance of three alternative sets of options for diagnosing DSM-IV psychotic disorders plus the criteria from DSM-III, DSM-III-R, and lCD-b.

study

followed

the basic

methodology

of the parent

study,

with the first 25 interviews using a test-retest reliability design and the second 25 interviews an interrater design. One interviewer (J.D.) interviewed all SO patients by using the field trial instrument developed for the parent study. The first 25 patients were interviewed on a separate occasion by a second interviewer, using the field trial instrument, within 48 hours of the initial interview (test-retest design). The second 25 patients were each interviewed by a pair of interviewers on a single occasion (interrater design). All interviews were conducted during the recovery phase of the initial psychotic episode, and the position of primary interviewer for interrater agreement was alternated. The field trial instrument enabled raters to apply six sets of diagnostic criteria:

DSM-III, DSM-III-R, IV psychotic disorders

ICD-10, and three alternative options for DSM(3). This diagnostic procedure was compared

with three other methods of assigning DSM-III-R diagnoses for psychotic disorders that were independently employed by other members of the research team with the same subjects. The diagnostic procedures used were as follows: 1 . DSM-IV field trial instrument (3). The field trial instrument was divided into two sections: the first section comprised a semistructured

Am

comprised

and

all SO patients

sequence

of checklists links between

with

ET AL.

of psycho-

the six sets of operational

not especially tight and allowed and, also, error. The diagnoses

interviewed

used

section

HENRY,

the

on completion the two sections

of the were

for some rater judgment, assigned by the rater who field

trial

instrument

were

for this study.

2. Royal

Park

Multidiagnostic

is a comprehensive principally oriented

Instrument

psychopathological

for Psychosis

assessment

for the assessment of a first psychotic and uses serial interviews and multiple

to construct

a psychopathological

14 different

systems

are applied.

This

database

of operational

is a routine

designed

episode. It is validity information sources

for the episode

diagnosis,

diagnostic

(8, 9). This

tool

including

assessment

to which DSM-III-R,

tool

in our

unit

and is employed by research psychologists highly experienced in its use. If there can be a gold standard in studies of procedural validity, and this is questionable (10), in our unit the Royal Park Multidiagnostic Instrument for Psychosis would best represent this for first-epi-

sode psychosis. The assessment field trial instrument assessment, trial

instrument

ratings

and

was carried out independently of the and the rater was blind to the field

diagnosis.

3. Munich Diagnostic Checklists for DSM-III-R (1 1, 12). This procedure consists of a set of pocket-sized lists of the criteria for each diagnostic category. It was developed as an aid to systematic diagnostic evaluation under routine clinical conditions with time limits. The presentation is clear and easy to follow. It involves a blend of simple checklist and flowchart format with a series of boxes to record ratings. In the present study, these were completed by the senior psychia-

try resident and

who

was

management

responsible

of the

for the routine

patient.

The

Munich

clinical

assessment

Diagnostic

Checklists

for DSM-III-R represent a brief but systematic approach to DSM-IIIR diagnosis, which might be more efficient than a more complex procedure, yet more rigorous than merely applying the diagnostic criteria from memory or by reference to a manual.

METHOD

Our

the second

of the content

pathology;

had

56%

for the assessment

as a cri-

Procedural validity was felt to be a more satisfactory term than concurrent validity in the operational age because it focused upon the procedure rather than a combination of the procedure and the operational definition. To consider it as a form of validity, however, does require that the assumption be made that some procedures are more valid than others. This is undoubtedly true, although criteria for assessing this are somewhat elusive. The present study was conducted as part of our international collaborative research in diagnosis and classification of psychotic disorders (3-5) and, specifically, as a satellite study to our participation as a field trial center, one of two non-U.S. sites, for the DSM-IV Field Trial for Schizophrenia and Related Psychotic Disorders (3). The principal aim was to examine the procedural validity of four methods of assigning DSM-III-R diagnoses of psychotic disorders in order to gain an estimate of the rate of potential misclassification (6) in a research environment.

Fifty

MIHALOPOULOS,

J Psychiatry

1 52:2,

February

1995

4. Consensus

diagnostic

been proposed ity (2); however,

not be valid tation

of each

of four

procedure.

as a criterion there

measure

are significant

( 1 ). The consensus case

experienced

by the treating

clinicians

A procedure

nature

of procedural

assumptions

involved

procedure psychiatry

(P.D.M.,

of this

for studies involved

that

a detailed

resident

H.J.J.,

J.K.,

may

presen-

to a stable

R.K.).

has

valid-

team

A mini-

mum of three clinicians was required to be present for each case. Each clinician completed ratings for the six criteria sets included in the field trial instrument, including DSM-III-R. This process was carried out independently, without discussion, after each case was presented. The first 25 cases were rated in a series of fortnightly meetings over a period of several months, while the second 25 cases were rated on a single day at the end of the study.

RESULTS The DSM-III-R profile, determined by the field trial instrument, of the study group was as follows: schizophrenia, N=24 (48%); schizophreniform disorder, N=7 (14%); schizoaffective disorder, N=2 (4%); brief reactive psychosis, N=2 (4%); major depressive episode with psychotic features, N=4 (8%); manic episode with psychotic features, N=10 (20%); and psychotic disorder not otherwise specified, N=i (2%). Reliability for the Royal Park Multidiagnostic Instrument for Psychosis and field trial instrument procedures was assessed and found to be satisfactory. Levels of agreement for pairs of the four diagnostic procedures are presented in table 1 . Each of the four procedures was used to assign a diagnosis from one of eight DSM-III-R categories of psychotic disorder for each of the SO subjects. Comparisons of diagnostic as-

221

PROCEDURAL

TABLE agnose

VALIDITY

OF

DIAGNOSTIC

ASSESSMENT

1. Agreement Between Pairs of Four Procedures Used to Di50 Patients With a First Episode of Psychotic Illness Unadjusted Agreement

Procedure

DSM-IV Royal ment

Pair

field trial instrument Park Multidiagnostic for Psychosis

diagnosis

(%)

0.67

0.08

76

0.64

0.08

74

0.53

0.09

66

0.64

0.08

74

0.65

0.08

74

0.65

0.08

74

and

nich Diagnostic Checklists for DSMIII-R DSM-III-R consensus diagnosis and Munich Diagnostic Checklists for

consensus

SE

Instru-

DSM-IV field trial instrument and DSM-III-R consensus diagnosis DSM-IV field trial instrument and Mu-

DSM-III-R DSM-III-R

Kappa

and

Royal Park Multidiagnostic Instrument for Psychosis Munich Diagnostic Checklists for DSM-III-R and Royal Park Mubtidiagnostic Instrument for Psychosis

signment resulted in an 8x8 matrix for each pair of procedures. Cohen’s unweighted nominal kappa (13) is presented as an index of agreement between each pair of procedures; percent agreement is also given. A total of six pairwise comparisons were possible among the four diagnostic procedures. The standard error of each of the six kappas (14) is also reported. Kappa values were moderate and statistically comparable for all pairwise comparisons, as indicated by the standard errors. Percent agreement was also modest. Because comparisons that involve standard error for kappa do not take into account possible correlation between the kappas, pairwise comparisons among agreements were made by using McNemar’s test for correlated proportions (15). This test has previously been used to compare percent agreement in studies of psychiatric diagnosis (16). McNemar’s test indicated that the difference between the highest (76%) and lowest (66%) percent agreement was not statistically significant (x2i.4S df=i, p>O.1O, onetailed), thereby confirming similar levels ofagreement between pairs of diagnostic procedures. Full diagnostic concordance among all four procedures occurred in only 54% (N=27) of the subjects. The data were also analyzed separately for the first and second halves of the cohort in view of the change in methodology for the consensus procedure, which was condensed into a single day for the last 25 subjects. This revealed that the Munich Diagnostic Checklists for DSM-III-R achieved higher levels of agreement with all other procedures for the second half than for the first. The magnitude of this difference was a 12%20% increase in percent agreement and a 0.2-0.3 increase in kappa.

DISCUSSION

In contrast concordance

222

to much research of competing sets

that has compared the of operational criteria

(17) in making psychiatric diagnoses, the aim of the present study was to evaluate the performance of alternative procedures in assigning a common set of operational criteria. No more than modest levels of agreement were found among the four procedures used for assigning DSM-III-R psychotic disorders. It could be argued that the concordance is satisfactory and sufficient to cross-validate the procedures, particularly the simpler, time-efficient Munich Diagnostic Checklists for DSM-III-R method. If we take the Royal Park Multidiagnostic Instrument for Psychosis as the criterion measure, because of its emphasis on clinical validity, comprehensive data collection, and nondiscretionary application of diagnostic decision rules, the Munich Diagnostic Checklists for DSM-III-R performed as well as the consensus procedure and the field trial instrument. The better concordance of the Munich Diagnostic Checklists for DSM-III-R in the second half of the study can be explained by a training effect in the least experienced raters (the senior psychiatry residents) rather than the change in methodology for the consensus procedure. While it can be argued that one procedure is more valid and entitled to be regarded as the criterion against which others may be judged, there is at present little consensus about the grounds for establishing such a hierarchy (10). Furthermore, such a step is unnecessary to understand the most important implications of these data. The relative lack of concordance, even between the most systematic pairs of procedures, the Royal Park Multidiagnostic Instrument for Psychosis and the field trial instrument, was reflected in levels of percent disagreement ranging from 24% to 34%. The contributing factors include virtually all the standard sources of unreliability described by Spitzer and Wilhams (2), notably variance in information, interpretation, and occasion. Some of these relate to the timing of assessments within an episode, the data sources that can be included, and the mode of applying the criteria to the clinical data set. The problem of criterion vanance has only been improved but not solved by the advent of operational criteria, as illustrated by Winokur et al. (1 ). In this study, we attempted to standardize as many procedural aspects as possible, e.g., the timing of assessments and the time period considered, the latter being enhanced by the fact that the study group consisted purely of first-episode psychotic patients. We believe that even if other diagnostic procedures had been used, the general findings would have been similar, namely, that a significant lack of concordance would have been demonstrated among competing procedures aiming to assign diagnoses according to a single operationalized system. There are important implications of this level of misclassification or differential classification for research and clinical practice. As recognized by Kendler (6), a substantial, yet poorly appreciated level of error continues to cloud interpretation of research data. While nowhere near as serious a problem as in the pre-operational era, this residual error makes it more difficult to

AmJ

Psychiatry

152:2,

February

1995

MCGORRY,

discern biological correlates of key diagnostic categonies and to link them with treatment response and patterns of outcome. Genetic studies are particularly affected by this source of error. For research purposes, it is clearly important to standardize the diagnostic procedure as well as specifying operational criteria as clearly as possible, or, alternatively, to require researchers to specify exactly how their operational diagnoses were assigned (1 ). Blacker (1 8) has discussed the question of misclassification in research in a clear and sophisticated manner and made practical suggestions for minimizing the problem. From a clinical perspective, in our specific experience with several hundred patients with first-episode psychosis, misclassification can lead to iatrogenic effects because of the prognosis-based diagnostic categories with which we continue to work in psychosis. If even rigorous operational procedures in a research context can cross-sectionally misclassify people into a diagnostic group with an inherently pessimistic set of expectations linked to it, e.g., schizophrenia, this should be of concern to clinicians. Combined with the complexity and temporal instability of psychopathology in early psychosis ( 1 9), the present study indicates the need for circumspection in diagnostic and psychoeducational approaches to patients with first-episode psychosis and to their families.

S.

3.

Am

Textbook

1. Edited by Kaplan HI, Freedman Williams & Wilkins, 1980 American Psychiatric Association: Field Trial for Schizophrenia and

Iowa 4.

in Comprehensive

City,

University

oflowa,

of Psychiatry, AM, Report Related

Sadock From Psychotic

the

DSM-IV Disorders.

I 992

Ellis trials

PM, Welch G, Purdie GL, Mellsop GW: Australasian field of the mental and behavioural disorders section of the draft lCD-b. Aust NZ J Psychiatry 1990; 24:313-321

J Psychiatry

I 52:2,

February

1995

P: Austrabasian

Psychiatric

field

Research,

6. Kendler pattern

trials

of the

draft

KS: The of familial

multi-axial

version

J

illness.

I 993

impact of diagnostic misclassification on the aggregation and coaggregation of psychiatric

Psychiatr

Res

1987;

21:55-91

7. McGorry P: Early psychosis prevention Australasian Psychiatry I 993; 1:32-34 8.

McGorry PD, tic Instrument

Copobov DL, for Psychosis,

Singh part

and intervention

BS: Royal I: rationale

centre.

Park Mubtidiagnosand review. Schi-

zophr Bull 1990; 26:501-5 15 9. McGorry PD, Singh BS, Copobov DL, Kaplan I, Dossetor CR, van Rid RJ: Royal Park Multidiagnostic Instrument for Psychosis, part

II: development,

reliability,

and

validity.

Schizophr

Bull

10.

1990; 26:517-536 Faraone SV, Tsuang MT: Measuring diagnostic accuracy in the absence ofa “gold standard.” AmJ Psychiatry 1994; 151:650-657

1 1.

Hiller

W,

von

Bose

M,

list-guided diagnoses ders. J Affect Disord 12.

Hiller W, Zaudig tic checklists for

to assess 13. 14.

15.

17.

3rd ed, vol BJ. Baltimore,

ET AL.

of the ICD-10 (mental and behavioural disorders section), in Proceedings of the Australian Society for Psychiatric Research 1993 Annual Scientific Meeting. Sydney, Australian Society for

16.

DSM-III,

HENRY,

Ellis P, McGorry P, Ungvari G, Chaplin R, Chapman M, Collings S, Hantz P, Little J, Mellsop G, Purdie G, Richards J, Silfverskiold

REFERENCES I . Winokur G, Zimmerman M, Cadoret R: ‘Cause the Bible tells me so. Arch Gen Psychiatry 1988; 45:683-684 2. Spitter RL, Williams JBW: Classification of mental disorders and

MIHALOPOULOS,

G, Agerer

D: Reliability

M, Mombour use in routine

DSM-III-R

W: The development clinical care: a guideline

diagnoses.

Arch

Gen

of check-

and anxiety

disor-

of diagnosdesigned

Psychiatry

1 990;

47:

782-784 Cohen J: A coefficient of agreement for nominal scales. Educational and Psychol Measurement I 960; 20:37-46 Fleiss JL, Cohen J, Everitt BS: Large sample standard errors of

kappa and weighted kappa. Psychol Bull 1969; 72:323-327 Siegel 5, Castellan NJ: Nonparametric Statistics for the Behavioral Sciences. New York, McGraw-Hill, 1988 McKenzie DP, Clarke DM, Low LH: A method of constructing parsimonious diagnostic and screening tests. Int J Methods in Psychiatr McGorry

Res 1992; PD, Singh

Copobov

DL: Diagnostic

revisited:

a study

cepts I 8.

Dichtl

for DSM-III-R affective 1990; 20:245-247

of psychotic

2:71-79 BS, Connell

S, McKenzie

concordance

of inter-relationships

disorder.

D,

in functional between

Psychol

Med

van

Rid

alternative

I 992;

R,

psychosis con-

22:367-378

Blacker D: Reliability, validity and the effects of misclassification in psychiatric research, in Research Designs and Methods in Psychiatry. Edited by Fava M, Rosenbaum JF. Amsterdam, Elsevier,

1992 I 9. McGorry ity

and

emerge

PD: The influence stability in and stabilize

of illness duration

functional with time:

psychosis: Aust NZ

on syndrome does

the

J Psychiatry

clar-

diagnosis

I 994;

28:

607-619

223

Suggest Documents