Validation of a Classroom Observation Code for ... - APA PsycNET

2 downloads 0 Views 1003KB Size Report
mean phi coefficient for interval agreement was .76, indicating adequate inter- observer reliability. Children referred for hyperactivity had significantly higher.
Journal of Consulting and Clinical Psychology 1977, Vol. 45, No. 5, 772-783

Validation of a Classroom Observation Code for Hyperactive Children Howard Abikoff, Rachel Gittelman-Klein, and Donald F. Klein Long Island Jewish-Hillside Medical Center Glen Oaks, New York The purpose of this study was to devise a classroom observation code that would identify hyperactive children reliably. A 14-category observation code was used to record the classroom behavior of 60 children referred to an outpatient clinic for hyperactivity and 60 same-sex normal children. The overall mean phi coefficient for interval agreement was .76, indicating adequate interobserver reliability. Children referred for hyperactivity had significantly higher scores than comparison children on 12 categories. There was greater withinsubject variability in the hyperactive group. Motor activity for both groups was significantly inversely related to age. The behavior frequencies during initial and later observations were not significantly different, indicating a lack of systematic observer effects on the child's behavior. However, there was overlap between the hyperactive and comparison children for all observation categories. Poor discrimination between the groups was therefore obtained with singlecategory criteria. Two-category criteria, especially the dyad interference and of task, substantially increased the code's discriminability, resulting in relatively few false positive and few false negative classifications. The code is felt to be a reliable and valid instrument for the objective quantification of classroom behavior in hyperactive children.

This article presents an observation code to enable quantification of the classroom behavior of hyperactive children. These children typically evidence excessive motor activity, inattentiveness, and poor impulse control for their age (Douglas, 1972; Safer & Allen, 1976; Wender, 1971). During the past decade, hyperactive children have received increasing attention from educators and mental health professionals, partly because of This study was supported in part by U.S. Public Health Service Grant MH18S79 and an internal grant (MH18S79-06) from the Long Island JewishHillside Medical Center. The authors wish to acknowledge gratefully the contribution of Lucille Westrich who participated in the revision of the code and who throughout has supervised the scheduling of appointments for observation purposes. Ronald Kent was consulted regarding code development. Thanks are extended to the observers—Pamela Antell, Rosalie Apolet, and Leslie Epstein. Requests for reprints should be sent to Howard Abikoff, Long Island Jewish-Hillside Medical Center, P.O. Box 38, Glen Oaks, New York 11004.

new psychopharmacological and behavioral treatments. These clinical interventions spurred a need for instruments to identify hyperactive children and to provide estimates of treatment effects. Several rating scales have been devised to meet these objectives (Classroom Behavior Inventory—Greenberg, Deem, & McMahon, 1972; Hyperactivity and Withdrawal Rating scales—Bell, Waidrop, & Weller, 1972; Conners' Teacher Rating Scale—Conners, 1969; Conners' Abbreviated Teacher Rating Scale—Conners, 1972; Behavior Check List—Werry & Sprague, 1970; Children's Psychiatric Rating Scale—Early Clinical Drug Evaluation Unit, 1973). The Teacher Rating Scale is the only one that has demonstrated validity for the identification of hyperactive children (Sprague, Christensen, & Werry, 1974). Behavior rating scales have advantages and disadvantages. They are obtained easily and have the inestimable virtue of being inexpensive. On the negative side, they may be subject to halo effects (Guilford, 1954), leading to excessively high ratings due to 772

CLASSROOM OBSERVATION CODE FOR HYPERACTIVE CHILDREN

elevation in only one or a few behaviors. Consequently, these scales may have limited value for a precise quantification of discrete deviant behaviors. Observational codes have serious disadvantages as well—they are cumbersome and expensive. More important is that they may be invalid if the behaviors rated are variable and if adequate behavior sampling is not obtained. Additional problems associated with the collection and analysis of observational data, including observer effects on behavior, observer drift, and response class definition, have been discussed by Johnson and Bolstad (1973), Reid (1970), and Weick (1968). However, when valid, observational codes offer important advantages. They may be useful for the evaluation of specific behaviors without bias due to halo effects, rater set, or bias. Kent, O'Leary, Diament, and Dietz (1974) have presented data indicating that whereas global ratings are vulnerable to rater bias, observation code ratings are not. Unbiased observations are essential in many experiments. For instance, in assessing behavioral differences between boys and girls, unbiased observations insure that sex-linked expectancies do not affect the ratings obtained. Another situation faced by us and others (O'Leary, Pelham, Rosenbaum, & Price, 1976) is the need for evaluating experimental therapeutic interventions—typically behavior modification—implemented by teachers. Since teachers are agents of change, blind independent raters are necessary to measure treatment effects to avoid reports of unduly large improvement rates by teacher therapists. Observation scales offer additional advantages, since behavioral definitions can be defined with clear operational criteria, whereas scale items are typically vague. Thus, criterion error variance can be decreased. Although teachers' global ratings, at separate time intervals, allow for assessment of the temporal stability of behavior as perceived by the teacher, there is no opportunity, except for infrequent instances of team teaching, to evaluate the degree to which another person would agree with the teacher's ratings. Observation scales, on the

773

other hand, allow for the establishment of interrater agreement. Jones, Reid, and Patterson (1975) have provided a detailed discussion of the potential usefulness of observation systems for behavioral assessment. In addition, observational scales provide relevant clinical diagnostic data. Among hyperactive children, the cross-situational variability of the children's behavior, along with difficulties in developing instruments with predictive utility for treatment response, causes problems in the diagnosis of the hyperkinetic syndrome (Klein & Gittelman-Klein, 1975; Shaffer, McNamara, & Pincus, 1974). What are the implications of labeling two groups of children as hyperactive, one of which is restless and aggressive and the second of which is distractible and impulsive? Are there differences in the drug responsiveness of such groups? Do they respond differentially to different treatments? Reliable observational data should contribute to the clarification of these issues, allowing classification rules for subdiagnosis, which can then be validated against different treatment conditions. Several classroom observation procedures have been reported. Becker, Madsen, Arnold, and Thomas (1967) developed a code containing behavior categories thought to be incompatible with academic learning to assess the effects of teacher attention and praise in reducing classroom behavior problems. Revised versions have been used in the classroom to observe "conduct problem children" (Werry & Quay, 1969), to assess medication effects on emotionally disturbed underachieving boys (Sprague, Barnes, & Werry, 1970), to quantify hyperkinetic behaviors of institutionalized mental retardates (Christensen, 1975), and to record the behavior of hyperactive children during structured academic periods (Allyon, Layman, & Kandel, 1975). Behavioral coding systems have also been developed to observe the classroom behavior of primary-grade "discipline problem" children (Bernal & North, Note 1) and to assess the effect of teacher attention and praise in reducing classroom behavior problems (Cobb & Ray, as described in Patterson, Cobb, & Ray, 1972).

774

H. ABIKOFF, R. GITTELMAN-KLEIN, AND D. KLEIN

Blunden, Spring, and Greenberg (1974) developed a 10-category observation code to coincide with 10 teacher-rated items of the Classroom Behavior Inventory. Poor concordance between the code and the teacher ratings for normal and "behavior problem" children were obtained: 9 of the 10 correlations between the two measures were nonsignificant. The authors do not report the score variances. It is conceivable that the range of scores within the groups limited the likelihood of detecting significant relationships between the observations and the teacher ratings. The validity of either instrument remains unclear. So far, O'Leary et al. (1976) are unique in reporting the use of a code that significantly differentiates hyperactive from normal children. Given this paucity of data, the diagnostic validity of observational codes is not firmly established. If a code discriminates between normal and hyperactive children, several clinical hypotheses can be investigated. The behavior of hyperactive children has been reported to be variable and inconsistent from clinical observation (Laufer & Denhoff, 1957) and from psychometric results (Douglas, 1972). Further, it has been reported that with age, both hyperactive and normal children exhibit decreases in activity level (Routh, Schroeder, & O'Tuama, 1974; Shaffer et al., 1974), and normal children show decreases in teacher reported "distractibility" and "short attention span" (Werry & Quay, 1971). Hypotheses It was predicted that (a) hyperactive children would exhibit significantly more inappropriate behavior than normals; (b) hyperactive children would display significantly more variable classroom behavior than normals; and (c) the level of motor activity for hyperactive and normal children and offtask behaviors for normals would be significantly inversely related to age. In the use of a classroom observational code, care needs to be taken that ratings are independent of the impact of the observer on the children's behavior (Gelfand & Hartmann, 197S). It is possible that the presence

of an observer has an initial nonrandom effect on the children's behavior. The immediate effect of the classroom observer might be either inhibiting or disinhibiting of inappropriate behavior. If an initial observer effect existed, significant differences between the frequencies of initial and later observed behaviors could occur. Consequently, the last hypothesis was that the first three classroom observations would yield significant differences from later observations in frequency of behavior (direction not predicted). This analysis would be of significance to establish the technology of classroom observation. Method Study Context During the past 2 years, we (with others) have been conducting an experimental study to assess the relative efficacy of three interventions among hyperactive children: methylphenidate, methylphenidate with behavior modification, and placebo with behavior modification. For purposes of the study, a code developed at the State University of New York at Stony Brook (Kent & O'Leary, 1976) was adapted to record the classroom behavior of children referred • for treatment and that of normal children. The observational data collected on untreated hyperactive children and controls are presented in this study.

Subjects Hyperactive children (targets}. To be accepted for observation, a child had to be between the ages of 6 to 12, attend elementary school, obtain a Wechsler Intelligence Scale for Children Verbal IQ of at least 85 and a Performance IQ not lower than 70, and be free of gross neurological disease and psychosis. The child had to be rated as hyperactive by the teacher, as defined by a minimum mean score of 1.8 out of a possible maximum of 3.0 on the Hyperactivity factor of Conners' Teacher Rating Scale (Conners, 1969). This criterion score was based on norms obtained by Sprague et al. (1974), who reported a mean factor score of .40 (SD — .55) for 291 normal children and a mean of 2.17 (SD-.12} for hyperactive children. As a further requirement for treatment consideration, the child had to be rated as hyperactive by the mother or be reported to have other significant behavior difficulties at home. The level of activity at home was quantified by use of a behavior checklist that was devised for this purpose (Werry & Sprague, 1970) and that has been shown

CLASSROOM OBSERVATION CODE FOR HYPERACTIVE CHILDREN to be sensitive to drug effects (Gittelman-Klein, Klein, Katz, Saraf, & Pollack, 1976). To be considered hyperactive at home, the child had to obtain a minimum mean score of 3.6 on 11 items (scored 1-5). Of 205 children referred by schools whose parents applied for treatment, 60 (56 boys and 4 girls) met the above criteria and were observed. Their mean age was 8 years 2 months. Comparison children. Each observed child was paired with a same-sex child reported to present average behavior by the teacher. Birth dates were not available for the 60 comparison children, but their age range was considered to parallel that of the target children. The teacher was asked to complete a Teacher Rating Scale for each comparison child. (This procedure was initiated during the course of the study. Consequently, these data were obtained for part of the sample only).

Classroom Observation Code The Stony Brook code (Tonick, Friehling, & Warhit, Note 2), intended for use with "problem children," was chosen because of its demonstrated resistance to rater bias, its provision for many categories relevant to the clinical group under study, its detailed scoring criteria, and its satisfactory level of reported interobserver reliability. The code, however, did not provide behavioral categories reflecting motor activity. Categories to measure children's classroom motor activity level while in and out of their seats were added. In addition, our pilot classroom experience with the code necessitated some revisions. Initially, the code consisted of 13 observational categories. Absence of behavior was rated when none of the other behaviors occurred. Two of the 13 categories, frustration and noises to self, had extremely low frequencies and were dropped from the code. The original category of gross motor movement was broken down in the course of the study into two categories labeled gross motor—standing, and gross motor—vigorous. A category called out of chair was added. Observation categories. The following is a brief description of the observation categories. (The full version of the code is available from the authors. The description presented below does not enable use of the code.) The behavioral categories are sampled every 15 sec. Using a modified timesampling procedure, nontimed behaviors are scored as soon as they occur within an interval, with only the first occurrence noted. Timed categories are scored only if the child engages in the behavior for more than 15 consecutive sec. For example, a child is scored as "off task" in Interval 2 if the behavior began in Interval 1 and continued uninterrupted throughout Interval 2. 1. Interference (nontimed) is a general measure of disruptiveness. Included are calling out, interruption of others during work periods, and clowning.

775

2. Solicitation (nontimed) reflects how often the child seeks out the teacher's attention (e.g., calling out to the teacher, going up to the teacher's desk). 3. Off task (timed) is a general measure of inattentiveness. It indicates attention to stimuli other than the assigned work after initiation of appropriate task-relevant behavior. 4. Minor motor movement (nontimed) is a general measure of in-chair restlessness. Only buttock movements and body and chair rocking movements are included. 5. Gross motor movement—standing (nontimed) refers to standing up without permission. 6. Gross motor movement—vigorous (nontimed) is scored when the child engages in vigorous motor activity (e.g., running, jumping). 7. Noncompliance (limed) indicates how often the child fails to comply with teacher commands. 8. Out of chair (timed) reflects how often the child remains out of his or her seat without permission. 9. Physical aggression (nontimed) indicates destructive physical behavior (e.g., hitting, pushing, throwing objects, etc.) and destruction of materials. 10. Threat or verbal aggression to children (nontimed) indicates abusive or threatening verbalizations and physical gestures directed toward other children. 11. Threat or verbal aggression to teacher (nontimed) indicates abusive or threatening verbalizations and physical gestures directed toward the teacher. 12. Extended verbalization (timed) is scored when the child engages in conversation. 13. Daydreaming (timed) is scored when the child is not attending to a specific stimulus while a task has been assigned. 14. Absence of above behaviors is scored when none of the above behaviors occur.

Observation Procedures Prior to observations, a social worker established that the school, the teacher, and the child's parents consented to classroom observations. A letter summarizing the treatment program was then sent to the teacher and principal. Next, a staff member (the first author) met with the teacher to obtain specific classroom rules from the teacher. The rules were used to determine the appropriateness of the children's behavior. Each teacher was also asked whether the target child evidenced more negative behaviors during morning or afternoon lessons. Whenever possible, the child was observed during the period when negative behaviors were most salient. In addition, an average child, matched for sex with the target, who was free of significant behavior problems was identified by the teacher. This child, seated in close proximity to the target for ease of observation, served as a normal comparison. No attempt was made to obtain exceptionally well-behaved children. Teachers were instructed

776

H. ABIKOFF, R. GITTELMAN-KLEIN, AND D. KLEIN

Table 1 Mean Factor Scores for Target and Comparison Children on the Teacher Rating Form Target Factor 1. 2. 3. 4. 5.

M

Conduct Problems Inattention Anxiety Hvperactivity Sociability

1.19 1.74 .44

2.30 1.15

SD .68 .59 .48 .42 .70

Comparison

M 1

-.IS ' .25 .43 .46 .08

SO

«»

.11 .31 .45 .23 .20

9.36* 1 1 .44* .09

19.76* 7.18*

Note. Items were scored on a scale from 0 to 3. Ws = 55 and 23 for target and comparison children, respectively. df = 76; one-tailed. b One item, "submissive," has a negative loading on Factor 1. *p < .0005.

R

to inform their class, the day before the first observation, of forthcoming periodic visits by people interested in becoming teachers. Observers received extensive individual training, typically extending over a 3-week period and averaging SO hours, including formal discussion of the code, viewing videotapes of classroom lessons, and observing children's classroom behavior in vivo. Training continued until acceptable reliability levels were obtained. These were defined as a minimum of 70% interobserver agreement across observation categories over several days. Only five of the eight observers (seniors or graduates majoring in psychology or education) who received this training program achieved this performance criterion. All five observers were blind to the design of the

study, and all but one were blind to the purpose of observation and to the nature of the clinical population studied. Each observation period lasted 32 min, with alternating 4-niin periods between the target and comparison child. Each 4-min segment was divided into IS-sec intervals. Immediately after observations, ratings on a 7-point scale were periodically obtained from the teacher describing how typical of general classroom conduct the behavior of the target and the comparison had been during an observation. Because satisfactory reliability could not be reached for three children in open classrooms, observations were limited to traditional classroom settings. Observations were carried out during structured didactic teaching and during periods of independent academic work under teacher super-

Results Data were analyzed to determine the characteristics of the samples on the Teacher Rating Scale; the reliability of the code; the diagnostic validity of the code; the behavioral variability of the targets; and the relationship between observation scores, age, and time (initial vs. later observations).

Table 2 Phi Coefficients

Between the Interval Scores of the Standard and Other Observers Observer

Category

1(JV = 4,544)"

2(N = 1,088)

3(JV = 704)

4(2V = 1,536)

Interference Solicitation Off task Minor motor movement Gross motor— all Gross motor— standing Gross motor—vigorous Noncompliance Out of chair Aggression Verbal aggression to children Verbal aggression to teacher Extended verbalization Daydreaming

.81 .80 .79 .80 .80 .68b .84'' .70 .90b .88 .68 .58 .86 .77

.85 .83 .54 .77 .90 NR NR .64 NR .71 -.002 .41 -(1) -(D

.80 .68 .71 .80 1.00 1.00 1.00 .34 1.00 .82 -(1) NR NR NR

.80 .80 .74 .81 .81 .85 .71 .24 .88 .71 .50 -(1) 1.00 1.00

Note. NR means that the category was not rated by either observer. An em dash indicates that the coefficient could not be computed because one of the observers never reported the behavior. The number in parentheses indicates the frequency of rating by one of the observers. " N refers to the number of intervals rated by the standard and the other observer. b N = 1,152 intervals.

CLASSROOM OBSERVATION CODE FOR HYPERACTIVE CHILDREN

777

Table 3 Product-Moment Correlations Between the Global Scores of the Standard and Other Observers Observer Category

1 (N = 60)

Interference Solicitation Off task Minor motor movement Gross motor— all Gross motor— standing Gross motor—vigorous Noncompliance Out of chair Aggression Verbal aggression to children Verbal aggression to teacher Extended verbalization Daydreaming

.96** .93** .93** .95** .82** .76"** .90"** .86** .99"** .90** .86** .96** 99** -(1)

2 (N = 17) .94**

.93**

.14 .90** .96**

NR NR .79**

NR .76** .89** 1.00** — (1) — (1)

3 (TV = 12)

.93** .82* .95** .91** 1.00** 1.00** 1.00** .82* 1.00** .89** -(1)

NR NR NR

4 (jv = 24)

.97** .97** .95** .93** .99** .99** .95**

.03 .99** .69**

.28 — (1) 1.00** 1.00**

Note. An em dash indicates that the correlations could not be computed because one of the observers never reported the behavior. The number in parentheses indicates the frequency of rating by the other observer. NR indicates that the category was not rated by either observer. » AT = 22. * p < .01. **p < .001.

Teacher Rating Scale The mean factor scores on the Teacher Rating Scale for the target and normal children are consonant with normative data obtained by Sprague et al. (1974; see Table 1). The targets were significantly more elevated than the comparisons on four factors (p < .0005), with the greatest difference found on the Hyperactivity factor. The mean score on this factor for the hyperactive children was 2.30, for the comparisons, .46. There was no overlap between the two groups on the factors of Conduct Problems, Inattention, Hyperactivity, and Sociability. No significant group difference was obtained on the Anxiety factor. Classroom Behavior Code Interobserver reliability measures for interval and total session scores were computed for each observation category. Reliability of interval scores. One of the authors (H.A.) always served as the "standard" in determining interobserver reliability. For each observation category, phi coefficients were determined as a measure of interval

reliability (Gelfand & Hartmann, 1975). These coefficients ranged from .34 to .93, with a mean $ of .76 for all 14 categories. Table 2 presents the coefficients obtained with each observer for each category. Reliability of total session agreement. Reliability was also calculated on the total session scores for each observer for each category. This session reliability (Gelfand & Hartmann, 197S) was obtained by calculating the product-moment correlation (Harshbarger, 1971) between the total number of behavioral occurrences reported by the standard observer and the other observer over an entire observation period. (The number of reliability observations differed for each observer, ranging from 12 to 60 sessions.) As indicated in Table 3, 31 of 45 correlations were above .90 (p < .001, two-tailed). Differences between hyperactive and comparison children. Correlated t tests (Harshbarger, 1971) were performed to determine whether the two study groups differed in rate of observed problem behaviors. The results are presented in Table 4. The children who met the study criteria for treatment consideration obtained signifi-

778

H. ABIKOFF, R. GITTELMAN-KLEIN, AND D. KLEIN

Table 4 Mean Number of Behaviors for Hyperactive and Comparison Children During 16-Minute Observations Hyperactive

Comparison

Category

M

SD

M

SD

Interference Solicitation Off task Minor motor movement Gross motor—all Gross motor— standing" Gross motor—vigorous" Noncompliance Out of chair" Aggression Verbal aggression to children Verbal aggression to teacher Extended verbalization Daydreaming

14.52 3.09 7.88 19.70 2.75 2.34 .41 2.21 2.84 .43 .19 .19 .07 .04

9.35 2.45 6.01 6.55 2.71 2.08 .54 2.27 3.54 .63 .39 .56 .25 .21

4.83 .98 1.25 16.11 1.02 1.06 .11 .27 .43 .11 .05 .00 .03 .00

4.07 .72 1.56 6.49 .95 .84 .26 .61 .84 .34 .20 .00 .20 .03

8.93** 6.57** 8.23** 3.75** 5.57** 3.64** 4.11** 6.12** 4.28** 4.51** 3.09** 2.59* 1.25 1.50

Note, t tests arc one-tailed for correlated means, n = 60 for both groups. " Only 33 subjects rated on this category. * p < .025. ** p < .005. *** p < .0005.

cantly higher mean scores than the comparisons on 12 of the 14 code categories. Some overlap was present between the groups' score distributions for each observation category. Frequency distributions revealed that for the majority of the observation categories, the mean scores for the comparison children

are best described as unimodal, positively skewed distributions. On the other hand, the distributions for the hyperactive children, although more varied, are multimodel and more symmetrical than the distributions for the comparison children. For each code category, Kolmogorov-

Table 5 Cutoff Score and Percentage of False Positives and False Negatives for Each Code Category

Category

Cutoff score

Targets above cutoff

% false negatives

Comparisons above cutoff

% false positives

Interference Solicitation Off task Minor motor movement Gross motor— all Gross motor—standing" Gross motor—vigorous" Noncompliance Out of chair" Aggression Verbal aggression to children Verbal aggression to teacher Extended verbalization Daydreaming

5.99 1.99 2.99 17.99 1.99 1.99 .49 1.99 .99 .50 .50 .50 .50 .50

49 39 44 39 35 21 10 26 20 15 6 5 3 3

18.3 34.9 26.7 34.9 41.7 36.4 69.7 56.7 39.4 75.0 90.0 91.7 95.0 95.0

17 8 7 26 12 6 2 2 5 3 2 0 1 0

28.3 13.3 11.7 43.3 20.0 18.2 6.1 3.2 15.2 5.0 3.3 .0 1.7 .0

a Thirty-three targets and 33 comparisons were rated on these categories. For all other categories there were 60 targets and 60 comparisons.

CLASSROOM OBSERVATION CODE FOR HYPERACTIVE CHILDREN

Smirnov analyses (Siegel, 1956) were conducted on the cumulative frequency distributions of the two groups' scores to determine the score that differentiated best between the two groups and to ascertain the percentage of false positive and false negative cases for each category. The percentages, presented in Table 5, provide relative indicators of the diagnostic discriminability of each category. There was considerable overlap in observer-rated classroom behavior between the groups. With regard to an examination of the five most reliable and most frequently occurring behaviors (interference, off task, minor motor movement, gross motor movement, and solicitation), the most discriminating single category would fail to confirm hyperactive group membership in 18% of the cases (using interference) and would erroneously rate almost 12% of normal children as similar to hyperactive children (using off task). Additional analyses were performed to determine whether using more than single categories would increase the ability of the code to discriminate between hyperactives and normals. Accordingly, calculations were made of the number of hyperactive and Table 6 Percentage of False Positives and Negatives Using Two-Category Criteria Pairing Interference with Off task Minor motor Gross motor—all Solicitation with Off task Minor motor Gross motor—all Off task with Minor motor Gross motor—all Minor motor with gross motor—all

% false negatives"

% false positives'1

41.7 43.3 48.3

0 13.3 5.0

58.3 56.7 60.0

0 5.0 3.3

51.7 56.7

3.3 1.7

55.0

16.7

Note, n = 60 for both false negatives and false positives. a Percentage of hyperactive children below cutoff scores on one or both behavioral categories. b Percentage of normal children at or above cutoff scores on both behavioral categories.

779

Table 7 Within-Subject Variability for Frequently Occurring Behaviors M range Category

Targets

Interference Solicitation Off task Minor motor movement Gross motor —all

15.56 5.95 13.51 15.26

4.46

Comparisons

t

7.04 2.S4 3.46 13.93 2.42

5.97* 5.44* 8.27* 1.02 4.37*

Note. One-tailed / tests for correlated means. Three comparison children were observed only once, therefore d[ — 56. *p < .0005.

normal children with mean scores at or above cutoff levels on pairs of categories. By using category dyads, there was appreciable improvement in the diagnostic discriminating power of the code (see Table 6). Three dyads identified a majority of the hyperactive children. The pairing of interference with solicitation was not included, since scores on these two categories are often nonorthogonal. For example, a child who interrupts a lesson by calling out to the teacher will be coded on both behaviors. The percentage of diagnostic errors using a three-category criteria is overly conservative, since the majority of the hyperactive children would be indistinguishable from controls. Within-subject variability. The data were analyzed to determine whether the two groups differed in the within-subject range of observational ratings for the five most frequently occurring behaviors, that is, interference, minor motor, off task, gross motor (all), and solicitation. Correlated t tests comparing the ranges for the targets and their paired comparisons indicated that the hyperactive group had significantly greater within-subject variability than the comparison group for four of the five observation categories. No difference between the groups in minor motor movement was obtained (see Table 7). The teachers' ratings of how representative of general classroom conduct the child's behavior had been during an observation period further document the relative variability of the behavior of hyperactive children. Of 128 ratings for the hyperactive children made by the teachers, 60 (43.7%) indicated that the children's behavior had

780

H. ABIKOFF, R, GITTELMAN-KLEIN, AND D. KLEIN

been atypical of their usual classroom conduct, whereas only 16 (13.&%) of the 116 ratings of the comparisons' behavior were reported to be atypical by the teacher. Time effect. Of the total 120 children, 30 targets and 23 comparison children were observed at least five times. Paired-comparison analyses on the first three versus the remaining observations, using t tests for dependent means, indicated only 1 significant difference (two-tailed) out of 28 comparisons between initial and later observations. This is well within chance expectations and indicates that there were no systematic temporal observer effects. Age. Since birth dates were not available for the comparison children, the ages of their paired targets were used in calculating the correlations. As predicted, the mean scores of both the hyperactive and comparison groups were significantly negatively correlated with age for the movement categories (targets: minor motor, r = —.36, p < .025; gross motor—all, r = —.28, p < .025; comparisons: minor motor, r — —.39, p < .005; gross motor—all, r = —.31, p < .025; onetailed tests). For both groups, the correlations between age and mean scores were nonsignificant for the other categories. Discussion The revised Stony Brook observation code was found to have adequate reliability. In addition, on 12 of the 14 classroom observation categories, it discriminated significantly between hyperactive and normal children, as the hyperactive children had significantly higher mean scores than normal children. The two timed categories that did not differentiate significantly between the two groups (extended verbalization and daydreaming) occurred very infrequently among all children. As expected, the hyperactive children displayed significantly more variable classroom behavior than their normal comparisons. This finding points to the fact that hyperactive children have high behavioral variability not only cross-situationally but also within a setting. It may be that in treating these children, an ameliorative intervention results not only in fewer occurrences of symptomatic

behavior but also in more stable, less variable behavior. Given that both the Conners' Teacher Rating Scale and the observation code differentiated between normal and hyperactive children, it would be interesting to correlate ratings on those observation categories that are semantically similar to scale items (e.g., minor motor movement with fidgeting and off task with inattentive, distractible). However, restrictions in the range of teachers' scale ratings precludes the determination of these concurrent validity coefficients. The children who produced low instances of target symptomatology, as measured by the observation code, present a validation problem in diagnosis. Which of the instruments used, the Conners' Teacher Rating Scale or the observation code, is the more valid diagnostic marker for the diagnosis of hyperactivity? The standard deviations of the mean scores for the comparison and hyperactive children on the teacher-rated Hyperactivity factor indicated no overlap between the children who were referred for treatment and the normal children. This was not the case for the observation categories. One approach in attempting to resolve the problem of diagnostic validity is to determine which instrument is a better predictor of treatment response. It has been reported that about 2070-30% of children treated for hyperactivity are refractory to stimulant medication (Gittelman-Klein, Klein, Katz, Saraf, & Pollack, 1976; Satterfield, Cantwell, & Satterfield, 1974). Do the nonresponders represent false positive errors in diagnosis? Children rated high by the teachers but low on the observation code might represent nonresponders to stimulant treatment. On the other hand, because observations lasted only 16 minutes, low observation scores may be related to inadequate sampling of classroom behavior. For these children, teachers' global scale ratings, based on judgments of behavior seen over a long period of time, may be more accurate diagnostic measures and treatment predictors. Data are currently being collected to investigate this issue. It should be noted that in view of the overlap between normal and hyperactive children on the observation ratings (see Table 5),

CLASSROOM OBSERVATION CODE FOR HYPERACTIVE CHILDREN

observation scores should not be used as the only diagnostic measure of hyperactivity. The target children who were observed had been referred on the basis of teachers' and parents' reports of symptomatic behavior. Given such preexisting cross-situational agreement, observation ratings can then offer additional diagnostic confirmation and information. The five most reliable, clinically relevant, and frequently occurring code behaviors were interference, off task, minor motor movement, gross motor movement, and solicitation. Although all five significantly discriminated between the hyperactive and normal children, the groups overlapped on all categories yielding high percentages of false positive and false negative errors (see Table 5). However, a diagnosis of hyperkinesis should not be based on the presence or absence of a single trait. Rather, the disorder is viewed as a syndrome, with various symptom clusters describing different syndromal subcategories. It is therefore not surprising that the discriminability of the code, defined by a reduction in the number of false positive errors, is substantially increased when a two-category criterion is used. In fact, as Table 6 indicates, using observation scores on both interference and off task resulted in the successful identification of 58% of the 60 children referred for hyperactivity and not a single instance of the misidentification of the 60 normal children. It remains to be seen whether the discriminant function of this behavior dyad will be replicated. The combined presence of interference and off task is not dissimilar to the typical clinical description of the impulsive, inattentive hyperactive child, and, therefore, there is face validity for this two-category criterion. The presence of an observer in the classroom was expected to have a nonrandom effect on children's behavior. This expectation was not supported. No significant differences between the frequencies of early and later observed behaviors occurred. Two factors may account for this finding. First, since public school classes are frequently observed by school personnel and student teachers, the teachers' descriptions of the observers as prospective teachers may have reduced the impact of the observer's presence. Second, be-

781

cause the comparison and target children sat in fairly close proximity to each other, the observer was able to watch both children without moving around the classroom, resulting in a reduction in observer obtrusiveness. An absence of children's reactivity to classroom observers has also been confirmed by Dubey, Kent, O'Leary, Broderick, and O'Leary (in press). As expected, for both groups of children, activity level was negatively correlated with age. Contrary to the findings of others regarding teachers' ratings of distractibility and inattentiveness, no significant negative relationship between off task behavior and estimated age for the comparison children was found ('•=.11). Differences between the code operational definition of off task and teachers' concepts of short attention span and distractibility may account for this discrepancy. For example, a teacher might consider a child whose attention wanders for short periods of time (less than IS sec) to have an attentional problem. This child would not be rated as off task on the code. On the other hand, a teacher may assume that a child working quietly has no attentional difficulties. However, if the child is attending to stimuli other than the assigned work, he or she will be rated as off task. Two measures of interobserver reliability were computed—a phi coefficient indicating degree of interval agreement and a correlation coefficient indicating total session agreement. Werry and Quay (1969) used a similar session index and reported a mean interrater correlation coefficient of .89 for 15 categories, which is quite similar to the correlations obtained in this study (see Table 3). Werry and Quay (1969) noted that "the reliabilities must be considered to be highly satisfactory" (p. 463). Yet, the validity of the ratings may be questioned. Their observers received little supervision after initial instruction and observation scores were compared against each other rather than with a standard observer. Such procedures may lead to "observer drift" (Kent et al., 1974; Patterson, 1969), a situation that results in a change in the definition of observational categories over time. As Kent et al. (1974) pointed out, observer drift

782

H. ABIKOFF, R. GITXELMAN-KLEIN, AND D. KLEIN

can be reduced substantially if observers are periodically paired with the trainer and thus, in effect, are periodically retrained. A total-session measure indicates the degree to which observers agreed on the number of occurrences of a behavior over an entire observation period. This measure has relevance when an observation code is used in applied behavioral studies. However, when session reliability is not accompanied by an index of interval agreement, its usefulness is substantially reduced. For example, using total occurrences to compute interobserver agreement can result in perfect reliability, even though, in fact, there had been 0% concordance for specific interval ratings. As Gelfand and Hartmann (1975) note: "Without a reasonable degree of interobserver reliability at the trial [i.e., interval] level, a study is uninterpretable because of the ambiguous meaning of the basic data" (pp. 204-205). The mean phi coefficient of .76 obtained across all observation categories indicates an acceptable level of interobserver interval reliability. A classroom's physical setup can adversely affect interobserver reliability. For example, even if observers use the same operational definitions of behavior, their agreement will be limited by the degree to which they can both observe the same behaviors. In some classrooms, because of space limitations, observers were required to watch the children from different vantage points, often leading to reduced agreement. In addition to the observability of behavior, reliability can also be affected by such factors as imprecise code definitions and inferential observer judgments (Herbert & Attridge, 1975). Notwithstanding several drawbacks associated with observation codes, the code discriminated significantly between normal and hyperactive children. These clinically relevant discriminations are noteworthy because they were reported by blind, unbiased observers. Such observed behavioral differences provide evidence that teachers and parents, rather than helping create a myth regarding the existence of hyperactive children, as it has been claimed (Schrag & Divoky, 1975), are responding to real behavioral differences in these children. Furthermore, in addition to its

descriptive clinical utility, the code has been found to be sensitive to treatment effects (Gittelman-Klein, Klein, Abikoff, Katz, Gloisten, & Kates, 1976). This sensitivity can provide useful empirical information for practitioners and others concerned with evaluating the clinical effects of medication on the behavior of hyperactive children. Reference Notes 1. Bcrnal, M. E., & North, J. A. Scoring system for home and school (Revision X ) . Unpublished manuscript, University of Denver, 1972. 2. Tonick, I., Friehling, J., & Warhit, J. Classroom observational code. Unpublished manuscript, Point of Woods Laboratory School, State University of New York at Stony Brook, 1973.

References Allyon, T., Layman, D., & Kandel, H. J. A behavioral-education alternative to drug control of hyperactive children. Journal of Applied Behavior Analysis, 1975, 8, 137-146. Becker, W., Madsen, C., Arnold, C., & Thomas, D. The contingent use of teacher attention and praise in reducing classroom behavior problems. Journal of Special Education, 1967, 1, 287-307. Bell, R. Q., Waldrop, M. D., & Wcller, G. M. A rating system for the assessment of hyperactive and withdrawn children in preschool samples. American Journal of Orthopsychiatry, 1972, 42, 23-34. Blunden, D., Spring, C., & Greenberg, L. M. Validation of the Classroom Behavior Inventory. Journal of Consulting and Clinical Psychology, 1974, 42, 84-88. Christensen, D. E. Effects of combining methylphenidate and a classroom token system in modifying hyperactive behavior. American Journal of Mental Deficiency, 1975, 80, 266-276. Conners, C. K. A teacher rating scale for use in drug studies with children. American Journal of Psychiatry, 196,9, 126, 884-888. Conners, C. K. Pharmacotherapy of psychopathology in children. In H. C. Quay & J. S. Werry (Eds.), Psychopathological disorders of childhood. New York: Wiley, 1972. Douglas, V. I. Stop, look and listen: The problem of sustained attention and impulse control in hyperactive and normal children. Canadian Journal of Behavioral Science, 1972, 4, 259-282. Dubey, D. R., Kent, R. N., O'Leary, S. G., Broderick, J. E., & O'Leary, K. D. Reactions of children and teachers to classroom observers: A series of controlled investigations. Behavior Therapy, in press. Early Clinical Drug Evaluation Unit Assessment Battery for Pediatric Psychopharmacology. Children's

CLASSROOM OBSERVATION CODE FOR HYPERACTIVE CHILDREN Psychiatric Rating Scale. Bethesda, Md.: Department of Health, Education and Welfare, Public Health Service, National Institute of Mental Health, Psychopharmacology Research Branch, January 1973. Gelfand, D. M., & Hartmann, D. P. Child behavior: Analysis and therapy. New York: Pergamon Press, 1975. Gittelman-Klein, R., Klein, D. F., Abikoff, H., Katz, S., Gloisten, A., & Kates, W. Relative efficacy of mcthylphenidate and behavior modification in hyperkinetic children: An interim report. Journal of Abnormal Child Psychology, 1976, 4, 361-379. Gittelman-Klein, R., Klein, D. F., Katz, S., Saraf, K., & Pollack, E. Comparative effects of methylphenidate and thioridazine in hyperkinetic children: I. Clinical results. Archives of General Psychiatry, 1976, 33, 1217-1231. Greenberg, L. M., Deem, M. A., & McMahon, S. Effects of dextroamphetamine, chlorpromazine, and hydroxyzine on behavior and performance in hyperactive children. American Journal of Psychiatry, 1972, 129, 532-539. Guilford, J. P. Psychometric methods (2nd ed.). New York: McGraw-Hill, 1954. Harshbarger, T. R. Introductory statistics: A decision map. New York: Macmillan, 1971. Herbert, J., & Attridge, C. A guide for developers and users of observation systems and manuals. American Educational Research Journal, 197S, 12, 1-20. Johnson, S. M., & Bolstad, 0. D. Methodological issues in naturalistic observations: Some problems and solutions for field research. In L. A. Hamerlynck, L. C. Handy, & E. J. Mash (Eds.), Behavior change: Methodology, concepts and practice. The fourth Banff international conference on behavior modification. Champaign, 111.: Research Press, 1973. Jones, R. R., Reid, J. B., & Patterson, G. R. Naturalistic observations in clinical assessment. In P. McReynolds (Ed.), Advances in psychological assessment (Vol. 3). San Francisco: Jossey-Bass, 1975. Kent, R. N., & O'Leary, K. D. A controlled evaluation of behavior modification with conduct problem children. Journal of Consulting and Clinical Psychology, 1976, 44, 586-596. Kent, R. N., O'Leary, D. K., Diament, C., & Dietz, A. Expectation biases in observational evaluation of therapeutic change. Journal of Consulting and Clinical Psychology, 1974, 42, 774-780. Klein, D. F., & Gittelman-Klein, R. Problems in the diagnosis of minimal brain dysfunction and the hyperkinetic syndrome. International Journal of Mental Health, 1975, 4, 45-60. Laufer, M. W., & Denhoff, E. Hyperkinetic behavior syndrome in children. Journal of Pediatrics, 1957, SO, 463-474. O'Leary, D. K., Pelham, W. E., Rosenbaum, A., & Price, G. H. Behavioral treatment of hyperkinetic children: An experimental evaluation of its usefulness. Clinical Pediatrics, 1976,15, 274-279.

783

Patterson, G. R. A community mental health program for children. In L. A. Hamerlynck, P. O. Davidson, & L. E. Acker (Eds.), Behavior modification and ideal mental health services. Calgary, Alberta, Canada: University of Calgary Press, 1969. Patterson, G. R., Cobb, J. A., & Ray, R. S. Direct intervention in the classroom: A set of procedures for the aggressive child. In F. Clark, D. Evans, & L. Hamerlynck (Eds.), Implementing behavioral programs for schools and clinics. Champaign, III.: Research Press, 1972. Reid, J. B. Reliability assessment of observation data: A possible methodological problem. Child Development, 1970, 41, 1143-1150. Routh, D. K., Schroeder, C. S., & O'Tuama, L. A. Development of activity level in children. Developmental Psychology, 1974, 10, 163-168. Safer, D. J., & Allen, R. P. Hyperactive children: Diagnosis and management. Baltimore, Md.: University Park Press, 1976. Satterfield, J. M., Cantwell, D. P., & Satterfield, B. T. Pathophysiology of the hyperactive child syndrome. Archives of General Psychiatry, 1974, 31, 839-844. Schrag, P., & Divoky, D. The myth of the hyperactive child and other means of child control. New York: Pantheon, 1975. Shaffer, D., McNamara, N., & Pincus, J. H. Controlled observations on patterns of activity, attention, and impulsivity in brain-damaged and psychiatrically disturbed boys. Psychological Medicine, 1974, 4, 4-18. Siegel, S. Nonparametric statistics. New York: McGraw-Hill, 1956. Sprague, R. L., Barnes, K. R., & Werry, J. S. Methylphenidate and thioridazine: Learning, reaction time, activity, and classroom behavior in emotionally disturbed children. American Journal of Orthopsychiatry, 1970, 40, 615-628. Sprague, R. L., Christensen, D. E., & Werry, J. S. Experimental psychology and stimulant drugs. In C. K. Conners (Ed.), Clinical use of stimulant drugs in children. Boston: Excerpta Medica, 1974. Weick, K. E. Systematic observational methods. In G. Lindzey & E. Aronson (Eds.), The handbook of social psychology (2nd ed.). Menlo Park, Calif.: Addison-Wesley, 1968. Wender, P. H. Minimal brain dysfunction in children. New York: Wiley, 1971. Werry, J. S., & Quay, H. C. Observing the classroom behavior of elementary school children. Exceptional Children, 1969, 35, 461-472, Werry, J. S., & Quay, H. C. The prevalence of behavior symptoms in younger elementary school children. American Journal of Orthopsychiatry, 1971, 41, 136-143. Werry, J. S., & Sprague, R. L. Hyperactivity. In C. G. Costello (Ed.), Symptoms of psychopathology. New York: Wiley, 1970.

Received May 19, 1976 •