5Barton.qxd
9/16/03
1:23 PM
Page 386
Functional Behavior Assessment Rating Scales: Intrarater Reliability with Students with Emotional or Behavioral Disorders Sally M. Barton-Arwood and Joseph H. Wehby Vanderbilt University Philip L. Gunter Valdosta State University Kathleen L. Lane Vanderbilt University ABSTRACT: This study evaluated the intrarater reliability of two functional behavior assessment rating scales: the Motivation Assessment Scale and the Problem Behavior Questionnaire. Teachers rated 30 students from 10 self-contained classrooms for students with emotional or behavioral disorders on three separate occasions using both rating scales. Pearson correlation coefficients and exact and adjacent agreement percentages indicated variable and inconsistent ratings across administrations and rating scales. The authors discuss possible reasons for inconsistencies, as well as implications for practice and future research.
One challenge faced by educators of students with emotional or behavioral disorders (E/BD) is determining how to address disruptive classroom behaviors, including noncompliance, academic disengagement, and aggression (Mathur, Quinn, & Rutherford, 1996; Walker, Colvin, & Ramsey, 1995; Wehby, Symons, & Shores, 1995). With goals of reducing behavioral disruptions and providing positive behavioral supports in least restrictive educational settings, the 1997 Amendments to the Individuals with Disabilities Education Act (IDEA) explicitly mandate functional behavioral assessment (FBA) and behavior intervention plans (BIPs) for students with disabilities and histories of inappropriate behavior (Yell & Shriner, 1997). Functional behavioral assessment is the process of gathering information about factors or events that reliably predict and maintain problem behavior (O’Neill et al., 1997) in order to develop behavioral interventions that discontinue reinforcement of problem behavior and teach functional alternative behaviors (Carr & Durand, 1985). A primary assumption of FBA is that reliance on behavioral form (i.e., 386 / August 2003
topography) leads to less effective intervention plans. Hence, FBA is predicated on the assumption that interventions based on the behavioral function, not the form, will result in more efficacious interventions. More specifically, identifying underlying causes of problem behavior provides more useful information to develop idiographic, proactive, instructional interventions (Center for Effective Collaboration and Practice, 1998a). Therefore, BIPs based on information gathered throughout the FBA process have the capacity to be positive in scope, emphasize teaching instead of punishment, and prosocially meet the individual needs of students with disabilities, including E/BD (Lewis, Scott, & Sugai, 1994).
The Functional Behavioral Assessment Process The main focus of FBA is to identify environmental events (i.e., antecedents and consequences) that occasion and sustain problem behaviors. After gathering sufficient information, a hypothesis is developed regarding the
Behavioral Disorders, 28(4), 386–400
5Barton.qxd
9/16/03
1:23 PM
Page 387
function(s) of the inappropriate behavior (Horner, 1994). FBA-based BIPs may then be developed that include potentially appropriate reinforcers and replacement behaviors directly matched to the behavioral function. Although the 1997 IDEA Amendments do not provide information regarding what constitutes FBA, 10 sequential steps have been prescribed by the Center for Effective Collaboration and Practice (1998a) as a means for school personnel to complete the FBA and BIP processes. These steps are as follows: 1. Describe and verify the problem behavior.
to their low frequency of occurrence. Rating scales, therefore, have been recommended as one FBA method (e.g., Center for Effective Collaboration and Practice, 1998a). Rating scales are easier to complete, offer a more palatable alternative to complex methods (Lewis et al.), and may become an important, efficient FBA methodology for assessing those high-intensity, low-frequency behaviors (Sasso et al.). The Motivation Assessment Scale (MAS) and the Problem Behavior Questionnaire (PBQ) are two FBA rating scales developed to be efficient methods for identifying possible functions of problem behavior (Lewis et al.).
2. Refine the problem behavior definition. 3. Collect information on the function of the problem behavior. 4. Analyze the information. 5. Generate a hypothesis for the problem behavior function. 6. Test the hypothesis. 7. Create and implement the BIP. 8. Monitor the BIP. 9. Evaluate the BIP’s effectiveness. 10. Modify the BIP as needed. Although each step of the FBA process is important and contributes to the goal of developing an effective BIP, the third step (i.e., collect information on the function of the problem behavior) is critical and challenging. Sufficient expertise, time, and resources are necessary to collect adequate information to develop a plausible hypothesis or hypotheses regarding the behavioral function(s) (Howell & Nelson, 1999; Walker & Sprague, 1999). If this step is not thoroughly completed, subsequent steps may be hindered or unsuccessful (Jolivette, Barton-Arwood, & Scott, 2000). FBA data may be collected through direct methods (e.g., observations, environmental manipulations) and indirect methods (e.g., interviews, rating scales). The efficacy of direct data-collection methods has been questioned in terms of efficiency, time, and expertise required for completion (Durand, 1990; Lewis et al., 1994; Sasso, Conroy, Stitchter, & Fox, 2001; Walker & Sprague, 1999). Many teachers are poorly trained and lack the time to perform complex and lengthy assessment procedures (Sugai, Horner, & Sprague, 1999). In addition, many of the high-intensity problem behaviors exhibited by students with E/BD are difficult to assess using direct observation due
Behavioral Disorders, 28(4), 386–400
The Motivation Assessment Scale The MAS is an indirect assessment tool described as a simpler alternative to more formal and labor-intensive functional analysis (Spreat & Connelly, 1996). This scale contains 16 questions designed to determine possible behavioral function(s), including sensory reinforcement, escape from aversive situations, social attention, and tangible rewards (Durand & Crimmins, 1988). Given a defined problem behavior, third-party informants rate its occurrence within 16 context-specific situations (4 for each function) using a 7-point Likert-type scale ranging from never (0) to always (6). After the situations are rated, numeric ratings are totaled and mean scores are calculated within the four functional categories. Relative rankings are then assigned to each category according to the mean scores. The rank of 1 (i.e., primary function) is assigned to the behavioral function with the highest mean, and a 4 is assigned to the function with the lowest mean. Although the instrument was designed for self-injurious behavior (SIB), the MAS has been used for other behavioral topographies (e.g., aggression; Sigafoos, Kerr, & Roberts, 1994). The reliability of the MAS has been investigated for children, adolescents, and adults with developmental disabilities. Reliability estimates have varied across studies. For itemlevel responses, Durand and Crimmins (1988) reported significant interrater reliability correlations ranging from .66 to .92 for 50 children with developmental disabilities and SIB. Relative rank agreement was also significant, ranging from .66 to .81. In additiion, test-retest reliability coefficients for the primary rater August 2003 / 387
5Barton.qxd
9/16/03
1:23 PM
Page 388
across a 30-day interval were .89 to .98 for items and .82 to .99 for ranks. The authors concluded that there was agreement between raters on the variables that influenced and maintained the SIB. These results have not been replicated. For example, Zarcone, Rodgers, Iwata, Rourke, and Dorsey (1991) found the interrater consistency for the MAS to be less adequate. In regard to SIB, interrater reliability ranged from –.24 to .44 for individuals in an institution setting and –.51 to .55 for participants in a school setting. Exact agreement on item-by-item ratings averaged less than 20% and 50% per item for institution and school, respectively. Relative rank correlations ranged from –.80 to 1.00, with a mean of .41. Newton and Sturmey (1991) reported even lower interrater reliability, with a median correlation of .18 for problem behavior. Sigafoos and colleagues (1994) assessed aggressive behavior and also found poor interrater reliability, with nonsignificant correlations for individual items (–.34 to .43) and attention, tangible, escape, and sensory subscales (–.09, .13, –.01, and .17, respectively). However, Sigafoos and colleagues reported over 44% agreement on the primary function. Conroy, Fox, Bucklin, and Good (1996) found item interrater reliability correlations ranging from –.80 to .80. For intrarater reliability, Conroy and colleagues found low item exact agreement (i.e., means of .40 and .41 for two raters) but reported higher adjacent agreement (means of .75 and .77). Across studies, discrepant reliability may have resulted from differences in the rate or topography of problem behaviors (Duker & Sigafoos, 1998; Sigafoos et al., 1994). Specifically, Durand and Crimmins (1988) used the MAS to assess high rates of SIB (e.g., 15 times per hour) while Zarcone and colleagues (1991) assessed SIB occurring at lower rates (e.g., several times an hour or less). In addition, Newton and Sturmey (1991) used the MAS to assess unspecified problem behavior, while Sigafoos and colleagues evaluated aggressive behavior. Conroy and colleagues (1996) examined a variety of challenging behaviors (e.g., shirt chewing, stealing, hyperactivity). In light of questionable technical adequacy, researchers have recommended continued evaluations of the MAS’s technical adequacy (Conroy et al.; Kearney, 1994; Sigafoos et al.). 388 / August 2003
The Problem Behavior Questionnaire The PBQ was developed in response to the lack of FBA methodologies for individuals with milder behavioral challenges in general education settings (Lewis et al., 1994). Similar to the MAS procedure, 15 context-specific situations are rated according to the frequency of an identified problem behavior using a 7-point Likert-type scale. Responses include never (0), 10% of the time (1), 25% of the time (2), 50% of the time (3), 75% of the time (4), 90% of the time (5), and always (6). The 15 situations address problem behavior in terms of the following five functional categories: (a) access to peer attention, (b) access to teacher attention, (c) escape from peer attention, (d) escape from teacher attention, and (e) setting events. After the scale has been completed, scores are plotted on a profile. Within each of the five functional categories, items receiving a score of 3 or above are considered a potential hypothesis for the problem behavior. If there are two or more items with a score of 3 or above, a primary hypothesis is suggested. No published empirical studies were found that evaluated the technical adequacy of this assessment tool.
Widespread Application of FBA in School Settings In general, there appears to be support regarding the importance and potential usefulness of FBA in school settings (e.g., Fox & Conroy, 2000; Gable, 1999; Scott & Nelson, 1999; Sugai et al., 1999), including the use of rating scales (Center for Effective Collaboration and Practice, 1998a; Knoster, 2000). However, the widespread school application of FBA has been questioned given its limited empirical support. First, the majority of the experimental FBA literature has been conducted in controlled clinical settings with individuals with severe developmental disabilities (Gresham, Quinn, & Restori, 1999; Nelson, Roberts, Mathur, & Rutherford, 1999). Because the MAS was developed specifically for use with students with severe disabilities, there is limited empirical support for the use of such rating scales with students with E/BD (Fox, Conroy, & Heckaman, 1998; Heckaman, Conroy, & Fox, 2000; Lane, Umbreit, & Beebe-Frankenberger, 1999; Sasso et al., 2001). Researchers and educators are concerned about the generaliz-
Behavioral Disorders, 28(4), 386–400
5Barton.qxd
9/16/03
1:23 PM
Page 389
ability of clinical findings to school settings (Gresham et al.; Nelson et al.). Despite limited empirical support, many “how to” resources have been made available to school personnel for conducting FBA with students with E/BD. For example, the Center for Effective Collaboration and Practice (1998a, 1998b, 1999) developed a series of manuals following the reauthorization of IDEA (1997) that described FBA steps and methodological options. In these and other resources (e.g., Knoster, 2000), rating scales are frequently recommended as acceptable indirect assessment tools to simplify the FBA process for school personnel. Although direct observation and interviews are frequently used FBA methodologies (Fox et al., 1998; Lane et al., 1999), rating scales may also be frequently used to generate behavioral function information for students with E/BD (Sasso et al., 2001). While rating scales seem applicable and feasible for this population of students, they appear to be selected indiscriminately and without sufficient information. An example of indiscriminate selection may be taken from a local school district. The MAS and the PBQ were both included in the system’s FBA training and teacher resource packets, with recommendations that the scales could be used interchangeably in the FBA process (C. Redelheim, personal communication, April 1999). The rating scales were included because of their ease of use and scoring. Technical adequacy did not appear to be given primary concern and was not explicitly discussed. Even researchers appear to use FBA methodologies without reference to statistical properties. Literature reviews of FBA with students with E/BD revealed a lack of reliability and validity data investigated or reported for all methodologies, including rating scales (Fox et al.; Sasso et al.). This suggests that school systems and researchers around the country are using FBA techniques that may be socially valid but may lead to unreliable outcomes (Fox et al.; Nelson et al., 1999; Sasso et al.). Researchers suggest that much of the current emphasis on the “how to” of the FBA process may be premature and that more careful investigation is needed regarding FBA procedures (Stichter, Shellady, Sealander, & Eigenberger, 2000). Although there is a substantial research base demonstrating the conceptual foundation for the clinical application of FBA, sufficient empirical support for using FBA with students
Behavioral Disorders, 28(4), 386–400
with E/BD in applied settings is lacking (Armstrong & Kauffman, 1999; Nelson et al., 1999). Given the socially valid nature of rating scales and FBA’s potential usefulness for students with E/BD, investigation of its technical adequacy is imperative. The purpose of this study, therefore, was to extend the literature on FBA methodology for students with E/BD with regard to the intrarater reliability of the MAS and the PBQ. The study was designed to answer the following questions: 1. What is the item-by-item intrarater reliability over time for the MAS and the PBQ? 2. What is the stability of primary behavioral function over time as indicated by both rating scales? 3. How do the escape and attention functional categories compare between the MAS and the PBQ?
Method Participants Raters in this study included 10 teachers of self-contained classrooms for students with E/BD within a southeastern metropolitan public school district. Nine of the 10 teachers had degrees in general or special education. The one noncertified teacher had a degree in criminal justice and was teaching on a waiver. All of the participants had taught in special education classrooms for at least 1 year, with an average of 7.9 years’ special education teaching experience. These teachers had taught in their current classrooms for at least 1 year. The teacher participants nominated a total of 30 students (26 boys and 4 girls) across the 10 special education classrooms for behavioral ratings. The teachers identified students who exhibited the highest rates of problematic behaviors and had been in their classrooms for at least 1 month. Three boys were dropped from the study, 1 due to school transfer and 2 due to incomplete rating scales. Of the 27 students who completed the study, ages ranged from 6 to 10 years old with an average of 8.4 years. The racial composition included 19 African American students, 6 Caucasian students, and 2 students reported as biracial. The students’ individualized education programs (IEPs) included primary disabling conditions of emotional disturbance (n = 13), language impairment (n = 4), mental retardation (n = 4), August 2003 / 389
5Barton.qxd
9/16/03
1:23 PM
Page 390
TABLE 1 Student Demographics and Behaviors Student
Age
Diagnosis
Behavior
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27
6 6 5 8 11 11 12 9 10 10 10 10 9 9 8 9 5 6 6 9 9 10 10 7 7 8 7
LI MR LI; SI ED ED ED; SI; LI ED OHI; ED ED ED; OHI MR; ED LD LD ED ED ED LI LI LI G ED LD; LI ED ED MR; LI MR ED
Calls out teacher’s name without raising hand Turns away from teacher and verbally refuses to transition Looks around, drops materials, and falls out of chair Yells at teacher Off task by playing with objects or drawing Goes to the computer and plays games without permission Works too quickly, making multiple mistakes Off task and nonattending by sleeping Perfectionist with school work Sexually inappropriate behavior Refuses to talk when given a task perceived as difficult Cries, moves desk and self away from teacher and peers Gives excuses, blames others, denies misbehavior Hits peers during unstructured times Verbally disrespectful to teacher when reprimanded Whines and makes excuses when given independent work Curses, kicks, and bites following teacher mands Screams, hits, destroys materials following teacher mand Kicks, hits, curses following teacher redirection Cries and kicks objects when misbehavior confronted Becomes tense, withdrawn when does not achieve 100% Repeatedly asks strangers or acquaintances to be friends Denies breaking class rules when behavior is evaluated Tantrums when teacher marks errors on work Verbally disrupts, hits, and kicks when asked to redo work Will not begin work that looks different from others Does not complete academic work
Note. LI = language impairment; MR = mental retardation; SI = speech impairment; ED = emotional disturbance; OHI = other health impaired; LD = learning disability; G = gifted.
health impairment (n = 1), and learning disability (n = 4). One student was labeled gifted. Seven students had secondary certifications of emotional disturbance (n = 2), speech impairment (n = 2), language impairment (n = 2), and health impairment (n = 1). The IEP team made decisions for special education placement in self-contained classrooms for students with E/BD based on histories of problem behavior (e.g., extreme noncompliance, verbal and physical aggression) exhibited in school settings. At the time the study began, the average length of time that students had been receiving services in their current classrooms was 8 months (range: 1–24 months). Descriptions of the students and their challenging behaviors as identified by their teachers are provided in Table 1. 390 / August 2003
Procedures After securing signed teacher and student consents, the graduate research assistant (GRA) met with each teacher individually. During the first meeting, the GRA asked the teacher to identify each student’s most frequent and problematic classroom behavior. Next, using a rubric containing components of an operational definition (i.e., When does the behavior occur? What does the behavior look like? How often does it occur?), the GRA guided each teacher in developing an operational definition of the identified behavior. For 14 of the students (52%), a second GRA was present to assess whether the definition was developed in accordance with the rubric and was sufficiently clear. Fidelity to the rubric was 100%. Final-
Behavioral Disorders, 28(4), 386–400
5Barton.qxd
9/16/03
1:23 PM
Page 391
ly, during this initial meeting, the teacher completed the first administration of the PBQ and the MAS with the GRA present. After the initial administration of the two rating scales, the same GRA met individually with each teacher for the second administration 1 week later and for the third administration 4 weeks after the second administration. For each administration, the GRA wrote the operational definition of the target behavior on the rating scale protocol prior to the meeting and reviewed it with the teacher immediately before the completion of the rating scale.
Data Analysis Four separate analyses were conducted to evaluate technical adequacy of the MAS and PBQ. First, to address the intrarater reliability of the MAS and the PBQ over time, Pearson correlations were calculated to determine coefficients of agreement for the raw scores of each scale item by student. In addition, percentage-of-agreement measures were calculated by student for each item using exact and adjacent methods (i.e., [total agreements/total agreements + total disagreements] × 100; Zarcone et al., 1991). In the exact method, agreement was scored when the scale values were identical for the same item when comparing the first to the second, the second to the third, and the first to the third administrations. For the adjacent method, a plus/minus match procedure was used wherein agreement was scored if a teacher’s responses were within 1 point of each other for the particular scale item. For example, a rating of 6 on MAS item 1 during the first administration and a rating of 5 on the same item during the second administration would be scored as a disagreement for the exact method and an agreement for the adjacent method. The exact and adjacent methods are regarded as a more conservative evaluation of agreement because Pearson correlations may fail to reflect true reliability given that no agreement across administrations may result in perfect correlation (Duker & Sigafoos, 1998). Second, Pearson correlations were used to determine coefficients of agreement for the mean scores of each functional category for both the MAS and PBQ. Mean scores rather than total scores were used for statistical analyses because MAS rankings for behavioral function are based on mean scores. Although
Behavioral Disorders, 28(4), 386–400
not traditionally part of PBQ scoring, mean scores were also calculated for each functional category. Third, functional rank order agreement across rating scale administrations was calculated by hand. This was achieved by first assigning a primary and secondary relative rank to the scores on each rating scale protocol. Although the scoring of the PBQ does not include assignment of ranks, this is the typical procedure for the MAS. Relative rankings were assigned according to the mean scores for each functional condition. The rank of 1 was assigned to the function with the highest mean and a rank of 2 was given to the second highest mean. After the ranks were established, primary and secondary ranks (i.e., 1 and 2) were compared across administrations for exact agreement. For example, ratings for a student during the first administration might result in a mean Sensory score of 4.5, an Escape mean score of 3.25, an Attention mean of 2.75, and a Tangible mean of 4. Therefore, Sensory would receive a rank of 1 and could be considered a primary function of the target behavior. Likewise, Tangible would be assigned a rank of 2 with subsequent consideration as the secondary function. To continue the example, the second MAS administration might yield mean scores of 4 for Sensory, 3.25 for Escape, 2.5 for Attention, and 3.5 for Tangible. Again, Sensory would be ranked as 1 and Tangible as 2, and exact agreement for primary and secondary functional ranks would be scored for this student from the first to the second administration. If two primary or two secondary ranks were identified, exact agreement for rank 1 would be scored if at least one primary functional condition matched the primary condition of the next administration. A rank 2 agreement would be scored as exact if at least one of the secondary functional conditions were the same. Fourth, Pearson correlations were calculated to compare functional categories across the MAS and the PBQ. Although the MAS and the PBQ categorize behavioral function differently, both rating scales include Escape and Attention. Therefore, only these two categories were compared across administrations. Sensory and Tangible from the MAS and Setting Events from the PBQ were not included in these analyses. August 2003 / 391
5Barton.qxd
9/16/03
1:23 PM
Page 392
TABLE 2 Motivation Assessment Scale Item-by-Item Pearson Correlations Administrations MAS Item with Function 1. 2. 3. 4. 5. 6. 7. 8. 9. 10. 11. 12. 13. 14. 15. 16.
If left alone for long periods of time (S) Following a request to perform a difficult task (E) In response to your talking to others (A) To get a toy, food, activity (T) If no one is around (S) When any request is made (E) When you stop attending (A) When you take away a favorite toy, food, activity (T) Person enjoys performing the behavior (S) To upset you when trying to get to do what asked (E) To upset you when not paying attention (A) Stops when you give toy, food, activity requested (T) Seems calm and unaware of anything going on (S) Stops shortly after you stop demands (E) To get you to spend time with him/her (A) When told he/she can’t do something (T)
M SD
1st–2nd
2nd–3rd
1st–3rd
.82** .50** .88** .56* .76** .42* .58** .58** .71** .72** .73** .39* .76** .41* .79** .39* .63 .17
.54* .75** .41* .68** .83** .64** .47* .40* .73** .68** .63** .49** .51** .51** .44* .38* .57 .14
.57** .45* .46* .35 .53** .61** .52** .49** .65** .50** .64** .31 .39* .61** .40* .63** .51 .11
Note. (S) = Sensory; (E) = Escape; (A) = Attention; (T) = Tangible; *p < .05. **p < .01.
Results Item-by-item Pearson correlation coefficients are reported for the MAS in Table 2 and for the PBQ in Table 3. For the MAS, from the first to second administrations (i.e., 1-week interval), 100% of the correlations were significant at p < .05, and 75% were significant at p < .01 (range = .39 to 88; M = .63; SD = .16). From the second to third administration (i.e., 1month interval), 100% of the correlations were significant at p < .05, and 67% were significant at p < .01 (range = .38 to .83; M = .57, SD = .14). From the first to the third administration (i.e., 5-week interval), 88% of the correlations were significant at p < .05, and 71% were significant at p < .01 (range = .392 to .640; M = .507; SD = .108). Of the total correlations, 96% (i.e., 46 of 48) were significant, and 88% were significant across all three administrations. However, only 6% were above .80. The PBQ ratings were more variable and yielded fewer significant correlation coefficients. From the first to second administrations, 67% of the coefficients were significant at p < .05 with 70% significant at p < .01 392 / August 2003
(range = –.047 to .669; M = .414; SD = .242). From the second to third administrations, 74% of the correlations were significant at p < .05 with 64% significant at p < .01 (range = .008 to .892; M = .473; SD = .247). For the first to the third administration, 47% of the correlations were significant at p < .05 with 43% at p < .01 (range = .009 to .771; M = .405; SD = .219). Overall, 63% of the total correlations were significant; yet, only one was above .80. Only 20% of the PBQ questions had statistically significant coefficients across all three administrations. Tables 4 and 5 show the exact and adjacent percentages of item agreement for each student across rating scale administrations for the MAS and PBQ. The data indicate that exact agreement was lower and more variable than adjacent agreement. All exact agreement means were less than 50%. Adjacent agreement was higher and reached more acceptable levels. Table 6 summarizes the findings for primary function exact agreement for each rating scale across administrations. For the MAS, the greatest agreement occurred within the shortest evaluation interval. From the first to the
Behavioral Disorders, 28(4), 386–400
5Barton.qxd
9/16/03
1:23 PM
Page 393
TABLE 3 Problem Behavior Questionnaire Item-by-Item Pearson Correlations Administrations PBQ Item with Function
1st–2nd
2nd–3rd
1st–3rd
Peer Escape 3. Occurs during peer conflict; peers leave him/her alone 10. Stops when peers stop interacting with him/her 14. Occurs and peers stop interacting
.47* .14 .41*
.62** .01 .69**
.31 .01 .46*
Peer Attention 4. Occurs and peers respond/laugh 7. Occurs in presence of specific peers 11. Occurs when peers attend to other peers
.48** .73** .60**
.42** .64** .19
.78** .47* .38*
.47** –.05 .19
.72** .34 .42*
.62** .40* .62**
Adult Attention 2. You redirect student back to task when behavior occurs 6. Occurs to get your attention when working with others 12. You provide 1:1 instruction when problem occurs
.58** .74** .07
.66** .89** .60**
.25 .70** .27
Setting Events 5. More likely to occur after conflict outside of classroom 8. Occurs throughout day after earlier episode 15. Occurs following unscheduled event/disruptions M SD
.42* .67** .29 .41 .17
.09 .43* .37* .47 .14
.47* .06 .30 .41 .11
Adult Escape 1. Occurs when you make request to perform task 9. Occurs during specific academic activities 13. Stops if you stop requests or end academic task
Note. ** p < .01. *p < .05; M = mean; SD = standard deviation.
second administration, the primary function remained the same for 70% of the students. For the other 30% of the students in that interval (i.e., those whose primary function changed during this 1-week interval), the primary function became the secondary function (i.e., receiving the second highest score and relative rank of 2) for 63%. The stability of the MAS primary function decreased across administrations. However, more than 50% of the students with changing functions still had the primary function as one of the top two rated functions across the last two MAS administrations. In regard to the PBQ, the primary function was the most stable from the second to third administration. Comparing the second to third and first to third administrations, the primary function became the secondary function for more than 50% of the students.
Behavioral Disorders, 28(4), 386–400
Table 7 contains the primary function for each student’s target behavior across all three administrations as scored on the MAS and the PBQ. The information reveals that primary function remained constant for 10 students on the MAS, for 13 students on the PBQ, and for 8 students across both rating scales (i.e., students 2, 4, 5, 6, 7, 9, 16, 18). These results should be considered with the exact agreement percentages reported in Tables 4 and 5. Note that the 8 students whose function did not change across administrations and scales did not have 100% exact item-by-item agreement. This suggests that even with inconsistent item-by-item ratings, the summated scores may be stable in reflecting behavioral function. In addition, the 8 students with stable primary function across scales and administrations were not spread across 8 teachers. Three teachers were responsible for the ratings of these students, suggesting that some teachers August 2003 / 393
5Barton.qxd
9/16/03
1:23 PM
Page 394
TABLE 4 Motivation Assessment Scale Exact and Adjacent Agreement Percentages Student
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 Range M SD
MAS Exact
MAS Adjacent
Administrations
Administrations
1st–2nd
2nd–3rd
1st–3rd
1st–2nd
2nd–3rd
1st–3rd
31 31 19 69 31 50 56 38 75 25 13 50 56 25 50 31 50 31 31 31 88 88 44 25 19 38 31
44 50 44 44 13 56 31 38 75 19 38 63 38 50 50 19 31 25 25 31 75 81 19 13 25 19 31
31 44 44 38 44 31 25 25 85 50 25 31 56 25 44 31 38 44 44 50 81 75 13 31 25 13 19
81 63 86 100 75 63 69 50 75 75 69 81 94 75 81 88 88 94 94 75 88 94 81 69 88 75 50
88 94 86 81 56 94 69 94 88 69 75 88 94 94 81 100 81 81 81 50 75 81 50 50 56 38 44
63 50 94 81 81 69 69 88 81 81 69 88 88 75 75 75 88 69 69 88 81 81 38 63 56 44 63
13–88 41.70 20.06
13–81 38.78 19.07
13–85 39.34 18.51
50–100 78.56 12.88
were more consistent in their ratings than other teachers. The final analysis compared the MAS and PBQ functional categories across administrations (see Table 8). MAS Escape and PBQ Adult Escape revealed the strongest relationship, with significant and positive coefficients across all administrations. In contrast, MAS Escape and PBQ Peer Escape had negative correlations across all administrations, with only the second administration revealing a significant correlation. MAS Attention and PBQ 394 / August 2003
38–100 75.48 12.88
38–94 72.85 14.18
Adult and Peer Attention demonstrated positive correlations with significance at the first and third administrations. Overall, the second administration revealed the fewest correlated functions; 90% of the significant correlations were below .80.
Discussion Since the 1997 reauthorization of IDEA, rating scales have received increased attention as FBA assessment procedures (Fox et al., 1998;
Behavioral Disorders, 28(4), 386–400
5Barton.qxd
9/16/03
1:23 PM
Page 395
TABLE 5 Problem Behavior Questionnaire Exact and Adjacent Agreement Percentages Student
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 Range M SD
Exact Agreement
Adjacent Agreement
Administrations
Administrations
1st–2nd
2nd–3rd
1st–3rd
1st–2nd
2nd–3rd
1st–3rd
20 27 20 54 40 40 47 47 60 40 27 47 54 27 34 40 40 54 13 40 20 47 54 27 13 20 17
47 54 40 27 40 40 34 34 40 14 40 60 40 27 13 34 54 60 20 20 52 40 40 54 20 47 60
34 47 7 40 34 34 27 34 54 47 34 27 40 20 27 34 60 60 13 27 13 27 54 47 34 34 27
67 74 54 80 74 80 80 87 67 80 40 80 80 67 73 74 87 94 67 60 60 60 67 54 60 74 54
87 74 67 74 54 87 87 80 67 67 87 94 67 87 73 94 80 94 74 67 67 67 67 74 67 80 80
67 80 40 87 54 87 74 80 60 74 54 80 74 67 73 60 94 87 40 60 54 60 74 87 80 74 54
13–60 35.89 14.17
14–60 38.93 13.94
7–60 34.67 13.62
40–94 70.15 12.31
54–94 76.41 10.51
40–94 69.45 14.48
TABLE 6 Exact Agreement Percentages of Primary Function Motivation Assessment Scale Administrations Primary Function Exact Agreement Primary as Secondary
Problem Behavior Questionnaire Administrations
1st–2nd
2nd–3rd
1st–3rd
1st–2nd
2nd–3rd
1st–3rd
70 63
59 55
56 67
67 34
81 60
59 73
Behavioral Disorders, 28(4), 386–400
August 2003 / 395
5Barton.qxd
9/16/03
1:23 PM
Page 396
TABLE 7 Motivation Assessment Scale and Problem Behavior Questionnaire Primary Functions Primary Behavioral Function MAS
PBQ Administrations
T 1
2 3 4 5
6 7
8
9
10
S 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27
Target Behavior Calls teacher’s name without raising hand Verbally refuses to transition Drops materials, falls from chair Yells at teacher Off task, plays with objects or draws Plays computer games without permission Works quickly, makes multiple mistakes Off task and non-attending by sleeping Perfectionist with school work Sexually inappropriate behavior Refuses to talk when given difficult task Cries, moves away from teacher and peers Makes excuses, denies misbehavior Hits peers during unstructured times Verbally disrespectful when reprimanded Whines, will not do independent work Curses, kicks, bites after teacher mands Screams, hits, destroys materials after mand Kicks, hits, curses after teacher redirection Cries, kicks objects when confronted Withdraws when does not achieve 100% Asks strangers to be friends Denies breaking class rules Tantrums when work errors marked Kicks when asked to redo work Will not begin work different from others Does not complete academic work
1
2
3
1
2
3
S S S T S S S A S S E E E T T A E A T T T S T E T T S
S T AA AA AA S/A S/T AE AE AE S T AA AA AE T T PA/AA AA AA S S AE AE PA/AE S S AA AA AA S S AE AE AE/SE S S AA AA AA S S PA/SE SE AE/SE S A PA/AA AA AA E E AE AA AE/AA E E AE AE AA E T AA AE AE T E SE AA/SE PA E E PA AA PE A A AA AA AA/SE A E AA SE AA A A AA AA AA E T AA AA/SE SE T E AE/AA AA SE E T PE AA AA S/A A PA AA AA/SE E E SE SE/AA SE T E/T AA PE PE/AA T S AA PA/PE PE T E AE AA/PE AA T S PE PE/AA PE
Note. T = teacher; S = student; S = sensory; E = escape; A = attention; T = tangible; PE = peer escape; PA = peer attention; AE = adult escape; AA = adult attention; SE = setting events.
TABLE 8 Motivation Assessment Scale and Problem Behavior Questionnaire Correlations by Functional Category Functions MAS Escape Escape Attention Attention
Administrations
to
PBQ Adult Escape Peer Escape Adult Attention Peer Attention
1st
2nd
3rd
.44* –.27 .72** .40**
.73** –.45* .05 .23
.84** –.30 .63** .45*
Note. *p < .05. ** p < .01.
396 / August 2003
Behavioral Disorders, 28(4), 386–400
5Barton.qxd
9/16/03
1:23 PM
Page 397
Sasso et al., 2001). However, there is a paucity of empirical support, especially with regard to statistical properties. This study investigated intrarater reliability of item-by-item responses and behavioral function over time for the MAS and the PBQ rating scales, as well as the relationship between the rating scales’ functional categories. Item-by-item intrarater reliability was variable for both rating scales, with decreasing mean correlations over time for the MAS. Conroy and her colleagues (1996) reported similar MAS findings when their raters’ reliability decreased considerably across 4 weeks. Conroy and her colleagues hypothesized that inconsistent ratings may be related to time. The length of the test-retest interval is one factor proposed to impact reliability; estimated reliability is assumed to be greater when assessed over shorter periods of time (Salvia & Ysseldyke, 1991). Although the MAS item-byitem outcomes suggested that this may have occurred, the pattern of decreasing reliability was not evident with the PBQ. In fact, the 4week period between the second and third administrations reflected greater agreement. Another factor possibly influencing scoring inconsistency, therefore, may have been the frequency of the target behaviors. Low-frequency target behavior(s) may provide too little information, resulting in ratings based on rater intuition or guessing. High-frequency behavior(s) may also be challenging to assess when consequences and antecedents occur closely in time (Kearney, 1994). For example, a teacher may have difficulty rating a student’s frequent aggression occurring after reinforcement for completing work and prior to presentation of the next task. The functions of Attention, Tangible, and Escape may be difficult to differentiate if there were indeed only one function for the aggression. Although frequency of target behaviors was not assessed or included in operational definitions in this study, its influence on reliability should be considered (Kearney). Acceptable standards of reliability should also be considered in regard to stability. Although many of the item-by-item correlation coefficients were statistically significant, very few coefficients were at or above .80. According to Salvia and Ysseldyke (1991), .80 is considered an acceptable standard for screening purposes. However, for making educational decisions for individual students, .90 is the recommended minimum standard. Although
Behavioral Disorders, 28(4), 386–400
the MAS appeared more stable, only 6% of all the MAS reliability coefficients were above the accepted screening reliability standard of .80 (Salvia & Ysseldyke), and no coefficients reached .90. Therefore, MAS item-by-item stability may be considered low for this group of students with E/BD, while PBQ stability is even more questionable, with only one coefficient above .80. Individual item scores on rating scales are not typically used to develop hypotheses of behavioral function. Instead, decisions are based on summated scores. Item-by-item reliability, therefore, may be viewed as a minor issue (Spreat & Connelly, 1996). The MAS and PBQ item-by-item exact agreement percentages exemplify this point for students whose primary function did not change. For example, for Student 18, Attention and Adult Attention remained the highest-ranked functions on the MAS and the PBQ, respectively, across all three administrations (see Table 7). However, MAS exact item-by-item agreement percentages for this student were 31%, 25%, and 44% comparing the first to second, second to third, and first to third administrations. PBQ percentages were 54%, 60%, and 60% (see Tables 4 and 5). Although the teacher did not provide 100% exact or even adjacent agreement on numerical ratings for the Attention category questions, the summated scores for Attention were still the highest across all administrations. Therefore, stability of identified behavioral function over time may be more important to researchers and practitioners than individual item stability (Fox et al., 1998; Sigafoos, et al., 1994; Spreat & Connelly). For most of the students in this study, results suggested that stability of the primary function as identified by both the MAS and the PBQ is questionable. However, it is important to note that in many instances, the primary function became the secondary function. Spreat and Connelly (1996) suggested that only using the top-rated function may be too stringent a standard. Given that many problem behaviors are maintained by more than one function, the two toprated functions may need to be considered (Spreat & Connelly). A second possible explanation for reported behavioral function instability is that function changes over time and that changes may be related to behavioral topography or classification. For students in this study, many behaviors that remained constant in terms of function could be classified as noncompliance August 2003 / 397
5Barton.qxd
9/16/03
1:23 PM
Page 398
(e.g., Student 2: refusal to transition; Student 5: off task; Student 6: plays games on computer without permission). In contrast, many of the behaviors that reflected functional instability appeared more severe and externalizing and were related to aggression (e.g., Student 19 hits, kicks, curses) or verbal disrespect (e.g., Student 13 makes excuses, blames others, denies misbehavior). Although this pattern did not hold for all the students (i.e., some students with noncompliance had inconsistent functions across administrations), it may be that certain types of behaviors are more stable in terms of function. In contrast, other behaviors may be expanded in a variety of contexts to meet multiple needs for the student. Comparing functions across rating scales, the Escape and Attention functional categories across the MAS and the PBQ were not strongly correlated, especially in regard to peer interactions. This outcome is not surprising given the fact that the two scales were developed for different populations of students. For several years, the E/BD research community has debated the generalizability of FBA procedures, developed for students with severe disabilities, to students with high-incidence disabilities (e.g., Gable, 1999; Gresham et al., 1999; Nelson et al., 1999). Although additional research is needed, these outcomes suggest that the MAS and the PBQ do not provide similar information. Therefore, indiscriminant selection of rating scales may result in inaccurate information, possibly leading to ineffective behavior intervention plans.
A Limitation One important limitation in this study was that the number of students rated per class was not held constant. Teachers nominated and rated 1 to 4 students. Close reviews of teacher ratings indicated that three teachers demonstrated very consistent ratings as evidenced by stable primary functions on both rating scales across all administrations. This outcome suggested that some teachers were more consistent in their ratings over time than others. Teacher demographics did not indicate that these teachers had more teaching experience or FBA training than other teachers in the study. However, these teachers only had 1 or 2 students to rate while the other teachers rated 3 or 4 students. Completing multiple ratings for multiple students at one sitting may have led to rater fatigue. Consequently, the effect of instrumen398 / August 2003
tation (Campbell & Stanley, 1966) cannot be ruled out. Changes in ratings may have been due to changes in observers and not in student behaviors.
Implications for Practice Despite the possible instrumentation influence, several implications for practice are evident given the instability of intrarater scoring across time, the tendency for primary function not to remain constant, and the lack of correlation of behavioral function across rating scales. First, professionals who are responsible for making FBA recommendations and training school personnel in FBA should be aware of and discuss the limitations of rating scales as well as the entire FBA assessment process. Practitioners must understand that basing educational decisions on information collected through minimally tested methods may result in erroneous data leading to ineffective BIPs and countertherapeutic outcomes (Sasso et al., 2001). Consequently, rating scales should be used cautiously and in conjunction with other data-collection methods. A rating scale may be an appropriate method to help teachers begin considering the underlying function of problem behavior. However, direct observation, interviews, and record review are other methods to confirm behavioral function. Moreover, teachers should be trained to view and use FBA as an ongoing process, not as a one-time assessment. Regardless of teacher consistency, behavioral function may indeed change over time for particular behaviors and students. Therefore, as indicated by ongoing measurement of the independent variable (i.e., problem behavior) during intervention, followup FBAs may be needed to evaluate possible functional changes. These data are important for revising BIPs for continued student progress and success.
Conclusion There are still many issues regarding effective use of FBA and rating scales with students with E/BD. As practitioners are attempting to comply with the FBA federal mandate to the best of their ability, research must continue to extend the knowledge base of experimentally validated FBA practices. The current study may be extended in several ways. First, it may be replicated controlling for the number of students rated per teacher. In addition to
Behavioral Disorders, 28(4), 386–400
5Barton.qxd
9/16/03
1:23 PM
Page 399
intrarater reliability, interrater reliability should be evaluated for both the MAS and the PBQ for students with E/BD. Second, studies could be pursued that determine the influence of behavioral frequency and topography on the reliability of rating scales. Finally, given the high social validity awarded to rating scales for their ease in use and scoring, instruments that accurately reflect contexts of problem behaviors for students with E/BD should be developed and empirically validated. Existing research with this student population suggests that FBA has the potential to influence outcomes positively (e.g., Dunlap et al., 1993). However, until more methods are experimentally validated, researchers and practitioners must be cautious in their use of existing methodology.
References Armstrong, S. W., & Kauffman, J. M. (1999). Functional behavioral assessment: Introduction to the series. Behavioral Disorders, 24, 167–168. Campbell, D. T., & Stanley, J. C. (1966). Experimental and quasi-experimental designs for research. Boston: Houghton Mifflin. Carr, E. G., & Durand, V. M. (1985). Reducing behavior problems through functional communication training. Journal of Behavioral Education, 18, 111–126. Center for Effective Collaboration and Practice. (1998a). Addressing student problem behavior: An IEP team’s introduction to functional behavioral assessment and behavior intervention plans (2nd ed.). Washington, DC: Author. Center for Effective Collaboration and Practice. (1998b). Addressing student problem behavior— Part II: Conducting a functional behavioral assessment. Washington, DC: Author. Center for Effective Collaboration and Practice. (1999). Addressing student problem behavior— Part III: Creating positive behavioral intervention plans and supports. Washington DC: Author. Conroy, M. A., Fox, J. J., Bucklin, A., & Good, W. (1996). An analysis of the reliability and stability of the Motivation Assessment Scale in assessing the challenging behaviors of persons with developmental disabilities. Education and Training in Mental Retardation and Developmental Disabilities, 31, 243–250. Duker, P. C., & Sigafoos, J. (1998). The Motivation Assessment Scale: Reliability and construct validity across three topographies of behavior. Research in Developmental Disabilities, 19, 131–141. Durand, V. M. (1990). Severe behavior problems: A functional communication training approach. New York: Guilford .
Behavioral Disorders, 28(4), 386–400
Durand. V. M., & Crimmins, D. B. (1988). Identifying the variables maintaining self-injurious behavior. Journal of Autism and Developmental Disabilities, 18, 99–117. Fox, J. J., & Conroy, M. A. (2000). FBA for children and youth with emotional-behavioral disorders: Where we should go in the twenty-first century. Preventing School Failure, 44, 140–144. Fox, J., Conroy, M., & Heckaman, K. (1998). Research issues in functional assessment of the challenging behavior of students with emotional and behavioral disorders. Behavioral Disorders, 24, 26–33. Gable, R. A (1999). Functional assessment in school settings. Behavioral Disorders, 24, 246–248. Gresham, F. M., Quinn, M. M., & Restori, A. (1999). Methodological issues in functional analysis: Generalizability to other disability groups. Behavioral Disorders, 24, 180–182. Heckaman, K., Conroy, M., & Fox, J. (2000). Functional assessment-based intervention research on students with or at risk for emotional and behavioral disorders. Behavioral Disorders, 25, 196–210. Horner, R. H. (1994). Functional assessment: Contributions and future directions. Journal of Applied Behavior Analysis, 27, 401–404. Howell, K. W., & Nelson, K. L. (1999). Has public policy exceeded our knowledge base? This is a two-part question. Behavioral Disorders, 24, 331–334. Individuals with Disabilities Education Act, 20 U.S.C. § § 1400–1487. Jolivette, K., Barton-Arwood, S., & Scott, T. (2000). Functional behavioral assessment as a collaborative process among professionals. Education and Treatment of Children, 23, 298–313. Kearney, A. (1994). Interrater reliability of the Motivation Assessment Scale: Another, closer look. Journal of the Association for Persons with Severe Handicaps, 19, 139–142. Knoster, T. P. (2000). Practical applications of functional behavioral assessment in schools. Journal of the Association for Persons with Severe Handicaps, 25, 201–211. Lane, K. L., Umbreit, J., & Beebe-Frankenberger, M. E. (1999). Functional assessment research on students with or at risk for E/BD: 1990 to the present. Journal of Positive Behavior Interventions, 1, 101–111. Lewis, T. J., Scott, T. M., & Sugai, G. (1994). The Problem Behavior Questionnaire: A teacherbased instrument to develop functional hypotheses of problem behavior in general education classrooms. Diagnostique, 19, 103–115. Mathur, S. R., Quinn, M. M., & Rutherford, R. B. (1996). Teacher-mediated behavior management strategies for children with emotional/ behavioral disorders [Monograph]. Arlington, VA: Council for Children with Behavioral Disorders.
August 2003 / 399
5Barton.qxd
9/16/03
1:23 PM
Page 400
Nelson, J. R., Roberts, M. L., Mathur, S. R., & Rutherford, R. B. (1999). Has public policy exceeded our knowledge base? A review of the functional behavioral assessment literature. Behavioral Disorders, 24, 169–179. Newton, J. T., & Sturmey, P. (1991). The Motivation Assessment Scale: Inter-rater reliability and internal consistency in a British sample. Journal of Mental Deficiency, 35, 472–474. O’Neill, R. E., Horner, R. H., Albin, R. W., Sprague, J. R., Storey, K., & Newton, J. S. (1997). Functional assessment and program development for problem behavior: A Practical handbook. Pacific Grove, CA: Brooks/Cole. Salvia, J., & Ysseldyke, J. E. (1991). Assessment. Boston: Houghton Mifflin. Sasso, G. M., Conroy, M. A., Stichter, J. P., & Fox, J. J. (2001). Slowing down the bandwagon: The misapplication of functional assessment for students with emotional or behavioral disorders. Behavioral Disorders, 26, 282–296. Scott, T. M., & Nelson, C. M. (1999). Functional behavioral assessment: Implications for training and staff development. Behavioral Disorders, 24, 249–252. Sigafoos, J., Kerr, M., & Roberts, D. (1994). Interrater reliability of the Motivation Assessment Scale: Failure to replicate with aggressive behavior. Research in Developmental Disabilities, 15, 333–342. Spreat, S., & Connelly, L. (1996). Reliability analysis of the Motivation Assessment Scale. American Journal on Mental Retardation, 100, 528–532. Stichter, J. P., Shellady, S., Sealander, K. A., & Eigenberger, M. E. (2000). Teaching what we do know: Preservice training and functional behavioral assessment. Preventing School Failure, 44, 142–146. Sugai, G., Horner, R. H., & Sprague, J. R. (1999). Functional-assessment-based behavior support planning: Research to practice to research. Behavioral Disorders, 24, 253–257. Walker, H. M., Colvin, G., & Ramsey, E. (1995). Antisocial behavior in school: Strategies and best practices. Pacific Grove, CA: Brooks/Cole. Walker, H. M., & Sprague, J. R. (1999). Longitudinal research and functional behavioral assessment issues. Behavioral Disorders, 24, 335–337. Wehby, J. H., Symons, F. J., & Shores, R. E. (1995). A descriptive analysis of aggressive behavior in
400 / August 2003
classrooms for children with emotional and behavioral disorders. Behavioral Disorders, 20, 87–105. Yell, M., & Shriner, J. (1997). The IDEA Amendments of 1997: Implications for special and general education teachers, administrators, and teacher trainers. Focus on Exceptional Children, 30, 1–19. Zarcone, J. R., Rodgers, T. A., Iwata, B. A., Rourke, D. A., & Dorsey, M. F. (1991). Reliability analysis of the Motivation Assessment Scale: A failure to replicate. Research in Developmental Disabilities, 12, 349–360.
AUTHORS’ NOTE: Preparation of this manuscript was funded in part by the Office of Special Education Programs, U.S. Department of Education (H325D980049). Address all correspondence and requests for reprints to Sally BartonArwood, Department of Special Education, Box 328 Peabody College, Vanderbilt University, Nashville, TN 37203. E-mail:
[email protected]. AUTHORS: SALLY M. BARTON-ARWOOD, Assistant Professor, and JOSEPH H. WEHBY, Assistant Professor, Department of Special Education, Peabody College of Vanderbilt University, Nashville, TN. PHILIP L. GUNTER, Professor, Department of Special Education and Communication Disorders, Valdosta State University, Valdosta, GA. KATHLEEN L. LANE, Assistant Professor, Department of Special Education, Peabody College of Vanderbilt University. MANUSCRIPT: Initial acceptance: Final acceptance:
6/1/03 6/2/03
Behavioral Disorders, 28(4), 386–400