R E S E A R C H
A R T I C L E
Psychometric Properties of the Pediatric Balance Scale Using Rasch Analysis Nancy Darr, PT, DSc, NCS; Mary Rose Franjoine, PT, DPT, MS, C/NDT, PCS; Suzann K. Campbell, PT, PhD, FAPTA; Everett Smith, PhD School of Physical Therapy (Dr Darr), Belmont University, Nashville, Tennessee; Department of Physical Therapy (Dr Franjoine), Daemen College, Amherst, New York; and Department of Physical Therapy (Dr Campbell), and Department of Educational Psychology (Dr Smith), University of Illinois at Chicago.
Purpose: The Pediatric Balance Scale (PBS) is a 14-item measure of functional balance for children. This study examined PBS dimensionality, rating scale function, and hierarchical properties using Rasch analysis. Methods: The PBS data were analyzed retrospectively for 823 children, aged 2 to 13 years, with uni- and multidimensional Rasch partial credit models. Results: The PBS best fits a unidimensional model based on the Bayesian information criterion analysis (12 400.73 vs 12 404.48), strong correlations between 3 proposed dimensions (r = 0.946-0.979), and high internal consistency (Cronbach α = 0.94). Analysis of rating scale functioning is limited by small numbers of children achieving low scores on easy items. Item maps indicated a ceiling effect but no substantive gaps between item difficulty estimates. Conclusion: The PBS items are best targeted to preschool-age children; additional children with known balance dysfunction are required to fully assess functioning of the easiest PBS items. Revisions may improve PBS utility in older children. (Pediatr Phys Ther 2015;27:337–348) Key words: activities of daily living, age factors, child, child development, female, human, logistic models, male, motor skills/physiology, neurologic examination/statistics and numerical data, postural balance/physiology, preschool, psychometrics/statistics and numerical data, psychomotor disorders/classification, validity INTRODUCTION Clinical evaluation of balance is a complex process involving the examination of multiple systems and their interactions. Historically, the clinician examined a child’s ability to assume and maintain developmental postures and demonstrate the presence of protective, righting, and equilibrium reactions.1,2 During the 1970s and 1980s, as our understanding of the complexities of balance expanded 0898-5669/110/2704-0337 Pediatric Physical Therapy C 2015 Wolters Kluwer Health, Inc. and Section on Copyright Pediatrics of the American Physical Therapy Association
Correspondence: Mary Rose Franjoine, PT, DPT, MS, C/NDT, PCS, Department of Physical Therapy, Daemen College, 4380 Main St, Amherst, NY 14226 (
[email protected]). The authors declare no conflicts of interest. This study was supported by grants from the APTA Section on Pediatrics, the Daemen College Student-Faculty Interdisciplinary Research Think Tank, and the Tennessee Physical Therapy Association. DOI: 10.1097/PEP.0000000000000178
Pediatric Physical Therapy
and as standardized, norm-referenced measures of gross motor function became more available, therapists began to extrapolate children’s balance abilities from items within these tests. In the past 2 decades, many new standardized assessments have been developed; however, a paucity of validated stand-alone balance tools for children remains. The Pediatric Balance Scale (PBS) is a 14-item, criterion-referenced, measure of functional balance for children.3 The concept of functional balance, as used within the PBS, is defined as the ability of a child to attain and maintain upright control during typical childhood activities of daily living, school, and play. Initially, the PBS was developed to identify balance dysfunction (BD) in school-age children, although it has since been found to be an effective tool for the preschool population.4 Specifically, the PBS was developed to: 1. Identify children who lack age-appropriate functional balance and thus may be at risk for impaired safety and/or developmental delay. 2. Identify change in functional balance with maturation or intervention. Rasch Analysis of the PBS
337
Copyright © 2015 Wolters Kluwer Health, Inc. and the Section on Pediatrics of the American Physical Therapy Association. Unauthorized reproduction of this article is prohibited.
3. Identify regression in functional balance. Since its first publication in 2003, considerable interest in the PBS has been expressed by physical therapists around the world due to its ability to identify BD in children in almost any environment with minimal equipment and resources. The PBS has been translated into at least 9 languages for clinical and research use. Development of the PBS The PBS was initially developed in 1994 as a modified version of the Berg Balance Scale (BBS).5 The initial step in the modification process was to administer the BBS to 13 children developing typically, aged 4 to 12 years.3 The intent was to assess children’s ability to perform BBS activities and to examine test-retest and interrater reliabilities. Many children across all ages were not cooperative with testing. Preschool-age children had difficulty understanding the instructions, and older children were challenged by the lengthiness of the static standing items. Reliability of the BBS with children could not be established. On the basis of analysis of these data in consultation with clinical and test design experts, the following changes were made to the BBS: 1. Items were reordered into functional sequences to enhance children’s attention throughout the test. 2. Time parameters were decreased for selected static sitting and standing balance items. 3. Test administration directions were clarified and standardized for the examiner. 4. Individual item instructions were simplified and made more “child-friendly.” 5. Instructions were added for engaging children during testing. The modified tool was named the PBS.3 Directions are simple and require minimal multitasking of cognitive and motor abilities. Each PBS item is scored on a 5-point scale with specific qualitative and quantitative scoring criteria; item scores are summed to obtain a PBS total test score (maximum = 56 points). Over a 12-year period, PBS typical performance data were collected on more than 1000 children who were healthy in conjunction with reliability and validity studies. Typical performance values based on age and sex of 643 children were published in 2010.4 PBS Reliability and Validity Following the initial modification process the PBS was administered to 40 children, who per parent report, were developing typically, and 20 children with mild to moderate motor impairments, for the purpose of determining test-retest and interrater reliabilities.3 Good to excellent test-retest reliability was demonstrated in the group of children developing typically utilizing intraclass correlation coefficients (ICC3,1 = 0.850). Test-retest (ICC3,1 = 0.998) and interrater (ICC3,1 = 0.997) relia338
Darr et al
bilities in the group of school-age children with known motor impairments were excellent. Test-retest (ICC2,1 = 0.923), interrater (ICC2,1 = 0.972), and intrarater (ICC2,1 = 0.895-0.998) reliabilities were subsequently assessed in a larger sample of children, aged 2 to 12 years.4 The process of evaluating validity evidence for the inferences made from PBS data began by examining PBS concurrent validity with standardized developmental assessment tools. The PBS demonstrated moderate to good concurrent validity with the Stationary, Locomotion, and Object Manipulation subtests of the Peabody Developmental Motor Scales, second edition (PDMS-2) in 54 children developing typically.6 Concurrent validity was stronger between the PDMS-2 Stationary subtest raw scores and the PBS total test scores (rs = 0.633) than between the PBS and the Locomotion (rs = 0.518) and the Object Manipulation (rs = 0.531) Subtest Scores. The content focus of the PDMS-2 Stationary subtest is evaluation of static standing balance; several items are common to both tests including standing on 1 foot and standing with 1 foot in front (PBS items 8 and 9). The PBS concurrent validity with the BruininksOseretsky Test of Motor Proficiency (BOT-2) subtest scores was examined in 50 children, aged 4 to 11 years.7 Correlations between PBS total test scores and BOT-2 subtest total point scores were stronger for the Running Speed and Agility (rs = 0.777) and Strength (rs = 0.714) subtests than for the Balance (rs = 0.671) and Bilateral Coordination (rs = 0.625) subtests. Correlations were weakest in 4- to 6-year-olds. Younger children, aged 4 to 6 years, had difficulty following directions on the BOT-2; however, they followed PBS directions with ease. Initial examination of discriminant validity has also been undertaken. The PBS can be used to discriminate preschool children developing typically from those with mild and moderate balance impairments.4,8 School-age children with moderate balance impairments can also be identified by using the PBS; however, the PBS in its current form cannot be used to discriminate school-age children with mild motor impairments from those developing typically.8 The PBS has also been shown to identify progression and regression of balance abilities in children with cerebral palsy.9 The PBS was administered to 20 children with cerebral palsy triannually for 3 years. At each evaluation children’s caregivers provided health history information, and perceptions of balance were rated by caregivers, treating physical therapists, and the children. Changes in PBS scores were strongly correlated with orthotic use (rs = 0.830) and moderately correlated (rs = 0.541) with medical events such as surgeries, onset of seizures, and orthopedic injuries. Perceived changes in balance capabilities were consistent with changes in PBS scores of 10 or more points. Rasch Analysis Rasch models have been used to solve various measurement problems that plague the rehabilitation, social, Pediatric Physical Therapy
Copyright © 2015 Wolters Kluwer Health, Inc. and the Section on Pediatrics of the American Physical Therapy Association. Unauthorized reproduction of this article is prohibited.
and behavioral sciences.10-13 The use of Rasch analysis to support validity has been employed in the development of many assessment tools for children including the Test of Infant Motor Performance,14 the Alberta Infant Motor Scale,15 the Gross Motor Function Measure-6616 and the Pediatric Evaluation of Disability Inventory.17 Rasch analysis is used to compare response patterns of individuals to hypothesized responses based on Rasch modeling to estimate person ability and item difficulty, enabling test developers to examine dimensionality, hierarchy of item difficulty, as well as, item and rating scale functioning. If the data fit the model, the underlying continuum (represented in log-odd or logit units) is on an interval scale, making the estimates appropriate for parametric statistical analysis. Greater logit values for items indicate increasing item/threshold difficulty; children with higher logit values have more ability than children at lower levels. Standard errors associated with each estimate quantify the precision of the estimate. The specific Rasch model appropriate for the current analyses is the partial credit model as each item on the PBS has a unique scoring model. Strong psychometrics coupled with feedback from clinicians and investigators indicate that the PBS is a promising tool to evaluate functional balance in children. The next logical step was to further investigate PBS psychometric properties (dimensionality, item hierarchy, and rating scale functioning) using Rasch analysis. The aims of this study were to systematically investigate the validity of inferences made from PBS data using a Rasch model of analysis and to determine whether further PBS modifications are indicated. The specific goals of the Rasch analysis for this study were to:
a. Identify the hierarchy of PBS item difficulty estimates for clinical expectations. b. Identify gaps in the item hierarchy where measurement precision may be improved. c. Identify possible statistical redundancies between items. d. Identify potential ceiling and/or floor effects. Results of the Rasch analysis will be used to identify items and response formats that need to be omitted, revised, or added. METHODS Subjects This study involved retrospective analyses of data collected from 13 previous studies conducted between 1994 and 2011.3,4,9 Table 1 describes the sample for this portion of the Rasch analysis, which included 823 children, 685 with typical development (TD) and 138 with known BD. Although the specific inclusion and exclusion criteria varied among studies, all children were healthy at the time of data collection, were able to stand unsupported for at least 4 seconds, and follow 1-step directions. Rater Training Thirty-three raters (4 physical therapy faculty and 29 entry-level physical therapist students) participated in data collection over a period of 14 years.4 Each faculty rater had more than 10 years of experience teaching pediatrics in entry-level physical therapy programs, in addition to extensive clinical experience in various settings (clinic, hospital, school, early intervention, and public health). All raters, including faculty mentors, were formally trained in administration and scoring of the PBS by the test author (M.R.F.) before participating in data collection. Each rater participated in a minimum of 20 hours of training that included the following activities: (1) expert
1. Determine dimensionality of the PBS via model comparisons, item fit, and principal component analysis (PCA) of standardized residuals. 2. Examine functioning of the rating scales within the existing PBS structure. 3. Examine PBS hierarchical properties including:
TABLE 1 Profile of Participants
Age, Y 2 3 4 5 6 7 8 9 10 11 12 13 Total
Rasch Age Group 1 1 2 3 4 5 5 6 6 6 6 6
Total, n
Boys
Girls
TD
BD Mild
BD Mod
BD Total
27 98 146 199 134 56 54 40 32 19 9 9 825
17 54 79 101 64 29 25 21 12 10 2 5 420
10 44 67 98 70 27 29 19 20 9 7 4 405
24 93 140 164 116 38 32 26 26 15 7 3 687
3 4 5 11 9 12 13 9 2 1 2 2 73
0 0 1 24 9 6 9 5 4 3 0 4 65
3 4 6 35 18 18 22 14 6 4 2 6 138
Abbreviations: BD, children with known balance dysfunction; BD mild, independent ambulation without assistive device; BD mod, required assistive technology (walking aids or wheelchairs) for community mobility; TD, children with typical development.
Pediatric Physical Therapy
Rasch Analysis of the PBS
339
Copyright © 2015 Wolters Kluwer Health, Inc. and the Section on Pediatrics of the American Physical Therapy Association. Unauthorized reproduction of this article is prohibited.
demonstration; (2) peer simulated practice; (3) practice scoring from video; and (4) test administration and scoring with children who were not part of the data set. At the conclusion of training each rater achieved a mastery-skill level for administration and scoring of the PBS established by the test author; interrater reliability was established within-rater groups and with the test author (ICC3,2 = 0.90-0.92). Student raters collected data under the supervision of a faculty investigator. DATA ANALYSIS Uni- and multidimensional Rasch partial credit models were used for all Rasch analyses.18-20 The Conquest (V2.0, http://www.acer.edu.au/conquest)21 program was used for multidimensional models, and Winsteps (V3.73, http://www.winsteps.com/index.htm)22 was used for all unidimensional models. The partial credit model was selected because PBS items use a 5-point rating scale with varying scoring criteria across items. Dimensionality The PBS items were initially hypothesized to represent 3 dimensions4 —(1) static balance, 6 items; (2) anticipatory balance, 3 items; and (3) functional movement transitions, 5 items. The hypothesized 3-dimensional model was compared with a unidimensional model for overall fit to the data.18,23 Both of these hypothesized models were estimated and compared for relative model fit using Conquest.21 Two parsimony fit indices that take into account model complexity were employed for the dimensionality assessment, the Akaike Information Criterion (AIC)24 and the Bayesian Information Criterion (BIC).25,26 Both the AIC and the BIC are model selection approaches that combine data-to-model fitting with a penalization for model complexity.25-27 Lower values of both indices are indicative of better fit. The model with the lowest AIC and BIC values was chosen for additional analyses. Internal consistency estimates were calculated for the items comprising each retained dimension. Data were subjected to further dimensionality checks using Winsteps. The Rasch models implemented by Winsteps22 require unidimensionality and local independence. Unidimensionality means that a single dominant construct is being measured. Local independence means after controlling for the latent trait responses items should be independent of each other. These assumptions were evaluated using fit statistics, correlations among standardized residuals, and PCA of standardized residuals. Item fit statistics were used to identify items that may not be contributing to a unidimensional construct or items that may be statistically dependent. Outfit mean-square fit statistics (MnSq) values greater than 1.4 and standardized Z values (Zstd scores) greater than 2 were used to identify items with misfit. The MnSq values greater than 1.4 or 340
Darr et al
Zstd greater than 2 indicate erratic scoring or that the item belongs to a different construct. Items identified as misfitting the model were examined for potential reasons for the misfit. As item fit statistics are not useful in all situations for detecting model violations, a PCA of standardized residuals was also performed. The combination of item fit statistics and PCA of standardized residuals is effective at detecting departures from the unidimensionality requirement of the Rasch model.28-30 If the first principal contrast of the standardized residuals approaches the noise level, data unidimensionality is supported. The noise level is estimated by first simulating a data set on the basis of the item thresholds and children measures estimated from the current observed data. These simulated data are constructed to fit model expectations. Therefore, the PCA of the standardized residuals from these simulated data reveals the noise level one can expect when the data fit the model. The level of noise is quantified by the size of the eigenvalue of the first residual contrast. Comparison of the eigenvalues derived from the actual and simulated data was used to help reveal the extent to which the model requirements were met (ie, if the eigenvalue from the real data is substantially larger than the corresponding eigenvalue from the simulated data it may indicate the presence of a secondary dimension). Adherence to the local independence assumption is important as violations may lead to inaccurate estimates of item thresholds and child ability and may result in an overestimation of reliability.31,32 Yen’s Q3 statistic,33,34 which is a Pearson correlation coefficient between the residuals for a pair of items after partialling out the measured construct, was used to investigate the degree of local dependence. Values of at least 0.20 were used to flag pairs of items that may be in possible violation of local independence. A potential consequence of statistically redundant items is the artificial increase of person reliability (ie, the Rasch analog of the Cronbach α). Rating Scale Function Three indicators of rating category usage were used to address 3 important assumptions of an adequately functioning rating scale35 —(1) each consecutive number of the scale is to represent more of the latent construct under investigation. That is, each consecutively higher number on the rating scale should correspond to higher levels of ability. (2) As one moves up the ability continuum, each point on the rating scale should in turn become the most probable response. And (3) idiosyncratic category use on the part of the raters is minimized. Rasch average measures were used to examine appropriateness of rating category order. Average measures are the average proficiencies of the children observed in a particular category, averaged across all occurrences of the category. If average measures increase with each successive rating point, it implies that higher ability is associated with higher category labels.35 Pediatric Physical Therapy
Copyright © 2015 Wolters Kluwer Health, Inc. and the Section on Pediatrics of the American Physical Therapy Association. Unauthorized reproduction of this article is prohibited.
For assumption number 2, Andrich thresholds were examined to determine which categories were unlikely to be observed.36 Ordered thresholds imply that as one moves up the ability continuum, each category in turn becomes the most probable response. Disordered thresholds may indicate underutilized rating categories. Outfit meansquare statistics were used to detect idiosyncratic category use. Values greater than 2 indicate a presence of excessive variability, suggesting that the category has been used unexpectedly and that another response was expected by the model. Using these indicators, undesirable noise (nonsystematic variance) can be identified and suggestions for dealing with such undesirable noise provided (ie, combining categories, reversing the ordering of the categories, or the creation of new categories to increase measurement precision).36 Hierarchical Structure of Items The Rasch analysis was used to produce variable maps, which visually display the distribution of child ability and item difficulty on a common logit scale. These maps were used to identify how well the PBS item difficulties and thresholds targeted the test sample of children. Distributions of items along the continuum of difficulty were also visually analyzed to identify potential gaps, as well as possible ceiling and floor effects. Item fit statistics, the PCA of standardized residuals, Yen’s Q3 statistics, and visual inspection of item maps in conjunction with clinical analyses of item content were all evaluated to identify items to be retained, deleted, or revised. Measures of rating scale function were used to identify items with rating scale disorder or redundancy, which may require modification.
RESULTS Dimensionality The first analyses were undertaken to determine whether the PBS is uni- or multidimensional. This multistep process involved model comparisons, PCA analysis, and analyses of item fit and local independence. Both the AIC and BIC suggest the PBS better fits a 1-dimensional model (AIC = 12 141.52; BIC = 12 400.73) compared with the hypothesized 3-dimensional model (AIC = 123 121.70; BIC = 12 404.48). The PCA analysis also supported PBS unidimensionality with 4.4% of residual variance being accounted for by the first eigenvalue compared with 2.9% from the simulated data. On the basis of these results, along with the strong correlations among the 3 proposed dimensions (r = 0.946-0.979) and the relatively small number of items, further Rasch analyses were performed assuming a unidimensional model. Results of item fit statistics described below also support unidimensionality of the PBS. Outfit statistics are presented in Table 2. The MnSq values greater than 1.4 and Zstd scores greater than 2 were used to identify items with misfit. All items except item 14 (Reach Forward) met the predetermined fit criteria. A Zstd score of 3.1 for item 14 indicates a potential misfit to the unidimensional requirement. Analysis of children’s observed versus the Rasch model expected performances (ie, residuals) on item 14 suggests that misfit resulted from a small number of children (27 out of 645 or 4.2%) contributing to the item misfit with a number of highperforming children not scoring as well as predicted by the Rasch model. Yen’s Q3 statistic identified the following pairs of items as potentially statistically redundant—Sit to Stand and Stand to Sit (items 1 and 2, respectively), r = 0.30;
TABLE 2 Dimensionality: Rasch Measures and Overall Item Fit Item Name Sit to Stand Stand to Sit Transfer Stand Unsupported Sit Stand with Eyes Closed Stand Feet Together Stand with One Foot in Front Stand on One Foot Turn 360 Turn to Look Behind Retrieve Object from Floor Placing Alternate Foot on Stool Reach Forward Mean SD
Item Number
Measurea
Model SEb
Outfit MnSqc
Outfit Zstdc
1 2 3 4 5 6 7 8 9 10 11 12 13 14
− 1.05 − 0.46 − 1.18 − 0.25 − 0.45 − 0.19 0.28 1.69 1.38 0.62 − 0.69 − 0.84 0.13 1.00 0.00 0.86
0.11 0.09 0.08 0.07 0.08 0.07 0.06 0.05 0.05 0.05 0.08 0.09 0.06 0.06 0.07 0.02
0.37 1.23 0.83 1.10 0.74 0.62 0.79 0.86 0.85 1.30 1.22 0.39 0.63 1.19 0.87 0.3
− 2.2 0.8 − 1.9 0.4 − 0.5 − 1.3 − 0.6 − 2.1 − 2.4 1.9 0.7 − 1.6 − 1.1 3.1a − 0.5 1.6
Abbreviations: SD, standard deviation; SE, standard error; MnSq, mean-square fit statistics; Zstd, standardized Z values. a Rasch Measure (in logits): greater logit values indicate increasing item difficulty. b Model standard error: standard errors associated with each Rasch measure quantify the precision of the estimated logit values. c Outfit MnSq >1.4 or Outfit Zstd >2 indicates item misfit to a unidimensional model.
Pediatric Physical Therapy
Rasch Analysis of the PBS
341
Copyright © 2015 Wolters Kluwer Health, Inc. and the Section on Pediatrics of the American Physical Therapy Association. Unauthorized reproduction of this article is prohibited.
and Transfers and Reach Forward (items 3 and 14, respectively), r = 0.21 Rating Scale Function The second task of the Rasch analysis was to examine the rating scale functioning for each PBS item, specifically rating category order and fit, as well as thresholds for category usage. Children’s average measures increased monotonically across all categories within each item except for item 3 between categories 0 and 1 and item 5 between categories 0 and 1 and categories 2 and 3 (Table 3, column 5). The model met the minimum requirement of 10 responses per category needed to obtain stable estimates of the thresholds for each item except for the 0 category in items 1, 2, 3, 11, and 12. Item 1 also had fewer than 10 children in rating category 1. Items 1, 2, 3, 11, and 12 were very easy items for this sample of children with greater than 90% of the children receiving the maximum possible score on each of these items. Rating scale category fit for each item was acceptable (outfit MnSq < 2) with the following exceptions: rating category 1 for items 1 and 2; and rating category 0 for items 5 and 11. Some of these rating categories, however, had fewer than 10 responses as required by the Rasch model, which may contribute to high outfit36 (Table 3, column 6). Andrich thresholds and visual inspection of category probability curves indicated some disordered thresholds, most commonly in categories 1, 2, and/or 3, which varied somewhat by item (Table 3, column 7). Hierarchical Structure of Items The hierarchical structure of PBS items was analyzed, specifically for hierarchical fit to clinical expectations, and identification of gaps and/or redundancies between items and any potential floor or ceiling effects. The distribution of item difficulties and children measures are plotted on the logit metric in the variable map shown in Figure 1. Children’s ability estimates are noted on the left side of the graph. The mean Rasch measure for the children was 3.14 ± 1.90 logits with a range from approximately −3 to +5 logits. When children with extreme scores (achieved the maximum possible score of 56) were removed, the mean Rasch measure was 2.42 ±1.47. The PBS items are located at their overall difficulty level on the right side of the figure. Item difficulty ranges from approximately −1 to +3 logits. If the item difficulties along with the thresholds of those items (thresholds not shown in the figure) overlap the ability levels of the group of children, then the items are considered to be well targeted (ie, appropriate) for the current sample. The mean child ability is more than 2 logits greater than the mean item difficulty, indicating ability for this sample of children exceeds PBS item difficulty. A ceiling effect is also noted with 178 children (21.4%) achieving the highest Rasch measures. Visual inspection of the item map indicates that items 9 (Standing on One Foot), 8 (Standing One Foot in Front), 342
Darr et al
and 14 (Reach Forward) are the most difficult items. Also a cluster of items was found at the lower end of the continuum, which includes items 1 (Sitting to Standing), 2 (Standing to sitting), 3 (Transfers), 5 (Sitting Unsupported), 11 (Turning to Look Behind), and 12 (Retrieving Object from Floor). These findings are consistent with commonly accepted developmental skill acquisition in children. No substantive gaps were found between the item difficulty estimates along with the thresholds of those items. The primary weakness is the lack of items at the higher end of the distribution. This limits measurement precision and ability estimates in children with high balance ability thus limiting utility of the PBS in the school-age population. The Rasch person reliability was 0.67, whereas a traditional estimation of internal consistency among all 14 PBS items yields a Cronbach α of 0.94. The discrepancy is due to the ceiling effects discussed above, as children with extreme scores are treated differently in the calculations of Rasch and traditional estimates of internal consistency.
DISCUSSION The PBS is a unique, stand-alone measure of functional balance for children. Minimal space, equipment, and administration time requirements make the PBS amenable to use in environments where resources are limited. The measure is also easy to administer, score, and interpret; most items are familiar activities for children. The PBS has demonstrated strong psychometric characteristics in previous studies.3,4 The next logical step was to further evaluate PBS validity using Rasch analysis, which has been used increasingly in the process of developing standardized assessment tools for pediatric therapy and rehabilitation. The specific intent of this Rasch analysis was to examine dimensionality, rating scale functioning, and PBS item hierarchy.
Dimensionality Balance is a complex construct, and the lack of consensus regarding categorization and terminology of dimensions comprising balance increases the challenge of developing a balance assessment tool. Initially, the authors hypothesized that the PBS included 3 distinct dimensions of balance (static, anticipatory, and functional movement transitions).4 Six items were hypothesized to measure a child’s ability to assume and maintain static positions requiring increasing levels of postural control ranging from sitting unsupported on a bench to single limb stance. Three items were intended to assess anticipatory balance and include activities that require the child to maintain balance in a self-induced functional movement such as reaching into forward space. The third dimension was hypothesized to contain items that require the child to maintain balance while performing functional movement transitions such as transfers. Although the PBS was hypothesized to consist of Pediatric Physical Therapy
Copyright © 2015 Wolters Kluwer Health, Inc. and the Section on Pediatrics of the American Physical Therapy Association. Unauthorized reproduction of this article is prohibited.
TABLE 3 Rating Scale Function by Item
Item Number Name 1 Sit to Standa
2 Stand to Sita
3 Transfers
4 Stand Unsupported
5 Sit Unsupported
6 Stand with Eyes Closed
7 Stand with Feet Together
8 Stand with one Foot in Front
9 Stand on One Foot
10 Turn 360◦
11 Turn to Look Behind
12 Retrieve Object from Floor
Pediatric Physical Therapy
Rating Categorya
Observed Count
Percentage
Average Measuresb
Outfit MnSqc
1 2 3 4 1 2 3 4 0 1 2 3 4 0 1 2 3 4 0 1 2 3 4 0 1 2 3 4 0 1 2 3 4 0 1 2 3 4 0 1 2 3 4 0 1 2 3 4 0 1 2 3 4 0 1 2 3 4
8 14 34 767 20 13 42 748 1 10 22 193 595 16 19 36 15 736 14 10 30 17 750 15 19 39 31 707 31 19 65 23 683 74 53 179 113 398 35 117 99 172 396 21 91 36 71 593 5 26 27 22 736 7 15 16 14 771
1 2 4 93 2 2 5 91 0 1 2 24 72 2 2 4 2 90 2 1 4 2 91 2 2 5 4 87 4 2 8 3 83 9 6 22 14 49 4 14 12 21 48 3 11 4 9 73 1 3 3 3 90 1 2 2 2 94
− 1.39 − 0.68 0.9 2.68 − 0.42 − 0.40 0.65 2.72 − 1.19 − 1.47 − 0.04 1.56 3.08 − 0.48 − 0.35 0.55 0.79 2.77 − 0.47 − 0.71 0.38 0.37 2.71 − 0.90 − 0.18 0.61 1.11 2.68 − 0.67 0.12 0.77 1.79 2.93 0.24 0.58 2.13 2.74 3.62 − 0.59 0.99 2.04 2.66 3.61 − 0.42 0.77 1.08 2.49 3.06 − 0.73 − 0.29 0.20 1.79 2.74 − 1.72 − 0.88 0.43 0.66 2.66
2.15 0.41 0.20 0.79 8.54 0.18 0.36 0.83 1.23 0.43 0.67 0.85 0.93 1.14 0.58 1.85 0.36 0.97 2.06 1.08 1.10 0.18 1.04 1.45 1.37 0.72 0.24 0.88 0.70 0.42 0.88 0.78 0.77 1.09 0.54 0.68 0.95 1.01 0.70 0.71 0.91 0.80 1.02 0.91 1.31 1.93 0.96 1.48 2.43 1.26 0.80 1.48 1.05 0.60 0.13 0.85 0.13 0.87
Andrich Thresholdd None −0.36 −0.4 −0.05 None 0.47 −0.20 −0.27 None −2.65 0.01 −0.11 2.75 None −0.47 −0.25 2.02 −1.31 None 0.06 −0.69 1.74 −1.11 None −0.61 −0.34 1.43 −0.48 None 0.02 −1.01 2.04 −1.05 None −0.71 −1.33 1.18 0.86 None −2.17 0.23 0.39 1.55 None −2.01 1.28 0.47 0.26 None −1.84 0.64 1.71 −0.51 None −0.93 0.51 1.50 −1.07 (continues)
Rasch Analysis of the PBS
343
Copyright © 2015 Wolters Kluwer Health, Inc. and the Section on Pediatrics of the American Physical Therapy Association. Unauthorized reproduction of this article is prohibited.
TABLE 3 Rating Scale Function by Item (Continued)
Item Number Name 13 Alternate Step on Stool
14 Reach Forward
Rating Categorya
Observed Count
Percentage
Average Measuresb
Outfit MnSqc
Andrich Thresholdd
0 1 2 3 4 0 1 2 3 4
29 21 36 25 708 16 34 100 327 310
4 3 4 3 86 2 4 13 42 39
− 0.70 − 0.11 0.84 1.18 2.86 − 0.64 0.36 1.54 2.9 3.19
0.77 0.27 1.04 0.25 0.89 1.47 0.99 0.92 1.32 1.34
None −0.06 −0.29 1.34 −0.99 None −1.98 −1.11 −0.04 3.12
Abbreviation: MnSq, mean-square fit statistics. a No children scored “0” on items 1 and 2. b Average measures are average proficiencies of the children observed in each category averaged across all occurrences of the category. Increases in average measures with each successive rating point within an item rating scale imply that higher numerical values on the printed rating scale correspond to higher ability levels. c Outfit MNSQ values >2 indicate excessive variability where the rating category was used unexpectedly. (The Rasch model expected another category to be selected.) d Andrich thresholds: disordered thresholds may indicate underutilized rating categories or categories that are never the most probable to be selected anywhere along the ability continuum.
3 dimensions, evidence from Rasch analyses indicated that the PBS better fits a unidimensional model. Several factors may contribute to the discrepancy between Rasch analysis results and the original hypothesized multidimensionality. The relatively small number of 14 items may be insufficient to distinguish 3 distinct dimensions of balance. Furthermore, some PBS items may not be of sufficient precision to measure only 1 balance dimension and may be multiconstruct in nature. For example, in item 3 the child is asked to transfer from a child-size bench into an adult-size chair with arms. This item was initially classified as a functional movement transition. The authors expected that most children would choose to perform a stand-pivot transfer; however, analysis of the performance of younger children who were less able revealed self-selected movement strategies, which were motorically more complex. This group of children tended to stand up from the bench and move to face the chair. They then climbed up into the chair before turning 180◦ to sit. Their self-selected transfer strategy required a blend of triplanar functional movements and anticipatory postural control. On the basis of the Rasch analysis and on further clinical reflection, the PBS appears to be a unidimensional assessment tool, which measures 1 blended dimension of functional balance as opposed to 3 distinct dimensions as previously hypothesized. This indicates the PBS should always be administered and scored in its entirety. Fit statistics provide additional support for PBS unidimensionality. Only item 14 (Reach Forward) did not meet the criteria for good fit to the unidimensional model. Item 14 measures the ability to maintain balance while displacing one’s center of mass anteriorly in the context of a forward reach. This activity is hypothesized to require anticipatory control in preparation for reaching forward without moving the feet. To obtain a reliable reach mea344
Darr et al
sure, the child is required to sustain the forward reach position briefly thus necessitating a blend of anticipatory control with end range static stability. Item 14 provided various testing challenges based on age and presence of known BD. Preschool-age children had considerable difficulty understanding and following directions for this item; they tended to step forward rather than reach. Children with known BD had difficulty controlling and/or maintaining the forward reach. Neither of these groups was likely a major contributor to item misfit as both groups also tended to be challenged by PBS items of similar difficulty (Standing on One Foot, Standing with One Foot in Front.) The majority of item 14 misfit occurred as a result of high-performing children scoring lower on item 14 than was predicted by the Rasch modeling. Per examiner feedback, many children in this group demonstrated the desire to reach forward to the maximum of their capabilities but then were unable to sustain the extreme anterior terminal posture for measurement. The misfit of this item may be reduced by clarification of instructions to children to emphasize sustained stability at end reach. Item 14 demonstrates a well-functioning rating scale and is among the most difficult PBS items. Removal of this item would further contribute to the PBS ceiling effect. Although the Rasch analysis indicated a potential misfit to the unidimensional model, the decision to retain item 14 was based on the following 4 factors: (1) the functional importance of the reaching forward task, (2) the unique blend of movement strategies required, (3) the level of difficulty, and (4) the rating scale functioning. Yen’s Q3 statistic indicates potential violations of local independence between items 1 and 2 and between items 3 and 14. These violations indicate potential statistical redundancy within these item pairs and mathematically support removal of 1 item from each pair. The decision to Pediatric Physical Therapy
Copyright © 2015 Wolters Kluwer Health, Inc. and the Section on Pediatrics of the American Physical Therapy Association. Unauthorized reproduction of this article is prohibited.
Fig 1. Hierarchical structure of items. Expected score person-item hierarchy map for all children, aged 2 to 13 years. The distribution of item difficulties (right) and children measures (left) plotted on a logit metric. The continuum represents difficulty level and hierarchy for items (on the right) and ability level of children (on the left). M = Mean; S = 1 standard deviation; T = 2 standard deviation for both children and item distributions. Each “#” = 14 children. Each “.” = 1 to 13 children. Mean child ability is greater than mean item difficulty, indicating ability for this sample of children on average exceeds PBS average item difficulty. One hundred seventy-eight children (21.4%) achieved the highest Rasch measures. Items 9, 8, and 14 are the most difficult items, with 1, 2, 3, 5, 11, and 12 clustered at the bottom of the difficulty continuum. No substantive gaps between the item difficulty estimates are observed.
Pediatric Physical Therapy
Rasch Analysis of the PBS
345
Copyright © 2015 Wolters Kluwer Health, Inc. and the Section on Pediatrics of the American Physical Therapy Association. Unauthorized reproduction of this article is prohibited.
remove an item should not be based solely on statistical analysis; the clinical relevance of each item must also be carefully considered. Items 1 (Sit to Stand) and 2 (Stand to Sit) are among the easiest items for children and measure the child’s ability to maintain postural control during transitional movements. Item 1 requires concentric controlled balance as the child moves from sit to stand, and item 2 requires eccentric control when moving from stand to sit. Both are clinically relevant especially for younger children with known BD. Most children developing typically obtain the total maximum score of 4 on items 1 and 2 (97.8% and 95.8%, respectively); however, over half the children with moderate BD scored less than 4 on items 1 and 2 (52.3% and 56.9%, respectively.) These items provide baseline performance for children with moderate BD and thus may be essential for demonstrating improvements in balance with intervention. The Reach Forward (14) and Transfers (3) items are both clinically very important and functionally very different tasks. Item 14 requires the child to use anticipatory control to reach into forward space without moving his or her feet. As previously described, item 3 requires an element of anticipatory control; however, the child must also move the entire body through space in multiple planes to transfer from a bench to a chair. Analysis of the item hierarchy map indicates that item 3 is a relatively easy task, and item 14 is one of the most difficult tasks for children. After considering both statistical consequences of potential violations of local independence (ie, artificial increase in internal consistency) and clinical applications of items 1, 2, 3, and 14, the decision was made to retain both pairs of items. Rating Scale Function The Rasch analysis examined rating scale functioning for each PBS item and identified rating scale categories that potentially should be collapsed, expanded, or revised. The ordering of the average measures was acceptable for all items and categories with the exception of disordering at the lowest end of the rating scale in items 3 and 5; however, a less than optimal number of 10 responses per rating category for item 3 may limit the accuracy of the average measures for these categories. Clinical examination of the rating scales for items 3 and 5 indicates that only children with moderate BD would likely achieve a score of 2 or lower. Children, aged 2 years or older, with TD or mild BD would likely score at least 3 on each of these items. Fewer than 10 responses were observed among the lowest rating categories for items 1, 2, 3, 11, and 12. The personitem variable map indicated that these were among the least difficult items; more than 90% of children earned the maximum score of 4 on each of these items. Several items demonstrated disordering of Andrich thresholds and category probability curves. A disordered threshold may represent poor rating scale function suggesting a need for expansion, collapse, or revision of rating scale categories. Disordering may also occur with tran346
Darr et al
sitional categories, which have been recently identified as important constructs in the development of rating scales.37 Transitional categories are hypothesized to represent critical windows of performance, with low frequency and/or duration of occurrence. For example, the scoring scale for item 2 (Standing to Sitting) demonstrates 2 potential clinically critical windows of performance with relatively rare occurrence in this sample of children. A child achieves a score of 0 if he or she requires assistance with this task, 1 if he or she is able to transition between standing and sitting but descent is uncontrolled, and a 2 if he or she pushes his or her lower legs against the support surface to control descent. Scores of 1 and 2 may indicate critical performance windows that are each very unique and important clinically. One would not expect or want to observe many children with uncontrolled descent during this movement transition; however, each category identifies children with unique movement strategies resulting in differing amounts of compromised safety. Additional targeted data collection in younger and lower-performing children is recommended to further examine PBS rating scale function. This would enable examination of average measures disordering in the lower rating categories of items 3 and 5, as well as critical statistical and clinical analyses of scoring criteria for each item with disordered Andrich thresholds to distinguish disordering from the presence of critical transitional categories. Items 1, 2, 3, 11, and 12 bear special consideration as these items had rating scale categories with less than 10 respondents. Item Hierarchy The item difficulty hierarchy was consistent with clinical and developmental expectations. Visual inspection of item hierarchy maps (Figure 1) demonstrates that items 8 (Standing One Foot in Front), 9 (Standing on One Foot), and 14 (Reach Forward) are the most difficult items. These tasks occur later in development than any of the other PBS items. The least difficult items (1, 2, 3, 11, and 12) develop by the age of 2 years in children with TD. The overall range of item difficulty may be targeted below the ability level of this sample population, which contained 685 children, aged 2 to 13 years, developing typically and 138 with known BD. Of the children with known balance problems, only 65 were classified as having moderate BD (able to stand unsupported >4 seconds but required assistive technology for community mobility). Clinically, the intentionality of the least difficult items is to identify children with BD and subsequent changes with intervention. Item targeting to the intended population may be affected by item content, item directions, or rating scale function as discussed in the previous sections. A much larger sample of children with mild and moderate BD is required to fully assess item and rating scale functioning of the easiest items. The small percentage of items targeted to children of higher ability may limit the measurement precision at the higher end of the ability continuum. This is consistent with Pediatric Physical Therapy
Copyright © 2015 Wolters Kluwer Health, Inc. and the Section on Pediatrics of the American Physical Therapy Association. Unauthorized reproduction of this article is prohibited.
previous studies, indicating that the PBS best differentiates balance abilities in children aged 6 years and younger.4 The addition of more difficult items to the PBS may enable the PBS to be used with children older than the age of 6 years regardless of balance ability. Rating scale revisions to expand the scope of ability level in the more difficult items may also help improve measurement precision and extend the utility of the PBS in the school-age population. CONCLUSION Results of the Rasch analysis and previous studies of PBS performance in children with and without disabilities indicate that the most appropriate applications for the current version of the PBS are identification of children with BD and tracking changes in balance over time. The unidimensional nature of the PBS does not support the use of individual PBS items or item groups to identify specific aspects or dimensions of balance in need of remediation. Although the Rasch analysis provides evidence for modifications to selected PBS items and scoring criteria, results of previous studies indicate that the PBS in its present form can be used to identify mild or moderate BD in children less than 6 years and to track progress with intervention.4,9 The PBS can also be used in school-age children with moderate BD to establish baseline performance and to identify changes in functional balance.3 The Rasch analysis confirms the presence of a ceiling effect and provides guidelines for modifications to the current version of the PBS. To expand the effective measurement range the ceiling effect must be addressed. Strategic modifications may include the addition of more difficult items and/or modification to scoring criteria within existing items. To effectively implement any change to the PBS, significant pilot testing will be crucial to a wide demographic of children with and without known BD. Rasch analysis in conjunction with clinical reasoning will continue to guide the refinement of the PBS. REFERENCES 1. Bobath K, Bobath B. The facilitation of normal postural reactions and movements in the treatment of cerebral palsy. Physiotherapy. 1964;50(8):246-262. 2. Zafeiriou DI. Primitive reflexes and postural reactions in the neurodevelopmental examination. Pediatr Neurol. 2004;31(1):1-8. 3. Franjoine MR, Gunther J, Taylor MJ. Pediatric Balance Scale: a modified version of the Berg Balance Scale for the school-age child with mild to moderate motor impairment. Pediatr Phys Ther. 2003;15:114128. 4. Franjoine MR, Darr N, Held S, Kott K, Young B. The performance of children developing typically on the pediatric balance scale. Pediatr Phys Ther. 2010;22:350-359. 5. Berg K, Wood-Dauphinee S, Williams JI, et al. Measuring balance in the elderly: preliminary development of an instrument. Phys Canada. 1989;41(6):304-311. 6. Franjoine MR, Darr N, Young B, et al. Comparison of the Pediatric Balance Scale with the Peabody Developmental Motor Scales, Second Edition. APTA Combined Sections Meeting; February 2010; San Diego, CA. Abstract in Pediatr Phys Ther. 2009;21(1):89-90.
Pediatric Physical Therapy
7. Darr N, Franjoine MR, Young B, et al. Comparison of the pediatric balance scale with the Bruininks-Oseretsky test of motor proficiency, second edition. Pediatr Phys Ther. 2011;23(1):97. 8. Darr N, Franjoine MR, Young B, et al. Pediatric balance scale performance in children who are developing typically and in children with mild developmental delays [abstract]. Pediatr Phys Ther. 2009;21(1):89-90. 9. Franjoine M, Darr N, Young B. Pediatric balance scale responsiveness to functional balance changes in children with cerebral palsy. Platform presentation at: APTA Combined Sections Meeting; January 2013; San Diego, CA. Abstract in Pediatr Phys Ther. 2013;25(1):81. 10. Smith EV, Smith RM. Introduction to Rasch measurement. Maple Grove, MN: JAM Press; 2004. 11. Smith EV, Smith RM. Rasch Measurement: Advanced and Specialized Applications. Maple Grove, MN: JAM Press; 2007. 12. Smith EV, Stone GE. Criterion-Reference Testing: Practice Analysis to Score Reporting using Rasch Measurement Models. Maple Grove, MN: JAM Press; 2009. 13. Wright BD. Solving measurement problems with the Rasch model. J Educ Measure. 1977;1:97–116. 14. Campbell SK, Kolobe TH, Osten ET, Lenke M, Girolami GL. Construct validity of the test of infant motor performance. Pediatr Phys Ther. 1995;75:585-596. 15. Liao PJ, Campbell SK. Examination of the item structure of the Alberta infant motor scale. Pediatr Phys Ther. 2004;16:31-38. 16. Russell D, Avery L, Rosenbaum P, Parminder R, Walter S, Palisano R. Improved scaling of the gross motor function measure for children with cerebral palsy: evidence for reliability and validity. Pediatr Phys Ther. 2000;80:373-385. 17. Kothari DH, Haley SM, Gill-Body KM, Dumas HM. Measuring functional change in children with acquired brain injury (ABI): comparison of generic and ABI-Specific Scales using the Pediatric Evaluation of Disability Inventory (PEDI). Phys Ther. 2003;83(9):776-785. 18. Adams RJ, Wilson M, Wang WC The multidimensional random coefficients multinomial logit model. Appl Psychol Measure. 1997;21:1-23. 19. Andrich D. A rating formulation for ordered response categories. Psychometrika. 1978;43:1978. 20. Wright BD, Masters GN. Rating Scale Analysis: Rasch Measurement. Chicago, IL: MESA Press; 1982. 21. Wu ML, Adams R, Wilson M. Conquest Acer (version 1.0). Melbourne, Australia: Australian Council for Educational Research; 1998. 22. Linacre JM WINSTEPS: Rasch Measurement Computer Program. Chicago, IL: Winsteps.com; 2010. 23. Ackermann T. A didactic explanation of item bias, item impact, and item validity from a multidimensional IRT perspective. J Educ Measure. 1992;29:67-91. 24. Akaike H. Likelihood of a model and information criteria. J Econometr. 1981;16:3-14. 25. Kass RE, Raftery AE. Bayes factors. J Am Stat Assoc. 1995;90:773-795. 26. Schwarz G. Estimating the dimension of a model. Ann Stat. 1978; 6(2):461-464. 27. Kuha J. AIC and BIC comparisons of assumptions and performance. Sociol Methods Res. 2004;33(2):188-229. 28. Linacre JM. Structure in Rasch residuals: why principal component analysis? Rasch Measur Trans. 1998;12:636. 29. Linacre JM. Detecting multidimensionality: which residual data-types works best? J Outcome Measure. 1998;2:266-283. 30. Smith EV. Detecting and evaluating the impact of multidimensionality using item fit statistics and principal component analysis of residuals. J Appl Measure. 2002;3:205-231. 31. Sireci SG, Thissen D, Wainer H. On the reliability of test-based tests. J Educ Measur. 1991;28:237-247. 32. Thissen D, Steinberg L, Mooney J. Trace lines for testlets: a use of multiple-categorical response models. J Educ Measure. 1989;26:247260. 33. Yen YM. Effects of local item dependence on the fit and equating performance of the three-parameter logistic model. Appl Psychol Measure. 1984;8:125-145.
Rasch Analysis of the PBS
347
Copyright © 2015 Wolters Kluwer Health, Inc. and the Section on Pediatrics of the American Physical Therapy Association. Unauthorized reproduction of this article is prohibited.
34. Yen YM. Scaling performance assessments: strategies for managing local item dependence. J Educ Measure. 1993;30:187213. 35. Smith EV, Wakely M, de Kruif R, Swartz C. Optimizing ratings scales for self-efficacy (and other) research. Educ Psychol Measure. 2003;63:369-391.
36. Linacre JM. Optimizing rating scale category effectiveness. In: Smith EV, Smith RM, eds. Introduction to Rasch Measurement. Maple Grove, MN: JAM Press; 2004:258-280. 37. Linacre JM. Transitional categories and usefully disordered thresholds. Online Educ Res J. www.oerj.org/View?action= viewPDF&paper=2. Published 2010. Accessed January 15, 2014.
CLINICAL BOTTOM LINE Commentary on “Psychometric Properties of the Pediatric Balance Scale Using Rasch Analysis”
“How could I apply this Information?” When choosing a tool for examination in practice, clinicians need to understand its validity or usefulness. The authors of this report further validate the Pediatric Balance Scale (PBS) through establishing its concurrent and discriminant validity, establishing a change score, and Rasch analysis. Concurrent validity with other standardized developmental assessment tools means that the PBS can be trusted by the clinician to measure aspects of balance especially stationary activities. Discriminate validity means the PBS can be used to discriminate preschool-aged children with mild to moderate balance impairments from those with typical development. The Rasch analysis establishes the proper sequencing of the items and where each child “fits” along the same scale (like a ruler) once a total score is given. Changes in a score of 10 points or more indicted changes in balance capabilities, beyond the error of measurement, and might indicate a need for change in the child’s management. “What should I be mindful about when applying this information?” Although the PBS demonstrates concurrent validity with subsets of the Peabody Developmental Motor Scales, 2nd Edition (PDMS-2) and the Bruininks Oseretsky Test of Motor Proficiency, 2nd Edition (BOT-2), the PBS measures different abilities. The clinician needs to determine which tool provides the information needed to make clinical decisions for a particular child. In regard to discrimination, PBS is best for categorizing preschool-aged children (6 years or younger) with mild and moderate balance impairments but use for school-aged children is limited. The Rasch analysis defines PBS as limited in its measurement of the complexity of the dimensions of balance. The PBS is unidimensional; therefore, to test functional balance and to make a clinical decision, the clinician needs to use the total test score reflecting testing of all items. The change score of 10 points was established in children with cerebral palsy; children with other developmental disabilities need to be studied to determine whether this change score is appropriate.
Karen M. Kott, PT, PhD Old Dominion University, Norfolk, Virginia Sharon L. Held, PT, DPT, MS, PCS, C/NDT Daemen College, Amherst, New York; SportsFocus Physical Therapy, Orchard Park, New York The authors declare no conflicts of interest. DOI: 10.1097/PEP.0000000000000194
348
Darr et al
Pediatric Physical Therapy
Copyright © 2015 Wolters Kluwer Health, Inc. and the Section on Pediatrics of the American Physical Therapy Association. Unauthorized reproduction of this article is prohibited.