An Applied Example Using Elbow Flexor Strength Data

43 downloads 0 Views 743KB Size Report
Jul 7, 1997 - PT, is Assistant Professor, School of Rehabilitation Science, and Associate Member, ... Address all correspondence to Mr Stratford at Faculty of Health Scienceh. .... the researcher believes that a SEM as high as 0.93 kgf.
Use of the Standard Error as a Reliability Index of Interest: An Applied Example Using Elbow Flexor Strength Data The intraclass correlatio~lcoefficient (ICC) and the standard error of measurement (SEM) are two reliability coefficients that are reported frequently. Both measures are related; however, they define distinctly different properties. The magnitude of the ICC defines a measure's ability to discriminate among subjects, and the SEM quantifies error in the same units as the original measurement. Most of the statistical methodology addressing reliability presented in the physical therapy literature (eg, point and interval estimations, sample size calculations) focuses on the ICC. Using actual elbow flexor make and break strength measurements, this article illustrates a method for estimating a confidence interval for the SEM, shows how an a priori specification of confidence interval width can be used to estimate sample size, and provides several approaches for comparing error variances (and square root of the error variance, or the SEM). [Stratford PW, Goldsmith CH. Use of the standard error as a reliability index of interest: an applied example using elbow flexor strength data. Phys Th,m. 1997;77:745-750.1

Key Words: Measurement error, Reliability, Variance components.

Paul W Straiford Charlie H Goldsmith Physical Therapy . Volume 77 . Number 7 . July 1997

review of the physical examination literature reveals :I plethora of' competing tests for most For examcomponents of the as~essment.'-~ ple, numerous devices, methods, and protocols exist for assessing range of motion, muscle strength, and joint stability. ' - T h e goal of many clinical measureinent studies has been to aid in selecting the best measure for a specific purpose. O n e type of measurement st~iclyreported f r e q ~ ~ e n t lisy the reliability study, :~ndnumerous design choices are available to investigators who are interested in conducting such studies. Several designs reported often in the physical therapy literat~~l-e incl~idethe simple replication design and the interr>lter design (equivalent to the intertrial or interoccasion designs) ,' where raters, trials, or occasions have been considered to be either fixed or A simple replication study exists when multiple measurements are taken o n a sample of subjects and there is no structure linking the replicate measures to each other. Different raters assess different subjects, and there is no need f'or the number of replicate measurements to be the same for each subject. Presuming that raters are randonlly allocated to subjects, within an analysis-ofvariance (ANOVA) context, this represents a one-way randon1 effects model.

h potential limitation of this design is that all components of error are included in a single error term. An equally popi~larand slightly more complex design is the interrater reliability study, where all raters evaluate all subjects (ie, raters are crossed with subjects). Data from this design can be analyzed using a two-way ANOVA."5 This approach allows partitioning of the variance terms into patients, raters, and error. With this design, the investigator must determine whether inferences are to be unique to only those raters taking part in the study o r whether the raters are intended to represent a sample from a larger population of raters. In the first case, the rater effect is considered to be fixed, and in the second case, the rater effect is random. Results from reliability studies, where the measure is quantified o n an internal or ratio scale, are usually expressed by reporting the intraclass correlation coeffiT11r ~rcpratrd-mr.~rut r\ factor could also reprrsrnt thc word "I-a~rrs" i l l ~ h tc c ~ t .

iir \\III u s r

trials or-occasions; howrvrr.

Table 1. Analysisaf-Variance Results for Knee Extension Range of Motion"

Variance Estimate Between people Within people Total

43

1264.91 229.00 1493.91

60.23 10.41

24.91 10.41 35.32

"Adapted from the work of H d y rt al* (Tab. 6. Examiner 1 )

cient (ICC) or the standard error of measurement (SEM).l-5.8.YThe ICC provides information about a measure's ability to differentiate among patients, and it is obtained by dividing the true variance by the total v a r i a n ~ e . When ~ ~ . ~ patients ~ are the object of measurement, the true variance represents the variance among patients. The SEM expresses measuremelit error in the same units as the original measurement, and it is not influenced by variability among patients.I0.l

Relationship Between the SEM and the ICC T h e relationship between the SEM and ICC is defined by the following expression:

:a'

SEM = \ T( 1 - p) where a$ represents the total variance and p is the ICC. Using a bit of algebra, it can be shown that:

-

SEM = ,a,' where a: is the error variance and equals the mean square error term from an ANOVA. Although the SEM and ICC are related, they d o not convey the same information.l2 This point is highlighted in the results , ~ studied the presented by Hayes and c ~ l l e a g u e s who test-retest reliability of measurements of knee joint range of motion in patients with osteoarthritis. Tables 1 and 2 present the ANOVA results for extension and flexion measurements, respectively. The ICCs for extension and flexion are .71 and .96, respectively, and the SEMs for extension a n d flexion are 3.23 and 5.03 degrees, respectively. Notice that the ICC shows the flexion measurements to be more reliable, whereas the SEM indicates that the extension measurements are more reliable. The

P\V Stratfol-d. PT, is Assistant Professor, School of Rehabilitation Science, and Associate Member, Department of Clinical Epidemiolo%qand Bio\tatistics. McMaster University, Hamilton, Ontario. Canada. Address all correspondence to Mr Stratford at Faculty of Health Scienceh. School 01 Rrhabilitation Science, McMaster University, OT./PT Bldg T-16, 1280 Main St W, Hamilton. Ontario, Canada LXS 4 K l ([email protected]). (:H (;oldsmith, PhD, is Profesrr)r ot Clinical Epiderniolop and Biostatistics. McMaster Universin, and Honoran Professor of Physical Therapy, U e p a r t ~ n e ~of~ tPhysical Therapy, University o f Wrstern Ontario, London, Ontario, Canada.

746 . Stratford and Goldsmith

Physical Therapy . Volume 77 . Number 7 .July 1997

Table 2.

Table 3.

Ano~ysis-of-Vorionce Results for Knee Flexion Range of Motiona

Summary Data for the Make and Break Tests (in Kilograms of Force)

df Between people Within people Total

21 22 43

SS 28219.73 563.00 28782.73

"Xdapted from the work of Haves

rt alH (Tah.

MS 1343.80 25.59

Variance Estimate 659.1 1 25.29 684.40

I

1

7, Examiner 1 )

interpretation of this paradox is that, within the context of Hayes and colleagues' study design, knee flexion measurements are more useful for discriminating among patients, whereas the magnitude of error (expressed in degrees) is less for knee extension measurements. A second important point is that both the ICC and SEM represent point estimates of the population values for these variables. Although considerable attention has been provided to reporting ICCs (eg, stating the type and confidence interval are usual requirement^),'^,'^ the same thoughtfulness has not been afforded to reporting SEMs. The goals of this article are to illustrate a process for obtaining an interval estimate of the SEM, to show how this process can be used to estimate sample size, and to illustrate how error estimates can be compared statistically. T o illustrate these analyses, we use actual strength assessment data.

Example Study The data represent make and break tests performed o n the right elbow flexors of 30 female volunteers. Two make tests and two break tests were performed o n all subjects at the same test session by the same examiner. The order of testing methods was balanced across subjects. 'The data are shown in the Table 3, and summary ANOVA tables for the tests are presented in Tables 4 and 5 . The error variances are equal to the mean square error (MSE) terms, and the SEMs are the square roots of these terms. Accordingly, the error variance is 0.76 (SEM=0.87 kg0 for the make test and 1.51 (SEM=1.23 kg0 for the break test. The difference scores for each method, shown in Table 3 and reproduced in Table 6, were obtained by subtracting the second trial score from the first trial score (eg, trial 1-trial 2). The correlation between the difference scores for the make and break tests is 0.21, suggesting a modest degree of dependency.

Estimating a Confidence Interval for the Standard Error of Measurement Variances of independent samples drawn randomly from a population with a normal distribution have a chisquare d i s t r i b u t i o i ~ .Specifically, ~~ the relationship is defined as

Physical Therapy . Volume 77 . Number 7 . July 1997

Make Test Subject No. 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 X SD

Break Test

1

2

Difference 1

2

Difference

19.1 18.2 19.5 23.7 27.1 20.1 21.1 25.4 17.8 17.8 22.8 16.7 20.3 15.3 23.0 14.8 17.6 19.1 31.7 16.5 19.7 23.8 22.8 19.3 15.1 16.2 17.7 16.7 16.2

16.2 18.3 19.2 24.5 26.2 22.7 20.3 25.7 18.9 17.4 22.6 15.2 19.6 16.3 21.8 15.7 16.2 17.2 29.4 16.3 19.4 22.7 23.8 18.1 15.2 16.8 17.9 17.6 16.3

2.9 -0.1 0.3 -0.8 0.9 -2.6 0.8 -0.3 -1.1 0.4 0.2 1.5 0.7 -1 .O 1.2 0 . 9 1.4 1.9 2.3 0.2 0.3 1.1 -1 .O 1.2 -0.1 -0.6 -0.2 -0.9 -0.1

25.4 17.7 19.6 29.2 27.2 23.0 20.9 26.2 21.5 20.4 22.1 19.5 20.1 19.6 23.5 18.9 18.4 19.2 31.8 18.9 21.5 22.7 23.1 21.4 17.7 19.8 17.5 19.7 15.8

22.3 20.3 18.8 23.7 26.3 23.9 19.2 27.6 21.9 20.4 21.2 18.3 21.2 16.6 19.3 18.2 18.6 20.3 31.4 18.9 20.5 21.2 22.8 20.5 19.2 18.8 16.9 19.1 15.0

3.1 -2.6 0.8 5.5 0.9 -0.9 1.7 - 1.4 -0.4 0.0 1 . 1 1.2 -1.1 3.0 4.2 0.7 -0.2 -1.1 0.4 0.0 1 .O 1.5 0.3 0.9 -1.5 1 .O 0.6 0.6 0.8

16.4 19.72 3.91

18.5 19.53 3.68

2.1 0.1 8 1.23

19.2 21.38 3.61

20.8 20.83 3.35

- 1.6 0.56 1.74

Table 4. Analysis-of-VarianceResults for Moke Test

Source

df

SS

MS

Between subjects Trials Within subiects Total

29 1 29 59

814.31 0.50 22.06 836.87

28.08 0.50 0.76

Variance Estimate 13.66 0.00 0.76 14.42

where X:n-I, is the chi-square distribution on n - 1 degrees of freedom, s' is the sample variance, o' is the population variance, and the symbol - represents "is distributed as." Using the approach outlined b) Xmmitage and Berry,1h(ppz1-23)an illustration of a confidence interval calculation based on the sainple error variance for the make test, 0.76 from Table 4, is shown. Specifically, a two-sided 95% confidence internal for the error

Stratford and Goldsmith . 747

Table 5.

associated with SSE (ie, 29). Taking the square root of the confidence interval values for (T' yields a 95% confidence interval on the SEM of 0.69 to 1.17 degrees.

Analysis-of-Variance Results for Break Test h

Source

df

SS

MS

Between subjects Trials Error Total

29 1 29 59

659.22 4.65 43.80 707.67

22.73 4.65 1.51

Variance Estimate 10.61 0.10 1.51 12.12

Table 6. Difference Scores for Make and Break Tests

Subject No.

Make Test Difkrence

Break Test Difference

Sum

Difference

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30

2.9 -0.1 0.3 -0.8 0.9 -2.6 0.8 -0.3 -1.1 0.4 0.2 1.5 0.7 -1.0 1.2 -0.9 1.4 1.9 2.3 0.2 0.3 1.1 -1.0 1.2 -0.1 -0.6 -0.2 -0.9 -0.1 -2.1

3.1 -2.6 0.8 5.5 0.9 -0.9 1.7 -1.4 -0.4 0.0 -1.1 1.2 -1.1 3.0 4.2 0.7 -0.2 -1.1 0.4 0.0 1 .O 1.5 0.3 0.9 -1.5 1 .O 0.6 0.6 0.8 -1.6

6.0 -2.7 1.1 4.7 1.8 -3.5 2.5 -1.7 -1.5 0.4 -0.9 2.7 -0.4 2.0 5.4 -0.2 1.2 0.8 2.7 0.2 1.3 2.6 -0.7 2.1 -1.6 0.4 0.4 -0.3 0.7 -3.7

-0.2 2.5 -0.5 -6.3 0.0 -1.7 -0.9 1.1 -0.7 0.4 1.3 0.3 1.8 -4.0 -3.0 -1.6 1.6 3.0 1.9 0.2 -0.7 -0.4 -1.3 0.3 1.4 -1.6 -0.8 -1.5 -0.9 -0.5

variance associated with knee extensor measurements is defined as SSE

Sample Size Estimation This section illustrates a method for determining sample size when an investigator is interested in designing a study to estimate the magnitude of the SEM. Consider the following scenario. A researcher is interested in conducting a study to determine the reliability of make tests for the elbow flexors in patients with osteoarthritis. Based on a pilot study on subjects with no known history of osteoarthritis, the researcher expects the subsequent investigation using an enhanced protocol on patie~~ts with osteoarthritis to yield a SEM of 0.71 kgf; however, the researcher believes that a SEM as high as 0.93 kgf would be acceptable. Rephrasing this last sentence yields the following hypotheses: Null hypothesis: The measurement protocol of interest in patients with osteoarthritis is reliable (SEMs0.95 kgf).

Alternate hypothesis: The measurement protocol of interest in patients with osteoarthritis is not reliable (SEM>O.95 kg0 . Moreover, the researcher wants to evaluate these hypotheses at a one-sided 100(1- a ) % confidence level of 95% when the sample SEM is 0.71. The first step is to square the SEM values to produce variances. Thus, the researcher expects to find an error variance of 0.50 (ie, 0.712) with an upper acceptable variance limit of 0.90 (ie, 0.952). Once again, we make use of the relationship

where s2 is the expected study variance and 6'is the maximal acceptable value for the population variance (ie, the upper 95% confidence limit is set to this value). Using the approach outlined by Armitage and B e r r y , l " ~ p ~ ~n-1 -~~~ is ) replaced by the degrees of freedom associated with the error variance (dfe) and this term is isolated

SSE

x;-I)u2 dfe = s

[0.48;1.37] where SSE is the sum of squares error (22.06) in the ANOVA table, X:,c,,, represents the chi-square value for the probability level a, and dfe is the degrees of freedom 748 . Stratford and Goldsmith

x(\-,,0.90 0.50 Using an iterative process, it is found that a chi-square value of 11.59 on 21 degrees of freedom produces the dfe =

Physical Therapy. Volume 77 . Number 7 .July 1997

Table 7. Approximate Sample Sizes for a Onesided Upper 95% Confidence Limit for Population Variance

verted to a probability by referring to the s u m m a q table of F values for the appropriate degrees of freedom.

Ratio of Population Variance to No. of Sample Variance Measurements per Subject 1.3 1.4 1.5 1.6 1.7 1.8 1.9 2.0

P = .035 This analysis suggests that the break test variance is significantly greater than the make test variance. desired one-sided confidence level. Thus, 22 patients are required, provided two measurements are performed o n each patient. Often, an investigator may want to obtain more than two measurements o n each patient, and this will influence the number of patients required for the study. Table 7 supplies sample size estimates for various combinations of 0 2 / s 2 ratios and number of measurements per patient. For the example mentioned pre\iously, 0 2 / s 2 is equal to 1.8 (0.90/0.50). By referring to Tables 4 and 5 , for a 0 2 / s 2 ratio of 1.8, it is evident that 22 patients are required when two measurements are performed o n each patient, 12 patients are needed when three measurements are performed o n each patient, and SO 011.

Comparing Error Estimates Researchers and clinicians are constantly in search of strategies, techniques, and protocols that minimize measurement error. In pursuing this goal, it is necessary to compare the methods or measures of interest by estimating the error associated with the competing measures. Estimates of measurement error for the competing measures can be obtained o n different samples or on the same sample. In the case where o n e group of subjects is used to estimate the error associated with method A and a second group of subjects is used to approximate the error for method B, the measurements are usually considered to be independent. When both methods are evaluated on the same group of subjects, however, the measurements are likely to be dependent. The intent of the following section is to show several statistical approaches for comparing error variances, and ultimately their square roots, the SEMs. The data for the make and break tests will be used to demonstrate the analyses. Independent Samples: Variance Ratio Approach Variance estimates from two independent samples can be compared using the ratio of the variance e ~ t i m a t e s . ~ " p p l ~ .Variance ~ - ~ ~ ~ ) ratios are distributed as an F distribution. Specifically, the larger variance is divided by the smaller variance, and this value is con-

Physical Therapy . Volume 77 . Number 7 . July 1997

Dependent Samples: Paired Approach This approach is discussed by Armitage and Berryl%nd Snedecor and Cochran.17 It is appropriate when pairs of measurements are being compared. T o use this technique to compare within-subject error variances, we have added an intermediate step. This step requires calculation of the difference between trials (ie, trial 1-trial 2) for each of the assessment methods. Thus, we are actually testing whether the variance of the difference scores for the two assessment methods (ie, make test difference and break test difference) differs, rather than whether the error variance associated with a single measurement differs. Given that the variance of a difference score is equal to the variance of a single measure times two, the statistical test provides valid information about the extent to which the error variances, and ultimately the SEMs of the make and break tests, differ. Table 6 provides the information necessary to formally evaluate the variances. The first step involves calculating the sum (make test+break test) and difference (make test-break test) for the difference scores shown in Table 6. T h e variances are compared by calculating the correlation between the sum and difference scores and testing whether the magnitude of the correlation coefficient differs from zero. In our example, the correlation between the sum and difference scores is 0.34 (F=3.58; (lf=1,28; P=.069), suggesting that there is a margnal difference associated with the variances of the make and break tests. Dependent Samples: Bootstrap Approach The paired method mentioned above is restricted to two observations (trials or judges) per subject for each measure being assessed. An alternate approach that avoids this shortcoming is the bootstrap method. Although reviewing this technique in detail is beyond the scope of this article, a brief explanation of its essential features will be offered. Specifically, the bootstrap method uses the sample data to estimate the distribution of the parent population.17-I" This estimation is accomplished by drawing a relatively large number of random draws with repla~ement.'~-'"he random

Stratford and Goldsmith . 749

draws are used to construct the estimated population distribution, and the confidence interval of interest can be determined. This approach was used to estimate the distribution of the difference in make and break test error variances. Specifically, 1,000 bootstrap samples, each with a sample size of 29 drawn with replacement, were obtained. This was accomplished by writing a macro program using MINITAB statistical software.+The one-tailed probability associated with the observed error variance difference of 0.75 (ie, 1.51-0.76 from Tabs. 4 and 5) was F . 0 5 3 . Again, the magnitude of this value suggests that the error variance for the break test rnay be marginally greater than for the make test.

Summary This article shows that although the ICC and SEM are related, they convey different information about the reliability of a measure. Specifically, the ICC is influenced by multiple sources of variation (eg, subjects, raters, trials) and by error, whereas the SEM is affected by error variation only. A method for estimating a confidence interval for the SEM is illustrated, and how an a priori specification of confidence interval width can be used to estimate sample size is discussed. Finally, several approaches used to compare error variances are discussed. References 1 Bohannon RW. Make tests and break tests of elbow flexor muscle strength. Phy Ther. 1988;68:193-194.

2 Malouin F, Boiteau M, Bonneau C, et al. Use of a hand-held dynamometer for the evaluation of spasticity in a clinical setting: a reliability study. Physiotherapy Canada. 39139;41:126-134. 3 Helewa A, Goldsmith CH, Smythe A. The modified sphygmomanometer-an instrument to measure muscle strength: a validation study. J Chronic Dis. 1981;34:3.5.3-361.

4 Stratford PW, Miseferi D, Ogilvie R, et al. Assessing the responsiveness of five KT1000 knee arthrorneter measures used to evaluate anterior laxity at the knee joint. Clin J Spods Mtd. 1991;I :22.5-=8.

5 LaStayo PC, Wheeler DL. Reliability of passive wrist flexion and extension goniometric measurements: a multicenter study. Phys Ther. 1994;74:162-176. 6 Fleiss JL. TheDesign and Analysis of Clinlml Expampnts. New York, NY: John Wiley & Sons Inc; 1986:l-32, 291-300. 7 Shrout PE, Fleiss JL. Intraclass correlations: uses in assessing rater reliability. Psycho1 Bull. 1979:86:420-428.

8 Hayes KW, Peterson C, FalconerJ. An examination of Cyriax's passive motion tests with patients having osteoarthritis of the knee. Phys Ther. 1994;74:697-709, 9 Eliasziw M, Young SL, Woodbury M C , Fryday-Field K. Statistical methodology for the concurrent assessment of interrater and intrarater reliability: using goniometric measurements as an example. Phys Ther. 1994;74:777-788. 10 Streiner DL, Norman GR. Health Measurtmmt Scah: A Prartzral Guide to Their Deuelopment and Ust. New York, NY: Oxford University Press; 1989:79-89. 11 Domholdt E. Physical Therub Research: Principles and Applications. Philadelphia, Pa: WB Saunders Co; 1993:153-1.57. 12 Stratford PW. Reliability: consistency or differentiating among subjects? Phys Ther. 1989;69:299-300. Letter to the editor. 13 Krebs DE. Declare your ICC type. Phys Ther. 1986;66:1431. Letter to the editor.

14 Stratford PW. Confidence limits for your ICC. Phys Thm. 1989;69: 237-238. Letter to the editor. 15 Kleinbaum DG, Kupper LL., Muller KE. Applied Regre.ssion Analysis and Other Multiuanabb Methods. 2nd ed. Boston, Mass: PWSKent Publishing Co; 1988:23. 16 Armitage P, Berry G. Statistzcal Methods i n Medical Re~earrh.3rd ed. Boston, Mass: Blackwell Scientific Publications; 1994:11.5-1 17, 221223. 17 Snedecor GUT.Cochran WG. Statistical Method$. 6th ed. Atnes, Iowa: Iowa State University Press; 1967:195-197.

18 Stine RA. An introduction to bootstrap methods: examples and ideas. Sociolopcal Methods Research. 1989;18:243-291. 19 Efron B, Gong G. A leisurely look at the bootstrap, the jackknife, and cross-validation. Ammican Statistrn'an. 3983;37:36-48.

MINITAB Inc, 3081 Enterprise Dr, State College, PA 16801-3008.

750 . Stratford and Goldsmith

Physical Therapy . Volume 77 . Number 7 . July 1997

Suggest Documents