Reproducibility of Performance-Based and Self-Reported Measures of ...

2 downloads 0 Views 814KB Size Report
The reproducibility of a performance-based and a self-reported measure of ... Self-reported functional status was significantly less reproducible in very old and.
Journal of Gemntology: MEDICAL SCIENCES 1997, Vol. 52A, No. 6. M363-M368

Copyright 1997 by The Geronlological Society of America

Reproducibility of Performance-Based and Self-Reported Measures of Functional Status Nancy Hoeymans,12 Emmy RCM Wouters,13 Edith JM Feskens,1 Geertrudis AM van den Bos,2 and Daan Kromhout1 'Department of Chronic Disease and Environmental Epidemiology, National Institute of Public Health and the Environment, Bilthoven, The Netherlands. institute of Social Medicine, Academic Medical Center, University of Amsterdam, Amsterdam, The Netherlands. 'Department of Epidemiology and Public Health, Wageningen Agricultural University, Wageningen, The Netherlands.

Methods. Of a random sample of 114 men of the 1995 survey of the Zutphen Elderly Study, 105 men (aged 79.9 ± 4.5 years) participated in a test-retest study. They filled out a questionnaire on disabilities and carried out performance tests twice, in a 2-week interval. Four performance tests were administered (standing balance, walking speed, chair stand, and external shoulder rotation), and a summary performance score was constructed. The number of self-reported disabilities in basic activities of daily living, mobility, and instrumental activities of daily living were assessed. Kappa statistics and Pearson correlation coefficients between test and retest measurements were computed for the total group and stratified by age and cognitive function. Results. Three performance tests and the summary performance score had fair to good reproducibility (walking speed: Pearsons r = .90, chair stand: r = .82, shoulder rotation: kappa = .49, summary score: kappa = .52). Only the test for standing balance was poorly reproducible (kappa = .29). The self-reported functional status was fairly to good reproducible (kappa = .63, r = .87). Self-reported functional status was significantly less reproducible in very old and cognitively impaired than in younger and nonimpaired individuals. Conclusions. In the elderly male subjects, performance tests and self-reported disabilities had moderate to good reproducibility, with the exception of the test for standing balance. In very old or cognitively impaired populations, self-reported functional status may have a lower reproducibility.

F

UNCTIONAL status is usually assessed by self-reported measures and by physical performance tests. Both measures give valid and important information about physical functioning, but they measure a different concept of functional status (1-5). This can be clarified by the disability process as described by Verbrugge (6), in which pathology leads through impairments and functional limitations to disabilities. Performance tests might identify functional limitations that can not be detected by self-reported measures. A performance test measures actual performance of standardized tasks required to accomplish many daily routine activities and is evaluated in an objective, uniform manner using predetermined criteria (7,8). Disadvantages might be that these tests require special equipment, are time- and money-consuming, and that they only simulate an activity without reflecting adaptations people make in daily life (9,10). An important advantage, however, is that these measures are objective, in the sense that they are free of biases that can be introduced by persons making implicit judgments (9,11). The accuracy of people's judgment about their own functioning could be affected by the presence of cognitive impairment, depression, or by affective responses to acute illnesses. (9,12). Performance tests might offer the

potential for greater reproducibility than self-reported measures, mainly in very old or cognitive-impaired individuals, although empirical evidence for this is still limited (7,13). This study was designed to assess 2-week reproducibility of two measures of functional status: performance tests and self-reported disabilities in daily routine activities. We also investigated whether test-retest reliability of these measures depended on age or cognitive functioning. METHODS

Study Population and Design The Zutphen Elderly Study (14) is a longitudinal investigation of elderly men born between 1900 and 1920 and living in Zutphen, a small industrial town in the eastern part of The Netherlands. It was begun in 1985, and its purpose is to investigate the prevalence and etiology of chronic diseases, functional status, and well-being of these subjects. All survivors of the Zutphen Elderly Study were recruited 10 years later, in 1995. From the target population of 462 men, 343 (74%) participated. A questionnaire including sociodemographic items and questions on disabilities in daily routine activities was mailed to the particiM363

Downloaded from http://biomedgerontology.oxfordjournals.org/ by guest on December 25, 2015

Background. The reproducibility of a performance-based and a self-reported measure of functional status was investigated, as well as the impact of age and cognitive function on the reproducibility.

M364

HOEYMANS ETAL.

Measurements Performance measures of functional status. — Performance-based functional status was evaluated by four performance tests: standing balance, walking speed, chair stand, and external shoulder rotation. These tests were adapted from the EPESE studies (Established Populations for Epidemiologic Studies of the Elderly) (5,15,16). The test for standing balance comprised three stands: tandem, semi-tandem, and side-by-side stand. Each participant started with the semi-tandem stand, in which the heel of one foot was placed to the side of the first toe of the other foot. If a participant was successful at the semi-tandem stand, he went on to the tandem stand, in which the heel of one foot was placed directly in front of the toes of the other foot. If not successful, he tried the side-by-side stand, in which the two feet were placed next to each other. Each stand had to be held for 10 seconds. Scoring on this test was dichotomized in a group who was able to hold the tandem position for 10 seconds and a group who held this position for less than 10 seconds or did not try the tandem stand. In the test for. walking speed, the participant was timed for two walks of 2.43 meters (8 feet), to the nearest tenth of a second. The mean time was used in the analyses. The chair stand is expressed as the time (in tenths of seconds) required to stand up from a chair five times as quickly as possible without making use of arms or hands. In the test for external shoulder rotation, participants were asked to place their hands behind their head at ear level. This position was scored at four criteria: fingers touching behind head at ear level, head held erect, elbows pointing out to the side at a 180-degree angle at or behind the plane of the ears, and elbows at shoulder level or higher with forearms parallel to the floor. High performance was defined as meeting all criteria; moderate performance was meeting three criteria; and low performance was defined as meeting less than three criteria. A summary performance score was computed for each individual as the number of tests performed low. For the timed tests (walking speed and chair stand), low performance was defined as performing in the lowest

quartile or not being able to perform the test. For the second measurement, cutoff points for low performance on the timed tests are the quartile boundaries of the first measurement. To ensure uniformity of administration, a videotape produced by the researchers of EPESE was used to instruct the assistants. This videotape provided detailed instructions for administering and scoring the tests, as well as instructions on maintaining the safety of the participants. Self-reported measures of functional status. — Selfreported functional status was measured as disabilities in daily routine activities. The questionnaire consisted of 1.3 items, adapted from the WHO questionnaire (5,17), each mentioning one activity (see also Table 3). We added up the number of items in which an individual reported disabilities. The items were further grouped in three dimensions: basic activities of daily living (BADL), mobility, and instrumental activities of daily living (IADL). Those who said they needed help with at least one item per dimension were classified as disabled in that dimension. The dimensions were found to be hierarchical; men who reported disabilities in mobility also reported disabilities in IADL, and men who reported disabilities in BADL also reported disabilities in mobility and IADL. We therefore developed a hierarchical ADL scale, distinguishing four categories: (a) not disabled, (b) disabled in IADL only, (c) disabled in mobility and IADL, and (d) disabled in BADL, mobility, and IADL. Only five individuals did not fit into this hierarchy. Those who reported to be disabled in mobility, but not in IADL (n = 4), were classified in category three. The one person who was disabled in BADL and IADL, but not in mobility, was classified in category four. Other measures. — Age, living situation, and marital status were registered. The population was divided into two age groups by the median (79 years), i.e., 74-79 years and 80-92 years. Socioeconomic status was recorded by life-long occupation in four levels: professionals, managers and teachers; small business owners; non-manual workers; and manual workers (18). Cognitive function was tested with the Dutch translation of the 30-point Mini-Mental State Examination (MMSE) developed by Folstein et al. (19), a short, structured examination taking 5 to 10 minutes to administer. The MMSE includes questions on orientation in time and place, registration, attention and calculation, recall language, and visual construction. Points are awarded for correct responses up to a maximum score of 30. Missing items were rated as errors, but if items could not be fulfilled because of physical inability a weighted score was given. If four or more items were missing, no MMSE score was computed (20). Cognitive impairment is defined as an MMSE score of 25 points or less (21). Statistical Analyses Statistical analyses were carried out using SAS, version 6.10. A p-value of .05 or less was considered statistically significant. All /?-values were two-sided. Self-reported disabilities and performance-based func-

Downloaded from http://biomedgerontology.oxfordjournals.org/ by guest on December 25, 2015

pants. One week after they received this questionnaire, participants were visited by a research assistant who checked the questionnaire for inconsistencies or missing items. The research assistant then administered a test for cognitive function and the physical performance tests. A random sample of 114 men was invited to participate in the test-retest study; five individuals did not participate during the retest measurements due to sickness or discomfort. The 109 men who participated in the test-retest study were visited again 2 weeks after the first visit (mean ±SD = 15 ± 1.5 days). They were asked to fill out the questionnaire on disabilities and to carry out the performance tests again. The test-retest study was administered by two research assistants. A random half of the men were visited twice by the same assistant, and the other half were visited by two different assistants in random order. For 105 men, complete information about the performance tests and the selfreported functional status was available.

REPR0DUCIB1L1TY OF FUNCTIONAL STATUS

To test whether reproducibility in self-reported disabilities was related to age, we used a regression model with number of self-reported disabilities during measurement two (ADL2) as dependent variable, and number of selfreported disabilities during measurement one (ADL1), age (as a continuous variable), and an interaction term of age and ADL1 as independent variables. The same model was constructed with cognitive impairment, using the MMSE score as a continuous variable. If the interaction term was not statistically significant, we assumed that the reproducibility did not depend on age or cognition. Similar models were constructed for the summary performance score. Reproducibility was assessed stratified by age group and cognitive function. RESULTS

Sociodemographic characteristics and cognitive function of the participants in the test-retest study are shown in Table 1. The mean age was almost 80 years. Over 90% of the men lived independently. Of the 10 men who did not live independently, 7 lived in a "service flat," which is a relatively independent kind of protected living, 2 in a home for the elderly, and one in a nursing home. Two thirds of the men were still married, and 25% were widowed. About one third were considered to be cognitively impaired. Performance during the second measurement was consistently higher than during the first measurement (Table 2), but this was significant only for the test for external shoulder rotation and for the summary score. The test for standing balance was poorly reproducible (kappa = .29), and the reproducibility of the test for external shoulder rotation was fair (kappa = .49). Pearson correlation coefficients for the test for walking speed and for the chair stand during the first and second measurement were .90 and .82, respectively. The Pearson correlation coefficient of the summary performance score was .71 and kappa was .52. The repro-

Table 1. Characteristics of the Participants of the Test-Retest Study of the Zutphen Elderly Study 1995 (n = 105) Age (mean ± SD)

79.9 ± 4.5

Living arrangement Independent (%) Protected (%)

90.5 9.5

Marital status Married (%) Never married (%) Divorced (%) Widowed (%)

67.6 4.8 1.0 26.7

Socioeconomic status (n = 102) Professionals, managers, teachers (%) Small business owners (%) Non-manual workers (%) Manual workers (%)

29.4 18.6 29.4 22.5

Cognitive function MMSE score (mean ± SD)* Cognitively impaired (%)

25.9 ± 3.0 36.2

*Mini-Mental State Examination (MMSE), cognitively impaired if MMSE score is < 25.

ducibility of the summary performance score improved when standing balance was not taken into account (r = .73 and kappa = .74). Reproducibility of the summary performance score was not significantly different between the group who was visited twice by the same assistant, and the group who was visited by two different assistants (r = .75 and .67, respectively, p = .32). The reliability and distribution of the self-reported measure of functional status is shown in Table 3. No significant differences were observed between measurement one and two. Kappa of disabilities in BADL was .58. All BADL items had perfect agreement except for the items "use the lavatory" and "wash and bath oneself." The percentage agreement of these items was 99% and 96%, but kappa statistics were only .66 and .58. Kappa of disabilities in mobility was .63. The kappas of the individual mobility items ranged between .56 and .74. Kappa of disabilities in IADL was .59, and kappas of the individual IADL items ranged from .57 to .69. The hierarchic ADL scale had fair to good reproducibility (kappa .63). The correlation between number of self-reported disabilities during the first and second measurement was .87. Performance on the EPESE test was significantly related to age (Spearman's r = .25 and .29 for the first and second measurement, respectively), and to cognitive functioning (only during the second measurement: r = -.09 (n.s.) and r = .23, respectively). However, the test-retest reliability of summary performance did not depend on age (p = .59), nor on cognitive functioning (p = .52). Self-reported functional status was neither associated with age nor with cognitive functioning, but the reproducibility was lower for the oldest men than for the youngest (p = .006), and for the cognitively impaired compared to the not cognitively impaired (p = .001). In Table 4, reproducibility of the summary performance score and the hierarchic ADL scale stratified by age group and cognitive function are presented.

Downloaded from http://biomedgerontology.oxfordjournals.org/ by guest on December 25, 2015

tional status were described for the test and retest measurement, and differences were tested for significance using paired Mests. For noncontinuous variables, a correction for continuity was made (22). For the performance tests, standing balance and shoulder rotation and for the summary performance score, (weighted) kappa statistics were calculated. Kappa values greater than .75 indicate excellent agreement beyond chance, values from .40 to .75 may be taken as fair to good agreement beyond chance, and values below .40 may be taken to represent poor agreement beyond chance (23). Because kappa statistics are corrected for agreement that is expected by chance, a relatively high value of observed agreement can be modified into a relatively low kappa when the table's marginal totals are imbalanced (24). Therefore, we also presented the percentage of observed agreement. For the timed tests (walking speed and chair stand) and for the mean number of tests performed low, Pearson correlation coefficients were assessed. For all ADL items, for disabilities in BADL, mobility, and IADL, and for the hierarchic ADL scale, kappa statistics and percentage of agreement were calculated. Pearson correlation coefficients were calculated for the mean number of self-reported disabilities during the test and retest measurements.

M365

HOEYMANS ETAL.

M366

Table 2. Performance During Test and Retest Measurements and Reproducibihty of the Performance Tests. Zutphen Elderly Study 1995 (n = 105) Scoring on Performance Test Retest

Test Performance Tests Test for standing balance % high %low External shoulder rotation % high % moderate %low

Test-Retest Reliability % Agreement

Weighted Kappa

61.0 39.0

70.5 29.5

68

.29

61.0 28.6 10.5

68.6* 26.7 4.8

72

.49

Pearson Correlation Coefficient 3.6 ± 1.7 13.6 ±4.2

3.5 ±1.5 13.2 ±3.8

41.9 26.7 19.0 12.4

49.5* 24.8 18.1 7.6

Summary Performance Score % 0 tests low % 1 test low % 2 tests low % 3 or 4 tests low

.90 .82 % Agreement

Weighted Kappa

53

.52

Pearson Correlation Coefficient Number of tests scored low; mean ± SD

1.02 ±1.06

.71

.84 ± .98*

*Significantly different between test and retest, tested with paired Mest (with continuity correction); p < .05.

Table 3. Percentage Disabled in Daily Routine Activities During Test and Retest Measurement, and Reproducibihty of Self-Reported Disabilities. Zutphen Elderly Study 1995 (n = 105) % Disabled Test

Retest

% Agreement

Weighted Kappa

5.7 0.0 1.0 1.9 1.0 1.9 5.7

3.8 0.0 1.0 1.9 1.9 1.9 3.8

96 100 100 100 99 100 96

0.58 1.00 1.00 1.00 0.66 1.00 0.58

Mobility Move outdoors Use stairs {n = 103) Walk at least 400 meters Carry a heavy object, e.g., a shopping bag of 5 kg for a hundred meters

26.7 3.8 11.7 8.6

21.0 3.8 7.8 10.5

87 98 92 94

0.63 0.74 0.56 0.67

21.9

17.1

88

0.61

IADLs Do light housework Do one's own cooking (n = 104) Do heavy housework (n = 104)

61.9 6.7 27.9 53.8

57.1 5.7 27.9 46.2

80 95 83 85

0.59 0.59 0.57 0.69

Hierarchic disability scale Not disabled Disabled in IADL only Disabled in mobility and IADL Disabled in BADL, mobility, and IADL

69

0.63

37.1 36.2 21.0 5.7

40.0 38.1 18.1 3.8

BADLs Walk between rooms Feed oneself Get in and out of bed Use the lavatory Dress and undress Wash and bathe oneself

Test-Retest Reliability

Pearson Correlation Coefficient Number of disabilities (n = 101) mean ± SD

1.47 ±2.01

1.29 ±1.99

Note: BADL = Basic Activities of Daily Living; IADL = Instrumental Activities of Daily Living.

0.87

Downloaded from http://biomedgerontology.oxfordjournals.org/ by guest on December 25, 2015

Walking speed (n = 104); mean ± SD Chair stand (n = 99); mean ± SD

REPRODUCIBILITY

OF FUNCTIONAL

Table 4: Reproducibility of Functional Status (Kappa statistics) Stratified by Age Group and Cognitive Function. Zutphen Elderly Study 1995 Summary Performance Score

n

Hierarchic ADL Scale

Age group Old (74-79 years) Very old (80-92)

55 50

.71 .55

.53 .50

Cognitive function Not impaired (MMSE score > 25) Impaired (MMSE score < 25)

67 38

.73 .45

.55 .46

Both measures of functional status — self-reported disabilities and performance tests — were moderately to highly reproducible in this population of elderly men, with kappas and Pearson correlation coefficients ranging from .49 to .90. Only the reproducibility of the tests for standing balance was poor (kappa = .29). Reproducibility is a measure of the potential of the instrument to yield the same result for a single respondent on two separate assessments, which are usually closely spaced so that any variation is due to the reliability of the instrument rather than to changes in the respondent's status (8). The interval should also be long enough to exclude recollection effects. The time interval chosen in this study was 2 weeks, and recollection effects might still play a role. Performance during the second measurement was higher than during the first measurement, possibly due to a recollection or learning effect. In this very old population, however, 2 weeks might also be already long enough for genuine changes in functional status. During the second survey, respondents were asked if their ability to do the tests had changed since the first survey. Exclusion of the respondents who indicated that their functioning had changed (n = 10) did indeed increase the test-retest reliability, mainly of the self-reported disabilities (kappa of hierarchic ADL scale increased from .63 to .69, kappa statistics of BADL, mobility, and IADL increased with similar amounts). Reproducibility of the performance tests increased also, but less. Reproducibility of the self-reported disabilities might be biased because the manner of administering the questionnaire the second time was different from the first time. In the first measurement, the ADL questionnaire was embedded in a large questionnaire, which the participants filled out at home, with or without the help of a relative or acquaintance. The second time, they were asked to fill out the questionnaire by themselves, while the research assistant waited. Such varying conditions are known to affect the results (25). Because the questionnaire might be filled out differently the second time, test-retest reliability of the self-reported functional status might be underestimated. Test-retest reliability of performance-based functional status was not better than the test-retest reliability of self-reported functional status. The reproducibility of the test for standing balance was poor. Although performance tests are more objective, they do not seem more repro-

M367

ducible in this population of elderly men. Variations in performance-based functional status might be due to a "real" variation in performance. People differ in physical condition from day to, day. Variation in self-reported functional status is not due to these day-to-day variations, because people report their normal functioning over a certain period of time. Variations in the self-reported disabilities are probably due to contextual aspects. For example, certain activities might not be performed by the men in our study population (e.g., preparing meals or using stairs when one lives in a house without stairs). Some men might not know whether they are able to do these activities. In our study, both measures of functional status were equally reproducible, indicating that both biases are in the same order of magnitude. Only in very old or cognitively impaired men, the bias in self-reported measures might be larger. These results, however, can not be generalized to women. Only a few studies collected data on test-retest reliability of functional status measures. It is difficult to compare the results of these studies, because different instruments are tested, the study populations and settings vary considerably, and the time between the two measures varies from a couple of hours to months. In a study on the development of a physical performance measure, described by Winograd and colleagues (9), test-retest kappa statistics of .99 were observed, which are very high. A possible explanation is the hospital setting, and the fact that the time between the two measures was only 48 hours. In a study by Seeman et al. (26), where the performance tests resembled the tests in our study, the observed 2-week test-retest reliability also resembled our results. They found a test-retest correlation of the test for walking speed of .80 and for the chair stands .73. In their study, more persons scored low on the test for standing balance compared to our study, and probably because of this larger variation the reproducibility of the standing balance was higher than in our study (correlation coefficient between number of seconds the balance was held during first and second measurement was .61). Testretest reliability of a self-reported measure of functional status is reported in a study by Sale"n (27), who observed a 1-day test-retest reliability of .95 and a 3-day test-retest reliability of .92 (intraclass correlation coefficients). This is slightly higher than the .85 we observed, but the interval between the two measures in our study was longer. Myers compared 2-week test-retest reliability of a self-reported and performance-based measure of functional status in 29 older adults (10). She also found no systematic differences in reproducibility between the two kinds of measurements. The tests were administered by two research assistants, who were randomly assigned to the subjects. If we estimate the interrater reliability as the test-retest reliability of the group who was visited by two different interviewers, in which case both interviewers rated the same individual, we found this to be comparable to the overall test-retest reliability (summary performance score: r = .67 versus .71). The two interviewers did not score the tests differently (paired analyses), nor was the reproducibility different between the interviewers (r = .74 and .71, p = .91). For test-retest reliability of the continuous variables, we calculated Pearson correlation coefficients. Systematic dif-

Downloaded from http://biomedgerontology.oxfordjournals.org/ by guest on December 25, 2015

DISCUSSION

STATUS

M368

HOEYMANS ETAL.

ACKNOWLEDGMENTS

This study was supported by grants from the Praeventie Fonds, the Hague, The Netherlands, and the National Institute on Aging. We thank the fieldwork team in Zutphen, especially Dr. E. B. Bosschieter. Address correspondence to Nancy Hoeymans, National Institute of Public Health and the Environment, P.O. Box 1, 3720 BA Bilthoven, The Netherlands. REFERENCES

1. Kelly-Hayes M, Jette AM, Wolf PA, D'Agostino RB, Odell PM. Functional limitations and disability among elders in the Framingham study. Am J Public Health 1992;82:841-5. 2. Rozzini R, Frisoni GB, Bianchetti A, Zanetti O, Trabucchi M. Physical performance test and activities of daily living scales in the assessment of health status in elderly people. J Am Geriatr Soc 1993;41:1109-13. 3. Reuben DB, Valle LA, Hays RD, Siu AL. Measuring physical function in community-dwelling older persons: a comparison of self-administered, interviewer-administered, and performance-based measures. J Am Geriatr Soc 1995;43:17-23. 4. Guralnik JM, Ferrucci L, Simonsick EM, Salive ME, Wallace RB. Lower-extremity function in persons over the age of 70 years as a predictor of subsequent disability. N Engl J Med 1995;332:556-61.

5. Hoeymans N, Feskens EJM, Van den Bos GAM, Kromhout D. Measuring functional status: cross-sectional and longitudinal associations between performance and self-report (Zutphen Elderly Study 1990— 1993). J Clin Epidemiol 1996;49:1103-10. 6. Verbrugge LM, Jette AM. The disablement process. Soc Sci Med 1994;38:1-14. 7. Guralnik JM, Branch LG, Cummings SR, Curb JD. Physical performance measures in aging research. J Gerontol Med Sci 1989;44: M141-6. 8. Applegate WB, Blass JP, Williams TF. Instruments for the functional assessment of older patients. N Engl J Med 1990;332:1207-14. 9. Winograd CH, Lemsky CM, Nevitt MC, et al. Development of a physical performance and mobility examination. J Am Geriatr Soc 1994;42:743-9. 10. Myers AM, Holliday PJ, Harvey KA, Hutchinson KS. Functional performance measures: are they superior to self-assessments? J Geronlol Med Sci 1993;48:M 196-206. 11. Feinstein AR, Josephy BR, Wells CK. Scientific and clinical problems in indexes of functional disability. Ann Intern Med 1986; 105:413-20. 12. Cress ME, Schechtman KB, Mulrow CD, Fiatarone MA, Gerety MB, Buchner DM. Relationship between physical performance and selfperceived physical function. J Am Geriatr Soc 1995;43:93—101. 13. Reuben DB, Siu AL. An objective measure of physical function of elderly outpatients: the Physical Performance Test. J Am Geriatr Soc 1990;38:1105-12. 14. Feskens EJM, Bloemberg BPM, Pijls LTJ, Kromhout D. A longitudinal study on elderly men: the Zutphen Study. In: Schroots JJF, ed. Aging, health and competence. Amsterdam: Elsevier Science, 1993: 327-33. 15. Cornoni-Huntley J, Brock DB, Ostfeld AM, Taylor JO, Wallace RB, eds. Established populations for epidemiologic studies of the elderly. Resource data book. (NIH pub. no. 86-2443). Washington, DC: U.S. Government Printing Office, 1986. 16. Guralnik JM, Simonsick EM, Ferrucci L, et al. A short physical performance battery assessing lower extremity function: association with self-reported disability and prediction of mortality and nursing home admission. J Gerontol Med Sci 1994;49:M85-94. 17. World Health Organization. The elderly in eleven countries. Copenhagen: World Health Organization, 1983. 18. Duijkers TJ, Kromhout D, Spruit IP, Doornbos G. Inter-mediating risk factors in the relation between socioeconomic status and 25-year mortality (the Zutphen Study). Int J Epidemiol 1989; 18:658-62. 19. Folstein MF, Folstein SE, McHugh PR. "Mini-Mental State": a practical method for grading the cognitive state of patients for the clinician. J Psychiatr Res 1975; 12:189-98. 20. Fillenbaum GG, George LK, Blazer DG. Scoring nonresponse on (he Mini-Mental State Examination. Psychol Med 1988;18:1021—5. 21. Siu AL. Screening for dementia and investigating its causes. Ann Intern Med 1991 ;115:122-32. 22. Snedecor GW, Cochran WE. Statistical methods (7th ed.). Ames, IA: The Iowa State University Press, 1982:146-7. 23. Fleiss JL. Statistical methods for rates and proportions (2nd ed.). New York: John Wiley & Sons, 1981:212-36. 24. Feinstein AR, Cicchetti DV. High agreement but low kappa: 1. The problems of two paradoxes. J Clin Epidemiol 1990;43:543-9. 25. Cartwright A. Health surveys in practice and in potential: a critical review of their scope and methods. London: Oxford University Press, 1983:163-5. 26. Seeman TE, Charpentier PA, Berkman LF, et al. Predicting changes in physical performance in a high-functioning elderly cohort. MacArthur Studies of Successful Aging. J Gerontol Med Sci 1994;49:M97-108. 27. Salen BA, Spangfort EV, Nygren AL, Nordemar R. The disability rating index: an instrument for the assessment of disability in clinical settings. J Clin Epidemiol 1994;47:1423-34. 28. Cicchetti DV, Feinstein AR. High agreement but low kappa: II. Resolving the paradoxes. J Clin Epidemiol 1990;43:551-8. Received April 22, 1996 Accepted February 10, 1997

Downloaded from http://biomedgerontology.oxfordjournals.org/ by guest on December 25, 2015

ferences between the two measures might be missed. However, we also calculated the intraclass correlation coefficients, which are corrected for systematic variance, but the difference with Pearson correlation coefficients was .01 at the maximum. For ordinal variables, we calculated kappa statistics. Kappa adjusts for expected agreement, but is affected by the prevalence. In reliability studies, no gold standard is available, so no "correct" value for true prevalence exists. The only available data are the observed marginal total values, which become the substitutes for the prevalence. When the marginal totals are imbalanced, low kappa values are obtained while the percentage of agreement is high, as shown by Feinstein and Cicchetti (24). In our study, with low prevalence rates of disability, this phenomenon is important. A possible solution, given by Cicchetti and FeinStein (28), is to report positive and negative agreement. For example, the percentage agreement in disabilities in BADL is 96%, the agreement of negative values (no disabilities) is 98%, but the agreement of positive values is only 60%, which is the reason for the lower kappa (.58). Kappa will have the same size as percentage agreement, only if the marginal totals are balanced or when the positive and negative agreements are balanced. Because of the low prevalence of disabilities in BADL in our study, the reproducibility of absence of disabilities is higher than the reproducibility of presence of disabilities. This is to a lesser extent observed for disabilities in mobility. In our population of elderly men, the test-retest reliability of performance tests and self-reported disabilities is moderate to good. In contrast to previous suggestions, we did not find performance tests to be more reproducible than self-reported measures. Only in very old or cognitively impaired persons, self-reported functional status was less reproducible than in younger or unimpaired persons.

Suggest Documents