Medical Decision Making - CiteSeerX

Medical Decision Making http://mdm.sagepub.com

Classification Algorithms for Hip Fracture Prediction Based on Recursive Partitioning Methods Hua Jin, Ying Lu, Steven T. Harris, Dennis M. Black, Katie Stone, Marc C. Hochberg and Harry K. Genant Med Decis Making 2004; 24; 386 DOI: 10.1177/0272989X04267009 The online version of this article can be found at: http://mdm.sagepub.com/cgi/content/abstract/24/4/386

Published by: http://www.sagepublications.com

On behalf of:

Society for Medical Decision Making

Additional services and information for Medical Decision Making can be found at: Email Alerts: http://mdm.sagepub.com/cgi/alerts Subscriptions: http://mdm.sagepub.com/subscriptions Reprints: http://www.sagepub.com/journalsReprints.nav Permissions: http://www.sagepub.com/journalsPermissions.nav Citations http://mdm.sagepub.com/cgi/content/refs/24/4/386

Downloaded from http://mdm.sagepub.com by hua jin on October 17, 2008

METHODOLOGY

JIN, LU, HARRIS, JUL–AUG 10.1177/0272989X04267009 ROBUST METHODOLOGY COST-SAVING BLACK, TREE STONE, ANALYSIS HOCHBERG, FOR HIP GENANT FRACTURE

Classification Algorithms for Hip Fracture Prediction Based on Recursive Partitioning Methods Hua Jin, PhD, Ying Lu, PhD, Steven T. Harris, MD, Dennis M. Black, PhD, Katie Stone, PhD, Marc C. Hochberg, MD, MPH, Harry K. Genant, MD

This article presents 2 modifications to the classification and regression tree. The authors improved the robustness of a split in the test sample approach and developed a cost-saving classification algorithm by selecting noninferior to the optimum splits from variables with lower cost or being used in parent splits. The new algorithm was illustrated by 43 predictive variables for 5-year hip fracture previously documented in the Study of Osteoporotic Fractures. The authors generated the robust optimum classification rule without consideration of classification variable costs and then generated an alternative cost-saving rule with equivalent diagnostic utility.

A 6-fold cross-validation study proved that the cost-saving alternative classification is statistically noninferior to the optimal one. Their modified classification and regression tree algorithm can be useful in clinical applications. A dual X-ray absorptiometry hip scan and information from clinical examinations can identify subjects with elevated 5-year hip fracture risk without loss of efficiency to more costly and complicated algorithms. Key words: cost-effective prediction; consistency; classification tree; equivalent diagnosis. (Med Decis Making 2004;24:386–398)

“O

are the most costly.3,4 Over the past 10 years, many new quantitative imaging techniques and biochemical markers have been developed and new risk factors identified to improve our understanding of osteoporosis and our ability to predict osteoporotic fracture risk.5–13 With the increasing availability of prevention and treatment options,14–18 it has become increasingly important to accurately stratify individuals into different levels of osteoporotic fracture risk so that subjects with high risk can receive appropriate treatments before fractures occur.19–22 Recursive partitioning (RP) methods, such as the classification and regression tree (CART) algorithm developed by Breiman and others,23 have been useful in medical research to construct algorithms for disease diagnosis and/or prognostic prediction.22,24–28 CART analysis is an alternative to traditional regression methods. For continuous variables, regression methods assume linearity, whereas CART assumes discreteness of the risk by splitting continuous variables at optimal cut-points. More important, regression methods assume additivity of the effects of variables. In contrast, CART assumes nonadditivity 29 by independent

steoporosis is a disease characterized by low bone mass and micro-architectural degradation of bone tissue, leading to enhanced bone fragility and a consequent increase in fracture risk.”1 It is a common disease among postmenopausal women and older men.2 Among all osteoporotic fractures, hip fractures

Received 2 October 2003 from the Department of Radiology (HJ, YL, HKG), Osteoporosis and Arthritis Research Group (YL, STH, HKG), and the Department of Epidemiology and Biostatistics (YL, DMB, KS), University of California, San Francisco; the Department of Medicine, University of Maryland, Baltimore (MCH); and the Department of Mathematics, South China Normal University, Guangzhou, China (HJ). The study is supported by a grant from the National Institutes of Health (R03 AR47104). We thank the reviewers and associate editors for their suggestions that substantially improved the article, Professor Heping Zhang for helpful discussions, and Mr David Breazeale for editorial help. Revision accepted for publication 12 January 2004. Address correspondence and reprint requests to Ying Lu, PhD, Department of Radiology, Box 0946, University of California, San Francisco, San Francisco CA 94143-0946; e-mail: ying.lu@radiology. ucsf.edu. DOI: 10.1177/0272989X04267009

386 • MEDICAL DECISION MAKING/JUL–AUG 2004


ROBUST COST-SAVING TREE ANALYSIS FOR HIP FRACTURE

searches for splits within branches of the tree. The linearity assumption in regression models can be relaxed by including nonlinear terms and the additivity assumption by including statistical interaction terms. CART constructs classification models in 3 steps: 1) selection of binary splits of a set of classification or predictor variables, also referred to as the splitting step; 2) determination of the appropriate tree size, referred to as the pruning step; and 3) creation of statistical summaries of the terminal nodes of the tree. There are many options for determining tree sizes, including the original rules by Morgan and Sonquist30 or using the results of the test samples or cross-validation (CV) such as CART. Loh and Vanichsetakul31 developed an alternative fast classification tree (FACT) algorithm, which uses discriminant analysis to split nodes and determines the tree size with another set of stopping rules. They found that FACT seemed to provide better results using linear combination splits but that CART was better using univariate splits. Lim and others32 compared 22 decision tree algorithms and concluded that CART, QUEST,33 and C4.534,35 were among the best decision tree algorithms with univariate splits and suggested that some form of pruning might be necessary for low error rates. More recently, Zhang has improved and extended CART to the survival and repeated measurement data in his MASAL algorithm and has also applied them to other medical studies.28,36,37 Although conventional RP algorithms have been well developed and successfully used, they still have some limitations. The testing sample approach of RP algorithms, including CART, QUEST, C4.5, and MASAL, separates the study data into 2 random data sets: learning and testing samples. They use the learning sample to split the nodes and the testing sample to prune the tree in separate steps. The testing sample is not applied to the splitting procedure. Thus, the testing sample does not contribute to the selection of the splitting variables and their corresponding cutoff values, even though the results in the testing sample may be contradictory to the results in the learning sample. This limitation can be avoided with an alternative CV pruning algorithm. In addition, current RP algorithms focus only on optimization in statistical performance and do not take into account the cost of the variables used to construct the classification algorithms. Because of the cost and/or availability of some of the variables, application of the derived algorithm may be limited in practice. In this article, we propose a modified testing sample approach for an RP algorithm, the robust and costsaving tree (RACT), and apply it to construct cost-sav-

ing classification rules for osteoporotic hip fracture risk. Our algorithm considers the cost/availability of the predictive variables and improves the consistency of classification results by using test samples in splitting the trees. In the next section, we introduce our data example from the Study of Osteoporosis Fractures (SOF) that motivated this research. We then explain RACT based on the well-known entropy criterion. In the Results section, we use RACT to establish the optimum and cost-saving equivalent classification algorithms for identifying subjects at high risk of hip fracture. We then use a 6-fold CV approach to compare RACT and an optimum tree similar to CART. In the last section, we discuss our results and present our conclusions. SUBJECTS AND METHODS Subjects Between September 1986 and October 1988, 9704 ambulatory white women aged 65 years or older without bilateral hip replacements were recruited from population-based listings in 4 United States cities.5 At baseline, bone mineral density (BMD) was measured at the calcaneus, distal radius, and proximal radius using single photon absorptiometry (SPA). At the 2nd visit (1988–9), 7786 women (82% of the survivors at that time) had BMD measurements of the anterior-posterior (AP) spine (L1–L4) and proximal femur (neck, trochanter, Ward’s triangle, total hip regions of interest) using dual X-ray absorptiometry (DXA). Besides radiological BMD measurements, additional information about functional status, nutritional intake, clinical information, vision examination, and so forth was also obtained at the baseline visit by questionnaire, interview, and/or examinations. The women were followed for hip fracture by letter or telephone every 4 months for an average of 4.1 years, with 99% completion. Reported hip fractures were confirmed by a radiologist (H. K. G.) from preoperative radiographs. Details of the study design and summary statistics of SOF subjects have been published previously.5,38 We used the 5-year hip fracture from visit 2 as the binary endpoint. We excluded 1475 women who were lost to followup within 5 years after visit 2 and whose hip fracture status was unknown. Because previous studies suggest that hip fracture is best predicted by BMD,38 we further excluded 1120 women from the study who missed at least 1 BMD measurement of the forearm, calcaneus, hip, or AP spine. The remaining 7127 women for whom most of the other previously

387

METHODOLOGY


JIN, LU, HARRIS, BLACK, STONE, HOCHBERG, GENANT

Table 1 Characteristics of Women Lost to Follow-up, Women with Missing Bone Mineral Density, and Women Used in the Final Analysis from the Study of Osteoporotic Fractures (SOF) Height (cm) Age at SOF Baseline Group

x

Women lost to follow-up (n = 1457) 74.4* Women with missing bone mineral density (n = 1120) 72.2* Women used in the final analysis (n = 7127) 71.1

SOF Baseline

Weight (kg)

At Age 25 x

s

SOF Baseline

s

x

s

x

6.3

158.2*

6.3

162.6

6.1

65.1*

5.5

158.7*

6.0

162.3

5.9

4.9

159.3

6.0

162.6

5.9

s

Body Mass Index at SOF Baseline 2 (kg/m )

At Age 25 x

s

x

s

13.0

56.5

7.8

26.0*

4.8

67.4

12.7

56.1

7.0

26.8

4.8

67.4

12.3

56.3

7.2

26.6

4.6

*The mean values were significantly different from the women used in the final analysis (P < 0.05).

verified predictive variables at baseline were available5,7–10 were randomly divided into 2 data sets. The first consisted of 60% of the subjects (n = 4307 with 142 hip fractures) used as the learning sample, and the remaining 40% (n = 2820 with 88 hip fractures) were used as the testing sample. Table 1 summarizes some pertinent characteristics of women with unknown 5-year fracture status, with known fracture status but missing BMD data, and those with known fracture status and BMD data. Women without BMD data but with known fracture status differed only slightly from the women used in the final data analysis. The statistically significant differences were the results of large samples and did not seem to be biologically relevant. Women with unknown fracture status, however, were 3 years older, slightly shorter, and thinner than women used in the final data analysis. Although it is hard to determine the effects of these differences on their respective fracture risks, the data suggest possible bias introduced by including only those women with known fracture status and BMD values and could provide strong caveats about potential clinical use of the derived algorithms. Predictive Variables Forty-three predictive variables were used as candidates to construct classification rules. Most of them are multivariate and have been established as statistically significant predictors of hip fractures after adjustment

of age and neck BMD,5,7–10 but not all of them were investigated simultaneously. Some variables are available in clinical visits but have not been identified as significant predictors. For example, heights at baseline visit and at age 25 are not significant variables themselves. We include them because they were used to calculate height lost from age 25 and the body mass index. A primary stepwise logistic regression analysis suggested 12 linearly unassociated significant predictors. However, with a total of 230 fractures and 6897 nonfractured women, we have actually limited ability to reliably fit a tree with this relatively large number of potential predictors.29,39 To illustrate the idea of a cost-saving tree algorithm, we grouped the predictive variables into 5 categories. Category 1 consists of demographic and clinical information that can be obtained in regular clinical visits. These include age, weight, height, health status and history, and current use of medications. Category 2 consists of nutritional assessment, functional status assessment, and so forth. These variables are relatively easy to collect but are either not routinely used in clinical practice or require assistance from specialists for interpretation. Category 3 consists of BMD measurements at peripheral sites by SPA (OsteoAnalyzer, Siemens-Osteon, Wahiawa, HI), including the distal forearm, proximal forearm, and calcaneus. Category 4 consists of central BMD measurements at the spine and hip by DXA (Hologic QDR, Hologic Inc, Waltham, MA). BMD values in the spine and hip are presented as T




Table 2 List of Predictors in Cost Categories Cost Level

1

2

3 4 5

Variable Type

5

List of Variables

Clinical/selfreport

Race; age; weight; body mass index; weight at age 25; height at age 25; height; height reduction from age 25; weight increase from age 25; resting pulse rate; drink alcohol in past 12 months; self-rated health; self-reported arthritis; history of maternal hip fracture; calcium current users; walking for exercise; on feet ⇐ 4 h per day; smoking status; currently a smoker; fall in the past 12 months; any fracture after age 50; current use of seizure medications, oral estrogen, longacting benzodiazepine, thyroid medications; hip pain; having osteoarthritis according to a doctor Need Calcium intake per week from food, physicians mean daily caffeine intake, quetelet or specialists index, inability to rise from chair, sum of initial functional status, walking speed, sum degree difficulty for functional status Peripheral Distal radius BMD, proximal radius BMD BMD, calcaneus BMD Central BMD Total hip, femoral neck, trochanteric, intertrochanteric BMD; total AP spine BMD X-ray and Prevalent vertebral fractures, lowvision tests frequency contrast sensitivity (per standard deviation decrease), distance depth perception Note: BMD = bone mineral density.

scores according to reference values in Lu and others40 to facilitate comparison of BMD measured by scanners of different manufacturers. Category 5 consists of X-ray readings of prevalent vertebral fractures and vertebral heights, as well as some special functional tests and vision tests rarely used in clinical practice. Here, SPA (or, more recently, single X-ray absorptiometry [SXA]) for peripheral BMD costs less than other X-ray methods do and can be installed in outpatient clinics. Thus, it is the preferred choice over central DXA. On the other hand, although X-rays are readily available, assessment of vertebral fractures and measuring vertebral heights require trained radiologists. In contrast, DXA scans can usually be performed by technicians. We did not include quantitative ultrasound in this analysis because it was not administered until a later visit and not

available in all subjects. The predictive variables are listed in Table 2. In all the analyses, the outcome variable is hip fracture within 5 years after visit 2 (when central BMD was measured). In all classification rules, if the percentage of hip fracture in a group of women is higher than the population fracture rate, the group is classified as a high-risk group for hip fracture. Otherwise, they are viewed as a low-risk group. Recursive Partitioning In general, RP methods start with a sequence of binary separations of a group of subjects (splitting step). At each separation (or split), we examine all included predictive variables known for this group of subjects. For an ordinal predictive variable, RP examines all observed values. For each observed value, subjects with values less than or equal to that value fall into one group, whereas subjects with greater values fall into the 2nd group. For a categorical variable, the separation is based on whether subjects are in one category versus another category. Changing the cutoff points of all of the predictive variables for a group of subjects makes many binary separations. An impurity improvement function based on the entropy criterion28,41 is used to indicate how homogeneous the resulting subgroups are in regard to hip fracture risks. The split that results in maximum impurity improvement is the optimal split among all possible choices. The subgroups resulting from such a split are called daughter nodes. A classification tree is formed by sequentially splitting the daughter nodes of previous splits until some specified stopping criteria are met. For simplicity, 2 rules are used to limit tree growth: 1) The resulting daughter nodes should consist of at least 30 women in the learning sample and 2) splitting stops when a node has fewer than 100 women. Splitting steps can produce a large number of branches of a classification tree. To make the results less dependent on the learning samples, searches for the optimum subtree from among the main tree are necessary. This step is commonly called the pruning step. We used Breiman and others’23 testing sample approach to determine the best tree size. For the given study size, the testing sample approach is the “honest and efficient” approach.23 We randomly divided the data into 2 sets: 60% of the women were used as the learning sample that generates all the splits and the remaining 40% as the testing sample to determine tree size. The lowest daughter nodes in the final pruned tree are the terminal nodes. We used squares to separate

389

METHODOLOGY



them from intermediate nodes, which are normally presented in circles. Classification of fracture risk status was based on 5-year fracture incidence rate in terminal nodes for the testing samples in comparison to the population’s overall incidence rate as specified previously. Various approaches have also been proposed to deal with missing data. For example, Breiman and others23 used surrogate splits for subjects with missing data. Zhang and Singer28 replaced missing data with the minimum or maximum values and examined the corresponding entropy results. We adopted the 2nd approach in our application in this article. Robust Splitting In this article, we illustrate our RACT algorithm that uses both learning and testing samples in the splitting step. Like CART, the learning sample is primarily used to generate the tree, and the testing sample is used to determine the tree size. Our first modification is to use the testing sample to assist in the splitting steps. The optimum tree with the robust modification is referred as the optimum robust tree in the Results section. In the conventional testing sample approach of CART, the testing sample is used only to select the optimum tree in the pruning step. Sometimes, an optimal split in the learning sample can result in completely opposite results when applied to the testing samples and/or can violate background biology. For example, because of statistical variations, the node with elevated risk in the learning sample could have reduced risk in the node of the testing sample. Although such inconsistencies will reduce the accuracy of the entire classification results for the testing samples and thus may not be selected in the final trees, there is no guarantee of avoiding such inconsistency in individual splits of the final optimum tree. We would expect that a reliable split should not depend on the specific data sets to which it is applied. Thus, in each split, we required selection of the best split with maximum impurity improvement in all possible choices of splits that can result in the same classification directions for both the learning sample and the testing sample. This is similar to requiring plausible signs for logistic regression coefficients that were found to slightly improve the predictive ability.42 We apply the split derived from the learning sample to the testing sample to see if the assignment of the high-risk group based on a “yes” or “no” answer to the split question is consistent and also agrees with the biological theory. For example, in the learning sample (4307 women with 142 hip fractures), the optimum

split of the 1st layer is whether a T score of neck BMD is less than or equal to –2.14. For those women with a “yes” answer, the probability of hip fracture within 5 years is 6.9%, whereas this probability drops to 1.2% for women with a “no” answer. Applying this split to the testing sample (2820 with 88 hip fractures), we get a similar result: 6.6% fracture probability for those women with lower BMD versus 1.0% for those with higher BMD. Both results suggest that lower BMD leads to a higher risk of fracture. At this time, we say the split is consistent between the learning and testing samples. Otherwise, we need to select the next best split that results in the same directions of daughter nodes for both two data sets rather than the optimum split. We would stop further partitioning if no consistency of splits was available. Cost-Saving Splitting The 2nd modification in our RACT algorithm also involves the splitting step. Here, we arranged all the predictor variables into different categories, as in Table 2. The variables in a lower category will be preferred over those in a higher category. Within each category, variables with missing data were ranked after those without missing data. And if a variable had been used in splitting a node, all variables available from the same examination were moved into the lowest category in the subsequent splits because they had no further cost. In each splitting step, we first found the optimum variable and its corresponding best split among all predictor variables based on robust splitting and then compared it to the best splits using variables in the categories below the optimum variable. An equivalence test of entropy at a type I error rate of 10% was used to select lower cost candidates with equivalent entropies (see the appendix for details). We chose the split (with maximum entropy) equivalent to the best one at the lowest cost level as our alternative costeffective split. The same splitting procedure was applied to daughter nodes, and the tree continued to grow until its full development. With the reduction in sample size for daughter nodes, the power to determine the equivalent split will likely be reduced. However, the impact of cost saving will also be reduced due to the smaller size of the nodes. We then used the same pruning algorithm to determine the final tree. To facilitate clinical applications of the classification tree, we also examined cutoff values at integers for age or half-integers of BMD T scores as alternative splits and compared them to the optimum split. If the half-integer cutoff values resulted in a split equivalent




to the optimum one, we used the half-integer cutoff values in tree construction. Although the equivalent low-cost split was used in each splitting step in the development of the costsaving alternative tree, it is still possible that the final classification by the cost-saving tree could be inferior to the robust optimum tree. Thus, it is necessary to further evaluate the equivalence between the 2 rules after the cost-saving alternative tree is constructed. Comparisons of classification rules should focus on sensitivity and specificity, as well as accuracy. For each pair of trees, we summarized the results as 2 × 2 tables, 1 for sensitivity and 1 for specificity. We used a statistical test by Nam43 to determine whether the cost-saving alternative classification had noninferior accuracy to the optimum robust classification. We also used the method of Lu and others44 to evaluate the simultaneous noninferiority in sensitivity and specificity of the costsaving equivalent tree to the optimum robust tree. The tolerance levels of these noninferior tests approximated the width of 95% confidence intervals of the corresponding indices. If the cost-saving alternative tree passes the equivalence test, it can be used in the application to replace the optimum robust tree. Otherwise, we would propose use of the optimum robust tree results. We refer to this classification tree as the equivalent tree in the Results section. Comparisons of Classifications Comparisons of trees were based on a 6-fold CV approach. Specifically, we randomly divided women (denoting the whole data set as L) into 6 random sub–data sets, L1, . . . , L6, such that each subset, Lv, v = 1, . . . , 6, has approximately the same number of women. Let L(v) = L – Lv be the data that excluded the subset Lv. We used L(v) to construct the robust optimum and noninferior alternative trees (with a tolerance level of 10% for sensitivity and 2% for specificity) based on a 60% learning sample and 40% testing sample approach. These 6 trees may be different from each other. We then applied these trees to women in Lv who were not used to generate trees. By rotating Lv’s, we have an honest prediction for all women. The goal of such analyses is to provide honest comparisons between classification rules. We used the noninferiority test of Nam43 for accuracy and simultaneous noninferiority test of Lu and others44 for sensitivity and specificity. RESULTS Figure 1 demonstrates the robust optimum tree generated by the robust splitting. In this figure, we show

142/4307 (88/2820)

Yes

110/1601 (71/1078)

64/498 (33/339)

Sum degree difficulty –2.14 T score would have a low risk of hip fracture. Women with lower femoral neck BMD (T scores ≤ –2.14) and age 75 or older would have elevated hip fracture risk. For those with lower femoral neck BMD and age 74 or younger, further examination of functional status would be needed. Women with at least 1 functional difficulty would have elevated fracture risk. For those with normal functional status, a further examination of walking speed (≤0.96 m/s or otherwise) separated women with higher or lower fracture risks. The classification required hip DXA scans for 100% of the women, 26% of the women to be assessed for functional status, and 19% of women to be tested for walking speed. Variables used in the construction of classification rules, except age, were not routinely available in outpatient clinics. The overall sensitivity and specificity for 5-year hip fracture were 70.4% and 391

METHODOLOGY



Table 3 Statistical Utility and Cost for Classification Rules Rule

Sensitivity (%)a

Specificity (%)

Accuracy (%)

Best tree

70.4

(61.4)

78.7

(77.2)

78.5

(76.7)

Equivalent tree

69.0

(64.8)

77.8

(76.6)

77.5

(76.2)

Cost Other Than a Clinical Visit

100% hip BMD test + 26% functional status test + 19% walking speed test 100% hip BMD test

Note: BMD = bone mineral density. a. Sensitivity, specificity, and accuracy without parentheses resulted from the learning sample and within parentheses resulted from the testing sample.

78.7%, respectively, for the learning samples but declined to 61.4% and 77.2%, respectively, for the testing samples. The corresponding accuracy was 78.5% and 76.7%, respectively, for the 2 samples (Table 3). To test for noninferiority and to select a cost-saving alternative tree, we need to determine the tolerance levels for noninferiority.44 Here, the thresholds were the width of 95% confidence intervals for sensitivity and specificity. For the robust optimum tree, the sensitivity is in the range of 65%, and the corresponding width of the 95% confidence interval is 12% based on 230 fractured women. Similarly, the 95% confidence width for the specificity (around 77%) is 2% based on 6897 women without fracture at the end of their 5-year follow-up. Figure 2 illustrates the alternative robust and costsaving equivalent classification tree developed according to the specifications in the Methods section. Femoral neck BMD was the optimum first split variable for which there were no equivalent low-cost splits available. As a result, all BMD measurements by hip DXA were moved into the lowest cost category in the subsequent splits. Rounded-up cutoff values in BMD T scores or age proved to be equivalent splits and therefore were used. For example, the optimal first split was T score of femoral neck BMD less than or equal to –2.14. The rounded-up cutoff value of a T score –2 at femoral neck BMD was found to pass the equivalent selection procedure and was used as the alternative split in the cost-saving equivalent tree. Then, age of 75 or younger was replaced with age of 74 or younger in the daughter nodes of those with lower BMD. Rather than functional status, a T score of total hip BMD less than or equal to –2 T score was used to split younger women (≤75 years of age) who had lower femoral neck BMD (≤–2 T score). Those women with both neck and total hip BMD less than –2 T score and younger were further asked about height loss since age 25. Women who had lost more than 3 cm since age 25 had elevated fracture risk; if the loss in height was 3 cm or less, body mass index (BMI) was checked. Women with higher BMI had reduced fracture risk compared to those with lower

Yes 114/1872 (76/1249)

Age < 75 yrs?

59/456 (29/309)

T-score of Hip BMD < -2 ? 11/541 44/875

TN5

(12/350)

(35/590)

TN4

Height Loss < 3 cm? 10/371 (10/250)

34/504 (25/340)

body mass index < 20.4 ?

TN3

5/310 (7/202)

TN1

TN2

No

T-score of Neck BMD 28/2435 (12/1571) < -2 ? TN6

55/1416 (47/940)

5/61 (3/48)

142/4307 (88/2820)

Figure 2 The robust and cost-saving classification tree for 5-year hip fractures. The split variables are listed under their corresponding nodes. The corresponding cutoff points for splitting are presented in inequalities. The left and right nodes are always those subjects with “yes” and “no” answers, respectively, to the corresponding inequalities. The numbers of fractured and total subjects in each node are expressed as fractions, with the denominators for the total numbers and the numerators as number of fractures. Results from the generating data are presented without parentheses, and those from the validation data are within parentheses. Terminal nodes (TN) are presented in squares. TNs that had a fracture incidence higher than population incidence are in italic and otherwise in regular font. TNs were numbered for Table 4.

BMI. Overall, the alternative classification rule requires 100% of women to have hip DXA examinations. However, all other variables—age, height loss, and BMI—used in the construction of the tree are easily available in routine clinical examinations. The fracture incidences in terminal nodes for this cost-saving equivalent RACT are presented in Table 4. The resulting sensitivity and specificity were 69.0% and 77.8%, respectively, for the learning samples and 64.8% and 76.6%, respectively, for the testing samples. Sensitivity and specificity were noninferior to the ro-




Table 4 Summary of Terminal Nodes for the Robust and Cost-Saving Equivalent Tree a

High-Risk Terminal Node Number

Number of Women Learning sample Testing sample Percentage of women Learning sample Testing sample Number of fractures Learning sample Testing sample Incidence (%) Learning sample Testing sample

Low-Risk Terminal Node Number

1

3

5

2

4

6

Total

61 48

504 340

456 309

310 202

541 350

2435 1571

4307 2820

1.4 1.7

11.7 12.0

10.6 11.0

7.2 7.2

12.6 12.4

56.5 55.7

100 100

5 3

34 25

59 29

5 7

11 12

28 12

142 88

8.2 6.3

6.7 7.4

12.9 9.4

1.6 3.5

2.0 3.4

1.1 0.8

3.3 3.1

a. Terminal node numbers are described in Figure 2.

Table 5 Classification Results for Hip Fracture Risk Using 2 Trees Fractured A=H

Fractured B=H 132 B=L 7 None fractured B=H B=L Total 139

A=L

None Fractured A=H

A=L

8 83

91

Total

140 90 1483 64 1547

107 5243 5350

1590 5307 7127

Note: A represents the equivalent tree and B represents optimum robust tree. Tree predicted as high and low risk of hip fractures. The data are summarized for 6-fold cross-validation.

bust optimum classification, with a tolerable level of 10% for sensitivity and 2% for specificity at a significance level of 5% (P = 0.028) based on the testing samples. Similarly, the 76.2% accuracy was noninferior to the robust optimum with an allowable difference of 2% at a significance level of 5% (P = 0.018). Six-fold CV results for comparisons of the robust optimal tree to the cost saving are summarized in Table 5. The resulting sensitivity and specificity are 60.9% and 76.9% for the optimum tree and 60.4% and 77.6% for the alternative tree, respectively. Under tolerance levels of 10% for sensitivity and 2% for specificity, the alternative tree is proved to be noninferior to the optimum tree (P < 0.0001).

DISCUSSION The RACT method proposed in this article can provide near optimum and yet economical classification rules. In our example, its performance based on a single BMD measurement and other information from a routine clinical visit is noninferior to a more extensive classification that also requires functional status assessment for 26% of women as well as walking speed tests for 19%. Our optimum robust classification tree used 4 of 43 variables for classification of subjects with elevated risk of hip fracture. Compared to the cost-saving equivalent tree, the optimum classification had only slightly higher sensitivity for the testing samples. This advantage was also maintained in our 6-fold CV comparisons, which reflect “honest” comparisons of 2 methods. Current conventional RP methods such as CART and MASAL cannot search for alternative cost-saving splits. The closest idea in CART to our alternative tree is the competing splits, which were the next most efficient splits. These competing splits were not ordered by their cost or availability, and there is no equivalence evaluation built into CART to determine whether the competing splits are indeed noninferior to the optimum splits. CART provides an option for assigning a penalty to each splitting variable so that it is possible to select a relatively cheap splitting variable. However, the penalty is uniformly applied to the whole tree. Although we are looking for cost-saving alternative splits, we still

393

METHODOLOGY



want to maintain the classification efficiency. With the penalty function approach, it is possible to select a split that is inferior to the optimal. Furthermore, the penalty cannot be altered during the tree constructions even if a test is required by the prior split and is virtually free for the subsequent splits. In the appendix, we provide a new statistical test for evaluating differences in improvement of impurity for splitting steps. Whether an alternative split is significantly different from the optimum split depends on the significance level used but also the sample size of the nodes. The higher the significance level is set, the less likely it is that an alternative candidate will be found. Similarly, the smaller the sample size, the less statistical power to detect the differences. As discussed in the Methods section, the significance level of this test remains the same in our algorithm in comparisons of each split and does not depend on the size of the node. The advantage of this option is that it keeps the algorithm simple. The test is used only to guide selection of splits and is by no means the determination for noninferiority. First, the null hypothesis for this test is equivalence of entry rather than inferiority. Because the hypothesis-testing paradigm protects the null hypothesis, it is more difficult to keep the optimum splits and easier to use alternative low-cost splits for the lower nodes. However, such impact is limited because fewer women are affected. In addition, we have to test for the overall noninferiority of the tree for sensitivity and specificity simultaneously, with inferiority as the proper null hypothesis and noninferiority as the alternative hypothesis. Only noninferiority in both parameters will warrant the choice of alternative cost-saving trees. In this article, we used a significance level of 10%, which is arbitrarily selected. If the final alternative tree is inferior to the robust optimum tree, it is possible to arbitrarily select a higher threshold to find a different alternative tree and to test for its noninferiority. If there are no noninferior alternatives, we recommend using the original optimum tree. Although RACT looks for a cost-saving alternative classification, the cost is the secondary aim, and the optimum classification is still the first priority. The algorithm seeks only economic splits to replace the optimum splits. If the economic splits will result in lower classification efficiency, the more expensive optimum classification will be retained. Furthermore, we avoided the use of the exact dollar cost of a predictive variable in this method. It is difficult to assign a dollar value for every predictive variable and to convert availability into dollars. Ranking the variables into cost categories (or preference categories) based on clinicians’

suggestions avoids the determination of exact dollar amounts and may create a more stable cost structure. Further science-based research is necessary to develop guidelines on ranking the variables, which will depend on clinical issues. In our case, our clinician coauthors have many years of experience in evaluating patients and understanding the availability/cost of these variables. Their consensus led to the ranking of the variables, based on reasons specified in the Methods section. RP algorithms generate classification rules based on binary splits. They are more acceptable to clinicians than individual risk prediction using traditional regression models. Tree models have the advantage of identifying complex interactions without any functional form of the predictors. They can deal with missing data flexibly. However, trees have the disadvantages of not using continuous variables effectively and of overfitting in 3 directions: searching for best predictors, searching for best splits, and searching multiple times.29 Furthermore, the splitting rule for continuous variables requires a monotone relationship between fracture risk and the variables. This monotone requirement can be avoided in linear regression analysis with polynomial terms, which are more difficult to handle with CART. Although hip fracture risk is not incremental by certain cutoff values and may be better described as a continuum of risk factors such as BMD,5 RP algorithms provide optimum decision thresholds for those risk factors for practical use. RP algorithms accommodate continuous risk factors by using multiple cutoff values. Although we classified women in this study into high and low risk of hip fracture, their individual risk still varies within each classification, as indicated in Table 4. Thus, receiver operating characteristics (ROC) curves should be a better measure for the equivalence of 2 classification rules.45 The statistical test for noninferiority of ROC curves is still under development. Once we have such a test, we can compare the noninferiority of 2 trees based on the ROC curve of the terminal nodes. Our proposed method modifies the conventional testing sample approach of recursive partition methods. For large sample studies such as the SOF, the testing sample approach is “honest and efficient” according to Breiman and others.23(p12) An alternative tree size determination is the CV approach. In principle, the CV approach of CART may reduce the possibility of reverse directions in the middle splits. However, CV determines only the tree size and will not affect the choices of splitting variables and cutoff values. Our idea about consistent directions in learning and testing




samples is similar to requiring plausible signs for the coefficients of predictors in logistic regression models.42 The most original idea in this article is, however, the equivalent splits or basing the tree on consideration of the cost of predictors. This will greatly simplify the diagnostic procedure and reduce the expense of diagnosis, if it works. The SOF data give a successful example of the application of RACT. Although the cost-saving step of the algorithm should be applicable to the CV approach, the programming aspect of such an approach has been difficult and needs further development. Our cost-saving classification rule requires only a hip DXA scan for women aged 65 and older. There was no equivalent alternative to femoral neck BMD for the root node splitting in our cost-saving classification tree. The only other factors needed for classification are current age, height loss from age 25, and BMI. Therefore, our rule may be used in general clinical practice. We must admit that the tree structure described here is still too complicated fo r general application. Some form of summarization, such as a scoring system, could be developed for clinician use.46 This article presents for the first time a comprehensive multivariate analysis of hip fracture based on 43 variables, including patient demographic data, clinical BMD measured by DXA and SXA, medical history, Xray for prevalent fracture and vertebral heights, functional status, vision test results, nutrition, and so forth. Most of these variables have been identified as significant predictors of hip fracture risk after adjustment of age and neck BMD in the literature, although they have not been simultaneously tested in one model. Although they may reflect different aspects of osteoporosis and aging and may help in understanding the etiology of osteoporosis and hip fracture, only a few of them are unrelated predictors and are necessary to identify subjects with elevated fracture risk. Our stepwise logistic regression, which was not presented in this article, identified 12 linearly unassociated risk factors. All selected variables in tree constructions were among these risk factors. Because of the low incidence of hip fracture, fitting 43 variables based on given data may lead to overfitting, and the classification algorithm could be less efficient when it is being applied to new data.39 We used a 6-fold CV design to compare the optimum robust and cost-saving equivalent trees. This comparison is different from our suggestion of a noninferiority test for sensitivity and specificity for the testing sample after tree construction. The testing sample has been used to help selection of splits in our approach and is used to determine the tree size. In contrast, the 6-fold

CV design is an honest estimate of the application of a tree to new data. Furthermore, 6-fold CV uses all the women in the study and therefore gives a much more solid impression of the quality of alternative approaches. Our study has limitations. First, we focused on 5year hip fracture risk. Although hip fracture is the most severe outcome of osteoporosis in terms of cost, morbidity, and mortality, there are important osteoporotic fractures at other sites.47 Therefore, prevention of osteoporotic fractures at other sites is also important. We will apply our method to vertebral fracture in a separate data analysis. Second, the SOF studied only white women aged 65 and older. Although our RP method is applicable to all race/ethnic, gender, and age groups, our classification rules are limited to this specific population only. Third, our outcome is 5-year hip fracture. As a result, we excluded women who did not have hip fracture and were lost to follow-up before 5 years because of their unknown 5-year fracture status. As shown in Table 1, women excluded because they were lost to follow-up were older with lower weight. Although the absolute differences were small and the small P values were mainly the results of large sample sizes, the exclusion could potentially introduce bias in the analysis and reduce the efficiency. The SOF has continued for more than a decade, and the time to hip fracture might be a more appropriate endpoint. We are working on a cost-saving RP algorithm for the survival data and will report our results in the future. Fourth, we focus only on sensitivity and specificity when comparing the alternative cost-saving classification rule to the robust optimum one. Comparison would be facilitated if a more general, sophisticated measure of performance, the area under the ROC curve,45 was used. Equivalence or noninferiority tests of 2 classification rules based on the area under the ROC curve is needed. Fifth, although the SOF is considered the largest epidemiological study of osteoporosis in the United States, because of the lower incidence of hip fracture, we still have a risk of overfitting our model based on 43 predictors even if they were not all linearly unassociated predictors. In conclusion, this article presents a modified recursive partitioning method, for RACT analysis. By applying this method to the SOF data, we constructed a classification rule that uses age, femoral neck, height loss, and BMI. Both the sensitivity and specificity of the rule were around 70%, and they were statistically noninferior to the robust optimum tree based on 6-fold CV. This article suggests that our modified CART algorithm may be useful in clinical application and, more important, that a DXA hip scan and variables available from 395

METHODOLOGY



outpatient clinical visits (age, height loss, and BMI) can identify subjects at high risk of osteoporotic hip frac-

ture within 5 years without loss of efficiency to more costly and/or complicated algorithms.

APPENDIX Statistical Test to Compare 2 Splits E 2 = P+ + 1 log P+ + 1 − P1 + 1 log P1 + 1 − P0 + 1 log P0 + 1 + P log P − P log P − P log P ,

Let D be the indicator of disease status, with D = 1 for diseased and 0 for normal subjects; let S1 and S2 be the best and alternative splits, respectively, with 1 for a positive and 0 for a negative result. We want to compare the impurity function

++0

Pijk =

Table A1 Layout for the Joint Split Result of T1 and T2 D=1 S2 = 1

S2 = 0

D=0 S2 = 1

S2 = 0

D=1 S1 = 1 n111(P111) n110(P110) S1 = 0 n101(P101) n100(P100) D=0 n011 (P011) n010 (P010) S1 = 1 n001 (P001) n000 (P000) S1 = 0 Totaln1+1 (P1+1) n1+0 (P1+0) n0+1 (P0+1) n0+0 (P0+0)

Total

between S1 and S2. Table A1 summarizes the probability distribution, where ndjk represents the number of study subjects with disease status d and split results (j, k), and Pdjk are the corresponding probabilities. A “+” sign is used to denote summation over the corresponding subscript. The weighted node entropy impurity of the best split S1 is defined by  P P P P  E 1 = P+1 +  − 11 + log 11 + − 01 + log 01 +   P+1 + P+1 + P+1 + P+1 +   P P P P  + P+0 +  − 10 + log 10 + − 00 + log 00 +   P+0 + P+0 + P+0 + P+0 +  = P+1 + log P+1 + − P11 + log P11 + − P01 + log P01 + + P+0 + log P+0 + − P10 + log P10 + − P00 + log P00 + and the weighted entropy of S2 is

So the difference of entropy between the 2 splits is T = E1 – E2, which can be estimated by T = E 1 − E 2 , where E 1 = P+1 + log P+1 + − P11 + log P11 + − P01 + log P01 + + P+0 + log P+0 + − P10 + log P10 + − P00 + log P00 + ,

1+ 0

0+ 0

0+ 0

, and “+” indicates the sum of corresponding indin ces. Based on the delta method, the asymptotic variance of T is

2   ∂T    var(T ) =  ∑  ∂Pijk  var(Pijk ) +   (i ,j , k )≠ (0 , 0 , 0 )

 ∂T ∂T cov(Pi1j1 k1 , Pi 2j 2 k 2 ) ,  ( i1 ,j 1 , k1 ) ≠ ( i 2 ,j 2 , k 2 ) ∂Pi1 j 1 k1 ∂Pi 2 j 2 k 2

∑

where

∂T P P P P = log +1 + 00 + − log + + 1 0 + 0 , ∂P111 P+0 + P11 + p+ + 0P1 + 1 P+1 + P00 + P+ + 1P0 + 0 ∂T = log − log , ∂P011 P+0 + P01 + P+ + 0P0 + 1 ∂T P P P = log +1 + 00 + − log 0 + 0 , ∂P110 P+0 + P11 + P1 + 0 P P P ∂T = log 00 + − log + + 1 0 + 0 , P10 + P+ + 0P1 + 1 ∂P101 ∂T P P = log 00 + − log 0 + 0 , ∂P100 P10 + P1 + 0 P P ∂T P+1 + P00 + ∂T , = − log + + 1 0 + 0 , = log P+0 + P01 + ∂P001 P+ + 0P0 + 1 ∂P010

E2 = P++1logP++1 – P1+1logP1+1 – P0+1logP0+1 + P++0logP++0 – P1+0logP1+0 – P0+0logP0+0.

1+ 0

nijk

n11+ (P11+) n10+ (P10+) n01+ (P01+) n00+ (P00+) n (1.00)

++0

var(Pijk ) =

Pijk (1 − Pijk ) n

and cov(Pi1j1 k1 , Pi 2j 2 k 2 ) = −

,

Pi1j1 k1 Pi 2j 2 k 2 n

.

We accept the null hypothesis that E1 equals E2 if the P value of the test statistic T is greater than the specified α.




REFERENCES 1. World Health Organization. Assessment of fracture risk and its application to screening for postmenopausal osteoporosis: report of a WHO study group. WHO Technical Report Series, Report No. 843. Geneva (Switzerland): World Health Organization; 1994. 2. Cummings SR, Melton LJ. Epidemiology and outcomes of osteoporotic fractures. Lancet. 2002;359:1761–7. 3. Wiktorowicz M, Goeree R, Papaioannou A, Adachi J, Papadimitropoulos E. Economic implications of hip fracture: health service use, institutional care and cost in Canada. Osteoporos Int. 2001;12(4):271–8. 4. Haentjens P, Autier P, Barette M, Boonen S. The economic cost of hip fractures among elderly women: a one-year, prospective, observational cohort study with matched-pair analysis. Belgian Hip Fracture Study Group. J Bone Joint Surg. 2001;83 A(4):493–500. 5. Cummings S, Nevitt M, Browner W, et al. Risk factors for hip fracture in white women: study of osteoporosis research group. N Engl J Med. 1995;332:767–73. 6. Genant H, Engelke K, Fuerst T, et al. Noninvasive assessment of bone mineral and structure: state of art. J Bone Miner Res. 1996;11(6):707–30. 7. Cummings S, Browner W, Bauer D, et al. Endogenous hormones and the risk of hip and vertebral fractures among older women. Study of Osteoporotic Fractures Research Group. N Engl J Med. 1998;339(1):767–8. 8. Bauer DC, Sklarin PM, Stone KL, et al. Biochemical markers of bone turnover and prediction of hip bone loss in older women: the study of osteoporotic fractures. J Bone Miner Res. 1999;14(8):1404– 10. 9. Arden N, Nevitt M, Lane N, et al. Osteoarthritis and risk of falls, rates of bone loss, and osteoporotic fractures. Study of Osteoporotic Fractures Research Group. Arthritis Rheum. 1999;42(7):1378–85. 10. Sellmeyer D, Stone K, Sebastian A, Cummings S. A high ratio of dietary animal to vegetable protein increases the rate of bone loss and the risk of fracture in postmenopausal women. Study of Osteoporotic Fractures Research Group. Am J Clin Nutr. 2001;73(1):118–22. 11. Bergot C, Bousson V, Meunier A, Laval-Jeantet M, Laredo J. Hip fracture risk and proximal femur geometry from DXA scans. Osteoporos Int. 2002;13:542–50. 12. Cummings S, Bates D, Black DM. Clinical use of bone densitometry: scientific review. JAMA. 2002;288(15):1889–97. 13. Bates D, Black D, Cummings S. Clinical use of bone densitometry: clinical applications. JAMA. 2002;288(15):1898–900. 14. Black D, Cummings S, Karpf D, et al. Randomised trial of effect of alendronate on risk of fracture in women with existing vertebral fractures. Fracture Intervention Trial Research Group. Lancet. 1996;348:1535–41. 15. Cummings SR, Black DM, Thompson DE, et al. Effect of alendronate on risk of fracture in women with low bone density but without vertebral fractures. JAMA. 1998;280(24):2077–82. 16. Harris S, Watts N, Genant H, et al. Effects of risedronate treatment on vertebral and nonvertebral fractures in women with postmenopausal osteoporosis: a randomized controlled trial. Vertebral Efficacy with Risedronate Therapy (VERT) Study Group. JAMA. 1999;282(14):1344–52. 17. Ettinger B, Black D, Mitlak B, et al. Reduction of vertebral fracture risk in postmenopausal women with osteoporosis treated with raloxifene: results from a 3-year randomized clinical trial. JAMA. 1999;282:637–45.

18. McClung M, Geusens P, Miller P, et al. Effect of risedronate on the risk of hip fracture in elderly women. Hip Intervention Program Study Group. N Engl J Med. 2001;344(5):333–40. 19. Barrett-Connor E, Gore R, Browner WS, Cummings SR. Prevention of osteoporosis hip fracture: global versus high-risk strategies. Osteoporos Int. 1998;8 Suppl 1:S2–S7. 20. Black D, Steinbuch M, Palermo L, et al. An assessment tool for predicting fracture risk in postmenopausal women. Osteoporos Int. 2001;12(7):519–28. 21. Dargent-Molina P, Douchin M, Cormier C, Meunier P, Breart G. Use of clinical risk factors in elderly women with low bone mineral density to identify women at higher risk of hip fracture: the EPIDOS prospective study. Osteoporos Int. 2002;13:593–9. 22. Miller P, Siris E, Barrett-Connor E, et al. Prediction of fracture risk in postmenopausal white women with peripheral bone densitometry: evidence from the National Osteoporosis Risk Assessment. J Bone Miner Res. 2002;17(12):2222–30. 23. Breiman L, Friedman J, Olshen R, Stone C. Classification and Regression Trees. Monterey (CA): Wadsworth & Brooks Cole; 1984. 24. Segal M. Regression trees for censored data. Biometrics. 1988; 44:35–47. 25. Stitt FW, Lu Y, Dickinson GM, Klimas NG. Automated severity classification of AIDS hospitalizations. Med Decis Making. 1991;11(4 Suppl):S41–5. 26. Sevin B-U, Lu Y, Nadji MN, Bloch D, Koechli OR, Averette HA. Surgically defined prognostic parameters in early cervical carcinoma: a tree structured survival analysis. Cancer. 1996;78(4):1438– 46. 27. Lu Y, Black D, Mathur AK, Genant HK. Study of hip fracture risk using tree structured survival analysis. J Mineral Stoffwechsel. 2003;10(1):11–6. 28. Zhang H, Singer D. Recursive Partitioning Tree and Application in Health Sciences. New York: Springer-Verlag; 1999. 29. Harrell FJ. Regression Modeling Strategies. New York: Springer; 2001. 30. Morgan J, Sonquist JA. Problems in the analysis of survey data, and a proposal. J Am Stat Assoc. 1963;58:415–34. 31. Loh W-Y, Vanichsetakul N. Tree-structured classification via generalized discriminant analysis. J Am Stat Assoc. 1988;83:715–28. 32. Lim T-S, Loh W-Y, Shih Y-S. A comparison of prediction accuracy, complexity, and training time of thirty-three old and new classification algorithms. Machine Learning. 2000;40:203–29. 33. Loh W-Y, Shih Y-S. Split selection methods for classification trees. Statistica Sinica. 1997;7:815–40. 34. Quinlan J. Improved use of continuous attributes in C4.5. Journal of Artificial Intelligence Research. 1996;4:77–90. 35. Quinlan JR. C4.5: Programs for Machine Learning. San Mateo (CA): Morgan Kaufmann; 1993. 36. Zhang H. Multivariate adaptive splines for longitudinal data. Journal of Computational and Graphic Statistics. 1997;6:74–91. 37. Zhang H. Analysis of infant growth curves using MASAL. Biometrics. 1999;55:452–9. 38. Cummings S, Black D, Nevitt M, et al. Bone density at various sites for prediction of hip fractures: the study of osteoporotic fractures. Lancet. 1993;341:72–5. 39. Steyeberg E, Bleeker S, Moll H, Grobbee D, Moons K. Internal and external validation of predictive models: a simulation study of bias and precision in small samples. J Clin Epidemiol. 2003;56:441–7.

397

METHODOLOGY



40. Lu Y, Fuerst T, Hui S, Genant HK. Standardization of bone mineral density at femoral neck, trochanter and Ward’s triangle. Osteoporos Int. 2001;12(6):438–44. 41. Benish WA. Relative entropy as a measure of diagnostic information. Med Decis Making. 1999;19(2):202–6. 42. Steyerberg E, Eijkemans M, Harrell F Jr, Habbema J. Prognostic modeling with logistic regression analysis: in search of a sensible strategy in small data sets. Med Decis Making. 2001;21(1):45–56. 43. Nam JM. Establishing equivalence of two treatments and sample size requirements in matched-pairs design. Biometrics. 1997;53(4):1422–30. 44. Lu Y, Jin H, Genant HK. On the non-inferiority of a diagnostic test based on paired observations. Stat Med. 2003;22(10):3029–44.

45. Hanley J, McNeil B. The meaning and use of the area under the receiver operating characteristic (ROC) curve. Radiology. 1982;143:29–36. 46. Lydick E, Cook K, Turpin J, Melton M, Stine R, Byrnes C. Development and validation of a simple questionnaire to facilitate identification of women likely to have low bone density. Am J Manag Care. 1998;4(1):37–48. 47. Nevitt M, Ettinger B, Black D, Stone K, Jamal S, Ensurd K. The association of radiolographically detected vertebral fractures with back pain and function: a prospective study. Ann Intern Med. 1998;128(10):793–800.