Cardiovascular diseases. In 2012, CVDs were the leading cause of NCDs' deaths (17.5 M). Davide Barbieri. IUAES 2016 Dubrovnik. 2 ...
Cardiovascular risk assessment in athletes by means of statistical learning D. Barbieri & L. Zaccagni University of Ferrara, Italy
Cardiovascular diseases
In 2012, CVDs were the leading cause of NCDs’ deaths (17.5 M) Davide Barbieri
IUAES 2016 Dubrovnik
2
Prevalence Deaths per 100k people: top 5 and bottom 5 Country (year 2012) Turkmenistan Kazakhstan Mongolia Uzbekistan Kyrgyzstan Republic of Korea Canada Israel France Japan
Both sexes Female Male 712 618 820 635 515 808 586 483 723 577 509 656 549 462 660 92 88 86 85 82
76 68 70 65 58
Stat Mean SD Max Min
both 271 124 712 82
F 242 115 618 58
M 306 149 820 105
112 112 105 111 108
Source: WHO web site http://apps.who.int/gho/data/node.main.A865CARDIOVASCULAR?lang=en accessed 15 Apr 2016 Davide Barbieri
IUAES 2016 Dubrovnik
3
Causes and consequences • Smoking, physical inactivity, unhealthy diet and alcohol • Medical care of CVDs is expensive: conflict of interest b/w physicians (defensive medicine) and public administration (spending reviews) • In developed countries, lower socioeconomic groups have greater prevalence of risk factors and higher mortality. • In developing countries, as CVDs’ prevalence increases the burden will shift to the lower socioeconomic groups
Source: WHO http://www.who.int/cardiovascular_diseases/prevention_control/en/ , accessed 15 Apr 2016
Davide Barbieri
IUAES 2016 Dubrovnik
4
Purposes • To predict the risk of CVDs in an active population, minimizing false alarms and false negatives • To optimize public spending in medical care (sustainable health care) • CVDs are quite rare among athletes who train consistently • Still, these subjects may be at risk because of repeated and intense efforts • Routinely monitored during sport medical examinations by means of ECG • In case of a positive ECG, they are warned against intense sport practice
Davide Barbieri
IUAES 2016 Dubrovnik
5
Sample • 33,126 Croatian athletes, both sexes, 4-69 years old • Data collected at the Sport Policlinic in Zagreb: • • • • • •
Sex Age Weight Height Pulse rate Blood pressure (systolic and diastolic)
• ECG outcome: P (≈9%) or N (≈91%)
Davide Barbieri
IUAES 2016 Dubrovnik
6
Methods • • • •
Binary classification by means of statistical learning Logistic regression (STATA) & data mining (WEKA) Cross-validation Height and weight (correlated) replaced by BMI=weight/height2
Davide Barbieri
IUAES 2016 Dubrovnik
7
Classification issues • • • •
Accuracy (correct guesses / total) not appropriate ROC and Youden index J=TPR+TNR-1 FN (i.e. unpredicted death risk) has higher cost than FP (extra ECG) Sensitivity may be more important than specificity: weighted J?
At risk Not at risk
Davide Barbieri
P TP FN
IUAES 2016 Dubrovnik
N FP TN
8
Logistic regression
Davide Barbieri
IUAES 2016 Dubrovnik
9
First results: ROC • • • •
Highly significant and very good fit Still, not predictive: AUC=0.55 Collected data not meaningful? Logistic regression not suitable?
Davide Barbieri
IUAES 2016 Dubrovnik
10
Exploratory data analysis • J. Tukey (1977) • If causes are tobacco, excessive body weight and lack of physical activity, collected variables should be informative (BMI, blood pressure, pulse rate) • Some biomedical variables, like pulse rate, may have abnormal low or high values, unlike blood pressure, for example, which increases risk only as it raises • Thresholds are available from medical literature • Still, we adopted a data-driven approach in order to find thresholds inductively in our sample
Davide Barbieri
IUAES 2016 Dubrovnik
11
Risk as a function of pulse rate and blood pressure Risk as a function of pulse rate
Risk as a function of blood pressure
50
12
40
10
8
30
6 20
4
10
2
0
0 low
Davide Barbieri
normal
high
low
IUAES 2016 Dubrovnik
normal
high
12
Data mining • Oversampling using SMOTE (Chawla et al. 2002) and undersampling were applied in order to balance the training data set and improve sensitivity • A filtered, rule-based classifier (OneR) was trained in order to find optimal cut-off values • Two thresholds (LT and HT) were found: • If pulse rate < LT then P • If pulse rate > HT than P • Else N
• Results: AUC=0.73; TPR=0.72; TNR=0.73, J=0.45
Davide Barbieri
IUAES 2016 Dubrovnik
13
Improved logistic • Dummy categorical variable: • =0 if LT