Using meta-regression data mining to improve ... - Semantic Scholar

G Model ASOC-2043; No. of Pages 7

ARTICLE IN PRESS Applied Soft Computing xxx (2013) xxx–xxx

Contents lists available at ScienceDirect

Applied Soft Computing journal homepage: www.elsevier.com/locate/asoc

Using meta-regression data mining to improve predictions of performance based on heart rate dynamics for Australian football Herbert F. Jelinek a,b , Andrei Kelarev a,e , Dean J. Robinson c , Andrew Stranieri d , David J. Cornforth e,∗ a

Centre for Research in Complex Systems, School of Community Health, Charles Sturt University, P.O. Box 789, Albury, NSW 2640, Australia Biomedical Engineering, Khalifa University of Science, Technology and Research (KUSTAR), Abu Dhabi, United Arab Emirates c Corio Bay Sports Medicine Centre, Geelong, VIC, Australia d School of Science, Information Technology and Engineering (SITE), University of Ballarat, P.O. Box 663, Ballarat, VIC 3353, Australia e Applied Informatics Research Group, School of Design, Communication and Information Technology, University of Newcastle, Callaghan, NSW 2308 Australia b

a r t i c l e

i n f o

Article history: Received 14 March 2013 Received in revised form 5 August 2013 Accepted 5 August 2013 Available online xxx Keywords: Feature selection Regression Meta regression Data mining Heart rate dynamics Australian football

a b s t r a c t This work investigates the effectiveness of using computer-based machine learning regression algorithms and meta-regression methods to predict performance data for Australian football players based on parameters collected during daily physiological tests. Three experiments are described. The first uses all available data with a variety of regression techniques. The second uses a subset of features selected from the available data using the Random Forest method. The third used meta-regression with the selected feature subset. Our experiments demonstrate that feature selection and meta-regression methods improve the accuracy of predictions for match performance of Australian football players based on daily data of medical tests, compared to regression methods alone. Meta-regression methods and feature selection were able to obtain performance prediction outcomes with significant correlation coefficients. The best results were obtained by the additive regression based on isotonic regression for a set of most influential features selected by Random Forest. This model was able to predict athlete performance data with a correlation coefficient of 0.86 (p < 0.05). Crown Copyright © 2013 Published by Elsevier B.V. All rights reserved.

1. Introduction This paper reports on the prediction of performance of elite athletes during games of Australian Rules Football (AFL). The prediction is based on a variety of measures derived from weekly ECG recordings, playing field dimension and temperature for the day of the game. The performance of players is expressed using variables derived from geographical positioning system (GPS data) of football players during the games. The success of sophisticated soft computing techniques to predict player performance during matches is demonstrated in this work. The outcomes are models that may provide the team coaches with a new tool for selection of players for any particular match based on physiological performance of players in preceding weeks. The coaches may also use such models to plan and implement training regimes that focus not only on the individual player but also on characteristics of the match that is being targeted. For example, coaches may be able optimise capacity

∗ Corresponding author. Tel.: +61 2 4985 4069. E-mail address: [email protected] (D.J. Cornforth).

for running at greater speed, which is associated with better and more ball control and therefore a better score. The prediction of athlete performance follows a history of performance analysis of athletes, which has been the focus of research for decades with an emphasis on motor skills and other physical factors [1]. The application of the information sciences to sport started with statistical analysis and calculations based on biomechanical data [2]. Since then the concept of sports informatics has grown to include modelling and simulation, databases, soft computing, pattern recognition and visualisation. Recently the advent of light weight mobile technological devices such as heart rate monitors, global positioning system units and other sensor devices have led to increased opportunities for the partnering of Information Technology with Medical and Sports Science. A number of recent studies have explored the application of these technologies to a range of sports including rugby and soccer, in order to more accurately understand player performance, conditioning and recovery [3,4]. High-performance sport such as football requires athletes to be at their optimal performance for a substantial part of the year, with the rest of the year being taken up with player selection and preseason training. Major factors are mental exhaustion, physical injury and over-training. Therefore it is important to have information for

1568-4946/$ – see front matter. Crown Copyright © 2013 Published by Elsevier B.V. All rights reserved. http://dx.doi.org/10.1016/j.asoc.2013.08.010

Please cite this article in press as: H.F. Jelinek, et al., Using meta-regression data mining to improve predictions of performance based on heart rate dynamics for Australian football, Appl. Soft Comput. J. (2013), http://dx.doi.org/10.1016/j.asoc.2013.08.010

G Model ASOC-2043; No. of Pages 7 2

ARTICLE IN PRESS H.F. Jelinek et al. / Applied Soft Computing xxx (2013) xxx–xxx

Fig. 2. Normal ECG recording showing RR interval.

Fig. 1. Diagram of playing field for Australian football, showing oval shaped field and main dimensions.

the coach that reduces the risk of physical and mental exhaustion and injury, whilst maintaining a high performance level. Currently the football club at the centre of this study collects biometric data that is analysed either by sports exercise experts within the club or by third parties, neither of which have advanced knowledge of data processing or soft computing. There is a good opportunity for advanced regression techniques to contribute to novel advances in this area. AFL is played with an oval ball on an oval field with a free flowing pace. Fig. 1 shows a diagram of the field along with the most important features, including the long and short dimension of the field. AFL stands in contrast to more structured styles of football, where a game is divided into well-defined phases, marked by a particular formation [5]. AFL is a fast moving game where speed and good ball handling and kicking skills determine the outcome. AFL adds extra variables to the analysis of fitness data because there is no standard size for the playing field, although the length is generally 135–185 m and the width 110–155 m. In this work the actual dimensions of the field for each game were recorded and used as one of the potential predictors of athlete performance. This is an important point because athletes have to make a different effort depending on the field that they find themselves playing on at any particular occasion. The prediction of athlete performance at any match is potentially confounded by this variable, so it should be included in regression models. Player performance has also been reported to change throughout a playing season in the game of soccer [6]. This variability may be related to training status, perceived wellness and environmental conditions [7]. With respect to the latter, there have been reports of a greater incidence in injuries in football played in warmer and/or drier conditions [8]. Changes in core body temperature may also be a factor in performance but has only been studied under warmer

conditions [9]. Our study therefore includes match temperatures as an additional variable. There are numerous physiological variable measures in high performance sport. Heart rate is one known to be associated with fitness and performance. Heart rate variability (HRV) is based on data describing the time interval between successive heart beats. HRV is an indicator of the regulation of the heart [10]. A standard ECG signal is shown in Fig. 2. This type of signal has been extensively studied and the diagnostic value of the different features is well established. ECG features are referred to using letters. The large spike is referred to as the QRS complex, with R being the peak of the wave or fiducial point, while the smaller peak to the left is the P wave and to the right of the QRS is the T wave. The signal is degraded by the presence of noise, so that the most reliable feature that can be obtained from low quality recordings (and therefore the most easily obtained measurement) is the interval between successive R peaks, known as the RR interval or inverse of the heart rate. This can be expressed as millisecond intervals and varies considerably between individuals. The inverse being heart rate is typically 60–80 beats per minute. The natural rhythm of the human heart is subject to variation that is believed to indicate the health of the cardiovascular system in that too much or too little variability between beats increases the risk of arrhythmia. RR intervals are obtained from the recorded ECG and subjected to further analysis through a variety of algorithms in order to yield variables with good discriminant power [10]. However, the RR interval can also be obtained from a simple and cheap chest strap device or incorporated into the playing jersey (garment), which makes it particularly suitable as a non-invasive test which allows data to be easily collected from players. As the electrical system of the heart is a complex and adaptive system, variation between beats can also be used to assess the health of elite athletes. The response of the cardiac autonomic nervous system can be assessed non-invasively using a number of measures based on HRV [11], all of which can provide useful information regarding the functional adaptations to a given training stimulus. As heart rate increases due to exercise, HRV tend to decrease. This is known as the sympathetic effect and is manifest in lower frequency changes in heart rate. Conversely, during rest the heart rate slows and higher frequency variation predominates. This is known as the parasympathetic effect. High and low frequency domain parameters are obtained by fast Fourier transform from the RR intervals of the biosignal. The extent of variability in the frequency domain can therefore be used as an indicator of cardiac modulation by the autonomic nervous system and an indicator of general health of the athlete or level of stress. HRV is rapidly becoming more accessible with the availability of wireless sensors at low cost. HRV has been conventionally analysed with time- and frequency-domain methods, which measure the overall magnitude of fluctuations around the mean or the magnitude of fluctuations in some predetermined frequencies. More recent nonlinear analysis has shown an increased sensitivity for




characterising beat-to-beat fluctuations. These methods use fractal and entropy based measures to extract further information from HRV. For example, detrended fluctuation analysis (DFA) has been shown to be effective in detection of Diabetic Autonomic Neuropathy [12] and the Renyi entropy measure has provided good results in the detection of abnormal cardiac rhythms [13]. Out of many measures that can be derived from HRV, it is not known which ones affect the match performance, or by how much. This work involves analysis methods that are the domain of data mining, which are able to identify important factors that can predict match performance, and to identify variables that have the most influence over performance. The authors were able to identify correlations between player measures prior to game and performance measured during the game. Regression analysis is a common method used for prediction and may be in a parametric or nonparametric form. All the methods used here employ advanced regression algorithms based on machine learning methods. In this paradigm, a model of the form y = f(x) is fitted to the data, where y is a match performance variable, and x is a vector of measures derived from data collected prior to the game. Regression is a process whereby the parameters of the model are tuned to minimise some error measure. A number of methods were attempted and some promising results provide evidence that such predications are possible. The potential value of regression modelling is the possibility of predicting the likely performance of individual players before each match. This can be of assistance to the coach in choosing players to be fielded for a particular game, or in crafting the pre-match training load. The questions addressed in this work are:

1. What is the best correlation that can be obtained using any method? 2. Does subset selection improve the correlation result? 3. Can meta-regression improve the results of standalone regression algorithms?

2. Football dataset This study uses data collected from elite Australian football players during training and daily data collection at rest to predict in-game performance recorded for 12 games of a football season. The athletes participating in the study were professional AFL players who trained for a minimum of 15 h per week during the study. The study period included 9 weeks of training each followed by a game, and 40 athletes participated. Not all the athletes participated in all the games. HRV data were collected at the same time each day to avoid any circadian changes in heart rate [14]. Heart rate data was collected from players on waking in the morning, using chest strap sensors. Measures were derived from HRV using time- and frequency-domain and non-linear methods. Time domain measures include the minimum, maximum and mean of the RR interval recorded. The number of pairs of successive intervals that differ by more than 50 ms divided by the total number of intervals is a parasympathetic measure known as pNN50 and is also a time domain measure. The HRV triangular index is based on estimating the density distribution of RR intervals. The maximum value of this curve is the mode amplitude (AMo), which indicates the stability of the heart rhythm. The area under the curve (integral) provides the number of all intervals and is divided by the maximum value of the distribution [10]. The triangular interpolation of the interval histogram (TINN) is the estimated width of the density distribution. This is also known as the index of control systems pressure and is believed to be sensitive to physical and emotional load, or to intensifying of the sympathetic nervous system tone.

3

Fig. 3. Poincaré plot for a sequence of RR intervals allows the estimation of SD1 and SD2.

The Poincaré plot is a visual representation of the time series and is constructed by plotting each consecutive RR interval as a point where y = RR(t) and x = RR(t − 1). Fig. 3 illustrates this. An ellipse is fitted to the resulting cloud of points and the major and minor axes of the ellipse are estimated as SD1 (short term correlation) and SD2 (long term correlation) [15,16]. Frequency domain methods divide the spectral distribution into regions labelled as very low, low and high frequency (Fig. 4). Low frequency power (LF) is believed to be indicative of sympathetic activity or sympathovagal balance. High frequency (HF) is indicative of parasympathetic activity. Very low frequency (VLF) amplitude is closely connected with psycho-emotional state and functional condition of the brain [17]. Other work [18] has shown the important meaning of VLF – range analysis, and that the capacity of VLF fluctuations of HRV is a sensitive indicator of management of metabolic processes and reflects deficit energy states. VLF is also used as a reliable marker of blood circulation and provides information on the degree the autonomous (segmental) level and supra-segmental regulation levels are functionally synchronised via neuronal output. To be normal VLF is between 15% and 30% of the total spectrum power and can be measured from the RR intervals associated with the ECG, which is regulated by supra-segmental and segmental neural modulatory systems including the cardiac autonomic nervous system innovation. The peak value of the frequency distribution is expressed as Hz while the power is expressed as ms2 . The logarithm of HF and LF is also calculated as it allows the definition of a new index based on HF and LF energy which appears relevant to measure HR regulation

Fig. 4. Power spectrum of RR intervals showing VLF, LF and HF regions.




4

and corrects for data not normally distributed [19]. The ratio of low to high frequency components is also calculated for our analysis as well as the total power [10]. Other measures used in this study are proprietary to the club, but are derived from commonly available ECG data. These are the independent variables of the regression models. In addition, the AFL Club collects detailed quantitative data on player performance. Data were collected from each match, using GPS sensors worn by players during the game. From these data, two measures were used to assess match performance: B3 – the time spent running at speeds of 18 and 24 km/h, and B4 – the time spent running faster than 24 km/h. These are the dependent variables of the regression models, which will be predicted. Additional measures were obtained from knowledge of the game location: • • • • •

long dimension of the playing field; short dimension of the playing field; minimum temperature (24 h); maximum temperature (24 h); average temperature (24 h).

These are added to the other independent variables, to be used as predictors of the performance measures or class variables. Bespoke software was prepared to combine data from these different sources into one file suitable for input to WEKA for analysis of data using regression algorithms and data mining methods. The complete data file for analysis comprised 250 records. The predicted or “class” variables used in our experiments were the B3 and B4 attributes in the database, referring to GPS data of running at low speed and running at high speed, respectively.

3. Methods Data on player training and daily physiological data collection at rest as well as in game performance measures were obtained as deidentified data from the AFL Club. Temperature data for locations and dates of matches were sourced from the Australian Bureau of Meteorology and added to create a unified data set. Prior to this research the study was approved by the Australian Institute of Sport Ethics and Research Committee and informed consent obtained from all players participating in this research. The data were thoroughly checked for consistency and records with inconsistent values were discarded. Bespoke software was prepared under the supervision of the authors for data cleaning, conversion and integration, in order to achieve a unified data set. The data were converted to the “arff” format used by Weka. Analysis was performed using freely available software from the Weka toolbox (distributed by the University of Waikato). All methods used from the Weka toolbox were run without modification of any kind, and were run using the default parameters. We use standard terminology and refer the readers to Refs. [20,21] for more background information on the Weka toolbox. The soft computing methods used in this work are automated regression, feature selection and meta-regression. All results were assessed using the Goodman–Kruskal correlation coefficient (GKCC) or the Goodman–Kruskal’s gamma. This is defined as the difference between the number of concordant pairs, C, and the number of discordant pairs, D, of the two rankings, as a proportion of all pairs, ignoring ties [22]. The coefficient is based on ranks of the variables and tests for a weak monotonicity between the two rankings. The value of GKCC ranges between +1 to −1, and it is equal to 0 for independent variables.

3.1. Base regression learners We consider several base regression learners that represent various essential regression methods available in WEKA. The following robust base regression learners have been selected for a complete experimental evaluation of their performance. The bracketed acronyms are used in the results section to refer to these methods. GaussianProcesses (GP) performs regression based on Gaussian processes with six available kernels (Normalised Poly Kernel, Poly Kernel, Precomputed Kernel Matrix Kernel, Puk, RBFKernel, String Kernel) and without hyperparameter tuning. It applies normalisation/standardisation to the attributes to facilitate choosing an appropriate noise level for the Gaussian model. Missing values are replaced by the global mean or mode. Nominal attributes are converted to binary ones. IsotonicRegression (IR) learns an isotonic regression model by picking the attribute that results in the lowest squared error. Missing values are not allowed. IsotonicRegression can only deal with numeric attributes. It considers the monotonically increasing case as well as the monotonically decreasing case. LeastMedSq (LMS) implements a least median squared linear regression based on least squared functions generated from random subsamples of the data. LinearRegression (LR) applies the Akaike Information Criterion for model selection in linear regression for prediction. MultilayerPerceptron (MLP) is a feedforward artificial neural network using backpropagation and one hidden layer. The network uses sigmoid functions at each output node for input and hidden layers, and a linear function at the output layer in order to provide a regression function. RBFNetwork (RBF) trains a normalised Gaussian radial basis function network by applying the k-means clustering algorithm, fitting symmetric multivariate Gaussians to the data in each cluster, and learning a linear regression for numeric class problems in each cluster. SMOreg (SMO) implements a support vector machine for regression. We use the most popular algorithm RegSMOImproved chosen in parameter settings. Here we use all these regression learners with default parameter settings to have a uniform comparison. It may be possible to achieve further improvement to performance by fine-tuning the choice of parameters. We refer the reader to Refs. [20,21] for more information on these base regression learners and their WEKA implementations. 3.2. Feature selection We use Random Forest feature selection implemented in R (version 2.15.1) with Rattle [23]. Random Forest is an ensemble meta classifier hardwired to a particular base classifier, Random Tree. It constructs a forest of random trees following [24]. The classifier builds many decision tree predictors with randomly selected variable subsets and utilises a different subset of training and validation data for each individual model. After generating many trees, the resulting class prediction is based on votes from the single trees. Consequently, lower ranked variables are eliminated based on empirical performance heuristics [25]. 3.3. Meta-regression methods for improving predictions We used several meta-regression algorithms, each of which combines multiple models to improve the overall fit. AdditiveRegression (AR) successively enhances the performance of a base regression learner. Each iteration fits a model to the residuals left by the previous regression learner. The final prediction is performed by adding the predictions of all regression learners.




Bagging (Bag) is a regression scheme for bagging a base regression model to reduce the variance. LWL implements locally weighted learning using an instancebased algorithm to assign instance weights, which are then passed on to any other available algorithm that can handle classification or regression. We used LinearNNSearch as a nearest neighbour search algorithm for LWL. MultiScheme (MS) applies cross validation on the training data or the performance on the training data to selecting a base regression model from several available performance models. Performance is measured by mean squared error. RandomSubSpace (RSS) constructs a decision tree based classifier that maintains highest accuracy on training data and improves on generalisation accuracy as it grows in complexity. The classifier consists of multiple trees constructed systematically by pseudorandomly selecting subsets of components of the trees constructed in randomly chosen subspaces. RegressionByDiscretisation (RD) is another important metaregression scheme that employs any regression learner on a copy of the data that has the class attribute discretised. The predicted value is the expected value of the mean class value for each discretised interval based on the predicted probabilities for each interval and on conditional density estimation performed by building a univariate density estimator from the target values in the training data weighted by the class probabilities. The readers are referred to Refs. [20,21] for more information on meta-regression methods and their WEKA implementations. 3.4. Experiments The first experiment addressed the question which regression method provides the best correlation between the input data and the outcome variables. By applying base regression methods directly to the full dataset, comparison between the following base regression learners: Conjunctive Rule, EM Imputation, Kstar, Linear Regression, M5Rules and REPTree are possible. The second experiment used Random Forest for feature selection, in an attempt to improve the quality of the obtained predictions by subset selection. It ranked all features and selected the 21 most influential attributes. The base regression learners were then compared for the set of features chosen by Random Forest. The third experiment used meta-regression methods to attempt further improvement of the regression outcome. We used the set of the 21 most influential features chosen by Random Forest and applied meta-regression methods to the three best base regression learners identified in experiments 1 and 2: Isotonic Regression, SMOreg and Gaussian Processes. All experiments used 10-fold cross validation as a standard method for preventing over-fitting. The GKCC correlation coefficients were used as a measure of performance of regression learners, since one cannot expect to obtain an exact linear relationship for match performance data, and assessing weak monotonic relationships is likely to be the best choice. The results from the three experiments were compared using 2-tailed heteroscedastic t-tests at the 95% confidence level. 4. Results and discussion Fig. 1 contains the correlation coefficients of the predictions of base regression learners using all features of the database. Most of the methods achieved a correlation coefficient above 0.5, apart from the Radial Basis Function (RBF) Network algorithm. When comparing the results of a variety of machine learning methods it should of course be noted that for any given algorithm that produces very

5

good outcomes in certain applications, there always exist examples of data sets in other domains where different algorithms are more effective. This is also confirmed by the so-called “no-free-lunch” theorems, which imply that there does not exist one algorithm which is best for all problems [26]. The performance of every category of algorithms depends on the dimension of a data set and the number of instances, types of attributes, the nature of functional relations and dependencies among the attributes and other parameters. In all cases the high speed variable B4 achieved a better correlation than B3. The best regression coefficient appears to be the Isotonic Regression (IR) algorithm, with a correlation of 0.67 for B3 and 0.71 for B4. This is likely to be due to the fact that high speed running requires a higher level of fitness than is indicated by B3, and is also dependant on the field size. While these are surprisingly good results for prediction of athlete performance from the input variables available, it holds promise for even better performance if more sophisticated methods are used. Fig. 2 displays the correlation coefficients of the predictions of the base regression learners for the set of 21 features chosen by Random Forest. In contrast to the previous case (Fig. 1), most of our regression methods achieved a correlation of 0.6 or above, suggesting an improvement by subset selection of variables. Also, the results show a higher correlation for B4 than B3 for all algorithms used. Once again the Isotonic Regression (IR) algorithm appears to give the best performance with 0.74 for B3 and 0.77 for B4. However, one would expect the use of meta-regression techniques to provide further improvement. Fig. 3 shows the correlation coefficients for the prediction of B3, using meta-regression methods based on Isotonic Regression, SMOreg and Gaussian Processes for the set of 21 features chosen by Random Forest. Regression. However by discretisation is not included, because for our database, discretisation created multivalued nominal attributes, and neither Gaussian Processes, nor Isotonic Regression, nor SMOreg can handle multi-valued nominal classes. Comparing this result to the previous results of experiments 1 and 2, shows that the correlation has improved. The Isotonic Regression (IR) base algorithm provided the best results. Out of the meta-regression algorithms, Additive Regression (AR) had the highest correlation probability, implying the best fit of this model to the data. The highest result for Isotonic Regression used with the Additive Regression meta-regression algorithm achieved a correlation coefficient of 0.83. Fig. 4 shows the correlation coefficients for the prediction of B4 attributes by the meta-regression methods based learners Isotonic Regression, SMOreg and Gaussian Processes for the set of features chosen by Random Forest. Results seem somewhat better than the predictions for B3, and nearly all methods achieved a correlation coefficient of 0.75 or higher. Isotonic Regression (IR) base algorithm appears to give the best results as a base classifier. As with Fig. 3, Additive Regression (AR) resulted in the highest correlation, in this case 0.86. The results of significance testing are provided in Table 1. The first column describes the groups being compared and the second column gives the p-value associated with the t-test. The first two rows suggest that the null hypothesis that the second experiment (including subgroup selection) provided no improved correlation over the first cannot be rejected. In spite of the apparent difference between Figs. 1 and 2, the t-test result implies that there is no evidence that the use of the reduced feature subset alone provided an improvement over using all the features. This finding relates to both dependant variables B3 and B4. It may be that the feature subset selection provided some advantage but it was not found to be significant in this test. However, the results for all other test are significant. Rows 3–8 provide details of the t-tests comparing the results of experiment 3 with those of experiment 2, so they measure the


G Model

ARTICLE IN PRESS

ASOC-2043; No. of Pages 7

H.F. Jelinek et al. / Applied Soft Computing xxx (2013) xxx–xxx

6

Table 1 Significant combinations of regression algorithms combined with subgroup selection and meta-analysis. GP, IR and SMO indicate Gaussian processes, isotonic regression, and SMOreg, respectively. Test

p Value

Comments

B3: Exp.1 vs. Exp.2 B4: Exp.1 vs. Exp.2

0.105 0.131

NS NS

B3: Exp.3 (GP) vs. Exp.2 B3: Exp.3 (IR) vs. Exp.2 B3: Exp.3 (SMO) vs. Exp.2 B4: Exp.3 (GP) vs. Exp.2 B4: Exp.3 (IR) vs. Exp.2 B4: Exp.3 (SMO) vs. Exp.2

0.00349 0.000415 0.00113 0.00634 0.000990 0.00253

Significant Significant Significant Significant Significant Significant

B3: Exp.3 (GP) vs. Exp.1 B3: Exp.3 (IR) vs. Exp.1 B3: Exp.3 (SMO) vs. Exp.1 B4: Exp.3 (GP) vs. Exp.1 B4: Exp.3 (IR) vs. Exp.1 B4: Exp.3 (SMO) vs. Exp.1

0.000266 5.23E−05 0.000106 0.000531 0.000122 0.000259

Significant Significant Significant Significant Significant Significant

contribution of the meta-regression. All of these results were significant at the 95% level, as their p-values are all less than 0.05. This implies that the use of meta-regression did indeed provide an advantage, and that this advantage applies to both dependant variable B3 and B4. Rows 9–14 provide details of the t-tests comparing the results of experiment 3 with those of experiment 1, revealing the contribution of both the techniques of feature selection and of meta-regression. Again, all tests were significant at the 95% level. The p-values are also lower than the corresponding values for the comparison between experiments 3 and 2. This suggests that while the contribution of feature selection alone was not found to make a significant difference according to this test, nevertheless a difference may exist, as displayed by comparison of Figs. 1 and 2. Table 1 indicates that using meta-analysis with IR as the base classifier following subgroup selection has the most improvement on using regression alone for finding a correlation between physiological features and including data on temperature and field size with performance. Twenty one physiological features were initially identified by Random Forest subgroup selection, which included the minimum and mean RR interval, pNN50, TINN, LF, HF and VLF frequency components, and some derived variables that are proprietary to the Club. Out of these variables, the frequency based variables have higher importance than the others. 5. Conclusion Soft computing methods have the potential to improve predictions in Sports Science. In this work, we predicted the performance of elite athletes on the field, as expressed in time spent running at two different speeds (B3 and B4) during a game. Game performance was available for 9 games as were the physiological measures during the weeks prior to the games, in addition to temperature and playing field dimensions. We used a range of variables extracted from heart rate measurement, which express the physical fitness of athletes [27]. We also used data relating to the size of field and temperature of the day of the match. Methods included regression, feature selection and meta-regression. Our experiments demonstrate that meta-regression methods and feature selection can be used to improve the accuracy of predictions for performance of Australian football players based on daily data of physiological measures. Meta-regression methods and feature selection were able to obtain predictions performance with significant correlation coefficients. Statistical analysis confirmed our correlation analysis results for use of the meta classifiers. Data reduction did not have a

significant effect on its own despite the correlation coefficient improving. The outcome of this test was similar for both dependant variables B3 and B4. The most likely cause for this is that the ten weeks of data analysed of the season does not provide sufficient data for the test to succeed at this level. However, the third experiment, where meta regression was used, provided a significantly better result than either experiment 1 or experiment 2. The outcome of this test was similar for both dependant variables B3 and B4. This means that we are more confident of the contribution of the meta regression than the contribution of the feature reduction, but it is likely that both played their part. Earlier in this paper we posed three questions, namely: 1. What is the best correlation that can be obtained using any method? 2. Does subset selection improve the correlation result? 3. Can meta-regression improve the results of standalone regression algorithms? The answer to the first question is that the Isotonic Regression algorithm provides the best correlation between prediction and actual results. This is common to results in all three experiments, suggesting that this is a real phenomenon and not simply the result of chance. The answer to the second question is that indeed subset selection by Random Forest improves the result, and this is borne out by the results of the second and third experiment. The answer to the third question is that the Additive Regression meta-regression algorithm achieves the best result, with a correlation figure of 0.86 between predicted and actual athlete performance. This is a remarkable achievement and provides the real possibility of advanced tools to assist team coaches in fitness management. Acknowledgements We are grateful to the Essendon Football Club, Australia, for providing data for this study. We would like to thank the reviewers for comments that have helped to improve our text and for suggestions of possible directions for further research. References [1] R.H. Seashore, Individual differences in motor skills, Journal of General Psychology 3 (1) (1930) 38–66. [2] A. Baca Computer Science in Sport, History, research areas and fields of application, in: Proceedings of the IACSS Symposium, 2007, pp. 16–22. [3] C. McLellan, D. Lovell, G. Gass, Performance analysis of elite Rugby League match play using global positioning systems, Journal of Strength and Conditioning Research 25 (6) (2011) 1703–1710. [4] D.J. Burgess, G. Naughton, K.I. Norton, Profile of movement demands of national football players in Australia, Journal of Science and Medicine in Sport 9 (4) (2006) 334–341. [5] D.M. O’Shaughnessy, Possession versus position: strategic evaluation in AFL, Journal of Sports Science and Medicine 5 (2006) 533–540. [6] E. Rampinini, A.J. Coutts, C. Castagna, A. Sassi, F. Impellizzeri, Variation in top level soccer match performance, International Journal of Sports Medicine 28 (2007) 1018–1024. [7] P.B. Gastin, B. Fahrner, D. Meyer, D. Robinson, Influence of physical fitness, age, experience, and weekly training load on match performance in elite Australian football, Journal of strength and conditioning research 27 (5) (2013) 1272–1279. [8] J. Orchard, Is there a relationship between ground and climatic conditions and injuries in Football? Sports Medicine 32 (2002) 419–432. [9] R. Duffield, A.J. Coutts, J. Quinn, Core temperature responses and match running performance during intermittent-sprint exercise competition in warm conditions, Journal of Strength & Conditioning Research 23 (2009) 1238–1244. [10] Task Force of the European Society of Cardiology (TFESC), North American Society of pacing electrophysiology, heart rate variability: standards of measurement, physiological interpretation, and clinical use, Circulation 93 (1996) 1043–1065.




[11] M. Buchheit, Y. Papelier, P.B. Laursen, S. Ahmaidi, Noninvasive assessment of cardiac parasympathetic function: post-exercise heart rate recovery or heart rate variability? American Journal of Physiology: Heart & Circulation Physiology 2293 (2007) H8–H10. [12] H.F. Jelinek, D.J. Cornforth, Computational methods for the early detection of diabetes, in: N. Wickramasinghe, E. Geisler (Eds.), Encyclopaedia of Healthcare Information Systems, IDEA Books, 2008. [13] H.F. Jelinek, M. Tarvainen, D.J. Cornforth, Renyi entropy in identification of cardiac autonomic neuropathy in diabetes, in: Proceedings of the Conference on Computing in Cardiology, Krakow, 2012. [14] T. Reilly, G. Robinson, O.S. Minors, Some circulatory responses to exercise at different times of day, Medicine and Science in Sports and Exercise 16 (5) (1984) 477–482. [15] M.P. Tulppo, T.H. Makikallio, T.E. Takala, T. Seppanen, H.V. Huikuri, Quantitative beat-to-beat analysis of heart rate dynamics during exercise, American Journal of Physiology – Heart 271 (1) (1996) H244–H252. [16] C.K. Karmakar, A.H. Khandoker, A. Voss, M. Palaniswami, Sensitivity of temporal heart rate variability in Poincaré plot to changes in parasympathetic nervous system activity, BioMedical Engineering OnLine 10 (17) (2011), Available at http://www.biomedical-engineering-online.com/content/10/1/17 (accessed December 2012). [17] R.M. Baevsky, A.P. Berseneva, Use of Kardivar System for Determination of the Stress Level and Estimation of the Body Adaptability, 2008, Available at www.ehrlich.tv/Kardivar Methodical Eng.pdf (accessed December 2012). [18] A.N. Fleishman, Slow Hemodynamic Oscillations. The Theory, Practical Application in Clinical Medicine and Prevention, Nauka, Novosibirsk, 1999. [19] N. Khalfa, P.R. Bertrand, G. Boudet, A. Chamoux, V. Billat, Heart rate regulation processed through wavelet analysis and change detection: some case studies, Acta Biotheoretica 60 (2012) 109–129. [20] M. Hall, E. Frank, G. Holmes, B. Pfahringer, P. Reutemann, I. Witten, The WEKA data mining software: an update, SIGKDD Explorations 11 (2009) 10–18. [21] I. Witten, E. Frank, Data Mining: Practical Machine Learning Tools and Techniques, Elsevier/Morgan Kaufman, Amsterdam, 2011. [22] M. Kendall, J. Gibbons, Rank Correlation Methods, 5th ed., Oxford University Press, London, 1990. [23] G. Williams, Data Mining with Rattle and R. The Area of Excavating Data for Knowledge Discovery, Springer Verlag, New York, 2011. [24] L. Breiman, Random forests, Machine Learning 45 (2001) 5–32. [25] J. Han, M. Kamber, Data Mining: Concepts and Techniques, Morgan Kaufmann, 2006. [26] D. Wolpert, The lack of a priori distinctions between learning algorithms, Neural Computation 8 (1996) 1341–1390. [27] R.E. De Meersman, Heart rate variability and aerobic fitness, American Heart Journal 125 (3) (1993) 726–731. Herbert Jelinek is an Associate Professor and he holds the B.Sc. (Hons.) in Human Genetics from the University of New South Wales, Australia (1984), Graduate Diploma in Neuroscience from the Australian National University (1986) and Ph.D. in Medicine from the University of Sydney (1996). He is Clinical Associate Professor with the Australian School of Advanced Medicine, Macquarie University, and a member of the Centre for Research in Complex Systems, Charles Sturt University, Australia. Dr. Jelinek is currently visiting Associate Professor at Khalifa University of Science, Technology and Research, Abu Dhabi, UAE. He is a member of the IEEE Biomedical Engineering Society and the Australian Diabetes Association.

7

Andrei Kelarev is an author of two books and 194 journal articles, and an editor of 5 international journals. He worked as an Associate Professor in the University of Wisconsin and University of Nebraska in USA and as a Senior Lecturer in the University of Tasmania in Australia. He was a Chief Investigator of a large Discovery grant from Australian Research Council and was a member of the programme committees of several conferences. Andrei Kelarev is working for a research grant at the University of Newcastle, Australia.

Dean Robinson is a high performance fitness coach for Australian Rules Football. He joined the Geelong, Victoria, club in 2007 where he was the conditioning coach. He joined the Gold Coast club in 2010 then joined Essendon before the 2012 pre-season and oversaw the players’ strength and conditioning programme for that year.

Andrew Stranieri is an Associate Professor and the Director of the Centre for Informatics and Applied Optimisation at the University of Ballarat. His research into cognitive models of argumentation and artificial intelligence was instrumental in modelling decision making in refugee law, copyright law, eligibility for legal aid and sentencing. His research in health informatics spans data mining in health, complementary and alternative medicine informatics, telemedicine and intelligent decision support systems. Andrew Stranieri is the author of over 120 peer reviewed journal and conference articles and has published two books.

David Cornforth holds the B.Sc. in Electrical and Electronic Engineering from Nottingham Trent University, UK (1982), and the Ph.D. in Computer Science from the University of Nottingham, UK (1994). He has been an educator and researcher at Charles Sturt University, the University of New South Wales, and currently at the University of Newcastle. He has also been a research scientist at the Commonwealth Scientific and Industrial Research Organisation (CSIRO), Newcastle, Australia. His research interests are in health information systems, pattern recognition, artificial intelligence, multi-agent simulation, and optimisation. He is convenor of the Applied Informatics Research Group, University of Newcastle.


Using meta-regression data mining to improve ... - Semantic Scholar

Using meta-regression data mining to improve ... - Semantic Scholar

Suggest Documents

use data mining to improve student retention in ... - Semantic Scholar

Data Mining using Web Spiders - Semantic Scholar

Mining Manufacturing Data using Genetic ... - Semantic Scholar

Massively Parallel Data Mining Using ... - Semantic Scholar

Complex Data: Mining using Patterns - Semantic Scholar

Using Semantic Technologies to Improve ... - Semantic Scholar

Using Data to Improve Programs: Assessment of a ... - Semantic Scholar

Using reporting and data mining techniques to improve ... - CiteSeerX

Using data mining techniques to predict ... - Semantic Scholar

Using Data Mining Techniques to Support the ... - Semantic Scholar

Using Data Mining and Recommender Systems to ... - Semantic Scholar

Using data mining techniques to predict industrial ... - Semantic Scholar

Using Data Mining to Assess Software Reliability - Semantic Scholar

Using Data Mining to Detect Health Care Fraud ... - Semantic Scholar

A Data Mining Approach to Improve Re

Using surrogate biomarkers to improve ... - Semantic Scholar

Using Quick Transcriptions to Improve ... - Semantic Scholar

Using Quick Transcriptions to Improve ... - Semantic Scholar

Using Virtualization to Improve Software ... - Semantic Scholar

Using Cytochalasins to Improve Current ... - Semantic Scholar

Using Quick Transcriptions to Improve ... - Semantic Scholar

Data-Throughput Enhancement Using Data Mining ... - Semantic Scholar

Data-Throughput Enhancement Using Data Mining ... - Semantic Scholar

From Data Mining to Knowledge Mining - Semantic Scholar