A Machine Learning pipeline for Multiple Sclerosis ...

3 downloads 0 Views 430KB Size Report
A Machine Learning pipeline for Multiple Sclerosis course detection from Clinical Scales ... analysis is conducted on a dataset coming from a clinical study .... c1c2 FS. Nested list of features. Linear. Classification. Best. Model. Data Exploration.
CONFIDENTIAL. Limited circulation. For review only.

A Machine Learning pipeline for Multiple Sclerosis course detection from Clinical Scales and Patient Reported Outcomes Samuele Fiorini1 , Alessandro Verri1 , Andrea Tacchino2 , Michela Ponzio2 , Giampaolo Brichetto2 , and Annalisa Barla1 Abstract— In this work we present a machine learning pipeline for the detection of multiple sclerosis course from a collection of inexpensive and non-invasive measures such as clinical scales and patient-reported outcomes. The proposed analysis is conducted on a dataset coming from a clinical study comprising 457 patients affected by multiple sclerosis. The 91 collected variables describe patients mobility, fatigue, cognitive performance, emotional status, bladder continence and quality of life. A preliminary data exploration phase suggests that the group of patients diagnosed as Relapsing-Remitting can be isolated from other clinical courses. Supervised learning algorithms are then applied to perform feature selection and course classification. Our results confirm that clinical scales and patient-reported outcomes can be used to classify Relapsing-Remitting patients.

I. INTRODUCTION Multiple Sclerosis (MS), a chronic disabling disease of the central nervous system, is usually classified into five courses [1]. The most common one is referred to as RelapsingRemitting (RR) and it presents well-defined attacks of worsening neurologic function, called relapses. Those episodes are followed by partial or complete recovery periods (remissions), during which symptoms improve partially or completely and there is no apparent progression of disease. Secondary-Progressive (SP) course follows after RR and it is characterized by a steadier progression with or without relapses. Primary-Progressive (PP) patients show steadily worsening neurologic function from the onset with no distinct relapses or remissions. Progressive-Relapsing (PR), the least common course, is characterized by continuously progressing disease from the onset and occasional relapses along the way. Finally, Benign MS occurs when the patient remains fully functional in all neurologic systems for at least 15 years after the onset. Achieving an accurate clinical course description is a very hard task even for clinical experts, but it is crucial for communication, prognosis, treatment decision-making, design and recruitment of clinical trials. MS course identification is a very time consuming task for both clinicians and patients. Indeed, it is usually based on the combination of time-consuming clinical examination like data collection 1 S.

Fiorini, A. Barla and A. Verri are with DIBRIS, University of Genoa, 16146, Genoa, Italy samuele.fiorini

at dibris.unige.it, {annalisa.barla, alessandro.verri} at unige.it 2 G. Brichetto, A. Tacchino, M. Ponzio are with Scientific Research Area, Italian MS Foundation, I-16149 Genoa, Italy

{giampaolo.brichetto, andrea.tacchino, michela.ponzio} at aism.it

from the natural history of the disease and frequency of relapses. Patients need also to undergo stressful exams to assess spatial and temporal dissemination of demyelinating plaques in both brain and spinal cord with T2-weighted and gadolinium-enhanced T1-weighted MRI scans. Conversely, questionnaires, such as Clinical Scales (CS) and Patient-Reported-Outcomes (PRO), are inexpensive and non-invasive sources of information that are fundamental in order to evaluate the effect of pharmacological or rehabilitative treatments. In particular, PRO questionnaires are being used more frequently in epidemiological studies, health service research and in clinical trials to evaluate therapeutic interventions because they can capture the self-perception of the disease [2]. This work aims at using PRO, CS and anthropometric measures (from now on referred to as features) to build a statistical model for the detection of MS courses by means of machine learning techniques. In the remainder of the paper, the five MS disease courses will be referred to as classes and identified with labels. The first explorative goal was to assess whether the collected features could discriminate any of the five disease courses. Unsupervised learning techniques were used to explore of the data at hand looking for a meaningful structure. The supervised setup inferred in the previous step was then used to learn a classifier based only on a subset of the available features. In particular, feature selection is motivated by the need to save time and effort in collecting all the answers to the CS and PRO questionnaires. The paper is organized as follows: Section II-A describes the dataset, Section II-B presents the proposed pipeline and Section II-C describes the machine learning methods used to perform dimensionality reduction, feature selection and classification. Section III presents an exhaustive explanation of the obtained results. Our conclusions are eventually drawn in Section IV. II. METHODS A. Materials The proposed analysis is conducted on a dataset presenting n = 457 samples and p = 91 features (see Table I). Of the 457 patients considered in our study 170 are diagnosed as RR, the other 287 could be further divided into 205 SP, 68 PP, 8 PR and 6 Benign. The features could be described distinguishing between anthropometric data (such as age, weight or height), Patient Reported Outcomes and Clinical

Preprint submitted to 37th Annual International Conference of the IEEE Engineering in Medicine and Biology Society. Received April 14, 2015.

CONFIDENTIAL. Limited circulation. For review only. Name

#Items

Description

AGEo AGEd AGEv W H

1 1 1 1 1

Age at the outset of the disease Age at the disease diagnosis Age at the examination Weight [Kg] Height [cm]

MFIS HADS LIFE OAB

21 14 11 8

Modified Fatigue Impact Scale [3] Hospital Anxiety and Depression Scale [4] Life satisfaction index [5] Overactive Bladder Questionnaire [6]

FIM MOCA PASAT T SDMT

19 11 1 1

Functional Independence Measure [7] Montreal Cognitive Assessment [8] Paced Auditory Serial Addition Task 3 [9] Symbol Digit Modality Test [10]

TABLE I T HE FEATURES SET CONSIDERED IN THIS STUDY

Fig. 1.

Data

The proposed machine learning pipeline

Preprocessing

Unsupervised Analysis

Problem Setting

Data Exploration

Best Model

Linear Classification Classification

Nested list of features

ℓ1 ℓ2 FS

Feature Selection

Scales. The main difference between the last two groups of measures is that while PRO could be directly collected by the patients, CS involve a certain number of scores that can only be assigned by trained medical staff. Each questionnaire may be composed of a large number of items. For instance, the Functional Independence Measure (FIM) is obtained asking 19 different questions about how much each patient needs to be assisted to carry out daily living activities. We populated our data matrix with the score associated with every single item belonging to the questionnaires in Table I. B. Experimental Design The proposed pipeline (Fig. 1) is entirely implemented in Python and it makes wide use of scikit-learn [11]. We also take advantage of the L1L2Signature and the PyXPlanner libraries. The pipeline can be divided into four phases. Preprocessing: First, the missing data (around 2%) are imputed with feature-wise median values. Afterwards, the dataset is normalized by means of a min-max scaling, in order to coerce all features in the [0, 1] interval. 1) Data Exploration: The main goal of this phase is to look for a possible structure within the data at hand by means of unsupervised techniques for dimensionality reduction. The outcome of this step is to identify a well-defined supervised setting among all possible scenarios.

2) Feature Selection: In the context of supervised learning, this phase aims at identifying a meaningful subset of features. The algorithm used to this purpose [12] requires a fine tuning of two regularization parameters τ and λ , therefore the learning procedure is performed within two nested cross-validation loops as in [13]. In our setting the outer loops were B = 8 and the inner loops K = 4. The outcome of this phase is a set of M = 8 nested lists of selected features (see Section II-C). 3) Linear Classification: Finally, the last step was to test the performance of a number of linear classifiers on the dataset restricted to the list of features identified at the previous step. To attain bias-free models, the dataset is split into three chunks for training Xtr , validation Xvl and test Xts . The first two chunks are fed to a grid search optimization of the parameters, then the best model is trained on Xtr ∪ Xvl . The final performance is evaluated on Xts . The splitting procedure is then repeated N = 30 times to ensure statistical robustness. C. Algorithms 1) Data Exploration: We here briefly describe the principles of the chosen algorithms for dimensionality reduction. Principal Component Analysis (PCA), [14] this algorithm estimates linearly uncorrelated directions, the so-called Principal Components (PC), that are linear combinations of the original features. The main property of PC is that they have decreasing rate of explained data variance. Linear Discriminant Analysis (LDA), [14] this algorithm is a link for the following supervised learning stages. In fact, taking also into account the class labels, LDA provides the directions that maximize the inter-class separation. 2) Feature Selection: the `1 `2 Framework. In the presented work we focused on the two-stages feature selection strategy presented in [12]. In its first stage a model with minimal set of features and small bias is estimated by a two-step optimization: a first list of features is identified selecting the p∗ non-zero parameters w obtained from the minimization of the functional kY − wXk22 + µ kwk22 + τ kwk1 for a small value of µ. This also allows to identify τ = τ ∗ that minimizes the prediction error. The second optimization step consists in a standard least square solution obtained mini regularized

2 mizing Y − w˜ X˜ 2 + λ kwk ˜ 22 where w˜ and X˜ are the weight vector and the data matrix restricted to the p∗ features identified in the previous step. This also allows to estimate the value of λ = λ ∗ that minimizes the prediction error. To ensure statistical robustness both τ ∗ and λ ∗ are estimated by a inner K − fold cross-validation loop. In the second stage the same procedure is iterated M times with increasing values for the `2 -norm parameter µi = µ1 < µ2 < · · · < µM which controls the amount of correlation among features allowed by the model (in this stage λi = λ ∗ and τi = τ ∗ ∀i ∈ 1, · · · , M). In order to gain statistical significance the whole procedure is also nested in a B − fold cross validation loop (following the algorithm described in [13]). For each feature, at fixed µi ,

Preprint submitted to 37th Annual International Conference of the IEEE Engineering in Medicine and Biology Society. Received April 14, 2015.

CONFIDENTIAL. Limited circulation. For review only.

Fig. 2.

The data projected on the directions identified respectively by PCA (left) and LDA (right).

this allows to evaluate the selection frequency, i.e. how many times that feature is selected across the B splits of the outer cross-validation loop. Therefore this procedure generates M consensus lists of features provided with a selection frequency score. Thresholding the selection frequency can be used as criterion to define the most statistically stable list of features. 3) Linear Classification: The main goal of this section is to investigate on the performance of a set of linear classification algorithms on the binary classification problem identified for the presented dataset. A linear classifier is a function fw (x) = wT x, where w ∈ R p and x ∈ Rn . fw minimizes a functional of the form: L[ fw ] = V (Y, fw ) + λ P( fw ), where the first term is the loss function, which evaluates the approximation error and the second term is the regularizer, or penalty term. The regularization parameter λ controls the trade-off between the data fitness and the smoothness of the solution. In our setting P( fw ) = kwk22 while the following choices for the loss function are tested [14]: Ordinary Least Squares (OLS): V (Y, fw ) = kY − wXk22 and λ =0 Regularized Least Squares (RLS): V (Y, fw ) = kY − wXk22 Logistic Regression (LR): V (Y, fw ) = L(Y, fw ) where L is the Logistic Loss function Linear Support Vector Machines (LSVM): V (Y, fw ) = H(Y, fw ), where H is the Hinge Loss function. We also tested the performance of a K-Nearest Neighbor (KNN) classifier. III. RESULTS A. Data Exploration The left side of Fig. 2 shows the projection of the data onto x1 and x2 , the first two dimensions identified by means of PCA. From this first result we sensed that a combination of anthropometric data, CS and PRO measures allows to isolate the patients diagnosed as RR (purple dots). The results of LDA enforced our hypothesis (see right side of Fig. 2). We then moved our focus on a different problem: the RR vs. ALL supervised classification.

B. Feature Selection As mentioned in Section I, we aim at identifying the most concise list of predictive features. This was obtained thresholding the consensus list associated with the lowest value of µi , see left side of Fig. 4. The threshold was set to 25% that is where a kink occurs when plotting the number of selected features vs. the selection frequency, as shown in Fig. 3. Fig. 3.

Number of selected features by `1 `2 FS vs. selection frequency

The heatmap plot in the right panel of Fig. 4 depicts the features correlation matrix. The matrix is sorted in order to highlight five correlation clusters obtained as in [15]. This algorithm uses the only five features which present selection frequency of 100% as prototype variables and then assigns every other feature to a cluster based on the absolute value of their Pearson correlation score. The block diagonal appearance of the matrix suggests that the prototypes effectively capture most of the correlation among the features. C. Classification The performance of the considered linear classifiers is summarized in Table II. We recall that, in this scenario, the accuracy score of a random classifier for RR vs. ALL is approximately 62.8%. The average accuracy score reached by our selection of linear classifiers is significantly higher and, in particular, the least squares classifiers (OLS and

Preprint submitted to 37th Annual International Conference of the IEEE Engineering in Medicine and Biology Society. Received April 14, 2015.

CONFIDENTIAL. Limited circulation. For review only.

1 2 3 4 5 6 7 8 9 10

Name

%SF

Description

LIFE 004 AGEv FIM 011 FIM 012 FIM 014 LIFE 009∗ MFIS 020∗ LIFE 005 MFIS 014 HADS 007

100 100 100 100 100 88 62 25 25 25

These are the best years of my life Age at the examination Type of transfer: tub or shower Locomotion: walking Locomotion: stairs I would not change my past life even if I could I have limited my physical activities Most of the things I do are boring or monotonous I have been physically uncomfortable I can sit at ease and feel relaxed

Fig. 4. Left panel: the novel questionnaire made of the top 10 questions selected by `1 `2 FS. The questions marked with a frequency (SF%) with maximum value of µi . Right panel: heatmap plot of the features correlation matrix. TABLE II A COMPARISON BETWEEN THE AVERAGE PERCENTAGE ACCURACY SCORES REACHED BY THE CONSIDERED LINEAR CLASSIFIERS ON THE ENTIRE FEATURES SET AND ONLY ON THE SUBSET SELECTED BY `1 `2

%

OLS

RLS

KNN

LR

LSVM

NO FS `1 `2 FS

71.79 78.32

72.42 78.24

72.44 74.99

77.28 77.30

75.69 75.82

FS

RLS) seem to reach the highest score above the others. The robustness of these results (obtained from the N = 30 experiments) is ensured by the small variances of the accuracy scores, which are always of the order of magnitude of 1 × 10−3 . Another non-trivial observation that could be drawn from Table II is that the `1 `2 FS step clearly improves the performance of almost every considered linear classifier. It is interesting to notice how both OLS and RLS classifiers are the algorithms that benefit more from the FS step (their accuracy score increases respectively of 6.53% and 5.82%). IV. CONCLUSIONS In this paper we propose a machine learning pipeline for clinical questionnaires analysis which aims at detecting MS disease course. In particular we restrict the problem to the RR vs. the progressive and benign forms (ALL). To the very best of our knowledge this is the first study which predicted MS course taking only into account a small subset of anthropometric and questionnaires variables, which we propose as a novel questionnaire, tailored for RR detection. ACKNOWLEDGMENT This study has been supported by Italian MS Foundation - FISM (FISM/2013/S3). The authors thank Salvatore Masecchia and Matteo Barbieri for the implementation of L1L2Signature and PyXPlanner, respectively. R EFERENCES [1] F. D. Lublin and S. C. Reingold, “Defining the clinical course of multiple sclerosis results of an international survey,” Neurology, vol. 46, no. 4, pp. 907–911, 1996.



occur with 100% selection

[2] J. M. Sonder, L. J. Balk, F. A. van der Linden, L. V. Bosma, C. H. Polman, and B. M. Uitdehaag, “Toward the use of proxy reports for estimating long-term patient-reported outcomes in multiple sclerosis,” Multiple Sclerosis Journal, p. 1352458514544078, 2014. [3] P. Flachenecker, T. K¨umpfel, B. Kallmann, M. Gottschalk, O. Grauer, P. Rieckmann, C. Trenkwalder, and K. Toyka, “Fatigue in multiple sclerosis: a comparison of different rating scales and correlation to clinical parameters,” Multiple sclerosis, vol. 8, no. 6, pp. 523–526, 2002. [4] K. Honarmand and A. Feinstein, “Validation of the hospital anxiety and depression scale for use with multiple sclerosis patients,” Multiple Sclerosis, 2009. [5] F. Franchignoni, L. Tesio, M. Ottonello, and E. Benevolo, “Life satisfaction index: Italian version and validation of a short form1,” American journal of physical medicine & rehabilitation, vol. 78, no. 6, pp. 509–515, 1999. [6] L. Cardozo, D. Staskin, B. Currie, I. Wiklund, D. Globe, M. Signori, R. Dmochowski, S. MacDiarmid, V. W. Nitti, and K. Noblett, “Validation of a bladder symptom screening tool in women with incontinence due to overactive bladder,” International urogynecology journal, vol. 25, no. 12, pp. 1655–1663, 2014. [7] C. Granger, A. Cotter, B. Hamilton, R. Fiedler, and M. Hens, “Functional assessment scales: a study of persons with multiple sclerosis.” Archives of physical medicine and rehabilitation, vol. 71, no. 11, pp. 870–875, 1990. ´ Roger, L. Chamelian, [8] E. Dagenais, I. Rouleau, M. Demers, C. Jobin, E. and P. Duquette, “Value of the moca test as a screening instrument in multiple sclerosis,” The Canadian Journal of Neurological Sciences, vol. 40, no. 03, pp. 410–415, 2013. [9] R. Aupperle, W. Beatty, F. Shelton, and S. Gontkovsky, “Three screening batteries to detect cognitive impairment in multiple sclerosis,” Multiple Sclerosis, vol. 8, no. 5, pp. 382–389, 2002. [10] B. Parmenter, B. Weinstock-Guttman, N. Garg, F. Munschauer, and R. H. Benedict, “Screening for cognitive impairment in multiple sclerosis using the symbol digit modalities test,” Multiple Sclerosis, vol. 13, no. 1, pp. 52–57, 2007. [11] F. Pedregosa, G. Varoquaux, A. Gramfort, V. Michel, B. Thirion, O. Grisel, M. Blondel, P. Prettenhofer, R. Weiss, V. Dubourg, J. Vanderplas, A. Passos, D. Cournapeau, M. Brucher, M. Perrot, and E. Duchesnay, “Scikit-learn: Machine learning in Python,” Journal of Machine Learning Research, vol. 12, pp. 2825–2830, 2011. [12] C. De Mol, S. Mosci, M. Traskine, and A. Verri, “A regularized method for selecting nested groups of relevant genes from microarray data,” Journal of Computational Biology, vol. 16, no. 5, pp. 677–690, 2009. [13] A. Barla, S. Mosci, L. Rosasco, and A. Verri, “A method for robust variable selection with significance assessment.” in ESANN, 2008, pp. 83–88. [14] T. Hastie, R. Tibshirani, and J. Friedman, The elements of statistical learning. Springer, 2009, vol. 2, no. 1. [15] S. Mosci, A. Barla, A. Verri, and L. Rosasco, “Finding structured gene signatures,” in Bioinformatics and Biomedicine Workshops, 2008. BIBMW 2008. IEEE International Conference on. IEEE, 2008, pp. 158–165.

Preprint submitted to 37th Annual International Conference of the IEEE Engineering in Medicine and Biology Society. Received April 14, 2015.