A computer-learning model using hand-drawn images for the assessment of visuo-spatial neglect Y.LIANGa, M.C.FAIRHURSTa, R.M.GUESTa*, J.M.POTTERb a Department of Electronics, University of Kent, UK, CT2 7NT b Kent and Canterbury Hospital, Canterbury, UK * Corresponding author email:
[email protected] Abstract. Visuo-spatial neglect is a complex post-stroke medical syndrome which may be assessed by means of a series of drawing-based tests. Using a test battery of pencil-and-paper tests, the aim of this study is to validate an automated computer-based assessment system based on feature-level performance modelling which enables the encapsulation of temporal sequence and other “dynamic” information inherent in the drawing process. Four continuous scoring modelling approaches using algorithmically extracted features are considered and compared. The optimal model is shown to produce significant agreement with the drawing-related components of an existing test battery, while the automated system benefits from repeatability in assessment and a reduction in the number of tasks which must be completed by the patient.
1. Introduction Visuo-spatial neglect (designated “neglect” subsequently in this paper) is a complex medical condition following stroke, characterised by patients’ failure to respond to stimuli on one side of their visual field (Cherney & el 2001). Conventionally clinicians use a series of ‘pencil-and-paper’ drawing tests, alongside other forms of assessment, for the diagnosis of neglect. The Behavioural Inattention Test (BIT) (Wilson & al. 1987; Halligan & al. 1992) uses six pencil-and-paper diagnostic tests (referred to as the BIT Conventional subset) wherein a test subject is evaluated based on the execution of 13 individual drawings, and is currently accepted in the UK as the clinical standard for neglect assessment. Assessment of responses is performed manually by clinical experts according to a set of marking criteria, thereby introducing the possibility of subjectivity in assessment which may also be time-consuming in execution (Bailey 2000). The marking criteria also do not account for dynamic/temporal information that has been shown to indicate important diagnostic information (Bartolomeo, 1998). The work described in this paper, therefore, aims to improve the assessment process with respect to these issues through the application of a series of feature modelling methods. Utilising a device for capturing the temporal information inherent in the test process, a computer-based testing system has been developed, consisting of a set of hand-drawing tests and software to extract performance-related features from the captured data (Fairhurst & al. 1998). These features have been analysed individually in previous studies (Guest 1999; Liang & al. 2007) identifying a positive contribution to the accuracy and resolution of a diagnosis. Reported in our previous work, (Liang & al. 2007) a binary (‘pass’/’fail’) classification model has been developed for predicting the diagnosis of neglect using a combination of extracted features. For the purpose of assessing the rehabilitation progress, however, it is necessary to provide a continuous score analogous to the BIT score. The work described in this paper will, therefore, investigate the possibility of developing an assessment model providing a continuous output equivalent to the corresponding Conventional BIT score.
2. Methodology In our experimentation a dataset containing drawing responses from 33 neglect patients and 110 stroke control subjects was used as a training dataset. Each of the subjects in the dataset also completed the BIT Conventional subtest to provide baseline data. For the purposes of testing, a second disjoint dataset was collected (utilising the same protocol as the training dataset) comprising 18 neglect patients and 26 stroke control subjects. Features describing performance as defined in (Liang & al. 2007) were extracted from 17 pencil-and-paper tasks for each subject. For the purpose of this study, these tasks can be categorised into two groups specified as follows: • Cancellation tasks – a total of 57 features were extracted from each task. • An Albert’s cancellation test (ALB) • A target differentiation cancellation task (OX) • Drawing tasks – a total of 35 features were extracted from each task. • Figure copying tasks (4 tasks) - a cross (FCCR), cube (FCCU), star (FCST) and square (FCSQ). • Drawing from memory tasks (2 tasks) – a square (DMSQ) and cube (DMCU). • Figure completion tasks (6 tasks) – individually completing the left and right side of a diamond (FMDL/FMDR), illustrative human shape (FMML/FMMR) and illustrative house shape (FMHL/FMHR).
• Derivative tasks (3 tasks) –derived from forming a ratio using equivalent features from the two figure completion tasks with the same shape (FMDD, FMHH and FMMM). In forming diagnostic models based on extracted features automated variable selection procedures are often chosen as a tool for exploratory studies, where the nature of the relationship between the predictors and the predicted variable is not proven in theory. Therefore, a Linear Regression with Forward Stepwise variable selection (Weisberg 1985) using performance features from across all 17 tasks was initially chosen to form a continuous assessment model. On the other hand, entirely automated procedures are not usually recommended for the purpose of optimal feature set selection as overfitting of training data can occur (Austin & el 2004). Therefore, in this study, a second solution-strand is developed to form the predictor of neglect symptoms, where all visually assessable aspects of the test response of a single task were considered. Instead of using an automated variable selection procedure, various aspects of test responses from a group of tasks were assessed and a feature set/predictor formed to represent observations. As described in our previous work (Liang & al. 2007), features are transformed and normalised to measure the difference between an individual feature and the average feature value from the control subject group. The predictor defined in this second approach therefore measures the sum of the distance from a specific test subject’s test response to the average response from the control subjects. The general predictor of neglect symptoms across all drawing tasks takes the form of the unweighted sum of a set of features as specified in Equation 1 (assessing the test response in each task in terms of total execution time, centre of drawing on x and y axis, mean pen pressure, standard deviation of pen tilt, pen travel distance, height and width of the drawing, mean pen acceleration, pen-on-pen-off ratio, and number of pen lifts). Due to the fact a different range of features are extracted from the cancellation tasks a separate predictor is defined (Equation 2 using features describing the number of cancellation omissions, time per cancellation, the total time and movement time ratios between cancellations made on the left hand side and the right hand side, and the pen-on-pen-off ratio). DRAWPRED = TOTAL-TIME + X-CENTRE + Y-CENTRE + MEAN-PRES + STD-XTILT + PEN-DIS + HEIGHT + WIDTH + MEAN-ACC + ON-OFF-R + PEN-LIFT CANCPRED = OMISSIONS + TIME-PER-CAN + TOTAL-LR + MOVE-LR + ON-OFF-R
Equation 1 Equation 2
Based on these predictors, three separate modelling approaches to the continuous scoring for computerbased tasks were considered • Unweighted Voting Model: A voting system is developed using the predictors as defined in Equation 1 and 2, calculated separately with the features extracted from individual tasks. The tasks chosen to contribute to the voting were selected following the selection of an optimum binary prediction configuration. The final score in this approach is calculated as the unweighted sum of the chosen task predictors in forming the voting system. • Weighted Voting Model: The BIT Conventional score comprises the sum of scores from six subtests: three cancellation subtests contributing a maximum of 130 marks and three representational drawing subtests contributing the remaining 16 marks out of the maximum 146. This observation leads to a method of forming the computer-based score weighted with respect to the two different sets of tasks. The cancellation tasks from the computer-based test battery are used to approximate the BIT cancellation subtests and the drawing tasks to approximate the BIT drawing subtests. Again, using the predictors shown in Equations 1 and 2, different weights are assigned to tasks according to the proportion of the marks assigned to the equivalent task within the BIT. As with the Unweighted Voting Model, tasks were selected according to optimum prediction performance. • Weighted Linear Model: Also inspired by the uneven weightings of different task categories within the BIT test, the computer-based score in this approach is composed of two parts which will be referred to as the cancellation score and the drawing score respectively. The computer-based cancellation score is calculated as 130 times the average of the predictors for ALB and OX tasks, both scaled to a value between 0 and 1. Linear Regression modelling is applied to form the model for calculating the drawing score. In training the regression model, the summed score of the BIT drawing subtests is taken as the dependent variable and the predictor for each individual drawing task as an independent variable. In each case, the performance of the modelling system is assessed by the correlation to the BIT score ranging from 0-146. Spearman’s rank is used to investigate this correlation for two reasons: first, it is suitable for interval variables and, second, it measures any arbitrary monotonic function describing the relationship between two variables. Although the BIT Conventional subset produces a score ranging from 0 to 146, the clinical interpretation of the score typically uses a diagnosis with lower scoring resolution. A number of reported studies have attempted to interpret the continuous BIT score on a four-point-scale classification measurement (Lafosse & al. 2004; Buxbaum & al. 2007) with the following groupings: 130-146 = non-neglect, 90-129 = mild neglect, 70-89 = moderate neglect,
0-69 = severe neglect. These categories will also be used in this study. The agreement between the two scores on the four-point-scale defined above can be assessed by the Kappa test (Cohen 1960), which is a measurement of the agreement between two nominal or categorical variables and has been suggested as a suitable method specifically in medical applications (Chan 2003). In addition, at the lowest resolution of diagnosis, it is important to assess the agreement between the binary diagnostic outcomes produced by the BIT score and the continuous computer-based scores in terms of False Acceptance Rate (FAR) and False Rejection Rate (FRR).
3. Results and Discussion Each of the modeling methods is considered separately. 3.1 Linear Regression Model As a result of the linear regression training with automated feature selection criteria, five features extracted from five different tasks were used to form this model. The selected features alongside their coefficients are shown in Table 1. Table 1. Linear Regression Model variables and coefficients Coefficient 102.91 18.22 -4.41 -3.43 -7.15 -7.08
Feature description constant number of cancellations in the bottom left quadrant extracted from OX mean acceleration of the pen movement extracted from FMMR movement time in the bottom left quadrant extracted from ALB height of drawing extracted from FMHR ratio of pen on to pen off time extracted from FMMM
3.2 Unweighted/Weighted Voting Model Since the Unweighted Voting Model and Weighted Voting model are both developed based on a voting system, the results of these two modelling approaches will be presented together in this subsection. Seven tasks/predictors were chosen to form the voting model on the basis that they produced the optimum binary prediction performance using a majority vote scheme: ALB, OX, DMSQ, FCST, FMDR, FMDD and FMMM. The Unweighted Voting Model, therefore, takes the form illustrated in Equation 3, while the Weighted Voting Model is specified in Equation 4.
y2 =
(x
y3 = (
alb
+ xox + xdmsq + x fcst + x fmdr + x fmdd + x fmmm )
*146 7 x + x fcst + x fmdr + x fmdd + x fmmm xalb + xox *130 ) + ( dmsq *16 2 5
Equation 3
)
Equation 4
where xtask represents the predictor of a designated task as calculated by Equation 1 for drawing tasks or Equation 2 for cancellation tasks, and the subscript denotes the corresponding task name 3.3 Weighted Linear Model Three computer-based tasks, including DMSQ, FMMR and FMHH, producing the optimum approximation of the BIT drawing score were selected to form the model. Together with the predictors for the two cancellation tasks, the equation for Weighted Linear Model can be specified as in Equation 5.
y4 = (−
xalb + xox × 130) + ( 23.664 − 0.516 x fmmr − 0.429 x fmhh − 0.342 xdmsq ) 2
Equation 5
3.4 Agreement Assessment and Discussion Employing the disjoint testing dataset, Table 2 presents the agreement between the BIT and model score produced by the different modelling methods with respect to the three different score forms. During the training process, the Linear Regression Model has shown the most promising performance (rho=0.927). However, it produces the poorest result for each of the assessment methods while tested with the disjoint testing dataset. The dramatic change in model performance indicates the overfitting issue caused by the utilisation of automated variable selection procedure, as discussed in Section 2. Among the other three models, the Weighted Linear Model, as shown in Table 2, produces the best classification performance based both on binary output and using the four-point scale. Considering the fact that, in clinical practice, the four-point-scale is the highest resolution in terms of the interpretation of BIT score, the Weighted Linear Model is rated as the optimal model. In order to give an intuitive perception of the agreement on
individual subjects, Table 3 provides the cross-tables between the four-point-scale outcomes of the BIT and the Weighted Linear Model.
4. Conclusions The results and discussion presented here indicate that it is possible to use a computer-based testing system for assessing neglect symptoms by means of both traditional features and novel features, and it shows an agreement of up to 92.5% with the diagnosis obtained by the BIT score. While a significant agreement with the BIT score is thus established, the computer-based testing system has additional advantages over the BIT test battery, including: the capability of assessing the dynamic features; a reduction in the number of tasks (five) in comparison to the BIT Conventional subset (equivalent to 13 individual tasks) with the possibility of reducing the patient’s fatigue as well as the feasibility of repeating the test battery on a regular basis and the assessment of features using an automated and objective measurement. Table 2 The agreement between the computer-based score and the BIT Conventional score Computer-based score model Linear Regression Model Unweighted Voting Model Weighted Voting Model Weighted Linear Model
Kappa agreement of fourpoint-scale outcomes Kappa p 0.187 0.042 0.245 0.004 0.410