One-Dimensional Statistical Parametric Mapping in Python

6 downloads 132472 Views 1MB Size Report
ety of common statistical tests like t tests, regression and ANOVA, SPM1D also emphasises .... (OO) capabilities; an OO framework was considered desirable because many SPM ... Full API documentation and example scripts are available at ...
One-Dimensional Statistical Parametric Mapping in Python Todd C. Pataky Department of Bioengineering, Shinshu University, Japan

Abstract Statistical Parametric Mapping (SPM) is a topological methodology for detecting field changes in smooth n-dimensional continua. Many classes of biomechanical data are smooth and contained within discrete bounds and as such are well-suited to SPM analyses. The current paper accompanies release of ‘SPM1D’, a free and open-source Python package for conducting SPM analyses on a set of registered one-dimensional curves. Three example applications are presented: (i) kinematics, (ii) ground reaction forces, and (iii) contact pressure distribution in probabilistic finite element modelling. In addition to o↵ering a high-level interface to a variety of common statistical tests like t tests, regression and ANOVA, SPM1D also emphasises fundamental concepts of SPM theory through stand-alone example scripts. Source code and documentation are available at: www.tpataky.net/spm1d.

1

Introduction One-dimensional biomechanical curves like ground reaction forces and kinematic trajectories

are typically sampled at frequencies that can yield hundreds or thousands of values per recording. (‘Curve’ is used throughout to refer to continuous or step-wise continuous one-dimensional trajectories.) When these curves are registrable (Sadeghi et al., 2003), e.g. by normalizing by stance phase duration (Cavanagh and Lafortune, 1980), mean and variance curves are typically computed, but hypothesis testing is usually conducted only on a relatively small number of summary metrics (e.g. values at local optima) (e.g. Nillson and Thorstensson, 1989). This reduction to summary-metric space is computationally convenient but it is not strictly necessary; statistical hypothesis testing may alternatively be conducted in a continuous manner, directly on the original (registered) curves using statistical parametric mapping (SPM) (Friston et al., 2007). 1

SPM is an n-dimensional methodology for the topological analysis of smooth continuum changes associated with experimental intervention. Most commonly used in 3D functional neuroimaging (e.g. Friston et al., 1995), it can also be e↵ectively applied to a variety of applications in other fields including biomechanics (Pataky, 2010). SPM’s two main advantages over the summary metric approach are that: (i) statistical results are presented directly in the original sampling space, so their spatiotemporal biomechanical context is immediately apparent, and (ii) there is no need for (potentially biasing) assumptions regarding the spatiotemporal foci of signals (Pataky, 2010, Appendix D). This paper introduces ‘SPM1D’ a free and open-source software package for one-dimensional SPM. As compared with existing 3D or nD SPM releases such as SPM8 (www.fil.ion.ucl.ac.uk/spm), fMRIstat (www.math.mcgill.ca/keith/fmristat), and NiPy (neuroimaging.scipy.org), SPM1D is more compact, with only very basic pre-processing functionality and implementing almost exclusively SPM computations, and thus may be useful for studying SPM concepts in their simplest (1D) form. The main implementation goals were: (i) to create a compact and easy-to-use interface for 1D SPM analysis, and (ii) to emphasise key SPM concepts with scripts and online documentation. The current paper briefly describes SPM procedures and SPM1D implementation details followed by three SPM1D demonstrations using experimental datasets.

2

Statistical computation Consider a set of curves representing independent experimental observations (Figure 1a). Fol-

lowing registration these data may be assembled into an I ⇥ K matrix Y where I and K are the numbers of curves and nodes per curve, respectively, and an experimental design can be represented by a (I ⇥ J) matrix X, where J is the number of experimental factors; in the Figure 1 example: I=12, J=2 and K=100. Given data Y and design X the entire family of linear parametric statistical tests can be efficiently implemented using a mass-univariate general linear model (GLM) (Friston et al., 1995) (Figure 1b):

Y =X +" 2

(1)

where

is a (J ⇥ K) matrix of unknown regression parameters and " is an (I ⇥ K) matrix of

residuals. Least-squares parameter estimates are given by:

ˆ = (XT X) where (XT X)

1 XT

1

XT Y

(2)

is the Moore-Penrose pseudo-inverse of X. In parameter matrix ˆ there are

J curves that reflect the mean trends for each of the J experimental factors. For example, if Xj (the jth column of X) is a binary categorical variable (as in Figure 1), then ˆj (the jth row of ˆ) is the mean curve for that condition. If Xj is a continuous independent variable (e.g. walking speed), then ˆj is a curve of regression slopes. The variance about these mean trends is given by first computing model residuals as:

"=Y and then assembling the variance curve

2



(3)

as:

ˆ2 =

diag("T ") I rank(X)

(4)

Since the above equations yield mean and variance curves it is apparent that arbitrary linear statistical testing (i.e. t tests, regression, ANOVA, MANCOVA, etc.) can be implemented with various choices of X and various combinations of the ˆj curves. Appropriate design specification and equations for particular statistics are described extensively elsewhere (e.g. Friston et al., 2007). Currently only the t and F statistics are considered (see Appendix for computational details) as these two statistics envelope the aforementioned family of linear statistical tests. The t and F statistics are respectively termed ‘SPM{t}’ and ‘SPM{F}’ in the literature (Friston et al., 2007); they are known as ‘parametric maps’ because they are computed using a parametric approach (i.e. the mean and variance parameterise continuous distributions), and because these statistical values map node-by-node to the spatial or temporal process of interest. An example SPM{t} is depicted in Figure 1c. The statistical significance of SPM{t} and SPM{F} curves can be assessed in a variety of ways 3

(Friston et al., 1994, 1995), including non-parametric approaches (Homes et al., 1996), and inference procedures continue to develop (Zhang et al., 2009). Most are topological in nature because such approaches take advantage of the regional correlation present in smooth curves, thereby yielding more realistic significance thresholds than are achieved with typical multivariate procedures (e.g. Bonferroni correction). All parametric procedures are grounded in the mathematics of random field theory (RFT) (Adler, 1981). For simplicity SPM1D currently employs a single inference procedure: a classical height-threshold cluster-breadth procedure (Friston et al., 1994). Briefly, as the smoothness of a random field increases, so does the breadth of clusters that survive a specified threshold (e.g. t > 3.0) (see Figure 1c). However, very broad and/or very high clusters are expected to occur with low probability. The fundamental RFT-relevant units are thus the topological features of thresholded SPM{t} and SPM{F} curves (e.g. height, cluster breadth). Ultimately the RFT procedures yield a single p value for each observed suprathreshold cluster (Figure 1c). In general, given the field smoothness (estimated from ") and the search space (a function of K), these p values may be interpreted as the probability with which the observed suprathreshold cluster could have resulted from a smooth random process.

3

Implementation overview SPM1D was implemented in Python 2.5 (ww.python.org) using the standard scientific and vi-

sualisation packages: NumPy 1.3, SciPy 0.8, Matplotlib 0.99 and pyTables 2.1.1 as packaged with the Enthought Python Distribution 5.0 (Enthought Inc., Austin, USA). Python was chosen predominantly because of its high-level and easily-readable syntax and because of its object-oriented (OO) capabilities; an OO framework was considered desirable because many SPM procedures are intrinsically encapsulated. For example, field smoothness estimation should only be conducted on residual curves (Eqn. 3) (Kiebel et al., 1999), so it is natural to constrain smoothness estimation to a ‘residuals’ data class. A second example is that SPM{t} and SPM{F} curves typically involve di↵erent inference procedures (Friston et al., 2007), so these procedures can naturally be tied to separate SPM{t} and SPM{F} data classes. To demonstrate SPM1D’s high-level OO syntax 4

consider the following complete script:

from matplotlib import pyplot from spm1d import io,stats

YA = io.load(‘mydataA.h5’);

YA.smooth(2.0)

YB = io.load(‘mydataB.h5’);

YB.smooth(2.0)

SPMt

= stats.ttest2(YA,YB)

SPMti = SPMt.inference(height=3.0)

pyplot.figure();

YA.mean().plot();

pyplot.figure();

SPMti.plot()

YB.mean().plot(color=‘r’)

The first two lines load the appropriate libraries, the next two load data from two di↵erent experimental conditions (stored in hierarchical data format by default [REF]) and smooth the curves, the next two compute the SPM{t} and conduct statistical inference, and the final two lines plot the results. From: SPMti = SPMt.inference(height=3.0) the conceptual benefits of an OO approach are evident: the test statistic values (the SPM{t}) exist as a raw field before conducting inference, and inference can only be conducted on an SPM curve, a data class that contains the smoothness and geometry information necessary for conducting RFTbased inference. Thus the OO architecture itself communicates non-trivial information regarding SPM procedures. A similar high-level OO syntax was adopted throughout, supported by a data class hierarchy (Figure 2) which, in turn, is based heavily on the high-level wrappers to efficient algebraic, statistics, and optimisation libraries in NumPy and SciPy. Consequently, scripts like those above can execute rapidly; a 20-curve, 100-node dataset was processed on a laptop computer (2.13 GHz dual-core, 2

5

GB) with an average duration of approximately 8 ms: loading (1.8 ms), statistical computations (3.5 ms), and inference (2.3 ms). This cascading OO approach (SPM1D ! SciPy ! NumPy ! numerical libraries) additionally o↵ers a convenience to users who are interested in computational specifics. Since computational methods are tied directly to objects the user will know precisely where to look for these details upon a simple query of the object in question. For example, querying an ‘SPM t’ object (see Figure 2) will inform the user of both its source code location (in the ‘ spm.py’ module) and its intended functionality (with method names: ‘plot’ and ‘inference’). Full API documentation and example scripts are available at www.tpataky.net/spm1d. The code was tested successfully on Windows XP, Windows 7, Mac OS X 10.5, Mac OS X 10.6, and Ubuntu 9.10.

4

Examples

4.1

Probabilistic FEA (two-sample t test)

A comparison of contact pressures between diabetic and non-diabetic subjects during axisymmetric heel-pad indentation (Figure 3a) (Erdemir et al., 2006) was conducted using SPM-based probabilistic finite element analysis (FEA) (Pataky, 2010). Mean diabetic and non-diabetic highstrain Ogden parameters were set at: ↵=7.30 and 6.82, respectively, to replicate the conditions under which planar stress field significance was observed (Pataky, 2010). Indenter depth was held constant at 8 mm and material parameters for the diabetic and non-diabetic groups were varied using the reported variances (Erdemir et al., 2006) in a 100-repetition Monte Carlo simulation. Since a large number of simulations was conducted the sample variance (Eqn. 4) was assumed to be equivalent to the true population variance. FE solutions were obtained using ABAQUS 6.7 (Simulia, Providence, USA). As expected the diabetic parameters yielded higher mean contact pressures than the non-diabetic parameters (Figure 3b). Like cited planar stress field results, these contact pressure di↵erences were found to be topologically significant (p=0.000) (Figure 3c).

6

4.2

Foot kinematics (single-subject ANOVA)

A single subject from a separate study (Caravaggi et al., in review) performed 10 trials of each of ‘slow’, ‘normal’, and ‘fast’ walking. Foot kinematics were recorded using a ProReflex system (Qualisys, Gothenburg, Sweden) following Leardini et al. (2007). All experimental subjects (here and below) gave informed consent according to the policies of the Research Ethics Committee of the University of Liverpool. Saggital plane ankle rotation was computed relative to rest ( ✓) and these angular trajectories were registered by normalising by stance phase duration. Mean curves (Figure 4a) revealed that fast walking tended to be associated with reduced dorsiflexion in mid-stance and increased plantar flexion in late stance, but only the latter could be confirmed as significant using one-way ANOVA (at ↵=0.05) (Figure 4b).

4.3

Ground reaction force (multi-subject population-level inference)

Ten subjects from a published study (Pataky et al., 2008) participated in a similar experiment to that described above. The main di↵erences were that 20 repetitions of each walking speed were performed (instead of 10) and that walking speed was measured using a torso-fixed marker; thus speed was treated as a continuous variable for SPM analysis. Ground reaction force (GRF) data were measured using a Kistler force plate (model 8281B, Winterthur, Switzerland) and were normalised by body weight (BW) and stance phase duration. Raw data revealed systematic GRF changes with walking speed (Figure 5a). Population-level inference was conducted using hierarchical random-e↵ects analysis following Beckmann et al. (2003), first by computing the regression curves associated with walking speed for each subject (Figure 5b), and then by submitting these summary curves to a second-level single-sample t test. Results confirmed significant positive speed-GRF correlations in early and late stance and a significant negative correlation in mid-stance (Figure 5c), in agreement with previous studies (Nillson and Thorstensson, 1989; Keller et al., 1996).

7

5

Discussion SPM appears to be a useful tool for analysing both experimental and probabilistically simulated

bounded one-dimensional biomechanical curves. Since SPM examines the entire sampling space topologically it avoids potential problems associated with discrete point analyses (e.g. conflation of regional optima) (Pataky et al., 2008), and it also highlights the spatial or temporal breadth over which signals are observed. Additionally, since statistical curves lie in the original sampling space their biomechanical meaning is immediately apparent; no ad hoc categorical labelling is necessary. SPM1D is a Python package that o↵ers a high-level object-oriented interface to one-dimensional SPM. It is not a comprehensive package, currently containing explicit interfaces only to basic statistical tests (t tests, regression, and one-way ANOVA), but the machinery is in place to implement arbitrary linear testing (ANCOVA, MANCOVA, etc.) via the powerful and flexible GLM (Eqn. 1). Indeed all aforementioned tests are special cases of the GLM; in the SPM1D code these highlevel routines simply organise the data for submission to ‘spm1d.stats.glm’ (Figure 2), SPM1D’s computational heart. A second limitation of the current implementation is that it supports only simple linear registration (linear interpolation) via the ‘Curve.interp’ and ‘CurveList.interp’ methods (Figure 2). Linear registration may be inappropriate in certain cases (Thies et al., 2009). Local misregistration, however, may not be critical; in many cases (e.g. Figure 5) suprathreshold SPM{t} or SPM{F} clusters can be order-of-magnitude larger than potential registration errors and one may thus appropriately ignore small misregistrations. Nonetheless, homologous data registrability is essential to all SPM analyses so continued scrutiny of biomechanical curve registration (Sadeghi et al., 2003; Page and Epifanio, 2007; Thies et al., 2009) is necessary to help to elucidate SPM’s validity for particular applications. Although SPM1D is not comprehensive one of the main development goals was to create an easy-to-use SPM package for exploring fundamental SPM and RFT concepts. Since 2D and 3D data involve progressively more sophisticated registration and topology computations, and in the case of brain data: considerably more complicated experimental modelling (Friston et al., 2007), it is hoped that this one-dimensional implementation will help to make SPM concepts accessible 8

in their simplest possible manifestation. That is, it is hoped that SPM1D can be a base from which to consider higher dimensional and/or more complicated SPM analyses. Interested readers are encouraged to peruse the original SPM literature and the highly developed SPM8, fmriStat and NiPy software packages.

Acknowledgments The author is grateful to Paolo Caravaggi for his support and contributions. Funding was provided by Special Coordination Funds for Promoting Science and Technology from the Japanese Ministry of Education, Culture, Sports, Science and Technology.

Appendix Given the GLM (1) and the parameter (2) and variance (4) estimates, the nodal t statistic (tk ) may be computed following Friston et al. (1995):

tk =

cT ˆk p ˆk cT (XT X)

1c

where c is a (J ⇥ 1) contrast vector and where the tk values form a K-vector, the SPM{t}. The contrast vector c assigns weights to the J experimental factors and thus represents the experimental ⇥ ⇤T hypothesis. For example, the two sample t test in Figure 1 uses c = 1 1 , and the regression ⇥ ⇤T in Figure 5 uses c = 1 0 0 0 .

The F statistic represents the ratio between the explained variance and unexplained variance and

it may also be readily computed using the GLM (following Friston et al., 1995) by first specifying a reduced form of the model: X0 which contains factors that are not of empirical interest (e.g. the intercept and time drift factors in Figure 5). After estimating the errors "0 associated with this reduced model using (3), the nodal F statistic is given by: ⇣

⌘ T ") /⌫ ("T " ) (" 0 1 k k 0 ⇣ ⌘ Fk = ("T ")k /⌫2 9

where ("T ")k is the kth diagonal element of "T ", and where the numerator and denominator degrees of freedom (⌫1 and ⌫2 ) are (J-1) and (I-J), resepctively. Like the tk values above the Fk values combine to form an SPM{F}.

References Adler RJ. 1981. The Geometry of Random Fields. New York: Wiley. Beckmann CF, Jenkinson M, Smit SM. 2003. General multilevel linear modeling for group analysis in FMRI. Neuroimage 20(2): 1052–1063. Caravaggi P, Pataky TC, G¨ unther M, Savage R, Cromprton RH. 2010. Dynamics of Longitudinal Arch Support in Relation to Walking Speed: the Contribution of the Plantar Aponeurosis. J Anat 217(3): 254-261. Cavanagh PR, Lafortune MA. 1980. Ground reaction forces in distance running. J Biomech 13(5): 397–406. Erdemir A, Viveiros ML, Ulbrecht JS, Cavanagh PR. 2006. An inverse finite-element model of heel-pad indentation. J Biomech 39(7): 1279–1286. Friston, KJ, Ashburner JT, Kiebel SJ, Nichols TE, Penny WD, editors. 2007. Statistical parametric mapping: the analysis of functional brain images. Amsterdam: Elsevier/Academic Press. Friston KJ, Holmes AP, Worsley KJ, Poline JP, Frith CD, Frackowiak RSJ. 1995. Statistical parametric maps in functional imaging: a general linear approach. Hum Brain Mapp 2: 189–210. Friston KJ, Worsley KJ, Frackowiak RSJ, Mazziotta JC, Evans AC. 1994. Assessing the significance of focal activations using their spatial extent. Hum Brain Mapp 1: 210–220. Holmes AP, Blair RC, Watson JD, Ford I. 1996. Nonparametric analysis of statistic images from functional mapping experiments. J Cereb Blood Flow Metab 16: 7–22.

10

Keller TS, Weisberger AM, Ray JL, Hasan SS, Shiavi RG, Spengler DM. 1996. Relationship between vertical ground reaction force and speed during walking, slow jogging, and running. Clin Biomech 11(5): 253–259. Kiebel SJ, Poline JB, Friston KJ, Holmes AP, Worsley KJ. 1999. Robust smoothness estimation in statistical parametric maps using standardized residuals from the general linear model. NeuroImage 10: 756–766. Leardini A, Benedetti MG, Berti L, Bettinelli D, Nativo R, Giannini S. 2007. Rear-foot, mid-foot and fore-foot motion during the stance phase of gait. Gait Posture 25(3): 453–62. Nilsson J, Thorstensson A. 1989. Ground reaction forces at di↵erent speeds of human walking and running. Acta Physiol Scand 136(2): 217–227. Page A, Epifanio I. 2007. A simple model to analyze the e↵ectiveness of linear time normalization to reduce variability in human movement analysis. Gait Posture 25(1): 153–156. Pataky TC, Caravaggi P, Savage R, Parker D, Goulermas JY, Sellers WI, Crompton RH. 2008. New insights into the plantar pressure correlates of walking speed using pedobarographic statistical parametric mapping (pSPM). J Biomech 41(9): 1987–1994. Pataky TC. 2010. Generalized n-dimensional biomechanical field analysis using statistical parametric mapping. J Biomech 43(10):1976–1982. Sadeghi H, Mathieu PA, Sadeghi S, Labelle H. 2003. Continuous curve registration as an intertrial gait variability reduction technique. IEEE Trans Neural Syst Rehabil Eng 11(1): 2430. Thies SB, Tresadern PA, Kenney LP, Smith J, Howard D, Goulermas JY, Smith C, Rigby J. 2009. Movement variability in stroke patients and controls performing two upper limb functional tasks: a new assessment methodology. J Neuroeng Rehabil 6(1): 2. Zhang H, Nichols TE, Johnson TD. 2009. Cluster mass inference via random field theory. Neuroimage 44: 51–61.

11

Figure 1: SPM procedure overview. (a) Simulated unregistered data (top) and design matrix (bottom). The simulated trajectories have di↵erent lengths (indicated by dots). The design matrix (white=1, black=0) indicates an alternating pattern of ‘Cond A’ and ‘Cond B’ trials. (b) General linear model; curves (top) and matrix renderings (bottom). Following registration it is apparent that the registered curves (Y ) all have a spike-like signal at normalised excursion = 50%. The data (Y ) are equivalent to mean trends (X ) plus residuals ("). The standard deviation curve is indicated by a thick black line in the upper right panel. (c) SPM results; raw SPM{t} computed from a two-sample t test (left) and an inference curve (right) thresholded at t > 3.0. The p value indicates the probability with which a suprathreshold cluster of the same extent and same minimum height as the observed cluster could have occurred by chance.

Figure 2: SPM1D code architecture overview.

Figure 3: Probabilistic finite element simulation of heel pad indentation (following Erdemir et al. 2006). (a) Deformed model; undeformed geometry is indicated by the dotted line. (b) Mean contact pressures and standard deviation clouds; mean (±st.dev.) Ogden hyperelastic parameters were µ=16.45 (±8.27), ↵=6.82 (±1.57) and µ=16.88 (±6.70), ↵=7.30 (±1.43) for the non-diabetic and diabetic groups, respectively. (c) Two-sample t test on the data in (b).

Figure 4: One-way ANOVA analysis of the foot kinematics dataset. (a) Mean (solid line) ankle flexion curves and their standard deviations (shaded cloud) for each of the three walking speeds. Ankle angle is expressed with respect to neutral posture ( ✓=0) and positive and negative ✓ values indicate dorsiflexion and plantarflexion, respectively. (b) ANOVA results including the raw SPM{F} (lightly shaded cloud) and the RFT-computed critical F threshold of 6.746 (at ↵ = 0.05).

Figure 5: Population-level random e↵ects analysis of the GRF dataset. (a) All sixty trials for a representative subject (left) and the first-level design matrix (right) indicating (in consecutive columns): walking speed, intercept (constant), linear time drift, and sinusoidal time drift. (b) The curves for all ten subjects (left) representing the relation between GRF (BW) and walking speed (m/s) and the second-level design matrix (right) indicating a single-column of ones (i.e. a one-sample t test). (c) Population-level inference based on (b).

Suggest Documents