Journal of Statistical Computation and Simulation
ISSN: 0094-9655 (Print) 1563-5163 (Online) Journal homepage: http://www.tandfonline.com/loi/gscs20
Multivariate-multiple circular regression Sungsu Kim & Ashis SenGupta To cite this article: Sungsu Kim & Ashis SenGupta (2016): Multivariate-multiple circular regression, Journal of Statistical Computation and Simulation, DOI: 10.1080/00949655.2016.1261292 To link to this article: http://dx.doi.org/10.1080/00949655.2016.1261292
Published online: 29 Nov 2016.
Submit your article to this journal
Article views: 10
View related articles
View Crossmark data
Full Terms & Conditions of access and use can be found at http://www.tandfonline.com/action/journalInformation?journalCode=gscs20 Download by: [Indian Statistical Institute - Kolkata]
Date: 09 January 2017, At: 05:45
JOURNAL OF STATISTICAL COMPUTATION AND SIMULATION, 2016 http://dx.doi.org/10.1080/00949655.2016.1261292
Multivariate-multiple circular regression Sungsu Kima and Ashis SenGuptab,c a Department of Mathematics, University of Louisiana, Lafayette, LA, USA; b Applied Statistics Unit, Indian Statistical Institute, Kolkata, India; c Department of Biostatistics and Epidemiology, Augusta University, Augusta, GA, USA
ABSTRACT
ARTICLE HISTORY
We introduce a fully model-based approach of studying functional relationships between a multivariate circular-dependent variable and several circular covariates, enabling inference regarding all model parameters and related prediction. Two multiple circular regression models are presented for this approach. First, for an univariate circular-dependent variable, we propose the least circular mean-square error (LCMSE) estimation method, and asymptotic properties of the LCMSE estimators and inferential methods are developed and illustrated. Second, using a simulation study, we provide some practical suggestions for model selection between the two models. An illustrative example is given using a real data set from protein structure prediction problem. Finally, a straightforward extension to the case with a multivariate-dependent circular variable is provided.
Received 20 May 2015 Accepted 12 November 2016 KEYWORDS
Circular regression; circular variable; mean-square error; multiple regression; multivariate regression
1. Introduction Applications of circular regression models appear in many different fields such as evolutionary psychology, motor behaviour, biology, and gene expressions in oscillatory systems. A practical need for circular multiple regression can be seen in real-life problems where researchers are interested in correlated multivariate angular data from two or more sources. Such examples are found in e.g. [1–3]. For the gene expression problem, a researcher may be interested in modelling the relationship among the phases of cell-cycle genes in more than two species with differing periods. In our paper, we have selected only one example from Bioinformatics, but circular variables in protein structure are ubiquitous, i.e. conformational angles appear not only in gamma turns but also in different patterns too. Also each amino acid is associated with more than one conformational angle such as χ angles. Hughes [4] contains another application to trivariate data in proteomics concerning χ angles in Serine and Valine residues. In this paper, we propose two multiple circular regression models to introduce a fully modelbased approach of studying functional relationships among the conformational angles in the protein structure prediction problem, which is considered to be the holy grail of Bioinformatics. By circular regression, here we mean any regression involving both the dependent and independent, one or more, as circular variables. Such regressions as these are sometimes also referred to as circular–circular regression. More generally, these may be given the nomenclature of toroidal (and hypertoroidal for the multivariate case) regression. Circular regression has been used in many diverse applications, see, e.g. [5–9]. Regression models using a circular-dependent or circular-independent variable have been studied in [8,10–15]. In Hughes [4], a form of the multiple circular regression model was suggested as an extension to Downs CONTACT Sungsu Kim
[email protected]
© 2016 Informa UK Limited, trading as Taylor & Francis Group
2
S. KIM AND A. SENGUPTA
and Mardia [10], where an idea based on the conditional distribution in a time series setting was enhanced. However, this idea was not pursued further in his thesis. Our contributions in this paper are the introduction of two multiple circular regression models and associated statistical inferences. The need for a multiple circular regression model can be found in various areas of study. For example, we illustrate the utility of our proposed multiple circular regressions in protein prediction problem from Bioinformatics. One of the two models is a new addition to circular regression literature and the other one is a generalization based on the arc-tangent link model presented in [14]. The structure of our paper is as follows. In Section 2, two multiple circular regression models are introduced. In Section 3, we propose the least circular mean-square error (LCMSE) estimation method and some asymptotic properties of the LCMSE estimators. Then, we develop some asymptotic inferential methods based on the proposed models. We provide an illustrative simulation study in Section 4, and a practical example in Section 5, using a real data set comprising of 334 observed protein γ turns. We introduce an extension to the cases with more than one dependent circular variable and more than two independent circular variables, i.e. multivariate-multiple circular regression in Section 6. Our concluding remarks follow in Section 7. Unless otherwise specified all angles and their sums or differences are expressed as their principal values in the half-open interval [−π, π) radians, and all half-angles of these principal values are in the half-open interval [−π/2, π/2). Further, we suppose that 1 , . . . , n are n dependent circular variables observed for n fixed values φ11 , . . . , φ1n and φ21 , . . . , φ2n of the independent circular variables 1 and 2 , respectively.
2. Multiple circular regression In this section, we present two regression models which involve one circular-dependent variable and two circular-independent variables. We show that our models can be easily extended to include more than two circular-independent variables. All the circular variables will be represented equivalently in terms of angles. 2.1. Arc-tangent link model 1 In this section, we propose the half-sine of centred angle given by sin((φ − μφ )/2) as a natural way of introducing independent angular variables in circular regression. It can be shown that the half-sine of centred angle can uniquely be identified it in a unit circle. Our first arc-tangent link model having one circular-dependent variable and two circular-independent variables is given by φ1i − μφ 1 φ2i − μφ 2 θi = μθ + 2 arctan α + β1 sin + β2 sin + i , 2 2
(1)
where i follows a circular distribution with zero mean direction, α ∈ R and β1 , β2 ∈ R denote the intercept and slope parameters, respectively, μφj is the centering constant of j for j = 1,2, and μθ is the centering constant of . α in Equation (1) allows the conditional mean direction of not to be always equal to μθ , when the independent circular variables take values μφ1 and μφ2 . In order to include k independent circular variables, say 1 , . . . , k , (1) can be extended as shown as follows: ⎧ ⎨
θi = μθ + 2 arctan α + ⎩
k j=1
⎫ φji − μφ j ⎬ βj sin + i . ⎭ 2
We will call the multiple circular regression model proposed in this section, the arc-tangent-sine link model in the following.
JOURNAL OF STATISTICAL COMPUTATION AND SIMULATION
3
2.2. Arc-tangent link model 2 Our second model is an extension of the model introduced in [13], which is presented as follows:
φi − μφ θi = μθ + 2 arctan α + β tan 2
+ i ,
where i follows a circular distribution with zero mean direction, α ∈ R is the intercept parameter, β ∈ R is the slope parameter, and μφ and μθ are centering constants of and , respectively. The other multiple circular regression model proposed in this paper, having one dependent circular variable and two circular-independent variables is shown as follows:
φ1i − μφ 1 θi = μθ + 2 arctan α + β1 tan 2
φ2i − μφ 2 + β2 tan 2
+ i ,
(2)
where i follows a circular distribution with zero mean direction, α, βj , μφj for j = 1,2, and μθ have the same interpretation as in (1). To include k independent circular variables, say 1 , . . . , k , Equation (2) can be extended as shown as follows: ⎧ ⎨
i = μθ + 2 arctan α + ⎩
k j=1
⎫ φji − μφ j ⎬ βj tan + i . ⎭ 2
We will call the multiple circular regression model proposed in this section, the arc-tangent-tangent link model in the following. In Section 4, a practical suggestion on how to choose between the two proposed models, i.e. model selection guideline, is provided using simulation studies.
3. Estimation and inference We now introduce a circular mean-square error (CMSE)-based estimation method, and discuss inferential problems based on the asymptotic properties of the CMSE estimators. 3.1. Least circular mean-square estimation A CMSE-type measure is given in the following definition. Definition 3.1: The CMSE of an estimator μˆ 0 of a circular parameter μ0 is defined by E(1 − cos(μˆ 0 − μ0 )).
(3)
It can be shown that the CMSE has some analogy to the mean-square error (MSE) used in the linear variable case. That is, it has the interpretation that CMSE = (CircularVariance) + (CircularBias)2 , where we have used the following definitions of circular variance and circular bias of μˆ 0 . Definition 3.2: Circular variance and circular bias of an estimator μˆ 0 of a circular parameter μ0 are, respectively, defined as (Jammalamadaka and SenGupta [16]) E(1 − cos(μˆ 0 − μ)) where μ is the mean direction of μˆ 0 .
and E(sin(μˆ 0 − μ0 )),
4
S. KIM AND A. SENGUPTA
In fact, for small values of μˆ 0 − μ and μ − μ0 , we can use the Taylor series expansion as (μˆ 0 − μ)2 (μˆ 0 − μ)2 E(1 − cos(μˆ 0 − μ)) ≈ 1 − E 1 − =E and 2 2 E(sin(μˆ 0 − μ0 )) ≈ E(μˆ 0 − μ0 ) = μ − μ0 . Thus, an analogy for the CMSE of circular data to the mean-square error of data on the line is exhibited by the following decomposition: 2E(1 − cos(μˆ 0 − μ)) + [E(sin(μˆ 0 − μ0 ))]2 ≈ E(μˆ 0 − μ0 )2 ≈ 2E(1 − cos(μˆ 0 − μ0 )).
(4)
The resulting approximation in Equation (4) motivates us to call Equation (3) the CMSE. The CMSE is zero when its argument is zero, that is, μˆ 0 = μ0 , and reaches up to 2 when μˆ 0 − μ0 goes to ±π. Figure 1 shows the plots of CMSE and circular root-mean-square error (CRMSE). The CRMSE is shown to yield a more linear measure than CMSE. The CMSE becomes the circular variance when μ = μ0 . Theoretically, it is possible that the argument can contain values outside ±π/2. However, by having prediction errors as its argument, it is not likely that the argument will take a value outside ±π/2. In Section 4, we illustrate this point using a real data set, where the predicted values are shown to lie within π/2 of the original angles. We estimate the parameters in the models (1) and (2) by minimizing the sample version of the CMSE shown as follows: Q(α, β1 , β2 , μ , μφ1 , μφ2 ) = n φ1i − μφ 1 φ2i − μφ 2 1 1 − cos θi − μθ − 2 arctan α + β1 sin + β2 sin , n i=1 2 2
n φ1i − μφ 1 φ2i − μφ 2 1 1 − cos θi − μθ − 2 arctan α + β1 tan + β2 tan , n 2 2 i=1
when using (1) and (2), respectively. The resulting estimators are called the least circular mean-square estimators (LCMSE). We next present the following theorem regarding the asymptotic distribution of the LCMSEs of linear parameters α, β1 , β2 in (1) and (2). In practice, for large n, the circular parameters in (1) and (2), i.e. μθ , μφ1 and μφ2 can be plugged in for the corresponding sample mean directions. Asymptotic distributions of LCMSEs of linear parameters α, β1 , β2 in (1) and (2) are given in the theorem below. Theorem 3.1: Let ζ denote the vector of linear parameters, i.e. (α, β1 , β2 ) , and the circular regression model be given by θi = mi (ζ , φ1 , φ2 ) + i , where mi (·) is a continuous function of φ1 , φ2 and ζ . Under the regularity conditions such as those in [14], the LCMSE ζˆ of ζ , defined to be roots of the first-order conditions (∂/∂ζ )( ni=1 [1 − cos(θi − mi (ζ ))]) = 0 are consistent for ζ and a −1 ζˆ ∼ N(ζ0 , n−1 A−1 0 B0 A0 ).
Proof: A proof of the theorem can be found in the appendix.
In the following, we provide some finite sample properties of our estimators using a simulation study. We generate 10, 20, 40 and 80 dependent angles based on α, having values −5, 5 and 15, β1
JOURNAL OF STATISTICAL COMPUTATION AND SIMULATION
5
Figure 1. Plots of CMSE (Top) and CRMSE (Bottom).
and β2 , having values −15, −5, 5, 15, where we simulate the corresponding angles of 1 and 2 from von Mises (vM)(0,1) and vM(0,1) distributions, respectively, and the circular error from vM(0,1) distribution. We provide the CMSE for the arc-tangent-sine link model in Table 1. The following two points emerged from the simulation study. First, increasing sample size generally decreases the CMSE. Second, while for n = 80, we did obtain similar CMSE (about 0.6) for most of the combinations of (α, β1 , β2 ), however for a small sample size, such as n = 10, the CMSE is highly variable, and sometimes the LCMSE was obtained. We have omitted a simulation study for the arc-tangent-tangent link model, since we expect the results would be similar to the one presented in Table 1.
6
S. KIM AND A. SENGUPTA
Table 1. CMSE from a simulation study for the arc-tangent-sine link model. α −5 −5 −5 −5 −5 −5 −5 −5 −5 −5 −5 −5 −5 −5 −5 −5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 15 15 15 15 15 15 15 15 15 15 15 15 15 15 15 15
β1
β2
n = 10
n = 20
n = 40
n = 80
−15 −15 −15 −15 −5 −5 −5 −5 5 5 5 5 15 15 15 15 −15 −15 −15 −15 −5 −5 −5 −5 5 5 5 5 15 15 15 15 −15 −15 −15 −15 −5 −5 −5 −5 5 5 5 5 15 15 15 15
−15 −5 5 15 −15 −5 5 15 −15 −5 5 15 −15 −5 5 15 −15 −5 5 15 −15 −5 5 15 −15 −5 5 15 −15 −5 5 15 −15 −5 5 15 −15 −5 5 15 −15 −5 5 15 −15 −5 5 15
0.625 0.449 0.455 0.479 0.579 0.608 0.833 0.747 0.684 0.655 1.190 1.087 0.705 0.983 0.777 0.595 0.851 0.413 0.468 0.699 0.692 0.714 0.398 0.772 730 0.617 0.677 0.560 0.819 0.544 0.598 0.508 0.654 0.486 0.318 0.404 0.399 0.527 0.698 0.483 0.724 0.598 0.582 0.413 0.511 0.666 0.387 0.626
0.566 0.638 0.613 0.506 0.406 0.495 0.581 0.655 0.638 0.725 1.044 0.894 0.772 0.782 0.782 0.664 0.607 0.698 0.641 0.596 0.722 0.721 0.647 0.668 0.706 0.606 0.599 0.449 0.568 0.983 0.649 0.543 0.589 0.656 0.404 0.457 0.577 0.715 0.497 0.573 0.642 0.582 0.790 0.477 0.526 0.436 0.448 0.660
0.328 0.540 0.579 0.529 0.409 0.604 0.709 0.657 0.592 0.636 0.912 0.729 0.835 0.764 0.522 0.627 0.575 0.486 0.645 0.512 0.782 0.744 0.518 0.714 0.748 0.524 0.581 0.552 0.791 0.646 0.632 0.629 0.571 0.638 0.604 0.390 0.758 0.544 0.507 0.740 0.463 0.520 0.606 0.525 0.687 0.556 0.508 0.592
0.541 0.542 0.523 0.532 0.554 0.515 0.556 0.581 0.583 0.564 0.860 0.633 0.678 0.761 0.660 0.661 0.663 0.569 0.628 0.540 0.686 0.654 0.507 0.636 0.698 0.563 0.460 0.569 0.515 0.472 0.556 0.563 0.606 0.540 0.534 0.451 0.664 0.502 0.582 0.548 0.542 0.499 0.448 0.464 0.432 0.488 0.541 0.501
3.2. Inference Inference based on the proposed models and asymptotic distributions of the LCMSEs is illustrated with two scenarios. First, by having the null hypothesis that α = 0, we can test for whether the conditional mean direction of denoted by μθ |φ , when the independent circular variables take their mean direction values, μφ1 and μφ2 , is equal to μθ . An approximate test of the null hypothesis can be obtained by using the Wald test. The test statistic is given by αˆ ∼ N(0, 1), se(α) ˆ
JOURNAL OF STATISTICAL COMPUTATION AND SIMULATION
7
where αˆ denotes the LCMSE of α and se(α) ˆ denotes the standard error of α. ˆ Next, for the null hypothesis that βj = 0, j = 1,2, the test statistic is given by βˆj ∼ N(0, 1), se(βˆj ) where βˆj denotes the LCMSE of βj and se(βˆj ) denotes the standard error of βˆj , which tests whether the independent angle j is important in prediction of , when the other is present in the given model. When there are many potential-independent angles, these individual tests can be employed in the step-wise variable selection method. For the null hypothesis H0 : β1 = β2 = 0, the test statistic is given by ˆ −1 (βˆ1 , βˆ2 ) ∼ χ22 , (βˆ1 , βˆ2 )[Cov(β)] ˆ denotes the variance–covariance matrix of βˆ1 and βˆ2 , which tests whether a meaningful where Cov(β) regression model, (1) or (2), exists by using 1 and 2 as the independent angles in the prediction of .
4. Illustrative simulation study for comparison of the two arc-tangent link models 4.1. Case of having largely spread-independent angles First, notice that sine and tangent values are nearly the same near 0 degree. In fact, a simulation shows that positive differences between the two values are within 0.1 and 0.15 when the absolute values of their arguments are less than 32 degrees and 37 degrees, respectively. This indicates that the ranges of observed angles are within 128 degrees and 147 degrees, respectively. Therefore, both the proposed models will yield very similar results when the independent angles in the sample have ranges of about 5π /6 or less. When the argument of sine and tangent is larger than, say 50 degrees, the difference starts getting substantially large as the tangent function becomes highly more nonlinear. In this section, we study the behaviour of the two models in a region far removed from the sample mean directions of largely spread-independent angles. For k = 1, planar graphs of the arc-tangent link functions of Equations (1) and (2) with μθ = 0 = μφ for selected values of (α, β) are shown in Figure 2 and 3, respectively.
Figure 2. Planar plots (arc-tangent-sine link model)of μθ |φ versus φ. (Left) centre thick solid line (α = 0, β = 3), from upper left to bottom right (α = 4, β = 3), (α = 2, β = 3), (α = 0, β = 3), (α = −2, β = 3), (α = −4, β = 3); (Right) thick solid line (α = 1, β = 1), from upper right to bottom right (α = 1, β = 5), (α = 1, β = 3), (α = 1, β = 1), (α = 1, β = −1), (α = 1, β = −3), (α = 1, β = −5).
8
S. KIM AND A. SENGUPTA
Figure 3. Planar plots (arc-tangent-tangent link model)of μθ |φ versus φ. (Left) centre thick solid line (α = 0, β = 3), from upper left to bottom right (α = 4, β = 3), (α = 2, β = 3), (α = 0, β = 3), (α = −2, β = 3), (α = −4, β = 3); (Right) thick solid line (α = 1, β = 1), from upper right to bottom right (α = 1, β = 5), (α = 1, β = 3), (α = 1, β = 1), (α = 1, β = −1), (α = 1, β = −3), (α = 1, β = −5).
Figure 4. Circular dot plots for observed independent angles used in scenario 1 : (Top left) φ1 ’s from vM(1, 2.2) (Top right) φ2 ’s from vM(−2, 3.5), and scenario 2 : (Bottom left) φ1 ’s from vM(1, 2.2) (Bottom right) φ2 ’s from circular uniform distribution.
JOURNAL OF STATISTICAL COMPUTATION AND SIMULATION
9
Figure 5. Rose diagram of 40 dependent angles with sample mean direction = −2.43.
First, notice that the arc-tangent-tangent link forces the range of μθ |φ to be from −π to π whenever φ ranges from −π to π. This will yield poor predicted values of when the dependent angle has a narrow sample range and the independent angle has a largely spread sample range. This also implies that the regression model given in (2) will behave poorly when outliers are present. On the other hand, the arc-tangent-sine link yields a flexible prediction range of the dependent angle even when the observed independent angles are largely dispersed. 4.2. Comparison based on the CMSE In this section, we present two scenarios, one with the sample ranges lying within about 5π /6, and the other with largely dispersed samples, for two independent angles. For each scenario, we compare the two arc-tangent link models using the CMSE criterion. In the first scenario, we generate 40 data points for each of 1 and 2 with vM(1, 2.8) and vM(−2, 3.5) distributions, respectively. In the second scenario, we generate 40 data points for each of 1 and 2 with vM(1.5, 0.8) distribution and the circular uniform distribution on [−π, π ), respectively. Circular dot plots of φ1,1 , . . . , φ1,40 and φ2,1 , . . . , φ2,40 are shown in Figure 4. In both scenarios, we generate 40 dependent angles using a wrapped skew normal(−π , 1, 20) distribution, where the location parameter is −π, the scale parameter is 1 and the skewness parameter is 10 [17]. The rose diagram of θ1 , . . . , θ40 in Figure 5 shows an asymmetric bimodal shape. The results of fitting the models (1) and (2) to the two scenarios are displayed in Figure 6. It is shown that the prediction using (2) for largely dispersed independent angles is poor in the scenario 2.
5. Real data example in the problem of protein structure prediction The protein structure prediction problem is considered to be the holy grail of Bioinformatics, and circular variables in protein structure problem are ubiquitous, e.g. conformational angles appear in γ turns, α helices, β sheets. The conformation of the polypeptide backbone is defined by the torsion angels, φ, ψ and ω of each residue, while those of the side chain are denoted by χ. It is well known that dihedral angles (φ and ψ) together with ω (torsion angle of the peptide bond) and χ (torsion angle of the side chain) are considered to be important for protein structure prediction [18]. In the following
10
S. KIM AND A. SENGUPTA
Figure 6. Plots of predicted versus observed dependent angles for scenario 1 (Top): (Left) arc-tangent-sine link model (CMSE = 0.1840) (Right) arc-tangent-tangent link model (CMSE = 0.1839), and scenario 2 (Bottom): (Left) arc-tangent-sine link model (CMSE = 0.1634) (Right) arc-tangent-tangent link model (CMSE = 0.1831).
example, protein γ turns consist of three residues, which are given by Glycine, Phenylalanine and Threonine. For the data set consisting of 334 observed values of the φ, ψ and χ conformational angles in the protein γ turns, it is important to understand whether side chain angles χ are relevant in the relationship between dihedral angles (φ, ψ) [18]. Figure 7 shows the rose diagram of χ angles.
Figure 7. Rose diagram of 334 observed χ in protein γ turns example.
JOURNAL OF STATISTICAL COMPUTATION AND SIMULATION
11
Figure 8. Planar plots of χ versus φ and ψ : (Top) arc-tangent-sine link model and (Bottom) arc-tangent-tangent link model.
The planar graphs of χ versus φ and χ versus ψ for (1) and (2) are shown in Figure 8. As expected, they look almost identical for each of and since the observed ranges of φ’s and ψ’s are 57.4◦ and 68.4◦ , respectively. We fit the two arc-tangent link models, with χ angle as the dependent variable and the dihedral angles as independent variable. The resulting CMSEs for (1) and (2) are given by 0.4453 and 0.4454, respectively. The fitted models are given by
φ − 1.446 ψ + 1.07 χˆ = −1.327 + 2 arctan −0.001 − 0.292 sin − 0.087 sin , 2 2 ψ − 1.07 φ − 1.446 − 0.088 tan , χˆ = −1.327 + 2 arctan −0.001 − 0.279 tan 2 2 respectively. The plots showing predicted versus observed χ’s are shown in Figure 8. To test whether χ angles are relevant to the relationship between dihedral angles, we test the null hypothesis of β1 = β2 = 0 in both models. Asymptotic variance–covariance matrices are
0.269 0.003 0.003 0.068
and
0.259 0.003 0.003 0.064
,
12
S. KIM AND A. SENGUPTA
respectively. Consequently, the Wald tests have χ22 = 0.055 and χ22 = 0.059 with the p-values of 0.973 and 0.971, respectively. So the data show little association between χ angles and (φ, ψ) in the protein γ turns. Next, we turn to the testing of the null hypothesis that the conditional mean direction of χ is equal to its centering constant when φ and ψ take values as their centering constants. This will be true when α = 0. We obtain z = −0.029 and z = −0.026 for (1) and (2), giving the p-values of .977 and .979, respectively, indicating that the data do not discredit the null hypothesis. Lastly, we test the null hypotheses of β1 = 0 and β2 = 0. We obtained the p-values of .66 and .75, respectively, indicating that introducing neither β1 nor β2 in (1) or (2) when the other is already present, significantly improves the given model. The results for the two proposed models are shown to be almost same since the observed independent angles have narrow sampled ranges.
6. Multivariate-multiple circular regression We now extend our model to have more than one circular-dependent variable with more than two independent circular variables. For the arc-tangent-sine link and the arc-tangent-tangent link models, we have the following extended models for multivariate-multiple circular regression: ⎤ ⎡ μ1 1 ⎢ .. ⎥ ⎢ .. ⎣ . ⎦=⎣ . μp p
⎧ ⎪ ⎪ ⎡ ⎪ ⎪ ⎤ α1 ⎪ ⎪ ⎪ ⎨⎢ β11 ⎥ ⎢ ⎦ + 2 arctan ⎢ .. ⎪ ⎪⎣ . ⎪ ⎪ ⎪ ⎪ β1k ⎪ ⎩
⎤ ⎡ μ1 1 ⎢ .. ⎥ ⎢ .. ⎣ . ⎦=⎣ . μp p
⎧ ⎪ ⎪ ⎪⎡ ⎪ ⎤ α1 ⎪ ⎪ ⎪ ⎨⎢ β11 ⎥ ⎢ ⎦ + 2 arctan ⎢ .. ⎪ ⎪⎣ . ⎪ ⎪ ⎪ ⎪ β1k ⎪ ⎩
⎡
⎡
· · · αp · · · βp1 .. .. . . · · · βpk
· · · αp · · · βp1 .. .. . . · · · βpk
⎤ ⎥ ⎥ ⎥ ⎦
⎤ ⎥ ⎥ ⎥ ⎦
⎡ ⎢ ⎢ ⎢ ⎢ ⎢ ⎢ ⎢ ⎣ ⎡ ⎢ ⎢ ⎢ ⎢ ⎢ ⎢ ⎢ ⎣
1 φ1 − μφ1 sin 2 .. . φk − μφk sin 2
1 φ1 − μφ1 tan 2 .. . φk − μφk tan 2
⎤⎫ ⎪ ⎪ ⎪ ⎥⎪ ⎥⎪ ⎪ ⎪ ⎥⎬ ⎥ ⎥ +E ⎥⎪ ⎥⎪ ⎪ ⎪ ⎦⎪ ⎪ ⎪ ⎭
and
⎤⎫ ⎪ ⎪ ⎪ ⎥⎪ ⎥⎪ ⎪ ⎪ ⎥⎬ ⎥ ⎥ + E, ⎥⎪ ⎥⎪ ⎪ ⎪ ⎦⎪ ⎪ ⎪ ⎭
respectively, where E has a zero mean direction vector. Then, to obtain the LCMSEs of the linear parameter matrix, we minimize the sum of circular distances given by p
1 [1 − cos(θim − μim )], np i=1 m=1 n
where μim is given by ⎧ ⎨
⎫ φjm − μφj ⎬ βij sin and μθi + 2 arctan αi + ⎩ ⎭ 2 j=1 ⎧ ⎫ k ⎨ φjm − μφj ⎬ βij tan , μθi + 2 arctan αi + ⎩ ⎭ 2 k
j=1
respectively, for the arc-tangent-sine link and the arc-tangent–tangent link models. For our example given in Section 5, we have used and as two dependent variables and χ as the independent
JOURNAL OF STATISTICAL COMPUTATION AND SIMULATION
13
Table 2. CMSEs of multivariate circular regression example. Arc-tangent-sine link model
Arc-tangent-tangent link model
0.128
0.123
variable. The fitted models for the arc-tangent-sine link and the arc-tangent-tangent link models are shown below. The CMSEs are provided in Table 2. χ − 1.327 φˆ = −1.447 + 2 arctan −0.888 − 0.022 sin , 2 χ − 1.327 ˆ , ψ = 1.07 + 2 arctan 0.611 + 0.113 sin 2 χ − 1.327 ˆ , φ = −1.447 + 2 arctan −0.863 + 0.042 tan 2 χ − 1.327 ψˆ = 1.07 + 2 arctan 0.693 + 0.144 tan . 2 The results of Theorem 3.1 regarding asymptotic properties of the linear parameters extend to the multivariate-multiple circular regression setting.
7. Concluding remarks We proposed two multiple circular regression models in this paper, along with a straightforward extension to multivariate-multiple circular regression. It was shown that the two models produce similar results when the independent angles have sample ranges that are less than about 5π /6. For circular data sets having largely dispersed sample ranges of independent angles, we recommend to use the arc-tangent-sine link model. Then we introduced the circular analog of the mean-square error (MSE), called the CMSE in this paper. We have also introduced a straightforward extension of the method for univariate circular-dependent variable to the multivariate case. One can, of course, study the least CMSE estimation in full generality parallel to the development for the multivariate linear regression case. However, this entails more complicated computations and is currently under study. It is well known that circular variables are rarely symmetrically distributed. Usually von Mises distribution, which is symmetric and unimodal, is used in circular regression, see, e.g. [10,11]. However, in case of asymmetric circular data, we should use a possibly asymmetric distribution, e.g. Asymmetric Generalized von Mises (AGvM) distribution proposed by Kim and SenGupta [19], whose pdf is shown in Equation (5). Maximum likelihood method may be conveniently employed to estimate the parameters involved. The resulting regression may convincingly fit the data. f (θ) =
1 exp(κ1 cos(θ − μ) + κ2 sin 2(θ − μ)), 2πG0 ( π4 , κ1 , κ2 )
(5)
where μ ∈ [−π, π ) is the location parameter, and κ1 > 0 and κ2 ∈ [−1, 1] are the concentration and skewness parameters, respectively, and G0 (π/4, κ1 , κ2 ) denotes the normalizing constant. Alternatively, as adopted in this paper for our proposed models, we do not assume that the circular measurement errors have any specified distribution and thereby avoid its incorrect specification.
Disclosure statement No potential conflict of interest was reported by the authors.
14
S. KIM AND A. SENGUPTA
References [1] Fernández-Durán JJ, Gregorio-Dominguez MM. Modeling angles in proteins and circular genomes using multivariate angular distributions based on multiple nonnegative trigonometric sums. Statist Appl Genet Mol Biol. 2014;13:1–18. [2] Mardia KV, Hughes G, Taylor CC, et al. A multivariate von Mises distribution with applications to bioinformatics. Can J Stat. 2008;36:99–109. [3] Rueda C, Fernandez MA, Barragan S, et al. Circular piecewise regression with applications to cell-cycle data. Biometrics. 2016. doi:10.1111/biom.12512. [4] Hughes G. Multivariate and time series models for circular data with applications to protein conformational angles [PhD thesis]. Leeds, UK: University of Leeds; 2007. [5] Hrushesky WJ. Circadian timing of cancer chemotherapy. Science. 1985;228:73–75. [6] Jones MC, Silverman BW. An orthogonal series density estimation approach to reconstructing positron emission tomography images. J Appl Stat. 1989;16:177–191. [7] Lowrey PL, Shimomura K, Antoch HMP, et al. Positional syntenic cloning and functional characterization of the mammalian circadian mutation tau. Science. 2000;288:483–491. [8] Rivest L-P. A decentred predictor for circular–circular regression. Biometrika. 1997;84:717–726. [9] Shearman LP, Sriram S, Weaver DR, et al. Interacting molecular loops in the mammalian circadian clock. Science. 2000;288:1013–1019. [10] Downs TD, Mardia KV. Circular regression. Biometrika. 2002;89:683–698. [11] Fisher NI, Lee AJ. Regression models for an angular response. Biometrics. 1992;48:665–677. [12] Johnson RA, Wehrly TE. Some angular-linear distributions and related regression models. J Am Statist Assoc. 1978;73:602–606. [13] Kim S. Inverse circular regression with possibly asymmetric circular error distribution [PhD thesis]. Riverside (CA): University of California; 2009. [14] SenGupta A, Kim S, Arnold BC. Inverse circular–circular regression. J Multivariate Anal. 2013;119:200–208. [15] SenGupta A, Ugwuowo FI. Asymmetric circular-linear multivariate regression models with applications to environmental data. Environ Ecol Stat. 2006;13:299–309. [16] Jammalamadaka RS, SenGupta A. Topics in Circular Statistics. New York: World Scientific Inc; 2001. [17] Oliveira M, Crujeiras RM, Rodriguez-Casal A. A plug-in rule for bandwidth selection in circular density estimation. Comput Stat Data Anal. 2012;56:3898–3908. [18] Chakrabarti P, Pal D. The interrelationships of side-chain and main-chain conformations in proteins. Progress Biophys Mol Biol. 2001;76:1–102. [19] Kim S, SenGupta A. A three parameter generalized von Mises distribution. Commun Stat: Theory Methods. 2013;54:685–693. [20] Cameron CA, Trivedi PK. Microeconometrics: methods and applications. New York: Cambridge University Press; 2005. [21] Mittelhammer R. Mathematical statistics for economics and business. London: Springer-Verlag; 1999.
Appendix Proof of Theorem 3.1 The limiting distribution of ζ = {α, β1 , β2 } is obtained using the exact first-order Taylor series expansion of the firstorder condition, for some ζ + between ζˆ and ζ0 [20], n n 1 ∂mj (ζ ) ∂Qn (ζ ) 1 ∂mj (ζ ) = sin{θj − mj (ζ )} = sin{θj − mj (ζ )}|ζ0 ∂ζ n j=1 ∂ζ n j=1 ∂ζ n 1 − n j=1
∂ 2 mj (ζ ) ∂mj (ζ ) ∂mj (ζ ) cos{θ − m (ζ )} − sin{θ − m (ζ )} j j j j ∂ζ ∂ζ ∂ζ ∂ζ
⎧ n ⎨1 √ ∂ 2 mj (ζ ) ∂mj (ζ ) ∂mj (ζ ) n(ζˆ − ζ0 ) = cos{θ − m (ζ )} − sin{θ − m (ζ )} j j j j ⎩n ∂ζ ∂ζ ∂ζ ∂ζ j=1
(ζˆ − ζ0 ) = 0, ζ+
ζ+
n 1 ∂mj ·√ sin(θj − mj )|ζ0 . n j=1 ∂ζ
⎫−1 ⎬ ⎭
JOURNAL OF STATISTICAL COMPUTATION AND SIMULATION
15
We apply the multivariate CLT independent random vectors [21] in the following, to obtain the asymptotic √ for multivariate normality of (1/ n) nj=1 (∂mj /∂ζ ) sin(θj − mj )|ζ0 . Then, n 1 ∂mj d sin(θj − mj ) −→ N(0, B0 ), √ n j=1 ∂ζ ζ0 where
B0 = var
and n 1 plim n j=1
∂mj ∂mj ∂mj 2 =E , sin(θ − m ) sin (θ − m ) j j j j ∂ζ ∂ζ ∂ζ ζ0 ζ0
∂ 2 mj (ζ ) ∂mj (ζ ) ∂mj (ζ ) cos(θj − mj ) − sin(θj − mj ) ∂ζ ∂ζ ∂ζ ∂ζ
ζ+
n ∂ 2 mj (ζ ) ∂mj (ζ ) ∂mj (ζ ) 1 = lim cos(θj − mj ) − E sin(θj − mj ) E n j=1 ∂ζ ∂ζ ∂ζ ∂ζ ∂mj (ζ ) ∂mj (ζ ) ∂ 2 mj (ζ ) = E cos(θj − mj ) − E sin(θj − mj ) ∂ζ ∂ζ ∂ζ ∂ζ
ζ+
Now, using Slutsky’s theorem (or Product Limit Normal Rule), we get √
d
−1 n(ζˆ − ζ0 ) −→ N(0, A−1 0 B0 A0 ).
Then, the asymptotic distributions are given by a −1 ζˆ ∼ N(ζ0 , n−1 A−1 0 B0 A0 ).
ζ+
= A0 .