Predicting ligand binding residues using multi ...

Predicting ligand binding residues using multi-positional correlations and kernel canonical correlation analysis

Alvaro J. Gonz´alez, Li Liao and Cathy H. Wu Department of Computer and Information Sciences, University of Delaware 101 Smith Hall, Newark, DE 19716 alvaro, lliao, [email protected]

Abstract—We present a new computational method for predicting ligand binding sites in protein sequences. The method uses kernelbased canonical correlation analysis and linear regression to identify binding sites in protein sequences as the residues that exhibit strong correlation between the residues’ evolutionary characterization at the sites and the structure based functional classification of the proteins in the context of a functional family. We explore the effect of correlations among multiple positions in the sequences and show that their inclusion enhances the prediction accuracy significantly. Keywords-Functional residues; specificity determining positions; multiple sequence alignments; kernel canonical correlation analysis.

I. I NTRODUCTION Proteins are responsible for most cellular functions. They often fulfill these functions by interacting with other biomolecules, for instance, ligands. When a ligand binds a protein, the protein’s three dimensional shape is often altered, causing a signal to be modulated, which can trigger a cascade of other reactions. The study of ligand binding sites in proteins is key in protein function characterization and for pharmaceutical applications in drug discovery and design as well, due to the fact that many drugs target at the ligand binding sites to intervene the cellular processes of interest. Evolution plays a central role in almost every aspect of biology, and consequently the use of evolutionary information lays the foundation for many bioinformatics solutions. Ligand binding site prediction is not an exception. Methods that use sequence homology and phylogenetic inference predict the sites by aligning target proteins to homologous proteins whose functional sites have already been studied [1]. The conservation of a residue, calculated from the amino acid frequency distribution of columns in multiple sequence alignments of proteins with certain degree of homology, also serves as a measure for the functional or structural constraints that have acted on that position, and has therefore widely been used to predict ligand binding residues [2]. Computational prediction of ligand binding sites has also been extensively studied from the structural bioinformatics perspective. When the three dimensional structure of the protein of interest is known, its ligand binding sites can be predicted by means of structural alignments to better-studied proteins (with known ligand binding sites). This type of alignment can find structural similarities even at vast evolutionary distances and between proteins with very low sequence homology [3]. The need to incorporate both evolutionary and structural information has been recognized to improve the accuracy of binding site predictors. In fact, these two sources of information have been shown to be largely complementary. Typically these existing methods encode individual amino acids in the target proteins into

feature vectors whose components are measurements of attributes directly or indirectly related to sequence conservation and structure, and use supervised and unsupervised machine learning techniques to predict functional sites in the encoded proteins [4]. In this work, we propose a new approach to binding site prediction. The method uses evolutionary and structural information in a reconciliatory manner, by calculating the correlation between positions in a multiple sequence alignment and a functional description, based on structural information, of the proteins in the alignment. The positions that exhibit stronger correlation are predicted as binding sites. The use of kernel canonical correlation analysis (kCCA) not only allows for rotations in the original coordinate system of the multivariates to maximize the correlation, but also enables the capture of non-linear dependencies by the trick of kernels. Moreover, by using multiple view kCCA, the method explores the effect of multiple position correlations, in recognition of the evolutionary pressure on the binding residues as a whole to retain a conformation that is thermodynamically favorable for ligand binding. II. M ETHODS AND DATA A. Binding sites from the evolutionary and structural perspectives. The binding sites, being essential to the protein’s function, tend to be more conserved during evolution. Proteins that are believed to be evolutionary related to one another are aligned into a multiple sequence alignment (MSA) that reveals the regions of conservation. These conserved regions are candidates for the domains that are shared by all the members in the family and are responsible for the protein functions. For instance, these might be regions where the proteins bind with other proteins or with small molecules. They could also be the active sites of enzyme families. Once the biological relevance of these conserved regions in a MSA is confirmed, the MSA can serve as a basis to predict functional residues in new sequences; simply aligning a new sequence to the existing MSA, the residues aligned to the conserved regions are the functional ones. Unfortunately, this approach can lead to false positive predictions, because the conserved regions, although functionally relevant, might not be playing a key role in the actual function of interest: they might be regions that provide physicochemical stability to the folded protein, and not functional sites per se. On the other hand, this homology approach can also lead to false negative predictions. Because non identical amino acid sequences can fold into highly similar three dimensional protein structures, regions that do not appear highly conserved in the MSA can possess identical (or considerably similar) conformations in their folded state, and hence host the binding sites. Such observations motivate

Figure 1.

Xdet method.

structural alignments to capture common patterns in the tertiary structures and eventually predict binding sites. While conceptually appealing, this methodology is costly and time consuming. Pazos et.al. [5] recently proposed a method that uses structural information (not as detailed as the tertiary structure) to rectify the conservation in a MSA in order to achieve “phylogenyindependent” prediction of functional residues. The algorithm, called Xdet, takes as input the MSA of the sequences in a family and a functional classification of the member proteins that can be based on structural information. Xdet then predicts as functional sites only those columns (or positions) whose pairwise amino acid similarities correlate strongly with the proteins’ functional classification. Specifically, Xdet measures the correlation by the Pearson correlation coefficient between the amino acid and the functional similarity matrices. Fig. 1 shows Xdet’s schema. For each position in the alignment, a matrix quantifying the amino acid changes for all pairs of proteins is constructed based on, for instance, a substitution scoring scheme, such as BLOSUM. In this matrix, a given entry represents the similarity between the residues of two proteins at that position. Another matrix is constructed from an external explicit functional classification, likely determined based on structural information such as ligand binding. And each entry of this matrix represents the “functional similarity” between the corresponding proteins. The two matrices are compared elementwise using the Pearson correlation coefficient 𝑟𝑘 : (

¯ ) ⋅ (𝐹𝑖𝑗 − 𝐹¯ ) 𝐴(𝑘)𝑖𝑗 − 𝐴(𝑘) √ ( ¯ )2 ⋅ ∑ (𝐹𝑖𝑗 − 𝐹¯ )2 𝐴(𝑘) − 𝐴(𝑘) 𝑖𝑗 𝑖,𝑗 𝑖,𝑗 ∑

𝑟𝑘 = √∑

𝑖,𝑗

(1)

where 𝐴(𝑘)𝑖𝑗 is the similarity between the amino acids of proteins 𝑖 and 𝑗 at position 𝑘 in a MSA of 𝑁 proteins of length 𝐿, and 𝐹𝑖𝑗 is the functional similarity between proteins 𝑖 and 𝑗; bars mean averages. Coefficient 𝑟𝑘 gives the score for position 𝑘. Positions with high 𝑟𝑘 values are predicted as functional sites. Because the Pearson correlation is calculated based on fixed pairings of variables, some subtle and potentially revealing correlations among residue compositions and functional specificity can be missed out. In fact, notice that Eq. 1 implicitly flattens

out the two matrices and treats them as vectors; it is worthwhile to note that the distance matrix has an embedded structure – the phylogenetic tree can be constructed from the distance matrix (of course based on whole MSA rather than one column of it) – and flattening it to a vector will lose that structure. Geometrically, the Pearson correlation coefficient can also be viewed as the cosine of the angle between the two vectors of samples drawn from the two random multivariates. While the use of Pearson correlation reflects more of the rigid view of using the functional classification to sanction the homology (or phylogeny) based prediction of the binding sites, it is reasonable to believe a more optimal solution may exist somewhere in between. This motivated the idea of using a different measure of correlation, the canonical correlation, which does not treat the direction of the two vectors as fixed, but allows rotations in their principal components to find stronger correlations. In [6], we investigated using canonical correlation analysis (CCA), instead of Pearson correlation, to detect functional residues. CCA has the advantage of not flattening the matrices into vectors; instead, it treats each matrix as a set of 𝑁 realizations of an 𝑁 dimensional random multivariate (𝑁 is the number of proteins in the MSA). By allowing the correlation measure to reach beyond the fixed direction of the sampled realizations, the improved prediction accuracy from a side-by-side comparison of Xdet and CCA suggests that more subtle and complex sources of mutual information may be discovered, leading to the detection of positive signals in highly noisy environments – in this case signals that might have been buried in the phylogeny relationship among the sequences which is intrinsic in the MSA. B. Kernel canonical correlation analysis In this work, we first extend the use of CCA for functional site prediction to kernel based CCA (kCCA), primarily to capture non-linear relationships among data. In many natural processes, linear relation is often an approximation of the underlying nonlinear relation. It is therefore reasonable to investigate if and how this might affect the binding site prediction. Like other kernel based methods, kCCA is capable of capturing non-linear relationships by using non-linear kernels. The theoretical development of the kCCA technique is thoroughly described in [7], [8] and [9]. For completeness and convenience for later discussion, we present here a brief mathematical introduction to kCCA. Suppose {there is a pair} of multivariates, 𝑥1 }∈ 𝑅𝑛1 , { 1 2 𝑛2 1 2 𝑁 denote 𝑥2 ∈ 𝑅 . Let 𝑥1 , 𝑥1 , . . . , 𝑥1 and 𝑥2 , 𝑥2 , . . . , 𝑥𝑁 2 sets of 𝑁 empirical observations{– hereinafter called datapoints of ( 1) ( 2) ( 𝑁–)} 𝑥 , 𝜙 𝑥 , . . . , 𝜙 𝑥 𝑥 , respectively, and let 𝜙 𝑥1 and 2 1 1 1 1 1 1 ( ) ( )} { ( ) denote their corresponding and 𝜙2 𝑥12 , 𝜙2 𝑥22 , . . . , 𝜙2 𝑥𝑁 2 images in a (potentially) higher-dimensionality space. Formally, we will say that 𝜙1 maps 𝑥1 to the Hilbert space 𝐻1 , and 𝜙2 maps 𝑥2 to the Hilbert space 𝐻2 . 𝐻1 and 𝐻2 are the image spaces. Suppose that the datapoints are centered in their corresponding image space. Let 𝐾1 and 𝐾2 be positive semidefinite matrices of size 𝑁 × 𝑁 that denote the kernel matrices of the datapoints of 𝑥1 and 𝑥2 , that is: ( )〉 〈 ( ) , 𝑡 ∈ {1, 2} (2) 𝐾𝑡 (𝑖, 𝑗) = 𝜙𝑡 𝑥𝑖𝑡 , 𝜙𝑡 𝑥𝑗𝑡 with 1 ≤ 𝑖, 𝑗 ≤ 𝑁 . It can be seen that the kernel is the evaluation of an inner product between two datapoints in the image space.

The goal of kCCA is to find directions 𝑓1 ∈ 𝐻1 and 𝑓2 ∈ 𝐻2 such that the features 𝑘𝐶𝐶𝐴 (𝐾1 , 𝐾2 , 𝜆) = 𝑢𝑡 = ⟨𝑓𝑡 , 𝜙𝑡 (𝑥𝑡 )⟩ , 𝑡 ∈ {1, 2}

𝑓𝑡 =

𝑁 ∑

( ) 𝛼𝑡𝑖 𝜙 𝑥𝑖𝑡 , 𝑡 ∈ {1, 2}

(4)

𝑖=1

and now the problem becomes finding the coefficients 𝛼1𝑖 , 𝛼2𝑖 , 1 ≤ 𝑖 ≤ 𝑁 . [7], [8] and [9] show how this optimization problem can be reduced to a regularized Lagrangian that can be maximized by solving the generalized eigenvalue problem: )( − → ) 𝛼 𝐾 1 𝐾2 1 = − → 𝛼2 0 )( − ( → ) 𝛼 (𝐾1 + 𝜆1 𝐼)2 0 1 (5) 𝜌 − → 2 𝛼2 0 (𝐾2 + 𝜆2 𝐼) − → = (𝛼1 , 𝛼2 , . . . , 𝛼𝑁 )𝑇 , − → = (𝛼1 , 𝛼2 , . . . , 𝛼𝑁 )𝑇 and 𝛼 where 𝛼 1 2 1 1 1 2 2 2 𝐼 is the identity matrix of size 𝑁 × 𝑁 and 𝜆1 , 𝜆2 are small, regularization constants. The problem can be solved without using regularization, but it has been shown [8] that the non-regularized solution does not generalize well, especially when the image space is of high dimensionality. The consequence is that the features 𝑢1 and 𝑢2 are 100% correlated, a clear sign of data overfitting. To avoid this phenomenon, regularization is strongly recommended. Eq. 5, like any generalized eigenvalue problem, can be solved efficiently by using Cholesky decomposition. It can produce multiple solutions that will ranked in descending eigenvalue (𝜌) order: } be {− } {−−→ −−→ } {−−→ −−→ → − → −− (1) (1) (2) (2) (𝑅) (𝑅) (1) (2) , 𝛼1 , 𝛼2 , 𝜌 , . . . , 𝛼1 , 𝛼2 , 𝜌(𝑅) , 𝛼1 , 𝛼2 , 𝜌 (

0 𝐾2 𝐾1

where 𝜌(1) ≥ 𝜌(2) ≥ . . . ≥ 𝜌(𝑅) and 𝑅 is the number of non-equal solutions of Eq. 5. Now, by evaluating −− → −− → (1) (1) 𝑢𝑡 = 𝐾𝑡 𝛼𝑡 , 𝑡 ∈ {1, 2}

(6)

we can obtain the projections (features) of the datapoints in the direction of maximum correlation in the image space. Pearson −− → −− → (1) (1) correlation between 𝑢1 and 𝑢2 will produce the first component of the canonical correlation. One −− →can also calculate the second −− → (2) (2) component by using 𝑢1 and 𝑢2 , and so on. In an attempt to filter the noise out of the data, we defined the total power in each dataset as 𝜌2𝑇 𝑂𝑇 𝐴𝐿 = (𝜌(1) )2 + (𝜌(2) )2 + . . . + (𝜌(𝑅) )2

(7)

and used as many components as needed to account for 90% of the total power. If that number of components is 𝐶, the final measure of canonical correlation between amino acid similarity and functional similarity that we use is a weighted sum of the Pearson correlations −− → −− → −− → −− → −− → −− → (1) (1) (2) (2) (𝐶) (𝐶) between 𝑢1 and 𝑢2 , 𝑢1 and 𝑢2 , . . ., 𝑢1 and 𝑢2 , and the weights, 𝑤1 , 𝑤2 , . . . , 𝑤𝐶 , are proportional to 𝜌(1) , 𝜌(2) , . . . , 𝜌(𝐶) and add up to 1. Formally,

(−→ −→) (𝑖) (𝑖) 𝑤𝑖 ⋅ 𝑃 𝑒𝑎𝑟𝑠𝑜𝑛 𝑢1 , 𝑢2

(8)

𝑖=1

(3)

are maximally correlated. Directions orthogonal to linear combinations of the datapoints in the image space do not contribute to any correlations, so we can restrict 𝑓1 and 𝑓2 to:

𝐶 ∑

where 𝜆 = [𝜆1 , 𝜆2 ] in Eq. 5, and the 𝑃 𝑒𝑎𝑟𝑠𝑜𝑛 correlation was formulated in Eq. 1. Applying kCCA to the task of predicting functional sites is straightforward: simply replace 𝐾1 and 𝐾2 in Eq. 5 with 𝐾(𝐴𝑘 ) and 𝐾(𝐹 ), the kernel versions of the amino acid similarity matrix at position 𝑘 and the functional similarity matrix, respectively. These kernels can be obtained in one of two ways: 1) by treating the matrices’ rows as datapoints, and filling in entry 𝐾𝑖,𝑗 with the kernel relationship between rows 𝑖 and 𝑗. This kernel relationship can be any kernel function: linear (in which case we would be doing non-kernel CCA), polynomial, RBF, etc; 2) we can use 𝐴𝑘 and 𝐹 as the kernels 𝐾1 and 𝐾2 themselves, in which case 𝐴𝑘 and 𝐹 would be faithfully representing what they measure; not point representations of amino acids and proteins, but pairwise similarity distances among them. In this latter case, the positive semi definite requirement for being a kernel can be easily met by a widely adopted technique: adding to all the diagonal elements in the matrix 𝐴𝑘 (or 𝐹 ) a positive number that is larger than the absolute value of the most negative eigenvalue of 𝐴𝑘 (or 𝐹 ) [10]. The second option, we believe, allows for a much more transparent and natural data representation, and provides a solution to the rigidity imposed by Xdet when it flattens 𝐴𝑘 and 𝐹 into vectors. C. Multi-view kernel canonical correlation analysis Thus far, the columns in the MSA are assumed to be independent; however, plenty of evidence has indicated that they are not independent. Firstly, the functional site is often not a single amino acid, but a region in the three dimensional structure of the protein that comprises a number of amino acids. Secondly, this set of amino acids do not need to be near one another in the MSA (they are, however, in a vicinity of the three dimensional space). Therefore, we hypothesize that this set of amino acids will manifest a higher correlation with the functional specificity if studied as a whole. In order to capture such correlations among multiple non-neighboring positions, we will adopt a generalization of kCCA to multi-view kCCA (mvkCCA). We briefly introduce mvkCCA here for our purpose. A more formal and detailed description can be found in [8] and in [9]. Suppose there are {𝑚 }multivariates, 𝑥{1 ∈}𝑅𝑛1 , 𝑥2 ∈ 𝑅𝑛2 , . . ., { 𝑖} 𝑛𝑚 𝑖 𝑥𝑚 ∈ 𝑅 . Let 𝑥1 , 𝑥2 , . . ., 𝑥𝑖𝑚 , 1 ≤ 𝑖 ≤ 𝑁 denote sets 𝑁 datapoints { ( )} of 𝑥1{, 𝑥2 ,( . . .,)}𝑥𝑚 respectively, and let { (of𝑖 )} 𝜙1 𝑥1 , 𝜙2 𝑥𝑖2 , . . ., 𝜙𝑚 𝑥𝑖𝑚 , 1 ≤ 𝑖 ≤ 𝑁 denote their corresponding images in Hilbert spaces 𝐻1 , 𝐻2 , . . ., 𝐻𝑚 . The goal of mvkCCA is to find directions 𝑓1 ∈ 𝐻1 , 𝑓2 ∈ 𝐻2 , . . ., 𝑓𝑚 ∈ 𝐻𝑚 such that the sum of all pairwise correlations between features 𝑢𝑡 = ⟨𝑓𝑡 , 𝜙𝑡 (𝑥𝑡 )⟩ , 𝑡 ∈ {1, 2, . . . , 𝑚}

(9)

be the largest possible. Like before, we can restrict 𝑓1 , 𝑓2 , . . ., 𝑓𝑚 to: 𝑓𝑡 =

𝑁 ∑ 𝑖=1

( ) 𝛼𝑡𝑖 𝜙 𝑥𝑖𝑡 , 𝑡 ∈ {1, 2, . . . , 𝑚}

(10)

Algorithm 1 Enumerating all possible combinations of columns. % Iteration 1 % for each column 𝑘, 1 ≤ 𝑘 ≤ 𝐿 do 𝑆1 (𝑘) = 𝑘𝐶𝐶𝐴(𝐾(𝐴𝑘 ), 𝐾(𝐹 ), 𝜆) end for

Algorithm 2 Using seeds. % Iteration 1 % for each column 𝑘, 1 ≤ 𝑘 ≤ 𝐿 do 𝑆1 (𝑘) = 𝑘𝐶𝐶𝐴(𝐾(𝐴𝑘 ), 𝐾(𝐹 ), 𝜆) end for 𝑆𝑒𝑒𝑑𝐼𝑡1 = arg max𝑘 𝑆1 (𝑘)

% Iteration 2 % for each column 𝑘, 1 ≤ 𝑘 ≤ 𝐿 do 𝑆2 (𝑘) = max1≤𝑙≤𝐿 𝑚𝑣𝑘𝐶𝐶𝐴 (𝐾(𝐴𝑘 ), 𝐾(𝐹 ), 𝑙∕=𝑘

% Iteration 2 % for each column 𝑘, 1 ≤ 𝑘 ≤ 𝐿 do 𝑆2 (𝑘) = 𝑚𝑣𝑘𝐶𝐶𝐴 (𝐾(𝐴𝑘 ), 𝐾(𝐹 ), 𝐾(𝐴𝑆𝑒𝑒𝑑1 ), 𝜆) end for 𝑆𝑒𝑒𝑑2 = arg max𝑘 𝑆2 (𝑘)

𝐾(𝐴𝑙 ), 𝜆)

end for % Iteration 3 % for each column 𝑘, 1 ≤ 𝑘 ≤ 𝐿 do 𝑆3 (𝑘) = max(𝑙,𝑚)∈{all pairs from[1...𝐿]} 𝑚𝑣𝑘𝐶𝐶𝐴 (𝐾(𝐴𝑘 ), 𝐾(𝐹 ), 𝑙∕=𝑚∕=𝑘

𝐾(𝐴𝑙 ), 𝐾(𝐴𝑚 ), 𝜆)

end for

and the problem becomes finding the coefficients 𝛼1𝑖 , 𝛼2𝑖 , . . ., 𝑖 , 1 ≤ 𝑖 ≤ 𝑁 . The problem can be reduced to the following 𝛼𝑚 generalized eigenvalue problem [8] [9]: ⎛ ⎜ ⎜ ⎜ ⎝ ⎛ ⎜ ⎜ 𝜌⎜ ⎜ ⎝

0 𝐾2 𝐾1 . . . 𝐾𝑚 𝐾 1

𝐾1 𝐾2 0 . . . 𝐾𝑚 𝐾2

(𝐾1 + 𝜆1 𝐼)2 0 . . . 0

... ... ..

. ...

𝐾1 𝐾𝑚 𝐾2 𝐾𝑚 . . . 0

0 (𝐾2 + 𝜆2 𝐼)2 . . . 0

... ... ..

. ...

⎞⎛ ⎟⎜ ⎟⎜ ⎟⎜ ⎠⎜ ⎝

− → 𝛼 1 − → 𝛼 2 . . . − → 𝛼− 𝑚

⎞ ⎟ ⎟ ⎟= ⎟ ⎠

0 0 . . . (𝐾𝑚 + 𝜆𝑚 𝐼)2

⎞⎛ ⎟⎜ ⎟⎜ ⎟⎜ ⎟⎜ ⎠⎝

− → 𝛼 1 − → 𝛼 2 . . . − → 𝛼− 𝑚

⎟ ⎟ ⎟. ⎟ ⎠ (11)

) ( 𝑚𝑣𝑘𝐶𝐶𝐴 𝐾𝑞 , 𝐾𝑟 , 𝐾𝑠 ∣ 1≤𝑠≤𝑚 , 𝜆 = 𝑠∕=𝑞,𝑠∕=𝑟

(−→ −→) (𝑖) 𝑤𝑖 ⋅ 𝑃 𝑒𝑎𝑟𝑠𝑜𝑛 𝑢(𝑖) 𝑞 , 𝑢𝑟

columns that are together correlated with the functional similarity matrix we can gain accuracy in the measurement of how likely a site is a binding site. In this work, we only correlate up to three columns. So, the score for a given site 𝑘 is 𝑆(𝑘) = 𝑤ˆ1 ⋅ 𝑆1 (𝑘) + 𝑤ˆ2 ⋅ 𝑆2 (𝑘) + 𝑤ˆ3 ⋅ 𝑆3 (𝑘)

⎞

Features of the datapoints in the direction of maximum correlation in image space can be found using Eq. 6. In feature space, the canonical correlation between any pair of multivariates, 𝑥𝑞 , 𝑥𝑟 , 1 ≤ 𝑞 ∕= 𝑟 ≤ 𝑚, is calculated using the Pearson correlation measure, similar to Eq. 8:

𝐶 ∑

% Iteration 3 % for each column 𝑘, 1 ≤ 𝑘 ≤ 𝐿 do 𝑆3 (𝑘) = 𝑚𝑣𝑘𝐶𝐶𝐴 (𝐾(𝐴𝑘 ), 𝐾(𝐹 ), 𝐾(𝐴𝑆𝑒𝑒𝑑1 ), 𝐾(𝐴𝑆𝑒𝑒𝑑2 ), 𝜆) end for 𝑆𝑒𝑒𝑑3 = arg max𝑘 𝑆3 (𝑘)

(12)

𝑖=1

where 𝜆 = [𝜆1 , 𝜆2 , . . . , 𝜆𝑚 ] in Eq. 11. D. Linear Regression kCCA We now present our algorithms that exploit the kCCA and the mvkCCA with linear regression to tap into the multi-positional correlations in a MSA for more accurate prediction of binding sites. As reasoned above, while scoring each column in a MSA based on its correlation with the functional classification has proved to be useful, it provides only a partial measurement, perhaps a first order approximation of, how a position is likely to be a binding site. More information can be added by correlating one position with other positions. Specifically, by incrementing the number of

(13)

where 𝑆1 (𝑘), 𝑆2 (𝑘), 𝑆3 (𝑘) are respectively the scores calculated for 𝑘 alone, when pairing it with another site, and when pairing it with another two sites. And 𝑤ˆ1 , 𝑤ˆ2 , and 𝑤ˆ3 are estimated weights, to be fixed by linear regression (more details in Sect. III). The initial proposed procedure for computing the scores is listed in Alg. 1. It starts by iterating through all the columns in the MSA and calculating the kernel canonical correlation between 𝐾(𝐴𝑘 ) and 𝐾(𝐹 ). The correlation measure at position 𝑘 gives 𝑆1 (𝑘). 𝑆2 (𝑘) can be obtained from a 3-view kernel canonical correlation between the column’s 𝐾(𝐴𝑘 ) and the functional 𝐾(𝐹 ), and another column 𝑙, which is chosen to maximize the 3-view mvkCCA. In a similar way, 𝑆3 (𝑘) is obtained from a 4-view kernel canonical correlation that is maximized from enumerating all possible triplets of columns containing position 𝑘 in the MSA. It is easy to see that the time complexity for calculating 𝑆1 (𝑘) is 𝑂(𝐿), 𝑂(𝐿2 ) for 𝑆2 (𝑘), and 𝑂(𝐿3 ) for 𝑆3 (𝑘). In general, the time complexity to maximize a mvkCCA for a n-tuple in a MSA of length L is 𝑂(𝐿𝑛 ), which becomes very costly as 𝑛 increases, while at the same time the extra information coming out of it may become tenuous. This is the rationale why only up to triplets are considered in this work. To alleviate the time complexity, we also propose a heuristic algorithm using seeds (Alg. 2). After 𝑆1 (𝑘) is obtained, the position with maximum correlation is saved as the first seed 𝑆𝑒𝑒𝑑1 . When calculating 𝑆2 (𝑘), 3-view kernel canonical correlation is calculated between the column’s 𝐾(𝐴𝑘 ) and the functional 𝐾(𝐹 ), and a third view of the data, namely the kernel based on the amino acid similarity matrix corresponding to the first seed, 𝐾(𝐴𝑆𝑒𝑒𝑑1 ), is included in the correlation. The second seed, 𝑆𝑒𝑒𝑑2 , is the position

˚ Av. distance from residue to ligand in A

25

20

15

10

5 20

Figure 2. Crystal structure of the human protein RhoA (pdb 1ftn), with its bound GDP ligand (dotted surface). This is a small GTPase protein known to regulate the actin cytoskeleton in the formation of stress fibers. Dark colored residues are predicted by Algorithm 1 to be ligand binding sites.

with maximum 𝑆2 (𝑘). In a similar way, a third iteration is run in which 4-view kernel canonical correlation is used to correlate 𝐾(𝐴𝑘 ) and 𝐾(𝐹 ), including the views of the first and the second seeds, 𝐾(𝐴𝑆𝑒𝑒𝑑1 ) and 𝐾(𝐴𝑆𝑒𝑒𝑑2 ). It is noted that, when the seeds are used, iteration 𝑛 is highly dependent on iterations (𝑛−1), . . . , 1, in the sense that the columns in iteration 𝑛 are grouped with the seeds from iterations (𝑛 − 1), . . . , 1 in order to evaluate their multi-column correlation with the functional similarity. This makes the algorithm highly sensitive to the choice of the seeds. An error in seed selection will inevitably compound. E. Data To test our methods, we use a dataset collected from the RAS oncogene protein family. Proteins in this family are small GTPases that are involved in cellular signal transduction and execution of mitogenic signals. The mutation of a single amino acid in a member of this family can cause the protein to become permanently overactive, leading to constant stimulation of RASdependent signaling pathways, even in the absence of mitogenic stimulation. For this reason, mutations in this family may contribute to the development of cancer, which makes the family an important target for pharmaceutical development. The dataset was originally used in [5]. It contains 24 distinct proteins that were aligned according to structural similarity using the Dali method [11]. The alignment contains proteins binding different ligands, including nucleotides (GTP, FMN, FAD), nucleosides and sugars. For instance, Fig. 2 shows the three dimensional structure of one of the members of this family, the human RhoA protein (pdb 1ftn), with its bound GDP ligand. The functional similarity between two proteins is measured by the chemical similarity between their bound ligands, according to the Tanimoto coefficient [12]. The similarity between two amino acids in a column is measured in binary format: 1 if the two amino acids are the same, 0 otherwise.

40

60 80 100 120 Position in the MSA

140

160

Figure 3. Average distance profile of the RAS oncogene family. The MSA has 166 positions (horizontal axis), and 24 different proteins, whose individual distance profiles were averaged to produce this landscape. Each individual profile was generated using the protein’s PDB file, by calculating the distance from each of its residues to the bound ligand. If a protein has a “delete” at some position, this position is simply disregarded for the average.

III. R ESULTS AND D ISCUSSION Given a protein in the MSA, a perfect functional site predictor would assign high scores to positions near the ligand, because these are the most likely to correspond with ligand binding residues. For each protein, based on its structural description (from the PDB file), we can calculate the distance from every residue to the location of the ligand, and build what we call the protein’s distance profile. Distance between a residue and the ligand can be defined in many different ways; we define it as the closest distance between any atom in the residue and any atom in the ligand. All the proteins in the MSA have a similar distance profile, even though each protein binds to a different ligand. The average of all the individual distance profiles is what we use as the ground truth for the MSA. Fig. 3 shows the average distance profile for our family of interest, the RAS oncogene. The perfect predictor would produce a profile of scores that inverts the average distance profile at each position: high scores for low distances (positions near the ligand), and low scores for high distances (positions far from the ligand). The score vectors in Algorithms 1 and 2, 𝑆1 , 𝑆2 and 𝑆3 , can be used independently as score profiles, and their performance as functional site predictors can be evaluated. However, in following the spirit of the Taylor expansion, it is also desirable to produce a weighted mix, or a linear combination, of the three score profiles. The weights are free parameters that can be optimized using any available extra information. Assuming an estimate of the average distance profile for the family exists (this would be the case when the ligand bound structures of the proteins in the MSA are known, and the functional site prediction is being attempted on a new sequence that is aligned to the family), the inverted average distance profile can be used as the ground truth and a system of linear equations is set up in which the weighted linear combination of the score profiles is made equal to the inverted average distance profile. We propose to solve this system using least squares. The least squares set up is the following: If 𝐴 = [𝑆1𝑇 , 𝑆2𝑇 , 𝑆3𝑇 ], where the 𝑆𝑖𝑇 are column vectors, 𝑏 is the inverted average distance

Iteration 1 Iteration 2 Iteration 3 Taylor

RBF kernel Alg. 1 Alg. 2 73.4% 73.4% 75.7% 74.3% 74.8% 82.0% 72.1% 66.2%

AA and ligand sim. kernels Alg. 1 Alg. 2 14.4% 14.4% 67.1% 48.6% 82.4% 63.1% 87.8% 79.3%

Table I AUC RESULTS WHEN USING A LGORITHMS 1 AND 2.

profile (also a column vector), and 𝑤 = [𝑤1 , 𝑤2 , 𝑤3 ]𝑇 are the iteration weights, we try to satisfy 𝐴 ⋅ 𝑤 = 𝑏, whose least squares solution is given by: ( )−1 𝐴𝑇 𝑏, (14) 𝑤 ˆ = [𝑤ˆ1 , 𝑤ˆ2 , 𝑤ˆ3 ]𝑇 = 𝐴𝑇 𝐴 and using these estimated weights the Taylor expansion score profile can be calculated by Eq. 13. We assess the performance of the algorithms by the area under the curve (AUC) of the Receiver Operating Characteristic (ROC). The true positives of the MSA are defined as those columns with an ˚ A perfect predictor average distance profile value smaller than 5𝐴. would assign scores to each position such that in the ranked list of scores the true positives are found at the top (higher scores). The prediction performance can be measured by the area under the ROC curve, which is the curve that plots the number of true positives as a function of false positives when moving the threshold down the ranked predicted scores. An AUC of 100% indicates a perfect classification, whereas an AUC of 50% corresponds to a random predictor. As a baseline, we tested the score profiles given by Xdet and CCA, obtaining AUC = 20.3% with Xdet, and AUC = 55.4% with CCA. CCA outperforms Xdet, as noticed in [6]. Table I shows the AUC results for the two proposed algorithms. Results for the three iterations are shown, and also for their linear combination using least squares weights (Taylor, see Eq. 13). The two left columns (RBF kernel) were obtained using a radial basis function kernel with 𝜎 = 2 for 𝐾(𝐴𝑘 ) and 𝐾(𝐹 ) (see Alg. 1 and 2), that is, the kernels are obtained by treating the rows of 𝐴𝑘 and 𝐹 as vectors, or datapoints. The two right columns (AA and ligand sim. kernels) were obtained by treating 𝐴𝑘 and 𝐹 as kernels themselves. With both algorithms, when using the RBF kernel, the performance improves from iteration 1 through 3, as expected, but the mixed scores can not enhance the individual iterations’ performances. The results using the amino acid and ligand similarity kernels are much better: not only does the performance incrementally improve with the iterations, but also the mixed score (Taylor) is able to boost up the individual contributions from each individual score. The best performance reported in this paper is obtained using the scores of the Taylor combination in Alg. 1, with AUC of 87.8%. In Fig. 2 we show in dark the ten highest positions scored by this method in the three dimensional structure of one of the proteins of the RAS family, the RhoA protein. IV. C ONCLUSION We have shown with a real biological example that the joint use of evolution and structures in a reconciliatory manner significantly improves predictions of ligand binding sites, which can help

facilitate mutagenesis experiments in pharmaceutical applications. The results support our rationale that, by use of kernel based CCA and multi positional correlation, the method is capable of detecting some subtle and non-linear signals conducive to binding site prediction, yet missed out from previous methods using Pearson correlation (Xdet) or even non-kernel CCA. Interesting future research directions include studying how this approach behaves in other datasets; incorporating more sophisticated measures of structural similarity between amino acids, ligands and proteins; and design of efficient algorithms to process the uncovered signals and features. R EFERENCES [1] N. Hulo, A. Bairoch, V. Bulliard, L. Cerutti, E. D. Castro, P. S. Langendijk-Genevaux, M. Pagni, and C. J. Sigrist, “The prosite database,” Nucleic Acids Res., vol. 34, pp. D227–D230, 2006. [2] J. A. Capra and M. Singh, “Predicting functionally important residues from sequence conservation,” Bioinformatics, vol. 23, no. 15, pp. 1875–1882, 2007. [3] S. Jones and J. M. Thornton, “Searching for functional sites in protein structures,” Curr Opin Chem Biol., vol. 8, no. 1, pp. 3–7, 2004. [4] N. V. Petrova and C. H. Wu, “Prediction of catalytic residues using support vector machine with selected protein sequence and structural properties,” BMC Bioinformatics, vol. 7, p. 312, 2006. [5] F. Pazos, A. Rausell, and A. Valencia, “Phylogeny-independent detection of functional residues,” Bioinformatics, vol. 22, no. 12, pp. 1440–1448, 2006. [6] A. J. Gonz´alez, L. Liao, and C. H. Wu, “Predicting functional sites in biological sequences using canonical correlation analysis,” in Proceedings of the 2010 International Conference on Bioinformatics and Computational Biology (BIOCOMP’10), Las Vegas, Nevada, USA, July 2010. [7] S. Akaho, “A kernel method for canonical correlation analysis,” in Proceedings of the International Meeting of the Psychometric Society (IMPS2001), Osaka, Japan, July 2001. [8] F. R. Bach and M. I. Jordan, “Kernel independent component analysis,” Journal of Machine Learning Research, vol. 3, pp. 1– 48, 2002. [9] Y. Yamanishi, J. P. Vert, A. Nakaya, and M. Kanehisa, “Extraction of correlated gene clusters from multiple genomic data by generalized kernel canonical correlation analysis,” Bioinformatics, vol. 19, pp. i323–i330, 2003. [10] H. Saigo, J. P. Vert, N. Ueda, and T. Akutsu, “Protein homology detection using string alignment kernels,” Bioinformatics, vol. 20, no. 11, pp. 1682–1689, 2004. [11] L. Holm and C. Sander, “Protein structure comparison by alignment of distance matrices,” J. Mol. Biol., vol. 233, pp. 123–138, 1993. [12] J. D. Holliday, C. Y. Hu, and P. Willett, “Grouping of coefficients for the calculation of inter-molecular similarity and dissimilarity using 2d fragment bit-strings,” Comb. Chem. High Throughput Screen, vol. 5, no. 2, pp. 155–166, 2002.