Mining Conditional Phosphorylation Motifs - IEEE Computer Society

0 downloads 0 Views 892KB Size Report
To discover conditional phosphorylation motifs, we propose an algorithm called .... To remove these redundant motifs, one method is to apply the permutation.
IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS,

VOL. 11,

NO. 5, SEPTEMBER/OCTOBER 2014

915

Mining Conditional Phosphorylation Motifs Xiaoqing Liu, Jun Wu, Haipeng Gong, Shengchun Deng, and Zengyou He Abstract—Phosphorylation motifs represent position-specific amino acid patterns around the phosphorylation sites in the set of phosphopeptides. Several algorithms have been proposed to uncover phosphorylation motifs, whereas the problem of efficiently discovering a set of significant motifs with sufficiently high coverage and non-redundancy still remains unsolved. Here we present a novel notion called conditional phosphorylation motifs. Through this new concept, the motifs whose over-expressiveness mainly benefits from its constituting parts can be filtered out effectively. To discover conditional phosphorylation motifs, we propose an algorithm called C-Motif for a non-redundant identification of significant phosphorylation motifs. C-Motif is implemented under the Apriori framework, and it tests the statistical significance together with the frequency of candidate motifs in a single stage. Experiments demonstrate that C-Motif outperforms some current algorithms such as MMFPh and Motif-All in terms of coverage and non-redundancy of the results and efficiency of the execution. The source code of C-Motif is available at: https://sourceforge. net/projects/cmotif/. Index Terms—Phosphorylation motif, protein phosphorylation, frequent pattern, data mining

Ç 1

INTRODUCTION

P

ROTEIN

phosphorylation is one of the most frequent post-translational modification events for the regulation and maintenance of most biological processes. This event plays vital roles in numerous key cellular processes including metabolism, signal transduction, transcription, translation, membrane transport, as well as the regulation of cellular activities such as proliferation, migration, differentiation, and death [1], [2], [3], [4]. The advent of highthroughput methods has greatly enhanced the investigations into phosphorylation, typically the technique of tandem mass spectrometry, which enables rapid and direct discovery of large-scale phosphorylation sites in a single experiment [5], [6], [7], [8]. Phosphorylation motifs represent common amino acids aligned upstream and downstream of the phosphorylation sites. Phosphorylation motif discovery aims at finding a set of motifs that occur with disproportionate frequency in two coordinate sequence data sets: the phosphorylated peptide set P and the unphosphorylated peptide set N. In this context, P is regarded as the foreground and N corresponds to the background. The identified interesting motifs are all “over-expressed” in the foreground, that is, they appear more frequently in the foreground than that in the background [9], [10]. Such sets of reported phosphorylation motifs can also provide information about the specificities of the kinases involved, reveal the underlying regulation mechanism and facilitate the prediction of unknown phosphorylation events. Hence, achieving a rapid phosphorylation motif search

 

X. Liu, J. Wu, H. Gong, and Z. He are with the School of Software, Dalian University of Technology, Dalian, China. E-mail: {eileenwelldone, wujun.myway, haipengxf}@gmail.com, [email protected]. S. Deng is with the School of Computer Science and Engineering, Harbin Institute of Technology, Harbin, China. E-mail: [email protected].

Manuscript received 9 Nov. 2013; revised 4 Apr. 2014; accepted 21 Apr. 2014. Date of publication 30 Apr. 2014; date of current version 2 Oct. 2014. For information on obtaining reprints of this article, please send e-mail to: [email protected], and reference the Digital Object Identifier below. Digital Object Identifier no. 10.1109/TCBB.2014.2321400

for gaining a comprehensive understanding of mechanism of protein phosphorylation is an essential work. The problem of discovering phosphorylation motifs has been widely studied and several algorithms have already been proposed. Motif-X [9] employs a greedy algorithm to identify over-expressed motifs in an iterative manner. This method is demonstrated to successfully uncover both known and novel informative motifs. MoDL [10] makes an attempt to optimize the expressiveness of a set of motifs instead of quantifying the significance of a single motif. FMotif [11] borrows an idea from clustering and combines it with an iterative greedy search derived from Motif-X. In fact, the greedy approaches such as Motif-X and F-Motif may miss some important motifs due to greedy choices and foreground reduction, and MoDL leaves out a quantity of significant motifs as well. MMFPh [12] extends Motif-X by employing a more complete search and can report much more statistically significant and sufficiently frequent motifs than Motif-X. Motif-All [13] is two-step procedure: first extracts a set of frequent motifs from the phosphorylated peptide data set, and then evaluates the statistical significance of frequent motifs using both phosphorylated and unphosphorylated peptide data sets. Overall, both MMFPh and Motif-All claim to guarantee the completeness of significant phosphorylation motifs under their corresponding definitions [14], [15]. But the lack of consensus on the definition of significant motifs leads to a gap between the results of these two methods. In general, Motif-All can ensure the maximum level of coverage of the potentially interesting results whereas both MMFPh and Motif-All involve some redundant motifs. Despite the significant progress achieved in the past years, there are still some important issues that remain unsolved. One challenging problem is how to filter out those biased motifs whose over-expressiveness mainly comes from its subsets so as to improve the accuracy of the identified motif set. For example, compared to other algorithms, experimental results demonstrate that Motif-All performs relatively better by reporting the largest number of

1545-5963 ß 2014 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.

916

IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS,

significant motifs. However, no further statistical correction is carried out in Motif-All, rendering an abundance of motifs whose over-expressiveness mainly originates from some portion of them in the result set. To remove these redundant motifs, one method is to apply the permutation test in statistics [16] to post-process the results. However, as pointed out in [17], the standard permutation test procedure may not fully address this issue. As a result, they utilize the “sequential permutation test” method [18] to reduce the effect of sub-motifs when calculating the statistical significance of phosphorylation motifs. However, the permutation test process is very time-consuming in practice. Another alternative strategy is to choose a rigorous measure that can remove the influence of subsets of motifs. Existing methods such as Motif-X [9] and MMFPh [12] are examples of this idea. They partially succeed in reducing the redundancy with more stringent restrictions for pruning strategy. Nevertheless, this is not the end of solution, and some additional motifs still remain in the results. Another challenging problem is how to achieve a maximum coverage such that all frequent and statistically significant motifs are included in the result. Existing methods such as Motif-X, F-Motif, MoDL and MMFPh will miss some motifs that are both sufficiently large and significant phosphorylation motifs due to their incomplete search or overstrict requirements on target motifs. Overall, the redundancy and coverage issues in phosphorylation motif discovery are only partially solved. This paper addresses this problem and takes a further step towards this direction by proposing a new problem formulation for phosphorylation motif discovery: conditional phosphorylation motif discovery. Owing to the problem formulation, the motifs reported would be “actual” motifs with more accurate statistical significance scores. This new problem formulation is a non-trivial extension and generalization of the motif assessment strategy used in Motif-X and MMFPh. Different from previous methods, the key feature of our method is that, the evaluation of significance with respect to one motif is only relevant to its own inherent property rather than whether it has significant constitute parts or not. Hence, we achieve an ideal trade-off between the high coverage and non-redundancy of significant motif set together with running efficiency to some extent. In the following, we will first illustrate the basic idea of conditional phosphorylation motif and then show how this concept can remove motifs whose over-expressiveness mainly comes from their subsets. If a position in one motif is specified by a certain fixed amino acid, then it is a so-called conserved position. Otherwise, it is a wild position that can match any arbitrary amino acid. One motif A is said to be a k-motif if it has k conserved positions. If another motif B contains only a subset of these k amino acids at the corresponding positions in A, then B is a sub-motif of A. In fact, every peptide that contains the motif must also contain its corresponding sub-motifs, but the reverse is not true. So the set of peptides that contain one motif must be a subset of the collection of peptides that contain its sub-motif. Notably there are exactly k sub-motifs of size k1 for one k-motif. For each sub-motif of size k1, we can generate a set of peptides in which every peptide contains this sub-motif. On this new

VOL. 11,

NO. 5, SEPTEMBER/OCTOBER 2014

data set, we can re-calculate the statistical significance of the k-motif which is called the conditional (or local) significance. Moreover, we define the statistical significance of a motif based on the original data set according to the traditional definition as the global significance. In this setting, this k-motif is claimed as a conditional phosphorylation motif if it is not only locally significant on all k sub-motif induced data sets but also globally significant on the whole data sets. Hence, there will be two parameters to measure the significance of motifs: the global significance threshold and the local significance threshold, respectively. As a result, the effects of its sub-motifs in the evaluation of statistical significance about the over-expressiveness are reduced. To address the problem of conditional phosphorylation motif mining, we present a new algorithm called C-Motif. We implement C-Motif by utilizing a support constraint together with statistical measures in the same mining process. Here the support is defined as the percentage of sequences that contain this motif. One motif is said to be frequent if its support is no less than a given threshold. Experiments on real data sets show that our algorithm is efficient and effective in conditional phosphorylation motif discovery. The remainder of this paper is organized as follows: Section 2 presents the details of C-Motif algorithm. Section 3 shows the experimental results on real data. Section 4 concludes the paper.

2

METHODS

2.1 Basic Terminology We describe a motif as a string with a single phosphorylated residue that is denoted with an underlined character, e.g., S, T or Y. We write the conserved positions of one motif as the corresponding amino acids directly and its wild positions are represented by ‘x’. For example, suppose a 2-motif has a fixed ‘P’ one position downstream and a fixed ‘D’ two positions upstream as well as a wild position next to the centered phosphorylation site, this motif is thus represented as ‘PSxD’. For a k-motif m, we use Supðm; P Þ to denote its support (frequency) in the foreground data set P . Since the goal of phosphorylation motif discovery is to find those motifs that occur more frequently in the foreground data set against the background data set, we usually only use the foreground to assess the frequency of a motif. Definition 1. m is a frequent motif if Supðm; P Þ is no less than the user-specified threshold usup . The set of candidate frequent motifs of size k is denoted as Sk . To discover all frequent motifs, the level-wise search strategy rooted from the Apriori algorithm [19] is widely used in practice. It first accumulates the count for each 1motif and collects those motifs with larger support than the given threshold usup to form the set of frequent 1-motifs F1 . Subsequently, since a k-motif will not be frequent if one of its sub-motifs of size k1 is infrequent, Fk1 is utilized to generate Sk and those infrequent ones are pruned to generate Fk . Owing to the fact that m consists of k fixed amino acids, this motif has k sub-motifs of size k1. These motifs

LIU ET AL.: MINING CONDITIONAL PHOSPHORYLATION MOTIFS

917

subsumed by m are denoted as m1 ; m2 ; . . . ; mk , respectively. The only difference between m and mi (1  i  k) is that the ith fixed position in m is non-fixed in mi . And we describe the sets of peptides in the foreground data P where these sub-motifs occur as P ðm1 Þ; P ðm2 Þ; . . . ; P ðmk Þ, respectively. Similarly, we use Nðmi Þ to denote the set of peptides in the background data N that contain mi . We utilize Sigðm; P; NÞ to denote the statistical significance calculation function for each motif m, and it measures the over-expressiveness of m in P against N. In fact, both the nonparametric measurements such as odds ratio or relative risk and the binomial probability model have been used in the assessment of phosphorylation motif [13], [9], [12]. Note that the use of different evaluation methods will not change the nature of the problem. Generally, these statistical significance assessment functions are consistent with each other in practice. For the ease of illustration and a fair comparison of different methods, here we use relative risk and odds ratio as the representatives of statistical significance evaluation methods for phosphorylation motifs. Both relative risk and odds ratio describe a likelihood change of the occurrence of one motif between the foreground and the background. Relative risk is defined as the ratio of the supports of m in the two data sets. When using relative risk to measure the statistical significance of m, it reads as: Sigðm; P; NÞ ¼

Supðm; P Þ : Supðm; NÞ

(1)

A relative risk of 1 indicates that the target motif under study is equally likely to occur in both data sets. A relative risk greater than 1 means that this motif is more likely to occur in the foreground. In addition, the odds is the ratio of the probability that the interesting event does happen to the probability that it does not happen. The odds ratio is defined as the ratio of the odds of an event occurring in one group to the odds of it occurring in another group. In the context of phosphorylation motif finding, if we adopt odds ratio to measure the over-expressiveness, then the significance of m is: Sigðm; P; NÞ ¼

Supðm; P Þ=ð1  Supðm; P ÞÞ : Supðm; NÞ=ð1  Supðm; NÞÞ

(2)

Odds ratio has the same characteristics as relative risk: only those motifs whose odds ratio is greater than 1 have potential to be statistically significant. Particularly, if we consider P ðm1 Þ, P ðm2 Þ; . . . ; P ðmk Þ as the new foreground data set instead of P and Nðm1 Þ, Nðm2 Þ; . . . ; Nðmk Þ as the new background data set instead of N when estimating the over-expressiveness, there are exactly k different significance values for m, that are Sigðm; P ðmi Þ; Nðmi ÞÞ where 1  i  k, respectively. Without loss of generalization, we assume that Sigðm; P; NÞ has positive correlation with the over-expressiveness: the bigger Sigðm; P; NÞ is, the more significant the motif m is. Under this setting, we adopt the minimum value of Sigðm; P ðmi Þ; Nðmi ÞÞ (1  i  k) as the local (or conditional) statistical significance in the estimation of over-expressiveness of each motif. Then, the problem of conditional phosphorylation motif discovery is to discover all frequent motifs from P with sub-motif derived statistical significance values

passing the given threshold in addition to fulfilling the traditional definition. Definition 2. Local statistical significance: Sigl ðm; P; NÞ ¼ min Sigðm; P ðmi Þ; Nðmi ÞÞ: 1ik

Definition 3. Global statistical significance: Sigg ðm; P; NÞ ¼ Sigðm; P; NÞ: Overall, to identify all statistically significant, sufficiently frequent conditional phosphorylation motifs, we assess at least two aspects of each motif: frequency and statistical significance including both the local significance and the global significance:  

Frequency. We impose the support constraint to reduce the search space and prevent the generation of random artifacts. Statistical significance. Note that the statistical evaluation of over-expressiveness for a motif can be done in various ways. The statistical significance measures such as relative risk and odds ratio are available to be utilized interchangeably. The choice of significance assessment measure will not change the performance of underlying algorithms.

2.2 Problem Formulation As shown above, we strengthen and optimize the definition of phosphorylation motif finding and try to conduct an extensive and non-redundant (NR) discovery. The conditional phosphorylation motifs are deemed to be true positives with prominent over-expressiveness under no subsets interplays. We impose a persuasive significance constraint called the local or conditional significance on each candidate motif that evaluates the statistical significance of a phosphorylation motif with the sets of sequences induced from its sub-motifs. Furthermore, we also perform the global significance evaluation using the original data sets in the traditional way. Thus, there are two parameters to measure the significance of motifs: the global significance threshold ug sig and the local significance threshold ul sig , respectively. That is, to ensure the significance of one motif m over P against N, two criteria must be satisfied simultaneously: Sigg ðm; P; NÞ  ug sig and Sigl ðm; P; NÞ  ul sig . Hence, we can guarantee that the conditional phosphorylation motifs are also significant under the traditional definition. However, there is one critical issue remaining for this task. That is, whether the effect of the sub-motifs has really been removed through the use of local statistical significance? To address this issue, we employ a measure called improvement proposed in [20] for justification. More precisely, the improvement is defined as the difference between the statistical significance of one motif and that of its sub-motifs. In general, the positive improvement indicates that the over-expressiveness of one target motif comes from the combinations of all its constituent amino acids rather than just one of its subsets. We should prune those redundant motifs that have no positive improvements: the over-expressiveness of a motif is equal to or less than that

918

IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS,

of its sub-motifs.1 To further clarify this issue, since we utilize the local significance to get rid of sub-motif interplays, we calculate the difference values between the conditional phosphorylation kmotif and its corresponding sub-motifs of k-1 with respect to the global significance. For the simplicity of illustrations and mathematical derivations, here we use relative risk as the significance measure as the example measure in the remainder of this section. Lemma 1. Each conditional phosphorylation motif possesses positive improvement if relative risk is used to measure the statistical significance. Proof. For a k-motif m in data set P and N, suppose one of its sub-motifs mj of length k1 has the minimal significance value, i.e., Sigl ðm; P; NÞ ¼ Sigðm; P ðmj Þ; Nðmj ÞÞ. We set ul sig ¼ t, t > 1. Then the Lemma 1 can be formulated as: if Sigl ðm; P; NÞ  t, then Sigg ðm; P; NÞ > Sigg ðmi ; P; NÞ for any 1  i  k: Supðm; P ðmi ÞÞ Sigðm; P ðmi Þ; Nðmi ÞÞ ¼ Supðm; Nðmi ÞÞ jP ðmÞkNðmi Þj ¼  Sigl ðm; P; NÞ jNðmÞkP ðmi Þj ¼ Sigðm; P ðmj Þ; Nðmj ÞÞ  t; (3) Sigg ðm; P; NÞ ¼

Sigg ðmi ; P; NÞ ¼

Supðm; P Þ jNkP ðmÞj ¼ ; Supðm; NÞ jP kNðmÞj

(4)

Supðmi ; P Þ jNkP ðmi Þj ¼ : Supðmi ; NÞ jP kNðmi Þj

(5)

To make the following description easier to follow, we provide a precise problem definition of conditional phosphorylation motif discovery with clearly stated input and output:



Input. The set of phosphorylated peptides (the foreground data set P ) and the set of unphosphorylated peptides (the background data set N), the support threshold usup , the local significance threshold ul sig and the global significance threshold ug sig . Output. A set of conditional phosphorylation motifs R, where each motif m 2 R satisfies: (1) Supðm; P Þ  usup ; (2) Sigl ðm; P; NÞ  ul sig ; (3) Sigg ðm; P; NÞ  ug sig .

2.3

Categorization of Existing Methods under the New Formulation Notice that Motif-X [9] and MMFPh[12] also measure the over-expressiveness of one motif in a way that is similar to our definition of local significance. More precisely, although 1. For justification, we also define some motifs as redundant ones if their improvement is very small.

NO. 5, SEPTEMBER/OCTOBER 2014

only ul sig is used in both Motif-X and MMFPh, they implicitly require that one sub-motif of the target motif should be locally significant as well. In this section, we first show that all the motifs reported by Motif-X and MMFPh are also globally significant with high over-expressiveness as shown in the lemma below. Lemma 2. For a k-motif m in data set P and N, suppose we use relative risk as the measure and set both ul sig and ug sig to be t, t  1. Let mðrÞ represents one sub-motif of m of size r. If SigðmðrÞ ; P ðmðr1Þ Þ; Nðmðr1Þ ÞÞ  t, for all 1  r < k, then Sigg ðm; P; NÞ  t. Proof. SigðmðrÞ ; P ðmðr1Þ Þ; Nðmðr1Þ ÞÞ ¼ ¼

SupðmðrÞ ; P ðmðr1Þ ÞÞ SupðmðrÞ ; Nðmðr1Þ ÞÞ ðrÞ

(6)

ðr1Þ

jP ðm ÞkNðm Þj ; jNðmðrÞ ÞkP ðmðr1Þ Þj

Sigðmðr1Þ ; P ðmðr2Þ Þ; Nðmðr2Þ ÞÞ ¼ ¼ Sigg ðm; P; NÞ ¼

jP ðmÞj jP ðmi Þj Since t is greater than 1, so jNðmÞj  jNðm . As a result, we i Þj can get that Sigg ðm; P; NÞ > Sigg ðmi ; P; NÞ by making equation (4) minus equation (2), which returns a result that is greater than 0. u t



VOL. 11,

Supðmðr1Þ ; P ðmðr2Þ ÞÞ Supðmðr1Þ ; Nðmðr2Þ ÞÞ ðr1Þ

(7)

ðr2Þ

jP ðm ÞkNðm Þj ; ðr2Þ ðr1Þ ÞkNðm Þj jP ðm Supðm; P Þ jP ðmÞkNj ¼ : Supðm; NÞ jP kNðmÞj

(8)

It is easy to see that SigðmðrÞ ; P ðmðr2Þ Þ; Nðmðr2Þ ÞÞ  t2 by multiplying Equations (6) and (7). That is, mðrÞ is statistically significant in the data sets derived from mðr2Þ . This rule also applies to motif mðr2Þ so that mðr2Þ is significant in the peptide sequences induced by one of its sub-motifs as well. Therefore, we can infer that Sigg ðm; P; NÞ  tk by iterating the multiplication process. Since t is a positive number that is no less than 1, then Sigg ðm; P; NÞ must be equal to or greater than t, too. Thus, m is globally significant as well. u t Lemma 2 shows that Motif-X and MMFPh can find motifs that are globally significant. However, two implicit issues exist with regard to coverage and redundancy. The first one is that they may miss some potentially meaningful motifs under the conditional phosphorylation motif definition. This is because these methods investigate a certain candidate k-motif m on condition that it must contain at least one both frequent and significant sub-motif of size k1. If all constituent motifs mi s are not significant, there is no chance to generate and evaluate this motif so that Motif-X and MMFPh will not check m. Fig. 1 provides such an example. For illustration purpose, we adopt relative risk as the significance measure here. Additionally, we set the significance threshold usig ¼ ul sig ¼ ug sig ¼ 1:2 and the support threshold usup ¼ 0:2. In this sample data set, we plant one frequent and significant phosphorylation motif ‘KMS’ with relative risk value larger than 1.2. There may be some other significant motifs presented in this data set while we will only focus our discussion on ‘KMS ’. ‘KMS’ consists of two

LIU ET AL.: MINING CONDITIONAL PHOSPHORYLATION MOTIFS

919

Fig. 2. The relationship of the sets of significant motifs discovered by different algorithms.

Fig. 1. A sample data set. Both the foreground data set and background data set consist of 10 sequences.

sub-motifs, ‘KxS’ and ‘MS’, which are definitely insignificant with significance scores equal to 1.0. Thus, methods like Motif-X and MMFPh have to filter out ‘KMS’ since it lacks of frequent and significant constituent motifs. However, motifs of this type might also have great potential to play valuable roles in biological research. So expanding the scope of search and assimilating such kind of motifs as useful ones is a considerable work for phoshphorylation motif discovery. An additional issue is that Motif-X and MMFPh may involve some motifs that are not reported by C-Motif as well. According to their definitions, one motif is deemed to be significant as long as it can pass the significance test on the reconstructed data sets induced from at least one submotif. Such loose restriction makes it possible to report some excess motifs whose over-expressiveness mainly comes from their subsets. As a result, this kind of motifs should be pruned according to our definition whereas Motif-X and MMFPh will report them. We will use one potential motif ‘GLxxxxSW’ in Fig. 1 as an example for illustration. With the significance threshold usig ¼ ul sig ¼ ug sig ¼ 1:2 and the support threshold usup ¼ 0:2, its three sub-motifs ‘LxxxxSW’, ‘GxxxxxSW’ and ‘GLxxxxS’ are all frequent and (both locally and globally) significant when relative risk is used for significance assessment. So it is possible to generate ‘GLxxxxSW’ in three ways in MMFPh and Motif-X. Moreover, ‘GLxxxxSW’ achieves three significance values on the subset induced data sets: one is 1.0 induced by ‘GLxxxxS’, another is also 1.0 by ‘GxxxxxSW’ and the other is 1.5 by ‘LxxxxSW’. According to MMFPh and Motif-X, ‘GLxxxxSW’ is a significant motif with significance value 1.5 which is larger than usig . However, ‘GLxxxxSW’ will be discarded by C-Motif since its local significance value 1.0 is less than the local significance threshold. Further investigation on the relationship between the target motif and its sub-motifs shows that ‘GLxxxxSW’ has equivalent global significance with

‘GLxxxxS’ and ‘GxxxxxSW’, rendering little improvement of the over-expressiveness. As a result, ‘GLxxxxSW’ is a redundant motif whose over-expressiveness possibly mainly roots from ‘GLxxxxS’ and ‘GxxxxxSW’ other than the integrity. In summary, the algorithms that only adopt the global significance such as Motif-All often return a set of phosphorylation motifs that contains lots of motifs whose overexpressiveness mainly benefits from the subsets. Furthermore, MMFPh is able to find much more significant motifs missed by Motif-X. This method decreases the number of redundant motifs but excludes some ones with no “overexpressed” sub-motifs and includes some redundant ones under subsets interplays as well. In contrast, our formulation and proposed algorithm use the global significance together with the local significance and can discover as many potential significant motifs as possible with nonredundancy. The relationship of the motif sets reported by the different algorithms is provided in Fig. 2.

2.4 The C-Motif Algorithm C-Motif (Algorithm 1) is implemented in a single-stage where the frequency and the statistical significance are tested at the same time. In detail, when discovering the set F1 from S1 in the first iteration, we calculate the local significance as well as the global significance for each frequent member of F1 . In fact, the global significance of 1-motifs is identical to their local significance. Since one motif is statistically significant on condition that its global significance and local significance are not less than the given threshold ug sig and ul sig , respectively. We filter out the insignificant motifs and then add the rest to the result set. So after investigating every possible 1-motif, all the motifs in the result set are significant conditional phosphorylation motifs of size 1. With respect to the kth iterations, we perform the following operations: 1)

Generate the set of potential frequent motifs of size k, i.e., Sk , by joining Fk1 with itself naturally (Steps 57). Let l1 and l2 be members of Fk1 . They are assumed to be joinable if their first k2 amino acids are in common with the same conserved positions. To avoid duplications, we rank these two motifs according to the positions of their last conserved amino acids. Suppose l1 and l2 are a pair of joinable

920

2)

3)

4)

IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS,

motifs, then l1 < l2 if the last conserved amino acid of l1 lies in the relative left of that of l2 . Line up these identical amino acids and the left in l1 and l2 according to their rank to compose one k-motif m. Evaluate the frequency of each candidate in Sk by their supports (Steps 8-10). Prune all the infrequent ones and add all the left to Fk . Test the local significance and the global significance of each potential motif in Fk (Steps 11-18). For one kmotif m, obtain all of its (k1)-sub-motifs. For each sub-motif mi of m, construct its matching foreground P ðmi Þ and background Nðmi Þ. Calculate the significance value on the new data sets and choose the minimum one as the final local significance value. In addition, we also obtain the global significance value using the original data sets P and N in consistency with traditional definition. Filter out the insignificant motifs and save the both globally and locally significant ones in R. Repeat the above steps until no more frequent candidates can be generated in Sk . Return R as the final result.

2.5 Proofs of Completeness and Correctness Theorem 1. The C-Motif algorithm is complete. Proof. With respect to the definition of conditional phosphorylation motif, the completeness of C-Motif can be shown by the following two facts. The first is that the Apriori algorithm is complete, that is, it explicitly

VOL. 11,

NO. 5, SEPTEMBER/OCTOBER 2014

enumerates and checks all frequent motifs in the mining process. The second is that our implementation only prunes those insignificant motifs with respect to both global significance and local significance. u t Theorem 2. The C-Motif algorithm is correct. Proof.. The correctness of the C-Motif algorithm can be guaranteed by two facts. First, only frequent motifs are generated. Second, both the local and global significance values are exactly calculated and every motif with significance values that are lower than the user-specified thresholds will be pruned. u t

3

EXPERIMENTAL RESULTS

In order to demonstrate the efficacy and utility of our algorithm, we conduct a series of tests with real data. In our experiments, we compare our algorithm with the Motif-All algorithm and the MMFPh algorithm with respect to efficiency, coverage and non-redundancy. Note that several motif-discovery methods have been proposed, the reason why we choose Motif-All and MMFPh for comparisons here is that they are representatives for algorithms that use only global significance threshold and local significance threshold, respectively. Furthermore, we use the same significance measure in all algorithms so as to make their outputs comparable. More precisely, we choose relative risk to measure the over-expressiveness so as to ensure the motifs reported by MMFPh2 are also globally significant, which can facilitate a fair comparison for our experiments. In the experiments, we apply C-Motif, MMFPh and Motif-All to both non-kinase-specific and kinase-specific phosphorylation data sets with a fixed length of amino acids upstream and downstream of the phosphorylated residues. The details of these data sets are provided in the following sections. In each experiment, we first present a brief description of the data and tune the thresholds so as to clarify the comparison. Subsequently, we perform a general analysis of the motifs discovered and then illustrate the superiority of C-Motif against MMFPh and Motif-All.

3.1 Non-Kinase-Specific Phosphorylation Data 3.1.1 Data Description We use Phospho.ELM (version 9.0) [21] and Swiss-Prot (release 2011_11) [22] as the data sources to construct the data for this experiment. To generate the set of phosphorylated peptides P, we directly extract the annotated phosphorylated peptides from Phospho.ELM without considering kinase information. To generate the set of unphosphorylated peptides N, we first extract all the proteins annotated as ‘Homo sapiens’, and then follow Musite [23] to partition all the proteins into different groups through BLAST-Clust in BLAST [24] package version 2.2.19 with a sequence identity threshold of 50 percent. We select the protein with the largest number of known phosphorylation sites in each group to 2. Note that in the original MMFPh, it reports the maximal phosphorylation motifs that are not subsumed by motifs with more fixed amino acids. In our implementation, we report all motifs that can pass the significance threshold so as to generate more motifs for a fair comparison with respect to the size of motif set.

LIU ET AL.: MINING CONDITIONAL PHOSPHORYLATION MOTIFS

Fig. 3. The number of reported motifs of different sizes on non-kinasespecific phosphorylation data sets. Here the support threshold usup is 0.005 and the significance threshold usig is 1.7.

form a non-redundant data set; if there are no phosphorylated proteins in the current group, we select the longest protein as the member of NR data set. In fact, there are at least five kinds of methods to construct the set of unphosphorylated peptides in different way [25]. Here we employ the method in [26] to construct the background data. More precisely, we generate the data by sampling 5,000 phosphorylated peptides and 5,000 unphosphorylated peptides from NR data set. Here all the peptides have the fixed length 13 and they are aligned on the residue that lies in the center position. In the context of phosphorylation motif discovery, one algorithm can report different result sets of motifs from different test data sets. Since our main objective in this paper is to compare the performance of different algorithms in the same data sets, we will not discuss the effects of the data construction methods on motif discovery and just use some existing methods for constructing both the foreground data and the background data.

3.1.2 Results In this experiment, we choose a lower value as the support threshold and a smaller value as the significance threshold so as to uncover a relatively larger number of significant motifs for a more remarkable comparison. For consistency, we set the support threshold usup ¼ 0:005 and the significance thresholds3 (both ug sig and ul sig ) to be 1.7 for all the algorithms under discussion. In fact, Motif-All only uses the global significance ug sig and MMFPh only adopts the local significance ul sig in their algorithms. The questions we want to answer in this experiment are: How many phosphorylation motifs of different sizes can be discovered by the algorithms, respectively? Which methods can detect more meaningful phosphorylation motifs? Which methods can filter out more biased motifs with respect to subsets interplays? 3. For the simplicity of notations, we use usig to denote both the global significance threshold and local significance threshold when they are same.

921

Fig. 4. The relationship of reported motifs of size 2 on non-kinase-specific phosphorylation data sets. Here the rounded rectangle denotes the result set of Motif-All, the circle presents that of C-Motif and the rectangle corresponds to that of MMFPh. Here the support threshold usup is 0.005 and the significance threshold usig is 1.7.

Fig. 3 summarizes the number of phosphorylation motifs discovered by C-Motif, MMFPh and Motif-All. Several observations can be made from Fig. 3. First, C-Motif is able to find more motifs than MMFPh in general. Moreover, Motif-All reports much more motifs, and all the reported motifs of C-Motif and MMFPh are included in the result set of Motif-All. Second, we find that all the algorithms present a same set of interesting 1-motifs and there is visible difference in the discovery of 2-motifs and 3-motifs. In Fig. 4, we describe the relationship of the 2-motif sets reported by different algorithms. Apparently, Motif-All reports a superset of the other two methods, which can be divided into four parts according to their distributions. Some considerable analysis upon the four parts can be made as follows: 1)

2)

3)

The motifs in the intersection of the results (the first part with label ‘1’) reported by MMFPh, Motif-All and C-Motif are all conditional phosphorylation motifs. The second part with label ‘2’ consists of motifs just discovered by Motif-All and C-Motif. The common characteristic of these motifs is that all of their submotifs are insignificant on the subset induced data sets. Accordingly, MMFPh fails to generate and check these motifs even though they are both globally significant and locally significant. The third part contains only one 2-motif ‘RxxSP’ detected by MMFPh and Motif-All. Since ‘RxxSP’ is composed of two significant sub-motifs (‘RxxS’ and ‘SP’), which enables MMFPh to generate this motif by extending either ‘RxxS’ or ‘SP’. Thus, ‘RxxSP’ can achieve two significance values, 3.43 and 1.54, upon the sub-motif induced data sets. MMFPh chooses the first value and reports this motif as significant one while C-Motif prunes it since the minimum value is less than the given local significance threshold. In fact, ‘RxxSP’ is quite dependent on ‘SP’ in the sense that its global significance value of 6.6 is relatively similar to that of 5.5 of ‘SP’. This indicates that the discriminative power of ‘RxxSP’ mainly comes from its subset ‘SP’. So regarding such kind of motifs as redundant ones is reasonable and justifiable.

922

IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS,

VOL. 11,

NO. 5, SEPTEMBER/OCTOBER 2014

Fig. 5. The running time comparison of C-Motif, MMFPh and Motif-All on non-kinase-specific phosphorylation data.

4)

The fourth part consists of motifs that are only reported by Motif-All. Motifs in this category generally have at least one very significant sub-motif that actually leads to the statistical significance of its superset. For instance, ‘GxxRxxS’ is composed of two sub-motifs: ‘GxxxxxS’ and ‘RxxS’. ‘GxxxxxS’ is not a significant phosphorylation motif with both local and global significance values of 0.99 while ‘RxxS’ is a significant one with the significance values of 2.48. Particularly, ‘GxxRxxS’ has the global statistical significance values of 3.22, rendering little improvement upon the over-expressiveness compared to that of ‘RxxS’. In this regard, ‘GxxRxxS’ is an insignificant motif according to our definition although it can pass the significance threshold usig . Similar analysis can also be made for the other motifs in this part. Hence, these motifs are redundant in the sense that their discriminative power mainly comes from some sub-motifs that are already claimed as to be statistically significant. Actually, filtering out this kind of motifs is able to reduce the redundancy in phosphorylation motif discovery. We have similar observations on the motifs of size 3 reported by different algorithms. In general, Motif-All covers all the potentially significant motifs detected by MMFPh and C-Motif but contains many redundant ones with no significant improvements of statistical significance. The difference set between MMFPh and C-Motif includes several potentially redundant motifs from MMFPh and some interesting ones without any significant sub-motifs discovered by C-Motif. Note that the largest size of reported phosphorylation motifs is 3 under the setting of usup ¼ 0:005 and usig ¼ 1:7. There is more evident difference in the performance of CMotif and MMFPh as well as Motif-All especially when the size of target motifs is larger than 2. That is, the bigger the size is, the more different the results of the algorithms are. Thus, we can infer that increasing the size of target motifs will apparently highlight the advantage of C-Motif over MMFPh and Motif-All. To further check if this is true, we also perform phosphorylation motif mining at the support threshold of 0.1 with significance threshold of 4.0. C-Motif obtains an

identical set of motifs of size 1 with MMFPh and a slightly different set from Motif-All under this setting. If we change the support threshold to 0.001 with the significance threshold equivalent to 1.5, C-Motif reports 1,530 motifs, whereas MMFPh reports 515 motifs and Motif-All reports 4,236 motifs. Accordingly, there exists a bigger gap between their result sets. On one hand, MMFPh misses a number of interesting motifs as well as includes several potentially redundant motifs whose over-expressiveness mainly derives from their sub-motifs. On the other hand, Motif-All also reports many motifs whose sub-motifs are included in the result as well. This demonstrates that CMotif not only can find more significant motifs than MMFPh but also is more qualified than Motif-All and MMFPh to achieve non-redundancy in a flexible manner. In conclusion, it has been illustrated that MMFPh is partially useful and effective in presenting meaningful motifs and reducing redundant motifs than Motif-All, although it would miss some statistically significant motifs. Furthermore, Motif-All acquires higher coverage at the cost of including many motifs whose over-expressiveness is rooted from sub-motifs. In contrast, C-Motif makes a reasonable trade-off between redundancy and coverage. Hence, the empirical comparison shows that C-Motif outperforms the other methods like MMFPh and Motif-All, by discovering more meaningful and non-redundant phosphorylation motifs. Fig. 5 presents the running time of different algorithms. For this data set, C-Motif and Motif-All spend similar execution time in the mining process under the same parameter setting. Due to the specific motif candidate generation procedure, MMFPh needs more time when the significance threshold is very low and is more efficient when the significance threshold is increased. Overall, C-Motif is relatively efficient and is comparable with other algorithms in terms of running efficiency.

3.2 CDK-Specific Phosphorylation Data 3.2.1 Data Description A protein kinase phosphorylates the substrates by transferring phosphate from adenosine triphosphate or guanosine triphosphate to specific amino acids (serine, threonine and

LIU ET AL.: MINING CONDITIONAL PHOSPHORYLATION MOTIFS

Fig. 6. The number of reported motifs of different sizes on CDK-specific phosphorylation data sets. Here the support threshold usup is 0.015 and the significance threshold usig is 2.0.

tyrosine). Cyclin-dependent kinases (CDK) is a major class of enzymes involved in the regulation of the cell cycle. This kind of kinases is activated alternatively along with the cell cycle, and phosphorylates the corresponding substrates so as to make the cell cycle proceed in an orderly manner. Similar to generating the non-kinase-specific phosphorylation data, we first extract the phosphorylated peptides from phosphorylated proteins of the kinase CDK to compose the foreground data set. We use the method in [27] to construct the nonphosphorylation data. There are approximately 200 sequences (i.e., 13-mers) in both the foreground data and background data.

3.2.2 Results For this data set, we first set the support threshold usup ¼ 0:015 for all the algorithms and the significance threshold usig ¼ 2:0 for Motif-All and MMFPh. Particularly, we set an identical threshold for global significance and local significance for C-Motif with usig ¼ 2:0. That is, both ug sig and ul sig are equal to 2:0. Fig. 6 presents the number of the significant motifs found by C-Motif, Motif-All and MMFPh under this setting. Accordingly, C-Motif presents a set of phosphorylation motifs whose size is less than that of Motif-All while it is greater than that of MMFPh. Motif-All also succeeds in finding all the phosphorylation motifs reported by C-Motif, whereas including some controversial ones that need to be further discussed. For this reason, we have to re-conduct significance tests for those motifs only found by Motif-All. We first extract them to form a new set, and then check the statistical significance of each motif together with that of its component parts. Most of these motifs have something in common that they consist of significant parts and insignificant parts, and their significant parts contribute to the over-expressiveness of the whole motifs. According to our definition, these motifs should be considered as redundant motifs. The result set of Motif-All may also contain some motifs that have similar over-expressiveness compared to their subsets. Therefore, due to the weaker pruning of the global significance in the assessment of over-expressiveness, Motif-All often fails to get rid of subsets interplays and returns a larger number of motifs with lots of undesired noise.

923

Specially, we find that almost all the motifs found by MMFPh also occur in the result sets reported by the other algorithms in addition to some special cases. The motifs that are reported by C-Motif but missed by MMFPh are both globally and locally significant according to our definition. For instance, ‘PxVxxxSxxK’ is a 3-motif reported by C-Motif and Motif-All but missed by MMFPh. With respect to the global significance, this motif is indeed a significant motif with prominent over-representation in the foreground data set. In contrast, its sub-motifs ‘VxxxSxxK’, ‘PxxxxxSxxK’ and ‘PxVxxxS’ are all insignificant on their sub-motif induced data sets with lower over-expressiveness. As shown in the former section, since there are no frequent and significant sub-motifs for ‘PxVxxxSxxK’, so there is no chance to generate and evaluate the target motif in MMFPh, and thus MMFPh prunes it for certain. In this sense, MMFPh fails to discover some underlying interesting motifs because of its restrictive definition. In addition, we also find that most of those motifs missed by MMFPh are of larger sizes. Particularly, the motifs discovered by MMFPh while filtered out by C-Motif are some exceptions whose overexpressiveness mainly roots from their significant parts. ‘STPxxxxR’ is such one special case with three significant sub-motifs: ‘STP’, ‘STxxxxxR’ and ‘TPxxxxR’. ‘STPxxxxR’ can be extended from any sub-motif so that it has three possible significance values to measure the over-expressiveness, which are 1.5, 2.17 and 1.0, respectively. So the target motif is significant only if we evaluate it relying on its second sub-motif induced data sets. In addition, ‘STPxxxxR’ has an identical global significance value with that of its sub-motif ‘TPxxxxR’. That whether ‘TPxxxxR’ is significant or not greatly affects the significance degree of ‘STPxxxxR’. As a result, motifs like this are biased ones that should be removed according to our definition. In conclusion, MMFPh still includes some redundant motifs even though they have made some contributions to redundancy reduction. For illustration purpose, we also study the performance of the different algorithms under different statistical significance standards and support levels. We gradually change the significance threshold and the support threshold to obtain various cases. The results are summarized in Table 1. Under any setting, it is easy to see that Motif-All always reports the most motifs and MMFPh returns the least quantitatively. If we lower the support threshold with unchanged significance threshold, or lower the significance threshold with unchanged support threshold, there will be much more motifs of different sizes to be reported. In this situation, more redundant motifs are considered as significant ones by Motif-All while more potential ones are missed by MMFPh. It indicates that the smaller the support threshold and the significance threshold are, the more motifs of the larger sizes are found, and the more obvious difference between the results of the algorithms is visible. This is because if one motif has larger size, then this motif contains more sub-motifs, and the subset interplays play more important role and are more potential to affect the whole combination with respect to over-expressiveness, especially compared to the motif of smaller size. Hence, it increases the probability for both Motif-All and MMFPh to regard the target motif as an

924

IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS,

VOL. 11,

NO. 5, SEPTEMBER/OCTOBER 2014

TABLE 1 The Number of Reported Motifs of Different Sizes on the CDK Phosphorylation Data Sets by Tuning the Thresholds usup and usig Gradually

usig indicates identical threshold for ug

sig

and ul

sig

in C-Motif.

interesting one even its over-expressiveness is derived from some subsets, or just for MMFPh to filter one motif out if it has no frequent and significant sub-motifs although this motif is globally significant. On the contrary, more motifs will be pruned and less ones can be discovered with higher thresholds. This will reduce the performance gap among these algorithms. In this sense, MotifAll and MMFPh may be more appropriate for discovering those motifs of smaller size. To conclude, C-Motif is a better algorithm with respect to coverage as well as nonredundancy in phosphorylation motif discovery. Fig. 7 depicts the running time of the algorithms under different parameters. Despite of the fluctuation of MMFPh, it is apparent that C-Motif needs similar running time as Motif-All and a little more than MMFPh to finish the mining process. Meanwhile, we would like to point out that the performance gap between our algorithm and other methods is not significant. Therefore, C-Motif is a relatively efficient algorithm for the task of finding phosphorylation motifs.

3.3

Protein kinase A (PKA)-Specific Phosphorylation Data 3.3.1 Data Description Protein kinase A is a class of cAMP-dependent enzymes. It plays important role in gene regulatory protein phosphorylation and activating transcription of specific genes. We also use Phospho.ELM and Swiss-Prot databases to generate phosphorylated peptides and unphosphorylated peptides with PKA-specific phosphorylation sites, which is similar to

the construction of CDK-specific phosphorylation data. In the generated data, the number of phosphorylated peptides is roughly equivalent to that of unphosphorylated peptides.

3.3.2 Results In this experiment, we further validate our approach using the PKA-specific phosphorylation peptides with parameters as usup ¼ 0:01 and usig ¼ 2:0. For a fair comparison, the significance thresholds for C-Motif are specified as: usig ¼ ul sig ¼ ug sig ¼ 2:0. Under this setting, the results of all the motif extraction algorithms are shown in Fig. 8. As shown in Fig. 8, in addition to the gap on the total number of motifs, MMFPh fails to find any motifs of size larger than 2 while Motif-All reports motifs of size even up to 9. C-Motif returns a result set of motifs whose sizes are not larger than 3. To further describe the difference, here we use several identified motifs as examples for illustration. ‘SxxSVT’ is a statistically significant motif with both global significance and local significance values larger than the given threshold usig . However, MMFPh ignores this motif since it lacks frequent and significant sub-motifs. The significance values of its three 2-sub-motifs (‘SxxSV’, ‘SxxSxT’ and ‘SVT’) are all less than usig . On the other hand, ‘GEKxxxS’ reported by Motif-All is composed of a significant part ‘GxKxxxS’ with local significance value of 3.6 and an insignificant part ‘ExxxxS’ with local significance value of 0.5 with the assessment in C-Motif. In addition, ‘GEKxxxS’ has a global significance value of 5.14 that has no significant improvement

Fig. 7. The running time comparison of C-Motif, MMFPh and Motif-All on CDK-specific phosphorylation data.

LIU ET AL.: MINING CONDITIONAL PHOSPHORYLATION MOTIFS

Fig. 8. The number of reported motifs of different sizes on PKA-specific phosphorylation data sets. Here the support threshold usup is 0.01 and the significance threshold usig is 2.0.

over that of ‘GxKxxxS’ (the global significance value of ‘GxKxxxS’ is 4.11). As a result, the significant sub-motif ‘GxKxxxS’ will directly enhance the statistical significance of ‘GEKxxxS’ to some extent. Hence, motifs like this are unmeaningful ones since its over-expressiveness does induce from its component parts other than the whole combination. Owing to the less-stringent pruning criterion, Motif-All reports ‘GEKxxxS’ as an interesting one improperly. Similarly, ‘KRxxxxS’ reported by MMFPh and MotifAll possesses similar global significance value of 5.14 as that of 4.52 of its sub-motif ‘RxxxxS’, which results in that ‘KRxxxxS’ is a redundant motif with little improvement with respect to the over-expressiveness. The detailed investigation on all the motifs reported by C-Motif, MMFPh and Motif-All demonstrates two crucial points. The first point is that the motifs identified by CMotif generally have no globally significant sub-motifs. Furthermore, these motifs survive after the pruning stage due to their inherent properties of the combination of each part rather than some subsets. The second point is that a proportion of the motifs reported by Motif-All and several ones reported by MMFPh are motifs whose over-expressiveness mainly benefits from their constituent parts. Consequently, though Motif-All has much success in returning the highest coverage of the significant motifs and MMFPh achieves the

925

ability of identifying conditional phosphorylation motifs at the cost of excluding some underlying ones, they suffer from failure to filter out those potential false positives whose over-expressiveness comes from their sub-motifs. Overall, C-Motif outperforms Motif-All and MMFPh with a trade-off between coverage and non-redundancy of the final significant motif set. Next, we will concentrate our discussion on the efficiency of the motif discovery algorithms. Fig. 9 presents the running time of C-Motif, MMFPh and Motif-All on PKA-specific phosphorylation data sets by changing the parameters gradually. We first fix the support threshold usup ¼ 0:015 and then increase the significance threshold usig from 1.0 to 10.0. Subsequently, we keep the significance threshold usig remain unchanged as 2.0 and increase the support threshold usup from 0.01 to 0.1. As shown in Fig. 9, our method has comparable running efficiency with the other two algorithms. In addition, we have the following remarks. First, there is an obvious gap between MMFPh and the other two methods in most cases. MMFPh seems like the most ‘inefficient’ approach before a critical point and the most ‘efficient’ approach after that point. This is because both Motif-All and C-Motif conduct the discovery under an Apriori framework, which is very efficient in frequent pattern mining. However, MMFPh generates candidates with enumeration method and includes lots of duplications, which is relatively time-consuming especially when there are plenty of significant ones to extend. Moreover, MMFPh would ignore some globally significant motifs if the motifs lack significant and frequent sub-motifs, while C-Motif and Motif-All would not. So C-Motif and Motif-All have to spend more time to test those candidate motifs. In addition, C-Motif also provides global significance assessment like Motif-All for justice. As a result, MMFPh runs more effectively when the significance threshold is relatively large. Second, the running time of Motif-All and C-Motif is almost the same. Though there is a little difference between these two methods in some special situations, they are still in the same magnitude and the gap can be negligible in the range of the errors permitted. This is because when detecting one potential significant motif, C-Motif has to investigate every peptide in the original data and check if this

Fig. 9. The running time comparison of C-Motif, MMFPh and Motif-All on PKA-specific phosphorylation data.

926

IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS,

peptide contains one of its sub-motifs. Then it generates a new data set instead for calculating the local significance value. In contrast, Motif-All only adopts global significance to evaluate one candidate, rendering that it obtains much reduction of computation and enumeration for generating the new data. While Motif-All has to spend additional running time on checking more candidates than C-Motif. Third, considering C-Motif is better at achieving an overall balance of the coverage and non-redundancy of significant phosphorylation motifs compared to Motif-All and MMFPh, it is reasonable to consider C-Motif as an efficient algorithm for mining phosphorylation motifs.

4

CONCLUSION

This paper formally proposes the notion of conditional phosphorylation motif and presents the problem of conditional phosphorylation motif discovery. Additionally, we also propose a new algorithm called C-Motif for this task and make an attempt to conduct a non-redundant discovery. Experiments on both non-kinase-specific phosphorylation data and kinase-specific phosphorylation data demonstrate that C-Motif is able to uncover more interesting phosphorylation motifs than MMFPh and retains less false positives than Motif-All under the same given parameter setting. Moreover, it is very fast so that it is able to find hundreds of significant motifs from large data sets. As a result, C-Motif outperforms the existing phosphorylation motif discovery algorithms with respect to efficiency, coverage as well as non-redundancy. For the future work, there are several possible directions that need further investigations. Firstly, although methods such as C-Motif and MMFPh can reduce the number of phosphorylation motifs returned to the biologists, there are still many statistically significant motifs that remain. As a result, it is still not an easy task for people to find really biologically meaningful motifs from this pool. From the computational and statistical perspective, we still need to develop effective algorithms and rigorous statistical testing procedures for further reducing the number of reported motifs. The concept of “maximal motif” in MMFPh and the application of permutation test in [17] are research efforts towards this direction. However, that is still not sufficient and further investigations should be conducted. Secondly, existing motif discovery algorithms fulfill the mining task merely with the sequence data around the phosphorylation site as input. This may prevent us to find really biologically interesting motifs to derive useful scientific insights. One possible remedy for this limitation is to conduct motif search on expanded data sets that include supplementary information such as the 3D protein structures. Finally, it is highly necessary to collect biologically validated motifs to build a public repository as the reference database for performance assessment and comparison.

ACKNOWLEDGMENTS This work was partially supported by the Natural Science Foundation of China under Grant No. 61003176 and No. 61073051, the Fundamental Research Funds for the Central Universities of China (DUT14QY07).

VOL. 11,

NO. 5, SEPTEMBER/OCTOBER 2014

REFERENCES [1] [2] [3]

[4] [5]

[6]

[7]

[8]

[9]

[10] [11]

[12] [13] [14] [15]

[16] [17] [18]

[19] [20] [21]

P. Cohen, “The regulation of protein function by multisite phosphorylation-a 25 year update,” Trends Biochem. Sci., vol. 25, no. 12, pp. 596–601, 2000. G. Manning, G. D. Plowman, T. Hunter, and S. Sudarsanam, “Evolution of protein kinase signaling from yeast to man,” Trends Biochem. Sci., vol. 27, no. 10, pp. 514–520, 2002. S. B. Ficarro, M. L. McCleland, P. T. Stukenberg, D. J. Burke, M. M. Ross, J. Shabanowitz, D. F. Hunt, and F. M. White, “Phosphoproteome analysis by mass spectrometry and its application to saccharomyces cerevisiae,” Nature Biotechnol., vol. 20, no. 3, pp. 301–305, 2002. B. E. Turk, “Understanding and exploiting substrate recognition by protein kinases,” Current Opinion Chem. Biol., vol. 12, no. 1, pp. 4–10, 2008. R. Amanchy, B. Periaswamy, S. Mathivanan, R. Reddy, S. G. Tattikota, and A. Pandey, “A curated compendium of phosphorylation motifs,” Nature Biotechnol., vol. 25, no. 3, pp. 285– 286, 2007. A. N. Kettenbach and S. A. Gerber, “Rapid and reproducible single-stage phosphopeptide enrichment of complex peptide mixtures: Application to general and phosphotyrosine-specific phosphoproteomics experiments,” Anal. Chem., vol. 83, no. 20, pp. 7635–7644, 2011. Y. Yu, S.-O. Yoon, G. Poulogiannis, Q. Yang, X. M. Ma, J. Villen, N. Kubica, G. R. Hoffman, L. C. Cantley, S. P. Gygi, and J. Blenis, “Phosphoproteomic analysis identifies Grb10 as an mTORC1 substrate that negatively regulates insulin signaling,” Science, vol. 332, no. 6035, pp. 1322–1326, 2011. S. Matsuoka, B. A. Ballif, A. Smogorzewska, E. R. McDonald, K. E. Hurov, J. Luo, C. E. Bakalarski, Z. Zhao, N. Solimini, Y. Lerenthal, Y. Shiloh, S. P. Gygi, and S. J. Elledge, “ATM and ATR substrate analysis reveals extensive protein networks responsive to DNA damage,” Science, vol. 316, no. 5828, pp. 1160–1166, 2007. D. Schwartz and S. P. Gygi, “An iterative statistical approach to the identification of protein phosphorylation motifs from large-scale data sets,” Nature Biotechnol., vol. 23, no. 11, pp. 1391–1398, 2005. A. Ritz, G. Shakhnarovich, A. R. Salomon, and B. J. Raphael, “Discovery of phosphorylation motif mixtures in phosphoproteomics data,” Bioinformatics, vol. 25, no. 1, pp. 14–21, 2009. Y.-C. Chen, K. Aguan, C.-W. Yang, Y.-T. Wang, N. R. Pal, and I.-F. Chung, “Discovery of protein phosphorylation motifs through exploratory data analysis,” PloS One, vol. 6, no. 5, p. e20025, 2011. T. Wang, A. N. Kettenbach, S. A. Gerber, and C. Bailey-Kellogg, “MMFPh: A maximal motif finder for phosphoproteomics datasets,” Bioinformatics, vol. 28, no. 12, pp. 1562–1570, 2012. Z. He, C. Yang, G. Guo, N. Li, and W. Yu, “Motif-all: Discovering all phosphorylation motifs,” BMC Bioinformat., vol. 12, p. S22, 2011. Z. He and H. Gong, “Comments on’MMFPh: A maximal motif finder for phosphoproteomics datasets,” Bioinformatics, vol. 28, no. 16, pp. 2211–2212, 2012. T. Wang, A. N. Kettenbach, S. A. Gerber, and C. Bailey-Kellogg, “Response to ‘comments on ‘MMFPh: A maximal motif finder for phosphoproteomics datasets,” Bioinformatics, vol. 28, no. 16, pp. 2211–2212, 2012. P. I. Good, Permutation, Parametric and Bootstrap Tests of Hypotheses, New York, NY, USA: Springer-Verlag, 2005, ch. 9. H. Gong and Z. He, “Permutation methods for testing the significance of phosphorylation motifs,” Statist. interface, vol. 5, pp. 61– 73, 2012. L. Ma, T. L. Assimes, N. B. Asadi, C. Iribarren, T. Quertermous, and W. H. Wong, “An almost exhaustive search-based sequential permutation method for detecting epistasis in disease association studies,” Genetic Epidemiol., vol. 34, pp. 434–443, 2010. R. Agrawal and R. Srikant, “Fast algorithms for mining association rules in large databases,” in Proc. 20th Int. Conf. Very Large Data Bases, 1994, pp. 487–499. R. J. Bayardo Jr., R. Agrawal, and D. Gunopulos, “Constraintbased rule mining in large, dense databases,” Data Mining Knowl. Discovery, vol. 4, no. 2-3, pp. 217–240, 2000. H. Dinkel, C. Chica, A. Via, C. M. Gould, L. J. Jensen, T. J. Gibson, and F. Diella, “Phospho. ELM: A database of phosphorylation sites-update 2011,” Nucleic Acids Res., vol. 39, no. 1, pp. D261– D267, 2011.

LIU ET AL.: MINING CONDITIONAL PHOSPHORYLATION MOTIFS

927

[22] N. Farriol-Mathis, J. S. Garavelli, B. Boeckmann, S. Duvaud, E. Gasteiger, A. Gateau, A.-L. Veuthey, and A. Bairoch, “Annotation of post-translational modifications in the Swiss-Prot knowledge base,” Proteomics, vol. 4, no. 6, pp. 1537–1550, 2004. [23] J. Gao, J. J. Thelen, A. K. Dunker, and D. Xu, “Musite, a tool for global prediction of general and kinase-specific phosphorylation sites,” Molecular Cellular Proteomics, vol. 9, no. 12, pp. 2586–2600, 2010. [24] S. F. Altschul, T. L. Madden, A. A. Sch€affer, J. Zhang, Z. Zhang, W. Miller, and D. J. Lipman, “Gapped BLAST and PSI-BLAST: A new generation of protein database search programs,” Nucleic Acids Res., vol. 25, no. 17, pp. 3389–3402, 1997. [25] H. Gong, X. Liu, J. Wu, and Z. He, “Data construction for phosphorylation site prediction,” Briefings Bioinformat., 2013, doi:10.1093/bib/bbt012. [26] N. Blom, S. Gammeltoft, and S. Brunak, “Sequence and structurebased prediction of eukaryotic protein phosphorylation sites,” J. Molecular Biol., vol. 294, no. 5, pp. 1351–1362, 1999. [27] T. H. Dang, K. Van Leemput, A. Verschoren, and K. Laukens, “Prediction of kinase-specific phosphorylation sites using conditional random fields,” Bioinformatics, vol. 24, no. 24, pp. 2857– 2864, 2008.

Haipeng Gong received the BS degree in software engineering from Dalian University of Technology, China, in 2011. He is currently working toward the MS degree in the School of Software at Dalian University of Technology. His research interests include data mining and computational proteomics.

Shengchun Deng received the PhD degree in computer science from Harbin Institute of Technology in 2002. He is currently a professor at Harbin Institute of Technology. His research interests include database, software engineering, and software interoperability.

Xiaoqing Liu received the BS degree in software engineering from Dalian University of Technology, China, in 2013. She is currently working toward the MS degree in the School of Software at Dalian University of Technology. Her research interests include bioinformatics and data mining.

Jun Wu received the BS degree in software engineering from Dalian University of Technology, China, in 2013. He is currently working toward the MS degree in the School of Software at Dalian University of Technology. His research interests include bioinformactics and data mining.

Zengyou He received the BS, MS, and PhD degrees in computer science from Harbin Institute of Technology, China, in 2000, 2002, and 2006, respectively. He was a research associate in the Department of Electronic and Computer Engineering at the Hong Kong University of Science and Technology from February 2007 to February 2010. Since March 2010, he has been an associate professor in the School of Software at Dalian University of Technology. His research interests include computational proteomics and biological data mining. " For more information on this or any other computing topic, please visit our Digital Library at www.computer.org/publications/dlib.