On the Power of Profiles for Transcription Factor Binding Site Detection Sven Rahmann1,2,∗ , Tobias M¨ uller1,3 , and Martin Vingron1 1
2
Department of Computational Molecular Biology Max Planck Institute for Molecular Genetics Ihnestraße 73, D-14195 Berlin, Germany
Department of Mathematics and Computer Science Freie Universit¨at Berlin 3
Present address: Department of Bioinformatics Biozentrum, Universit¨at W¨ urzburg Am Hubland, D-97074 W¨ urzburg, Germany ∗
Corresponding author
[email protected]
Running head: On the Power of Profiles Key words and phrases: Transcription factor binding site (TFBS); Profile; Position specific score matrix (PSSM); Position-weight matrix (PWM); Log-odds score; Exact Test; Significance; Power; TRANSFAC. Abstract. Transcription factor binding site (TFBS) detection plays an important role in computational biology, with applications in gene finding and gene regulation. The sites are often modeled by gapless profiles, also known as position-weight matrices. Past research has focused on the significance of profile scores (the ability to avoid false positives), but this alone is not enough: The profile must also possess the power to detect the true positive signals. Several completed genomes are now available, and the search for TFBSs is moving to a large scale; so discriminating signal from noise becomes even more challenging. Since TFBS profiles are usually estimated from only a few experimentally confirmed instances, careful regularization is an important issue. We present a novel method that is well suited for this situation. We further develop measures that help in judging profile quality, based on both sensitivity and selectivity of a profile. It is shown that these quality measures can be efficiently computed, and we propose statistically well-founded methods to choose score thresholds. Our findings are applied to the TRANSFAC database of transcription factor binding sites. The results are disturbing: If we insist on a significance level of 5% in sequences of length 500, only 19% of the profiles detect a true signal instance with 95% success probability under varying background sequence compositions.
1
Introduction
To understand the process of gene regulation, we need to detect accurately the location and relative strength of all transcription factor binding sites (TFBSs) in a genome. Knowing the location of these sites also helps to determine the promoter region of a gene, and to predict tissue-specific gene-expression patterns (Roulet et al., 1998). Several stochastic ways of describing TFBSs exist. For example, the TRANSFAC database (Wingender et al., 2000) represents TFBSs by count matrices, which can be interpreted as sequence profiles. Profiles are gapless position-specific probability distributions on the nucleotides. A profile, interpreted as a generative signal model, combined with a background model, defines a position specific score matrix (PSSM), also known as a position-weight matrix (PWM). Such a matrix assigns a position-specific score to each nucleotide. To detect the presence of the signal in one or more sequences, the score matrix is slid over the sequence, and each sequence window is given a score. When the score at some position exceeds a pre-determined threshold, we decide that an instance of the signal occurs at that position. As TFBS profiles are typically short and degenerate, their signal-to-noise ratio is low. Therefore TFBSs are hard to predict accurately. To achieve the best results under these circumstances, a sound statistical analysis is required to optimally determine the score threshold t. So far consideration has been restricted to the significance level of a given threshold (Staden, 1989; Claverie and Audic, 1996; Stormo, 2000). When a score of at least t frequently occurs in a sequence generated according to the background model, then this score does not provide evidence for the occurrence of the signal; it is insignificant. In practice a score of t is called significant for a sequence when the probability that such a score is reached at least once in a background sequence of the same length is at most 0.05 or 0.01. Choosing t in this way limits false positive classifications. Significance provides only half of the picture, however. We also need to know the probability to successfully identify a true signal when it is present in the sequence; this probability is called the power of the threshold t. When a highly significant threshold is enforced, it easily happens that t becomes so large that even true signal scores can rarely reach t; see Figure 1. We will show that both significance level and power of any threshold t can be determined accurately. Although it seems strange that the power of profiles has not yet been systematically analyzed, we found no such considerations in the literature. Because the profile score of a single sequence window is a sum of independent random variables under both the background model and the signal model, its distribution can be efficiently computed in both cases without making parametric distributional assumptions, such as assuming a Gaussian distribution, simply by computing the convolution of the individual score distribution (Staden, 1989). For the maximum score over a long background sequence, we do not make the common assumption of an extreme value distribution (EVD), but approximate the right tail probabilities only. We obtain a good bound on the error with the Chen-Stein Method (Barbour et al., 1992). While it has been proved that, under mild regularity conditions, the score distribution asymptotically converges to a Gaussian distribution (and the best score’s distribution to an EVD; Gold-
stein and Waterman, 1994), typical profiles are too short to rely on this approximation. In Section 2, we formalize the required concepts and definitions. We review the estimation of profiles from sequence examples and propose a new regularization method especially suited to TFBS profile estimation. We also comment on the statistical classification and testing framework. Section 3 is devoted to the computation of the score distribution under both background model, corresponding to the null hypothesis, and signal model, corresponding to the alternative. In Section 4, we define a general set of quality measures for profiles that allows us to evaluate and rank all TRANSFAC profiles. The results are disturbing: Over 80% of the profiles are too weak to be used without further consideration in large-scale (genome-wide) TFBS detection pipelines. A quality ranking allows to sort the high-quality profiles. The concluding Discussion (Section 5) recapitulates our main points and outlines some possible extensions.
2 2.1
Profiles and Their Applications Basics
Let Σ be a finite alphabet and |Σ| its cardinality. For DNA, |Σ| = 4, and the letters of the alphabet are the four nucleotides A, C, G, and T. For proteins, |Σ| = 20, and the letters are the twenty amino acids. We write ΣL for the set of all L-letter words over Σ. A probability distribution π over the letters P is given by a row-vector π = (πj )j∈Σ with |Σ| nonnegative components such that j∈Σ πj = 1. A profile is a probabilistic description of a sequence. It specifies a probability distribution over the alphabet’s letters for each position. More formally, a profile P of length L over = 1, . . . , L; j ∈ Σ), such that Pij ≥ 0 for all i, j P Σ is an L × |Σ| matrix (Pij ) (i L×Σ for the set of all length-L profiles over Σ. and j∈Σ Pij = 1 for all i. We write P Profiles as generative signal models. We think of a profile as a sequence generator. Each letter is generated independently according to its position-specific distribution. The L×Σ probability generates a fixed sequence S = (S1 , . . . , SL ) ∈ ΣL is QLthat profile P ∈ P PP (S) = i=1 Pi,Si . The probability that P generates an N -tuple S = (S 1 , . . . , S N ) of Q k N independent sequences S k ∈ ΣL is given by PP (S) = N k=1 PP (S ). Note that the generation probabilities PP (S 1 ), PP (S 2 ) of two sequences S 1 and S 2 are only comparable when both sequences have the same length.
2.2
Estimating profiles from observed sequences.
Assume that we know N experimentally confirmed positive examples of a signal (in this case a TFBS) of length L, where S k = (S1k , . . . , SLk ) ∈ ΣL denotes the k-th instance (k = 1, . . . , N ). We are interested in the profile P that best fits this N -tuple S of examples in the sense that the likelihood of P under S, LS (P ) := PP (S), is maximal among all profiles in P L×Σ . It is well known that the maximum likelihood (ML) profile in such a case is given by Pij = Cij /N , where Cij is the total number of occurrences of
letter j ∈ Σ at position i in all N sequences. The matrix C = (Cij ) is called the count matrix of the examples. The elements of C are generally integers, and each row of C sums to N . This need not always be the case, however. Sometimes, the available data may not consist of exact sequences, but also contain ambiguous letters. In the DNA alphabet, for example, we frequently find the IUPAC codes R (meaning A or G) and Y (meaning C or T). These codes can easily be accommodated in the count matrix by using fractional counts, where for example R is counted as half an A and half a G. We can also easily accommodate position-dependent numbers of observations Ni .
2.3
Regularization
Instead of using the count matrix C directly to estimate the ML profile, we prefer to use a regularized matrix C 0 := C +R. The elements of R are called pseudo-counts and need not be integers. We assume that the i-th row of R sums to the position-specific regularization weight Wi > 0. P The profile matrix P is then computed by Pij = (Cij + Rij )/(Ni + Wi ), where Wi = j Rij . The fraction Wi /(Ni + Wi ) indicates the relative importance of the pseudo-counts in comparison to the observed counts at position i. If Wi remains constant as Ni grows, the influence of the regularization diminishes, and the profile matrix is mainly determined by the observed counts. So when a large dataset of positive examples is available, regularization barely affects the resulting profile. When only little data is available, however, regularization has the following advantages. First, the regularized profile has a higher generalization ability. When only a few examples are known, there may be additional variations of the signal that have not been observed because of the lack of data. We interpret W pseudo-counts as W virtual observations that might have occurred if more data had been available. In this sense, regularization avoids over-fitting. Second, regularization provides a safeguard against zero counts in the profile matrix. When Pij = 0, it is impossible that the profile generates any sequence S with letter j in the i-th position: We have PP (S) = 0 because of a single letter (which may be a sequencing error). In most applications, it is safer to assume that nothing is impossible, but merely improbable. This also avoids singularities in the log-odds score described in Section 2.4. While at first regularization seems to dilute the signal that may be present at a particular position, in fact good regularization improves biological signal detection, because a regularized profile has a higher generalization ability, i.e., the ability to recognize the signal in previously unseen data. Another reason is that we must distinguish between the biological problem and the statistical problem (see Section 2.5). Sjølander et al. (1996) suggested a regularization method for protein family modeling based on Dirichlet mixtures. Finding a good regularizer is non-trivial and often problem-specific. We must find reasonable values for the pseudo-count weights Wi and the pseudo-counts Rij on a caseby-case basis. Of course we could set every Wi to the same small constant, but we do not find this appropriate for transcription factor binding site profiles. Many profiles are based on only a small number of observations, and we must be careful not to destroy
weak but significant signals, while on the other hand we should be willing to ignore insignificant “signals”. We will now present a simple regularization method that we find suitable for TFBS profiles. A new idea for careful profile regularization. The basic ideas are (a) not to change the overall nucleotide composition of the profile, and (b) to regularize each position dependent on its signal strength, leaving conserved “core” regions of the profile relatively untouched. Mirny and Gelfand (2002) present biological evidence supporting positiondependent regularization. Consider a fixed position i of the profile, and let τ be the symbol distribution of matrix row Ci , i.e., τj := Cij /Ni . We need to determine a regularizing distribution ρ and a weight Wi . The i-th row Ci0 of the regularized matrix C 0 is then given by Ci0 = Ni ·τ +Wi ·ρ, and the corresponding row of the profile by Pi = Ni /(Ni + Wi ) · τ + Wi /(Ni + Wi ) · ρ. Defining the regularization fraction w for a fixed position i by w := Wi /(Ni + Wi ), we can write Pi as the convex combination Pi = (1 − w) · τ + w · ρ. Distribution ρ. In order not to change P the overall P symbol distribution of the profile, we set ρ to that distribution, i.e., ρj := i Cij / i Ni , assuming that this leads to ρj > 0 for all j. Otherwise (this case rarely occurs we add 1/|Σ| to the numerator Pin practice),P and 1 to the denominator: ρj := (1/|Σ| + i Cij ) / (1 + i Ni ). We emphasize that ρ is not a “background” distribution, but the overall nucleotide composition of the profile. In some applications, it may be reasonable to symmetrize ρ such that ρG = ρC and ρA = ρT . Weight w. It remains to find a suitable weight w for regularizing position i. Consider the potential candidates for profile row Pi , i.e., the family of regularized distributions δ(w) := (1 − w) · τ + w · ρ for 0 ≤ w ≤ 1. We define a difference measure ∆(w) (a scaled relative entropy) between δ(w) and δ(1) = ρ by X δ(w)j · ln(δ(w)j /ρj ). ∆(w) := 2Ni · j∈Σ
Note that ∆(0) ≥ 0 and ∆(w) decreases towards 0 = ∆(1) as w increases from 0 to 1. The unregularized distribution τ = δ(0) has a certain distance from the regularizing distribution ρ = δ(1), and our goal is to “move” τ towards ρ without destroying a significant signal in τ . When in fact the observed counts Ci are derived by sampling Ni 0 observations from a Multinomial distribution with parameter ρ, then ∆(w), being a generalized log-likelihood ratio statistic, has a chi-square distribution with |Σ|−1 degrees of freedom (Ewens and Grant, 2001). For small values of Ni , however, the chisquare approximation can be inaccurate. The idea behind our regularization method is that E := Eρ [∆(w)], the expected value of ∆(w) when sampling Ni times from ρ, is explained by sampling error alone. Therefore it is reasonable to pick the value of w ∈ [0, 1] for which ∆(w) = ∆(0) − E if this is possible; otherwise (when already ∆(0) < E), we take w = 1 (i.e., we take
Pi := ρ). In other words, we compensate for the expected distance between τ and ρ that can be explained by sampling error. For |Σ| = 4 and large sample sizes, we have E ≈ 1.75, the expectation of the chi-square distribution with 3 degrees of freedom. For small sample sizes, numerical simulations show that E < 1.75 approaching it from below as Ni increases. While in principle we could compute E numerically in each case, we settle for a heuristic value of E = 1.5 for computational efficiency. Example. Consider a count matrix row i with Ci = (2, 1, 1, 1); so Ni = 5, and τ = (0.4, 0.2, 0.2, 0.2). Assume that the regularizing distribution is ρ = (0.25, 0.25, 0.25, 0.25). We have ∆(0) = 0.54 < 1.5, and so the difference between τ and ρ can be attributed to random fluctuations alone. Indeed, an observation of the form Ci belongs to the set of most probable observations when drawing five times from the uniform distribution. Therefore we take w = 1 and Pi = ρ. Now assume that Ci = (5, 0, 0, 0), so τ = (1, 0, 0, 0) and ∆(0) = 16.09. A clear signal is present here. For w = 0.031 we have ∆(w) = 14.59 = ∆(0) − 1.5; so τ is only very carefully modified in order not to dilute the signal.
2.4
Position Specific Score Matrices (PSSMs)
For a sequence window (“word”) W of length L, we want to decide whether it is an occurrence of the signal described by profile P ∈ P L×Σ , or whether it is “background” sequence, meaning everything except the signal. To make a meaningful decision, we need a probabilistic model for the background. It should capture compositional properties of the sequences under consideration without modeling any signal-like properties. For homogeneity reasons, the background is usually a simple i.i.d. model, that is, a profile matrix Π ∈ P L×Σ , where each row consists of the same probability vector π. The decision is based on the ratio of the probabilities of the observed word in both models, also called the likelihood ratio of the models. Thus we compute the log-odds score L X Score(W ) := log(PP (W )/PΠ (W )) = log(Pi,Wi /πWi ). (1) i=1
Once the background distribution π is fixed, a profile P can be translated into a profile score matrix S, more commonly known as a position weight matrix (PWM) or position specific score matrix (PSSM), by setting Sij := log(Pij /πj ) (i = 1, . . . , L; j ∈ Σ). When we use the natural logarithm, the score unit is called a “nat”; for log2 , it is a “bit”, and for log10 , it is called a “dit”. P Thus Score(W ) = Li=1 Si,Wi . A value of Score(W ) 0 is interpreted as strong evidence for W being an instance of the signal modeled by P . Treating ambiguity characters. In practice, DNA and protein sequences may contain ambiguous characters, representing a subset of the alphabet, as specified by the IUPAC code. We may extend the score matrix S as follows. Suppose that k ∈ / Σ is an ambiguity
0 0 0 symbol P representing P a subset Σ ⊂ Σ. Then by definition Sik = log(PPi (Σ )/PΠi (Σ )) = log(( j∈Σ0 Pij )/( j∈Σ0 πj )). It immediately follows that an ambiguity character k representing Σ itself, such as N in the DNA alphabet, always scores zero because P(Σ) = 1 under any model, and log 1/1 = 0. This score is intuitive; such a character does not contain any information. P Occasionally, the average score is used instead: Sik = ( j∈Σ0 Sij )/|Σ0 |. While this is the wrong score from a statistical point of view, it is easier to compute when only the score matrix (Sij ) is available and P or π have been “forgotten”.
Rounding the score matrix. For computational purposes, we round all scores to a certain granularity ε 1 (for example ε = 0.01). Thus we compute rounded scores Sij0 := ε · rd(Sij /ε), where rd(x) is the closest integer to x. When ε is small enough, this leaves the scores unchanged for all practical purposes. In what follows, we make no distinction between S and S 0 , assuming that the rounding errors are negligible, and write S for the rounded score matrix. After rounding, every score we encounter during the computation is an integer multiple of ε. Let S :=
S :=
min Score(W ) =
W ∈ΣL
max Score(W ) =
W ∈ΣL
L X i=1 L X i=1
min Sij j∈Σ
max Sij j∈Σ
denote minimum and maximum score over all words of length L; Both values can be computed in O(L|Σ|) time, they are finite and not far away from zero, assuming proper regularization of P . Now there are at most 1 + (S − S)/ε different values of Score(W ); these are Γ = {S + γε | γ = 0, . . . , (S − S)/ε}. When L is small, |Σ|L may be a tighter bound on the number of different score values. Then there is no use in choosing ε smaller than (S − S)/(|Σ|L − 1).
2.5
Classification and Statistical Testing
To find a signal of length L in a sequence of length n + L − 1, we slide a length-L window over the sequence. For each of the n windows, we must decide whether it represents an occurrence of the signal, i.e., we classify the window as either “positive” (signal) or “negative” (background). The classification can be either correct or wrong. Most windows do not represent an instance of the signal, and if we classify such a window correctly, we have a true negative (TN). A wrong decision in this situation is called a false positive (FP) or type-I error. If the window does represent an instance of the biological signal and we classify it correctly, we have a true positive (TP). If we make the wrong decision in this case, we have a false negative (FN), also called type-II error. These terms are hard to apply in practice, of course, since the biological truth is unknown to us.
Therefore we use generative probabilistic models, the profiles P and Π, to describe signal and background. This allows us to cast the classification problem into a statistical testing framework and base the classification on an appropriate test statistic (the logodds score (1) is optimal according to the Neyman-Pearson lemma). For each window W , we test the null hypothesis “W is generated by the background profile Π” against the alternative “W is generated by the signal profile P ” using the test statistic Score(W ). The null hypothesis is rejected if Score(Wi ) ≥ t for an appropriate score threshold t. Words W with Score(W ) ≥ t can be generated by both P and Π, albeit with different probabilities. This allows us to compute type-I and type-II error probabilities for the statistical classification problem. Evidently, these numbers may be less meaningful for the biological classification problem. This does not render the statistical results useless, though. They allow to assign a quality value to the profile as a search tool, and this is our goal. The error probabilities do not apply directly to the biological results that are derived from the application of the classification procedure; the biological truth can only be determined by wet-lab experiments. It is important to be aware of this distinction, because we use the above terms TP, FP, TN, and FN, and type-I and type-II error probabilities for the statistical testing problem from now on.
3
Exact Profile Statistics
Consider a random sequence W of length L with values in ΣL . There are two random models for W, one specified by the signal profile P , and the other one specified by the background profile Π. We write X for the random variable Score(W ). There are two probability measures associated with X: PP when W ∼ P (“W is generated PL by P ”) and PΠ when W ∼ Π. We write X as a sum of independent variables X = i=1 Xi , where Xi is the random score obtained from the i-th random letter Wi , which is distributed according to Pi (the i-th row of P ) under PP , or according to Πi = π under PΠ .
3.1
Expected Scores
We compare the expected scores EP [X] and EΠ [X]. Their difference can be taken as a first coarse measure ofP profile quality P (see Section 3.4). We have EP [Xi ] = j∈Σ Pij Sij = j∈Σ Pij log(Pij /πj ) = H(Pi k π), where H(Pi k π) denotes the relative entropy between the distributions Pi and π. A well known property is that a relative entropy is always nonnegative (Cover and Thomas, 1991). We define the relative profiles P and Π of length L by PL entropy H(P k Π) betweenPtwo L H(P k Π) := i=1 H(Pi k Πi ). Now EP [X] = i=1 EP [Xi ] = H(P k Π) ≥ 0. Hence for W ∼ P we expect a positive P score. P P Likewise, EΠ [Xi ] = j∈Σ πj Sij = j∈Σ πj log(Pij /πj ) = − j∈Σ πj log(πj /Pij ) = −H(π k Pi ), and hence EΠ [X] = −H(Π k P ) ≤ 0 (in general H(Π k P ) 6= H(P k Π), but both are nonnegative). Thus for W ∼ Π we expect a negative score.
3.2
The Exact Distribution of the Score
The purpose of this section is to show how to conveniently compute the probability mass function of X under any probability measure given by a profile in O(|Σ|L|Γ|) = O(|Σ|L2 R/ε) time, where L is the profile length, ε is the rounding granularity, and R is the maximal score range within all rows of the score matrix, R = maxi {maxj Sij − minj Sij }. We emphasize that no distributional assumptions, and especially no asymptotic theory, is involved at this point. We describe the computation for probability measure PP ; the computation for PΠ or any other profile-induced probability measure is similar. Our exposition follows mainly that of Staden (1989); however, in that reference the score is scaled by a factor f > 1 and then rounded to granularity ε = 1, whereas we do not re-scale but allow arbitrary granularity. From the discussion about rounding in Section 2.4 recall that X takes values in Γ = {S + γε | γ = 0, . . . , (S − S)/ε}. We define f (x) := PP (X = x) for x ∈ Γ and f (x) := 0 forPx ∈ / Γ. P L As X = i=1 Xi , we define fi (x) := PP (Xi = x) = j: Sij =x Pij . Note that fi (x) is only nonzero for x ∈ {Sij | j ∈ Σ}, that is, for at most |Σ| different values. P We also define the partial sums X k := ki=1 Xi and their probability mass functions f k (x) := P(X k = x) for x ∈ Γ. For x ∈ / Γ, we set f k (x) := 0. It follows that X 1 = X1 , X k+1 = X k + Xk+1 , X = X L , f 1 = f1 and f = f L . Since the letters Wi are independent, so are X k and Xk+1 for all k = 1, . . . , L − 1. Thus X k+1 is the sum of two independent random variables with distributions f k and fk+1 . It follows that f k+1 is given by the convolution f k ∗ fk+1 , X f k+1 (x) = (f k ∗ fk+1 )(x) = f k (x − z) · fk+1 (z). z∈Γ
For the practical computation, we store the current f k as a vector v = (vγ ) with at most |Γ| elements (γ = 0, . . . , (S − S)/ε), such that vγ := f k (S + γε). Note that the range of no partial sum X k can exceed Γ as we have minj Sij ≤ 0 and maxj Sij ≥ 0 for all i (since EΠ [Xi ] ≤ 0, there must be a non-positive score, and since EP [Xi ] ≥ 0, there must be a nonnegative score). Thus a vector with |Γ| elements is sufficient for all steps of the computation. Running Time. Since fk+1 (z) is nonzero for at most |Σ| values of z, each density f k+1 (x) (x ∈ Γ) can be computed in O(|Σ||Γ|) time. The convolution is associative and commutative. Thus f = f1 ∗ f2 ∗ . . . ∗ fL ; the computation consists of L − 1 convolutions, so the total time needed is O(L|Σ||Γ|). Note that |Γ| also increases linearly with L. Let Ri := maxj Sij − minj Sij be the score range at position i. When the largest score range maxi Ri stays bounded by a constant R, we have |Γ| ≤ LR/ε. Therefore the total time is bounded by O(L2 R|Σ|/ε). While this method was mentioned previously by Staden (1989) and Claverie and Audic (1996), it is apparently not used extensively. In particular, to our knowledge, nobody has yet directly computed the power of a profile, i.e., the distribution under
PP . For the computation of significance values, it has become common practice to make unwarranted assumptions such as that of a Gaussian distribution of X.
3.3
Error probabilities
The type-I window error probability α(t) is the probability of observing a score of at least t, given that the window is generated by the background model. The type-I sequence error probability αn (t) is the probability that at least one out of n consecutive overlapping windows scores at least t, under the assumption that the whole sequence is generated by the background model. Since the type-I window error can be made many times in a sequence, the type-I sequence error probability is much higher. A threshold t for which αn (t) ≤ 0.05 or ≤ 0.01 is usually called significant for the sequence. For a given observed score s, α(s) is also called the window p-value, and αn (s) the sequence p-value of s. The type-II error probability β(t) is the probability of observing a score below t when the window was generated by the signal profile. Note that it does not make sense to speak of a type-II sequence error probability because not the whole sequence can be generated by one and the same profile. We define the m-instance type-II error probability βm (t) as the probability that at least one of m independent instances scores less than t, under the assumption that all of them were generated by the profile. The power of the m-instance test is defined as 1 − βm (t) While high significance (small p-value) is important to distinguish signal from noise, it does not contain any information about the power. Indeed, this is a major problem with the current practice of determining profile score thresholds: With a high threshold, few positive decisions are made in general, and therefore there are also few false positives, and the type-I error probability is low. At the same time, there are many negative classifications overall, implying a higher type-II error probability. In Section 4, we shall see that many of the profiles used for transcription factor binding site detection have little power when a significant score threshold is chosen. We now describe how to compute the above-mentioned window and sequence error probabilities. Window error probabilities. Recall that f (x) = PP (X = x) for x ∈ Γ and f (x) = 0 for x ∈ / Γ. Similarly, we define g(x) := PΠ (X = x) for x ∈ Γ and g(x) := 0 for x ∈ / Γ. Then we have X X g(s), β(t) := PP (X < t) = f (s). α(t) := PΠ (X ≥ t) = s∈Γ∩[t,S]
s∈Γ∩[S,t−ε]
So the window error probabilities are right- and left-cumulative distribution functions of the densities g and f , respectively. The densities are computed with the convolution method described in Section 3.2. Type-I sequence error probability. Let w = (w1 , . . . , wn+L−1 ) be a sequence of length n + L − 1, containing n windows of length L. Let W (i) := (wi , . . . , wi+L−1 ) denote the
sequence of the i-th window (i = 1, . . . , n), and let X(i) be a random variable describing the score of W (i), assuming that the whole sequence w is generated according to the background distribution π. Then αn (t) := P max X(i) ≥ t . i=1,...,n
Exact computation of αn (t) is difficult because consecutive scores X(i) are not generally independent as successive windows overlap by L − 1 characters. The expected number of windows for which the score is at least t is easily computed as λ(t) := n · α(t), however. We approximate αn (t) by treating the X(i) as if they were independent; this is standard practice and corresponds to treating the number of windows for which the score is at least t as a Poisson random variable with mean λ(t): αn (t) = P(X(i) ≥ t for at least one i ∈ {1, . . . , n}) = 1 − P(X(t) < t for all i ∈ {1, . . . , n}) ≈ 1 − (1 − α(t))n ≈ 1 − exp(−nα(t)). The following lemma states that this approximation is accurate when n is large and α(t) is small such that nα(t) 1. It can be proved using a variant of the Chen-Stein Theorem (e.g., see Barbour et al. 1992). Lemma. Assume that n ≥ 2L and λ(t) := nα(t) < 1. Then |αn (t) − [1 − exp(−λ(t))]| ≤ 2 L n α(t) · (α(t) + c),
(2)
where c := max PΠ (X(j) ≥ t | X(1) ≥ t). j=2..L
Proof. Let Ii be the indicator variable of the event that Score(W (i)) ≥ t. Let Z := P n i=1 Ii be the number of windows whose score reaches at least t. Then αn (t) = PΠ (Z ≥ 1), λ(t) = EΠ [Z], and α(t) = EΠ [Ii ] for any i. A variant of the Chen-Stein Theorem (see Barbour et al. 1992 for a detailed exposition) states that |αn (t) − [1 − exp(−λ(t))]| ≤ min{1, 1/λ(t)} · (b1 + b2 ), where b1 and b2 are error terms defined in terms of “dependence neighborhoods” Ni : Let Ni be an index set such that Ii is independent of Ij whenever j ∈ / Ni . The index i itself is always excluded from Ni . Then b1 :=
n X i=1
E[Ii ]
X
j∈Ni
E[Ij ], and b2 :=
n X i=1
E[Ii ]
X
E[Ij | Ii = 1].
j∈Ni
To use this device, first note that by assumption 1/λ(1) > 1.
Next, for each i, the neighborhood Ni consists of at most those 2 (L − 1) indices j where W (j) has at least one common letter with W (i). Since E[Ii ] = α(t) for all i, we have b1 ≤ 2 (L − 1) n · α(t)2 ≤ 2 L n α(t)2 . P P Finally, b2 = ni=1 α(t) j∈Ni E[Ij | Ii = 1]. Let c := maxi,j E[Ij | Ii = 1]. Then b2 ≤ 2 L n c · α(t). Because of symmetry, we can also write c = maxj=2..L PΠ (X(j) ≥ t | X(1) ≥ t). Putting these observations together gives the stated result. In practice, profile score matrices tend to be “aperiodic”, meaning that the occurrence of a signal at some position i makes it less likely that the signal appears again within the L-neighborhood of position i. In that case we have c < α(t) and the practical error bound |αn (t) − [1 − exp(−λ(t))]| ≤ 4 L n · α(t)2 , unless the profile is “periodic” or λ(t) approaches 1. Profile periodicity is not yet extensively studied; see Sinha and Tompa (2000) for significance computations with periodic patterns allowing wildcards, and Rivals and Rahmann (2001) for a combinatorial view of periods in strings. The m-instance type-II error probability. When in reality m instances of a signal are present, we are interested in the m-instance type-II error βm (t) that at least one of them is missed (has a score below t). We may assume that the true instances are scarce and non-overlapping, and hence independent. Let X(i) denote the score achieved by the i-th instance. Then βm (t) = PP (X(i) < t for at least one i) = 1 − (1 − β(t))m . If mβ(t) 1, we have approximately βm (t) ≈ mβ(t).
3.4
Quality Measures of Profiles
We introduce several quality measures Q of a profile. While the measures are based on different ideas, they serve a common purpose: to quantify how well a given profile P is separated from a background Π. All measures Q are defined such that high values indicate good separation. As before, we assume that the optimal log-odds score matrix S is used. Recall that X denotes the score of a window W of length L. Difference of Expected Scores. This measure is easy to compute, even when the distribution of X under P or Π is unavailable. We have (Section 3.1) QH := EP [X] − EΠ [X] = H(P k Π) + H(Π k P ). Note that QH does not make a statement about any error probabilities. Sensitivity. Pick a sequence length n ≥ 1 (more precisely, the number of length-L windows in the sequence) and an instance number m ≥ 1. For a given type-I sequence error probability α = αn (t), we find the associated score threshold t and
the induced m-instance type-II error βm (t). Then the power Qsens (α; n, m) := 1 − βm (t) measures the sensitivity of the profile, i.e., the ability to detect a true signal instance at the given level of significance. Selectivity. Symmetrically to the sensitivity measure, we define the selectivity measure Qsel (β; m, n) as 1 − αn (t), where αn (t) is the type-I sequence error at the threshold t where the m-instance type-II error βm (t) equals the given value of β. Error Balance. We plot type-II against type-I error probability, that is, we plot the point set {(αn (t), βm (t)) | t ∈ Γ} to obtain the receiver operator characteristic (ROC) curve. We are interested in the point on this curve where the error probabilities are balanced (αn (t) = βm (t)). More generally, if we find that αn is c times more important than βm , find the point on the ROC curve where βm = cαn . Thus we set Qbal (c; n, m) := 1 − βm (t), where t is such that βm (t) = cαn (t). Since t takes values in the discrete set Γ, this condition is usually not met exactly. However, we can find the largest value t− where cαn (t− ) ≥ βm (t− ) and the smallest value t+ where cαn (t+ ) ≤ βm (t+ ). Then either t− = t+ or t− + ε = t+ , and we can evaluate the power at either point. The four quality measures QH , Qsens , Qsel , and Qbal highlight different aspects of profile quality. The latter three measures are visualized in Figure 2 for a very lowquality profile. None of these measures is better than the others, and their values are not directly comparable. Usually though, if one measure is high, so are the others; see Figure 6 for examples. Note that the following symmetry relation holds: For every quality level q ∈ [0, 1], we have the equivalence Qsens (1 − q; n, m) ≥ q ⇐⇒ Qsel (1 − q; n, m) ≥ q ⇐⇒ Qbal (1 − q; n, m) ≥ q. A profile with Qbal (0.05; n, m) ≥ 0.95 is called a high-quality profile; both errors are at most 5%. In the quality evaluation that follows, we use n = 500 and m = 1 (see Section 4).
3.5
Choosing a Score Threshold
Current practice (Claverie and Audic, 1996) chooses t such that the sequence p-value αn (t) is set to a fixed level α (e.g., 0.05 for a sequence of length n = 500). The power of the resulting classification (this is the quality measure Qsens (α; n, 1)) is not taken into account with this method. Therefore, by enforcing a certain level of significance, we cannot control how many of the true instances we miss. On the other hand, controlling the type-II error β = 0.05, say, will lead to many false positives in general. A reasonable alternative is to balance the errors: First choose reasonable values of n and m, such as n = 500 and m = 1, and a relative importance weight c > 0, meaning that we consider αn to be c times more important than βm . Then find t such that cαn (t) = βm (t), or the neighboring values t− and t+ as described in the previous section. It is natural to let the whole procedure be driven by the data as follows. Assume that the sequence to be searched with a PSSM contains nucleotide x exactly Nx times
Score Distribution for Profile AhR (V$AHR-Q5), aryl hydrocarbon / dioxin receptor 0.03 Background Signal
Min Score = -32.15 Max Score = 9.85 Granularity = 0.05 QH = 26.5
0.025
Probability
0.02
0.015
0.01
0.005
0 -30
-25
-20
-15
-10
-5
0
5
Score Score Distribution for Profile cap (V$CAP-01), cap signal for transcription initiation 0.03 Background Signal
Min Score = -25.45 Max Score = 5.35 Granularity = 0.05 QH = 14.75
0.025
Probability
0.02
0.015
0.01
0.005
0 -25
-20
-15
-10 Score
-5
0
5
Figure 1: Top: Signal and background score distribution of a high-quality TFBS profile; the overlap between the distributions is small. Bottom: Score distribution of a low-quality profile, as indicated by the large overlap. In both cases, the background distribution does not follow a Gaussian distribution. Results based on such an assumption may therefore produce wrong significance or power estimates. The direct convolution approach we use does not rely on distributional assumptions and gives exact p-values and power values.
ROC Curve for Profile cap (V$CAP-01), cap signal for transcription initiation 1
0.985 (t=5) Qsens(0.05; 500,1) = 0.015 Qbal(1; 500,1) = 0.228 Qsel(0.05; 500,1) = 0
Type-II error, m=1
0.8
0.772 (t=4.05)
0.6
0.4
0.2
1 (t=1.25) 0 0
0.2
0.4 0.6 Type-I error (p-value), n=500
0.8
1
Figure 2: The ROC curve shows how type-II error βm varies with type-I error αn . The dotted lines indicate the value of βm where αn = 0.05, the value of αn where βm = 0.05 and the point where αn = βm . P (x ∈ Σ = {A, C, G, T}), and let N = x∈Σ Nx be the length of the sequence. Then its GCcontent is χ := (NC +NG )/N . The most reasonable choice for the background distribution preserving Watson-Crick symmetry is then π := ((1 − χ)/2, χ/2, χ/2, (1 − χ)/2). This choice defines the background matrix Π, and the score matrix S. From Section 3.2 we obtain the densities of the score distribution. As described in Section 3.3, we compute the window error probabilities, and using Poisson approximation, also the sequence error probability.
4
Profile Quality in the TRANSFAC Database
We apply the theory of Section 3 to the 623 count matrices provided by the TRANSFAC database (Wingender et al., 2000), release 7.1. We are thinking of a user who scans the upstream regions of many (predicted) genes of a particular organism for potential transcription factor binding sites. The TRANSFAC database comes with its own search tool MatInspector (Quandt et al., 1995). For each sequence window, this program computes a matrix similarity measure s = (S − S)/(S − S) ∈ [0, 1], where S is a modified log-odds score. A part
Histogram of Qsens values of all TRANSFAC profiles (Background GC−content = 50 %) 90
80
70
Frequency
60
50
40
30
20
10
0
0
0.1
0.2
0.3
0.4
0.5 Qsens value 90
80
80
70
70
60
60
50
40
20
20
10
10
0.1
0.2
0.3
0.4
0.5 Qsens value
0.6
0.7
0.8
0.9
0.9
1
40
30
0
0.8
50
30
0
0.7
Histogram of Qsens values of all TRANSFAC profiles (Background GC−content = 67 %)
90
Frequency
Frequency
Histogram of Qsens values of all TRANSFAC profiles (Background GC−content = 33 %)
0.6
1
0
0
0.1
0.2
0.3
0.4
0.5 Qsens value
0.6
0.7
0.8
0.9
1
Figure 3: Distribution of profile sensitivity Qsens (0.05; 500, 1) among TRANSFAC profiles. Top: Uniform background model. Bottom left: AT-rich (2/3) background. Bottom right: GC-rich (2/3) background. of the profile is called the “core region”; it is expected to be more conserved than the rest of the profile. Because of the different scoring techniques, our findings in this section are not directly applicable to MatInspector performance. However, since our regularization technique (Section 2.3) also respects well-conserved “core” regions more strongly than less conserved positions, we may assume that there is approximately a monotone mapping between the natural log-odds window scores and the MatInspector similarity values; thus one could attempt to convert similarity values to log-odds scores for each profile individually. We find that the normalization to the interval [0, 1] is artificial and rather complicates things. The Neyman-Pearson lemma guarantees that the log-odds score optimally separates the profile P from the background Π; so we will continue using log-odds scores.
Histogram of Qsel values of all TRANSFAC profiles (Background GC−content = 50 %) 90
80
70
Frequency
60
50
40
30
20
10
0
0
0.1
0.2
0.3
0.4
0.5 Qsel value 90
80
80
70
70
60
60
50
40
20
20
10
10
0.1
0.2
0.3
0.4
0.5 Qsel value
0.6
0.7
0.8
0.9
0.9
1
40
30
0
0.8
50
30
0
0.7
Histogram of Qsel values of all TRANSFAC profiles (Background GC−content = 67 %)
90
Frequency
Frequency
Histogram of Qsel values of all TRANSFAC profiles (Background GC−content = 33 %)
0.6
1
0
0
0.1
0.2
0.3
0.4
0.5 Qsel value
0.6
0.7
0.8
0.9
1
Figure 4: Distribution of profile selectivity Qsel (0.05; 500, 1) among TRANSFAC profiles. Top: Uniform background model. Bottom left: AT-rich (2/3) background. Bottom right: GC-rich (2/3) background.
4.1
Quality Distribution
To find the quality distribution of the TRANSFAC profiles, we proceed as follows. Each of the 623 count matrices is regularized with the procedure described in Section 2.3. To compute score matrices, we use three different background distributions: the uniform distribution πuni = ( 14 , 14 , 14 , 14 ), a GC-rich distribution πGC = ( 16 , 26 , 26 , 16 ), and an AT-rich distribution πAT = ( 26 , 16 , 16 , 26 ). Naturally, GC-rich profiles are harder to detect in GC-rich backgrounds and vice versa. We obtain 1869 score matrices, three for each profile. As a first result, we observe that it is a good idea to compute the score distributions by the convolution method; the frequent assumption that the background score follows a Gaussian-like distribution is often violated, as shown for two profiles in Figure 1. In
Histogram of Qbalance values of all TRANSFAC profiles (Background GC−content = 50 %) 55 50 45 40
Frequency
35 30 25 20 15 10 5 0
0
0.1
0.2
0.3
0.4
0.5 0.6 Qbalance value 55
50
50
45
45
40
40
35
35
30 25
15
15
10
10
5
5
0.1
0.2
0.3
0.4
0.5 0.6 Qbalance value
0.7
0.8
0.9
1
25 20
0
0.9
30
20
0
0.8
Histogram of Qbalance values of all TRANSFAC profiles (Background GC−content = 67 %)
55
Frequency
Frequency
Histogram of Qbalance values of all TRANSFAC profiles (Background GC−content = 33 %)
0.7
1
0
0
0.1
0.2
0.3
0.4
0.5 0.6 Qbalance value
0.7
0.8
0.9
1
Figure 5: Distribution of Qbal (0.05; 500, 1) among TRANSFAC profiles. Top: Uniform background model. Bottom left: AT-rich (2/3) background. Bottom right: GC-rich (2/3) background. fact, this is not surprising. If the profile consists of a single position, we have a discrete distribution on at most four different score values. For two positions, there are at most 16 different score values with nonzero probability. Therefore an irregular shape and multi-modality of the distribution should be expected. From now on, we work with parameters n = 500 and m = 1. While these are somewhat arbitrary, they can be taken as reasonable for an upstream region of a single gene. Note that the quality assessment in this section does not take the multiple testing issues into account that arise from the fact that usually many TFBS profiles are searched against many upstream sequences; we come back to this issue in the Discussion. Using a classically significant score threshold (α500 ≤ 0.05), most of the profiles have little power to detect a true TFBS instance, as can be seen from the histograms in
Figure 3. In a uniform background, only 139 out of the 623 profiles (22%) are highquality profiles satisfying Qsens (0.05; 500, 1) ≥ 0.95. The result is slightly better for AT-rich and GC-rich backgrounds: For GC-content 1/3, there are 176 high-quality profiles (28%), and for GC-content 2/3, there are 218 high-quality profiles (35%). For 120 profiles (19%), the high-quality condition holds in all three background models, and there are 265 profiles (43%) which are high-quality in at least one of the three background models. If we determine the threshold in such a way that β ≤ 0.05 (detection power 0.95), most of the profiles are not very selective anymore, as the histograms of Qsel (0.05; 500, 1) in Figure 4 show. Note that the Poisson approximation from Eq. (2) eventually breaks down as Qsel < 0.9 (αn > 0.1), say. We may not obtain the exact value of Qsel then, but still be confident that the quality is not high. Finally, Figure 5 shows a histogram of the detection power Qbal (1; 500, 1) when both errors are equal. The quality measures QH , Qsens , Qsel , and Qbal correlate well in general, as Figure 6 shows. Even though the correspondence is not linear, there is approximately a monotonous relation between the different quality values. Especially the high quality values ≥ 0.95 correlate very well.
4.2
Quality Ranking
To provide an overview of the quality of different profiles in several background models, we ranked the high-quality profiles (Qbal ≥ 0.95) according to their quality value Qbal , and the other profiles according to their sensitivity Qsens at α500 = 0.05. Table 1 shows the results. Of course, different criteria would also be reasonable, and different rankings can easily be obtained from the list of profile quality values.
5
Discussion and Conclusion
We have presented several ideas on how to optimally use sequence profiles for transcription factor binding site detection. Our first contribution is a new regularization method that removes fluctuations deviating from the regularizing distribution that can be explained by sampling error alone, but maintains signals that are present in the data. When 5 observations are made, for example, then observing 2 As, 1 C, 1 G, and 1 T is not a significant deviation from the uniform distribution, but observing 5 As is. Next, we have shown that for profiles, both signal and background distribution are efficiently computable in O(|Σ|L2 R/ε) time, where |Σ| is the alphabet size, L is the profile length, R is the maximal score range over all positions, and ε is the rounding granularity of the scores. We point out that we make no a-priori assumptions about the distribution. For many of the profiles we consider, the background score distribution turns out to be very different from a Gaussian distribution; yet it is common practice to make this assumption, because the score is a sum of independent random variables.
Q
H
100
50
0
0.5
1
Q
sens
0
100
100
50
50
0
0
0.5
1
0
1
1
0.5
0.5
0
0
0.5
1
0
0
0.5
1
0
0.5
1
0
0.5 Q
1
sel
1
Q
0.5
0 Q
Q
sens
sel
bal
100
100
80
80
80
60
60
60
40
40
40
20 0.94
20 0.94
0.96
0.98
1
0.96
0.98
1
20 0.94
1
1
0.98
0.98
0.96
0.96
0.96
0.98
1
0.96
0.98
1
0.98
1
Q
sens
Q
H
100
0.94 0.94
0.96
0.98
1
0.94 0.94 1
Q
sel
0.98 0.96 0.94 0.94 Q
sens
Q
sel
0.96 Q
bal
Figure 6: Pairwise correlations of the four profile quality measures from Section 3.4 for the 623 TRANSFAC profiles under a uniform background model. Each dot represents one profile. While there in principle 4 × 4 comparisons to be made, we show only the upper right triangle of the correlation graph matrix. A graph in the lower triangle (e.g., QH vs. Qsens ) contains the same information as its shown counterpart (Qsens vs. QH ); the graphs on the diagonal (e.g., QH vs. QH ) are uninformative. Top: All profiles. Bottom: Zoom into high-quality profiles.
33% GC background Name Qbal V$HOGNESS B 1.0000 V$P53 01 1.0000 V$NRSF 01 0.9998 V$LUN1 01 0.9998 P$ABF1 02 0.9998 P$ABF1 03 0.9994 V$PAX1 B 0.9994 V$BRACH 01 0.9994 V$PPARG 01 0.9993 V$ROAZ 01 0.9991 V$PPARG 02 0.9988 P$ABF1 01 0.9988 V$STAT3 01 0.9986 V$TANTIGEN B 0.9986 V$NRSE B 0.9984 V$AHRARNT 02 0.9983 P$BZIP911 01 0.9982 V$SEF1 C 0.9980 P$BZIP910 01 0.9980 P$O2 01 0.9977
50% GC background Name Qbal V$HOGNESS B 1.0000 V$P53 01 0.9998 V$LUN1 01 0.9997 V$NRSF 01 0.9994 V$BRACH 01 0.9992 V$PAX1 B 0.9991 V$PPARG 01 0.9989 V$MEF2 04 0.9985 V$PPARG 02 0.9985 P$ABF1 02 0.9984 P$O2 01 0.9977 V$STAT3 01 0.9976 V$SRF 01 0.9970 V$ROAZ 01 0.9970 I$dTCF 1 0.9964 P$ABF1 03 0.9964 V$EVI1 01 0.9962 V$SEF1 C 0.9958 P$BZIP910 01 0.9955 P$ABF1 01 0.9955
67% GC background Name Qbal V$HOGNESS B 1.0000 V$MEF2 04 0.9998 V$LUN1 01 0.9998 V$P53 01 0.9998 V$BRACH 01 0.9997 V$PAX1 B 0.9995 V$PPARG 01 0.9994 V$PPARG 02 0.9993 P$O2 01 0.9992 V$NRSF 01 0.9992 V$EVI1 01 0.9991 I$dTCF 1 0.9991 V$SRF 01 0.9990 V$STAT3 01 0.9985 V$HMEF2 Q6 0.9984 V$POU3F2 01 0.9983 V$MEF2 01 0.9983 F$STE11 01 0.9981 V$OCT1 07 0.9977 V$MEF2 03 0.9975
Table 1: Quality ranking of the top 20 high-quality TRANSFAC profiles in three different background models (GC-content 33%, 50%, and 67%). More detailed information is available electronically, as described in the Colophon. The problem is that the majority of the profiles are too short to apply asymptotics. The problem becomes worse if we compute the p-value αn for the maximum random score over n windows. It is often assumed that the maximum has an extreme value distribution. Our findings show that this is not always true. Equation (2) states that we can approximate the right tail of the maximum score distribution with small error, and this is all we need to estimate p-values. Our main point is that the current practice of determining a score threshold t through the type-I error bound an (t) ≤ 0.05 is generally not a good idea, because this leads to low detection power in many cases. Our findings allow to balance type-I error against type-II error. In practice, we suggest to compute an appropriate background model from the given sequence data and use the corresponding score matrix to determine an optimal error-balancing score threshold on a case by case basis. The errors αn and βm become higher as n and m increase because the conditions are to reject all n false positives and to find all m true instances. For genome-wide studies (say, n > 107 and m > 10), we will usually find that αn (t) ≈ βm (t) ≈ 1 for all score thresholds t. This behavior finally occurs for all profiles. Therefore, extremely large values of n and m are unsuitable for evaluating the quality of a profile. We settled for n = 500 and m = 1 to ensure that one signal occurrence within sequences of length 500 can be reliably detected. In large-scale studies, eventually some false positive and false
negative errors have to be accommodated. Two approaches have been proposed to circumvent the above mentioned problem: First, if the genomes of two related organisms, such as man and mouse, are available, the search space can be reduced by considering only the conserved parts of upstream regions of genes. Corresponding tools, such as the CORG database (Dieterich et al., 2003) have recently become available. Second, researchers have used clusters of pairs or more combinations of TFBSs instead of single TFBS profiles to detect cis-regulatory modules; for example, see the paper by Berman et al. (2002). Our quality evaluation of the TRANSFAC profiles has several consequences: Some profiles are clearly too weak to be used as a meaningful TFBS detection method. More powerful methods than profiles are required, but given the limited amount of experimentally confirmed data, it is difficult to develop such models without taking the risk of over-fitting. When we have enough training data, the convolution approach can be extended. An order-d Markov dependency in signal or background model can be accommodated with an additional O(|Σ|d ) time factor in the computation by conditioning all distributions on the most recent d-letter context. It may also be possible to extend the approach to extended profiles of the form P1 Nl:u P2 , where P1 and P2 are profiles, and Nl:u a bounded gap of between l and u arbitrary nucleotides. However, computing the sequence type-I error αn from the window type-I error becomes more difficult then.
Colophon Detailed quality values for all TRANSFAC profiles, as well as software written in PERL to compute them, may be obtained by contacting the authors. The information may also be downloaded from the Gene Regulation website of the Max Planck Institute for Molecular Genetics, Berlin. Please address inquiries to
[email protected], or http://genereg.molgen.mpg.de. We would like to thank Steffen Grossmann, Stefan R¨opcke, and two anonymous referees for many helpful comments.
References A. Barbour, L. Holst, and S. Janson. Poisson Approximation. Clarendon Press, Oxford, 1992. B. P. Berman, Y. Nibu, B. D. Pfeiffer, P. Tomancak, S. E. Celniker, M. Levine, G. M. Rubin, and M. B. Eisen. Exploiting transcription factor binding site clustering to identify cis-regulatory modules involved in pattern formation in the drosophila genome. Proc Natl Acad Sci USA, 99(2):757–62, Jan 2002. J.-M. Claverie and S. Audic. The statistical significance of nucleotide position-weight matrix matches. CABIOS, 12(5):431–439, 1996.
T. M. Cover and J. Thomas. Elements of Information Theory. John Wiley & Sons, New York, 1991. C. Dieterich, H. Wang, K. Rateitschak, H. Luz, and M. Vingron. CORG: a database for comparative regulatory genomics. Nucleic Acids Research, 31(1):55–57, 2003. W. J. Ewens and G. G. Grant. Statistical Methods in Bioinformatics. Springer, 2001. L. Goldstein and M. Waterman. Approximations to profile score distributions. Journal of Computational Biology, 1(1):93–104, 1994. L. A. Mirny and M. S. Gelfand. Structural analysis of conserved base pairs in protein-DNA complexes. Nucleic Acids Research, 30(7):1704–1711, 2002. URL http://nar.oupjournals.org/cgi/content/abstract/30/7/1704. K. Quandt, K. Frech, H. Karas, E. Wingender, and T. Werner. MatInd and MatInspector: New fast and versatile tools for detection of consensus matches in nucleotide sequence data. Nucleic Acids Research, 23:4878–4884, 1995. E. Rivals and S. Rahmann. Combinatorics of periods in strings. In P. Orejas, P. Spirakis, and J. van Leuween, editors, Proceedings of the 28th International Colloquium on Automata, Languages, and Programming (ICALP 2001), volume 2076 of Lecture Notes in Computer Science, pages 615–626, Berlin, 2001. Springer-Verlag. E. Roulet, I. Fisch, T. Junier, P. Bucher, and N. Mermod. Evaluation of computer tools for the prediction of transcription factor binding sites on genomic DNA. In Silico Biology, 1(0004), 1998. S. Sinha and M. Tompa. A statistical method for finding transcription factor binding sites. In P. Bourne et al., editors, Proceedings of the Eighth International Conference on Intelligent Systems for Molecular Biology (ISMB 2000), pages 344–354, La Jolla, CA, USA, 2000. K. Sjølander, K. Karplus, M. Brown, R. Hughey, A. Krogh, I. Mian, and D. Haussler. Dirichlet mixtures: A method for improving detection of weak but significant protein sequence homology. CABIOS, 12:327–345, 1996. R. Staden. Methods for calculating the probabilities of finding patterns in sequences. CABIOS, 5:89–96, 1989. G. D. Stormo. DNA binding sites: representation and discovery. Bioinformatics, 16: 16–23, 2000. E. Wingender, X. Chen, R. Hehl, H. Karas, I. Liebich, V. Matys, T. Meinhardt, M. Pr¨ uss, I. Reuter, and F. Schacherer. TRANSFAC: an integrated system for gene expression regulation. Nucleic Acids Research, 28:316–319, 2000.