a novel strategy to search conserved transcription factor binding sites ...

212

A NOVEL STRATEGY TO SEARCH CONSERVED TRANSCRIPTION FACTOR BINDING SITES AMONG COEXPRESSING GENES IN HUMAN YOSUKE HATANAKA [email protected]

MASAO NAGASAKI [email protected]

TAKESHI OBAYASHI KAZUYUKI NUMATA [email protected] [email protected]

RUI YAMAGUCHI [email protected] ANDRÉ FUJITA [email protected]

TEPPEI SHIMAMURA YOSHINORI TAMADA SEIYA IMOTO [email protected] [email protected] [email protected] KENGO KINOSHITA [email protected]

KENTA NAKAI SATORU MIYANO [email protected] [email protected]

Human Genome Center, Institute of Medical Science, University of Tokyo, 4-6-1, Shirokanedai, Minato-ku, Tokyo 108-8639, Japan We report various transcription factor binding sites (TFBSs) conserved among co-expressed genes in human promoter region using expression and genomic data. Assuming similar promoter structure induces similar transcriptional regulation, hence induces similar expression profile, we compared the promoter structure similarities between co-expressed genes. Comprehensive TF binding site predictions for all human genes were conducted for 19,777 promoter regions around the transcription start site (TSS) given from DBTSS and promoter similarity search were conducted among coexpressing genes data provided from newly developed COXPRESdb. Combination of Position Weight Matrix (PWM) motif prediction and bootstrap method, 7,313 genes have at least one statistically significant conserved TFBS. We also applied basket method analysis for seeking combinatorial activities of those conserved TFBSs. Keywords: co-expressed genes; position weight matrix; promoter structure similarity; conserved TFBS

1.

Introduction

In the last decade, massive number of gene expression data from DNA microarray experiments, and various organisms’ complete genomic data became publicly available. And yet the spatiotemporal regulatory mechanisms is still unclear, it is widely accepted that the gene expression activities heavily depend on recognition of specific promoter sequences by transcriptional regulatory proteins in higher eukaryotes. The transcription regulatory sites, and thus cis-regulatory regions, can be identified using the high-throughput methods such as ChIP-chip experiment [1-3]. However, there are around 2,000 estimated transcription factors encoded in the human genome [4-5], and many are likely to be expressed and to combinatorial regulate target genes in various conditions, makes experimental identification of cis-regulatory regions difficult. Therefore, further computational identification of TFBSs based on signatures of their presence in the genomic sequence [6-9] is still an attractive alternative.

Search Conserved Transcription Factor Binding Sites 213 In this paper, we combined the genomic data and expression data for analysis to obtain substantial insights for gene regulatory mechanism by combination of conventional methods and newly developed database. To start with computational, all the genes from an organism clustered based on their expression patterns. Then, examination of the promoter region of genes in the same expression pattern group and look for common sequence motifs, namely transcriptional regulatory sites (transcription factor binding sites) that cause these genes active or inactive. For the prediction of TFBSs, we first applied 505 vertebrate Position Weight Matrices (PWMs), matrix of score values that give weighted match to promoter sequences: upstream 1000 bp and down stream 200 bp of transcription start sites (TSSs) for 19,777 human genes. Following that, we gathered genes co-expressed in various conditions and cell-cultures to seek for common TFBSs among the coexpressing genes. The novelty of our approach is that, taking into account that the limited structural flexibility of transcription regulating machines, we focused on common motifs in promoter region with similar distance from transcription start sites. 2.

Methods and Results

2.1. Coexpressing gene sets The coexpressing gene sets for each human gene were downloaded from a coexpressing gene database COXPRESdb ver.7 (http://coxpresdb.hgc.jp). In COXPRESdb, the coexpression data is calculated from the 4,401 Affymetrix GeneChip data (123 experiments) from NCBI GEO. Following the RMA normalization applied to each experiment, genes were normalized by expression level in each microarrays experiment. Then, all experiments were combined into one gene expression table, and the weighted Pearson Correlation Coefficients (PCCs) were calculated between genes to give correlation rank. The recently developed COXPRESdb is quite unique from other coexpressing gene databases because it introduces a parameter, “Mutual Rank (MR)” deduced from correlation rank values. In essence, the correlation rank calculated from PCCs are asymmetric, meaning the rank of gene B from gene A is not the same as the rank of gene A from gene B. Thus, rather than taking the rank based upon Pearson Correlation Coefficient between genes, to give the best combination of coexpressing gene sets, the geometric averaged rank between the two directional ranks, Mutual Rank (MR) is introduced.

MR(AB) = MR(A → B) × MR(B → A)

(1)

We retrieved the coexpressing gene lists arranged in descending order of MR for each gene.

214 Y. Hatanaka et al. 2.2. Promoter sequence For the promoter sequences, we retrieved 1,000 bp upstream and 200 downstream of the TSSs assembled from UCSC hg18, for 19,777 human genes. The location data of TSSs were retrieved from DBTSS v.6.0 (http://dbtss.hgc.jp), which bases on unique collection of experimentally determined 5'-end sequences of full-length cDNAs. 2.3. Transcription factor finding site (TFBS) prediction We collected 505 vertebrate position weight matrices (PWMs), equivalent to 313 transcription factors, registered at TRANSFAC v.8.3. For prediction the TFBSs, we used MATCH [11], which uses Mann-Whitney U- test using random gene list as a reference, to map all locations of predicted TFBSs on human genome assembled above. The basic concept of prediction is shown in Fig.1. Since this mapping algorithm is highly error-prone, mainly false positive hits because the known binding sites are short and sometimes degenerated. Thus, we adopted integrated value (≥ 0.98) of “matrices similarity score” and “core similarity score” to minimize false negatives. We then divided the genome into 50 bp regions and counted each region for the presence or absence of each PWM. We chose this region size because PWMs tend to produce large numbers of possible TFBSs in the genome; 50 bp regions are small enough to prevent most regions from containing most motifs. Also, experimental data from the TRANSCompel (v.10.3) [12] shows over 99% of the distance between the experimentally determined transcription factor binding sites (< 100bp). Therefore, the range of 50 bp is compatible with the size of known cis-regulatory regions and small enough to avoid inclusion of too many predicted TFBSs. Following mapping of predicted TFBSs, we minded this matrix of genomic regions and TFBSs’ frequencies contained in the region as “TFBS location matrix” as shown in Fig.2.

Fig. 1. An image of score matrix mapping algorithm of MATCH. Those regions with matched score ≥ 0.98 were selected.

Fig. 2. TFBS location matrix. The element in the matrix is the frequency of mapped TFBSs.

Search Conserved Transcription Factor Binding Sites 215

2.4. Bootstrap method Using “TFBS location table” for each gene, we calculated the intensity of conservation of the predicted TFBSs with combination of coexpression data. Due to the comparatively strict condition in the TFBS prediction, some “TFBS location matrix” became zero matrix, thus we excluded such genes from the following process, leaving 9,330 genes. We selected top 20 genes, or N (≤ 20) genes if missed from above reason, from COXPRESdb in the ascending order of “Mutual Rank” as highly co-expressed genes’ group, because the expression similarity rapidly decreases after the top 20 genes [13]. Let TFg,i, j denote the ith row and jth column of the “TFBS location matrix” for gth cox

gene, whereas TFg,i, j,c represents the corresponding element of “TFBS location matrix” seed

for N genes. Then, transform TFg,i, j to the arithmetic average X g,i, j according to

seed X g,i, j

⎛N ⎞ cox ⎜∑ TFg,i, j,c ⎟ + TFg,i, j ⎝ ⎠ = c=1 N +1

(2)

seed

where X g,i, j is interpreted as the “intensity of conservation”. In order to evaluate the significance of the conservation, we applied a testing procedure, which exploits a technique of the Bootstrap method. The testing procedure is described as follows: 1.

For randomly selected

random repeatedly N genes, calculate arithmetic average X g,i, j

according to

random X g,i, j,k

2.

⎛ 20 ⎞ random ⎜ ∑ TFi, j ⎟ + TFg,i, j ⎝ ⎠ = c=1 N +1

where k =1,…,10,000 random seed Arrange X g,i, j,k in descent order, and place X g,i, j as random random random seed random X g,i, j,1 ≤ X g,i, j,2 ≤ L ≤ X g,i, j,Z ≤ X g,i, j ≤ L ≤ X g,i, j,10000

3.

(3)

Compute an integrated p-value Pg,i, j by

(4)

216 Y. Hatanaka et al.

Pg,i, j = 1−

Z g,i, j + 1 (10,000 + 1)

where the following conditions were applied for

(5)

Pg,i, j

Pg,i, j →1 z If Pg,i, j > 0.05 , then Pg,i, j →1 z If Pg, j = 1 , then the jth row was excluded from the matrix.

z If TFg,i, j = 0 , then

The p-value matrix for the gth gene is denoted by Pg shown as Fig. 3.

Fig. 3. P-value matrix for each transcription factor (TF). Elements of the matrix > 0.05 are converted to 1.

7,313 genes out of 9,330 human genes had at least one conserved TFBS. And the frequency of those significantly conserved TFBSs for 200 transcription factors among 7,313 genes are shown in histogram (Fig. 4).

Fig. 4. The frequency histogram of conserved TFBSs for 200 transcription factors. The frequency ranges from 1 to over 2,500.

Search Conserved Transcription Factor Binding Sites 217 2.5. Heatmap Following the bootstrap method testing, p-value matrices were depicted as 2-colored heatmap.

Fig. 5. Heatmap of the conserved transcription factor binding site for (a) AIP and (b) HBA2. Red represents the conservation with statistical higher significance.

2.6. Association rule data mining Determining significantly conserved TFBS may help the transcription factor partners with co-acting biological roles for less well-studied transcription factors combination. Therefore, we applied the “association rule” used in market basket analysis. This method is to determine which items are frequently purchased together by using a database of transactions in which each tuple is a list of items purchased together in one customer’s transaction. The mining seeks to discover rules such as “beer ⇒ snacks,” meaning “People who buy beer also often buy snacks.” Association rules can be formally described as follows: ♦ ♦ ♦ ♦

I = {i1 , i2 ,L, in } is a set of literals called items

D is a set of transactions. Each transaction T is a set of items such that T ⊆ I A transaction T contains X , a set of items in I , if X ⊆ T An association rule is an implication of X ⇒ Y , where X ⊂ I,Y ⊂ I and X ∩Y = ∅

218 Y. Hatanaka et al.

C is the confidence of a rule X ⇒ Y in transaction set D if c% of transactions in D that contain X also contain Y . It is also known as the conditional probability of Y given X , or P (Y | X ) S is the support of rule X ⇒ Y in set D if s% of transactions in D contain both X and Y . It is also known as the joint probability of both X and Y , or P (X | Y )

♦

♦

Since the transcription factor like STAT1 has more than 2,000 conserved TFBSs among the transactions (genes), this may inhibit the substantial finding of meaningful rule for genes with rather less conserved TFBSs. Therefore, we limited the search of rule among genes with less than 10 TFBSs (Fig. 6).

Fig. 6. Histogram of conserved TFBS frequency for each transcription factor.

By selecting genes with the maximum number of TFBS by 9, 39 transcription factors were left with 102 genes corresponding to the found TFBSs. For those TFBSs, mining of frequent TFBSs, association rules were calculated using the Apriori algorithm. The Apriori algorithm employs level-wise search for frequent itemsets (TFBS pairs). The result is shown in Table 1. Table 1. The rule of basket analysis: (maxlen, support) = (9, 0.01). Rule No. 1 2 3 4 5

X

⇒

Y

support 0.0373 0.0373 0.0233 0.0187 0.0187

confidenc e 1.000 1.000 0.8333 0.8333 1.000

E2F-4:DP-1 pRb:E2F-1.DP-1 HFH8 NF.kappaB Zic2

⇒ ⇒ ⇒ ⇒ ⇒

pRb:E2F-1.DP-1 E2F-4:DP-1 Freac.7 c.Rel Zic1

26.75 26.75 25.47 21.40 53.50

6

Zic1

⇒

Zic2

0.0187

1.000

53.50

lift

Search Conserved Transcription Factor Binding Sites 219 3.

Discussion

In this paper, we conducted TFBS prediction using PWMs and bootstrap method to search conserved TFBSs. The assumption that location and combination of TFBSs are restricted due to the limited structural flexibility of transcriptional regulating machines, the co-expressed genes may have common structure in the promoter region. As a result of the bootstrap testing, number of statistically significant conserved TFBSs were found, and implication of functionality among them were confirmed. For example, in Fig. 4(a), AIP, aryl hydrocarbon receptor interacting protein shows significantly conserved TFBS of AhR, aryl hydrocarbon receptor, in close range of TSS. The heatmap also captures the HNF4-alpha as binding transcription factor which also being validated by ChIP-chip experiment [14]. As for HBA2, in Fig.4 (b), alpha2-globin, heatmap shows experimentally validated SP1 binding site on ChIP-chip experiment [15]. These results imply the conserved TFBSs are indeed functional, due to the restricted structural flexibility of transcription regulating machines. We further sought for significantly co-occurring conserved TFBSs using basket analysis method with expectation that such combinatorial phenomenon implies a potential cis-regulatory regions. According to the rule No.1 and No.2, the pRb:E2F-1:DP-1 and E2F-4:DP-1 have tendency to co-occur in the conserved region. The transcription factor complex pRb:E2F-1:DP-1, is known to associate with pRB altering the binding site specificity of E2F-1/DP-1 complexes [16]. Therefore, the pRB may act as switching device for gene regulation. The fact that E2F-4:DP-1, a similar complex of E2F-1:DP-1 subunit, has tendency to co-occur, it is yet to experimentally validated, but pRB may have similar function as regulatory switching device. As for rule No.4, NF.kappaB with c.Rel, it is known that they make complex and bind with DNA for transcriptional activation of various genes [17-18]. As for Zic1 and Zic2, it is known that they bind and trans-activate the apolipoprotein E gene promoter [19]. As for rule No.3, despite there has no previous report of interaction between HFH8 and Freac.7, they may be the good candidates for potential interacting partners for the further experiment. For the future work, while we restricted the “conserved TFBSs” to be located exactly in the same region among co-expressing genes in this study, in order to adopt more flexibility for the transcription regulating machines, including the redundancy by slightly shifted TFBSs may reveal better candidate for common functional TFBSs among coexpressing genes. And searching the co-occurring TFBSs among the coexpressing genes using basket method may reveal novel candidate for the cis-regulatory elements. In conclusion, our strategy to search the conserved TFBSs among coexpressing genes revealed the fact the there are, in deed, a significant number of conserved TFBSs. And with the analysis of co-occurrence, it is likely that such co-occurring conserved TFBSs may act as cis-regulatory element in human genome. References [1] Kim, J., Bhinge, AA., Morgan, XC., Iyer VR., Mapping DNA-protein interactions in large genomes by sequence tag analysis of genomic enrichment, Nat. Methods, 2(1): 47-53, 2005.

220 Y. Hatanaka et al. [2] Lee, TI., Rinaldi, NJ., Robert, F., Odom, DT., Bar-Joseph, Z., Gerber, GK., Hannett, NM., Harbison, CT., Thompson, CM., Simon, I., Zeitlinger, J., Jennings, EG., Murray, HL., Gordon, DB., Ren, B., Wyrick, JJ., Tagne, JB., Volkert, TL., Fraenkel, E., Gifford, DK., Young, RA., Transcriptional regulatory networks in Saccharomyces cerevisiae, Science, 298(5594):799-804, 2005. [3] Carroll, JS., Meyer, CA., Song, J., Li, W., Geistlinger, TR., Eeckhoute, J., Brodsky, AS., Keeton, EK., Fertuck, KC., Hall, GF., Wang, Q., Bekiranov, S., Sementchenko, V., Fox, EA., Silver, PA., Gingeras, TR., Liu, XS., Brown, M., Genome-wide analysis of estrogen receptor binding sites, Nat. Genet., 38(11):1289-1297, 2006. [4] Tupler, R., Perini, G., Green, MR., Expressing the human genome, Nature, 409(6822):832-833, 2001. [5] Messina, DN., Glasscock, J., Gish, W., Lovett, M., An ORFeome-based analysis of human transcription factor genes and the construction of a microarray to interrogate their expression, Genome Res., 14 (10B) :2041-2047, 2004. [6] Wingender, E., Chen, X., Hehl, R., Karas, H., Liebich, I., Matys, V., Meinhardt, T., Pruss, M., Reuter, I. and Schacherer, F., TRANSFAC: an integrated system for gene expression regulation, Nucleic Acids Res., 28, 316–319, 2000. [7] Siggia, ED., Computational methods for transcriptional regulation, Curr. Opin. Genet. Dev., 15(2):214-221, 2005. [8] Tavazoie, S., Hughes, JD., Campbell, MJ., Cho, RJ., Church, GM., Systematic determination of genetic network architecture, Nat. Genet., 22(3):281-285, 1999. [9] Birnbaum, K., Benfey, PN., Shasha, DE., cis element/transcription factor analysis (cis/TF): a method for discovering transcription factor/cis element relationships., Genome Res., 11(9):1567-1573, 2001. [10] Bussemaker, HJ., Li, H., Siggia, ED., Regulatory element detection using correlation with expression, Nat. Genet., 27(2):167-171, 2001. [11] Kel, A., Gossling, E., Reuter, I., Cheremushkin, E., Kel-Margoulis, O., Wingender, E., MATCH: A tool for searching transcription factor binding sites in DNA sequences, Nucleic Acids Res., 31:3576-3579, 2003. [12] Matys, V., et al., TRANSFAC and its module TRANSCompel: transcriptional gene regulation in eukaryotes, Nucleic Acids Res., 34(Database issue):D108–110, 2006. [13] Takeshi, O., Shinpei, Hayashi., Masayuki, Shibaoka., Motoshi, S., Hiroyuki, Ohta., Kengo, Kinoshita., COXPRESdb: a database of co-expressed gene networks in mammals, Nucleic Acids Res., Jan;36(Database issue):D77-82, 2008. [14] Odom, D.T., Zizlsperger, N., Gordon, D.B., Bell, G.W., Rinaldi, N.J., Murray, H.L., Volkert, T.L., Schreiber, J., Rolfe, P.A., Gifford, D.K., Fraenkel, E., Bell, G.I., Young, R.A., Control of pancreas and liver gene expression by HNF transcription factors, Science, 303:1378-1381, 2004. [15] TRANSFAC_Team, New ChIP-on-chip data. TRANSFAC Reports., Rel121:0002, 2008. [16] Tao, Y., Kassatly, R.F., Cress, W.D., Horowitz J. M., Subunit composition determines E2F DNA-binding site specificity, Mol. Cell. Biol., 17:6994-7007, 1997. [17] Sun, S.C., Elwood, J., Beraud, C., Greene, W.C., Human T-cell leukemia virus type I Tax activation of NF-kappaB/Rel involves phosphorylation and degradation of IkappaBalpha and RelA (p65)-mediated induction of the c-rel gene, Mol. Cell. Biol., 14:7377-7384, 1994.

Search Conserved Transcription Factor Binding Sites 221 [18] Hansen, S. K., Nerlov, C., Zabel, U., Verde, P., Johnsen, M., Baeuerle, P., Blasi, F., A novel complex between the p65 subunit of NF-kappaB and c-Rel binds to a DNA element involved in the phorbol ester induction of the human urokinase gene, EMBO J., 11:205-213, 1992. [19] Salero, E., Perez-Sen, R., Aruga, J., Gimenez, C., Zafra, F., Transcription factors Zic1 and Zic2 bind and transactivate the apolipoprotein E gene promoter, J. Biol. Chem., 276:1881-1888, 2001.