The Plant Journal (2011) 67, 130–144
doi: 10.1111/j.1365-313X.2011.04581.x
Vascular expression in Arabidopsis is predicted by the frequency of CT/GA-rich repeats in gene promoters Roberto Ruiz-Medrano1, Beatriz Xoconostle-Ca´zares1, Byung-Kook Ham2, Gang Li2 and William J. Lucas2,* Department of Biotechnology and Bioengineering, Centro de Investigacio´n y de Estudios Avanzados del Instituto Polite´cnico Nacional, Avenida IPN 2508, Zacatenco, 07360 Mexico DF, Mexico, and 2 Department of Plant Biology, College of Biological Sciences, University of California, Davis, CA 95616, USA
1
Received 6 January 2011; revised 23 February 2011; accepted 16 March 2011; published online 9 May 2011. * For correspondence (fax +1 530 752 5410; e-mail
[email protected]).
SUMMARY Phloem-transported signals play an important role in regulating plant development and in orchestrating responses to environmental stimuli. Among such signals, phloem-mobile RNAs have been shown to play an important role as long-distance signaling agents. At maturity, angiosperm sieve elements are enucleate, and thus transcripts in the phloem translocation stream probably originate from the nucleate companion cells. In the present study, a pumpkin (Cucurbita maxima) phloem transcriptome was used to test for the presence of common motifs within the promoters of this unique set of genes, which may function to coordinate expression in cells of the vascular system. A bioinformatics analysis of the upstream sequences from 150 Arabidopsis genes homologous to members of the pumpkin phloem transcriptome identified degenerate sequences containing CT/GA- and GT/CA-rich motifs that were common to many of these promoters. Parallel studies performed on genes shown previously to be expressed in phloem tissues identified similar motifs. An expanded analysis, based on homologs of the pumpkin phloem transcriptome from cucumber (Cucumis sativus), identified similar sets of common motifs within the promoters of these genes. Promoter analysis offered support for the hypothesis that these motifs regulate expression within the vascular system. Our findings are discussed in terms of a role for these motifs in coordinating gene expression within the companion cell/sieve element system. These motifs could provide a useful bioinformatics tool for genome-wide screens on plants for which phloem tissues cannot readily be obtained. Keywords: companion cells, long-distance signals, phloem, promoter analysis, sieve elements, vascular system.
INTRODUCTION During the transition to a terrestrial habitat, plants evolved developmental programs essential for formation of a vascular system, which allowed expansion of form and complexity. This vascular system comprises xylem and phloem tissues; the xylem conducts water and mineral nutrients from the roots to aerial regions of the plant, while the phloem delivers fixed carbon and other nutrients from mature, photosynthetic source leaves to heterotrophic tissues and organs. In angiosperms, the phloem translocation stream moves by bulk flow through a specialized sieve tube system driven by a pressure gradient. This system comprises two cell types: companion cells (CCs) and sieve elements (SEs), which are inter-connected by plasmodesmata. In the physiologically competent state, CCs are metabolically active and provide support for the enucleate SEs, which 130
have become highly reduced in cellular complexity (van Bel et al., 2002). The simplified form of individual SEs, together with the development of specialized sieve plate pores between neighboring SEs, provides a low-resistance pathway for fluid flow (Ruiz-Medrano et al., 2004; Lough and Lucas, 2006). The developmental programs that arose to control vascular differentiation are currently under intense investigation. Important insights have been obtained regarding the signaling molecules and transcription factors involved in cambial, xylem and phloem specification (see, for example, Bonke et al., 2003; Hawker and Bowman, 2004; Motose et al., 2004; Scarpella et al., 2004; Zhou et al., 2004; Carlsbecker and Helariutta, 2005; Ito et al., 2006; Sieburth and Deyholos, 2006; Demura and Fukuda, 2007; Yokoyama et al., 2007; ª 2011 The Authors The Plant Journal ª 2011 Blackwell Publishing Ltd
Common promoter motifs drive expression of SE-localized transcripts 131 Etchells and Turner, 2010). However, information on the mechanisms involved in coordinating patterns of gene expression in the developing vascular system, and within the specific tissues of the cambium, phloem and xylem, is still limited. A similar situation exists with respect to the genetic regulatory networks that operate to control the functioning of the phloem within mature tissues, particularly operation of the CC–SE complex. As with phloem development, sequences that drive expression of genes involved in specific phloem functions are being characterized (Schneider et al., 2006; Schneidereit et al., 2008; Ma et al., 2009). Of note, sequences in the promoter of galactinol synthase that drive expression in the minor veins of cucurbits have been identified (Ayre et al., 2003) and mapped to homologous genes in other species, including Arabidopsis. However, although similar sequences are present in other vascularexpressed genes (Keller and Baumgartner, 1991; Medberry et al., 1992; Kosugi et al., 1995), their significance and/or functionality remain to be elucidated in many cases. Recent studies have revealed that the functional sieve tube system of angiosperms contains a subset of mRNA molecules and a diverse population of proteins (Giavalisco et al., 2006; Roney et al., 2007; Gaupels et al., 2008; Hannapel, 2010). As the SEs are enucleate at maturity, the 1000 or more mRNA molecules present in the cucurbit phloem translocation stream (Ruiz-Medrano et al., 1999, 2007; Xoconostle-Ca´zares et al., 1999; Haywood et al., 2005; Lough and Lucas, 2006; Ham et al., 2009) probably originate from the neighboring nucleate CCs (Ruiz-Medrano et al., 2001, 2004; van Bel, 2003; Lough and Lucas, 2006; Huang and Yu, 2009; Turgeon and Wolf, 2009). A similar situation could exist for the more than 1000 proteins detected in the phloem sap of the cucurbits (Walz et al., 2004; Lin et al., 2007, 2009). In this case, the phloem transcriptome and/or proteome of the cucurbits could serve as a valuable resource to test for the presence of common motifs within the promoters of these genes that may function to coordinate their expression in this phloem cell type. In the present study, we analyzed Arabidopsis homologs of the pumpkin phloem transcriptome database. For this purpose, we first analyzed the upstream sequences from 150 homologous Arabidopsis genes in search of common motifs in their promoters. We identified degenerate sequences containing CT/GA-rich and to a lesser extent GT/CA-rich motifs that were common to many of these promoters. Parallel studies performed on genes previously shown to be expressed in phloem tissues identified similar motifs. An expanded analysis, based on homologs of the pumpkin phloem transcriptome from cucumber (C. sativus), rice (Oryza sativa) and poplar (Populus trichocarpa), provided additional support for the hypothesis that a similar set of common motifs are located within the promoters of these genes. Experimental support for this hypothesis was
obtained using representative upstream sequences, as well as minimal promoter cassettes, to drive a uidA–GFP reporter gene system. Our findings are discussed in terms of a role for these motifs in coordinating gene expression within the CC–SE system. These motifs should provide a useful bioinformatics tool for genome-wide screens on plants for which phloem tissues cannot readily be obtained. RESULTS Gene promoters of SE-derived transcripts share common degenerate motifs Transcripts detected in pumpkin phloem sap are probably derived from transcription occurring in neighboring CCs. To obtain insight into the production of this unique set of phloem transcripts, we first used a previously established EST database that was generated using poly(A)+ mRNA extracted from pumpkin phloem sap; this database comprises more than 1200 transcripts (Lough and Lucas, 2006). Given that the pumpkin genome has yet to be sequenced, together with the recalcitrant nature of pumpkin to routine transformation, a bioinformatics analysis was performed to identify the Arabidopsis homologs for these genes. For this purpose, we selected, out of the most abundant pumpkin transcripts (based on the number of ESTs), genes that were annotated as being involved in various forms of signaling. From these, we identified 150 Arabidopsis genes encoding putative phloem transcription factors, protein kinases, protein phosphatases, cell-cycle regulators and hormone response factors (Table 1). To identify potentially conserved sequences between the promoters of these genes, enumerative methods, based on the total count of given motifs in a dataset (PROMOMER and YMF), and probabilistic methods, based on a position weight matrix [Gibbs motif sampler, ALIGNACE (based on the Gibbs sampler algorithm) and MEME] were used. As background sequences (to serve as controls), we incorporated into our analysis sets of pollen-, guard cell-, ribosomal protein- and cyclin-specific genes, because it seemed reasonable to assume that their signature motifs would be different from those of vascular-expressed genes. We assumed that genes expressed at the highest levels in a given tissue or organ would display more common motifs or a more conserved motif than those expressed at lower levels. Thus, for the pollen set (Table S1) (based on data reported by Honys and Twell, 2004) and the guard cell set (Table S2) (based on data reported by Leonhardt et al., 2004), genes were selected according to their expression levels. Genes for ribosomal proteins were selected at random (Table S3) (based on data reported by Barakat et al., 2001). A search was performed to identify the entire complement of cyclin genes within the Arabidopsis genome (Table S4). The ALIGNACE program (Hughes et al., 2000) was used to analyze the sieve element transcript promoter (SETP) and
ª 2011 The Authors The Plant Journal ª 2011 Blackwell Publishing Ltd, The Plant Journal, (2011), 67, 130–144
132 Roberto Ruiz-Medrano et al. Table 1 Gene IDs for pumpkin phloem sap transcripts and their closest Arabidopsis gene homologs used for promoter motif analysis
Gene ID
Arabidopsis homolog
Cucurbita_000445
At3g46290
Cucurbita_006206
At3g47570
Cucurbita_011090
At4g05420
Cucurbita_007894
At5g66080
Cucurbita_000239 Cucurbita_009893
At5g65210 At2g17290
Cucurbita_008102
At3g15220
Cucurbita_008186
At4g26930
Cucurbita_008169 Cucurbita_001629 Cucurbita_003901
At1g49620 At5g47840 At1g63700
Cucurbita_000087
At4g26690
Cucurbita_008005 Cucurbita_010531 Cucurbita_009689
At5g03300 At1g80070 At1g53300
Cucurbita_009998
At1g19210
Cucurbita_004036
At3g24240
Cucurbita_010871 Cucurbita_004647 Cucurbita_007842
At2g16750 At5g03790 At5g54380
Cucurbita_010144 Cucurbita_007907 Cucurbita_004074 Cucurbita_006340 Cucurbita_011143 Cucurbita_010330
At5g67380 At3g25840 At1g51800 At1g43700 At1g57700 At1g34260
Cucurbita_011219 Cucurbita_010929
At1g19220 At4g23900
Cucurbita_011249 Cucurbita_009895
At4g17880 At3g14205
Cucurbita_010872 Cucurbita_011155
At1g16330 At3g17730
Cucurbita_011181
At1g77460
Cucurbita_001979
At1g18160
Cucurbita_010134 Cucurbita_003734 Cucurbita_002400
At1g11950 At4g14550 At1g80070
Table 1 (Continued)
Gene ID
Arabidopsis homolog
Function
Cucurbita_031271
At1g79580
Receptor protein kinase-like/ receptor-like Leucine-rich repeat transmembrane protein kinase UV-damaged DNA binding factor-like protein/XPE Protein phosphatase 2C-like protein bZIP transcription factor, TGA1 Putative calmodulin-domain protein kinase CPK6 Putative MAP kinase/similar to BnMAP4Ka2 Myb family transcription factor (MYB97) Kip-related protein 7 AMK2 Putative protein kinase/similar to MAP3Ka1 Glycerophosphodiester phosphodiesterase/kinase ADK2 Splicing factor Prp8, putative Tetratricopeptide repeat domain thioredoxin AP2 domain transcription factor, putative Leucine-rich repeat transmembrane protein kinase Putative protein kinase Homeodomain protein Receptor-protein kinase-like protein Casein kinase II a subunit Protein kinase, putative Receptor protein kinase, putative VirE2-interacting protein VIP1 CRK1 protein, putative Phosphatidylinositol-4phosphate 5-kinase family protein Auxin response factor, putative Nucleoside diphosphate kinase 4 (NDK4) bHLH protein Phosphoinositide phosphatase family protein Cyclin B3 Hypothetical protein/similar to GRAB1 protein Armadillo/b-catenin repeat family protein MAP kinase, putative/similar to MAP3Kd1 Putative DNA-binding protein IAA7-like protein Splicing factor Prp8, putative
Cucurbita_010471
At5g07370
Cucurbita_009855
At4g11800
Cucurbita_010468 Cucurbita_007857
At2g40270 At3g24550
Cucurbita_010505 Cucurbita_008667
At3g07610 At3g03770
Cucurbita_009955 Cucurbita_009029
At1g77450 At1g61370
Cucurbita_001979
At1g18160
Cucurbita_011373
At4g00460
Cucurbita_010764 Cucurbita_010798 Cucurbita_000123
At3g15730 At3g14980 At3g07650
Cucurbita_007834 Cucurbita_010139
At3g04830 At1g79640
Cucurbita_007967
At1g62310
Cucurbita_004499
At1g58100
Cucurbita_009773 Cucurbita_008262
At1g52150 At1g51070
Cucurbita_010344
At1g30330
Cucurbita_010176 Cucurbita_007868 Cucurbita_007884 Cucurbita_010282 Cucurbita_011532 Cucurbita_006504
At1g20696 At1g02230 At5g65530 At4g38520 At4g29230 At4g18020
Cucurbita_008052 Cucurbita_000582 Cucurbita_006544 Cucurbita_008235
At3g20770 At3g15030 At3g05050 At3g04730
Cucurbita_007941
At3g03300
Cucurbita_010586
At1g80840
Cucurbita_008066 Cucurbita_011480 Cucurbita_000153 Cucurbita_009862 Cucurbita_005788
At1g66340 At1g61550 At1g27730 At1g20080 At1g17720
Cucurbita_011082
At1g13800
Function NAM (no apical meristem)-like protein OsNAC4, putative Phosphatidylinositol kinase (IPK2a) Protein serine/threonine phosphatase Protein kinase family protein Proline extensin-like receptor kinase 1 IBM1 protein Leucine-rich repeat transmembrane protein kinase GRAB1-like protein Receptor protein kinase (IRK1), putative MAP kinase, putative/similar to MAP3Kd1 Rho guanyl-nucleotide exchange factor Phospholipase D, putative PHD finger protein, putative CONSTANS B-box zinc finger family protein Auxin-regulated protein Kinase, putative/similar to Ste-20 related kinase Transcription factor, JUMONJI (jmjC) domain-containing Auxin-induced basic helix-loop-helix transcription HD-Zip transcription factor bHLH protein/similar to bHLH transcription factor Auxin response transcription factor (ARF6) HMGB1 NAC domain-containing protein RBK1 Putative protein phosphatase 2c NAC domain-containing protein Pseudo-response regulator 2 (APRR2) Ethylene-insensitive 3 (EIN3) TCP3-like protein Cyclin-dependent protein kinase Auxin-induced transcription factor DEAD/DEAH box helicase, carpel factory-related WRKY family transcription factor Ethylene-response protein, ETR1 Receptor kinase, putative Salt-tolerance zinc finger protein C2 domain-containing protein Type 2A protein serine/threonine phosphatase Pentatricopeptide (PPR) repeat-containing protein
ª 2011 The Authors The Plant Journal ª 2011 Blackwell Publishing Ltd, The Plant Journal, (2011), 67, 130–144
Common promoter motifs drive expression of SE-localized transcripts 133 Table 1 (Continued)
Table 1 (Continued)
Gene ID
Arabidopsis homolog
Cucurbita_011051 Cucurbita_001705
At5g62310 At5g07100
Cucurbita_010957
At5g03730
Cucurbita_003380 Cucurbita_010773 Cucurbita_010209
At4g36010 At4g28540 At4g27410
Cucurbita_000634
At4g11460
Cucurbita_008117
At3g51550
Cucurbita_011203
At3g43220
Cucurbita_007812 Cucurbita_010585
At3g22790 At3g19510
Cucurbita_008533
At3g14270
Cucurbita_011202
At3g02130
Cucurbita_010568
At2g13370
Cucurbita_010507
At2g12900
Cucurbita_007385
At1g79620
Cucurbita_007847
At1g76630
Cucurbita_003663
At1g56090
Cucurbita_011366
At1g51220
Cucurbita_008924
At1g20640
Cucurbita_011175 Cucurbita_004985
At1g01060 At5g67190
Cucurbita_010521
At5g66210
Cucurbita_010421
At5g65710
Cucurbita_002537 Cucurbita_011303 Cucurbita_000261 Cucurbita_000260 Cucurbita_010855 Cucurbita_010187
At5g63940 At5g62940 At5g61430 At5g61420 At5g54680 At5g46510
Cucurbita_010656 Cucurbita_007753
At5g39810 At5g12430
Cucurbita_011069 Cucurbita_009810
At5g11100 At5g10200
Cucurbita_006560
At5g05140
Cucurbita_006458
At5g03140
Function
Gene ID
Arabidopsis homolog
IRE (root hair elongation) WRKY family transcription factor CTR1 serine/threonine protein kinase Thaumatin-like protein Protein kinase ADK1-like protein Desiccation-induced NAC transcription factor Serine/threonine kinase-like protein/receptor-like Receptor-protein kinase-like protein Phosphoinositide phosphatase family protein Kinase-interacting family protein Putative homeobox protein, HAT3.1 Phosphatidylinositol-4phosphate 5-kinase family Leucine-rich repeat transmembrane protein kinase Putative chromodomainhelicase-DNA-binding protein bZIP transcription factor family protein Leucine-rich repeat transmembrane protein kinase Tetratricopeptide repeat (TPR)-containing protein Tetratricopeptide repeat (TPR)-containing protein Zinc finger (C2H2 type) protein (WIP5) RWP-RK domain-containing protein DNA-binding protein, putative AP2 domain-containing transcription factor, putative Calcium-dependent protein kinase Leucine-rich repeat transmembrane protein kinase Putative protein kinase Dof zinc finger protein NAM-like protein Myb-related transcription factor bHLH protein Disease resistance protein (TIR-NBS-LRR class), putative MADS box protein DnaK heat shock N-terminal domain-containing protein CLB1-like protein Putative protein/ tetratricopeptide repeat protein Transcription elongation factor-related protein Receptor-like protein kinase
Cucurbita_002475
At4g37250
Cucurbita_006111
At4g34000
Cucurbita_011182
At4g33920
Cucurbita_010159
At4g31800
Cucurbita_009611
At4g31770
Cucurbita_011350 Cucurbita_010918 Cucurbita_010649 Cucurbita_010963 Cucurbita_011513 Cucurbita_011508
At4g30480 At4g29090 At4g28980 At4g28600 At4g16360 At4g12020
Cucurbita_011388
At3g53110
Cucurbita_011107
At3g45790
Cucurbita_011033 Cucurbita_010145
At3g23890 At3g19290
Cucurbita_010373
At3g15540
Cucurbita_008401 Cucurbita_008101
At3g15260 At3g15210
Cucurbita_010689
At3g14350
Cucurbita_010491
At3g13840
Cucurbita_010663
At3g11540
Cucurbita_002882 Cucurbita_000307
At3g10070 At3g09780
Cucurbita_011169
At3g09400
Cucurbita_011269
At3g09010
Cucurbita_000076
At3g06220
Cucurbita_010934 Cucurbita_007977
At3g06030 At3g05330
Cucurbita_000759
At3g04450
Cucurbita_010435 Cucurbita_011229
At3g02750 At2g47430
Cucurbita_001889
At2g41900
Cucurbita_010181
At2g17290
Cucurbita_000219
At2g01460
ª 2011 The Authors The Plant Journal ª 2011 Blackwell Publishing Ltd, The Plant Journal, (2011), 67, 130–144
Function Leucine-rich repeat transmembrane protein kinase Abscisic acid-responsive element binding factor(ABF3) Putative protein/protein phosphatase Wip1 WRKY family transcription factor RNA lariat debranching enzyme-like protein Tetratricopeptide repeat protein Reverse transcriptase, putative Cyclin-dependent kinase F Calmodulin-binding protein Kinase-like protein Putative disease resistance protein/DNA-binding DEAD/DEAH box RNA helicase protein, putative Putative protein/MAP3Ka1 protein kinase Topoisomerase II Abscisic acid-responsive element binding factor Early auxin-induced protein, IAA19 Protein phosphatase 2C (PP2C) Ethylene-responsive element binding factor 4 (AtERF4) Leucine-rich repeat transmembrane protein kinase Scarecrow transcription factor family protein Spindly (gibberellin signal transduction protein) TAF12/TAFII58 Putative protein kinase/similar to Pto kinase Protein serine/threonine phosphatase Putative receptor serine/ threonine protein kinase Transcription factor B3 family protein NPK1-related protein kinase 3 Peptidyl-prolyl cis-trans isomerase cyclophilin-type family protein Myb family transcription factor, putative Protein phosphatase 2C (PP2C) Cytokinin-responsive histidine kinase (CKI1) Putative CCCH-type zinc finger protein Putative calmodulin-domain protein kinase CPK6 Phosphoribulokinase/uridine kinase family protein
134 Roberto Ruiz-Medrano et al. control promoter sets to test for the presence of common degenerate motifs. The resulting alignments were visualized using the WEBLOGO program (http://weblogo.berkeley.edu). The most common sequences in all gene promoter sets were A/T-rich motifs (Figure S1) (Hampson et al., 2002); however, it is well known that such motifs are abundant throughout eukaryotic genomes. Additionally, probabilistic methods may tend to show homopolymeric sequences as over-represented; as the A/T-rich motifs are more abundant in eukaryotic genomes, false positives could result. Thus, more robust results are obtained when combined with enumerative methods. Importantly, the second most abundant motif, a CT/GA-rich repeat, was present in the SETP set. Degenerate motifs containing CT/GA repeats were also present in the control promoter sets, but their lower maximum a posteriori probability (MAP) scores, which reflect the relative frequency of the motif, indicated that they were far less abundant in the promoters for these genes. The A/T-rich sequences are probably non-specific, while other motifs appear to be unique to each promoter set. This was the case for the TCP motif, in the ribosomal protein gene set, which was also predicted by the Gibbs motif sampler. In general, the SETP CT/GA-rich signature motif was under-represented in the control promoter sets. Analysis of guard-cell genes offered additional support for the hypothesis that this motif was enriched mostly in vascular-specific promoters. Similar CT/GA-rich motifs were also detected within the promoters of Arabidopsis homologs for (i) transcripts of tomato (Solanum lycopersicum) that were previously shown to be transported through a graft union into dodder scions (Roney et al., 2007), and (ii) pumpkin phloem transcripts bound by CmRBP50 (Ham et al., 2009), but the small sample numbers (12 sequences for tomato phloem-mobile transcripts, and 10 for pumpkin CmRBP50bound phloem RNAs) made it invalid to analyze these data sets more thoroughly. Additional analysis for reported vascular-specific promoters (Table S5) using the ALIGNACE program identified similar A/T- and CT/GA-rich motifs (Figure S2). To further test the hypothesis that the SETP set is enriched in CT/GA-rich motifs, we used the Multiple EM for Motif Elicitation (MEME) program (Bailey et al., 2006). The MEME method allows prediction of repetitive sub-sequences within a set of larger sequences. However, as with the ALIGNACE program, this method limits the examination to 46 sequences each one 1000 bp in length. Figure 1 presents the expectation (E) values obtained using the SETP and control (background sequence) promoter sets. The control sequences, driving expression of genes in other cell types, are expected to have different signature motifs. Figure 1 shows the two motifs with the lowest E values. Evidently, the CT/GA-rich motif is present in the phloem and background (control) promoter sets. However, this motif was enriched
Phloem Transcriptome
E = 3.6 × 10–19
E = 1.2 × 10–11
Pollen
E = 8.1 × 10–7
E = 8.0 × 10–6
Guard Cell
E = 1.3 × 10–6
E = 6.1 × 10–5
Ribosomal Proteins
E = 2.9 × 10–20
E = 3.5 × 10–9
Cyclins
E = 3.0 × 10–9
E = 1.1 × 10–4
Motif 1
Motif 2
Figure 1. MEME analysis of Arabidopsis homologs for the pumpkin sieve element transcript promoters (phloem transcriptome) and background (control) gene promoter sets. Over-represented motifs in each promoter set were identified by MEME analysis. The two motifs with the lowest expectation (E) values are shown.
several orders of magnitude in the SETP set relative to the control promoter sets: E values in the SETP, pollen and guard cell sets were 3.6 · 10)19, 7.0 · 10)10 and 1.3 · 10)6, respectively. Low E values for a related CT/GA motif were also found for ribosomal protein and cyclin gene promoters (3.5 · 10)9 and 3.0 · 10)9, respectively). In order to determine the enrichment of the aforementioned motifs in the SETP set in a stringent manner, these sequences were filtered and then shuffled using the Sequence Manipulation Suite (http://www.bioinformatics. org/sms2/about.html). Randomized sequences were then
ª 2011 The Authors The Plant Journal ª 2011 Blackwell Publishing Ltd, The Plant Journal, (2011), 67, 130–144
Common promoter motifs drive expression of SE-localized transcripts 135 separated manually into 1 kb blocks for MEME analysis, which subsequently indicated that there were no enriched motifs in this dataset. Indeed, the only sequence appearing more than once was a GC-rich motif (detected twice) with an E value of 105, which is 15–20 orders of magnitude higher than the enriched CT/GA motifs identified in the SETP set. An inherent limitation of both the ALIGNACE and MEME methods was the extent of sequence information that could be simultaneously analyzed. To overcome this problem, we used the PROMOMER program (http://bar.utoronto.ca/ntools/ cgi-bin/BAR_Promomer.cgi; Toufighi et al., 2005), as it has the capacity to compare the frequency of a motif within a specific promoter set in a larger sample size of Arabidopsis upstream regions. Here, we used 400 gene promoter sequences from an expanded gene promoter set (Table S6), with the exception of the ribosomal and cyclin gene complement promoter sets, for which there are only 253 and 39 genes in the Arabidopsis genome, respectively. Using this approach revealed that the CT/GA-rich motif was overrepresented in only the set SETP (Table S7); only A/T-rich motifs were found to be over-represented in control sets. Another motif that was over-represented exclusively in the SETP set was a CA/GT-rich repeat, which scored even higher than the aforementioned CT/GA-rich motif. However, this motif was not identified with using ALIGNACE or MEME methods, so we are uncertain as to its significance. A phloem/cambium transcriptome dataset, established from the root–hypocotyl boundary of 8-week-old Arabidopsis plants (Zhao et al., 2005), was next analyzed using the PROMOMER program. Interestingly, the resulting analysis revealed that the CT/GA-rich motif was over-represented in the promoters of the phloem/cambium and xylem/cambium gene sets, but not in the phloem cambium/non-vascular gene set (Table S8). Alternative methods reveal similar motifs within the SETP set Sequence motifs within the SETP and control gene sets were searched for in 1 kb upstream regions of each gene using the Gibbs motif sampler (Lawrence et al., 1993). Each gene promoter set harbored specific motifs that were either more abundant or not present in the other promoter sets. This analysis revealed the presence of the well-characterized TCP motif (Welchen and Gonza´lez, 2006) in ribosomal protein gene promoters, which had not earlier been reported in these sequences. As shown in Figure S3, each of the four control promoter sets analyzed harbors specific over-represented motifs, as reflected by their low log MAP values. In the case of the SETP set, a CT/GA motif was the most abundant. Similar motifs were found in other promoter sets but at lower frequencies. Two additional enumerative approaches, based on the Weeder algorithm (Pavesi et al., 2007) and the YMF method (Sinha and Tompa, 2002), were also used in this study. The
Weeder algorithm predicted A/T-rich motifs with a P value