Tmod: toolbox of motif discovery

BIOINFORMATICS APPLICATIONS NOTE

Vol. 26 no. 3 2010, pages 405–407 doi:10.1093/bioinformatics/btp681

Sequence analysis

Tmod: toolbox of motif discovery Hanchang Sun1 , Yuan Yuan2 , Yibo Wu1 , Hui Liu1 , Jun S. Liu2,∗ and Hongwei Xie1,∗ 1 Department

of Automatic Control, College of Mechatronics and Automation, National University of Defense Technology, Changsha, Hunan 410073, China and 2 Department of Statistics, Harvard University, Cambridge, MA 02138-2931, USA

Received on September 8, 2009; revised on November 15, 2009; accepted on December 5, 2009 Advance Access publication December 10, 2009 Associate Editor: John Quackenbush

ABSTRACT Summary: Motif discovery is an important topic in computational transcriptional regulation studies. In the past decade, many researchers have contributed to the field and many de novo motiffinding tools have been developed, each may have a different strength. However, most of these tools do not have a user-friendly interface and their results are not easily comparable. We present a software called Toolbox of Motif Discovery (Tmod) for Windows operating systems. The current version of Tmod integrates 12 widely used motif discovery programs: MDscan, BioProspector, AlignACE, Gibbs Motif Sampler, MEME, CONSENSUS, MotifRegressor, GLAM, MotifSampler, SeSiMCMC, Weeder and YMF. Tmod provides a unified interface to ease the use of these programs and help users to understand the tuning parameters. It allows plug-in motiffinding programs to run either separately or in a batch mode with predetermined parameters, and provides a summary comprising of outputs from multiple programs. Tmod is developed in C++ with the support of Microsoft Foundation Classes and Cygwin. Tmod can also be easily expanded to include future algorithms. Availability: Tmod is available for download at http://www.fas .harvard.edu/∼junliu/Tmod/ Contact: [email protected]; [email protected]

1

INTRODUCTION

Conserved sequence patterns shared by multiple protein or nucleic acid sequences often provide important clues about the function of these linear biopolymers (Lawrence et al., 1993; Stormo and Hartzell, 1989). For DNA sequences, conserved sequence patterns, which are often located upstream of co-expressed genes, are likely to be transcription factor binding sites (Conlon et al., 2003). For proteins, short sequence patterns conserved across evolutionarily remotely related proteins may correspond to important functional domains (Liu et al., 1995). Identification of these overrepresented sequence patterns, often called motifs, can be an important first step toward our understanding of gene regulatory networks and protein functions. Because direct experimental determination of these motifs is neither cost-effective nor practical in many biological systems, computational motif discovery tools have become indispensable for many biological studies involving gene regulations. Many algorithms and software have been proposed to tackle the motif-finding problem. AlignACE (Roth et al., 1998), BioProspector ∗ To

whom correspondence should be addressed.

(Liu et al., 2001), MotifSampler (Thijs et al., 2001), SeSiMCMC (Favorov et al., 2005) and Gibbs motif sampler (Liu et al., 1995) use Gibbs sampling strategies for motif discovery based on Bayesian statistical models. GLAM (Frith et al., 2004) uses a simulated annealing approach to optimize the alignment of multiple sequences, and performs more efficiently for mode finding than the standard Gibbs sampler. MEME (Bailey and Elkan, 1994) applies the EM algorithm, instead of Gibbs sampling, to find the maximum likelihood motif estimation based on a model similar to that used by the Gibbs motif sampler. CONSENSUS (Hertz and Stormo, 1999) employs a greedy algorithm for optimizing the motif information content, which is asymptotically equivalent to finding the maximum a posteriori motif alignment. Although the aforementioned model-based motif-finding methods have met great successes, none of these algorithms can guarantee to find the optimal solution (such as the posterior mode). Thus, several alternative algorithms based on word enumeration and other heuristics have been developed. YMF (Sinha and Tompa, 2003) uses a simpler motif model and a heuristic enumeration algorithm guaranteed to find the motifs with the highest Z-scores. Weeder (Pavesi et al., 2001) uses an extended exhaustive enumeration for both short sequence signals and longer patterns. MDscan (Liu et al., 2002) finds motif candidates by combining a word enumeration technique applied to high-confidence sequences with a Gibbs sampling refinement. MotifRegressor (Conlon et al., 2003) starts from motif candidates predicted by MDscan, and uses gene expression and/or ChIP-on-chip enrichment information to help select functionally important motifs. When applied to a particular sequence dataset, the algorithms will most likely turn up different motif results due to different underlying mechanisms or tuning parameters of the algorithms, and each of these programs can perform better than others in some particular cases. Furthermore, motifs found by different algorithms may have different but unique characteristics. It is very helpful for further comparison, assessment and selection to present all of these different results in a single software platform. Jensen and Liu (2004) proposed BioOptimizer to combine and compare the motif-finding results using a score function based on a Bayesian model. BioOptimizer can handle situations in which binding site abundance and motif width are unknown. It also deals with two-block motifs with variablelength gaps. We thus include BioOptimier in Tmod to aid the user in analyzing the motif-finding results. A majority of the motif-finding programs were made to run on Linux operating systems. Considering the fact that many researchers

© The Author 2009. Published by Oxford University Press. All rights reserved. For Permissions, please email: [email protected]

[15:00 8/1/2010 Bioinformatics-btp681.tex]

405

Page: 405

405–407

H.Sun et al.

in related fields use the Windows operating systems, we developed Tmod, a Windows-based integrated software platform, to make these motif-finding programs easier to use for biologists.

2

THE Tmod SYSTEM

Our software tool Tmod employs a two-level software system architecture (Fig. 1). The first level is a human–machine interface, which contains the software main interface and several parameter input interfaces for the next level motif discovery programs. The second level comprises of the 12 motif-finding programs. These command-line based programs are supported by Cygwin (http://cygwin.com/), which is a Linux-like environment for Windows. MDScan, BioProspector, SeSiMCMC, MotifSampler, Weeder and Gibbs Motif Sampler provide Cygwin version programs. We downloaded the source code of CONSENSUS, AlignACE, MEME, GLAM and YMF, and compiled them in Cygwin environment. All these Cygwin version programs can run well on the Windows platform with the Cygwin dynamic link library cygwin1.dll. For MotifRegressor, we changed its main Perl script into an executable program, and rewrote the stepwise regression part in Matlab and compiled the M file to an executable file. After these modifications, MotifRegressor can run well on the Windows platform without S-PLUS. After the users input parameters for any of the 12 programs and click the ‘Run’ menu, Tmod will record these parameters in a batch file and then generate a new thread to run the saved batch file. In practice, users may want to run a certain program with different parameter combinations. It is tedious and time consuming to enter the parameters one by one. Tmod provides a batch interface so that the users can easily write the commands with several sets of parameters in a batch file. It will run the users’ batch file and report multiple results automatically. As all the motif discovery programs have their own format of output files, it is very hard for users to interpret and compare the results from multiple programs. Tmod provides a unified summary file to list the consensus sequence of all the motifs predicted by different programs. This summary can be further used for motif

comparison and optimization. BioOptimizer is also included in Tmod for further motif comparison and optimization. All the motif discovery programs included in Tmod handle their own exceptions. We transplant these programs to Cygwin platform retaining their exception handling abilities. In our Tmod software, we do not provide any additional exception handling. All the parameters passed to the motif-finding programs are in the form of strings. In this way, Tmod will not generate new exceptions. Our software Tmod adopts the open software architecture, so that it can be easily maintained and expanded. Any plug-in motiffinding program will not be interfered by other programs. Each motif-finding program uses its own independent thread. Besides the 12 motif-finding programs integrated in Tmod, there are still many other motif-finding programs. If they are written in C/C++ on Linux platform, they can be transplanted to the Windows platform and added to Tmod easily.

3

IMPLEMENTATION

Tmod was implemented by Microsoft Foundation Classes. The Visual C++ 6.0 development environment was used to build the Windows application, and the Cygwin Release 1.5.21-1 was used to compile the source code of the motif discovery programs. The motif discovery programs integrated in the current version of Tmod are as follows: MDscan, BioProspector, MEME 3.5.4, Gibbs Motif Sampler, AlignACE 4.0, CONSENSUS V6C, GLAM2, MotifSampler 3.1.1, Weeder 1.3, YMF 3.0, MotifRegressor (rewrite) and SeSiMCMC 4.35. Tmod also provides a platform to compare and assess the results of the motif discovery programs. The main interface and typical summary output of Tmod are shown in Fig. 2. It will help users get some knowledge on Tmod. As a demonstration, we applied Tmod to the cyclic AMP receptor protein binding sequence dataset (Stormo and Hartzell). We set the motif width in a range of 6–10 bp, and all other parameters of the programs as the default. The results are available at the Tmod homepage.

ACKNOWLEDGEMENTS We thank the developers of each program for allowing us to integrate their current versions in Tmod. We also thank Roee Gutman for helpful suggestions and his test of Tmod. Funding: NIH (GM-078990 to J.S.L.). Conflict of Interest: none declared.

REFERENCES

Fig. 1. Software architecture of the Tmod system. Names displayed at the second level, such as AlignACE, BioProspector, etc., correspond to different motif discovery programs. Each program receives input sequences and parameters through the human–machine interface of Tmod. The motifs found by the programs can be compared using BioOptimizer.

Bailey,T.L. and Elkan,C. (1994) Fitting a mixture model by expectation maximization to discover motifs in biopolymers. Proc. Int. Conf. Intell. Syst. Mol. Biol., 2, 28–36. Conlon,E.M. et al. (2003) Integrating regulatory motif discovery and genome-wide expression analysis. Proc. Natl Acad. Sci., 100, 3339–3344. Favorov,A.V. et al. (2005) A Gibbs sampler for identification of symmetrically structured, spaced DNA motifs with improved estimation of the signal length. Bioinformatics, 21, 2240–2245. Frith,M.C. et al. (2004) Finding functional sequence elements by multiple local alignment. Nucleic Acids Res., 32, 189–200. Hertz,G.Z. and Stormo,G.D. (1999) Identifying DNA and protein patterns with statistically significant alignments of multiple sequences. Bioinformatics, 15, 563–577.

406


Page: 406

405–407

Tmod

Fig. 2. (A) The main interface of Tmod. (B) The summary file of consensus sequences parsed from the output files of each program. (C) The summary file of motif information parsed from BioOptimizer sum files.

Jensen,S.T. and Liu,J.S. (2004) BioOptimizer: a Bayesian scoring function approach to motif discovery. Bioinformatics, 20, 1557–1564. Lawrence,C.E. et al. (1993) Detecting subtle sequence signals: a gibbs sampling strategy for multiple alignment. Science, 262, 208–214. Liu,J.S. et al. (1995) Bayesian models for multiple local sequence alignment and gibbs sampling strategies. J. Am. Stat. Assoc., 90, 1156–1170. Liu,X.S. et al. (2001) Bioprospector: discovering conserved DNA motifs in upstream regulatory regions of co-expressed genes. Proc. Pac. Symp. Bioinfor., 6, 127–138. Liu,X.S. et al. (2002) An algorithm for finding protein-DNA binding sites with applications to chromatin-immunoprecipitation microarray experiments. Nat. Biotechnol., 20, 835–839. Pavesi,G. et al. (2001) An algorithm for finding signals of unknown length in DNA sequences. Bioinformatics, 17, S207–S214.

Roth,F.P. et al. (1998) Finding DNA regulatory motifs within unaligned noncoding sequences clustered by whole-genome mRNA quantitation. Nat. Biotechnol., 16, 939–945. Sinha,S. and Tompa,M. (2003) YMF: a program for discovery of novel transcription factor binding sites by statistical overrepresentation. Nucleic Acids Res., 31, 3586–3588. Stormo,G.D. and Hartzell,G.W. (1989) Identifying protein-binding sites from unaligned DNA fragments. Proc. Natl Assoc. Sci., 86, 1183–1187. Thijs,G. et al. (2001) A higher-order background model improves the detection of promoter regulatory elements by Gibbs sampling. Bioinformatics, 17, 1113–1122.

407


Page: 407

405–407