Software Review - Wiley Online Library

Systematic Entomology (2005) 30, 179–182

Software Review The newest kid on the parsimony block: TNT (Tree analysis using new technology)

TNT: TREE ANALYSIS USING NEW TECHNOLOGY by P. Goloboff, J. S. Farris & K. Nixon. 2003. Program and documentation, available from the authors, and at http://www.zmuc.dk/public/phylogeny. US$80. Cladogram reconstruction based on parsimony has a long and illustrious history in systematics. Hennig (1950) suggested a precursor in the form of his ‘Sparsamkeitsprinzip’, which was transformed into the more rigorous concept of ‘parsimony’ sensu Farris and Kluge after Hennig’s ‘manual’ tree reconstruction techniques were fused with numerical methods. In several seminal papers, Farris (1983) and Kluge (1984) established the rationale and justification for using parsimony as an optimality criterion in tree searches and for many years parsimony dominated phylogenetics. Only fairly recently has maximum likelihood and its faster cousin, a MCMC implementation of Bayesian likelihood (Ronquist & Huelsenbeck, 2003), threatened parsimony’s dominance in tree reconstruction. However, parsimony remains arguably the most important and best justified tree reconstruction technique. Over the years many parsimony software packages have been developed. What started with Farris and Mickevich’s PHYSYS, was followed by PAUP 3.11 (Swofford, 1985), HENNIG86 (Farris, 1989), and PHYLIP (Felsenstein, 1993) before being complemented by NONA (Goloboff, 1993) and PAUP* (Swofford, 2002). Given the many parsimony packages and the current popularity of likelihood-based techniques, one may question the need for an additional parsimony program. However, this would be like questioning the replacement of a computer bought 10 years ago. The old one is still working, but it is no longer up to today’s challenges. The same is true for parsimony software. Ten years ago data sets had relatively few taxa and characters and the main goal of analyses was finding trees that were more or less guaranteed to be maximally parsimonious. The main worries were issues such as soft polytomies (Coddington & Scharff, 1997) and missing values (Platnick et al., 1991). Today, data sets routinely have thousands of characters for up to several hundred taxa. Furthermore, larger data sets are on the horizon. Just consider NSF’s ‘Assembling-the-tree-of-life’ initiative, for which the vast majority of the approved projects will require the analysis of data sets with many genes and morphology for several thousand taxa. Furthermore, modern systematics is toying with ways

#

2005 The Royal Entomological Society

to mine existing phylogenetic databases. Supermatrices and MRP matrices for supertree reconstruction with their high proportion of missing entries need to be analysed, thus creating new computational challenges. In addition, none of these matrices will be analysed only once. Tree support needs to be assessed using character resampling (bootstrap, jackknife) or Bremer support values, thus requiring many passes over large matrices. Obviously, modern phylogenetics needs new software that is able to search through huge tree spaces. Finding most parsimonious trees is still a noble goal, but somewhat unrealistic and today the name of the game is finding trees that are at least close to being optimal. The solution comes in the form of new smart algorithms and search strategies (e.g. Goloboff, 1994, 1996, 1999; Nixon, 1999a). All of these techniques are implemented in TNT and this makes the software package a must-have in any systematics laboratory working with reasonably large data sets (>100 taxa). Getting started The first hours with a new computer or a new piece of software are usually supremely annoying and one is tempted to go back to the old equipment/software despite knowing that the new tools will ultimately be a better choice. What about TNT? After all, some anxiety is warranted given that some software of the same authors has in the past been somewhat less than user-friendly. However, TNT turns out to be a pleasant surprise. The installation is a breeze and only involves a quick download and a doubleclick on a self-executable file. The program reads a variety of different file formats including basic Nexus files. Error messages alert the user to potential problems with a particular data set and make it relatively straightforward to trouble-shoot recalcitrant matrices. Data matrices can be built in TNT or imported. If the latter option is chosen, we would advise to only build the core matrix elsewhere and not to use Nexus’s interleaved format. Advanced features such as taxon and character sets are better added from within TNT. Matrices can also be built in TNT, but we suspect that most users will be reluctant to learn new software as long as the old programs are satisfactory. However, the data entry format of TNT is user-friendly and worth trying. For example, one can assign a character state to whole

179

180 Software Review blocks of taxa instead of having to enter the same state over and over again for individual species. Getting started with TNT is further facilitated by an informative tutorial consisting of a PowerPoint presentation with numerous TNT-screenshots (also allows for an unique view of Pablo Goloboff’s computer desktop in Spanish) and a help file that is for the most part intelligible as long as one skips the section on ‘scripting’.

Tree searches Once TNT has understood a matrix, a variety of menus make it very simple to carry out tree searches. A ‘run’ button opens a menu that allows for the precise definition of analysis conditions. If the data set is relatively small, one can rely on ‘traditional searches.’ The usual tree search techniques are available here (various branch-swapping routines, multiple addition replicates). However, the real novelty and most important reason to own the program are the ‘New Technology Search’ options. The good news is that there is a menu with lots of items to choose from and parameters to play with. This gives the expert a wide variety of choices and allows for adjusting the parameters to the needs of a particular matrix. The bad news is that you will be initially lost unless you are an expert on the inner workings of ‘sectorial searches, ratchets, tree drifting, and tree fusing’. However, TNT has default settings for its ‘New Technology Search’ that are quite powerful and useful for the beginner. The settings can easily be modified by varying the search-intensity level between 1 and 99. All New Technology Searches should be repeated in order to confirm that a particular tree length is parsimonious and at least a large proportion of all equal-length trees are found. In order to avoid wastage of computing time, TNT allows the user to pre-set the number of times the same tree length has to be found before the search is abandoned; i.e. there is no need to wait for 100 random addition replicates to complete if many of the initial searches already find the same tree lengths. PAUP* (Swofford, 2002), NONA (Goloboff, 1993), and TNT perform equally well for small data sets but TNT obliterates its competition in analysing large matrices. PAUP* especially has the tendency occasionally to choke on large numbers of trees during branch swapping and it can take hours to complete a single pass over the data. This makes PAUP* a poor choice for analysing super- and MRP matrices that are usually rich in missing entries. TNT and NONA use a stricter definition of branch support than PAUP* under its default settings, which cuts down on equally parsimonious trees. Both programs are less likely to get stuck, but only TNT breezes (Goloboff, 1993) through very large data sets. We have recently analysed several super- and MRP matrices. Compared to its competitors, TNT consistently finds shorter trees in a fraction of the time required by other software packages. Even a data set with almost 100 000 characters and 208 taxa could be analysed in 24 h using rigorous parameter settings in ‘New Technology Search’ (initial level ¼ 99; finding same tree length 10 times).

#

TNT allows the user to save trees in parenthetical notation or as an object-based graphics file in emf format. The latter is our favourite because it allows for an easy manipulation of the trees in standard software such as Word and PowerPoint. The parenthetical notation is useful for opening trees in programs such as TreeView (Page, 1996). A variety of consensus trees can be computed, but saving them in parenthetical notation is somewhat cumbersome, because it requires selecting the consensus tree from a tree buffer which also contains parsimonious cladograms. We also regret that TNT does not compute Adams consensus trees, because we find them useful for quickly identifying ‘wildcard’ or ‘floating’ taxa that cause strict consensus trees to be unresolved.

Branch support TNT offers the usual suspects; i.e. bootstrap and jackknife are supported in a variety of different implementations and with the option of either using traditional or New Technology Searches for the replicates. The latter is very useful because we found that for at least one data set, bootstrap support values were heavily dependent on carrying out high-quality tree searches for all replicates. Upon completing the analysis, TNT will display a tree with its support values, but unfortunately this tree is only saved as an emf file if one opens a metafile in the ‘output’ menu prior to resampling. Otherwise the tree is lost when the next computer user accidentally exits TNT. Here it would be useful to include the ‘metafile’ option in the resampling menu. TNT’s implementation of Bremer supports is also less than satisfactory. There are two traditions for calculating Bremer values. One is based on producing constraint files from a consensus tree that can then be used to coax PAUP* into finding parsimonious trees under the assumption that a particular node is not supported (Autodecay: Eriksson, 2001; TreeRot: Sorenson, 1999). The advantages of this approach are accurate Bremer support values for each clade, but it also requires separate analyses for each node. The other approach to computing Bremer supports is based on using suboptimal trees found during normal tree searches (e.g. POY, NONA). For example, if a suboptimal tree is one step longer than the MPTs and it lacks a certain clade found on the latter, then the Bremer support for this node is obviously 1. Computing Bremer supports based on suboptimal trees is fast because these trees are a by-product of normal tree searches. However, given the large number of suboptimal trees, one cannot save all trees that are wildly unparsimonious (e.g. 10% longer than MPT), so that Bremer supports for very well-supported clades remain elusive. Furthermore, not all suboptimal trees are saved during a regular tree-search so that the Bremer support values can be inaccurate. TNT follows the second tradition in promoting Bremer support calculation based on suboptimal trees. We would prefer if TNT also included an automated function that allows for calculating Bremer supports based on constrained searches. This automated option should also cover the increasingly popular Partitioned Bremer Support

2005 The Royal Entomological Society, Systematic Entomology, 30, 179–182

Software Review (PBS) (Baker & DeSalle, 1997), since PBS have proved to be useful for studying character conflict between data partitions (Baker et al., 1998). TNT does have a user-friendly option for carrying out constrained searches on particular nodes. The clades can be selected on the graphical representation of a tree instead of using a cumbersome parenthetical notation; i.e. one could compute Bremer supports in TNT by selecting one node at the time, but this procedure is impractical for trees with many nodes. One could presumably also create one’s own automated way to compute Bremer supports by using TNT’s scripting language, but we believe that TNT should automatically provide such an option. Tree graphics TNT includes many of the features known from Winclada (Nixon, 1999b); i.e. it contains a data and a tree editor. Trees can be modified, for example, by colouring particular branches. Branches on topologies can be moved, and obtaining a list of character changes at a particular branch is easy. More interestingly, the tree-graphics component of the software is well integrated with the data-analysis component. For example, one can select a group of taxa on a tree for subsequent exclusion in a reanalysis of the data set.

Useful luxury TNT contains an intuitive way of building batch files. The user just records a series of mouse-clicks on menu items, which can then be executed via the batch file at any point in time. Similar batch modes are available for other programs but require the user to learn a strict syntax and sometimes arcane command and option names. Complex tasks can also be prescribed using a TNT-specific scripting language. TNT furthermore allows for the coding and analysis of continuous characters. Continuous characters have in the past been neglected or horribly distorted through more or less arbitrary recoding into discrete states. TNT allows the user to explore the information content of continuous characters in their original form. However, combining them with discontinuous characters will not be straightforward and will require much research. TNT also includes the option to recode cladograms as MRPs (Goloboff & Pol, 2002). The MRP matrices can be saved in Nexus format, which could be useful for character mapping in MacClade, but unfortunately the taxon names are truncated and need to be manually edited.

Wishlist

included in TNT. TNT already has other useful tools such as sophisticated tree comparison metrics (e.g. agreement subtrees), but in future versions they should all be bundled in an ‘identify wildcard taxa’ menu. Secondly, we obviously bemoan the lack of a node-specific Bremer and PBS support calculation option. Thirdly, it would be useful to have a facility for displaying trees as ‘phylograms’ and to obtain ‘pairwise distance’ matrices. Both are admittedly very crude tools for data exploration, but they are nevertheless useful for a quick assessment of gene partitions. Fourthly, it is difficult in TNT to evaluate the influence of coding gaps as fifth character states or missing values on the tree topology, because dashes are automatically interpreted as fifth character states. If the user wants to explore alternative codings, a new matrix must be created by replacing all dashes with question marks. Especially for large supermatrices such a search-and-replace operation can be nontrivial. It would be more user-friendly if there was an option to interpret both question marks and dashes as missing values. Fifthly, we find some menus to be too complex and others lacking obvious options. For example, selecting and defining taxon and character groups is more complex and less intuitive than in the corresponding menus in PAPU* while the consensus tree and resampling menus lack options for saving trees in desirable formats. All the options are available, but not always easy to find. We have also repeatedly encountered problems when dealing with matrices that contained multiple taxon groups. Lastly, we agree with Hovenkamp (2004) that starting character, taxon, and group counts at zero remains unintuitive and is a potential source of error when a character list is entered into a matrix and some characters need to be coded as additive. Outlook Tree reconstruction in general and parsimony analysis in particular has come a long way in the past 10 years. Currently, we see a multitude of different techniques and programs vying for the attention of phylogeneticists. The overall trend is toward specialized software packages designed to be leaders in a particular analysis philosophy. Good examples are POY (Wheeler et al., 2004) for optimization alignment and fixed-state alignment, Mr Bayes for Bayesian likelihood analysis (Ronquist & Huelsenbeck, 2003), and there is no doubt in our mind that TNT will be the ‘industry leader’ for parsimony. TNT represents a milestone in parsimony analysis and with a few adjustments and additions, we are sure that it will be generally adopted. In our laboratory, TNT is already the default software and the program is well worth the distribution fee of US$80.

Some items in this category were already mentioned above. Firstly, finding wildcard taxa that destroy resolution on strict consensus trees is increasingly important for the analysis of super- and MRP matrices. We find Adams consensus trees to be useful for this purpose, and wish they were #

181


RUDOLF MEIER Department of Biological Sciences and FARHAN B. ALI Department of Social Work and Psychology National University of Singapore Singapore

182 Software Review References Baker, R.H. & DeSalle, R. (1997) Multiple sources of character information and the phylogeny of Hawaiian drosophilids. Systematic Biology, 46, 654–673. Baker, R.H., Yu, X. & Desalle, R. (1998) Assessing the relative contribution of molecular and morphological characters in simultaneousanalysistrees.MolecularPhylogeneticsandEvolution,9,427–436. Coddington, J.A. & Scharff, N. (1997) Problems with ‘soft’ polytomies. Cladistics, 12, 139–145. Eriksson, T. (2001) AUTODECAY V. 5.0. Bergius Foundation, Royal Swedish Academy of Science, Stockholm (program distributed by the author). Farris, J.S. (1983) The Logical Basis of Phylogenetic Analysis. Advances in Cladistics (ed. by N. I. Platnick and V. A. Funk), pp. 1–36. Columbia University Press, New York. Farris, J.S. (1989) HENNIG86: a PC-DOS program for phylogenetic analysis. Cladistics, 5, 163. Felsenstein, J. (1993) PHYLIP (Phylogeny Inference Package), Version 3.5c. Department of Genetics, University of Washington, Washington, DC. Goloboff, P. (1993) NONA and PeeWee. Freeware available from: http://www.zmuc.dk/public/phylogeny. Goloboff, P. (1994) Character optimization and calculation of tree lengths. Cladistics, 9, 433–436. Goloboff, P. (1996) Methods for faster parsimony analysis. Cladistics, 12, 199–220. Goloboff, P. (1999) Analyzing large data sets in reasonable times: solutions for composite optima. Cladistics, 15, 415–428. Goloboff, P. & Pol, D. (2002) Semi-strict supertrees. Cladistics, 18, 514–525.

#

Hennig, W. (1950) Grundzu¨ge einer Theorie der Phylogenetischen Systematik. Deutscher Zentralverlag, Berlin. Hovenkamp, P. (2004) Review of TNT – Tree Analysis Using New Technology, Version 1.0. Cladistics, 20, 378–383. Kluge, A.G. (1984) The relevance of parsimony to phylogenetic inference. Cladistics: Perspectives on the Reconstruction of Evolutionary History (ed. by T. Duncan and T. Steussey), pp. 24–38. Columbia University Press, New York. Nixon, K. (1999a) The parsimony ratchet, a new method for rapid parsimony analysis. Cladistics, 15, 407–414. Nixon, K.C. (1999b) WINCLADA (BETA), Version 0.9.9. Published by the author. Page, R.D.M. (1996) TREEVIEW: An application to display phylogenetic trees on personal computers. Computer Applications in the Biosciences, 12, 357–358. Platnick, N.I., Griswold, C.E. & Coddington, J.A. (1991) On missing entries in cladistic analysis. Cladistics, 7, 337–343. Ronquist, F. & Huelsenbeck, J.P. (2003) MrBayes 3: Bayesian phylogenetic inference under mixed models. Bioinformatics, 19, 1572–1574. Sorenson, M.D. (1999) TREEROT, Version 2. Boston University, Boston. Swofford, D.L. (1985) PAUP: Phylogenetic Analysis Using Parsimony. Natural History Survey, Illinois. Swofford, D.L. (2002) PAUP*: Phylogenetic Analysis Using Parsimony (and Other Methods), Version 4.0b10. Sinauer Associates, Inc, Sunderland, Massachusetts. Wheeler, W.C., Gladstein, D.S. & De Laet, J. (2004), POY, Version 3.0. Ftp.Amnh.Org /Pub/Molecular/Poy. American Museum of Natural History, Washington, DC.