HICLAS: a taxonomic database system for displaying and comparing ...

5 downloads 1622 Views 426KB Size Report
Oct 9, 1998 - tributed database that develops software tools for database .... Built-in comparison operations and custom queries are implemented based on ...
)&  ()   " ,  

BIOINFORMATICS

HICLAS: a taxonomic database system for displaying and comparing biological classification and phylogenetic trees (" #)("  $($(" .)  %-$ +'($%  ( )#(   '(  

*+-' (- )! )'*.- + $ ( (  *+-' (- )! )-(0 ( &(- -#)&)"0 $#$"( -- ($/ +,$-0 ,- (,$("                   

Abstract Motivation: Numerous database management systems have been developed for processing various taxonomic data bases on biological classification or phylogenetic information. In this paper, we present an integrated system to deal with interacting classifications and phylogenies concerning particular taxonomic groups. Results: An information–theoretic view (taxon view) has been applied to capture taxonomic concepts as taxonomic data entities. A data model which is suitable for supporting semantically interacting dynamic views of hierarchic classifications and a query method for interacting classifications have been developed. The concept of taxonomic view and the data model can also be expanded to carry phylogenetic information in phylogenetic trees. We have designed a prototype taxonomic database system called HICLAS (HIerarchical CLAssification System) based on the concept of taxon view, and the data models and query methods have been designed and implemented. This system can be effectively used in the taxonomic revisionary process, especially when databases are being constructed by specialists in particular groups, and the system can be used to compare classifications and phylogenetic trees. Availability: Freely available at the WWW URL: http://aims.cps.msu.edu/hiclas/ Contact: [email protected]; [email protected] Introduction The use of computers and quantitative analyses are two of the most important components of modern systematic biology (Abbott et al., 1985; Pankhurst, 1991). Various computerized taxonomic information systems including data capture, data storage, data retrieval and other methods of data proces-

3Permanent

address: Wuhan Institute of Botany, Academia Sinica, Wuhan 430074, Hubei, People’s Republic of China

 Oxford University Press

sing have been developed. These systems employ many approaches to managing curatorial, biogeographic and bibliographic data, as well as related nomenclatural data and taxonomic descriptions at all hierarchical levels. Many systems have become effective tools for taxonomic research, such as revisionary work in particular taxa or floristic databases based on specimen collections in herbaria. The methodologies of quantitative analyses in systematic biology, involving mainly numerical phenetics and phylogenetics or cladistics, have been used to analyze the variations within characters and divergences among organisms, and to reconstruct the evolutionary history among taxa. It is, therefore, a natural idea to combine taxonomic information systems with computer-assisted quantitative taxonomic analyses such as cladistic analysis. There are at least two practical needs in today’s systematics. First, the growth of phylogenetic information involving various taxonomic groups and taxon data, especially morphological and molecular data, requires taxonomic databases for the transmission, storage and retrieval of phylogenetic information. Both of these problems have begun to attract the attention of biologists and computer scientists (Sanderson et al., 1993; Zhong et al., 1994). Integration of traditional classification schemes and quantitative taxonomic analyses will raise the working efficiency of systematists and other users. Second, since the amount of electronically available biological information is increasing rapidly, it is necessary to have a distributed database that develops software tools for database interoperability. The development of computer networking technology provides great possibilities for the systematics community in these respects (Blake et al., 1994). However, there are still many practical problems in developing an integrated database system to deal with interacting biological classifications and phylogenies. For example, a basic problem is the lack of a common data structure for processing the data about taxonomic hierarchy and the results of taxonomic analysis simultaneously, although, in fact, some representations such as the tree structure have been widely

149

Y.Zhong et al.

used to demonstrate the results of biological classification or taxonomic analysis for a long time. On the other hand, it is difficult to model dynamically evolving and semantically interacting classifications using traditional data models. Even newly developed models such as object-oriented data (OOD) models are not appropriate for this purpose. This is because OOD models were primarily developed to model complex objects where the relationships between objects are static rather than dynamic (Hughes, 1991; Moerkotte and Zachmann, 1993; Jung et al., 1995). In previous papers (Beach et al., 1993; Jung et al., 1995; Zhong et al., 1996), we proposed an information–theoretic view (taxon view) that includes the taxon name, author or authority, data and reference to the publication. The taxon view can be applied to biological classification to capture taxonomic concepts such as taxonomic data entities, and to develop a system for managing these concepts and the lineage relationships among them. We also developed a data model which is suitable for supporting semantically interacting dynamic views of hierarchic biological classifications. On the basis of the data model and comparison and query methods, a prototype taxonomic database system called HICLAS (HIerarchical CLAssification System) was built to query classification data and to compare interacting classifications. For the goals of the present work, the system is expanded not only for displaying and comparing classification trees, but also for phylogenetic trees resulting from phylogenetic analyses.

System and methods Taxon view and data model The taxon view is a basic and central concept in our data model, and can serve as a sound basis for building a common framework for handling nomenclatural data in HICLAS (Beach et al., 1993; Jung et al., 1995; Zhong et al., 1996). A taxon view is considered to be a quadruple of the following elements [a question mark (?) can be inserted after any element]: taxon name, author or authority, year, and publication number. For example, and < Nymphaeaceae, Caspary, 1888, Pub 2> are two of the taxon views concerning the water lily family Nymphaeaceae. A data model which is suitable for storing and comparing interacting classification data supports a set of graphoriented data-structuring tools. The set includes classification trees that capture particular views of hierarchical classifications. In such a classification tree, each node can be represented by a taxon view based on a distinguishable set of data characteristics. Each edge in a classification tree represents a hierarchical relationship among taxon views within a classification.

150

Although phylogenetic trees have some similarities to classification trees and therefore are relevant to hierarchical classification-database considerations, two major differences between classification trees and phylogenetic trees prevent the latter from being directly represented in the original HICLAS database. (i) In a classification tree, each node is a taxon view, either at a mandatory or at an optional taxonomic rank. In a phylogenetic tree, however, the internal nodes (hypothetical taxonomic units, HTUs) may be or may not be given taxon names. (ii) In a classification tree, the taxonomic levels are determined by rules of biological nomenclature, such as the International Code of Botanical Nomenclature (ICBN) for the entire botanical classification system. So, the number of ranks or levels for higher plants, as an example, is at most 26 between Kingdom and Subform (see Table 1 in Zhong et al., 1996). In a phylogenetic tree, in contrast, the number of levels between the root and terminal nodes (operational taxonomic units, OTUs) depends on the topology of the tree, and can be as many as one less than the total number of OTUs. An example showing similarities and differences between classification trees and phylogenetic trees is given in Figure 1.

Database and data tables Modification of the original HICLAS database to include phylogenetic trees is based on the following three assumptions. 1. In the HICLAS database, the taxon views at each particular rank are stored in one table, and a separate table stores all the hierarchical relationships between all the taxa. A phylogenetic tree, however, typically has a number of internal nodes that are not distinct taxon views nor will they fit into any existing rank. Therefore, the phylogenetic relationships between all the taxa need to be stored in a separate table. 2. In a classification tree, each node is significant in its own right and some of the most meaningful queries for a given taxon view are about the taxon names of its children. In contrast, because the internal nodes are not given names by the author, the immediate lower level internal nodes in a phylogenetic tree may not be significant. Each comparison is made not by nodes of immediate lower level, but by terminal nodes as a set of OTUs, regardless of the level at which a query is carried out. 3. Because the internal nodes of phylogenetic trees are not given ranks, and only the root and terminal nodes have names and ranks, a query for a phylogenetic tree always starts at the root and usually the results are expressed in terms of the terminal nodes. Thus, the modified HICLAS database including phylogenetic trees has the following features: (i) a phylogenetic tree is a distinct taxon view and is distinguishable from classification taxon views; (ii) both root and terminal taxon views are

Taxonomic database system

Fig. 1. An example of a classification tree (a) and a phylogenetic tree (b) of the order Nymphaeales, Both the lowest taxon views (the terminal nodes or OTUs ) are 10 genera.

in tables of their ranks; (iii) all the internal nodes inherit the taxon view of the root node; (iv) the phylogenetic relationships of all phylogenetic trees can be described by using Newick/NH format (Swofford, 1991), and then stored in one separate table. Figures 2, 3 & 4 provide examples for the storage of data for a classification tree and a phylogenetic tree shown in Figure 1.

rithmic framework for comparing interacting classifications have been developed (Zhong et al., 1996). Several query pairs have also been added to the query system in HICLAS to allow querying on the common information between phylogenetic trees and classification trees. In order to implement these query pairs, a generalized method and an algorithm for measuring overall similarity between trees based on subtree similarity have been proposed (Zhong et al., 1997).

Comparison between classification and phylogenetic trees

Implementation

With the two general data types for classification and phylogenetic trees, four pairwise comparisons may be made: (i) between classification trees; (ii) between phylogenetic trees; (iii) classification trees querying phylogenetic trees; (iv) phylogenetic trees querying classification trees. Mathematically, most classification trees and phylogenetic trees can be referred to as leaf-labeled N-trees. In such an Ntree, a subtree is defined as a part of the tree in which the root node of the subtree may be an internal node or the root in the tree. The subtree’s terminal nodes must be all nodes which belong to the node in the tree assigned as the root of the subtree. A number of ad hoc queries and a corresponding algo-

We have implemented the HICLAS system, which is available on the Internet and provides an Open Window-interface, on a workstation UNIX platform. It can also be accessed with PC and Macintosh computers with X-windows. The system consists of intuitive mouse-based graphical user interfaces connected to a Sybase database server at Michigan State University. These interfaces allow a user to display and browse hierarchical taxonomic information and lineage relationships as well as phylogenetic relationships among particular taxonomic groups. The software architecture of the original HICLAS was described in previous papers (see Figure 11–14 in Jung et al., 1995). Two sample views of classifica-

151

Y.Zhong et al.

Fig. 2. Three of the 26 taxon-view tables in the HICLAS database. (a) is for order rank, and table # is 8; (b) is for family rank, and table # is 11; (c) is for genus rank, and table # is 15. The taxon names in the tables correspond to the trees shown in Figure 1.

tion tree and phylogenetic tree that could be presented for browsing in expanded HICLAS are shown in Figures 5 and 6, respectively. In Figure 5, a macro view displays a full tree for a selected classification, and in the detailed view any part of the tree can be displayed by scrolling vertically or horizontally, allowing detailed access to a selected node or rank. The windows also have the capability to display lineage information about a particular taxon view or all taxon views at a particular rank. In Figure 6, a full phylogenetic tree can also be displayed and seen by scrolling vertically or horizontally.

152

The comparison between interacting classification trees can be performed on a Classification Comparison Window shown in Figure 7. It consists of the upper and lower panes. The upper pane shows a number of taxon views which are the root nodes of different classification trees and symbolize the trees to be compared. The lower pane shows a query comparison expression is formed by using union (U), intersection (I) or subtraction (–) for a set of classification trees. Built-in comparison operations and custom queries are implemented based on the query algorithms. A user is able to built, edit and

Taxonomic database system

Fig. 3. A table for storing phylogenetic trees (table # is 99). The data in the table correspond to the phylogenetic tree shown in Figure 1b.

button. A taxon view(s) will show up at the bottom of the window. For more than one taxon view, they can be seen by using the vertical scroll bar. Highlighting a taxon view and then click on Display Classification Tree button, a new Classification Tree Window with a classification tree will show up. One can select a node to represent a subtree of the phylogenetic tree to be compared with the corresponding nodes in the classification tree, then click on the Map on Classification Tree button, the common terminal nodes between the phylogenetic tree and the corresponding classification tree will be highlighted by the system. Similarly, select a node in the classification tree and then click on the Graph Info button in the Classification Tree Window; the common terminal nodes appearing in the two trees will be highlighted. A data-input system for HICLAS has been designed and implemented based on a powerful encoding scheme. This system captures hierarchical and lineage relationships as well as phylogenetic relationships in table form. Currently, the HICLAS database contains two kinds of prototype data sets described as follows. (i) The angiosperm order Nymphaeales and all subordinate taxa (including data up to level of class, assembled from the classifications of many authors). This data set has been used interactively in the HICLAS system. It includes not only the hierarchical information, but also lineage relationships between classifications. Several cladograms from phylogenetic studies on the order Nymphaeales have been input into HICLAS. (ii) A compound data set of the vascular flora of the USA, Canada and Greenland (the Biota of North American). The data based on the Kartesz (1994) synonymized checklist have been input into SMASCH (Specimen MAnagement System for California Herbaria) (Bartholomew and Duncan, 1992). Most of them (currently about 290 families and 3168 genera, especially the family Asteraceae and the genus Aster) have been loaded into HICLAS from SMASCH and formatted as taxon views. The WWW version of HICLAS is also available through the Internet (URL: http://aims.cps.msu.edu/hiclas/). In addition to the Web site, information on the HICLAS system is also available through anonymous ftp to genesys.cps.msu.edu/pub/hiclas/ or by sending an e-mail to hiclas-manager@ cps.msu.edu.

Discussion Fig. 4. A hierarchical relationship table in the HICLAS database. The data in the table correspond to Figures 1–3.

execute comparison query expressions using Boolean or system-defined query expression. In Phylogeny Window in which displays a phylogenetic tree as shown in Figure NO TAG, one can query to go back and forth between a phylogenetic tree and its corresponding classification tree(s). To do this, click on Find Classification

The existing HICLAS can be effectively used in the taxonomic revisionary process, especially when databases are being constructed by specialists in particular taxonomic groups. It also allows efficient comparison and evaluation of the results of quantitative taxonomic analyses such as cladistic analysis. Currently, the focus of the HICLAS project is to extend the system to provide data networking and transferring capabilities with other taxonomic database systems. New interfaces will be developed to existing taxonomic data-

153

Y.Zhong et al.

Fig. 5. The Classification Tree Window in the X-Window version of HICLAS.

bases so that they can track nomenclatural history and interoperate with other systems. A set of interactive database interface tools will be built for data transfer, search and display. In order to establish interoperability among various taxonomic database systems, the databases that are currently available are classified into two categories. The first category of database manages nomenclatural data with taxonomic hierarchy and associated taxon data including taxonomic history. The second category of databases manages phylogenetic trees and related character-taxon data sets for cladistic analyses. We will design and implement a general data transfer protocol for nomenclatural data between the two types of databases. The generalized tree-comparison method and a user query system will be applied to compare classi-

154

fication trees with phylogenetic trees gathered from different taxonomic database systems. In order to implement and improve the WWW version of HICLAS, especially tree comparison, sophisticated technologies are needed. For example, JAVA applets can be used to display and compare taxonomic data over two or more simultaneous Web pages. The main data sources for this system are taxonomic monographs, revisions and indexes. Completeness and accuracy of the data are very important for the model and methods to be most effective. It would be possible to associate other data, such as that for type specimens, geographical distributions, descriptions and illustrations with the skeleton classification data and taxon views that we have currently implem-

Taxonomic database system

Fig. 6. The Phylogenetic Tree Window in the X-Window version of HICLAS.

ented. These related data also include original character states used for constructing phylogenetic trees.

Acknowledgements This research was supported by National Science Foundation grants DIR-9021656 and BIR 94-08384 to S.P. It also was partially supported by National Natural Science Foundation of China grant 39470049 to Y.Z. We would like to thank Dr Sungwon Jung for technical assistance and Dr Sibnath Das for preparing input data sets. We appreciate the computer programming by Michigan State University Com-

puter Science students Stephen Perkins, David English, Terrence Dexter and Jiahe Zhuang.

References Abbott,L.A., Bisby,F.A. and Rogers,D.J. (1985) Taxonomic Analysis in Biology. Columbia University Press, New York. Bartholomew,B. and Duncan,T. (1992) The specimen management system of California herbaria as a model for an inter-institutional distributed database system. In Peng,C.I. (ed.), Phytogeography and Botanical Inventory of Taiwan. Institute of Botany, Academia Sinica Monograph Series No. 12. Taipei, pp. 85–91.

155

Y.Zhong et al.

Fig. 7. The Classification Comparison Window in the X-Window version of HICLAS.

Beach,J.H., Pramanik,S. and Beaman,J.H. (1993) Hierarchical taxonomic databases. In Fortuner,R. (ed.), Advances in Computer Methods for Systematic Biology: Artificial Intelligence, Databases, Computer Vision. Johns Hopkins University Press, Baltimore, pp. 241–252. Blake,J.A., Bult,C.J., Donoghue,M.J., Humphries,J. and Fields,C. (1994) Interoperability of biological data bases: A meeting report. Syst. Biol., 43, 585–589. Hughes,J. (1991) Object-oriented Databases. International Series in Computer Science. Prentice Hall. Jung,S., Perkins,S., Zhong,Y., Pramanik,S. and Beaman,J. (1995) A new data model for biological classification. Comput. Applic. Biosci., 11, 237–246. Kartesz,J.T. (1994) A Synonymized Checklist of the Vascular Flora of the United States, Canada and Greenland, 2nd edn. Timber Press, Portland, OR. Moerkotte,G. and Zachmann,A. (1993) Towards more flexible schema management in object bases. In Proceedings of the IEEE 9th International Conference on Data Engineering, pp. 174–181.

156

Pankhurst,R.J. (1991) Practical Taxonomic Computing. Cambridge University Press, Cambridge, UK. Sanderson,M.J., Baldwin,B.G., Bharathan,G., Campbell,C.S., von Dohlen,C., Ferguson,D., Porter,J.M., Wojciechowski,M.F. and Donoghue,M.J. (1993) The growth of phylogenetic information and the need for a phylogenetic data base. Syst. Biol., 42, 562–568. Swofford,D.L. (1991) PAUP: Phylogenetic Analysis Using Parsimony, Version 3.0. Illinois Natural History Survey, Champaign, IL. Zhong,Y., Li,W. and Huang,D.S. (1994) The Theories and Methods of Cladistics. Science Press, Beijing (in Chinese). Zhong,Y., Jung,S., Pramanik,S. and Beaman,J.H. (1996) Data model and comparison and query methods for interacting classifications in a taxonomic database. Taxon, 45, 223–241. Zhong,Y., Meacham,C.A. and Pramanik,S. (1997) A general method for tree-comparison based on subtree similarity and its use in a taxonomic database. BioSystems, 42, 1–8.

Suggest Documents