Software tools and databases for bacterial systematics and their ...

2 downloads 0 Views 2MB Size Report
Abstract. The dynamic expansion of the taxonomic knowledge base is fundamental to further developments in biotechnology and sustainable conservation ...
Antonie van Leeuwenhoek 64: 205-229, 1993. © 1993 Kluwer Academic Publishers. Printed in the Netherlands.

Software tools and databases for bacterial systematics and their dissemination via global networks

Vanderlei Perez Canhos 1, Gilson Paulo Manfio 2 & Lois D. Blaine 3 z Tropical Data Base (BDT); 2 Tropical Culture Collection (CCT), Rua Latino Coelho 1301, Campinas, SP, 13087-010, Brazil; s American Type Culture Collection, 12301 Parklawn Drive, Rockville, MD 20852, USA Received 25 August 1993; accepted 6 September 1993

Key words: bacterial systematics, databases, electronic networks, information resources, software tools Abstract

The dynamic expansion of the taxonomic knowledge base is fundamental to further developments in biotechnology and sustainable conservation strategies. The vast array of software tools for numerical taxonomy and probabilistic identification, in conjunction with automated systems for data generation are allowing the construction of large computerised strain databases. New techniques available for the generation of chemical and molecular data, associated with new software tools for data analysis, are leading to a quantum leap in bacterial systematics. The easy exchange of data through an interactive and highly distributed global computer network, such as the Internet, is facilitating the dissemination of taxonomic data. Relevant information for comparative sequence analysis, ribotyping, protein and DNA electrophoretic pattern analysis is available on-line through computerised networks. Several software packages are available for the analysis of molecular data. Nomenclatural and taxonomic 'Authority Files' are available from different sources together with strain specific information. The increasing availability of public domain software, is leading to the establishment and integration of public domain databases all over the world, and promoting co-operative research projects on a scale never seen before.

Introduction

The development of new approaches and techniques for acquisition and utilisation of taxonomic data have sparked fundamental changes in the questions that now may be pursued in the study of interrelationships among microorganisms. Molecular data are leading to many reinterpretations of evolutionary relationships at all taxonomic levels. The study of the connections between the evolution of the genome and the phenotype is an important area of scientific investigation, rich in its implications for all biology. Technological developments in information/

computer technology and molecular biology are leading to new strategies for molecular microbial ecological studies and target screening programmes for microorganisms of industrial and environmental importance (Bull et al. 1992; Hall 1989). The molecular information not only provides new ways of classifying living organisms in terms of directlymeasured genomic similarity, but also brings systematics into the front line of the quest to explain the relationships between molecular and phenotypic diversity (Liesack et al. 1991). The computer technology revolution is having a great impact on systematics and should have an even greater impact in the future. Now, for the first time, the technology is

206 there to handle, sort, analyse and disseminate complex taxonomic data. Exploitation of taxonomic knowledge in microbiology is leading to significant benefits to public and commercial activities in the fields of agriculture, medicine and environmental monitoring (Hawksworth 1992). Applications in the biological control of pests, bioremediation and environmental restoration offer particular promise (Bull et al. 1992). Systematics plays a central role in understanding biodiversity, documenting it and helping in the elucidation of the underlying nature of environmental processes and the structure of natural communities (Krebs 1992). The design of sustainable conservation strategies will depend heavily on taxonomic knowledge and the availability of accurate, up to date and relevant information from computerised databases and others sources (Hawksworth & Colwel11992a,b). Bacteriologists now have an enormous array of tools and methods from which to choose for classifying and identifying organisms (Goodfellow & O'Donnel11993). The study of molecular evolution has come to the forefront and often seems to dwarf the efforts of traditional systematists. However, the 'polyphasic' approach (Colwel11970) is rapidly becoming the method of choice for bacterial systematics. One of the primary problems facing today's microbiologist is how to keep up with the rapidly proliferating technologies that can assist in the study of natural relationships among organisms. The development of comprehensive taxonomic databases associated with image databanks and geographic information systems will play an important role in ecology and biodiversity programs. Efforts are required to stimulate the integration of sequence databanks with more traditional information resources, such as bibliographic, strain data, compendia of species descriptions, and metabolic products databases. This paper presents an overview of the state-of-the-art with examples of software tools and databases for bacterial systematics and the communication mechanisms available to microbiologists for accessing these resources.

Dissemination of taxonomic information

Central to progress and development in systematics, and indeed any other scientific discipline, is the provision and ready availability of accurate, up to date, relevant and timely information. Moreover, this information must be communicated in a form which is appropriate for the intended audience. Until recently the dissemination of information for bacterial systematics was done exclusively through hard copy publications, such as printed books, manuals and periodicals. The most extensive effort is that of Bergey's Manual of Systematic Bacteriology (Krieg 1984; Sneath 1986; Staley et al. 1989; Williams et al. 1989), a four volume compendium of physiological and molecular information on most of the recognised species of bacteria. An abbreviated version of this multi-volume work is the Bergey's Manual of Determinative Bacteriology (Buchanan & Gibbons 1974), the 9th edition of which is to be published in 1993. Another comprehensive treatise covering known genera of bacteria is the second edition of'The Prokaryotes - A Handbook on the Biology of Bacteria: Ecophysiology, Isolation, Identification, and Applications.' This four volume work, edited by Balows et al. (1993), includes a broader spectrum of information than is covered by the Bergey's Manuals, but most of the characterisation data are in textual rather than tabular format. Nevertheless, it is an excellent tool for those seeking data on habitats, characteristics, metabolic pathways, and products of bacterial species. The International Journal of Systematic Bacteriology (IJSB) is, at present, the only official publication medium for validly described bacterial names. All new names of bacteria must be either published or announced in this journal. The Approved List of Bacterial Names published by Skerman et al. (1980) appears in this journal and is the official document containing the names of all recognisable bacterial taxa as of January 1,1980. The IJSB is also the vehicle for publicising recommendations of the Judicial Commission which considers amendments to the Code of Bacterial Nomenclature and exceptions to Rules of Nomenclature. The IJSB is an indispensable tool for the bacterial systematist. The CD-ROM (from compact disc, read only

207 memory) is a computer data storage peripheral device, which has evolved from Audio Compact Discs or CD's. CD-ROM technology allows the storage of very large databases on a 5.25 inch compact disc. The disc takes up little space and is inexpensive to produce. Good retrieval software is being generated, so that it is quite economical for the database producer wishing to make data available in this medium. CD-ROM disc drives are also dropping in cost and are increasingly becoming components of standard PC systems in libraries and laboratories. Following these developments in information technology, important publications such the IJSB are now being offered on CD-ROM thereby facilitating the search and retrieval of taxonomic information. The catalogues of the American Type Culture Collection, The Culture Collection of the Institute for Fermentation, Osaka, and the Japan Collection of Microorganisms are also available on CD-ROM from Hitachi Software on its 'CD-STRAINS' disk. Cambridge Scientific Abstracts produces a number of bibliographic databases on CD-ROM that are of interest to bacteriologists, e.g., Compact Cambridge MEDLINE, Pollution Abstracts, Aquatic Sciences and Fisheries Abstracts, and the Life Sciences Collection. Other bibliographic resources available on CD-ROM include Current Contents, produced by the Institute for Scientific Information. A recent addition to the repertoire of CD-ROM products is NCBI's Entrez Discs. This dual disc set contains literature references to molecular biologyrelated subjects, i.e., a subset of MEDLINE called Entrez References and a second disc called Entrez Sequences, containing sequence data from GenBank, PIR, and SWISS-PROT, plus MEDLINE references and abstracts associated with those sequences. Two CD-ROM databases of special interest to molecular biology are the EcoSeq, EcoMap, and EcoGene files that contain over 1.5 million base pairs of non-overlapping Escherichia coli DNA sequences in FASTA format, restriction map data, genetic map positions, and alignment information for over 1,000 E. coli genes. This database is also distributed by NCBI. The E. coli Database (ECD) from the Institut far Mikrobiologie und Molekularbiologie in Frankfurt, Germany contains descrip-

tive information to supplement E. coli nucleotide sequence data in EMBL and GenBank. Gene names, number of non-redundant base pairs, overlapping sequences, citations, and EMBL and GenBank cross references are components of ECD. The EMBL Data Library in Heidelberg, Germany and the Protein Databank (PDB) in the USA also distribute sequence and structure data on CD-ROM. With the proliferation of computer mediated communications systems, new horizons are opening for the dissemination of taxonomic information. The developments of academic or research networks, in association with commercial networks, are increasing the offer of electronic mail, on-line databases and computer conferences (Cerf 1991). The networks relevant to the dissemination of taxonomic information are listed in Table 1. The Information Resources and Computerised Databases relevant to bacterial systematics are listed in Tables 2 and 3, respectively. The Internet is the most rapidly growing network in the world. It is a network of networks, linking over i million computers together through common open protocols. The primary applications include electronic mail (e-mail), file transfer and remote login (Kro11992). The Internet is an excellent place to look for information in many fields of biology. Important and frequently asked questions (FAQ) about the Internet are addressed in the revision 'A Biologist's Guide to the Internet Resources' (Smith 1993). This electronic publication describes Usenet newsgroups, Internet and Bitnet mailing lists, and information archives of special interest to biology. The number of users subscribing to the mailing lists for topics in biology and reading the newsgroups is increasing steadily (Reid 1993). Other important tools widely used to access the Internet information resources are telnet, anonymous ftp, gopher and WAIS. Telnet is a protocol that allows a user in a computer in the Internet to login to another computer. Anonymous ftp allows a user to connect to another computer in the Internet using the account 'anonymous' in order to retrieve files archived for public domain access. These archives include databases, mailing lists and newsgroups logs, and public domain software (such as communication tools, sequence alignment and im-

208 age analysis) WAIS is a tool for indexing files stored in the 'anonymous ftp sites' for easy searching and location. Gopher is one of the most powerful tools, combining features of all the others. It was developed at the University of Minnesota, and, through a standard user interface, allows a user to connect to other gophers, retrieve files, programs, sound and image archives, search databases and library catalogues, without having to learn commands and the location of such archives. The rapid evolution of the Internet connectivity is changing the way the linking of separate data resources is perceived and may solve some of the problems data providers have in reaching their audiences. Much of this evolution is occurring in software that is in the public domain and is accessible to anyone with an Internet connection. Of the Internet servers, Gophers are showing an outstanding

growth rate. There are currently 1,200 Gopher servers around the world including some 60 BioGophers containing information relevant to bacterial systematics. Gophers to some degree make it possible for anyone to become a data resource provider and allows any interested person to systematically browse that resource through the Internet. In fact, one can search GenBank, EMBL databases and several hundred other resources on servers throughout the world. Through the Internet Gophers it is possible to share information without imposing rigidity on the structure of the data or predefining the group with whom the information can be shared. The existing technology is allowing the establishment of distributed Public Domain Databases (PPD) that can be co-ordinated and accessed from everywhere in the world. The main requirement is for participating sites to coordinate their activities

Table 1. Electronic networks relevant to bacterial systematics.

NETWORK Secretariat & E-mail address

Scope and Services

BIN21 - Biodiversity Information Network Interim Secretariat at the Tropical Database, Campinas, SR Brasil E-mail: [email protected]

Aims at facilitating efficient access to information relating to all aspects of biodiversity. Services: Gopher server and discussion list (biodiv-L @bdt.ftpt.br) Organized as a set of newsgroups covering general and special interests; available free over Internet

BIOSCI - Biological Communication Network Two distribution sites: SERC Daresbury Laboratory, UK fntelliGenetics, California, USA E-mail: [email protected]. (Americas, Pacific Rim) [email protected] (Europe, Africa, Asia) EMBNet - European Molecular Biology Network Contact EMBL Data Library Heidelberg, Germany E-mail: [email protected]; [email protected] GRIN - Genetic Resources Information Network USDA-ARS, Beltsville, MS, USA E-mail: [email protected] ICGEBNet - International Center of Genetic Engineering and Molecular Biology Network, Trieste, Italy E-mail: [email protected] MGD - Microbial Germplasm Database and Network, Oregon State University, Corvallis, OR, USA E-mail: [email protected] [email protected] MINE - Microbial Information Network Europe, CAB International Mycological Institute, Egham, UK; and DSM, Braunschweig, Germany E-mail: [email protected] MSDN Microbial Strain Data Network Institute of Biotechnology, Cambridge, UK E-mail: [email protected] -

A molecular biology network linking a series of centers in Western European countries; distributes sequence data and software for molecular biology Database of plant related materials including origin of samples and their useful characteristics Molecular biology related information, Unido's computer resources for molecular biology Data on plant pathogens, symbionts and biological control organisms Integrated catalog project, incorporating a European network of culture collection databanks Information on specific properties and cultured cells. Central Directory includes information on updated RKC codes

209 and agree on the protocols they use. Public Domain Databases are proliferating in all areas of science. Their great advantage is that the set of potential contributors includes everyone working on the database theme, even people unknown to the co-ordinating agent. Existing Internet network utilities and public domain software provide all the resources necessary for running a PDD project. The greatest issues confronting the co-ordinator of a PDD are the costs of running the project, developing data standards and setting up quality control procedures. The establishment of public domain taxonomic databases makes economic and scientific sense, as the effort will allow the maximum use of

every piece of available data and promote the coordination of microbial systematics research activities. The operational mechanisms, as well as the technological and biological issues related to biodiversity PDDs were discussed by Green (1992).

Tools for data analysis and databases for bacterial systematics There is a clear need for the construction of integrated, comprehensive, accurate and reliable taxonomic databases to answer the real questions relat-

Table 2. Information resources relevant to bacterial systematics.

INFORMATION RESOURCE: Secretariat & E-mail address

Scope and Services

ATCC - American Type Culture Collection Rockville, MD, USA E-mail: [email protected] Bacillus Genetic Stock Center Ohio State University Columbus, OH, USA E-mail: [email protected] E. coli Stock Center Department of Biology, Yale University, New Haven, CT, USA E-mail: [email protected] Staphylococcus Stock Center Iowa State University EMBL - European Molecular Biology Laboratory - Heidelberg, Germany E-mail: [email protected] ICECC - Information Centre for European Culture Collections DSM, Braunschweig, Germany E-mail: [email protected]; [email protected] NCBI - National Center for Biotechnology Information NIH, Washington DC, USA E-mail: info @NCBI.nlm.nih.gov RDP Ribosomal Databases Project University of Illinois, Urbana, 1L USA E-mail: [email protected]

Preservation and distribution of microbial strains and related data

-

WDC - World Data Center on Microorganisms RIKEN, Tokyo, Japan E-mail: [email protected]

Preservation and distribution of Bacillus strains and related genetic data

Preservation and distribution of E. coli strains and related data Preservation and distribution of Staphylococcus strains and related data Collection, organization, and distribution of nucleotide sequence data Collection, organization and dissemination of microbial strain data Services: production and distribution of nucleotide and amino acid sequence data and software tools for sequence analysis Ribosomal sequence data and software for sequence analysis Services: phylogenetic analysis of sequences and probe checking Information on resource centers of microorgansisms and cultured cells, including the following databases: - World Directory of Culture Collections of Microbial Strains (CCINFO) - Hybridoma Data Bank (HDB) - World Catalogue of Algae (ALGAE) The information is searchable through a Gopher/WAIS INTERNET server.

210 T a b l e 3.

List of computerized databases with relevant information to bacterial systematics.

DATABASE Producer & E-mail address

Scope and Comments

BBRN - Biosis Complete list of valid bacterial names, synonyms and history Register of Bacterial Nomenclature Phone: + 1 (800) 523-4806, (215) 507-4917 E-mail: [email protected] Publications and events available on-line through the MSDN BKS - Biotechnology Knowledge Sources Phone: + 44 (753) 7-4201 E-mail: [email protected] Online directory of organizations, databases, networks, DBIR - Directory of Biotechnology Information Resources publications and nomenclature resources relevant to E-mail: [email protected] biotechnology [email protected] Collectoin and dissemination of nucleotide sequence data DDBJ - DNA Databank of Japan generated in Japan National Institute of Genetics Yata, Mishima, Japan E-mail: tgoj [email protected] Genome and protein nucleotide and amino acid sequency, ECOLI - E. coli K-12 Database database gene name, and gene location data E-mail: [email protected] Database of reported nucleotide sequences GenBank E-mail: [email protected] Database describing construction and reactivities of hybridomas CODATA/IUIS Hybridona Data Bank (HDB) and monoclonal antibodies E-mail: [email protected] Nucleotide sequences databanks and related information EMBL, EMBL-New, EMBL-Daily UEMBL: European Molecular Biology Laboratory, Heidelberg, Germany E-mail: [email protected] A comprehensive listing of molecular biology databases LiMB - List of Molecular Biology Databases Los Alamos National Laboratory, Los Alamos, NH, USA E-mail: [email protected] Endopeptidase cleavage sites in PPR known protein sequences LYSIS E-mail: [email protected] Secondary metabolites and chemosystematics of plants and NAPRALERT bacteria E-mail: [email protected] Chemically synthesized oligonucleotide database OLIGONUC E-mail: [email protected] Three dimensional structure of proteins and other biological PDB - Protein Data Bank macromolecules determined by X-ray crystalography or nuclear Brokhaven National Laboratory magnetic ressonance Upton, NY, USA E-mail: [email protected] PIR Protein Identification Resource Information on amino acid sequences National Biomedical Research Foundation, Washington DC, USA E-mail: [email protected] List of existing plasmid names and transposon allocations PPR Plasmid Prefix Registry and PRCTR - Plasmid Reference Center Transposon Registry Stanford University, California, USA Phone: + 1 (415) 723-1772 Biologically significant protein sequence patterns PROSITE E-mail: [email protected] Analysis of 2D gel samples database QTDGPD - Quest 2D Gel Protein E-mail: [email protected] Database on restriction endonucleases, recognition sequences and RED Restriction Enzymes Database cleavage sites E-mail: rober [email protected] -

-

-

211 Table 3. Continued.

DATABASE Producer & E-mail address

Scope and Comments

RKC codes E-mail: [email protected]; [email protected] SEQ ANL REF Databank E-mail: [email protected] SIG PEP - Signal Peptide Sequences E-mail: [email protected] SRRSD - Small Ribosomal RNA Sequences Database E-mail: dewachter @ccv.uia.ac.be SWISS-PROT E-mail: [email protected] TFD - Transcription Factor Database E-mail: [email protected] TRNAC - Transfer RNA Compilation Phone: + 49 (921) 552-668

Standardized list of bacterial phenotypic characteristics

ed to microbial systematics. These databases should be widely available to maximise their impact. All taxonomic activity is part of an international network of communication and information. Although research can be done, and many times is carried out individually, the activities related to systematics depends on a series of agreed upon rules and codes, publications and up-to-date scattered pieces of information and molecular data. Ideally the results of every taxonomic study should not only answer an immediate question, but also contribute data to a fundamental global taxonomic knowledge database. The implementation of international programmes to document the importance and roles of systematic biology in sustainable development will contribute to the enhancement of the taxonomic knowledge base. The 'Sustainable Biosphere Initiative' (Anonymous 1991a), the 'Systematics Agenda 2000' (Anonymous 1991b) and the 'Microbial Diversity 21' (Hawksworth & Colwell 1992a) are already stimulating the debate on the importance of systematics in human affairs. Within the context of Diversitas, the International Union of Biological Sciences (IUBS) Biodiversity Programme an international Biodiversity Information Network (BIN-21) is being established in order to facilitate the dissemination of biodiversity information worldwide (Canhos et al. 1992) The development of integrated taxonomic data-

Listing of literature references relevant to sequence analysis A database of secretory signal peptide sequences Data on small ribosomal subunit RNA sequences Amino acid sequence data Database on transcription factors Nucleotide sequences of tRNA

bases is now feasible mainly because of the advances in instrumentation and information technologies. Laboratory instruments such as gas and liquid chromatographers, mass spectrometers, spectrophotometers, scintillation counters and microdilution analysers now include integrated computers or offer them as options (Kellogg 1989). Research operation is becoming even more computer intensive, and data logging and management of the resulting streams of microbial strain data are creating the conditions for the development of comprehensive databases. New possibilities of linking together institutions and people throughout the world using global computer networks are leading to a new era of co-operative research on a scale never seen before. Significant outputs of this international co-operation in the biological sciences are the public domain databases fostered by GenBank and EMBL. With this increased co-ordination and improved access to molecular biology databases, thousands of researchers are using and contributing to the compilations of DNA and protein data. These developments in molecular biology provide a good model for the design of integrated networks for microbial systematics and construction of public domain taxonomic databases. Inconsistencies in database organisation and data coding greatly complicate, and may even prevent electronic exchange of information between scien-

212 fists. The Committee on Data for Science and Technology, CODATA, which is concerned with the quality and accessibility of data (as well as the methods by which data are acquired, managed, analysed, and disseminated) has established a Commission on the Terminology and Nomenclature of Biology. This commission aims to provide resources for the development of standards for terminology and nomenclature in the fields of biological sciences and bioinformatics. Microbiologists have traditionally adopted the artificial and natural approaches to the classification of bacteria. Artificial classifications are still heavily used in clinical microbiology (Cowan & Steel 1974). In such classifications the groups are monothetic, i.e., defined on the basis of a few selected characters. Natural or phenetic classifications use a large number of phenotypic and genotypic characters for defining groups. These have a high information content and can accommodate some degree of phenotypic variability on the strains being identified. Groups defined in this way are polythetic, i.e., defined on the basis of a large number of common properties. A third approach, phylogenetic classification, consists of the interpretation of molecular sequence information (mainly rRNA and conserved protein sequences) for expressing the evolutionary relationships between microorganisms. Several constraints in data analysis and interpretation are still unresolved (Sneath 1989; Williams 1992), but phylogenetic analysis has been used recently for proposing the division of microorganisms into the Domains Archaea, Bacteria and Eucaria (Woese et al. 1990; Winker & Woese 1991) and has had a great impact on the taxonomic structure of several microbial taxa (Rossler et al. 1991, Stackebrandt et al. 1981, 1991). Phenetic and phylogenetic classifications have benefited from the developments in computer technology. The biggest impacts of the use of computers in bacterial systematics may be seen in the areas of numerical taxonomy, computer-assisted (probabilistic) identification, automated identification systems, chemosystematics, molecular data (nucleic acid and protein sequences), nomenclature and strain data (culture collections).

Tools for numerical taxonomy Numerical taxonomy studies involve the analysis of a large number of strains representing a taxonomic group with regard to a large number of characters (Sneath& Sokal 1973). Characters vary to a great extent, depend on the microbial group under investigation and range from morphological, cultural and biochemical properties, degradation activity, utilisation of organic and inorganic compounds (carbon and nitrogen sources), resistance to chemicals, production of metabolites to chemical data, such as fatty acid and polar lipid patterns, DNA guanine/cytosine ratio, isoprenoid quinones and sugars. These are discussed in detail by O'Brien and Colwell (1987), Goodfellow et al. (1985) and Goodfellow and O'Donnell (1993). The resulting data are usually coded as binary variables (positive/negative) representing the individual test results or sometimes as multistate characters, using a few variables to cover all the range of possible responses (e.g.; levels of antibiotic resistance). Continuous data (numeric values, e.g., spore diameter) are usually converted to multistate binary variables to facilitate the analysis, but a few analysis algorithms can cope with binary and continuous data in the same dataset. Duplicate strains are introduced for the assessment of the test error in the system, enabling the detection of variable and poorly reproducible tests (Sneath 1974, Sneath& Johnson 1972). The management of numerical taxonomic data may be accomplished in simple IBM-PC compatible microcomputers, using spreadsheets (e.g. Lotus 123, Quattro Pro), database software (e.g. DBase, FoxPro, Paradox) or shareware software (PC-File). Other platforms (e.g., Macintosh, Unix, Vax) are equally suited for data management and analysis. Regardless of the software or system used it is always advisable to test a small simulated dataset to ensure that the software has all the desired features before embarking upon a time consuming large scale taxonomic project. Faster computers are desirable for large datasets and some software may require very specific computer configurations to run (e.g., numerical co-processors or large RAM memory). The usual data analysis starts with the evaluation of the 'quality' of the data set by the estimation of

213 the test error. This is done by comparing all test results for the pairs of duplicate strains (Sneath 1974, Sneath & Johnson 1972). Poorly reproducible tests may be removed from the dataset. Subsequent taxonomic analysis involves calculating similarity coefficients and performing hierarchical and non-hierarchical statistics for grouping the strains according to their overall similarity (Bryant 1987; Sackin & Jones 1993). The most usual coefficients used for similarity calculations are the Jaccard and simple matching, but several similarity/dissimilarity coefficients are available for the analysis of binary and continuous data (Wishart 1987). Usually, similarity data are expressed by a dendrogram, calculated using the unweighted pair group method with arithmetic means algorithm (UPGMA) (Sneath& Soka11973). The evaluation of numerical taxonomic classifications (groups) can be achieved using several practical and theoretical approaches. The use of different similarity and clustering algorithms provides some insight on the stability of the clusters (Sackin & Jones 1993). The cophenetic correlation index (Sheath & Sokal 1973) provides an estimate of the distortion introduced during the construction of dendrograms by comparing the final similarity values with the original data from the similarity matrices. Minimum spanning trees and shaded diagrams provide additional information on the composition, stability and relationships between strains and clusters. Calculation of cluster overlap statistics (Sheath 1977, 1979b,c) and outlying strains (Sneath & Langham 1989) provide effective ways of assessing the overall classification. Statistical procedures and calculations can be done using commercial packages (SAS; SPSS; Genstat) or specialised software (CLUSTAN, MVSR TAXAN, Taxon). CLUSTAN (Wishart 1987), MVSP and TAXAN (ASM, USA, unpublished) have an extensive range of similarity coefficients and clustering algorithms, suitable for binary and continuous data, and are more user-friendly than statistical software packages. The use of an integrated software for numerical taxonomic analysis may present several advantages over the use of separate database and analysis softwares. MICRO-IS (Bello 1989) has integrated modules for data man-

agement and analysis in the same software with minimal manipulation of data and files. TAXON (A. C. Ward, unpublished) has basic database facilities for strain data and data analysis can be carried out in separate software using ASCII exported files. This software includes procedures for the calculation of test error, cluster centrotypes, construction of percentage positive tables and probabilistic identification. NTSYS-pc (Sackin 1987) and SYN-TAX IV (J. Podani, unpublished) have limited database management facilities for ASCII files but provide the most comprehensive suite of procedures for data analysis, including clustering, ordination, comparison and evaluation of classifications and several options for graphical display. Peter Sneath and colleagues have published several BASIC programs for numerical taxonomic analysis and probabilistic identification which can be implemented as user subroutines for taxonomic data analysis (Sheath 1974, 1977, 1979a-e, 1980a-c; S n e a t h & Langham 1989; Sneath & Sackin 1979). Additional information on computer programs can be found in Sackin (1987). The reader is reffered to Sackin and Jones (1993) for an excellent up-to-date text on computerassisted classification. Analysis of reactivity patterns of monoclonal antibodies may be used as a tool for bacterial systematics. The CODATA/IUIS Hybridoma Data Bank can be used to study taxonomic relationships among strains of bacteria and other organisms (Walczak et al. 1988). Programs allowing for analysis of reactivity patterns of monoclonal antibodies are an integral part of the software used to manage the data bank. Biochemical substances will 'cluster' when subjected to taxonomic analysis based on their reactivity patterns with a defined set of monoclonals. Monoclonal antibodies exhibit complex reactivity patterns with various antigens, although they are specific to unique epitopes. Because similar epitopes can be found on diverse proteins and/or polysaccharides, the monoclonal antibodies may react with dissimilar cell types or biochemical substances. Cluster analysis can reveal epitopes common to related, and possibly non-related, organisms. By inverting the data, relationships among epitopes can be determined. Cluster analysis seeks to group individual objects

214 based on measured properties. These groupings can serve many purposes, e.g., to display relationships among similar objects and to predict unmeasured properties of objects. A cluster analysis of Leishmania strains based on reactivity with monoclonal antibodies demonstrated that certain strains were more closely related by geographic origin than by species, i.e., cross species clustering of organisms from specific locations was observed (Bussard et al. 1985). Identification of a conserved epitope provides some information on genetic diversity among organisms. The data analyses performed, while not definite measures of structural (topological) similarities, can help to generate hypotheses. When combined with data from other molecular biology databases, such as PIR and GenBank (Table 3), the hypotheses can be further developed.

Probabilistic identification matrices as strain databases Probabilistic identification matrices are one of the end products from numerical taxonomic studies (Willcox et al. 1980). Once a stable classification is achieved (i.e., the strains are arranged into groups according to the overall similarity), the individual clusters are assessed to produce a 'percentage positive table', i.e., the percentage of strains showing a positive state for each character studied. The percentage positive table is then reduced to an identification matrix, using statistical procedures to select a minimal combination of characters which are discriminatory for the groups defined previously by the numerical taxonomy (Sneath 1979e, 1980a). The matrix should contain several strongly diagnostic characters for each group. The final identification matrix is evaluated by a series of theoretical and practical trials (Langham et al. 1989a; Sneath 1974, 1979d, 1980b,c; Sneath& Sackin 1979) after which it can be used for the identification of unknown isolates. Identification of unknown strains can be performed using computer software. MATIDEN (Sneath 1979d) is a BASIC routine which provides three identification scores: Willcox probability, taxonomic distance and standard error of taxonomic distance, all implemented in the TAXON software

(A. C. Ward, unpublished). The Bacterial Identifier (Bryant 1991) provides matrices for Gram-positive and Gram-negative bacterial species. Facilities for adding organisms and creating additional identification matrices are provided but the output is restricted to one identification score. Matrix (S. Souza, unpublished) also provides one identification score and accepts user-defined identification matrices being most suited for routine identification tasks. A review of probabilistic identification matrices is available (Bryant 1993). The several identification matrices for microbial taxa available in the literature, including Bacillus (Priest & Alexander 1988) and Streptomyces (Kampfer & Kroppenstedt 1991; Langham et al. 1989b) amongst others, can be loaded into suitable software (Bacterial Identifier, MICRO-IS, MATIDEN, Matrix) and be used in data analysis. Several laboratories and governmental organisations provide facilities for the use of specialised identification matrices, such as the National Institute of Dental Research of the U.S. National Institutes of Health that holds large datasets on mycobacteria, clinical anaerobes, and Arctic marine bacteria. Probabilistic identification matrices were developed using data from large numbers of phenotypic clusters of Alaskan marine bacteria (Davis et at. 1983) Capnocytophaga (Walczak & Krichevsky 1982) and slow growing mycobacteria (Wayne et al. 1980). A consortium of U.S. government agencies has developed a suite of programs for management and analysis of microbial strain data. These programs, called MICRO-IS, have an Identification Module (McManus & Krichevsky 1992) that is used for probabilistic identification of unknown strains. Twenty-five identification matrices, listed below, are provided with the programs. Users can also develop their own matrices and import them to the system. It is important to note that automated identification of unknown strains must be tempered with common sense and background knowledge of the organisms being tested. Successful identification of unknowns is dependent upon using standardised methods to test the strains, i.e., methods used to develop the matrix must be known and used to test the characteristics of the unknowns. Documen-

215 tation for the MICRO-IS Identification Module contains complete information on media and methods used to develop the matrices. Matrices without accompanying methods data have been removed from the module. Identification matrices for the following groups of organisms are currently included with the MICRO-IS programs: Aerobic Aerobic Aerobic Aerobic Aerobic

Gram negative fermentative bacilli Gram positive cocci Gram negative bacilli Gram negative nonfermentative bacilli Gram positive cocci Bacillus species (two separate matrices) Enterobacteriaceae and Vibrionaceae Facultatively anaerobic Gram negative bacilli Fastidious bacteria Gram positive nonsporeforming bacilli Lactobacillus species Mycobacterium species Oral Peptostreptococcus species Oral Staphylococcus species Oral Streptococcus species Oral Gram negative bacilli Pseudomonas species Slow-growing Mycobacterium species - clinical matrix Slow-growing Mycobacterium species - taxonomic matrix Streptococcus species Streptomyces species - minor clusters Streptomyces species - major clusters Streptoverticillium species Vibrio and related genera The MICRO-IS programs are public domain software and are available through the Secretariat of the Microbial Strain Data Network, described in Table 1. Despite the wide availability of matrices and computer software, identification matrices should be evaluated for the different laboratory conditions using the same standardised identification tests and reference strains (Sneath 1974). The user should be aware that in most cases, particularly when dealing with environmental isolates, only a certain percent-

age of the organisms will be successfully identified by the matrix (Williams et al. 1985).

Automated systems as tools for database construction Commercially available automated bacterial identification systems represent a recent tool for the routine identification of isolates in the laboratory and construction of databases (Mauchline & Keevil 1991). Identification tests are assembled in microtitre plates or disposable kits and the results may be read and collected automatically by a plate reader connected to a microcomputer or visually by the operator (Bochner 1989). Although rapid automated identification systems are used for the quick and reliable identification of clinical isolates, few systems can identify environmental isolates with the same success (Klinger et al. 1992). The assay system is usually based on colour developing reactions, where a change of colour in the reaction well indicates the utilisation of the substrate (e.g., sugar or nitrogen source) or enzymatic activity (e.g., phosphatase) (BIOLOG, API ZYM). Some systems use substrates chemically bonded to indicator molecules, such as methylcoumarins or methylumbelliferones, which become coloured or fluorescent when the link is cleaved (Manafi et al. 1991; Sensititre 1992). The reaction can be analysed both quantitative or qualitatively. Advantages of this type of system are the high sensitivity, speed (the majority of the reactions can be read after a few hours incubation) and high throughput of samples. The identification of unknown isolates in automated systems may be performed manually by comparing results with identification tables or using computer identification matrices. A wide range of identification matrices for Gram-positive and Gram-negative organisms are currently available for automated and semi-automated systems (BIOLOG 1992; K~impfer & Kroppenstedt 1991; K~mpfer et al. 1991; Sensititre 1992). Automated systems have similar restrictions to probabilistic identification matrices derived from numerical taxonomic studies in that they may not successfully identify all the isolates and that they

216 need to be evaluated for the specific laboratory conditions and samples under investigation (Klingler et al. 1992).

Chemosystematic data analysis Chemical data are a valuable source of taxonomic information. Cell wall composition, lipid patterns, fatty acids and other cell components provide complementary information to phenetic data for the classification and identification of microorganisms (Goodfellow & Minnikin 1985). The analysis of chemical data usually demands a refined statistical background from the taxonomist (James & McCulloch 1990). Complex qualitative and quantitative patterns of chemical compounds, such as fatty acid and sugar profiles, amongst other types of chemical profiles, are not as easy to analyse and interpret as numerical phenetic data (O'Donnell et al. 1985). The raw data may require normalisation, scale transformation and data reduction steps prior to statistical analysis. These procedures are not easily conducted without a computer and adequate software. Statistical analysis is sometimes complex, usually consisting of multivariate statistics, such as principal component analysis, canonical variates analysis and SIMCA (Saddler et al. 1987; Brondz et al. 1990). Several statistical software packages may be used, including Genstat (Payne et al. 1989), SAS (SAS Coorporation, USA) and SPSS (Nurisis 1982), but users invariably have to program their own routines for data analysis in these systems. IDAMS is another data management and analysis software package that is distributed free of charge UNESCO (Paris, France) with versions for IBM mainframes and PC platforms. The software has several options for elaborate multivariate statistics, such as factor analysis and multidimensional scaling which are not found in other packages. The MIDI-Hewlett Packard gas-chromatographic identification system is one example of an automated system for generating chemotaxonomic data. The chromatographic analysis of bacterial fatty acids can be fully automated and the system provides options for the identification of unknown strains against a li-

brary of fatty acid data as well as for the classification of unknown strains (MIDI 1993). Curie point pyrolysis-mass spectrometry analysis of biological samples has been used for the classification and identification of microorganisms of clinical and industrial importance (Gutteridge et al. 1985; Sanglier et al. 1992). An up-to-date review of the use of PYMS can be found in Magee (1993). Horizon (Horizon Instruments, East Sussex, UK) provides analytical equipment with dedicated statistical software (PYMENU) for multivariate statistical analysis, including facilities for library building and identification of unknown strains (Horizon 1992). Other rapid whole-organism fingerprinting techniques, such as infra-red spectrometry (Helm et al. 1991) may also develop into useful tools for bacterial systematics.

Techniques, software tools and molecular information Molecular techniques are adding a new dimension to bacterial systematics. Introduced in the 60's, the determinations of DNA base composition and nucleic acids hybridisation marked the beginning of a new era in bacterial systematics (Stackebrandt & Goodfellow 1991). Further developments, including the introduction of techniques based on the comparison of total DNA and sequences from homologous genes, are allowing the measurement of genealogical distances and the order at which the organisms evolved. New tools, such as oligonucleotide probing, the generation of taxon-specific profiles of DNA fragments and analysis of PCR gene products are now available for the identification of taxa at all levels. A vast array of software is available for the analysis of molecular data and for the dissemination of molecular information via electronic networks. Comparative sequence analysis, especially of ribosomal RNA sequences, have gained a renewed interest from microbiologists as a useful tool for the classification and identification of bacteria. Hypervariable and conserved regions in the 23S, 16S and 5S ribosomal ribonucleic acid can be used for deter-

217 mining taxonomic relationships at suprageneric, generic, species and strain level. Comparative 16S rRNA sequence analysis has been used to study phylogenetic diversity in Bacillus. Rossler et al. (1991) have determined that there are four major clusters of Bacillus species, based on rRNA sequence similarities. The designated clusters are: 'Bacillus subtilis', which includes B. stearothermophilus; 'Bacillus brevis', including B. laterosporus; 'Bacillus alvei', including B. polymyxa, B. macquariensis, and B. macerans; and a 'Bacillus cycIoheptanicus' group that includes only B. cycloheptanicus. As further evidence of the distinct divisions among these Bacillus clusters, the authors point out that some genera of non-spore forming Gram-positive bacteria, such as enterococci, lactobacilli, and leuconostocs share more sequence similarity with B. subtilis and B. stearotherrnophilus than any of the Bacillus species from either the 'B. alvei' or 'B. brevis' clusters. Based on the rRNA sequence comparisons reported here, the authors conclude that there is enough genotypic diversity among the Bacillus species to justify the creation of new genera and to alter the current classification of the genus Bacillus. It has also been demonstrated (Jurtshuck et al. 1992) that in situ hybridisation techniques using 16S rRNA segments can differentiate between closely related bacteria. The use of DNA probes to 16S rRNA for rapid identification of bacteria has proved to be advantageous, primarily because of the large database of 16S rRNA sequences available for bacteria. Probes specific to an organism's 16S rRNA can be used to rapidly identify bacteria at the family or genus level in tissue, soil and water samples (Sayler & Layton 1990). However, the technology can also be used for taxonomic purposes, i.e., to differentiate between closely related bacterial species. Jurtshuk et al. (1992) have shown that two Bacillus species, B. polymyxa and B. macerans, having similar phenotypic characteristics, can be clearly differentiated by use of a rapid in situ hybridisation method. This in situ technique, which draws on information from the sequence databanks and on the use of software for sequence alignment and editing, can be applied to taxonomic studies of other bacterial genera. The growing body of bacterial sequence

data and the development of software tools for the use of specialised subsets will be of great assistance to bacterial systematists as they realise the power of these tools for classification and differentiation of organisms. Ribotyping is another molecular approach to characterisation of microorganisms. The technique is based on the fact that, although sequences of rRNA genes are highly conserved, some regions of the genes are variable and indicative of differences among the species. Ribotyping involves the analysis of RFLP patterns of rRNA operons. Genomic DNA is submitted to restriction endonuclease digestion using a suitable enzyme and the fragments separated by gel electrophoresis are transferred and immobilised on a membrane. The location of the rRNA operons is mapped using labelled 16S and 23S RNA or a cloned rRNA operon from a related bacterial species or genus. The resulting patterns are indicative of the number and distribution of the rRNA operons on the bacterial genome, therefore reflecting genome organisation and genomic relatedness between different strains. Modern non-radioactive labelling and detection systems, such as biotin, dioxigenin and chemoluminescence, provide several advantages for this methodology, such as increased sensitivity, safe handling of probes and possibility of reprobing the same membrane with other nucleic acid probes. Riboprinting involves the extraction and PCR assisted amplification of DNA from the organism, purification of the gene, restriction enzyme digestion, and data analysis. The combined power of molecular and computer techniques to solve taxonomic problems is illustrated in numerous publications describing the use of ribotyping for characterisation and classification of bacteria, rRNA restriction patterns have been used as taxonomic tools for classification of Aeromonas (Lucchini & Altwegg 1992; Moyer et al. 1992); Neisseria (Woods et al. 1992); Clostridium (Gurtler et al. 1991); Shigella (Hinojosa-Ahumada et al. 1991) and other bacterial genera. Ribotyping is an excellent tool for elucidating genetic relationships among bacteria, primarily because ribosomal genes are highly conserved and stable and multiple operons are usually present on the bacterial chromosome. The results of restriction enzyme digests can be

218 compared by cluster analysis and other statistical techniques. Using maximum parsimony, Gurtler et al. (1991) constructed a dendrogram that identified two distinct clusters of Clostridium based on restriction site differences. Hinojosa-Ahumada et al. (1991) have used the technique of riboprinting to subtype isolates of Shigella sonnei from different geographic regions and outbreaks of shigellosis, thus demonstrating the utility of ribotyping for epidemiological studies and the genetic relationships among the causative organisms. The technique is particularly useful for closely related organisms having limited phenotypic differentiating characteristics. Protein and D N A electrophoretic patterns, particularly whole-cell proteins and restriction fragments of DNA, generate good quality data for taxonomic studies. Protein electrophoretic fingerprints are usually analysed using a densitometer and DNA fragment patterns derived from simple restriction endonuclease digestion or from restriction fragment length polymorphism (RFLP) analysis can be subjected to image analysis. One of the consequences of the evolution of two-dimensional (2D) gel techniques has been the development of databases that contain master gels from a variety of bacterial sources. These databases are going to play an increasingly important role in the analysis of bacterial genomes. 2D gel databases generally contain one or more master images of the gels that correspond to the organism studied; spots on these images are attributed an identification code and a variable percentage of these spots are linked to known proteins. The identification of a protein on ~ 2D gel is generally carried out using antibodies or by microsequencing leading to the elucidation of partial sequences and physico-chemical data for a number of yet uncharacterised proteins. SWISS-PROT has committed itself to work in close collaboration with a number of groups developing 2D gel databases. Since last year cross-references are already available to the gene-protein database of Escherichia coli K12 now called ECO2DBASE (van Bogelen et al. 1992) and symmetrically that database now contains cross-references to SWISS-PROT. A file server will be set up that will allow anyone with a network connection to obtain annotated graphic files

containing the region of the gel that correspond to a selected SWISS@ROT entry linked to SWISS2DPAGE (Bairoch 1993a). Quantitative or qualitative data may be analysed by comparing similarities between individual profiles using band matching or overall similarity algorithms. A classification can be achieved using hierarchical cluster analysis or multivariate statistical analysis (Stglhl et al. 1990). Generally speaking, software for the analysis of electrophoretic fingerprints can handle both DNA and protein electrophoresis patterns, with a few exceptions. GelManager from BioSystematica, has features for data acquisition from densitometers and document scanners and the data management provides facilities for classification (similarity matrix and cluster analysis) and identification of unknown organisms against a library of representative fingerprints (Manfio 1993). GelCompar (Kersters 1985) provides extensive facilities for data management, including hierarchical clustering and principal component analysis (PCA), a wide choice of outputs (dendrograms, ordination plots, shaded similarity matrices, band profile) and facilities for identification of unknown strains against a database of fingerprints. State-of-the-art data acquisition, database management and analysis of DNA and protein fingerprints is provided by the BioImage package (Millipore Corporation, USA), a dedicated image analysis system with software modules for the analysis of DNA sequencing, one and two dimensional protein electrophoresis, RFLP and whole band DNA electrophoresis gels. The system combines powerful hardware (SPARCstation, Sun Microsystems Inc.), extensive database capabilities (databases up to 10.000 profiles) and statistical facilities. Further software developments may enable other taxonomic applications of this system. Molecular information is being generated from the implementation of international programs that envision the formidable task of deciphering the sequences of several genomes. At present, what has emerged from these programs is extremely interesting and will lead to new molecular biology approaches to the study of biodiversity (Monolou 1992). The immediate benefit is the establishment of numerous molecular biology e-mail servers and

219 UA A G

G G I CC G AAAuGc AUCUQGAGGAAu A GU GQC G A l°lllll* °1 • II °11 GUG?CGAuBUGGACCUUAAGAUoGe CQ IGA

U • GC CAA K C G A G CG CC C U U A U IIIIII I1"" U,GUGCUCG AGGGGU A--U U c~ Q

G ---- CC G G--C G--C

-A G UG

A o G A o G U--A

U A

°C-- -8 U --A A

C

/ U UG / U

U

."

....

A /.~C / Au • U GA//CAUu / G G G GA



G

CGU

, ~c,:.,., ~e,,~ - u

GGGUU G II ° U

.

.

.

i

AA

GC "~ GA

A G G AuC%

5so A ~

~o,,o~'

Uc

Ill" IIII G CGCGGGU ACGUA C AAC I u A U C A~ U ^/'A-SO u A G G l uQg = A A

cACG

G

CG

G 360

A UG_- _- CA

~

-A A U *

o

/A e U CG~

A

CGU

GO

A

(~A~,t~-

."

~,,G°°uG" A ~o A.t

,

,

C t400-O

G--C-• G

A A A G %% G C % uCc G A C AI

I

950-U

%

G----------~

I

U A

C C

I

AAJ U JC

C G A Ac__GA -A--UC--G C--G A o G U'G GIU G--C

G

~ C

AU~GU~"

/

C %" U CG~% HGA A A

e-,~o

oo

U ,

I-u

~

,

U

c~

oc o U G~ A

G

~-mso

~

u

UC / G~,

/U

~.~U A~G

G I G UAACCGUAGG ~ IIII1'111 ~ GUUGGCGUCC AA Gt

"''GO

AG

* G %_U ~G U A

\ GU . AGCG%*~A *% - U i ~.'~CG~ /A

U

A~ A ~ U G ~ ~C H UAA~ C~ ~G U C GAA

A

1

U C C

G

U,G

o

100

*

3"

C-

G

a U A

G ~

A GC G-~:~

G'U ,.G--C

U

*

O

AG--G C--G U .G

~

~-: G--G

G

CC G- G

1,,0 % p

Escherichia

C--G U--A A -- U lS0 C--G \C--G AUA GGG CCUCUU G AGGGGG ACUACUGG A UG I I [ I . I I , lello ellll-I A uCC GGGGAG CGGC A cGAUGGCA 200 A

A--U G--C

- A A C°

Fig. 1. S e c o n d a r y structure of bacterial I6S r i b o s o m a l R N A .

C

, 00

G--C A U~A~ I

ACUGUUCCGGGC

~

AQ

G--O

U--A

C A-900

I A A AA A AUGAA I ~ t UUGACGQQGCCCQCACA ~ucGGUAc . I I * • I I I I I I

CGG UQ A CA GUC U U U C U U C G

c U GGI A - - U A • A G--C G U A--U GC U C--G O %%G A C--G UA *U C C--G C %%U A-2S0 G o ~ \C % A G U--A -

~

~

GA \AA i GUC AC G GU C A G G A A Q A A G C

%', G~ • o o ~ ,

O G

G

GC % UU U G--C I AA A

A

l II

5

C A G

II'cCA~c cAC%G

% G

C

C

C UC

A-

G~A G u c A ~ G G U c

G~°.'G~

~

~- u ~, G, - / U C U

GGCC * I I

CA U G

/ C A UGCA G

GU/I

~

\

A

G

U

G 4 0 0 ~ C ,~% G A A I AHGC%% CG G

G A C

~

G--C C G U

AGC %\AAQ ~ A % GC

UAUGU~C

CAGCUGAA

U

~o,,.,~G-c GUP ; " I J]

uCCGG

AU

.

G j A U ~G C s U A G

U

G--C C -- G

U

C G~C

AA Q

/

U

G

U

CC

Go,,

G

A -U--A C--G G U C-1200

I

UC U G

. . AC - - G G-800

"A"

/" C / [U--A 4S0 A G G / C uAAC--G G Q / / G C A ~GG C / 3G~ c-G G r G 500C-- G

\

C

-- -7S0 GC--Q I I I • I I • C I 1050~C--G GA - U AGG • U AGACUUUUGAAG #/ ~ II G = C 65O U G G-- C r r, GC// Q A -G-- C \ U A G--C A / G G--CCU G G C A A G C U=A / Cc//Ac A UU I .111-1 CQ G--C \cUu lO00 U A GA U U G U U U G G / / C U C A ~ C --Q/C C Q A uc--GC A i CG • G A G-C GU. GC%% A CA ~=U~U-- A G G A ~ U % G~ ~ G--C A I UG A A UG " C ~ G- C CG C I / CG A U ,° G CUu U--A I Ac / C G . GU\ C A C -- G ~ GC G \ AG,% G eS0 A C--G G. . . . UA~ G ~ UC A G --= A G U C G A CUUQ%% U G C

UA

'G,,

A

G C G--C

U G--

Q G

~UA

AACCUGGG UGCAUCUGA -C I1-111 .1111111 UCQGGCCC GUGUAQACU C A I UAA 600 I



U --C G C G --

E GA A

G

A

I U C CI U U U G U U G C CAQc G G U c IIIIIII Io I III C A G G A A A C cAAGG GCICGG GA F U

G

coil 16S

rFINA

G

C AA

GC G I

220 several bulletin boards for molecular biologists. Lists of e-mails servers and bulletin board services can be obtained at serv-ema.txt and serv-bbo.txt. A detailed list of molecular biology ftp servers for databases and software is available at [email protected] (serv-ftp.txt). The document describes the various sources of databases and software for molecular biologists that are publicly available on the Internet computer network (Bairoch 1993b,c). LIMB, a List of Molecular Biology Databases available on the Internet, provides the scientific community with a comprehensive overview of databases relevant to molecular biology and related data sets. Information on how to access databases, how to contribute information as well as on general characteristics and availability of databases is provided by LiMB (Burks et al. 1988). Established and maintained at the Los Alamos National Laboratory in the USA, LiMB is periodically updated (Lawton et al. 1989, 1992). Nucleic acid sequences are usually deposited in sequence databanks, such as the GenBank (USA), EMBL (European Molecular Biology Laboratory) and DDBJ (DNA Databank of Japan) (Tables 2 and 3). A very large number of nucleic acid sequences from viruses, prokaryotic and eucaryotic organisms are deposited under unique accession numbers and may be searched by genus and species of origin and keywords in the commentary fields, as well as by homology with a nucleotide sequence provided by the user. The Ribosomal RNA Database (RDP). is a project headed by Carl Woese (Olsen et al. 1991) and funded by the U.S. National Science Foundation. It provides rRNA sequence data along with software packages for managing, analysing, and viewing the data. Although the RDP aims to include sequence data from all types of organisms, the currently available database contains aligned and phylogenetically organised small subunit rRNA sequences from close to 500 prokaryotes. Among the best represented genera (by number of species) are Bifido-

bacterium, Clostridium, Cytophaga, Flavobacteriurn, Flexibacter, Fusobacterium, Lactobacillus, Mycobacterium, Mycoplasma, Spirochaeta and Spiroplasma. Sequences in the RDP are drawn from the public sequence databanks, e.g. GenBank and

EMBL, but also include sequences from investigators who have not deposited with the other databanks. The RDP is available from the University of Illinois via the Internet. Software offered includes several sequence editors, phylogenetic analysis tools, a tree drawing program, and an editor for managing bibliographic data associated with rRNA sequences. The RDP plans to accept sequence submissions which it will format and release for distribution to other databanks. It also provides services such as sequence alignment and secondary structural representation. An example of a secondary structure diagram provided by RDP and received over the Internet (courtesy of Carl Woese) is illustrated in Figure 1. The RDP is unique in that it offers a 'sequence assessment' system that can detect possible errors and anomalies in submitted sequences. It also offers several formats, including GenBank format, and various software packages and services. It is anticipated that the scope and coverage of the RDP will expand significantly in the near future. Another important resource for bacterial systematists is the E. coli Genetic Stock Center and Database developed by Barbara Bachmann at Yale University (Berlyn & Letovsky 1992). Characteristics of over 7,000 mutant derivatives of E. coli K12 are comprehensively described in the database including alleles, structural mutations, mating types and plasmids. Gene names, properties, products and mutations are also listed as well as derivation, strain names, and references. The E. coli Stock Center has been a strain and data repository for the past 20 years and its staff have provided leadership in the development of standards for genetic nomenclature and data organisation. Recently, the Stock Center has announced on-line availability of the database (see Table 2). Although the database does not include sequence information, it is anticipated that it will be complementary to the sequence databanks in providing users with a mechanism to check on gene names before depositing sequence data. Systematists studying Bacillus strains have access to a similar resource produced by the Bacillus Genetic Stock Center at the Ohio State University. The database, containing information on Bacillus mutant strains, gene names, literature references

221 and availability is distributed on floppy disc and in hard copy catalogue format. Similar information is available for Staphylococcus strains from the Staphylococcus Stock Center at Iowa State University. Sequence alignment. There are a number of algorithms available for a wide scope of DNA and protein sequence alignments and comparisons. Rapid searching programs, such as FASTA (Pearson 1990) or BLAST (Altschul et al. 1990 ) have facilitated efficient comparisons of unidentified sequences with those existing in databanks. An excellent review (Doolittle 1990) of computerised sequence analysis for the study of molecular evolution provides indepth examinations of the various algorithms and their attributes. It is clear that molecular biologists recognise the significance of their results, which are highly dependent on the algorithm used for analysis. An overview of a few of the more popular packages for rapid sequence alignment and comparison follows. FASTR developed by Lipman and Pearson (1985) can be used for both nucleic acid and protein sequence alignment but was designed primarily to identify protein sequences that have descended from a common ancestor (Pearson 1990), making the algorithms more suitable for protein sequence comparisons. Modifications of this program have been developed and a package called FASTA is suitable for aligning nucleic acid sequences and also for comparing protein sequences to nucleic acid sequences. The programs are written in C and run on DOS, Macintosh, Unix, and VMS operating systems, and are highly portable. The number of searchable residues depends on the hardware, but there is virtually no limit if the library is scanned in overlapping pieces. FASTA limits the number of gaps that can be inserted into an alignment, while rigorous optimal alignments, though slower, may extend the alignment. The programs are freely available on the Internet and source code is available from the developer, Willian Pearson. Another tool for rapid sequence comparison, is BLAST, the Basic Local Alignment Search Tool. BLAST utilises the maximal segment pair (MSP) score that can be used for both protein and nucleic acid sequence analyses. It is a rapid sequence similarity search program and may also be used for too-

tif searches, gene identification searches, and in the analysis of multiple regions of similarity in long DNA sequences (Altschul et al. 1990). BLAST does not allow gaps in the local regions that it reports. MACAW, the Multiple Alignment Construction and Analysis Workbench is useful for studying molecular evolution and also for analysing structure/function/sequence relationships (Schuler et al. 1991). The user of MACAW constructs multiple alignments by locating and combining blocks of aligned sequence segments. Blocks may be edited or linked to form a composite multiple alignment, i.e., the user may 'custom design' the set of sequence segments that are submitted for analyses. The U.S. National Center for Biotechnology Information is using the Internet to provide access to public domain software such as BLAST and MACAW. The programs and the source code may be obtained via anonymous ftp to ncbi.nlm.nih.gov. The pub directory contains these and other computational tools that may be transferred and used without restriction. Physical/genetic mapping. Colibri (Medigue et al. 1990) is a software system for Macintosh computers that provides the ability to localise unknown DNA fragments from the Escherichia coli chromosome on the restriction map established by Kohara et al. (1987). The software, a runtime version of 4th Dimension, is available on the Internet via anonymous FTP to radium.jussier.fr. Correlation of physical and genetic maps allows for analysis of polymorphism data. Mapping positions can be assigned to most genes for which sequences are available. Detection of inaccuracies in the map is facilitated through the retrieval of appropriate data from the sequence databanks in combination with the use of Colibri. The software also permits placement on the map of any cloned gene for which restriction fragments are provided, allowing for rapid mapping of new genes. Colibri, written in PASCAL, provides a good model for software to correlate physical and genetic maps of other bacteria. Electronic submission of sequence data. Sequence data submission policies have been designed by agreement among the nucleotide and protein sequence databanks. Researchers should submit nucleotide sequence data directly to Gen-

222 Bank or EMBL for assignment of an accession number prior to publication. Authors are strongly urged to use the sequence submission software package A U T H O R I N to submit their sequence data to the databanks; a free copy (for either the IBM PC or Macintosh) can be obtained by sending the request to [email protected]. (Garavelli 1993). The U.S. National Center for Biotechnology Information (NCBI) has also assumed responsibility for distribution of the A U T H O R I N software, which can be obtained via e-mail to [email protected]. Data submissions can be sent directly to Los Alamos National Laboratory at [email protected]. Multifunetional software for PCs. The array of multifunctional software packages for sequence alignment and analysis is overwhelming for the molecular biologist having to make a choice on which provides the best tools for solving a particular problem. Among software packages available for the Macintosh, GeneWorks, LaserGene, Mac DNA/Pronasis, and MacMolly are used for accessing PIR, EMBL, GenBank, and SWISS-PROT for performing sequence alignment, predicting PCR primers, and other analytic functions (Ahern 1993). Several also provide the ability to draw phylogenetic trees based on sequence relatedness from multiple sequence alignments. It is practically impossible to identify and describe each and every set of programs used for this purpose, but the examples below provide an overview of the types of software available to the molecular taxonomist. For more detailed information on these and other programs, Biotechnology Software: The Journal of Computational Biology, edited by Kevin Ahem, and published bimonthly by Mary Ann Liebert Publishers is recommended. LaserGene for the Macintosh from DNA Star contains modules that carry out sequence editing, restriction mapping, sequence alignment, database analysis, and protein analysis. The editor, EditSeq, is used for entering sequences and importing/exporting sequences in various formats. Open reading frames may be found by defining start and stop codons. MapDraw provides restriction mapping functions and recognises both linear and circular sequences and can draw circular maps. Enzyme sites

with which to scan the sequence are selected in MapDraw. The module called Align allows for alignment of multiple DNA or protein sequences. Several algorithms may be used for alignment. Sub alignments can also be done. The program also provides for database analysis through the GeneMan module from which one is able to query the sequence databases, e.g. GenBank, EMBL, SWISSPROT and PIR. Protean, the protein analysis module, produces two dimensional plots based on common sequence algorithms. The versatility of this suite of programs makes it attractive to molecular systematists. GeneWorks by Intelligenetics is another popular set of sequence analysis programs for the Macintosh. GeneWorks provides a number of options for analysis, including algorithms for restriction and protease analysis, PCR primer prediction, pair and multiple homology determination, phylogenetic relationships, and motif identification. For restriction analysis, the package provides a comprehensive list of enzymes and allows the user to add new enzymes. Reports are in text or graphic form. For finding open reading frames (ORF), GeneWorks provides options such as genetic code and minimum number of bases, sequence of initiator codon, location and sequence of upstream sequences. Information about the location of the O R E length, number of occurrences of each amino acid and each codon may be viewed on the output screen. Sequence alignment may be done in pairs or in groups. Dot matrix analyses can also be used to examine the relatedness of sequences. Intelligenetics has also developed PC/GENE for sequence analysis on DOS based systems. This software has capabilities similar to those of GeneWorks. A new module is available for PCR primer design that gives the user more flexibility than previously available. Gene Runner for Windows is a relative newcomer to the field of sequence analysis software. Gene Runner can be used with the usual sequence bank formats as well as with the National Center for Biotechnology Information Entrez retrieval software for GenBank access. Capabilities for ORF analysis, restriction analysis, fragment analysis, oligonucleotide analysis, PCR analysis, sequencing

223 primer analysis, and reverse translation probe analysis are provided.

Nomenclaturai and taxonomic 'authority files'

Inherently, systematics will always be subject to interpretation. In many cases there are 'competing taxonomies.' Molecular techniques are now augmenting classical taxonomic methodology. Debates among systematists are commonplace and must continue with the expansion of knowledge about the world's organisms and their relationships. For this reason, 'official' taxonomic treatments are rare. It has been perceived, however, that publicly available databases containing the most comprehensive information possible on the nomenclature, history, authorities, hierarchical placement, and competing opinions on the taxonomic status of organisms are needed. In the case of bacteria, names at the genus and species levels are controlled by the International Code of Nomenclature of Bacteria (Sneath 1992). The correct name of a bacterial taxon is based on the valid publication, legitimacy, priority of publication and the effective publication. Since 1 January 1980, priority of bacterial names is based upon the Approved Lists of Bacterial Names (Skerman et al. 1980). Names that were not included in the Approved Lists lost standing in bacterial nomenclature. The Bacterial Nomenclature Up-to-Date Database is compiled by the Information Centre for European Culture Collections (ICECC) together with the Deutsche Sammlung von Mikroorganismen und Zellkulturen GmbH (DSM). This database is available in a printed version or floppy disk. The Bacterial Nomenclature Up-to-Date Database includes all valid bacterial names which have been published since 1 January 1980 and valid nomenclature changes which have been published since then. It is updated with the publication of each new issue of the IJSB and is available on-line through the Internet (Tropical Database Gopher, Brazil) or through the Microbial Strain Data Network (MSDN). There are several computerised resources that provide authority references, synonyms, and taxo-

nomic rank of bacterial names. While none of these resources is officially sanctioned by the International Committee on Systematic Bacteriology, several are extremely valuable for researching the origin and current status of genus and species names. Bacteriologists are fortunate to have a widely accepted de facto authoritative resource for bacterial systematics. The Bergey's Manual Trust, headquartered at Michigan State University in East Lansing, Michigan, has, since 1923, published taxonomic treatments of bacterial species. In addition to the classifications, the Manuals contain descriptions of all genera and higher taxa, criteria for species differentiation, taxonomic comments, and references to authors of the names. Future editions of Bergey's Manual of Determinative Bacteriology and Bergey's Manual of Systematic Bacteriology will be published both in hard copy and computer accessible media. A prototype data model for a computerised version of the Bergey's Manual of Determinative Bacteriology, 9th edition is being developed. This model is based on the use of the RKC Codes, a set of standardised descriptors for bacterial characteristics, created by Rogosa et al. (1986). An example of these codes with their numeric equivalents are illustrated below. SECTION 24 FATTY ACIDS P R O D U C E D FROM D-GLUCOSE 024016: Acetic acid is produced from D-glucose. 024017: Acetic acid without carbon dioxide is produced from D-glucose. 024018: Butyric acid is produced from D-glucose. 024019: Isobutyric acid is produced from D-glucose. 024020: Caproic acid is produced from D-glucose. 024021: Isocaproic acid is produced from D-glucose. 024022: Caprylic acid is produced from D -glucose 024023: Formic acid is produced from D-glucose. 024211: Fumaric acid is produced from D-glucose. 024024: Oenanthic (heptanoic) acid is produced from D-glucose.

224 024025: Propionic acid is produced from D-glucose. 024026: Valeric acid is produced from D-glucose. 024027: Isovaleric acid is produced from D-glucose. RKC Codes applicable to each genus of bacteria are being supplied to the international editorial staff of Bergey's Manuals. Editors will select the codes that apply to each species within the genus and indicate a positive or negative value for each of the phenotypic characteristics to differentiate among the species. Tables will be constructed within the database that will allow for adding, editing, and retrieving data in response to specific queries. Standardised descriptors will facilitate interspecies comparisons and provide the mechanism for matching user-generated data with the Bergey's data. Another computerised taxonomic resource is the on-line BIOSIS Register of Bacterial Nomenclature (BRBN). This resource contains nomenclatural and taxonomic information on over 15,500 bacteria including the following data fields: bacterial name, authority name, synonyms of the bacterial name, International Committee on Systematic Bacteriology (ICSB) status, and taxonomic rank (including Bergey's Manual 7th and 8th editions). An effort is made to also cover all commonly occurring infra-specific ranks, the major portion of which are pathovars and serovars. BRBN is a component of the BIOSIS Taxonomic Reference File (TRF). This database contains taxonomic data on other classes of organisms and provides electronic mail, computer conferencing, and bulletin board services. A complete list of data of BRBN elements are illustrated below. BRBN DATA ELEMENTS TRFNUM: Unique identifying number of each record. NAME: Bacterial name described in record. LEVEL: Taxonomic level of bacterial name. ICSBSTAT: Official ICSB status. AUTHORITY: Author and date of first published description. TYPE: Type culture collection identifier. SOURCE: Publication first noted by TRF staff.

PREENAME: Preferred name if synonyms exist. REFERENCE: Bibliographic reference where name first published. BERGEY7: Status in Bergey's Manual 7th edition. BERGEY8: Status in Bergey's Manual 8th edition. BSB: Coverage of name in Bergey's Manual of Systematic Bacteriology, Volumes 1-4. NAMENOTE: Nomenclatural notes which help explain current status of name. BIOSISI: BIOSIS Biosystematic Code(s) pre 1979 (order level). BIOSIS2: BIOSIS Biosystematic Code(s) post 1979 (Family level). CODE: Four-letter abbreviations for genera on Approved Lists.

Culture collections and strain databases

Large amounts of information on collection holdings are acquired as a result of routine activities and the research applied to the cultures. Information may relate to the phenotypic or genotypic features of the cultures, to their industrial applications, their taxonomic relatedness, their response to different preservation methods, their history and their patent status. This information is increasingly held in computers for ready access and updating (Canhos 1990). It is anticipated that in the near future, strain information will include reference to taxon-specific sequences, probes and primers. This information, together with the availability of strains and genetic material will lead to advances in the development and application of modern diagnostic tools in microbial systematics. Established collections are collaborating nationally, regionally and internationally to make data more readily available to the scientific community through databases and networks. The function and differences of the major computerised strain databases has been reviewed by Kirsop (1988a). Descriptions of culture collections and a species oriented directory are held at the World Data Center for Microorganisms (WDC) housed at RIKEN (Institute of Physical and Chemical Research) in Japan

225 and are available on-line trough the Internet GOPHER. A number of major European culture collections, supported by the 'Biotechnology Action Programme' (BAP) of the European Community (EC) are involved in the establishment of the multinational 'Microbial Information Network Europe' (MINE). The project aims at the centralisation of data on strains in one database, in order to insure easy and quick access to the data that will be available on-line at the German Institute for Medical Documentation and Information (DIMDI). In order to allow the electronic combination of data from different collections, a uniform system for computer storage and retrieval of strain data is being adopted. Formats for bacteria (Stalpers et al. 1990) and for yeasts and filamentous fungi (Gains et al. 1988) have been developed. The Microbial Germplasm Data Network (MGD) is the USA National Network of scientists and collections containing microbial germplasm utilised primarily in plant related research. The MGD, which is available from the Oregon State University, USA, has on the Internet almost a gigabyte of information provided by scientists describing their research culture collections. Accession data on plant pathogens, symbionts and biological control organisms are included and uploaded on a daily basis. Future developments at the MGD will include a spelling checker for taxonomic names, email searches and responses for users not on the Internet and Image Data relevant to microbial germplasm. Further information can be obtained through [email protected]. The Microbial Strain Data Network (MSDN) is an internationally sponsored information network that provides mechanisms for locating microorganisms and cultured cells with specific properties, through an electronic communication system (Kirsop 1988b). The MSDN is also a facilitating mechanism that builds links between databases. Many of the major collections are now making their catalogues available on-line through the MSDN network. The MSDN services make it possible to both search catalogues and order cultures on-line and collections are increasingly taking advantages of

these facilities and enriching their databases with taxonomic information.

Conclusions and o u t l o o k

The polyphasic approach to microbial systematics requires a broad spectrum of information, including access to primary molecular data, such as sequences, and to software tools for data analysis. The expansion of the taxonomic base is becoming very dynamic and is demanding a close monitoring of new developments in research methodology and data analysis. To facilitate the follow up of frontier developments, new mechanisms for the dissemination of information relevant to microbial systematics must be created. The development of new taxonomic databases together with the creation of specialised services such as newsgroups, listservers, mailing lists and information archives are required for the advancement of microbial systematics. Long term planning and funding are fundamental for the success of the enterprise. Considering its rapidly growing connectivity and relatively low cost to the end users, the Internet is the best route for the dissemination of taxonomic information at the present time. Initially, access to the Internet was a privilege only available to those at universities and governmental institutions able to finance organisational subscriptions to this worldwide network. However, new links have been established that provide access to individual users at very reasonable costs. The Internet is an attractive tool for research and public information programs and reaches regions of perceived remoteness such as Africa and Latin American countries, important stockholders of biological diversity. In industrialised countries high speed data transmission backbones are being established that will allow the transfer of massive amounts of data and images in real time. Developments in public domain software shared via the Internet will lead to a massive growth of public domain databases as end products of newly established National and International Biodiversity Programmes. While it is true that there is much software and

226

other information to be gathered on the Internet, it is also the case that software tools and databases are being developed that will be commercialised. Although these will not be free to the Internet users, a microbial systematics list/server/usergroup would serve as a locator for these software tools as well as linking people to databases and other resources of importance to taxonomic interests. Daily, new ftp sites are being created, discussion lists are being established and biological databases are being set up on the Internet. There is a clear need for the development of co-ordinating and linking mechanisms in order to solve the problems related to the collection and analysis of primary data and their dissemination through electronic networks. Related international organisations should promote the development of common protocols, guidelines and standards to insure the data quality. Linking the existing computerised databases and information resources is important for making the data available worldwide. The existing models such as the MSDN and the Internet Gophers clearly indicate that the linkage of diverse distributed databases is feasible at low cost. Following the developments in molecular biology and information technology it is anticipated that in the near future high quality and easily accessible taxonomic databases will be available and linked though common open protocols. These developments, associated with the creation of other information dissemination mechanisms such as listservers, newsgroups, ftp sites and information archives, will enlarge the knowledge base necessary for the elucidation of bacterial diversity and taxonomic complexity, allowing new adventures in the biotechnological exploitation of the microbial gene pool.

Acknowledgement We would like to acknowledge the helpful reviews of the paper by Kevin Painting and Trevor Bryant.

References Ahern K (1993) DNA Star/LaserGene. Biotechnology Software, 10:6-12 Altschul SF, Gish W, Miller W, Myers EW & Lipman DJ (1990) Basic local alignment search tool. J. Mol. Biol. 215:403-410 Anonymous (1991a) Sustainable biosphere initiative. Ecology 72:371-412 Anonymous (1991b) Systematics Agenda 2000: integrating biological diversity and societal needs. System. Zool. 40:520-523 Bairoch A (1993a) SWISS-PROT and 2D gel databases. Posted (May 1, 1993) at the [email protected] Bairoch A (1993b) List of molecular biology e-mail servers. Version 1.50/July 30, 1993. Posted at [email protected] Bairoch A (1993c) List of molecular biology FTPservers for databases and software. Version 1.50/July 30,1993. Posted at [email protected] Balows A, Trtiper HG, Dworkin M, Harder W & Schleifer K-H (Eds) (1993) The Prokaryotes - A Handbook on the Biology of Bacteria: Ecophysiology, Isolation, Identification, Applications. 2nd ed. Springer-Verlag, New York Bello (1989) Computer applications in microbiology: making sense out of data. ASM News 55:71-73 Berlyn MB & Letovsky S (1992) Genome-related datasets within the E. coli Genetic Stock Center Database. Nucl. Acids Res. 20:6143-6151 BIOLOG (1992) BIOLOG Microstation: Automated Bacteria Identification System. BIOLOG Inc., Hayward, USA Bochner BR (1989) Sleuthing out bacterial identities. Nature (London) 339:157-158 van Bogelen RA, Sankar R Clark RL, Bogan JA & Nedhardt FC (1992) The gens-protein database of Escherichia coli: Edition 5. Electrophoresis 13:1014-1054 Bryant TN (1991) Bacterial Identifier: a Utility for Probrabilistic Identification of Bacteria. Blackwell Scientific Publ., Oxford Brondz I, Olsen I & Sjostrom M (1990) Multivariate analysis of quantitative chemical and enzymic characterization data in classification of Actinobacillus, Haemophilus and Pasteurella spp. J. Gen. Microbiol. 136:507-513 Bryant TN (1987) Programs for evaluating and characterising bacterial taxonomic data. CABIOS 3:45-48 Bryant TN (1993) A review of probabilistic identification matrices. A compilation of probabilistic bacterial identification matrices. Binary 5:207-210 Buchanan RE & Gibbons NE (Eds) (1974) Bergey's Manual of Determinative Bacteriology. 8 ed. Williams & Wilkins, Baltimore Bull AT, Goodfellow M & Slater JH (1992) Biodiversity as a source of innovation in biotechnology. Ann. Rev. Microbiol. 46:219-252 Burks C, Lawton J & Bell G (1988) The LiMB database. Science 241:888 Bussard A, Krichevsky MI, & Blaine, LD (1985) An international hybridoma data bank: aims, structure, function. In Macario AJL & Macario EC (Eds) Monoclonal Antibodies Against Bacteria (p. 287-311). Academic Press, Orlando

227 Canhos VP (1990) The impact of computers on culture collections. In: Sly I, Iijima I & Kirsop BE (Eds) 100 years of culture collections (p. 20-27). Institute for Fermentation, Osaka Canhos VE Lange DA, Kirsop BE, Ross E & Nandi S (Eds) (1992). Needs and Specifications for a Biodiversity Information Network (265 pp.). United Nations Environment Programme, Nairobi Cerf V (1991). Networks. Sci. Amer. 265:72-84 Cinkosky M, Fickett JW, Gilna P & Burks C (1991) Electronic publishing and GenBank. Science 252:1273-1277 Colwell RR (1970) Polyphasic taxonomy of the genus Vibrio: Numerical taxonomy of Vibrio cholerae, Vibrio parahaemolyticus and related Vibrio species. J. Bacteriol. 104:410-433 Cowan ST & Steel KJ (1974) Manual for the Identification of Medical Bacterial. Cambridge University Press, Cambridge Davis AW, Atlas RM & Krichevsky MI (1983) Development of probability matrices for identification of Alaskan marine bacteria. Int. J. System. Bacterio133:803-810 Doolittle RF (Ed) (1990) Molecular evolution: computer analysis of protein and nucleic acid sequences. Meth. Enzymol. 183: 1-736 Gains W, Hennebert GL, Stalpers JA, Janssens D, Schipper MA, Smith J, Yarrow D & Hawksworth DL (1988) Structuring strain data for storage and retrieval of information on fungi and yeast in MINE, the Microbial Information Network Europe. J. Gen. Microbiol. 134:1667-1689 Garavelli JS (1993) Announcements of the Protein Information Resource. Network Request Service. [email protected] (29 April 1993) Goodfellow M & Minnikin DE (Eds) (1985) Chemical Methods in Bacterial Systematics. Academic Press, London Goodfellow M & O'Donnell AG (Eds) (1993). Handbook of The New Bacterial Systematics. Academic Press, London Goodfellow M, Jones D & Priest FG (Eds) (1985) Computerassisted Bacterial Systematics. Academic Press, London Green D (1992) Public Domain Databases for Networking Biodiversity. On-line contribution to [email protected], available through BDT gopher Gurtler V, Wilson VA & Mayall BC (1991) Classification of medically important clostridia using restriction endonuclease site differences of PCR-amplified 16S rDNA. ]. Gen. Microbiol. 137:2673-2679 Gutteridge CS, Vallis L & Macfie HJH (1985) Numerical methods in the classification of micro-organisms by pyrolysis mass spectrometry. In: Goodfellow M, Jones D & Priest F (Eds) Computer-assisted Bacterial Systematics (p. 369-401). Academic Press, London Hall HJ (1989) Microbial product discovery in the biotech age. Bio/Technology 7:427-430 Hawksworth DL (1992) Biodiversity in microorganisms and its role in ecosystem function. In: Solbrig OT, van Emdem HM & Van Oordt PGWJ (Eds). Biodiversity and Global Change (p. 83-94). Monograph no. 8. International Union of Biological Sciences, Paris Hawksworth DL & Colwell RR (1992a). Microbial Diversity 21 :

Biodiversity amongst microorganisms and its relevance. Biodiv. Conserv. 1:221-226 Hawksworth DL & Colwell RR (1992b) Biodiversity amongst microorganisms and its relevance. Biology International 24:11-15 Helm D, Labischinki H, Schallehn G & Naumann D (1991) Classification and identification of bacteria by Fourier-transform infrared spectroscopy. J. Gen. Microbiol. 137:69-79 Hinojosa-Ahumada M, Swaminathan B, Hunter SB, Cameron DN, Kiehlbauch JA, Wachsmuth IK & Strockbine NA (1991) Restriction fragment length polymorphisms in rRNA operons for subtyping Shigella sonnei. J. Clin. Microbiol. 29:2380-2384 James FC & McCulloch CE (1990) Multivariate analysis in ecology and systematics: panacea or Pandora's Box? Ann. Rev. Ecol. System. 21:129-166 Jurtshuk R J, Blick M, Bresser J, Fox GE & Jurtshuk Jr P (1992) Rapid in situ hybridization technique using 16S rRNA segments for detecting and differentiating the closely related Gram-positive organisms Bacillus polymyxa and Bacillus macerans. Appl. Environ. Microbiol. 58:2571-2578 K~impfer P & Kroppenstedt RM (1991) Probabilistic identification of streptomycetes using minituarized physiological tests. J. Gen. Microbiol. 137:1893-1902 K~impfer R Kroppenstedt RM & Dott W (1991) A numerical classification of the genera Streptomyces and Streptoverticillium using miniaturized physiological tests. J. Gen. Microbiol. 137:1831-1891 Kellogg ST (1989) The state of computers. ASM News 55:22-25 Kersters K (1985) Numerical methods in the classification of bacteria by protein electrophoresis. In: Goodfellow M, Jones D & Priest F (Eds) Computer-assisted Bacterial Systematics (p. 337-368). Academic Press, London Kirsop BE (1988a) Computerized databases for locating microorganisms: Functions and differences of major information resources. MIRCEN Journal 4:419-424 Kirsop BE (1988b) Microbial Strain Data Network: A service to biotechnology. Internat. Indust. Biotechnol. 8:24-27 Klinger JM, Stowe RE Obenhuber DC, Grover TO, Mishra SK & Pierson DL (1992). Evaluation of the Biolog automated microbial identification system. Appl. Environ. Microbiol. 58: 2089-2092 Kohara Y, Akiyama K & Isono K (1987) The physical map of the whole E. coli chromosome: Application of a new strategy for rapid analysis and sorting of a large genomic library. Cell 50: 495-508 Krebs JR (1992). Evolution and Biodiversity-The New Taxonomy. Published by the Natural Environment Research Council Krieg NR (Ed) (1984) Bergey's Manual of Systematic Bacteriology. Vol.1. Williams & Wilkins, Baltimore Krol E (1992) The whole Internet user's guide and catalog. 1st. ed. O'Keilly & Associates, Inc., Sebastopol (California) Langham CD, Sneath PHA, Williams, ST & Mortimer, AM (1989a) Detecting aberrant strains in bacterial groups as an aid to constructing databases for computer identification. J. Appl. Bacteriol. 66:339-352 Langham CD, Williams ST, Sheath PHA & Mortimer AM

228 (1989b) New probability matrix for identification of Streptomyces. J. Gen. Microbiol. 135:121-133 Lawton J, Burks C & Martinez F (1989) Overview of LiMB database. Nucl. Acids Res. 17:5885-5899 Lawton J, Cinkosky M, Mishra S, Fickett J & Burks C (1992) Access to molecular biology databases. Math. Comput. Model. 16:93-101 Lennette EH, Balows A, Hausler Jr WJ & Shadomy HJ (Eds). (1985) Manual of Clinical Microbiology. American Society for Microbiology, USA Liesack W, Ward N & Stackebrandt E (1991) Strategies for molecular microbial ecological studies. Actinomycetes 2:63-67 Lipman DJ & Pearson WR (1985) Rapid and sensitive protein similarity searches. Science 227:1435-1441 Lucchini GM & Altwegg M (1992) rRNA gene restriction patterns as taxonomic tools for the genus Aeromonas. Int. J. System. Bacteriol. 42:384-389 Magee J (1993) Whole organisms fingerprinting In: Goodfellow M & O'Donnell AG (Eds) Handbook of New Bacterial Systematics (p. 383419). Academic Press, London Manfio GP (1993) GelManager for DOS, GelManager for Windows: gel fingerprinting analysis. Binary 5:114-116 Manafi M, Kneifel W & Bascomb S (1991) Fluorogenic and chromogenic substrates used in bacterial diagnosis. Microbiol. Rev. 55:335-348 Mauchline WS & Keevil CW (1991) Development of the BIOLOG substrate utilization system for identification of Legionella spp. Appl. Environ. Microbiol. 57:3345-3349 McManus C & Krichevsky MI (1992) Self-Instruction Manual for MICRO-IS Medigue C, Bouche JP, Henaut A & Danchin A (1.990) Mapping of sequenced genes (700 kbp) in the restriction map of the Escherichia coil chromosome. Mol. Microbiol. 4:169-187 MIDI (1993) Microbial Identification System. MIDI, Newark, DE, USA Monolou JC (1992) Biodiversity at the molecular level. In: Solbrig OT, Van Emdem HM & Van Oordt PGWJ (Eds). Biodiversity and global change (p. 33-39). Monograph No. 8, International Union of Biological Sciences, Paris Moyer NE Martinetti G, Luthy-Hottenstein J & Altwegg M (1992) Value of rRNA gene restriction patterns of Aeromonas spp. for epidemiological investigations. Curr. Microbiol. 24:15 21 Nurisis MJ (1982) SPSS introductory guide: basic statistics and operations. McGraw-Hill, Chicago O'Brien M & Colwell RR (1987) Characterisation tests for numerical taxonomy studies. In: Colwell RR & Grigorova R (Eds) Methods in Microbiology: Current Methods for Classification and Identification of Microorganims. Vol. 19 (p. 69104). Academic Press, London O'Donnell AG, Minnikin DE & Goodfellow M (1985) Integrated lipid and wall analysis of actinomycetes. In: Goodfellow M & Minnikin DE (Eds). Chemical Methods in Bacterial Systematics (p. 131-143). Academic Press, London Olsen GJ, Larsen N & Woese CR (1991) The ribosomal RNA database project. Nucl. Acids Res. 19 (Suppl.): 2017-2021 Payne RW, Lane PW, Ainsley AE, Bicknell KE, Digby PGN,

Harding SA, Leech PK, Simpson HR, Todd AD, Verrier PJ, White RE Gower JC, Wilson GT & Paterson LJ (1989) Genstat 5 Reference Manual. Oxford University Press, Oxford Pearson WR (1990) Rapid and sensitive sequence comparison with FASTP and FASTA. Meth. Enzymol. 183:63-98 Priest FG & Alexander B (1988) A frequency matrix for probabilistic identification of some bacilli. J. Gen. Microbiol. 134: 3011--3018 Horizon (1992) RAPyD-400 User Manual. Horizon Instruments Ltd. Ghyll Industrial Estate, UK Reid B (1993) Usenet readership report for January 1993. Usenet news lists Rogosa M, Colwell RR & Krichevsky M (1986) Coding Microbiological Data for Computers. Springer-Verlag, New York Rossler D, Ludwig W, Schleifer KH, Lin C, McGill T J, Wisotzkey JD, Jurtshuk Jr. E & Fox GE (1991) Phylogenetic diversity in the genus Bacillus as seen by 16S rRNA sequencing studies. System. Appl. Microbiol. 14:266-269 Sackin MJ (1987) Computer programs for classification and identification. In: Colwell RR & Grigorova R (Eds) Methods in Microbiology: Current Methods for Classification and Identification of Microorganims. Vol. 19 (p. 459494). Academic Press, London Sackin MJ & Jones (1993) Computer-assisted classification. In: Goodfellow M & O'Donnell AG (Eds) Handbook of New Bacterial Systematics (p. 281-313). Academic Press, London Saddler GS, O'Donnell AG, Goodfellow M & Minnikin DE (1987) SIMCA pattern recognition in the analysis of streptomycete fatty acids. J. Gen. Microbiol. 133:1137-1147 Sanglier J-J, Whitehead D, Saddler GS, Ferguson EV & Goodfellow M (1992) Pyrolysis mass spectrometry as a method for the classification, identification and selection of actinomycetes. Gene 115:235-242 SAS (1992) SAS Introductory Guide for Personal Computers: release 6.03. SAS, Coorporation, USA Sayler GS & Layton AC (1990) Environmental application of nucleic acid hybridization. Ann. Rev. of Microbiol. 44:625 -648 Schuler GD, Altschul SF & Lipman DJ (1991) A workbench for multiple alignment construction and analysis. Proteins 9:180190 Sensititre (1992) Sensititre Operation Manual. Sensititre Corporation, UK Skerman VDB, McGowan V &Sneath PHA (1980). Approved lists of bacterial names. Int. J. System. Bacteriol. 30:225420 Smith U R (1993). A Biologist's Guide to Internet. Published monthly in the Usenet Newgroups sci.bio, bionet.general and news.answers, and archived as file 'biology/guide' in the anonymous ftp archive on pit-manager.mit.edu. '20 pages' Sneath PHA (1974) Test reproducibility in relation to identification. Int. J. System. Bacteriol. 24:508-523 Sneath PHA (1977) A method for testing the distinctness of clusters: a test of the disjunction of two clusters in Euclidean space as measured by their overlap. J. Math. Geol. 9:123-143 Sneath PHA (1979a) BASIC program for significance test for clusters in UPGMA dendrograms obtained from squared Euclidean distances. Comp. Geosci. 5:12%137

229 Sneath PHA (1979b) BASIC program for a significance test for two clusters in Euclidean space as measured by their overlap. Comp. Geosci. 5:143-155 Sneath PHA (1979c) BASIC program for nonparametric significance of overlap between a pair of clusters using the Kolmogorov-Smirnov test. Comp. Geosci. 5:173-188 Sneath PHA (1979d) BASIC program for identification of an unknown with presence-absence data against an identification matrix of percent positive characters. Comp. Geosci. 5:195-213 Sneath PHA (1979e) BASIC program for character separation indices from an identification matrix of percent positive characters. Comp. Geosci. 5:349-357 Sneath PHA (1980a) BASIC program for the most diagnostic properties of groups from an identification matrix of percent positive characters. Comp. Geosci. 6:21-26 Sneath PHA (1980b) BASIC program for determining the best identification scores possible from the most typical examples when compared with an identification matrix of percent positive characters. Comp. Geosci. 6:27-34 Sneath PHA (1980c) BASIC program for determining overlap between groups in an identification matrix of percent positive characters. Comp. Geosci. 6:267-278 Sneath PHA (Ed) (1986) Bergey's Manual of Systematic Bacteriology. Vol. 2. Williams & Wilkins, Baltimore Sneath PHA (1989) Analysis and interpretation of sequence data for bacterial systematics: the view of a numerical taxonomist. System. Appl. Microbiol. 12:15-31 Sneath PHA (1992) International Code of Nomenclature of Bacteria (1990 revision). American Society For Microbiology, Washington. Sneath PHA & Johnson R (1972) The influence on numerical taxonomic similarities of error in microbiological tests. J. Gen. Microbiol. 72:377-392 Sneath PHA & Langham CD (1989) OUTLIER: a BASIC program for detecting outlying members of multivariate clusters based on presence-absence data. Comp. Geosci. 15:939-964 Sneath PHA & Sackin MJ (1979) BASIC program for printing a coding sheet for unknowns that are to be identified against an identification matrix of percent positive characters. Comp. Geosci. 5:359-367 Sneath PHA & Sokal RR (1973) Numerical Taxonomy. Freeman, San Francisco Stackebrandt E & Goodfellow M (Eds) (1991) Nucleic Acid Techniques in Bacterial Systematics. John Wiley & Sons, Chichester Stackebrandt E, Wunner-Ffissl B, Fowler V J, Schleifer, K-H (1981) Deoxyribonucleic acid homologies and ribosomal ribonucleic acid similarities among sporeforming members of the order Actinomycetales. Int. J. System. Bacteriol. 31:420-431 Stackebrandt E, Witt D, Kemmerling C, Kroppenstedt R & Lie-

sack W (1991) Designation of streptomycete 16S and 23S rRNA-based target regions for oligonucleotide probes. Appl. Environ. Microbiol. 57:1468-1477 StShl M, Molin A, Ahrn6 S & St~hl S (1990) Restriction endonuclease patterns and multivariate analysis as a classification tools for Lactobacillus spp. Int. J. System. Bacteriol. 40:189-193 Staley JT, Bryant MR Pfleming N & Holt JG (Eds) (1989) Bergey's Manual of Systematic Bacteriology. Vol. 3. Williams & Wilkins, Baltimore Stalpers JA, Kracht M, Janssens D, DeLey J, Van der Toorn J, Smith J, Claus D & Hippe D (1990) Structuring strain data for storage and retrieval of information on bacteria in MINE, the Microbial Information Network Europe. System. Appl. Microbiol. 13:92-103 Walczak CA & Krichevsky MI (1982) Computer-aided selection of efficient identification features and calculation of group descriptors as exemplified by data on Capnocytophaga species. Curr. Microbiol. 7:199-204 Walczak CA, Blaine L & Krichevsky MI, (1988) The CODATA/ IUIS Hybridoma Data Bank: development of a hybrid system to handle complex data relationships. Comp. Meth. Progr. Biomed. 88:275-285 Wayne LG, Krichevsky E J, Love LL, Johnson R & Krichevsky MI (1980) Taxonomic probability matrix for use with slowly growing mycobacteria. Int. J. System. Bacteriol. 30:528-538 Willcox WR, Lapage SP & Holmes B (1980) A review of numerical methods in bacterial identification. Antonie van Leeuwenhoek 46:233-299 Williams DM (1992) DNA analysis: Theory. In: Forey PL, Humphries CH, Kitching IJ, Scotland, RW, Siebert DJ & Williams DM (Eds) Cladistics: A Practical Course in Systematics p. 89101. Oxford University Press, Oxford Williams ST, Sharpe ME & Holt JG (Eds) (1989) Bergey's Manual of Systematic Bacteriology. Vol. 4. Williams & Wilkins, Baltimore Williams ST, Locci R, Vickers, Schofield GM, Sneath PHA & Mortimer AM (1985) Probabilistic identification of Streptoverticillium species J. Gen. Microbiol. 131:1681-1689 Winker S & Woese CR (1991) A definition of the domains Archaea, Bacteria and Eucarya in terms of small ribosomal RNA characteristics. System. Appl. Microbiol. 14:305-310 Wishart D (1987) Clustan User Manual 4th edition. Computing Laboratory University of St. Andrews, Scotland, U K Woese CR, Kandler O & Wheelis ML (1990) Towards a natural system of organisms: proposal for the domains Archaea, Bacteria and Eucarya. Proc. Nat. Acad. Sci. USA 87:4576-4579 Woods TC, Helsel LO & Swaminathan B (1992) Characterization of Neisseria meningitidis serogroup C by multilocus enzyme electrophoresis and ribosomal DNA restriction profiles (ribotyping). J. Clin. Microbiol. 30:132-137

Suggest Documents