Document not found! Please try again

Research and Applications in Global Supercomputing - CiteSeerX

6 downloads 5842 Views 327KB Size Report
HTML-based interactive visualization for annotated multiple sequence alignments. (Gille, Birgit ..... bioinformatics software intended to create, process or convert annotation graphs. (Gremme, ..... alignment visualization in HTML5 without. Java.
Research and Applications in Global Supercomputing Richard S. Segall Arkansas State University, USA Jeffrey S. Cook Independent Researcher, USA Qingyu Zhang Shenzhen University, China

A volume in the Advances in Systems Analysis, Software Engineering, and High Performance Computing (ASASEHPC) Book Series

Managing Director: Managing Editor: Director of Intellectual Property & Contracts: Acquisitions Editor: Production Editor: Typesetter: Cover Design:

Lindsay Johnston Austin DeMarco Jan Travers Kayla Wolfe Christina Henning Mike Brehm Jason Mull

Published in the United States of America by Information Science Reference (an imprint of IGI Global) 701 E. Chocolate Avenue Hershey PA, USA 17033 Tel: 717-533-8845 Fax: 717-533-8661 E-mail: [email protected] Web site: http://www.igi-global.com Copyright © 2015 by IGI Global. All rights reserved. No part of this publication may be reproduced, stored or distributed in any form or by any means, electronic or mechanical, including photocopying, without written permission from the publisher. Product or company names used in this set are for identification purposes only. Inclusion of the names of the products or companies does not indicate a claim of ownership by IGI Global of the trademark or registered trademark. Library of Congress Cataloging-in-Publication Data Research and applications in global supercomputing / Richard S. Segall, Jeffrey S. Cook, and Qingyu Zhang, editors. pages cm Includes bibliographical references and index. Summary: “This book investigates current and emerging research in the field, as well as the application of this technology to a variety of areas by highlighting a broad range of concepts”-- Provided by publisher. ISBN 978-1-4666-7461-5 (hardcover) -- ISBN 978-1-4666-7462-2 (ebook) 1. High performance computing 2. Supercomputers. I. Segall, Richard, 1949- II. Cook, Jeffrey S., 1966- III. Zhang, Qingyu, 1970QA76.88.R48 2015 004.1’1--dc23 2014045462 This book is published in the IGI Global book series Advances in Systems Analysis, Software Engineering, and High Performance Computing (ASASEHPC) (ISSN: 2327-3453; eISSN: 2327-3461)

British Cataloguing in Publication Data A Cataloguing in Publication record for this book is available from the British Library. All work contributed to this book is new, previously-unpublished material. The views expressed in this book are those of the authors, but not necessarily of the publisher. For electronic access to this publication, please contact: [email protected].

149

Chapter 6

Applications of Supercomputers in Sequence Analysis and Genome Annotation Gerard G. Dumancas Oklahoma Medical Research Foundation, USA

ABSTRACT In the modern era of science, bioinformatics play a critical role in unraveling the potential genetic causes of various diseases. Two of the most important areas of bioinformatics today, sequence analysis and genome annotation, are essential for the success of identifying the genes responsible for different diseases. These two emerging areas utilize highly intensive mathematical calculations in order to carry out the processes. Supercomputers facilitate such calculations in an efficient and time-saving manner generating high-throughput images. Thus, this chapter thoroughly discusses the applications of supercomputers in the areas of sequence analysis and genome annotation. This chapter also showcases sophisticated software and algorithms utilized by the two mentioned areas of bioinformatics.

INTRODUCTION Bioinformatics is often regarded as a discipline in its infancy. However, this interdisciplinary field had its historical start in 1960s when computers emerged as a vital tool in molecular biology. With the notable efforts of Margaret O. Dayhoff, Walter M. Fitch, Russell F. Doolittle among others, this area emerged as an approach to managing and interpreting massive data generated by genomic research. Bioinformatics today represent a convergence of various fields, which involve modeling of biological phenomena, genomics, biotechnol-

ogy and information technology, analysis and interpretation of data, and the development of novel algorithms for analyzing biological datasets. With the advent of the emergence of these large amount of biological datasets, scientists are often confronted with the issues of analyzing and interpreting these massive information and datasets in a less amount of time, requiring high accuracy, and cost-saving. In the last few decades, this has been made possible with the emergence of supercomputers. The wide array of available supercomputers has made it possible to analyze and interpret biological datasets and systems in a

DOI: 10.4018/978-1-4666-7461-5.ch006

Copyright © 2015, IGI Global. Copying or distributing in print or electronic forms without written permission of IGI Global is prohibited.

 Applications of Supercomputers in Sequence Analysis and Genome Annotation

more convenient manner. Nowadays, because of supercomputers, groundbreaking bioinformatics research is made possible. A good example is the discovery of novel genes associated with different diseases. With the discovery of these genes, scientists have come up to a deeper understanding of the etiology of various unexplained diseases caused genetically. Consequently, various drugs and treatments were discovered to counteract such diseases. Within the area of bioinformatics, sequence analysis and genome annotation are among the two of the emerging and most important branches. In the recent years, supercomputers play very important roles in the successes of such branches. The objective of this chapter is to provide the readers a clear understanding of the specific applications of supercomputers in the two most emerging areas of bioinformatics, sequence analysis and genome annotation. Though supercomputers play critical roles in such areas, the audience is not often aware of the potential applications that may arise from them. A universal understanding that constitute both fundamental and experimental methodologies will enhance the development and progress of such areas. Thus, the major motivation of this chapter is to provide the abovementioned understanding by discussing and analyzing the fundamentals of several examples centered on the various applications of supercomputers in sequence analysis and genome annotation. While the content of this chapter may be technical to some readers, we encourage them to review some basic concepts of genetics and biochemistry as well as to look at the definition of terms to better understand this chapter.

BACKGROUND Genotype analysis involves studying the association between genotype and phenotype, and the genotype frequencies. Genetic association studies are aimed primarily in identifying genetic

150

variants that explain differences in phenotypes among individuals in a study population. Once association is found between the gene(s) and the phenotype, scientists would be able to understand the mechanism of action and disease etiology in individuals and consequently characterize the relevance and importance of such in the general population. The long-term goal of these studies is to identify better treatment and prevention strategies. Association or any genetic analyses usually require highly intensive mathematical calculations. Supercomputers play a critical role in the success of such calculations. Prior to genetic association analyses, any genotype information needs to undergo two critical steps—sequence analysis and genome annotation. Applications of supercomputers in genotype analyses involve a wide array of applications and will be discussed in the mentioned areas below.

SUPERCOMPUTERS IN SEQUENCE ANALYSIS Sequence analysis is the most commonly performed task in bioinformatics. It was one of the first bioinformatics techniques founded in ~1970 (Webb-Roberts, 2004). DNA sequencing is simply any process used to map out the sequence of the nucleotides that comprise a strand of DNA. After the discovery of the double helix shape of DNA in 1953, and seeing how it is comprised of a series of ladder like units known as DNA nucleotides, the primary goal has been to find out just how the sequence of those little nucleotides leads to the physical characteristics of an organism, that is, whether what your hair color, your skin color, and every other detail from your bone marrow to the tip of your hair. Thus, DNA sequencing is simply a way for scientists to unravel genetics, the study of how we are put together and how we transfer our traits to our offspring. It was in 1970 when DNA sequencing first became possible with the discovery of restric-

 Applications of Supercomputers in Sequence Analysis and Genome Annotation

tion enzymes and DNA polymerases. Eventually, breakthrough in the rate of sequencing came when the dideoxy chain termination (Sanger, Nicklen, & Coulson, 1977) and chemical degradation (Maxam & Gilbert, 1977) techniques were introduced in 1977. Consequently, using the former method, the 16.5 kb human mitochondria genome (Anderson, 1981) was sequenced and the latter method was used for the analysis of the 40 kb bacteriophage T7 (Dunn & Studier, 1983). Thus, these methods provide the theoretical and practical backgrounds for our modern sequencing technologies (Chen, 1994). The GenBank, an NIH genetic sequence database is an annotated collection of all publicly available DNA sequences. It used to contain only 15 million nucleotides in 1987 and had nearly doubled its size in each of the subsequent five years. The GenBank had reached over 120 million in 1992, with progressively more data obtained using automated DNA sequencers (Chen, 1994). Today, GenBank has approximately 126,551,501,141 bases in 135,440,924 sequence records in the traditional GenBank divisions and 191,401,393,188 bases in 62,715,288 sequence records in the whole genome shotgun (WGS) division as of April 2011 (Information, 2011). With the large array of DNA sequences produced by various DNA sequencing technologies, it is necessary to perform sequence alignment or multiple sequence alignment, which is a way of arranging the sequencing of DNA to identify regions of similarity that may be a consequence of functional, structural, or evolutionary relationships between the sequences (Mount, 2004). Thus, a wide variety of sequence alignment softwares are available to assist scientists in this process. Throughout the years, numerous computational tools have facilitated the success of genetic research specifically in comparing sequences (Table 1). After auto-assembly and before genomic annotation, the genomic finishing process is executed in a typical sequence analysis procedure. Finishing

is the process of turning a rough draft assembly composed of shotgun sequencing reads into a highly accurate finished DNA sequence with a defined maximum allowed error rate. The international publicly funded sequencing community established a standard for considering a sequence finished: It should be completely contiguous, with no gaps in the sequence, and that it have a final estimated error rate of

Suggest Documents