book includes leading edge data intensive computing architectures and systems,
... The handbook comprises of four parts, which consist of 30 chapters. The first ...
Handbook of Data Intensive Computing
Borko Furht • Armando Escalante Editors
Handbook of Data Intensive Computing
123
Editors Borko Furht Department of Computer and Electrical Engineering and Computer Science Florida Atlantic University Boca Raton, Florida USA
[email protected]
Armando Escalante LexisNexis Boca Raton, Florida USA
[email protected]
ISBN 978-1-4614-1414-8 e-ISBN 978-1-4614-1415-5 DOI 10.1007/978-1-4614-1415-5 Springer New York Dordrecht Heidelberg London Library of Congress Control Number: 2011941878 © Springer Science+Business Media, LLC 2011 All rights reserved. This work may not be translated or copied in whole or in part without the written permission of the publisher (Springer Science+Business Media, LLC, 233 Spring Street, New York, NY 10013, USA), except for brief excerpts in connection with reviews or scholarly analysis. Use in connection with any form of information storage and retrieval, electronic adaptation, computer software, or by similar or dissimilar methodology now known or hereafter developed is forbidden. The use in this publication of trade names, trademarks, service marks, and similar terms, even if they are not identified as such, is not to be taken as an expression of opinion as to whether or not they are subject to proprietary rights. Printed on acid-free paper Springer is part of Springer Science+Business Media (www.springer.com)
Preface
This handbook is carefully edited book – contributors are worldwide experts in the field of data intensive computing and their applications. The scope of the book includes leading edge data intensive computing architectures and systems, innovative storage, virtualization, and parallel processing technologies applied in data intensive computing, and a variety of data intensive applications. Data intensive computing refers to capturing, managing, analyzing, and understanding data at volumes and rates that push the frontiers of current technologies. The challenge of data intensive computing is to provide the hardware architectures and related software systems and techniques which are capable of transforming ultra-large data into valuable knowledge. Data intensive computing demands a fundamentally different set of principles than mainstream computing. Data-intensive applications typically are well suited for large-scale parallelism over the data and also require extremely high degree of fault-tolerance, reliability, and availability. In addition, most data intensive applications require real-time or near real-time response. The objective of the project is to introduce the basic concepts of data intensive computing, technologies and hardware and software techniques applied in data intensive computing, and current and future applications. The handbook comprises of four parts, which consist of 30 chapters. The first part on Architectures and Systems includes chapters dealing with network architectures for data intensive computing, data intensive software systems, and high-level programming languages and storage systems for data-intensive computing. The second part on Technologies and Techniques covers load balancing techniques, linking technologies, virtualization techniques, feature ranking methods and other techniques applied in data intensive computing. The third part on Security includes various aspects on privacy and security requirements and related techniques applied in data intensive computing. The fourth part on Applications describes various data intensive applications from earthquake simulations and geosciences to biological systems, social information systems, and bioinformatics. With the dramatic growth of data intensive computing and systems and their applications, this handbook can be the definitive resource for persons working in this field as researchers, scientists, programmers, engineers, and users. The book is v
vi
Preface
intended for a wide variety of people including academicians, designers, developers, educators, engineers, practitioners, and researchers and graduate students. This book can also be beneficial for business managers, entrepreneurs, and investors. The book can have a great potential to be adopted as a textbook in current and new courses on Data Intensive Computing. The main features of this handbook can be summarized as: 1. The handbook describes and evaluates the current state-of-the-art in a new field of data intensive computing. 2. It also presents current systems, services, and main players in this explosive field. 3. Contributors to the handbook are the leading researchers from academia and practitioners from industry. We would like to thank the authors for their contributions. Without their expertise and effort this handbook would never come to fruition. Springer editors and staff also deserve our sincere recognition for their support throughout the project. Editors-in-Chief Boca Raton, Florida
Borko Furht Armando Escalante
About the Editors-in-Chief
Borko Furht is a professor and chairman of the Department of Electrical & Computer Engineering and Computer Science at Florida Atlantic University (FAU) in Boca Raton, Florida. He is also director of recently formed NSF-sponsored Industry/University Cooperative Research Center on Advanced Knowledge Enablement. Before joining FAU, he was a vice president of research and a senior director of development at Modcomp (Ft. Lauderdale), a computer company of Daimler Benz, Germany, a professor at University of Miami in Coral Gables, Florida, and a senior researcher in the Institute Boris Kidric-Vinca, Yugoslavia. Professor Furht received Ph.D. degree in electrical and computer engineering from the University of Belgrade. His current research is in multimedia systems, video coding and compression, 3D video and image systems, wireless multimedia, and Internet and cloud computing. He is presently Principal Investigator and Co-PI vii
viii
About the Editors-in-Chief
of several multiyear, multimillion dollar projects including NSF PIRE project and NSF High-Performance Computing Center. He is the author of numerous books and articles in the areas of multimedia, computer architecture, real-time computing, and operating systems. He is a founder and editor-in-chief of the Journal of Multimedia Tools and Applications (Springer). He has received several technical and publishing awards, and has consulted for many high-tech companies including IBM, HewlettPackard, Xerox, General Electric, JPL, NASA, Honeywell, and RCA. He has also served as a consultant to various colleges and universities. He has given many invited talks, keynote lectures, seminars, and tutorials. He served on the Board of Directors of several high-tech companies.
Armando J. Escalante is Senior Vice President and Chief Technology Officer of Risk Solutions for the LexisNexis Group, a division of Reed Elsevier. In this position, Escalante is responsible for technology development, information systems and operations. Previously, Escalante was Chief Operating Officer for Seisint, a privately owned company, which was purchased by LexisNexis in 2004. In this position, he was responsible for Technology, Development and Operations. Prior to 2001, Escalante served as Vice President of Engineering and Operations for Diveo Broadband Networks where he led world class Data Centers located in the U.S. and Latin America. Before Diveo Broadband Networks, Escalante was VP for one of the fastest growing divisions of Vignette Corporation, an eBusiness software leader. Escalante earned his bachelors in electronic engineering at the USB in Caracas, Venezuela and a master’s degree in computer science from Stevens Institute of Technology as well as a master’s in business administration from West Coast University.
Contents
Part I 1
Architectures and Systems
High Performance Network Architectures for Data Intensive Computing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . Geng Lin and Eileen Liu
3
2
Architecting Data-Intensive Software Systems . . . . . .. . . . . . . . . . . . . . . . . . . . Chris A. Mattmann, Daniel J. Crichton, Andrew F. Hart, Cameron Goodale, J. Steven Hughes, Sean Kelly, Luca Cinquini, Thomas H. Painter, Joseph Lazio, Duane Waliser, Nenad Medvidovic, Jinwon Kim, and Peter Lean
25
3
ECL/HPCC: A Unified Approach to Big Data . . . . . .. . . . . . . . . . . . . . . . . . . . Anthony M. Middleton, David Alan Bayliss, and Gavin Halliday
59
4
Scalable Storage for Data-Intensive Computing . . . .. . . . . . . . . . . . . . . . . . . . 109 Abhishek Verma, Shivaram Venkataraman, Matthew Caesar, and Roy H. Campbell
5
Computation and Storage Trade-Off for Cost-Effectively Storing Scientific Datasets in the Cloud .. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . 129 Dong Yuan, Yun Yang, Xiao Liu, and Jinjun Chen
Part II 6
Technologies and Techniques
A Survey of Load Balancing Techniques for Data Intensive Computing.. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . 157 Zhiquan Sui and Shrideep Pallickara
ix
x
Contents
7
Resource Management for Data Intensive Clouds Through Dynamic Federation: A Game Theoretic Approach .. . . . . . . . 169 Mohammad Mehedi Hassan and Eui-Nam Huh
8
Salt: Scalable Automated Linking Technology for Data-Intensive Computing . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . 189 Anthony M. Middleton and David Alan Bayliss
9
Parallel Processing, Multiprocessors and Virtualization in Data-Intensive Computing.. . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . 235 Jonathan Burger, Richard Chapman, and Flavio Villanustre
10 Challenges in Data Intensive Analysis at Scientific Experimental User Facilities . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . 249 Kerstin Kleese van Dam, Dongsheng Li, Stephen D. Miller, John W. Cobb, Mark L. Green, and Catherine L. Ruby 11 Large-Scale Data Analytics Using Ensemble Clustering .. . . . . . . . . . . . . . 285 Martin Hahmann, Dirk Habich, and Wolfgang Lehner 12 Specification of Data Intensive Applications with Data Dependency and Abstract Clocks . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . 323 Abdoulaye Gamati´e 13 Ensemble Feature Ranking Methods for Data Intensive Computing Applications . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . 349 Wilker Altidor, Taghi M. Khoshgoftaar, Jason Van Hulse, and Amri Napolitano 14 Record Linkage Methodology and Applications . . . .. . . . . . . . . . . . . . . . . . . . 377 Ling Qin Zhang 15 Semantic Wrapper: Concise Semantic Querying of Legacy Relational Databases . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . 415 Naphtali Rishe, Borko Furht, Malek Adjouadi, Armando Barreto, Debra Davis, Ouri Wolfson, Yelena Yesha, and Yaacov Yesha Part III
Security
16 Security in Data Intensive Computing Systems . . . . .. . . . . . . . . . . . . . . . . . . . 447 Eduardo B. Fernandez 17 Data Security and Privacy in Data-Intensive Computing Clusters . . . 467 Flavio Villanustre and Jarvis Robinson 18 Information Security in Large Scale Distributed Systems . . . . . . . . . . . . . 485 Salvatore Distefano and Antonio Puliafito
Contents
xi
19 Privacy and Security Requirements of Data Intensive Computing in Clouds. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . 501 Arash Nourian and Muthucumaru Maheswaran Part IV
Applications
20 On the Processing of Extreme Scale Datasets in the Geosciences . . . . . 521 Sangmi Lee Pallickara, Matthew Malensek, and Shrideep Pallickara 21 Parallel Earthquake Simulations on Large-Scale Multicore Supercomputers . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . 539 Xingfu Wu, Benchun Duan, and Valerie Taylor 22 Data Intensive Computing: A Biomedical Case Study in Gene Selection and Filtering . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . 563 Michael Slavik, Xingquan Zhu, Imad Mahgoub, Taghi Khoshgoftaar, and Ramaswamy Narayanan 23 Design Space Exploration for Efficient Data Intensive Computing on SoCs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . 581 Rosilde Corvino, Abdoulaye Gamati´e, and Pierre Boulet 24 Information Quality and Relevance in Large-Scale Social Information Systems. . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . 617 Munmun De Choudhury 25 Geospatial Data Management with Terrafly. . . . . . . . .. . . . . . . . . . . . . . . . . . . . 637 Naphtali Rishe, Borko Furht, Malek Adjouadi, Armando Barreto, Evgenia Cheremisina, Debra Davis, Ouri Wolfson, Nabil Adam, Yelena Yesha, and Yaacov Yesha 26 An Application for Processing Large and Non-Uniform Media Objects on MapReduce-Based Clusters. . . . . .. . . . . . . . . . . . . . . . . . . . 667 Rainer Schmidt and Matthias Rella 27 Feature Selection Algorithms for Mining High Dimensional DNA Microarray Data .. . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . 685 David J. Dittman, Taghi M. Khoshgoftaar, Randall Wald, and Jason Van Hulse 28 Application of Random Matrix Theory to Analyze Biological Data .. . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . 711 Feng Luo, Pradip K. Srimani, and Jizhong Zhou
xii
Contents
29 Keyword Search on Large-Scale Structured, Semi-Structured, and Unstructured Data . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . 733 Bin Zhou 30 A Distributed Publish/Subscribe System for Large Scale Sensor Networks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . 753 Masato Yamanouchi, Ryota Miyagi, Satoshi Matsuura, Satoru Noguchi, Kazutoshi Fujikawa, and Hideki Sunahara Index . . . . . . . . .. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . 777
Contributors
Nabil Adam U.S. Department of Homeland Security (DHS.gov), Washington DC, USA Malek Adjouadi NSF Industry-University Cooperative Research Center for Advanced Knowledge, Enablement (CAKE.fiu.edu) at Florida International University, Miami, Florida, USA Wilker Altidor FAU, Boca Raton, FL, USA Armando Barreto NSF Industry-University Cooperative Research Center for Advanced Knowledge, Enablement (CAKE.fiu.edu) at Florida International University, Miami, Florida, USA David Alan Bayliss LexisNexis, Boca Raton, FL, USA Pierre Boulet LIFL/CNRS and Inria, Parc Scientifique de la Haute Borne, Villeneuve d’Ascq, France Jonathan Burger LexisNexis Risk Solutions, LexisNexis, Alpharetta, Georgia, USA Matthew Caesar Department of Computer Science, University of Illinois at Urbana-Champaign, Urbana, IL, USA Roy H. Campbell Department of Computer Science, University of Illinois at Urbana-Champaign, Urbana, IL, USA Richard Chapman LexisNexis Risk Solutions, LexisNexis, Alpharetta, Georgia, USA Jinjun Chen Faculty of Engineering and Information Technology, University of Technology, Sydney, NSW, Australia Evgenia Cheremisina NSF Industry-University Cooperative Research Center for Advanced Knowledge, Enablement (CAKE.fiu.edu) at Florida International, Florida Atlantic and Dubna University, Moscow, Russia xiii
xiv
Contributors
Munmun De Choudhury Rutgers University, New Brunswick, NJ, USA Luca Cinquini Instrument and Science Data Systems, NASA Jet Propulsion Laboratory, California Institute of Technology, Pasadena, CA, USA John W. Cobb Data Systems Group, Neutron Scattering Science Division, Oak Ridge National Laboratory, Oak Ridge, TN, USA Rosilde Corvino University of Technology Eindhoven, Eindhoven, AZ, The Netherlands Daniel J. Crichton Instrument and Science Data Systems, NASA Jet Propulsion Laboratory, California Institute of Technology, Pasadena, CA, USA Kerstin Kleese van Dam Fundamental and Computational Science Department, Pacific Northwest National Laboratory, Richland, WA, USA Debra Davis NSF Industry-University Cooperative Research Center for Advanced Knowledge, Enablement (CAKE.fiu.edu) at Florida International University, Miami, Florida, USA Salvatore Distefano Dipartimento di Matematica, Universit`a di Messina, Contrada Papardo, S. Sperone, Messina, Italy David J. Dittman FAU, Boca Raton, FL, USA Benchun Duan Department of Geology & Geophysics, Texas A&M University, College Station, TX, USA Eduardo B. Fernandez Department of Computer & Electrical Engineering and Computer Science, Florida Atlantic University, Boca Raton, FL, USA Kazutoshi Fujikawa Graduate School of Information Science, Nara Institute of Science and Technology 8916-5, Takayama-cho, Ikoma-shi, Nara, Japan Borko Furht NSF Industry-University Cooperative Research Center for Advanced Knowledge, Enablement (CAKE.fiu.edu) at Florida International, Florida Atlantic and Dubna Universities, Boca Raton, Florida, USA Abdoulaye Gamati´e LIFL/CNRS and Inria, Parc Scientifique de la Haute Borne, Villeneuve d’Ascq, France Cameron Goodale Instrument and Science Data Systems, NASA Jet Propulsion Laboratory, California Institute of Technology, Pasadena, CA, USA Mark L. Green Systems Integration Group, Tech-X Corporation, Williamsville, NY, USA Dirk Habich Dresden University of Technology, Database Technology Group, Dresden, Germany Martin Hahmann Dresden University of Technology, Database Technology Group, Dresden, Germany
Contributors
xv
Gavin Halliday LexisNexis, Boca Raton, FL, USA Andrew F. Hart Instrument and Science Data Systems, NASA Jet Propulsion Laboratory, California Institute of Technology, Pasadena, CA, USA Mohammad Mehedi Hassan Department of Computer Engineering, Kyung Hee University, South Korea J. Steven Hughes Instrument and Science Data Systems, NASA Jet Propulsion Laboratory, California Institute of Technology, Pasadena, CA, USA Eui-Nam Huh Department of Computer Engineering, Kyung Hee University, South Korea Jason Van Hulse FAU, Boca Raton, FL, USA Sean Kelly Instrument and Science Data Systems, NASA Jet Propulsion Laboratory, California Institute of Technology, Pasadena, CA, USA Taghi M. Khoshgoftaar Department of Computer & Electrical Engineering and Computer Science, Florida Atlantic University, Boca Raton, FL, USA Jinwon Kim Joint Institute for Regional Earth System Science and Engineering (JIFRESSE), University of California, Los Angeles, Los Angeles, CA, USA Joseph Lazio Instrument and Science Data Systems, NASA Jet Propulsion Laboratory, California Institute of Technology, Pasadena, CA, USA Peter Lean Department of Meteorology, University of Reading, Reading, UK Wolfgang Lehner Dresden University of Technology, Database Technology Group Dongsheng Li Fundamental and Computational Science Department, Pacific Northwest National Laboratory, Richland, WA, USA Geng Lin Dell, IBM Alliance Cisco Systems Eileen Liu Nominum, Inc., Wyse Technology, San Jose, California, USA Xiao Liu Faculty of Information and Communication Technologies, Swinburne University of Technology, Melbourne, Australia Feng Luo School of Computing, Clemson University, Clemson, SC, USA Muthucumaru Maheswaran McGill University, Montreal, Canada Imad Mahgoub Department of Computer & Electrical Engineering and Computer Science, Florida Atlantic University, Boca Raton, FL, USA Matthew Malensek Department of Computer Science, Colorado State University, Fort Collins, CO, USA Satoshi Matsuura Graduate School of Information Science, Nara Institute of Science and Technology 8916-5, Takayama-cho, Ikoma-shi, Nara, Japan
xvi
Contributors
Chris A. Mattmann Instrument and Science Data Systems, NASA Jet Propulsion Laboratory, California Institute of Technology, Pasadena, CA, USA Nenad Medvidovic Computer Science Department, Viterbi School Engineering, University of Southern California, Los Angeles, CA, USA
of
Anthony M. Middleton LexisNexis, Boca Raton, FL, USA Stephen D. Miller Data Systems Group, Neutron Scattering Science Division, Oak Ridge National Laboratory, Oak Ridge, TN, USA Ryota Miyagi Graduate School of Information Science, Nara Institute of Science and Technology 8916-5, Takayama-cho, Ikoma-shi, Nara, Japan Amri Napolitano FAU, Boca Raton, FL, USA Ramaswamy Narayanan Charles E. Schmidt College of Science, Florida Atlantic University, Boca Raton, FL, USA Satoru Noguchi Graduate School of Information Science, Nara Institute of Science and Technology 89165, Takayama-cho, Ikoma-shi, Nara, Japan Arash Nourian McGill University, Montreal, Canada Thomas H. Painter Instrument and Science Data Systems, NASA Jet Propulsion Laboratory, California Institute of Technology, Pasadena, CA, USA Sangmi Lee Pallickara Department of Computer Science, Colorado State University, Fort Collins, CO, USA Shrideep Pallickara Department of Computer Science, Colorado State University, Fort Collins, CO, USA Makan Pourzandi Ericsson, Mississauga, Canada Antonio Puliafito Dipartimento di Matematica, Universit`a di Messina, Contrada Papardo, S. Sperone, Messina, Italy Matthias Rella Austrian Institute of Technology, Donau-City-Strasse 1, Vienna, Austria Naphtali Rishe NSF Industry-University Cooperative Research Center for Advanced Knowledge, Enablement (CAKE.fiu.edu) at Florida International University, Miami, Florida, USA Jarvis Robinson LexisNexis, Alpharetta, GA, USA Catherine L. Ruby Systems Williamsville, NY, USA
Integration
Group,
Tech-X
Corporation,
Rainer Schmidt Austrian Institute of Technology, Donau-City-Strasse 1, Vienna, Austria
Contributors
xvii
Michael Slavik Department of Computer and Electrical Engineering and Computer Science, Florida Atlantic University, Boca Raton, FL, USA Pradip K. Srimani School of Computing, Clemson University, Clemson, SC, USA Zhiquan Sui Department of Computer Science, Colorado State University, Fort Collins, CO, USA Hideki Sunahara Graduate School of Media Design, Keio University Kouhoku-ku, Yokohama, Kanagawa, Japan Valerie Taylor Department of Computer Science and Engineering, Texas A&M University College Station, TX, USA Shivaram Venkataraman Department of Computer Science, University of Illinois at Urbana-Champaign, Urbana, IL, USA Abhishek Verma Department of Computer Science, University of Illinois at Urbana-Champaign, Urbana, IL, USA Flavio Villanustre LexisNexis Risk Solutions, LexisNexis, Alpharetta, Georgia, USA Randall Wald FAU, Boca Raton, FL, USA Duane Waliser Instrument and Science Data Systems, NASA Jet Propulsion Laboratory, California Institute of Technology, Pasadena, CA, USA Ouri Wolfson Computational Transportation Science Program (CTS.cs.uic.edu), University of Illinois at Chicago, USA Xingfu Wu Department of Computer Science & Engineering, Institute for Applied Mathematics and Computational Science, Texas A&M University, College Station, TX, USA Masato Yamanouchi Graduate School of Media Design, Keio University Kouhoku-ku, Yokohama, Kanagawa, Japan Yun Yang Faculty of Information and Communication Technologies, Swinburne University of Technology, Melbourne, Australia Yaacov Yesha NSF Industry-University Cooperative Research Center for Multicore Productivity Research (CHMPR.umbc.edu) at the University of Maryland Baltimore County, Baltimore, Mayland, USA Yelena Yesha NSF Industry-University Cooperative Research Center for Multicore Productivity, Research (CHMPR.umbc.edu) at the University of Maryland Baltimore County, Baltimore, Maryland, USA Dong Yuan Faculty of Information and Communication Technologies, Swinburne University of Technology, Melbourne, Australia
xviii
Contributors
Ling Qin Zhang LexisNexis Risk Solutions, Boca Raton, FL, USA Bin Zhou Department of Information Systems, University of Maryland, Baltimore County (UMBC), Baltimore, USA Jizhong Zhou Institute for Environmental Genomics, University of Oklahoma, Norman OK, USA Xingquan Zhu Department of Computer and Electrical Engineering and Computer Science, Florida Atlantic University, Boca Raton, FL, USA Centre for Quantum Computation and Intelligent Systems, University of Technology, Sydney, NSW, Australia