Mar 23, 2015 - Virtual private network. WMS ..... private cloud, and hybrid cloudâthese models of computing are .... such as the use of a virtual private network.
27 Processing Remote-Sensing Data in Cloud Computing Environments Ramanathan Sugumaran The University of Iowa and John Deere
James W. Hegeman
Acronym and Definition.................................................................................................................... 549 27.1 Introduction............................................................................................................................. 549 Remote Sensing and Big Data • Big-Data Processing Challenges
27.2 Introduction to Cloud Computing........................................................................................551 Definitions • Cloud Paradigms • Cloud Service Models • Advantages and Limitations of Cloud Computing
The University of Iowa
Vivek B. Sardeshmukh The University of Iowa
Marc P. Armstrong The University of Iowa
27.3 Cloud-Computing-Based Remote-Sensing-Related Applications....................................553 A Case Study: Cloud-Based LiDAR Processing System
27.4 Conclusions.............................................................................................................................. 556 Acknowledgments............................................................................................................................... 556 References............................................................................................................................................. 556
Acronym and Definition
27.1 Introduction
AWS CLiPS CPU EC2 EOSDIS GIS GPGPU
27.1.1 Remote Sensing and Big Data
Amazon Web Services Cloud-based LiDAR processing system Central processing unit Elastic Compute Cloud Earth Observing System Data and Information System Geographic information system General-purpose computing on graphics processing units GPU Graphics processing unit HPC High-performance computing HTC High-throughput computing IaaS Infrastructure as a service MODIS Moderate-Resolution Imaging Spectroradiometer OCC Open Cloud Consortium OGC Open Geospatial Consortium PaaS Platform as a service Saas Software as a service TIN Triangulated irregular network UAV Unpiloted aerial vehicles USGS U.S. Geological Survey VPN Virtual private network WMS Web Map Service
During the past four decades, scientific communities around the world have regularly accumulated massive collections of remotely sensed data from ground, aerial, and satellite platforms. In the United States, these collections include the U.S. Geological Survey’s (USGS) 37-year record of Landsat satellite images (comprising petabytes of data) (USGS, 2011); the NASA Earth Observing System Data and Information System, having multiple data centers and more than 7.5 petabytes of archived imagery (Hyspeed Computing, 2013); and the current NASA systems that record approximately 5 TB of remote-sensing-related data per day (Vatsavai et al., 2012). In addition, new data-capture technologies such as LiDAR are used routinely to produce multiple petabytes of 3D remotely sensed data representing topographic information (Sugumaran et al., 2011). These technologies have galvanized changes in the way remotely sensed data are collected, managed, and analyzed. On the sensor side, great progress has been made in optical, microwave, and hyperspectral remote sensing with (1) spatial resolutions extending from kilometers to submeters, (2) temporal resolutions ranging from weeks to 30 min, (3) spectral resolutions ranging from single bands to
549
K22125_C027.indd 549
3/23/2015 1:43:46 PM
550
Cloud Computing and Remote Sensing
Sources/types
Big data
Ground, aerial, satellite, and UAV Optical, microwave, hyperspectral, LiDAR
Challenges Collect
Volume
Manage Store
Veracity
Large achieve, realtime,
RS as Big data
Velocity
Archive Analyse Visualize
Variety
Distribute
Figure 27.1 Remote sensing: big-data sources and challenges.
hundreds of bands, and (4) radiometric resolutions ranging from 8 to 16 bits. The platform side has also seen rapid development during the past three decades. Satellite and aerial platforms have continued to mature and are producing large quantities of remote-sensing data. Moreover, sensors deployed on unpiloted aerial vehicles (UAVs) have recently begun to produce massive quantities of very-high-resolution data. The technological nexus of continuously increasing spatial, temporal, spectral, and radiometric resolutions of inexpensive sensors, on a range of platforms, along with internet data accessibility is creating a flood of remote-sensing data that can easily be included in what is commonly referred to as “big data.” This term refers to datasets that have grown sufficiently large that they have become difficult to store, manage, share, and analyze using conventional software tools (White, 2012). “Big data” are often thought to span four dimensions: volume (data quantity), velocity (real-time processing), variety (source multiplicity), and veracity (data accuracy) (IBM, 2012). Operating hand in glove with Moore’s law, the growth of big data is largely a consequence of advances in acquisition technology and increases in storage capacity. Figure 27.1 summarizes the overall sources and challenges presented by big remote-sensing data.
27.1.2 Big-Data Processing Challenges As the pace of imaging technology has continued to advance, the provision of affordable technology for dealing with issues such as storing, processing, managing, archiving, disseminating, and analyzing large volumes of remote-sensing information has lagged. One major challenge is related to the computational power required to process these massive data sources. Traditionally, desktop computers with single or multiple cores have been used to process remote-sensing data for small areas. In contrast, largeor macroscale remote-sensing applications may require highperformance computing (HPC) technologies; general-purpose computing on graphics processing units (GPGPU); and parallel, cluster, and distributed-computing approaches are gaining broad acceptance (Plaza et al., 2006; González et al., 2009; Simmhan and Ramakrishnan, 2010; Shekhar et al., 2012). Given these
K22125_C027.indd 550
architectural advances, the analysis of big data presents new challenges to both cluster-infrastructure software and parallelapplication design, and it requires the development of new computational methods. These methods and several articles about the importance of HPC in remote sensing are featured in special journal issues, books, and conferences devoted to this topic (Plaza and Chang, 2007, 2008; Lee et al., 2011; Prasad, 2013). Graphics processing units (GPUs) have been widely used (in GPGPU applications) to address remote-sensing problems (Chang et al., 2011; Christophe et al., 2011; Song et al., 2011; Yang et al., 2011). Oryspayev et al. (2012) developed an approach to LiDAR processing that used data-mining algorithms coupled with parallel computing technology. A specific comparison was made between the use of multiple central processing units (CPUs) (Intel Xeon Nehalem chipsets) and GPUs (Intel i7 Core CPUs using the NVIDIA Tesla s1070 GPU cards). The experimental results demonstrated that the GPU option was up to 35 times faster than the CPU option. In a similar vein, distributed parallel approaches have also been developed. Haifang (2003) implemented various algorithms using a heterogeneous grid computing environment, and Liu (2010) analyzed the efficiency improved by grid computing for the maximum likelihood classification method. Yue et al. (2010) used cluster computing to solve remote-sensing-image fusion, filtering, and segmentation. Commodity-cluster-based parallel processing of various multispectral and hyperspectral imagery has also been used by various authors (e.g., Plaza et al., 2006). While HPC environments such as clusters, grid computing, and supercomputers (Simmhan and Ramakrishnan, 2010) can be used, these platforms require significant investments in equipment and maintenance (Ostermann et al., 2010) and individual researchers and many government agencies do not have routine access to these resources. In addition to data volume, the variety and update rate of datasets often exceed the capacity of commonly used computing and database technologies (Wang et al., 2009; Yang et al., 2011; Shekhar et al., 2012). As a result of these limitations, users have begun to search for less-expensive solutions for the development of large-scale data-intensive remotesensing applications. Cloud computing provides a potential solution to this challenge due to its scalability advantages in data
AQ1
3/23/2015 1:43:46 PM
551
Processing Remote-Sensing Data in Cloud Computing Environments
storage and processing and its relatively low cost as compared to user-owned, high-power compute clusters (Kumar et al., 2013). The goal of this chapter is to provide a short review of remotesensing applications using cloud-computing environments, as well as a case study that provides greater implementation detail. An introduction to cloud computing is provided in Section 27.2, and then in Section 27.3, various applications, including a detailed case study, are described to illustrate the advantages of cloud-computing environments.
27.2 Introduction to Cloud Computing 27.2.1 Definitions Cloud computing is a vague term, as nebulous as its eponym. The NIST Definition of Cloud Computing (Mell and Grance, 2011) defines cloud computing as “a model for enabling ubiquitous, convenient, on-demand network access to a shared pool of configurable computing resources.” In research, “cloud computing” is a popular idiom for distributed computing, encompassing the same fundamental concepts—multiplicity, parallelism, and fault tolerance. Distributed computing has come to the forefront as an area of research over the past two decades as dataset sizes have outstripped traditional sequential processing power, even of modern high-performance processors, and as the bottlenecks of large-scale computation have moved outside the CPU (e.g., to storage I/O). To employ distributed or cloud computing means to leverage the additional hardware and computing throughput available from large networks of machines. Because of the challenges inherent in optimizing resource scale and utilization for a problem, two central focuses of cloud computing in practice have been (1) the elasticity of resource provisioning and (2) abstraction layers capable of simplifying these challenges for users. Indeed, as exemplified in an eScience Institute document (2012), it is these two qualities of cloud computing—elasticity and abstraction—that are often most important from the end user’s perspective. Thus, while research with cloud computing generally focuses on the distributed scalability of the cloud, the characteristic feature of cloud computing, in practice, is flexibility. There are several related terms associated with cloud computing as explained below. Cluster computing is a similar, but more limited, model of parallel/distributed computing, in which the machines comprised
Input
by the cluster are usually assumed to be tightly connected by a low-latency, private network. This generally implies physical locality of the system itself. The concept of cluster computing was a precursor to today’s notion of HPC (Lee et al., 2011). Cluster computing differs from cloud computing in its emphasis—a compute cluster is often a localized system dedicated to one particular problem (or class of problems) at a time. Grid computing can also be thought of as a subset of cloud computing having a slightly different emphasis (Buyya et al., 2009). A compute grid is a distributed system, often encompassing machines physically separated by a large distance, together with a job scheduler that allows a user to easily and simultaneously run a certain small set of programs on many different data inputs—the action of a user’s program on one such input constitutes a task, or job. Grid computing is thus the use of a parallel/ distributed system for solving a large problem or accomplishing a large collection of tasks that would take too long to complete on a single machine. In grid computing, the problem to be solved can generally be broken down into many nearly identical tasks that can run concurrently and independently on the nodes of the system (see, e.g., Wang and Armstrong, 2003). Once complete, the solutions to, or results of, these tasks are aggregated. The type of computational problem-solving approach exemplified by grid computing is also known as high-throughput computing (HTC), in which similar computations are done independently by a (large) number of processors, and the network interconnect is used primarily for data distribution and results aggregation. Another, distinct, distributed computational paradigm is HPC, which emphasizes slightly different aspects of the system. Whereas HTC emphasizes the problem division, HPC refers to the use of many high-power servers connected by a fast (usually >10 Gbps) network, under any algorithmic paradigm. An HPC algorithm may also qualify as HTC, or it may require much more intermediate communication between processes in order to accomplish the goal. In other words, in HPC, individual processes/processors may need to communicate very rarely, or very often, during the intermediate stages of a computation. The concepts of HTC and HPC thus focus on slightly different qualities of a whole system and are neither identical nor mutually exclusive. In Figure 27.2, an example HTC workflow is depicted. The problem input is divided into many segments, each of which is sent to a node of the system; depending on processing
T1,1
T1,2
T1,3
T2,1
T2,2
T2,3
T3,1
T3,2
T3,3
T49,1
T49,2
T49,3
T50,1
T50,2
T50,3
AQ2
Results
Figure 27.2 A typical high-throughput computing workflow model.
K22125_C027.indd 551
3/23/2015 1:43:47 PM
552
capacities, a single node may receive multiple segments. Several tasks (in Figure 27.2, three) may be required to complete the processing of each segment. Thus, in Figure 27.2, any two tasks Ti, j with the same first (i) index are distinct but operate (in series) on the same segment of the input, whereas any two tasks with the same second (j) index are identical but operate on different portions of the input. After processing of each segment, the individual results are merged in some manner to form the output. Unlike HTC, there is no canonical diagram for HPC, since HPC is defined more by the power of the computational resources and admits many algorithmic paradigms. Presently, there are several large-scale commercial options for cloud computing, and many larger institutions have their own distributed systems that users may use as a “cloud.” Amazon’s Elastic Compute Cloud (EC2) was the first public cloud to be available to large numbers of users over the Internet. Subsequently, Google and Microsoft, as well as several other vendors, have also begun to offer large-scale public cloud services. Since the distributed nature of a cloud is opposite of the traditional mainframe server, most cloud-computing systems in practice run some version of the GNU/Linux operating system. However, Microsoft Windows Server is also an option for cloud systems, and indeed the availability of geographic information system (GIS)-specific software packages makes Windows Server relevant in this domain.
27.2.2 Cloud Paradigms While a variety of common terms have arisen—public cloud, private cloud, and hybrid cloud—these models of computing are fundamentally the same, differing only in ancillary issues such as security and usability. A public cloud is a cloud system available to the public (or some subset of the public) over the World Wide Web. A public cloud may even rely on Internet infrastructure for “internal” network connectivity. In practice, compute access to most public clouds is available for rent to the general public. Amazon’s EC2 is the quintessential public cloud—Amazon Web Services (AWS) was an early leader in providing computing power as a commercial service. Others in the commercial sector have followed suit, such as Google with their Google Cloud Platform. The emergence of public clouds is a good example of an economy of scale—powerful servers and high-performance networks can be expensive and require specialized expertise to administer and support. Many institutions possess problems for which cloud computing is [a part of] an ideal solution, but only the largest have a sufficient quantity of such problems that it makes financial sense to own and manage their own servers. Because of the expense, cloud-computing systems of scale are often only financially practical under high load—smaller organizations without a sufficient volume of computational problems may find themselves only utilizing a private system, say, 50% of the time. In this scenario, the costly overhead looms even larger compared to the cost–benefit analysis of problems solved versus power consumption.
K22125_C027.indd 552
Cloud Computing and Remote Sensing
A private cloud is just that—a private network of computers owned by the using entity, usually segregated from the public internet. Examples of private clouds include any compute clusters to which access is limited, such as the research clusters commonly operated by and within large universities. Private cloud computing makes sense under many scenarios: (1) For research and development, it may be necessary to have more control over the system than is afforded by the typical public cloud. Oftentimes, complete control of all aspects of a system can only be achieved when the system is private. (2) When it is financially feasible to own a system whose response time would be sufficient for the users, and if the users have a sufficient quantity of computational tasks that the system would be highly utilized, it could make sense to use a private cloud—thus cutting out the middle man from the services. Finally, the term hybrid cloud is sometimes used to refer to an aggregate system of which some portion is owned, or fully controlled, by the user and some portion is available over a public network. A hybrid cloud can make sense when different parts of the cloud are being used for distinctly different tasks and a different cloud-computing model would be ideal for each. For example, an institution with critical data it needs to process may opt to store the data permanently on a smaller, private cloud, where members have full control over data security, but send some portion of the data to a public cloud, on demand, for processing. While a private cloud offers more control to its owners/ users, a public cloud can be easier to use and can usually provide greater computational power, with little or no overhead cost. As such, a hybrid cloud model may make sense in practice for many businesses and smaller, short-to-medium-term research endeavors.
27.2.3 Cloud Service Models Infrastructure as a service: The fundamental concept of outsourcing and commercialization of compute time is referred to as infrastructure as a service (IaaS). In this approach, a cloud service provider (for instance, Amazon) owns and operates a collection of networked servers available for rent. At a base level, the cloud service provider provisions rental machines (often virtual machines) with a client’s desired operating system, as well as related facilities and tools. For instance, a client might rent compute time from a cloud service provider and request that the rental machines run Fedora Linux. The physical machines, together with the network, Linux kernel, and Fedora distribution constitute the infrastructure, and this system in its entirety is the product provided to the client. Platform as a service: The concept of a platform as a service (PaaS) lies on top of IaaS. For many clients, a compute infrastructure alone is not sufficient for their goals—there is a large gulf between the presence of computing infrastructure and the desired end result. For end users who may have a higherlevel abstraction of their computational task(s), an intermediate platform—a computational service that provides more than just an operating system and network and is managed at that
3/23/2015 1:43:47 PM
Processing Remote-Sensing Data in Cloud Computing Environments
higher level—may be appropriate. In PaaS, a software platform is provisioned by the cloud service provider and can be composed of one or more different software layers. One common example is Hadoop, by Apache. Hadoop is an open-source implementation of Google’s MapReduce framework that provides the MapReduce computational model along with an underlying distributed file system. In addition, there are a variety of additional layers compatible with Hadoop that can exist on top of the Hadoop–MapReduce platform—a common such layer is the data warehouse Hive, also developed by Apache. For more information on the MapReduce computing framework, see Dean and Ghemawat (2004). Software as a service: At a higher level than PaaS is software as a service (SaaS), in which the cloud service provider furnishes a complete software package designed for a particular domain. The end user need only perform configuration-level tasks on the system before it is ready for use. This model of computational services is well suited for organizations that must perform ubiquitous tasks, such as vehicle location tracking or legal document preparation, and that have no desire for, or commitment to, computational research and development. SaaS is becoming increasingly popular as an operational business model.
27.2.4 Advantages and Limitations of Cloud Computing On a practical level, the cloud-computing paradigm has many advantages: access to HPC systems; “pay-as-you-go” payment schemes, with few overhead costs; on-demand provisioning of resources; highly scalable/elastic compute and storage resources; and automated data reliability (McEvoy and Schulze, 2008; Watson et al., 2008; Cui et al., 2010; Huang et al., 2010; Rezgui et al., 2013; Yang and Huang, 2013). Note that this last improvement, high data reliability, is different from data security—one of the major challenges in cloud computing. By its very nature of high accessibility/availability, the most scalable type of cloud computing—the use of a public cloud—is inherently less secure than a computing model based on private control of the entire system. The security challenges of using a public cloud are manyfold: depending on privacy requirements, it may not be acceptable that data reside on the cloud service provider’s machines; users are reliant on the internal security controls of the service provider to prevent unauthorized data access, and the public network infrastructure that must be traversed for data and compute access may be compromised. More elaborate (and from a performance perspective, costly) security and encryption measures must commonly be taken when using a public cloud, such as the use of a virtual private network. On a fundamental level, the advantages of cloud computing lie in the distributed paradigm of the hardware systems and the prospect of lowered barriers to access through the commoditization of computing power itself. The cloud-computing paradigm provides flexibility, scalability, robustness, and, when managed by a dedicated provider, ease of use. In order to make best use of these new computational opportunities, however, software
K22125_C027.indd 553
553
design must take the distributed/parallel nature of cloud computing into account—the power of cloud computing lies not in any single machine, but in the capacities of the system as a whole. Problems of an HTC nature are natural candidates for solving in the cloud because of the intrinsically scalable aspects of HTC solutions. More generally, for all problem paradigms, the communication requirements of the problem play a role in selecting the appropriate service model when seeking a cloudcomputing solution—if substantial communication is required between compute nodes in a particular algorithm, a lower-level service model (IaaS) may be necessary. For example, it may be difficult to find an appropriate, off-the-shelf SaaS-level software implementation for an algorithmic approach that requires compute nodes to communicate heavily. In contrast, in HTC, the focus of the software developer can be narrowed to the proper provisioning of resources; network bandwidth is often a bottleneck only during the dissemination of problem input and gathering of output. Thus, an analysis akin to Amdahl’s law plays an important role in determining the practicality of a particular scale of cloud computing for running a specific algorithm or solving a particular problem. Finally, on issues of cloud security and liability: it should be emphasized that, in part because cloud computing is not an established commodity, many contractual aspects (e.g., various liabilities) are left to the user and provider to agree upon.
27.3 Cloud-Computing-Based RemoteSensing-Related Applications This section provides a short summary of remote-sensing applications that use cloud-computing environments as well as a more detailed case study. HPC frameworks (e.g., supercomputers at various research organizations) can potentially be widely adapted for remote-sensing applications (Parulekar et al., 1994; Simmahan and Ramakrishnan, 2010; Lee et al., 2011; Plaza, 2011). The main limitations associated with the use of HPC are as follows: (1) HPC resources are not readily available for large user communities and (2) HPC resources are expensive to acquire and maintain (Ostermann et al., 2010). Krishnan et al. (2010) evaluated a MapReduce approach for LiDAR gridding on a small private cluster consisting of around 8–10 commodity computers. They investigated the effects of several parameters, including grid resolution and dataset size, on performance. For their software implementation using Hadoop, the authors experimented with Hadoop-specific factors, such as the number of reducers allocated for a problem and the inherent concurrency therein. In their particular study, using quad-core machines with 8 GB of main memory and connected by gigabit Ethernet, they found that doubling the size of their Hadoop cluster from four to eight nodes had little effect on their experimental runtimes. They thus showed that their solution to the task was not strictly compute bound. On the other hand, the authors did note a substantial degradation in performance for their single-node, nondistributed algorithm control (implemented in
AQ3
3/23/2015 1:43:47 PM
554
AQ4
C++) when the problem size grew larger than could fit in main memory. Krishnan et al. (2010) concluded that Hadoop could be a useful framework for algorithms that process large-scale spatial data (roughly 150 million points), but that certain serial elements, such as output generation, could still be rate limiting, especially on commodity hardware. Their work also (1) motivates the study of systems with larger memories (this stems from their experience with HPC resources) and (2) demonstrated that the task of designing optimal systems for the processing of massive spatial data is complex. Project Matsu is an open-source project for processing satellite imaginary using a community cloud (Bennett and Grossman, 2012). This project, a collaboration between NASA and the Open Cloud Consortium (OCC), has been developed to process data from NASA’s EO-1 satellite and to develop opensource technology for public cloud-based processing of satellite imagery. Most computations were completed using a Hadoop framework running on 9 compute nodes with 54 compute cores and 352GB of RAM. The stated goal of this project (http://matsu. opensciencedatacloud.org/) is (1) to use an open-source cloudbased infrastructure to make high-quality satellite image data accessible through an Open Geospatial Consortium-compliant (OGC-compliant) Web Map Service (WMS), (2) to develop an open-source cloud-based analytic framework for analyzing individual images and collections of images, and (3) to generalize this framework to manage and analyze other types of spatial– temporal data. This project also features an on-demand cloudbased disaster assessment capability through satellite image comparisons. The image comparisons are done via a MapReduce job using a Hadoop-streaming interface. As an example, this project hosts a website that provides real-time information about flood prediction and assessment in Namibia. The final data are served to end users using a standard OGC WMS and Web Coverage Processing Service tools. Oryspayev et al. (2012) studied LiDAR data reduction algorithms that were implemented using the GPGPU and multicore CPU architectures available on the AWS EC2. This paper tests the veracity of a vertex-decimation algorithm for reducing LiDAR data size/density and analyzes the performance of this approach on multicore CPU and GPU technologies, to better understand processing time and efficiency. The paper documents the performance of various GPGPU and multicore CPU machines including Tesla family GPUs and the Intel’s multicore i-CPU series for the data reduction problem using large-scale LiDAR data. The study raises several questions about implementation of spatial-data processing algorithms on GPGPU machines, such as how to reduce overhead during the initialization of devices and how to optimize algorithms to minimize data transfer between CPUs/GPUs. Eldawy and Mokbel (2013) developed an open-source framework, SpatialHadoop, that extends Hadoop by providing native support for spatial data. As an extension of Hadoop, the framework operates similarly—programs are written in terms of map and reduce functions, though the system is optimized to exploit underlying properties and characteristics of
K22125_C027.indd 554
Cloud Computing and Remote Sensing
spatial data. As case studies, SpatialHadoop has three spatial operations, range queries, k-nearest-neighbor queries, and spatial join. Cary et al. (2009) studied the performance of the MapReduce framework for bulk construction of R-trees and aerial image quality computation on both vector and raster data. They deployed their MapReduce implementations using the Hadoop framework on the Google and IBM clouds. The authors presented results that demonstrate the scalability of MapReduce and the effect of parallelism on the quality of the results. This paper also studied various metrics to compare the performance of their implemented algorithms including execution time, correctness, and tile quality. Their results indicate that the appropriate application of MapReduce could dramatically improve task completion times and also provide close to linear scalability. This study motivates further investigation of the MapReduce framework for other spatial-data-handling problems. Li et al. (2010) studied the integration of data from groundbased sensors with the Moderate-Resolution Imaging Spectroradiometer satellite data using the Windows Azure cloud platform. Specifically, the authors provide a novel approach to reproject input data into timeframe- and resolution-aligned geographically formatted data and also develop a novel reduction technique to derive important new environmental data through the integration of satellite and ground-based data. Slightly modified Windows Azure abstractions and APIs were used to accomplish the reprojection and reduction steps. They suggest that cloud computing has a great potential for efficiently processing satellite data. It should be noted that their current framework doesn’t fit into the MapReduce framework since it uses Azure’s general queue-based task model. Berriman et al. (2010) compared various toolkits to create image mosaics and to manage their provenance using both the Amazon EC2 cloud and the Abe high-performance cluster at NCSA, UIUC. They conducted a series of experiments to study performance and costs associated with different types of tasks (I/O bound, CPU bound, and memory bound) in these two environments. Their experiments show that for I/O-bound applications, the most expensive resources are not necessarily the most cost-effective, that data transfer costs can exceed the processing costs for I/O-bound applications on Amazon EC2, and that the resources offered by Amazon EC2 are generally less powerful than those available in the Abe high-performance cluster and consequently do not offer the same levels of performance. They concluded from their results that cloud computing offers a powerful and cost-effective new resource for compute and memory intensive remote-sensing applications.
27.3.1 A Case Study: Cloud-Based LiDAR Processing System To more completely illustrate an approach to cloud computing of remote-sensing information, in this section we describe a LiDAR application. In this cloud-based LiDAR processing system (CLiPS) project, we use a statewide (Iowa) LiDAR data
3/23/2015 1:43:47 PM
555
Processing Remote-Sensing Data in Cloud Computing Environments
Client
Server
Amazon web services
EC2 instances
LiDAR and tiles vector data Web browser
Web-GIS
Large, X, 2X
Open layers GIS server Buckets Web server
Simple storage service
Figure 27.3 Overall architecture developed for cloud-based LiDAR processing system.
repository (Iowa DNR, 2009) in which data are distributed to the public as a collection of 34,000 tiles, each covering 4 km2; this comprises roughly 7 TB of data. In order to process this massive data, CLiPS was designed (Figure 27.3) as a web portal implemented using Adobe’s Flex framework along with ESRI’s ArcGIS API for Flex (ESRI, 2012; Sugumaran et al., 2014), OpenLayers, and the Amazon EC2 cloud environment. CLiPS use a threetier client–server model (Figure 27.3). The top tier supports user interaction with the system, the second tier provides process
management services, such as monitoring and analysis, and the third tier is dedicated to data and file services. The client-side interface was developed using Flex and the server-side uses custom-built tools, constructed from open-source products. Figure 27.4 shows the user interface developed for this study. The interactive user interface requires a user to, first, select a region of interest on a map and then select AWS credentials and computing resources as well as a location to which results will be sent for downloading (this is typically an e-mail address). The
Figure 27.4 Cloud-based LiDAR processing system user interface for the state of Iowa, United States.
K22125_C027.indd 555
3/23/2015 1:43:48 PM
556
system was tested by creating a triangulated irregular network (TIN) from a LIDAR point cloud using 18 dataset and processor configurations: three terrain data (flat, undulating, urban), two tile sizes (9 and 25), and three Amazon EC2 processing configurations (large, Xlarge, and double Xlarge). The undulating terrain dataset took more time than the other terrain types for 5 × 5 tile groups, while the urban terrain was the most computationally intensive for the 3 × 3 tile groups used in this study (Sugumaran et al., 2014). The results clearly show that as computer power increases, processing times decrease for all three types of LiDAR terrain data. The various combinations in our evaluations showed that even with up to 25 tiles with varying processing configuration types, each request required less than an hour and cost less than a dollar for data processing (e.g., TIN creation). Moreover, the cost of uploading data from our server and data storage on the cloud was less than 50 dollars. Thus, the overall cost for our test using the Amazon cloud was less than $100, an amount that is affordable by most users.
27.4 Conclusions It is abundantly clear that as a consequence of technological and sensor licensing improvements, the spatial, spectral, temporal, and radiometric resolution of remote-sensing imagery will continue to increase. This translates into massive quantities of data that must be processed to glean meaningful information that can be used in a variety of decision support and visualization applications. Such data quantities quickly overwhelm the capabilities of even the most powerful desktop systems available now and in the foreseeable future. As a consequence, researchers continue to explore cost-effective, yet powerful, computing environments that can be harnessed for remote-sensing applications. Dedicated HPC systems are expensive to acquire and maintain and have relatively short half-lives. This significantly diminishes this alternative as a practical solution. Instead, the availability of high-speed communication technologies now makes the use of distributed, pay-as-you-go resources an attractive choice for researchers and government agencies, as well as users in the private sector. One term that is used to describe these distributed resources is cloud computing. The cloud provides new patterns for deploying remote-sensing-data processing and provides easy, inexpensive access to servers, elastic scalability, managed infrastructure, and low complexity of deployment. Despite these significant advantages, the use of cloud computing does have some limitations. First, since data are distributed across networks to distributed, but unknown, locations, security can become problematic. Thus, public cloud resources cannot be used exclusively for applications that require the processing of many types of individual-level information. This limitation, however, is not normally significant for remote-sensing applications. Secondly, different types of spatial algorithms may require considerable amounts of interprocessor communication, particularly in big-data applications, and certain elements of cloud infrastructure may therefore induce processing latencies that may be unacceptable to users.
K22125_C027.indd 556
Cloud Computing and Remote Sensing
Acknowledgments This research was conducted with support from an Amazon Research Grant and USGS–AmericaView projects.
References Bennett, C. and Grossman, R. 2012. OCC Project Matsu: An open-source project for cloud-based processing of satellite imagery to support the earth sciences. http://matsu.opensciencedatacloud.org (accessed on March 4, 2014). Berriman, G. B., Deelman, E., Groth, P., and Juve, G. 2010. The application of cloud computing to the creation of image mosaics and management of their provenance. In SPIE Astronomical Telescopes+ Instrumentation, International Society for Optics and Photonics, Bellingham WA, vol. 7740F. Buyya, R., Yeo, C. S., Venugopal, S., Broberg, J., and Brandic, I. 2009. Cloud computing and emerging IT platforms: Vision, hype, and reality for delivering computing as the 5th utility. Future Generation Computer Systems, 25, 599–616. Cary, A., Sun, Z., Hristidis, V., and Rishe, N. 2009. Experiences on processing spatial data with MapReduce. In Scientific and Statistical Database Management, Springer, Berlin, Germany, pp. 302–319. Chang, C. C., Chang, Y. L., Huang, M. Y., and Huang, B. 2011. Accelerating regular LDPC code decoders on GPUs. IEEE Journal of Selected Topics in Applied Earth Observations and Remote Sensing (JSTARS), 4(3), 653–659. Christophe, E., Michel, J., and Inglada, J. 2011. Remote sensing processing: From multicore to GPU. IEEE Journal of Selected Topics in Applied Earth Observations and Remote Sensing (JSTARS), 4(3), 643–652. Cui, D., Wu, Y., and Zhang, Q. 2010. Massive spatial data processing model based on cloud computing model. In Computational Science and Optimization CSO, 2010 Third International Joint Conference on Computational Science and Optimization, May 28–31, Anhui, China, vol. 2, pp. 347–350. Dean, J., and Ghemawat, S. 2004. MapReduce: Simplified data processing on large clusters. In Proceedings of the Sixth Conference on Symposium on Operating Systems Design & Implementation, December 6–8, San Francisco, CA, vol. 6, 13pp. Eldawy, A., and Mokbel, M. F. 2013. A demonstration of SpatialHadoop: An efficient MapReduce framework for spatial data. Proceedings of the VLDB Endowment, 6(12), 1230–1233. eScience Institute, University of Washington. 2012. Understanding cloud computing for research and teaching. http://escience. washington.edu/get-help-now/understanding-cloud-computing-research-and-teaching (accessed on March 4, 2014). ESRI. 2012. ArcGIS server. http://www.esri.com/software/arcgis/ arcgisserver (accessed on March 23, 2012). González, J. F., Rodríguez, M. C., and Nistal, M. L. 2009. Enhancing reusability in learning management systems through the integration of third-party tools. In 39th IEEE Frontiers in Education Conference, FIE’09, October 18–21, San Antonio, TX, pp. 1–6.
AQ5
AQ6
3/23/2015 1:43:48 PM
Processing Remote-Sensing Data in Cloud Computing Environments
AQ7
Haifang, Z. 2003. Study and implementation of parallel algorithms for remote sensing image processing. PhD thesis, National University of Defence Technology, Chasha, China. Huang, Q., Yang, C., Nebert, D., Liu, K., and Wu, H. 2010. Cloud computing for geosciences: Deployment of GEOSS clearinghouse on Amazon’s EC2. In HPDGIS ‘10: Proceedings of the ACM SIGSPATIAL International Workshop on High Performance and Distributed Geographic Information Systems, November 3–5, San Jose, CA, pp. 35–38. Hyspeed Computing. 2013. Big data and remote sensing—Where does all this imagery fit into the picture? http://hyspeedblog. wordpress.com/2013/03/22/big-data-and-remote-sensingwhere-does-all-this-imagery-fit-into-the-picture (accessed on March 2014). IBM. 2012. Bringing big data to the enterprise. http://www-01. ibm.com/software/data/bigdata (accessed on April 4, 2013). Iowa DNR. 2009. State of Iowa. http://www.iowadnr.gov/mapping/lidar/index.html, retrieved April 29, 2009 (accessed on April 4, 2013). Krishnan, S., Bary, C., and Crosby, C. 2010. Evaluation of MapReduce for gridding LIDAR data. In 2010 IEEE Second International Conference on Cloud Computing Technology and Science, CloudCom, November 30–December 3, Indianapolis, IN, pp. 33–40. Kumar, N., Lester, D., Marchetti, A., Hammann, G., and Longmont, A. 2013. Demystifying cloud computing for remote sensing application. http://eijournal.com/newsite/ wp-content/uploads/2013/06/cloudcomputing.pdf. Lee, C. A., Gasster, S. D., Plaza, A., Chang, C. I., and Huang, B. (2011). Recent developments in high performance computing for remote sensing—A review. IEEE Journal of Selected Topics in Applied Earth Observations and Remote Sensing, 4.3, 508–527. Li, J., Humphrey, M., Agarwal, D., Jackson, K., Ingen, C., and Ryu, Y. 2010. “eScience in the cloud: A MODIS satellite data reprojection and reduction pipeline in the windows azure platform.” In IEEE International Symposium on Parallel & Distributed Processing (IPDPS), April 19–23, Atlanta, GA, pp. 1–10. Liu, T. et al. 2010. Remote sensing image classification techniques based on the maximum likelihood method. FuJian Computer, (001), 7–8. McEvoy, G. V., and Schulze, B. 2008. Using clouds to address grid limitations. In Proceedings of the 6th International Workshop on Middleware for Grid Computing, December 1–5, Leuven, Belgium. Mell, P. and Grance, T. 2011. The NIST definition of cloud computing. National Institute of Standards and Technology, Special Publication 800-145. Oryspayev, D., Sugumaran, R., DeGroote, J., and Gray, P. 2012. LiDAR data reduction using vertex decimation and processing with GPGPU and multicore CPU technology. Computers & Geosciences, 43, 118–125. Ostermann, S., Iosup, A., Yigitbasi, N., Prodan, R., Fahringer, T., and Epema, D. 2010. A performance analysis of EC2 cloud computing services for scientific computing. In Cloud Computing, Springer, Berlin, Germany, pp. 115–131.
K22125_C027.indd 557
557
Parulekar, R. et al. 1994. High performance computing for land cover dynamics. In Proceedings of the 12th IAPR International Conference on Pattern Recognition, 1994, Vol. 3—Conference C: Signal Processing, October 9–13, Jerusalem, Israel, IEEE, New York, 1994. Plaza, A., and Chang, C.-I. 2007. High Performance Computing in Remote Sensing, CRC Press, Boca Raton, FL. Plaza, A., and Chang, C.-I. 2008. Special issue on high performance computing for hyperspectral imaging. International Journal of High Performance Computing Applications, 22(4), 363–365. Plaza, A., Du, Q., Chang, Y.-L., and King, R. L. 2011a. High performance computing for hyperspectral remote sensing. IEEE Journal of Selected Topics in Applied Earth Observations and Remote Sensing (JSTARS), 4(3), 528–544. Plaza, A., Plaza, J., Paz, A., and Sanchez, S. 2011b. Parallel hyperspectral image and signal processing. IEEE Signal Processing Magazine, 28(3), 119–126. Plaza, A., Valencia, D., Plaza, J., and Martinez, P. 2006. Commodity cluster-based parallel processing of hyperspectral imagery. Journal of Parallel and Distributed Computing, 66(3), 345–358. Prasad. 2013. Special issue on High performance computing in remote sensing. Remote Sensing. Rezgui, A., Malik, Z., and Yang, C. 2013. High-resolution spatial interpolation on cloud platforms. In Proceedings of the 28th Annual ACM Symposium on Applied Computing, March 18–22, Coimbra, Portugal, pp. 377–382. Shekhar, S., Gunturi, V., Evans, M. R., and Yang, K. 2012. Spatial big-data challenges intersecting mobility and cloud computing. In Proceedings of the 11th ACM International Workshop on Data Engineering for Wireless and Mobile Access, May 20–24, Scottsdale, AZ, pp. 1–6. Simmhan, Y., and Ramakrishnan, L. 2010. Comparison of resource platform selection approaches for scientific workflows. In Proceedings of the 19th ACM International Symposium on High Performance Distributed Computing, June 21–25, Chicago, IL, pp. 445–450. Song, C., Li, Y., and Huang, B. 2011. A GPU-accelerated wavelet decompression system with SPIHT and Reed-Solomon decoding for satellite images. IEEE Journal of Selected Topics in Applied Earth Observations and Remote Sensing (JSTARS), 4(3), 683–690. Sugumaran, R., Burnett, J., and Armstrong, M. P. 2014. Using a cloud computing environment to process large 3D spatial datasets. In H. Karimi, ed., Big Data: Techniques and Technologies in Geoinformatics, CRC Press, Boca Raton, FL, pp. 53–65. Sugumaran, R., Oryspayev, D., and Gray, P. 2011. GPU-based cloud performance for LiDAR data processing. In COM. Geo 2011: Second International Conference and Exhibition on Computing for Geospatial Research and Applications, May 23–25, Washington, DC. USGS. 2011. Landsat archive. http://landsat.usgs.gov (accessed on April 7, 2013).
AQ8
AQ9
3/23/2015 1:43:48 PM
558
Vatsavai, R. R., Ganguly, A., Chandola, V., Stefanidis, A., Klasky, S., and Shekhar, S. 2012. Spatiotemporal data mining in the era of big spatial data: Algorithms and applications. Proceedings of the First ACM SIGSPATIAL International Workshop on Analytics for Big Geospatial Data, November 7–9, Redondo Beach, CA, pp. 1–10. Wang, S., and Armstrong, M. P. 2003. A quadtree approach to domain decomposition for spatial interpolation in grid computing environments. Parallel Computing, 29(10): 1481–1504. Wang, Y., Wang, S., and Zhou, D. 2009. Retrieving and indexing spatial data in the cloud computing environment. In Proceedings of the First International Conference on Cloud Computing, December 1–4, Beijing, China, Lecture Notes in Computer Sciences, vol. 5931, pp. 322–331.
Cloud Computing and Remote Sensing
Watson, P., Lord, P., Gibson, F., Periorellis, P., and Pitsilis, G. 2008. Cloud computing for e-Science with CARMEN. In Second Iberian Grid Infrastructure Conference Proceedings, May 12–14, Porto, Portugal, pp. 3–14. White, T. 2012. Hadoop: The Definitive Guide, O’Reilly Media, Inc., Sebastopol, CA. Yang, C. and Huang, Q. 2013. Spatial Cloud Computing: A Practical Approach, CRC Press, Boca Raton, FL. Yang, H., Du, Q., and Chen, G. 2011. Unsupervised hyperspectral band selection using graphics processing units. IEEE Journal of Selected Topics in Applied Earth Observations and Remote Sensing (JSTARS), 4(3), 660–668. Yue, P., Gong, J., Di, L., Yuan, J., Sun, L., Sun, Z., and Wang, Q. 2010. GeoPW: Laying blocks for the geospatial processing web. Transactions in GIS, 14(6), 755–772.
Author Queries [AQ1] The following changes have been made in reference citations to match with the references list: Yue (2010) to t Yue et al. (2010); Lee (2011) to Lee et al. (2011); Simmahan (2010) to Simmahan and Ramakrishnan (2010); Parulekar (1994) to Parulekar et al. (1994); Ostermann (2010) to Ostermann et al. (2010); Oryspayev et al. (2011) to Oryspayev et al. (2012). Please check the changes made. [AQ2] Please provide exact chapter/figure/table number to guide the reader. [AQ3] Please provide complete details for the reference citation “Plaza (2011)” and add the same to the references list. [AQ4] Please provide complete details for URL available in this chapter. [AQ5] Please provide editors and page range for reference “Berriman et al. (2010).” [AQ6] Please provide editors for references “Cary et al. (2009) and Ostermann et al. (2010).” [AQ7] Please provide accessed date for reference “Kumar et al. (2013).” [AQ8] Please provide in-text citation for reference “Plaza et al. (2011a,b).” [AQ9] Please provide complete details for reference “Prasad (2013).”
K22125_C027.indd 558
3/23/2015 1:43:48 PM