Document not found! Please try again

spatial data sharing on grid - Semantic Scholar

1 downloads 83 Views 3MB Size Report
KEYWORDS: Distributed GIS, Grid computing, Data sharing, Spatial ..... Escience-grid, 2003, http://www.escience-grid.org.uk/docs/gridtech/define.htm. Foster, I.
SPATIAL DATA SHARING ON GRID Yufei Wang, Linlin Ge, Chris Rizos, Ravindra Babu School of Surveying and Spatial Information Systems The University of New South Wales, Sydney, NSW 2052, Australia Tel: +61-2-9385-4208; Fax: +61-2-9313-7493; Email: [email protected]

ABSTRACT The Internet technology has already changed the Information Society in profound ways, and will continue to do so. Nowadays many people foresee that there is a similar trajectory for the next generation of Internet - Grid Technology. As an emerging computational and networking infrastructure, Grid Computing is designed to provide pervasive, uniform and reliable access to data, computational and human resources distributed in a dynamic, heterogeneous environment. On the other hand, the development of GIS has been highly influenced by the evolution of information technology such as the Internet, telecommunications, software and various types of computing technology. In particular, in the distributed GIS domain, the development has made significant impact in the past decade. However, due to the closed and centralised legacy of the architecture and the lack of interoperability, modularity, and flexibility, current distributed GIS still cannot fully accommodate the distributed, dynamic, heterogeneous and speedy development in network and computing environments. Hence, the development of a high performance distributed GIS system is still a challenging task. So, the development of Grid Computing technology undoubtedly provides a unique opportunity for distributed GIS, and a Grid Computing based GIS paradigm becomes inevitable. This paper proposes a new computing platform based distributed GIS framework – the Grid Geographic Information System (G2IS). KEYWORDS: Distributed GIS, Grid computing, Data sharing, Spatial Information System,

Distributed computing.

1. INTRODUCTION Geographic Information Systems (GIS) have evolved in parallel with mainstream information technology. In particular, with the rise in the use of the Internet, the evolution in telecommunications, and the development of various types of computing technologies, the paradigm of GIS always shifts. Traditional GIS provides capabilities for handling geo-referenced data, which includes data capture, data input, storage, retrieval, management, manipulation, analysis, and output (Tsou & Buttenfield, 1998; Zaslavsky et al, 2003). However, with the closed and centralised legacy of the architecture, current GIS cannot fully accommodate distributed, dynamic and heterogeneous network environments due to their lack of interoperability, modularity, and flexibility. More and more countries are determined to establish their National Information Infrastructure (NII or “Information Highway” as it is popularly known), or National Spatial Data Infrastructure (NSDI), or at the international level propose the launch of initiatives such as the Global Information Infrastructure (GII) or simply “Digital Earth”. There is an increasing demand on the technical architecture to build a well-planned Distributed GIS system. A Distributed Geographic Information System (DGIS) is a collection of GIS organisations, usually at some distance from each other, connected via a data communication network (Wang, 2000). Each GIS organisation is autonomous, in that 1

it contains both processing power and geo-spatial data. For a user, a DGIS may provide access to the data stored at both the local and remote GIS organisations. A DGIS presents a single database image to the user and provides transparent data access, hiding the data distribution and connection paths. To the user, the system can be accessed as if all the data and functions are provided by the local GIS organisation. Compared with a centralised/stand-alone GIS, a DGIS can meet a user’s organisational objectives better as well as support projects that require the cooperation of different organisations. Technically, the major advantages of a DGIS include: (1) in contrast with a centralised, remotely accessed GIS, faster response may be achieved by storing data at different organisations where they are most often used (response time can also be improved by distributing some computationally intensive jobs to multiple organisations for parallel processing); (2) Greater system reliability can be achieved by storing critical data and processing functions at multiple organisations; (3) Data sharing can be achieved in a more convenient way, and development costs can be reduced through data sharing; and (4) In a well-planned DGIS, incremental system growth can be achieved by adding more GIS organisation sites. In a nutshell, a DGIS can strengthen system reliability, efficiency, resource sharing, and provide flexibility for incremental system growth. Motivated by its advantages, Distributed Geographic Information Systems have evolved quickly in the past decade. For example, as far as the distributed architecture is concerned, the focus in DGIS has changed from data sharing, heterogeneous interpretation to intelligent coordinated working. The system architecture has moved from classic client/server (C/S), Object-oriented Multi-layer C/S to autonomous architecture. As far as the software aspects go, DGIS has also experienced changes, such as the introduction in quick succession of JAVABEANS / RMI / JINI, CORBA / IIOP, AC (CORBA-Agent) framework, COM / DCOM / ActiveX / WIN DNA, Agent / Mobile-Agent / Multi-Agent / Distributed Intelligence Systems, and Microsoft .NET & Web Services. Nowadays DGIS is embracing the Spatial Information Service Architecture - GIServices. Computing aspects include modern computing techniques such as Web computing, Internet computing, Cluster computing, Mobile computing, Distributed computing, and Pervasive computing techniques. Although many new computing, software, database, and networking technologies have been introduced into the DGIS research domain, developing a distributed GIS is still a challenging task. The major challenges are concerned with generating efficient system-wide query strategies, synchronising operations executed at different sites, handling heterogeneity, managing transactions, etc. In addition, from the composition point of view, all of DGIS are based on the tree or master/slave model, while data exchanging is based on the layer-model, and it is difficult to distinguish the dynamic relationship between different spatial entities, and the spatial analysis is not complete for lack of synchronous and dynamic processing abilities. From the spatial data management point of view, all of DGIS can process the 2D spatial data effectively, and display 3D data. But it is deficient in 3D and Time Series data analysis. It is tolerable in attribute data mining and knowledge discovery. But it falls short in its handling of spatial vector data. From the data and functionality sharing point of view, 2

although metadata is adopted, nevertheless the problem of (lack of) interpretability still exists. These evolutionary trends generate new requirements for distributed application development and deployment. Now, to meet these daunting challenges, Grid Computing, a brand new abstraction and conception distributed enabling technology, is under development. 2. CHARACTERISTICS OF GRID COMPUTING GRID is the next generation Internet, in which the World Wide Web (WWW) will upgrade to the Great Global Grid (GGG). As an emerging computational and networking infrastructure, Grid technologies have attracted a great deal of attention recently. But what is a Grid? And what is the essence of Grid Computing? 2.1 The essence of Grid Computing Grids were originally proposed in the context of large-scale scientific applications that require or exploit computational and data-resources distributed geographically among multiple sites. This paradigm was subsequently found to be applicable in other settings, e.g. inter-departmental, inter-company resource sharing, and in other application domains. In these scenarios, the common theme is the sharing of distributed resources in a well-controlled, secure and mutually fair manner. A platform that fits such a model is referred to as a “Grid”. On the other hand, the word “Grid” is chosen as an analogy to the well-known electric power grid, which provides a ubiquitous access to electricity. It is believed that by providing pervasive, dependable, consistent and inexpensive access to advanced computational capabilities, databases, sensors and people, computational grids will have a similar transforming effect, allowing new classes of applications to emerge (escience-grid, 2003). Accordingly, Grid Computing is a type of platform, designed to provide pervasive, uniform and reliable access to data, computational and human resources distributed in a dynamic and heterogeneous environment (Foster et al, 2002). Often, Grids can be categorised by the type of solutions that they best address, e.g. the computational grid, the scavenging grid and the data grid. They are different from conventional distributed systems, which comprise of a number of cooperating processes that exploit resources of a loosely-coupled computer system or applications (that are distributed in an adhoc manner in order to gain performance or to utilise those geographically dispersed resources). Grid systems are based on large-scale sharing and a virtual pool of resources rather than computational nodes, and are expected to operate on a wider range of resources such as storage, network, data, software, and atypical resources like graphical and audio IO devices, manipulators, sensors, etc. Integration and collaboration on network functionality are particularly emphasised in Grid systems. The motivations and fundamental goal of Grid Computing is obvious. It is, to seamlessly multiplex distributed computational resources of providers across wide area networks (Renato, 2003). In traditional computing environments resources are multiplexed using the mechanisms found in a typical operating system. However, the Grid Computing solution is trying to establish a true network-based operating system 3

(NET OS). This is different from traditional operating systems such as Unix, Linux, Apple MAC OS, Windows, Solaris, or Free BSD, IBM AIX, and Java OS, which focus on how to strengthen the computational ability of a computer (even though they have the ability to connect with the Internet and other computers, all of these connections are located on the application layer). In contrast, Grid Computing is building a true Network-Oriented (NO) OS or Network-Oriented infrastructure. Hence strengthening the network-computing ability at the OS level, rather than pure computing at the application level, is the main goal. As a consequence, the current Personal Computer (PC) is transforming to a type of Network Computer (NC). This is the essence of Grid Computing. However, even though Grid technology is developing at a very rapid pace, some technical challenges still remain, e.g. security mechanisms in cross-administration and in the multi-nation domain, access policies, reimbursement without central control, resource stability, runtime resource management, guaranteed “end-to-end” performance, heterogeneity, fault-tolerance and hidden complexity (Xue et al, 2003). 2.2 Conventional Distributed Computing Versus Grid Computing Although the goal of Grids is obvious, there is no clear definition for Grid systems, especially the essential distinction between the traditional distributed computing and new Grid Computing. Therefore, to assist in making this distinction, a comparitive study is required. In general, distributed applications comprise of a number of cooperating processes that exploit resources of loosely-coupled computer systems. An application may be distributed simply due to its nature, in order to gain performance, or to utilise resources that are not locally present (Nemeth, 2003). On the other hand, distributed computing, e.g. in the high performance computing domain, may be accomplished via traditional environments such as PVM (Parallel Virtual Machine) and MPI (Message Passing Interface). But now the emerging software framework termed Computational Grids is providing a new solution for distributed computing technology. The essential semantic difference between conventional distributed systems and Grids is the manner in which they establish a virtual, hypothetical concurrent machine from the available resources, e.g. an application in a conventional distributed environment assumes a pool of computational nodes (Figure 1, left side) from which a virtual concurrent machine is formed. The pool consists of PCs, workstations, and possibly supercomputers, provided that the user has access privileges, e.g. a valid login name and password to all of them. Login to the virtual machine is realised by login authentication to each node, although it is technically possible to avoid per node authentication if at least one node accepts the user as authentic. In general it can be assumed that once a user is logged onto a node, permission is given to use essentially all the resources belonging, or attached, to the node without further authorisation. On the other hand, the user is restricted to using the local resources at a given node, and only in rare cases is there support for using remote resources. The user, having personal accounts on these nodes, is aware of its features: architecture type, computational power and capacities, operating system, security concerns, usual load, etc. Furthermore, the virtual pool of nodes can be considered static, since the set of nodes to which the user has login access changes very rarely. The size of such 4

systems which are deployed, and currently being used, are typically of the order of 10–100 nodes. In contrast, Computational Grids are based on large-scale resource sharing. Grids assume a virtual pool of resources rather than computational nodes (Figure 1 right side). Although current systems mostly focus on computational abilities, e.g. CPU and memory, that basically coincide with the notion of nodes, Grid systems are expected to operate on a wider range of resources, and all these resources typically exist within nodes that are geographically distributed, and span multiple administrative domains. The virtual machine constitutes a set of resources taken from the pool. In Grids, the virtual pool of resources are dynamic and diverse, as the resources can be added and withdrawn at any time according to their owner’s discretion, their performance or load which can change frequently over time. The typical number of resources in the pool is of the order of several thousand or more. Hence the user has very little or no apriori knowledge of the actual type, state and features of the resources constituting the pool. If users have access to the pool, which means the user has some sort of credential that is accepted by the owners of the resources in the pool, a user may have the right to use a given resource. However, it does not mean that login access to the node hosting the resource is provided. Access to the nodes cannot be controlled based on login access due to the large number of resources in the pool and the diversity of local security policies, and it is unrealistic that a user has login access to thousands of nodes simultaneously. See the summary of comparisons in Table 1.

GIS Application 1

GIS Application 1

GIS Application 2

Application Level

Intra-Net GIS nodes

Virtual Machine Intra-Net GIS nodes Level

Internet GIS Resources pool

Virtual Pool Level

GIS Application 3

GIS Application 2

Intra-Grid

Intra-Grid

Inter-Grid GIS resources pool

Physical Level

GIS Nodes

GIS Resources

Figure 1. GIS Cases: Conventional distributed computing (left) and Grids (right). Table 1. Comparison of conventional Distributed Computing and Grid Computing 5

Conventional Distributed Computing Virtual pool of computational nodes User has credential to all the nodes in the pool Access to a node means all resources on the node User is aware of the capabilities & features of nodes Nodes belong to a single trust domain Elements in the pool 10–100, more or less static

Grid Computing Virtual pool of resources Access to the pool but not individual nodes Access to a resource may be restricted User has little or no knowledge about each node Resources span multiple trust domains Elements in the pool more than 100, dynamic

3. CASE STUDY: MULTI-NODE & MULTI-SOURCE GRID GIS (MMG2IS) Huge amounts of spatial information data, a naturally geographically-distributed information resource, are ideal for Data Grids. Furthermore, some traditional spatial computations such as spatial visualisation, spatial modelling, spatial analysis, spatial statistics, and certain advanced spatial computing such as, data mining and discovery, virtual reality, spatial fuzzy logic, cell automation, multi-dimensional processing, human-computer interaction, are the ideal high-performance applications for Computational and Scavenging Grids. The integration of Geographic Spatial Information and Grid Computing is therefore considered to be inevitable. Multi-Node & Multi-Data Source Grid Geographic Information System (MMG2IS) is an example of a Grid GIS system developed by the authors, which integrates the Grid technology with network GIS systems, to advance the development of distributed GIS. In this example there are three Grid Computing domain nodes: [gridgis1, gridgis2, gridgis3].GMAT.UNSW.EDU.AU, and the data source includes vector and image data. All the data are stored at these three nodes respectively. 3.1 Computing Platform and Development Tools In order to develop a MMG2IS, some basic platform and tools are required, including the Grid Computing platform – the GLOBUS Toolkit (IBM Grid Toolbox), Grid development tools – such as the Java CoG (Commodity Grid) Kit, and the IBM Grid Application Framework For Java (GAF4J). In order to construct a Multi-Node environment, the authors advocate a novel approach to Grid Computing that is based on the combination of “classic” OS level Virtual Machines (VMs) and middleware mechanisms to manage VMs in a distributed environment. The abstraction is that of dynamic and mobile VMs that are a combination of traditional OS processes and files (the VM monitor and state) (Renato, 2003). In MMG2IS, the authors use the VMware workstation as the Multi-Node VM platform. Currently the OS and programming language supported by MM Grid GIS is Red Hat Linux and Java, respectively. A short introduction to these tools and platform is given below. Globus Toolkit, the fruit of the GLOBUS project, is a public-domain software and low-level toolkit that is a community-based, open architecture, open-source set of services and libraries that support Grids and Grid applications (Foster & Kesselman, 2000). GLOBUS intends to achieve a vertically integrated treatment of applications, middleware, and network, offering basic mechanisms such as communication, authentication, network information, security, resource directory, and data access. These mechanisms are used to construct various high-level networked virtual supercomputers or meta-computer services, such as parallel programming tools and schedulers. In other words, it is the core technology for implementing Grid Computing. 6

IBM Grid Toolbox is an integrated set of tools and software that facilitates the creation of applications that can exploit the advanced capabilities of the Grid, using a combination of the Grid Toolbox and other technologies. You can use it: (1) to allow a site to participate in a Computational Grid by contributing resources to a Grid's pool of resources, (2) to provide access to other Grid resources without contributing to any of the site's own resources, and (3) to provide other services, e.g. single sign-on authentication, without the need for contributing computational resources. Some important components are: 

 

   

provides resource allocation and process creation, monitoring, and management services. Grid Security Infrastructure (GSI) provides a single sign-on authentication service, with support for local control over access rights and mapping from global to local user identities. Monitoring and Discovery Service (MDS) is an integrated information service across Grid Toolbox-enabled resources that provides information about the state of the Grid infrastructure. This service is based on the Lightweight Directory Access Protocol (LDAP). Global Access to Secondary Storage (GASS) implements a variety of automatic and programmer-managed data movement and data access strategies, enabling programs running at remote locations to read and write local data. Replica Services provides cataloging and data management, and handles copying and placement of files in a distributed computing environment. Grid Toolbox I/O provides an interface to TCP/UDP and file I/O, and supports synchronous and asynchronous interfaces, multi-threading, and GSI security. Simple CA provides a personal certificate authority (CA) for testing and developing Grids and Grid applications. Globus Resource Allocation Manager (GRAM)

VMware Workstation is a Virtual Machine desktop software used to construct a multi-node & cross-domain virtual network system, and to form a Grid entity. Java CoG (Commodity Grid) Kit is a series of GLOBUS functionality APIs via the Java programming language (Laszewski, 2001). IBM Grid Application Framework For Java (GAF4J) is an application framework for multi-threaded Java applications that take advantage of Grid resources, which abstracts all Grid semantics from the application logic, and provides a simpler programming model that lines up smoothly with common Java programming models. GAF4J abstracts the essentials of interfacing with a Grid infrastructure, and is assumed to be a GLOBUS toolkit (Jhoney et al, 2003). This framework aids Java applications with multi-threaded logic so that the threaded tasks are distributed for execution over a Grid instead of having multiple threads started on the same node. In addition, GAF4J is facilitated with minimum impact on the existing programming model adhered to by the application. The parts of the code where a Java thread object is created and started will now be substituted by the creation of a task object, which will then be submitted for execution over a Grid. 3.2 System Architecture

7

Based on these Grid computing platform and development tools, MMG2I S is developed in an Object-Oriented language such as Java. Figure 2 illustrates the software architecture. There are six main layers: Grid hardware, VM OS, Distributed resources, GLOBUS platform, Java COG toolkit, and Grid Aware Java Applications.

MM Grid GIS Client/Server

Grid Aware Java Applications

Grid Application Framework for Java Java CoG Tool Kit GRAM Listener, Proxy GridFTP Client, UrlCopy GLOBUS Tool Kit (Grid Infrastructure Services)

GRAM GASS GSI GridFTP

Distributed Resources on the Grid (Computer Cycles, storage, JVM, Visualization Software) Virtual Machine (VM) and OS Level Grid Entity

2 gridgis1 node Architecture gridgis3 node Figurenode 2. MMGgridgis2 IS Software

3.3 MMG2IS Workflows and Programming Model

Application Development and Execution Process

The MMG2IS system, as a data intensive application, should strictly follow the workflow defined for the Grid environment. There are two important steps in planning an application executed in a Grid environment: Abstract Workflow and Concrete Workflow (see left of figure 3) (Deelman, 2003). Abstract Workflow is formed by selecting and configuring application components, and should comply with the specification to generate the desired data products. Concrete Workflow is formed by selecting specific resources, files and additional jobs, and can be executed in a Grid environment if the location of physical files as well as resources is specified. So, following the workflow, the programming model is shown on the right of figure 3.

Application Domain

Application Components Selection 2

Abstract Workflow

2

MMG IS Client

Specify different workflow

Resource. Transformation Instance Selection

Grid GIS Task01

MMG IS Server

Grid GIS Task02

Grid GIS Task03

8 2

G IS Executive

Pick different resource

Concrete Workflow Retry

Task Submitting

Execution Environment

Figure 3. MMG2IS Workflow (left) & Programming Model (right).

3.4 System Demonstrations Before running MMG2IS, the basic running environment should be configured properly. First, three different Grid nodes, named gridgis1, gridgis2, gridgis3, were setup under the sub-domain of gmat.unsw.edu.au in the pre-installed VMware Linux OS. At these nodes some spatial data (vector or raster) has been put on the public directory for sharing. Second, the Grid computing platform – GLOBUS, (or IBM Grid Toolbox) was installed to form the Spatial Grid Infrastructure (SGI). Third, a user account on the Grid was created by using a simple CA issued by a trusted Certificate Authority which grants the access to the local or remote Grid (security is at the heart of Grid Computing technology). Fourth, at the client side Java CoG and GAF4J, as well as Java JDK, are installed. Finally a list of all the Grid nodes that can be used in Grids is created. Once all these steps are performed MMG2IS can be run direct from the command line (see figure 4). During execution the G2IS server will search all the nodes which can be used, be they with the same domain name or cross-domain. Here three nodes, gridgis1, gridgis2 and gridgis3, are used to construct the test Spatial Grid Infrastructure (SGI). When all nodes are found, the G2IS server will assign different task to be executed on them. In MMG2IS three tasks are designed and every task will be executed on each node. At the same time, a task-monitoring console will track the progress (see figure 5). After all three tasks are finished, the data dispersed on these three nodes (vector or raster) will be displayed on the G2IS client side program (see figure 6).

9

Figure 4. MM Grid GIS starts running

Figure 5. G2IS tasks running at the different nodes simultaneously

Figure 6. Spatial image & vector data from different nodes 4. CONCLUDING REMARKS Based on the emerging Grid Computing technology, an open, extensible and scalable gridgis3.gmat.unsw.edu.au gridgis1.gmat.unsw.edu.au gridgis2.gmat.unsw.edu.au Multi-Node and Multi-Source Grid GIS (MMG2IS) for data sharing becomes feasible. g

10

In the MMG2IS it is very convenient to share all the resources on Grids, e.g. the resource in the Spatial Grid Infrastructure (SGI), including CPU, storage, memory, data management, resources allocation, I/O, etc. Compared to a conventional distributed GIS system, a Grid GIS (G2IS) has some distinct advantages: 1) M ulti-node and cross-domain dispersed spatial data sharing and seamless integration becomes much easier and more effective, because some complex problems of distributed computing, such as system heterogeneity, security mechanisms and resource dynamic management, have been hidden and wrapped at the Grid Computing OS level or at the GLOBUS platform level. 2) The programming model for G2IS is terser than some classic distributed programming models such as CORBA, DCOM or the Distributed Agent model, especially if you can map the abstract complex workflow as a standardised workflow on the Grid Computing environment using the Abstract and Concrete Workflow Generator (ACWG). 3) Grid GIS is more extensible and scalable with incremental growth of GIS because there is no need to pre-install any special software. Just connect the computer to the Grid and it becomes part of the SGI. 4) Grid GIS can provide abundant computational resources, and also a great potential for large-scale spatial data processing, spatial analysis, spatial statistics applications, and so on. In summary, Grid Computing is the next generation of Internet computing technology, re-defining the concept and connotation of distributed computing. G2IS will make a big impact on distributed GIS. Acknowledgments The authors wish to give special thanks to Dr. Gregor von Laszewski, a scientist at Argonne National Laboratory, USA, and one of the primary developers of Java CoG Toolkit. REFERENCES Deelman, E., James Blythe et al, 2003. Mapping Abstract Complex Workflows onto Grid Environments, Journal of Grid Computing, pp.25-39. Escience-grid, 2003, http://www.escience-grid.org.uk/docs/gridtech/define.htm Foster, I., & Kesselman, C., 1998. Computional Grid, The Grid: Blueprint for a Future Computing Infrastructure, Morgan Kaufmann Publishers. Foster, I., & Kesselman, C., 2000. Globus: A Metacomputing Infrastructure Toolkit. http://www.globus.org. Foster, I., Kesselman, C., Nick, J., & Tuecke, S., 2002. The Physiology of the Grid: An Open Grid Services Architecture for Distributed Systems Integration, IEEE Computer, 0018-9162 / 02, 37-46. Jhoney, A., Kuchhal, M., & Venkatakrishnan, 2003. A Technical White Paper Grid Application Framework for Java (GAF4J), http://www.ibm.com Laszewski, G., Foster I., Gawor J., & Lane P., 2001. A Java Commodity Grid Kit, Concurrency and Computation: Practice and Experience, 13(8-9), 643-662. Nemeth, Z., & Sunderam, V., 2003. Characterizing Grids: Attributes, Definitions, and Formalisms, Journal of Grid Computing, 9-23. Renato, J.A., & Dinda, P., 2003. A Case for Grid Computing on Virtual Machines. 11

Tsou, M., & Buttenfield, B.P, 1998. Client/Server Components and Metadata objects for Distributed Geographic Information Services, Proceedings of GIS/LIS '98, Fort Worth, Texas, November 1998, 590-599. Wang, F., 2000. A Distributed Geographic Information System on the Common Object Request Broker Architecture (CORBA), GeoInformatica, 89-115. Xue, Y., Wang, J., Sheng, X., & Guo, H., 2003. Building Digital Earth with GRID Computing–The Preliminary Results, Proceedings of Digital Earth 2003, 804812. Zaslavsky, I., Memon, A., Petropoulos, M., & Baru, C., 2003. Online Querying of Heterogeneous Distributed Spatial Data on a Grid, Proceeding of Digital Earth, 813-823.

12