empowering scientific discovery by distributed data mining ... - CiteSeerX

2 downloads 9007 Views 5MB Size Report
This proposal considers research in grid-based distributed data mining. ..... grid architecture and two projects GriPhyN and European Data Grid Project that have.
EMPOWERING SCIENTIFIC DISCOVERY BY DISTRIBUTED DATA MINING ON A GRID INFRASTRUCTURE A PROPOSAL FOR DOCTORAL RESEARCH

by Haimonti Dutta

SUBMITTED IN PARTIAL FULFILLMENT OF THE REQUIREMENTS FOR THE DEGREE OF DOCTOR OF PHILOSOPHY AT UNIVERSITY OF MARYLAND BALTIMORE COUNTY 1000 HILLTOP CIRCLE, BALTIMORE, MD, 21250 JULY 2006

Table of Contents Table of Contents

i

Abstract

iii

1 Introduction 1.1 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1.2 Proposed Research . . . . . . . . . . . . . . . . . . . . . . . . . . . 1.3 Objectives . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

1 1 2 3

2 Background 2.1 The Grid . . . . . . . . . . . . . . . . . . . 2.1.1 Introduction . . . . . . . . . . . . . 2.1.2 The Grid Architecture . . . . . . . 2.1.3 Classification of Grids . . . . . . . 2.2 The Data Grid . . . . . . . . . . . . . . . . 2.2.1 Introduction . . . . . . . . . . . . . 2.2.2 Data Distribution Scenarios . . . . 2.2.3 Middleware, Protocols and Services 2.2.4 Data Mining on the Grid . . . . . . 2.3 Distributed Data Mining . . . . . . . . . . 2.3.1 Introduction . . . . . . . . . . . . . 2.3.2 Classification . . . . . . . . . . . . 2.3.3 Clustering . . . . . . . . . . . . . . 2.3.4 Distributed Data Stream Mining . . 2.4 The Challenges . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . .

. . . . . . . . . . . . . . .

. . . . . . . . . . . . . . .

. . . . . . . . . . . . . . .

5 5 5 6 8 9 9 11 11 14 23 23 24 29 33 34

3 Preliminary Work 3.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.2 Orthogonal Decision Trees . . . . . . . . . . . . . . . . . . . . 3.2.1 Decision Trees and the Fourier Representation . . . . . 3.2.2 Computing the Fourier Transform of a Decision Tree . . 3.2.3 Construction of a Decision Tree from Fourier Spectrum 3.2.4 Removing Redundancies from Ensembles . . . . . . . 3.2.5 Experimental Results . . . . . . . . . . . . . . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

39 39 40 41 44 48 53 55

i

. . . . . . . . . . . . . . .

. . . . . . . . . . . . . . .

. . . . . . . . . . . . . . .

. . . . . . . . . . . . . . .

. . . . . . . . . . . . . . .

. . . . . . . . . . . . . . .

. . . . . . . . . . . . . . .

. . . . . . . . . . . . . . .

. . . . . . . . . . . . . . .

. . . . . . . . . . . . . . .

3.3

3.4

DDM on Data Streams . . . . . . . . . . . . . . . . . . . . . . . . . 3.3.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.3.2 Experimental Results . . . . . . . . . . . . . . . . . . . . . . 3.3.3 Monitoring in Resource Constrained Environments . . . . . . 3.3.4 Grid Based Physiological Data Stream Monitoring - A Dream or Reality ? . . . . . . . . . . . . . . . . . . . . . . . . . . . DDM on Federated Databases . . . . . . . . . . . . . . . . . . . . . 3.4.1 The National Virtual Observatory . . . . . . . . . . . . . . . 3.4.2 Data Analysis Problem: Analyzing Distributed Virtual Catalogs 3.4.3 The DEMAC system . . . . . . . . . . . . . . . . . . . . . . 3.4.4 WS-DDM – DDM for Heterogeneously Distributed Sky-Surveys 3.4.5 WS-CM – Cross-Matching for Heterogeneously Distributed Sky-Surveys . . . . . . . . . . . . . . . . . . . . . . . . . . 3.4.6 DDM Algorithms: Definitions and Notation . . . . . . . . . . 3.4.7 Virtual Catalog Principal Component Analysis . . . . . . . . 3.4.8 Case Study: Finding Galactic Fundamental Planes . . . . . . 3.4.9 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . .

4 Future Work 4.1 The DEMAC system - further explorations . . . . 4.1.1 Grid-enabling DEMAC . . . . . . . . . . 4.1.2 PCA based Outlier Detection on DEMAC 4.2 Proposed Plan of Research . . . . . . . . . . . . Bibliography

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

65 65 69 74 75 75 75 77 78 78 79 79 80 82 88 91 91 91 92 93 95

ii

Abstract The grid-based computing paradigm has attracted much attention in recent years. The sharing of distributed computing resources (such as software, hardware, data, sensors, etc) is an important aspect of grid computing. Computational Grids focus on methods for handling compute intensive tasks while Data Grids are geared towards data-intensive computing. Grid-based computing has been put to use in several application areas including astronomy, chemistry, engineering, climate studies, geology, oceanography, ecology, physics, biology, health sciences and computer science. For example, in the field of biomedical informatics, researchers are building an infrastructure of networked high-performance computers, data integration standards, and other emerging technologies, to pave the way for medical researchers to transform the way diseases are being treated. In Oceanography, efforts are being made to federate ocean observatories into an integrated knowledge grid. Breakthroughs in telescope, detector, and computer technology allow astronomical surveys to produce terabytes of images and catalogs, thereby producing a data avalanche. However, extracting meaningful knowledge from these gigantic, geographically distributed, heterogeneous data repositories requires development of architectures, sophisticated data mining algorithms and efficient schemes for communication. This proposal considers research in grid-based distributed data mining. It aims to bring together the relatively new research areas of distributed data mining and grid mining. While architectures for data mining on the grid have already been proposed, we argue that the inherently distributed, heterogeneous nature of the grid, calls for distributed data mining. Consequently, research should be geared towards development of distributed schema integration, query processing, algorithm development and workflow management. As a proof of concept, we first explore the feasibility of executing distributed data mining algorithms on astronomy catalogs obtained from two different sky surveys Sloan Digital Sky Survey (SDSS) and The Two Micron All Sky Survey (2MASS). In particular, we examine a technique for cross-matching indices of different catalogs thereby aligning them, use a randomized distributed algorithm for principal component analysis and propose to develop an outlier detection algorithm based on a similar technique. While this serves as a proof of concept, efforts are in way to grid-enable the application. This requires research on service-oriented architectural paradigms to support distributed data mining on the grid. The data repositories ported on the grid are not all static. Streaming data from web click streams, network intrusion detection applications, sensor networks, wearable devices, multimedia applications are also finding their way into the grid. This is particularly useful since in this way, researchers do not need to set up or own mobile devices, expensive equipment such as telescopes and satellites but can access interesting data streams published on the grid. However, in order to discover meaningful knowledge from these distributed, heterogeneous streams, efforts have to be made to build new architectures and algorithms to support distributed data streams. We propose to address these issues to enable data stream mining on the grid.

iii

Chapter 1

Introduction 1.1 Motivation Advances in science has been guided by analysis of data. For example, huge genome sequences [132] available online motivates collaborative research in biology, catalogs of sky surveys [242, 3] enables astronomers to answer queries that may have taken years of observation, high-resolution, long-duration simulation data from experiments and models enables research in climatology, physics, geosciences and chemistry [197, 115, 225], and advanced imaging capabilities such as Magnetic Resonance Imaging (MRI), Computed Tomography (CT) scans produce large volumes of data for medical professionals [29]. However, as pointed out by Ian Foster, converting data into scientific discoveries requires "connecting data with people and computers". It involves: 1. Finding the data of interest. 2. Moving the data to desired locations. 3. Managing large scale computations 4. Scheduling resources on data and 5. Managing who can access the data when. For example, the main goal of CERN, the European Organization for Nuclear Research in Geneva, Switzerland is to study the fundamental structure of matter and the interaction of forces. In particular, subatomic particles are accelerated to nearly the speed of light and then collided. Such collisions are called events and are measured at time intervals of only 25 nanoseconds in four different particle detectors of the Large Hadron Collider (LHC) CERN’s “next generation” accelerator which has started data collection in 2006. According to the MONARC Project1 each of the 4 main experiments will produce around 1 Petabyte of data a year over a life span of about two 1 MONARC: Models of Networked Analysis at Regional Centers for LHC experiments, http://monarc.web.cern.ch/MONARC/

1

decades. This data needs to be analysed by about 5,000 physicists around the world. Since CERN experiments are collaborations of over a thousand physicists from many dif ferent universities and institutes, the experiments’ data is not only stored locally at CERN but is distributed world wide in so called Regional Centres (RCs), in national institutes and universities. Thus, complex distributed computing infrastructures motivate the need for Grid environments. To extract meaningful information from distributed, heterogeneous data repositories on the grid, sophisticated knowledge discovery architectures have to be designed. The process of data mining on the grid, is still a relatively new area of research ([40, 43, 46, 37, 220, 264, 233]). While several architectures have been developed for this purpose, the framework of distributed data mining on the grid infrastructure still has a long way to go. The aim of this proposal is to motivate research in this direction.

1.2 Proposed Research There has been a growing interest in grid computing in recent years (see section 2.1). Grids can be classified into two main categories: (1) Computational Grids designed to meet the increasing demand of compute intensive science and (2) Data Grids designed to meet the needs of data intensive applications. In this proposal, the primary focus is on Data Grids. The objective of setting up a Data Grid is to encapsulate the underlying mechanisms of storage, querying and transfer of data. Thus an user of the grid does not need to bother about the underlying mechanism of data storage, authentication, authorization, resource management and security but can still have the benefits of large scale distributed computing. Several protocols, services and middleware architectures have been proposed for storage, integration and querying of data on the grid. Of particular interest is the Open Grid Service Architecture - Data Access and Integration (OGSA -DAI) [205] which was conceived by the UK Database Task force and works closely with Database Access and Integration Service - Working Group (DAIS-WG) of the Global Grid Forum (GGF) and the Globus team. Their aim is to develop a service based architecture for data access and integration. Several other projects such as Knowledge Grid [40], Grid Miner [220], Discovery Net [264], TeraGrid [257], ADaM (Algorithm Development and Mining) [233] on NASA’s Information Power Grid, and the DataCutter project [191] have focused on the creation of middleware / systems for data mining and knowledge discovery on top of the Data Grid. Motivated by this research, we propose to develop service based architectures for distributed data mining on the grid infrastructure. We have developed a system for distributed data mining on astronomy catalogs (see section 3.4) using the resources from the National Virtual Observatory. The system demonstrates how distributed data mining algorithms can be designed on top of the heterogeneous astronomy catalogs without they being downloaded onto a centralized server. In particular we examine a randomized algorithm for distributed principal component analysis and provide experimental results to show that this algorithm replicates results obtained in the centralized setting at a lower communication cost. Encouraged by these results we also plan to develop a distributed outlier detection algorithm for astronomy catalogs. We also propose to develop a service based architecture for distributed stream min2

ing on the grid. Distributed data streams obtained from network intrusion detection applications, sensor networks, vehicle monitoring systems and web click streams are being ported onto the grid. Mining these inherently distributed heterogeneous streams on the grid requires development of new architectures and algorithms since existing architectures such as those described in section 2.2.4 may not be suited for streaming data. Thus the overall focus of our attention is on developing a synergy between distributed data mining and the grid-based data mining. We propose to develop service based architectures for grid-based distributed data mining relying on application scenarios from astronomy.

1.3 Objectives The main objectives of the proposed research are as follows: 1. Develop a service-oriented architecture for enabling distributed data mining on the Grid. This includes (a) Development of services for distributed schema integration, integration of indices and query processing. (b) Development of distributed workflows for service composition. 2. Develop a prototype system for implementing the above architecture. The application area is astrophysics and the objectives of building the system are as follows: (a) Access and integrate federated astronomy databases using the Open Grid Services Architecture - Data Access and Integration (OGSA-DAI) [205] middleware as a starting point. Extension of OGSA-DAI to incorporate schema integration, workflow composition are future objectives. (b) Perform distributed data mining on these repositories including dimension reduction by Principal Component Analysis (PCA), classification and outlier detection. (c) Provide a client side browsing that enables astrophysicists to perform distributed data mining on the federated databases without having to intricately manage resource allocation, authorization and authentication and communication of grid resources. 3. Develop a service-oriented architecture for distributed data stream mining on the Grid. The remainder of this proposal is organized as follows. Chapter 2 offers an overview of the grid infrastructure with emphasis on the Data Grid, existing architectures for data mining, integration of streams on the grid and identifying the challenges for distributed data mining on the grid. Chapter 3 presents our preliminary work on distributed classification using Orthogonal Decision Trees (ODTs) and shows the applicability of ODTs 3

in streaming resource-constrained devices. It also presents a feasibility study for distributed scientific data mining on astronomy catalogs. Chapter 4 outlines the directions for future research.

4

Chapter 2

Background 2.1 The Grid 2.1.1 Introduction The science of the 21st century requires large amounts of computation power, storage capacity and high speed communication [124, 99]. These requirements are increasing at an exponential rate and scientists are demanding much more than is available today. Several astronomy and physical science projects such as CERN’s 1 Large Hadron Collider (LHC) [170], Sloan Digital Sky Survey (SDSS) [242], The Two Micron All Sky Survey (2MASS)[3], bioinformatics projects including the Human Genome Project [132], gene and protein archives [216, 251], meteorological and environmental surveys [197, 239] are already producing peta and tera bytes of data which requires to be stored, analyzed, queried and transferred to other sites. To work with collaborators at different geographical locations on peta scale data sets, researchers require communication of the order of Gigabits / sec. Thus computing resources are failing to keep up with the challenges they face. The concept of the "Grid" has been envisioned to provide a solution to these increasing demands and offer a shared, distributed computing infrastructure. In an early article [99] that motivates the need for Grid computing, Ian Foster describes the Grid "vision" "...to put in place a new international scientific infrastructure with tools that, together, can meet the challenging demands of 21st-century science." Today, much of this dream has become reality with numerous research projects working on different aspects of grid computing including development of the core technologies, deployment and application of grid technology to different scientific domains 2. In the following sections, we briefly review the grid architecture and provide a classification of different types of grids. It must be noted that the objective of this proposal is not to provide a detailed overview of grid computing and related issues, but 1 Conseil

Européen pour la Recherche Nucléaire - European Organization for Nuclear Research list of applications in different scientific domains using the grid technology can be found at http://www.globus.org/alliance/publications/papers.php Applications 2A

5

to introduce the concept of mining and knowledge discovery on a data grid (introduced later in section 2.2.1). Consequently, a reader interested in grid computing should refer to [101] for a detailed overview.

2.1.2 The Grid Architecture

Figure 2.1: The hour glass model The sharing of distributed computing resources including software, hardware, data, sensors, etc is an important aspect of grid computing. Sharing can be dynamic depending on the current need, may not be limited to client server architectures and the same resources can be used in different ways depending on the objective of sharing. These characteristics and requirements for resource sharing necessitate the formation of Virtual Organizations (VOs) [135]. Thus "VOs enable disparate groups of organizations and / or individuals to share resources in a controlled fashion, so that members may collaborate to achieve a shared goal." [135] An example of a virtual organization is the International Virtual Data Grid Laboratory (iVDGL) [140], an NSF funded project that aims to share computing resources for experiments in high energy physics [170], gravitational wave searches (LIGO) [173] and astronomy [242]. The architecture for grid computing, henceforth referred to as grid architecture is a protocol architecture that outlines how the users of a virtual organization interact with 6

Figure 2.2: The Grid Protocol Architecture

one another for resource sharing. Proposed by Ian Foster and Carl Kesselman [135], the grid architecture follows the principles of an "hourglass model". The Figure 2.1, obtained from [99] illustrates this architecture. The "narrow neck" of the hourglass defines the core set of protocols and abstractions, the top contains the high level behaviors and the base contains the underlying infrastructure. The Grid protocol architecture comprises of several layers (illustrated in Figure 2.2) including the Fabric layer responsible for local, resource specific operations, the Connectivity layer which performs the network connections, the Resource layer containing protocols for sharing single resources, the Collective layer for co-ordination among underlying resources and the Applications layer containing user applications. These layers provide the basic protocols and services that are necessary for sharing of resources among different groups in a virtual organization. Thus if the user application is a data mining scenario, the fabric layer would contain participating computers with data repositories, the connectivity layer comprises of the service discovery, authorization and authentication services and communication, the resource layer contains the access to computation and data, the collective layer contains the resource discovery, system monitoring and other application specific requirements. A further enhancement to the protocol based grid architecture, was the Open Grid Services Architecture (OGSA), proposed in [134]. OGSA introduces the concept of Grid Services which can be regarded as specialized web services that contain interfaces for discovery, dynamic service creation, lifetime management, etc and conform to the Web Services Description Language (WSDL) specifications. Various VO structures can be configured using the grid services interfaces for creation, registration and discovery. Thus, the use of grid services provides a way to virtualize components in a grid environment and ensures abstraction across different layers of the architecture. The implementation of the OGSA architecture can be found in the current release of 7

the Globus Toolkit 4.03 [133]. In this section we briefly reviewed the grid protocol architecture and the web services based Open Grid Services Architecture (OGSA). The following subsection discusses methods to classify grids.

2.1.3 Classification of Grids Classification of grids may be done based on different criteria such as the kind of services provided, the class of problems they address or the community of users [127]. However, a common method of discrimination depends on whether they offer computational power (Computational Grids) or data storage (Data Grids). 1. Computational Grid: The computational grid has been defined as ".. a hardware and software infrastructure that provides dependable, consistent, pervasive, and inexpensive access to high-end computational capabilities." [100] They are designed to meet the increased need of computational power by large scale pooling of resources such as compute cycles, idle CPU times between machines, software services etc. An example where a computational grid could be put to use is a health maintenance organization of a metropolitan area, requiring collaboration between medical personnel, patients, health insurance representatives, financial experts and administrative personnel. The resources to be shared include high end compute servers, hundreds of workstations, patient databases, medical imaging archives and medical instrumentation (such as Computed Tomography (CT) scan, Medical Resonance Imaging (MRI),ElectroCardioGram (ECG), Ultrasonography equipment). The formation of a computational grid enables computer aided diagnosis by utilizing information from different medical disciplines, facilitates cross domain medical research, searching of imaging archives, enhanced recommendation schemes for health insurance facilities, detection of fraud on financial data (such as hospital bills, insurance claims etc). 2. Data Grid: It is primarily geared towards management of data intensive applications [64, 17] and focuses on the synthesis of knowledge discovered from geographically distributed data repositories, digital libraries and archives. An example of a data grid would be a collaboration of astronomy sky surveys such as Sloan Digital Sky Survey [242], Two Micron All Sky Survey [3] which are producing large volumes of astronomical data. The purpose of formation of a data grid would be to enhance astronomy and astrophysical research making use of distributed data mining and knowledge discovery techniques. In this proposal we are mainly concerned with data grids and focus on how efficient distributed algorithms can be designed on top of them. The next section offers an overview of the architecture of the data grid, some already existing data grids, and efforts made towards implementation of data mining and knowledge discovery services on the data grid. 3 http://www.globus.org/toolkit/

8

2.2 The Data Grid 2.2.1 Introduction In many scientific domains such as astronomy, high energy physics, climatology, computational genomics, medicine and engineering large repositories of geographically distributed data are being generated ([265], [13], [14], [267], [192], [136], [228]). Researchers needing to access the data may come from different communities and are often geographically distributed. The use of these repositories as community resources has motivated the need for developing an infrastructure capable of having storage and replica management facilities, efficient query execution techniques, data transfer schemes, caching and networking. The Data Grid [64] has emerged to provide an architecture for distributed management and storage of large scientific data sets. The objective of such an architecture is 1. To provide a framework such that the low level mechanisms of storage, transfer of data etc are well encapsulated 2. Design issues that can lead to significant performance implications are allowed to be manipulated by an user 3. To be compatible with a Grid infrastructure and benefit from the grid’s facilities of authentication, resource management, security and uniform information infrastructure.

Figure 2.3: The Data Grid Architecture The Figure 2.34 illustrates the basic components of the Data Grid as envisioned by Chevranek et all. The core grid services are utilized to provide basic mechanisms 4 This

figure has been adapted from [64] with slight modifications.

9

of security, resource management etc. The high level components (such as replica management, replica selection) are allowed to be built on top of these basic grid components. Their work assumes the data access and meta-data access as the fundamental services necessary for a data grid architecture. The data access handles issues related to accessing, managing, transfer of data to third parties etc. The metadata access services are explicitly concerned with the handling of information regarding the data such as information related to how the data was created, how to use it and how file instances can be mapped to storage locations. Other basic grid services that can be incorporated into a data grid framework include an authorization and authentication infrastructure (such as Grid Security Infrastructure), resource allocation schemes and performance management issues. It must be noted that any number of high level components can be designed by the user, using the afore mentioned basic grid services. Several projects such as GriPhyN (Grid Physics Network [115]), European Data Grid Project ([93]) have already implemented the data grid architecture. In order to harness the petascale data resources obtained from four data intensive physics experiments (ATLAS and CMS [170], LIGO [173], SDSS [242]), the GriPhyN project conceptualized the idea of Petascale Virtual Data Grids (PVDGs) and Virtual Data. Petascale Virtual Data Grids[18, 17] are aimed at serving a diverse community of scientists and researchers, enabling them to retrieve, modify and perform experiments and analyses on the data. The idea of virtual data revolves around creation of a virtual space of data products derived from experimental data. The European Data Grid Project is also motivated by the needs of High Energy Physics, Earth Observation and BioInformatics research communities who need to store, access and process large volumes of the data. The architecture [275, 130, 75, 4] of the data grid proposed, is modular in nature, and has similar characteristics to the GriPhyN project. The subsystems of the architecture include (in order from bottom to top) Fabric Services, Grid Services, Collective Services, Grid Applications and Local Computing modules. Management of workload distribution, resource sharing and management, monitoring fault tolerance, providing an interface between grid services and underlying storage are some of the functionalities handled by the Fabric Services. The Grid Services typically comprise of the SQL databases services, authentication and authorization, replica management and service indices5 . The grid service schedulers, replica managers form the bulk of the Collective Services, while application specific services are handled in the Grid Applications layer. The Local Computing layer resides outside the Grid infrastructure, and are typically the desktop machines from where end users may access the data grid. In this section we discussed the motivation, features and components of the data grid architecture and two projects GriPhyN and European Data Grid Project that have implemented this architecture. At this point it will be interesting to discuss some of the data distribution scenarios that are commonly seen in the data grid. The next subsection introduces this topic. 5 The services that allow a large number of decentralized grid components to collaboratively work in virtual data environments.

10

2.2.2 Data Distribution Scenarios In a data grid, the repositories may contain data in different formats. We discuss several different data distribution schemes here6 . 1. Centralized Data Source: This is one of the simplest scenarios since the data can be thought of as residing in a single relational database, a flat file, or as unstructured data (XML). Grid / Web services needing to access this data source can do so by using metadata to obtain the physical data locations and then make use of relevant query languages. 2. Distributed Data Source: When the data is assumed to be distributed among different sites, two different scenarios can arise. (a) Horizontally Partitioned Data: The horizontal partitioning ensures that each site contains exactly the same set of attributes. Note that we refer to the data as horizontally partitioned with respect to a virtual global table. An example of horizontally partitioned data could be a departmental store such as Walmart which has shops at different geographical locations. Each shop maintains information about its customers such as name, address, telephone number and products purchased. Although the shops are geographically distributed, each database keeps track of exactly the same information about its customers. (b) Vertically Partitioned Data: The vertical partitioning requires that different attributes are observed at different sites. The matching between the tuples can be determined using a unique identification or key that is shared among all the sites. An example of vertically partitioned data can be different sky surveys such as SDSS[242], 2-MASS[3], Deep[77], CfA[51] all observing different attributes of the same objects seen in the sky. In either case, horizontal or vertical partitioning of data, grid services can provide a level of abstraction and encapsulation so that the user is not burdened with writing custom code to access the data. Due to the different distribution schemes available, porting databases to the Grid requires development of new protocols and services and middleware architectures. We discuss some of the relevant work in this area in the following section.

2.2.3 Middleware, Protocols and Services As noted in section 2.2.1, the objective of developing a data grid is to encapsulate the low level mechanisms of storage, integration and querying of data stored at different geographical locations in tables, archives, libraries and other repositories. The datagrid community has to develop techniques to handle heterogeneous data repositories in an integrated way. Thus, the infrastructure should support standard protocols and services 6 It is assumed that the reader is familiar with basic steps involved in the development of a web service and we refrain from a detailed discussion of this context

11

and build re-usable middleware components for storage, access and integration of data. These are areas of active research and we briefly discuss them in this section. GridFTP: The Data Transfer Protocol [9, 269] was designed to provide secure and efficient data movement in the Grid environment. In many applications on the data grid [257, 205, 274, 86], GridFTP is used as the protocol for data transfer between different components of the system. It is an extension of the standard FTP protocol. In addition to the standard features of the FTP, the GridFTP also provides Grid Security Infrastructure(GSI) and Kerberos support, third party control of data, parallel, striped and partial data transfer. While a detailed discussion of the protocol specifications is outside the scope of the proposal, an interested reader should refer to [268] for further details. Several other projects have resorted to development of middleware for data access and integration. A relational database middleware approach ([270]) and serviceoriented approaches ([205, 69, 68]) have already been proposed. In the Spitfire project [246] within the European Data Grid project [93] grid enabled middleware services have been used to access the relational data tables. The client and the middleware service communicate using XML over GSI enabled secure HTTP. The middleware service and the relational databases communicate using JDBC / ODBC calls. While this approach is interesting, it is limited to the model of queries and transactions [271] and requires a lot of application dependent code to be written by programmers themselves. This creates a lack of portability among different databases and does not provide a metadata driven approach as advocated by the data grid architecture. In contrast to the relational database middleware approach, the service oriented approach focuses on providing services for the generic database functionalities such as querying, transactions etc. This introduces a level of abstraction or encapsulation, since the service descriptions contain definitions of what functionality is available to an user, without specifying how they are implemented in the underlying system. Thus a virtual database service may provide the illusion that a single database is being accessed, whereas in fact the underlying service can access several different types of data repositories. The Open Grid Services Architecture - Data Access and Integration (OGSA-DAI) project [205, 185, 161], conceived by the UK Database Task Force 7 is developing a common middleware solution to be used by different organizations, allowing uniform access to data resources using a service based architecture. The project aims to expose different types of data resources (including relational and unstructured data) to grids, allow data integration, provide a way of querying, updating, transforming and delivering data via web services and provide metadata about data resources to be accessed. The architecture of the OGSA-DAI infrastructure, illustrated in Figure 2.4 8 , depends on three main types of services: 1. Data Access and Integration Service Group Registry (DAISGR): The purpose of this service is to publish and locate metadata regarding the data resources 7 OGSA-DAI is working closely with the Database Access and Integration Services - Working Group (DAIS-WG) of the Global Grid Forum (GGF), the Open Middleware Infrastructure Institute (OMII) and the Globus team 8 This Figure has been adapted from [161]

12

Figure 2.4: The Architecture of the OGSA-DAI Services

and other services avaiable. Thus, clients can use DAISGR to query metadata of registered services and select the service that best suites their requirements. 2. Grid Data Service Factory (GDSF): It acts as an access point to data resources and allows creation of Grid Data Services (GDS). 3. Grid Data Service (GDS): It acts as a transient access point for the data source. Clients can access data resources using the GDS. When the service container starts up, the DAISGR is invoked and gets instantiated. On creation, a GDSF may register as a service with the DAISGR and enables discovery of other services and data resources using relevant metadata. The Grid Data Services are invoked at the request of clients wanting to access a particular resource. It is interesting to note that several different types of data resources are supported by the OGSA-DAI including Oracle, MySQL, DB2, SQLServer, PostgreSQL, Cloudscape, IBM Content Manager and even Data Streams. The infrastructure developed by the OGSA-DAI project is a popular9 data access and integration service for developing data grids and has been used in several astronomy, bioinformatics, medical research, meteorology, geo-science applications10. More recently, a service based Distributed Query Processor (DQP) has been developed to work with OGSA-DAI. OGSA-DQP extends OGSADAI by incorporating two new services: (1) Grid Distributed Query Service (GDQS): that compiles, optimizes, partitions and schedules distributed query execution plans over multiple execution nodes in the Grid. (2)Grid Query Evaluation Service (GQES): which is responsible for partitioning the query execution plans assigned by the GDQS. 9 The Grid toolkits such as Globus GT3.0 and Unicore (http://europar.upb.de/tutorials/tutorial03.html) do not have the facility of uniform data access using web services 10 A complete listing of projects using the OGSA-DAI software is available at http://www.ogsadai.org.uk/about/projects.php

13

It is interesting to note that none of the above mentioned projects still meet the need for schema integration. Traditionally, in the database community, data integration [168] has been defined as " ..the problem of combining data residing at different sources, and providing the user with a unified view of this data." This means that the process of data integration involves modelling of the relation between individual database schemas and the global schema obtained by integrating them. Given that the grid could contain horizontally or vertically partitioned data as mentioned in section 2.2.2, unstructured data and data streams, the problem of data integration becomes a non-trivial one11 . A decentralized service based data integration architecture for Grid databases has been proposed in the Grid Data Integration System (GDIS) [70]. It uses the middleware architecture provided by OGSA-DQP, OGSA-DAI and the Globus Toolkit. It is based on Peer Database Management System (PDMS) [10], a P2P based decentralized data management architecture for supporting data integration issues in relational databases. The basic idea is that any peer in the PDMS can contribute data, schema information, or mappings between schema forming an arbitrary graph of interconnected schemas. The GDIS system offers a wrapper / mediator based approach to integrate the data sources. The process of data storage, access and integration on the grid is still an area of active research. Existing protocols, services and middleware architectures described in this section, aim to solve related problems, but it is still open for research. As the architectures are evolving, researchers have also focused on data mining and knowledge discovery on the data grid infrastructure. The next section reviews this topic in some detail.

2.2.4 Data Mining on the Grid Several research projects including the Knowledge Grid [40, 43, 46, 37], Grid Miner [220], Discovery Net [264, 1], TeraGrid [257], ADaM (Algorithm Development and Mining) [233] on NASA’s Information Power Grid, and the DataCutter project [191] have focused on the creation of middleware / systems for data mining and knowledge discovery on top of the data grid. We briefly review related work in this area. 1. The Knowledge Grid: Built on top of a grid environment, it uses basic grid services such as authentication, resource management, communication and information sharing to extract useful patterns, models and trends in large data repositories. It is organized into two layers Core K-grid layer and High level K-grid layer which is implemented on top of the core layer. The Core Kgrid layer is responsible for management of metadata describing data sources, third party data mining tools, algorithms and visualization. It comprises of two main services - Knowledge Discovery Services (KDS) and the Resource Allocation and Execution Management (RAEM). The knowledge discovery services 11 An example would be cross matching heterogeneously distributed astronomy catalogs from sky surveys described in section 3.4.5

14

Figure 2.5: The Knowledge Grid Architecture

extends the basic globus monitoring services and is responsible for managing metadata regarding which data repositories are to be mined, data manipulation and pre-processing and certain specific execution plans. The information thus managed is stored in three repositories - Knowledge Metadata Repository(KMR) which contains metadata regarding the data, software and tools, Knowledge Base Repository(KBR) that stores learned knowledge and Knowledge Execution Plan Repository (KEPR) which keeps track of the execution plans for a knowledge discovery process. The interested reader is referred to [71, 252, 45, 41, 42, 39] for further details regarding each of these repositories. The RAEM finds mapping between execution plans and resources available on the grid. The High level grid layer, built on top of the core grid mainly includes services used to compose, execute and validate the specific distributed data mining operations. The main services provided by it include data access services, tools and algorithms access services, execution plan management services and results representation service. The Figure 2.5 adapted from [43] illustrates the basic components. The architecture has been implemented in a toolset named VEGA (Visual Environment for Grid Applications) [73]. It is responsible for task composition, consistency checking and generation of the execution plan. A visual interface for the work flow management is an attractive feature. However, it appears that a more fundamental problem is to be able to design an algorithm or the steps required to perform distributed work flow management. As of now a distributed work flow management scheme does not exist, and this appears to a very important need for the Grid community. It would be interesting to see, if the work flow manager designed by the authors can be extended to a distributed work flow management scheme. While the authors take particular care in the description of an elaborate architecture, the distributed data mining algorithms implemented 15

on this architecture seems to be very short and inadequate for the current context. They perform experiments on network intrusion data the size of which was reported to be about 712 MB. This is a considerably small dataset considering that they are trying to make a case for a grid mining scenario. The data mining tasks described here also appear to be pretty straightforward and does not include a real distributed algorithm requiring extensive communication or synchronization12. It would be really useful to see how the current system scales with real distributed algorithms (such as clustering algorithms requiring multiple rounds of communication, distributed association rule mining algorithms) and larger datasets stored at different geographical locations.

Figure 2.6: The GridMiner Components

2. The Grid Miner: This project aims to integrate grid services, data mining and On-Line Analytical Processing (OLAP) technologies. The GridMiner-Core framework [127, 220] is built on top of the Open Grid Services Infrastructure (OGSI) and uses the services provided by it. On top of this infrastructure the following components are built: (1)GridMiner Information Service (GMIS): It is responsible for collecting and aggregating service data from all available grid services and has query interfaces for resource discovery and monitoring.(2)Grid Miner Logging and Bookkeeping Service (GMLB): It collects scheduling information, resource reservations and allocations, logging and error handling. (3)Grid Miner Resource Broker (GMRB): It is responsible for workload management and grid resource management. (4) Grid Data Mediation System (GDMS): This [219] is responsible for accessing and manipulation of the data repositories by providing an API that is capable of abstracting the process of data access from repositories to higher level knowledge discovery services. This system comprises of several components including: (a) GridMiner Service Factory(GMSF): provides a service creation facility (b) GridMiner Service Registry 12 A recent work [72] provides a meta-learning example, but a completely implemented system is still under development

16

(GMSR): provides a directory facility for the OGSA-DAI services (c)GridMiner Data Mining Service (GMDMS): provides the data mining algorithms and tools (d) GridMiner PreProcessing Service (GMPPS): encapsulates the functionality that is needed for pre-processing the data (e)GridMiner Presentation Service (GMPRS): provides facilities for visualization of models. (5) Replica Management: This comprises of a GridMiner Replica Manager(GMRM) and a GridMiner Replica Catalog (GMRC). (5) GridMiner Orchestration Service (GMOrchS): The orchestration service is an optional component that is capable of aggregating a sequence of data mining operations into a job. It acts as a workflow engine that executes the steps involved in the complete data mining task (either sequentially or in parallel). It provides an easy mechanism for handling long running jobs. The Figure 2.6, adapted from [127] illustrates the components of the GridMiner architecture. A prototype application, Distributed Grid-enabled Induction of Decision Trees (DIGIDT) [128, 127] has been developed to run on the GridMiner. It is based on concepts introduced in SPRINT [244] but has a modified data partitioning scheme and workload management strategy. The interested reader is referred to [127, 220] for details of implementation of the algorithm and performance results presented herein. It must be noted that this algorithm closely resembles a truly distributed scenario as described in section2.3 but appears to be capable of handling homogeneously partitioned data only. 3. Discovery Net: The Discovery Net (DNET) [1, 264] project aims to build a platform for scientific discovery from data collected by high throughput devices. The infrastructure is being used by scientists from three different application domains including Life Sciences, Environmental Monitoring and Geo-hazard Modelling. The DNET architecture develops High Throughput Sensing(HTS) applications by using the Kensington Discovery Platform on top of the Globus services. The knowledge discovery process is based on a workflow model. Services provided are treated as black boxes with known input and output and are then strung together into a sequence of operations. The architecture allows users to construct their own workflows and integrate data and analysis services. A unique feature is the Discovery Process Markup Language (DPML) [143] which is an xml based representation of the workflows. Processes created by DPML are re-usable and can be shared as a new service on the Grid by other scientists. Abstraction of workflows and encapsulating them as new services can be achieved by the Discovery Net Deployment Tool. A workflow warehouse [143] acts as a repository of user workflows and allows the capability of querying and meta-analysis of workflows. Another interesting feature is the InfoGrid [200] infrastructure, which allows dynamic access and integration of various data sets in the workflows. Interfaces to SQl databases, OGSA-DAI sources, Oracle databases and custom designed wrappers are built to enable data integration. The interested reader is referred to [1, 264, 200, 143] for detailed descriptions of the architecture, components and applications of DNET. 4. TeraGrid: The TeraGrid project aims to provide a "CyberInfrastructure" [258, 17

122] by making use of resources available at four main sites - The San Diego Supercomputer Center (SDSC), Argnonne National Laboratory (ANL), Caltech and National Center for SuperComputing Applications (NCSA). The architecture of the TeraGrid project makes use of existing Grid software technologies and builds a "virtual" system that is comprised of independent resources at different sites. It consists of two different layers - the basic software components (Grid services) and the application services (TeraGrid Application Services) implemented using these components. The objective is to build a knowledge grid [26] throughout the science and engineering community. For example the Biomedical Informatics Research Group has been developed to allow researchers at geographically different locations to share and access brain image data and extract useful patterns and models from them. This enables TeraGrid to act as a knowledge grid in the biomedical informatics domain. A knowledge grid, thus conceived, is "the convergence of a comprehensive computational infrastructure along with scientific data collections and applications for routinely supporting the synthesis of knowledge from that data" [26]. In September 2004, the deployment of TeraGrid was completed enabling access to 40 teraflops of computing power, 2 petabytes of computing storage, specialized data analysis and visualization schemes and high speed network access. 5. Algorithm Development and Mining (ADaM): The ADaM toolkit [233, 238], conceptualized on NASA’s Information Power Grid, has been developed by the Information Technology and Systems Center (ITSC) at the University of Alabama in Huntsville. It consists of over 100 data mining and image processing components and is primarily designed for scientific and remote sensing data. The ADaM toolkit has been grid enabled by making use of Globus and CondorG frameworks. Several projects including Modeling Environment for Atmospheric Discovery (MEAD)[229], Linked Environments for Atmospheric Discovery (LEAD) [239] have made use of the grid enabled toolkit for data mining operations including classification, clustering, association rule mining, optimization, image processing and segmentation, shape detection and filtering schemes. In MEAD the goal is to develop a cyber infrastructure for storm and hurricane research allowing users to configure, model and mine simulated data, retrospective analysis of meteorological phenomenon, and visualization of large models. 6. Data Cutter: The Data Cutter project enables processing of scientific datasets stored in archival storage systems across a wide-area network. It has a core set of services on top of which application developers can build services on a need basis. The main design objective is to enable range queries and custom defined aggregations and transformations on distributed subsets of data. The system is modular in nature and contains client components, that interact with clients and obtain multi-dimensional range queries from them. The data access services enables low level I/O support and provides access to archival storage systems. The indexing module allows hierarchical multidimensional indexing on datasets including R-trees and their variants and other sophisticated spatial indexing schemes. The purpose of the filtering module is to provide an effective way of subsetting and data aggregation. The Data Cutter project however, does 18

not provide support for extensive distributed data mining facilities. It also does not support distributed stream based applications. Grid Enabled WEKA Research has also been done to Grid enable WEKA [273], a popular java based machine learning toolkit. Some of the projects working with this objective include Weka4WS [274, 86], GridWeka [231, 114], Federated Analysis Environment for Heterogeneous Intelligent Mining (FAEHIM)[94] and WekaG[184]. We briefly summarize the contribution of each of these projects. 1. Weka4WS: The goal of this project is to support execution of data mining algorithms on remote Grid nodes by exposing the Weka Library as web services using the Web Services Resource Framework (WSRF)13 . The architecture of Weka4WS comprises of three kinds of nodes - storage nodes which contain the datasets to be mined, compute nodes on which the data mining algorithms are run and the user nodes which are the local machines of users. Local data mining tasks at a grid node are computed using the Weka library resident at that particular node, while remote computations are routed through the user nodes. The compute nodes contain web services compliant with WSRF and are therefore capable of exposing the data mining algorithms implemented in the Weka library as a service. GridFTP servers are executed on each storage node to allow data transfer. While the architecture for grid enabling Weka used here is interesting, it appears to have some disadvantages. First, the authors have limited themselves to use of weka data mining algorithms only. Thus they are unable to use a truly distributed data mining algorithm such as those described in section 2.3 and the framework appears to be running centralized data mining algorithms at different grid nodes. It is also unclear how these algorithms adapt in case of a heterogeneous data partitioning scheme as described in section 2.2.2. Second, the use of GridFTP server in storage nodes, is also restrictive since it does not allow complete flexibility of the type of data resources used14 . 2. GridWeka: This is an ongoing work at the University of Dublin, which aims to distribute data mining weka algorithms (in particular weka classifiers) over computers in an ad-hoc Grid. A client-server architecture is proposed such that all machines that are part of the Weka Grid have to implement the Weka Server. The client is responsible for accepting a learning task and input data, distributing the task of learning, load balancing, monitoring fault tolerance of the system and crash recovery mechanisms. Tasks that can be done using the Grid Weka include building a classifier on a remote machine, labelling a dataset using a previously built classifier, cross validation and testing. Needless to say, the client server architecture is not ideally suited for the Grid and there is no service oriented schemes in place yet for making use of the basic grid features such as security, resource management etc. 3. FAEHIM: The Federated Analysis Environment for Heterogeneous Intelligent Mining (FAEHIM) project[8] aims to provide a data mining toolkit using web 13 http://ws.apache.org/wsrf/wsrf.html

14 For

example, it is unclear how unstructured data is handled by gridFTP scheme

19

services and the Triana problem[255, 260] solving environment. The primary data mining activities supported include classification, clustering, association rule mining and visualization modules. The basic functionality is derieved from the Weka library and converted into a set of web services. Thus the toolkit consists of a set of data mining services, tools to interact with the services and a workflow management system to assemble the services and tools. This project appears to be very similar to the Weka4WS project mentioned above. 4. WekaG: The WekaG toolkit aims at adapting the Weka Toolkit for the Grid using a client server architecture. The server side implements the data mining algorithms while the WekaG client is responsible for creation of instances of grid services and acts as an interface to users. The authors describe WekaG as an implementation of a more general architecture called Data Mining Grid Architecture (DMGA) which is geared towards coupling data sources and provides facilities for authorization, data discovery based on meta data, planning and scheduling resources. A prototype for this toolkit has been developed using the Apriori algorithm integrated into the Globus Toolkit 3. Future work for the project is aimed at development of other data mining algorithms for the Grid and compatibility with the WSRF technology. Next generation grid based systems are moving towards P2P and Semantic Grids [218, 254, 253, 138, 44]. However, we will leave this area virtually untouched in the proposal considering that work in this area is still in the nascent stages. Another interesting direction of research is Privacy Preserving Data Mining on Grids. While there is a need to solve problems in this arena, little [21] work has been done. The next section explores architectures and applications of data streams on the grid. Grid Data Streams Application areas like e-science ( AstroGrid [15], GridPP [113]), e-health ( Telemedicine [256], Mobile Medical Monitoring [194]) , e-business ( INWA [139]) produce distributed streaming data. This data needs to be analyzed and one way to ensure that researchers have easy access to streams is to port them to the grid [230]. In this way, individuals do not need to set up mobile devices, expensive equipment such as telescopes and satellites but can access interesting data streams published on the Grid. In the Equator Medical Devices Project [50] for example, the authors have adapted Globus Grid Toolkit (GT3) to support remote medical monitoring applications on the Grid. Two different medical devices - the monitoring jacket and the blood glucose monitor are made available on the grid as services. Data miners can access the data for the purpose of knowledge discovery and pattern recognition without having to go through the trouble of setting up an environment for collecting the data or even owning the equipments themselves. Several other advantages of porting data streams to the grid include sharing of data on-the-fly, easy storage of large streams and reduction in network traffic. In recent years, different architectures have been proposed for porting data streams to the grid [259, 221, 223, 222, 25, 61, 176]. We briefly review related architectures

20

and applications, making note of the fact that none of these architectures incorporate distributed data mining facilities on grid data streams.

Figure 2.7: The Virtual Stream Store Plale et al ([222, 221] propose a model for bringing data streams to the grid, based on the ability of stream systems to act as a data resource. They argue, it is possible to treat each stream source as a grid service but the approach may not scale to the entire range of data stream generation devices from large hadron colliders in physics experiments to tiny motes in sensor networks. Thus an architecture for porting streams to the grid must cater to the needs of different types of data stream generation devices. The model proposed is based on three main assumptions: (1) data streams can be aggregated (2) They can be accessed through database operations and query languages (3) It is possible to access streams as grid services. The main motivation for proposing such a model comes from the fact that data streams can be viewed as indefinite sequences of time-sequenced events and can be treated as a data resource like a database. This enables querying of global snapshots of streams and development of a virtual stream store as the architectural basis of stream resources. A virtual stream store is defined as follows: "...collection of distributed, domain-related data streams and set of computational resources capable of executing queries on the streams. The virtual stream store is accessed through a grid service. The grid service provides query access to the data streams for clients." [221] The concept of the virtual stream store is illustrated in Figure 2.7, adapted from [221]. It comprises of nine data streams and computational resources. The computational resources  ’s are located very close to the streams (indicated by S in the Figure) but they could also act as stream generators. In general, it is not necessary for the generators to be a part of the virtual stream store. The model can act like a database 21

system for data stream stores having access to modified SQL type query languages. This architecture has been integrated into the dQUOB [223] and Calder systems [201, 277, 202]. While it is a first step towards integration of data streams to grid, it has some drawbacks. It appears that this architecture does not consider heterogeneity of stream data and it is unclear how continuous queries can deal with heterogeneous streaming data15 . The Grid-based AdapTive Execution on Streams (GATES) project [61] aims to design and develop middleware for processing distributed data streams. The system is built using the Open Grid Sources Architecture (OGSA) and GT3. It offers a high-level interface that allows the users to specify the algorithm(s) and the steps involved in processing data streams without being concerned with resource discovery mechanisms, scheduling or allocating grid-based services. Hence the system is " self-resourcediscovering". The authors also refer to the system as "self-adapting" since a high degree of accuracy is obtained in analyzing data streams by tweaking certain parameters such as sampling rates, summary structures or algorithms. The goal of the self adaptation algorithm is to provide real time stream processing while keeping the analysis as precise as possible. This is obtained by maintaining a queuing network model of the system. It appears to be very close to the dQUOB system [223] although their scheme has capabilities for resource discovery and adaptation in distributed environments. In StreamGlobe [230], the authors propose the processing and sharing of data streams on Grid-based P2P infrastructures. The motivation is derieved from an astrophysical e-science application. The key features of the system include: (1) publishing data and retrieving information by interactively registering peers, data streams, and subscriptions (2) sharing of existing data streams in the network, thereby providing optimization and routing facilities (3) network traffic management capabilities by preventing overloading. This is a relatively new project and future work in the area is geared towards providing support for subscriptions with multiple input data streams and joins. Benford et al [25] describe their experiences in monitoring life processes in frozen lakes in Antarctic. They deploy remote monitoring devices on lakes of interest which send data to base stations over satellite phone networks. Integration of the sensing devices into a grid infrastructure as services, enables archiving of sensor measurements. The complete system consists of several components including (1) the Antarctic sensing device deployed on icy surface (2) A satellite telephony network to a base computer where the raw data is pre-processed (3) An OGSA compliant web service that makes the sensing device and its data available on the Grid (4) The data archived in a Grid accessible data repository (5) The data analysis and visualizing components of interest to the Antarctic scientist. While hurdles like erroneous remote sensor readings, software and hardware failure due to extreme weather conditions are still being sorted out, this system provides proof of concept of data analysis on streams ported to the Grid. This section emphasizes the idea that much interest has gone into developing architectures for supporting streams on the Grid. While the architectures themselves are in a nascent stage, even less research has been done to develop data mining algorithms on grid data streams. However, the need for knowledge discovery mechanisms is in15 More

recent work of the group includes studying the feasibility of continuous query grid services [202].

22

evitable. Keeping this in mind, we explore the relatively new area of distributed data mining in the next section.

2.3 Distributed Data Mining 2.3.1 Introduction A primary motivation for Distributed Data Mining (DDM) discussed in literature and in this proposal, is that a lot of data is inherently distributed. Merging of remote data at a central site to perform data mining will result in unnecessary communication overhead and algorithmic complexities. As pointed out in [226], "Building a monolithic database, in order to perform non-distributed data mining, may be infeasible or simply impossible" (pg 4). For example, consider the NASA Earth Observing System Data and Information System (EOSDIS) which manages data from earth science research satellites and field measurement programs. It provides data archiving, distribution, and information management services and holds more than 1450 datasets that are stored and managed at many sites throughout the United States. It manages extraordinary rates and volumes of scientific data. For example, Terra spacecraft produces 194 gigabytes (GB) per day; data downlink is at 150 Megabits/sec and the average amount of data collected per orbit is 18.36 Megabits/sec16 . A centralized data mining system may not be adequate in such a dynamic, distributed environment. Indeed, the resources required to transfer and merge the data on a centralized site may become implausible at such a rapid rate of data arrival. Data mining techniques that minimize communication between sites are quite valuable. Simply put, DDM is data mining where the data and computation are spread over many independent sites. For some applications, the distributed setting is more natural than the centralized one because the data is inherently distributed. Typically, in a DDM environment, each site has its own data source and data mining algorithms operate on it producing local models. Each local model represents knowledge learned from the local data source, but could lack globally meaningful knowledge. Thus the sites need to communicate by message passing over a network, in order to keep track of the global information. For example, a DDM environment could have sites representing independent organizations whose operation and data collection have nothing to do with each other and who communicate over the Internet. Typically communication is a bottleneck. Since communication is assumed to be carried out exclusively by message passing, a primary goal of many DDM methods in the literature is to minimize the number of messages sent. Some methods also attempt to load-balance across sites to prevent performance from being dominated by the time and space usage of any individual site. In the following sections we briefly review DDM algorithms for classification and clustering. In a subsequent section 2.3.4 we also give an overview of stream data mining which has been receiving increasingly more attention in the last ten years. Since 16 This

information has been obtained from http://spsosun.gsfc.nasa.gov/eosinfo/EOSDIS_Site/index.html

23

the focus of this proposal is DDM on the grid infrastructure with emphasis on clustering, classification and data streams, we will leave many areas of DDM virtually untouched. For different perspectives on this exciting field, the reader is referred to [154], [156],[214], [278], [279].

2.3.2 Classification Distributed Classification is closely related to ensemble based classifier learning [213]. Ensemble-based classifiers work by generating a collection of base models and combining their outputs using some pre-specified schemes. Typically, voting (weighted or unweighted) schemes are employed to combine the outputs of the base classifiers. A large volume of research reports that ensemble classifier models often perform better than any of the base classifiers used [79, 207, 23, 190, 166]. Two popular ensemble models of this kind are Bagging [34] and Boosting [103, 104, 240]. Both Bagging and Boosting build multiple classifiers from different subsets of the original training data set. However, they take substantially different approaches in sampling subsets and combining classifications. In Bagging, each training set is constructed by taking a bootstrap replicate of the original training set.   This means that given a training set T that contains m tuples, the new training set is constructed by uniformly sampling (with replacement) from T. Classifiers are built on the new training set and the results aggregated by majority voting. In Boosting, a set of weights are maintained over the original training set T and they are adjusted after classifiers are learned using a learning algorithm. The adjustment procedure increases the weight of tuples that are mis-classified by the base learning algorithm and decrease the weight of those that are correctly classified. There are two different ways by which the weights can be used to form the new training set - boosting with sampling (tuples are drawn with replacement from the training set with probability proportional to the weight) and boosting by weighting (the learning algorithm can take a weighted training set directly). Other ensemble based approaches include Stacking[276, 95], Random Forest [35] and a recent work Rotation Forest [232]. Stacking [276, 95] learns from pre-partitioned training sets  -s and validation sets   -s. Given a set of learning algorithms = {     }, it builds a classification model in two stages. During the first stage, a set of classifiers      is learned from  . In the second stage, classifications for each  !"$#&%' ( are tested with a set of classifiers learned from the previous stage, which results in, (  )  #*     #+  ,  )  #+ !  ). By applying the process repeatedly for all   -s, a new training set is formed, which is then used to create the so-called meta-classifier. Classification of an unseen data instance  is also made in two stages. A meta-level testing instance (  )#+   #*    )# ) is first formed, which is then passed to the meta-classifier for the final classification. In Random Forests, a forest of classification trees is grown as follows: (1) If the training set has T tuples, sample T cases at random (with replacement), from the original data. (2) If there are M attributes, a number m - M is specified such that at each node, m variables are selected at random out of the M and the best split on these m is used to split the node. The value of m is held constant during the forest growing. (3) Each tree is grown to the largest extent possible. There is no pruning. The error rate of the Random Forest depends on the correlation between any two trees in the forest 24

and the strength of each individual tree in the forest. While Random Forests have been shown to have good accuracy of classification, the major disadvantage is that the size (total number of nodes in trees of the forest) of the model built can be very large. A more recent work, Rotation Forest [232], builds classifier ensembles based on feature selection. In order to create the training data set for the base classifier, the attribute space is randomly split into K (a parameter of the algorithm) subsets and Principle Component Analysis (PCA) is applied to the subsets, retaining all the principle components. There are K axis rotations creating a new feature space for the base classifiers. The objective of this method is to increase the individual accuracy of classifiers and the diversity within the ensemble. Experimental results reported in the work claim that Rotation Forest can provide a better accuracy than Random Forest, Bagging and Boosting. All of the above ensemble learning schemes can be directly adapted for distributed classification. The individual sites can produce the base models from local data, and ensemble based aggregation schemes [164, 171, 172, 217] can be used for producing the final result.

Figure 2.8: The Meta-Learning Framework Homogeneously Distributed Classifiers: A slightly different ensemble learning technique for homogeneously distributed data is the meta-learning framework [55, 53, 54, 58, 59, 56, 57]. It follows three steps: (1) Build the base classifiers using learning algorithms on local data at a site. (2) Base classifiers are collected and combined at 25

a central site. Produce the meta-level data by using a validation set and predictions of base classifiers on this set. (3) Build the meta-level classifier from the meta-level data. The Figure 2.8 illustrates the process. Different strategies for combining multiple predictions from classifiers in the meta-learning phase can be adopted. These include: 1. Voting: Each classifier is assigned a single vote and majority wins. A variation of this process is Weighted Voting, where some classifiers are assigned preferential treatment depending on the performance on some common validation set. 2. Arbitration: The arbiter acts like a "judge" and its predictions are selected if the participating classifiers themselves cannot reach a consensus. Thus the arbiter itself is a classifier which chooses a final outcome based on the predictions of other classifiers. 3. Combiner: The combiner makes use of the knowledge about how classifiers behave with respect to one another and thereby enables meta-learning. There are several ways by which the combiner can be learnt. One way is to use the base classifiers and their outputs. Yet another option is to learn the combiner from data comprised of training examples, correct classifications and base classifier outputs. The meta-learning framework is implemented in a system called Java Agents for Metalearning (JAM) [247, 248]. In general, meta-learning helps improve performance by executing in parallel and ensures better predictive capabilities by combining learners with different inductive bias such as search space, representation schemes and search heuristics. Other meta-learning based distributed classification schemes include [118] (Distributed Learning and Knowledge Probing) and [111]. Heterogeneously Distributed Classifiers: The problem of learning over heterogeneously partitioned data is inherently difficult since different sites observe different attributes from the original data set. Traditional ensemble based approaches generate high variance local models and are not adept at identification of correlated features distributed over different sites. Hence the problem of learning over heterogeneously partitioned data is challenging. Park and his colleagues [210] have addressed the problem of learning from heterogeneous sites using an evolutionary approach. Their work first identifies a portion of the data that none of the local classifiers can learn with a great deal of confidence. This subset of the data is merged at the central site and another new classifier is built from it. When a new instance cannot be learned with high confidence from a combination of local classifiers, the central classifier is used. The approach produces better results than simple aggregation of local models. However, the algorithm is sensitive to the confidence threshold selected. An algorithm to construct decision trees over heterogeneously partitioned data has been proposed by Giannella et all [107]. The algorithm is designed using random projection based dot product estimation and a message sharing strategy. Their work assumes that each site has the same number of tuples which are ordered to facilitate 0/ tuple on each site matches and also have the same class labels. The matching i.e. . 26

aim is to construct a decision tree using attributes from all the sites. The problem boils down to estimating the information gain offered by attributes in making splitting decisions. To reduce communication, the information gain estimation is approximated using a random projection based approach. It must be noted that the decision tree obtained from the distributed framework may not be identical to the one obtained if all the data were centralized. However, increasing the number of messages exchanged can make the distributed tree arbitrarily close to the centralized tree. Also, the distributed algorithm requires more local computation than the centralized algorithm. Thus the overall benefit of the algorithm is based on a trade-off: increased local computation or reduced communication. The work does not take into account actual communication delays in the network in the distributed setting and thus could benefit from a more detailed timing study of the centralized and distributed settings. The above problem of constructing decision trees over heterogeneously partitioned data has also been addressed in work by Caragea et all [48, 47, 49]. However, their work focuses on producing an exact distributed learning algorithm, compared with decision trees constructed on centralized data. Thus it is fundamentally different from the inexact random projection based approach described in [107]. An order statistics based approach to combining classifiers generated from heterogeneous sites has been presented in [261]. Their technique works by ordering predictions of the classifiers and provides mechanisms for selecting an appropriate order statistic and making a linear combination of them. The work provides an analytical framework for quantifying the reduction in error obtained when an order statistics based ensemble is used. Their experimental results suggest that when there is significant variability among the classifiers, order statistics based approaches perform better than ordinary combiners. Collective Data Mining: The framework for Collective Data Mining (CDM) has been proposed by Kargupta and his colleagues [150]. It has its roots in the theory of communications, machine learning, statistics and distributed databases. The main objective of CDM is to ensure that partial models produced from local data at sites are correct and can be used as building blocks for forming the global model. The steps involve: (1) Chose an appropriate orthonormal representation for the type of data model to be built. (2) Construct approximate orthonormal basis coefficients at each local site. (3) If required, transfer a sample of the datasets from each local site to a coordinator site and generate approximate basis coefficients corresponding to the non-linear cross terms. (4) Combine the local models and output the final model using a user specified representation format (such as decision trees). A major contribution, is the use of the CDM approach to construct decision trees from data through the Fourier analysis. It has been pointed out that there are several different techniques to do this. One possibility is to use the Non-uniform Fourier Transformation (NFT) [36]. Computation of the Fourier coefficients require all the members of the domain to be available. However, in most learning frameworks, a training set is used to learn the model and it is tested on a validation (test) set. In such a framework, estimation of the Fourier coefficients themselves is not easy. The Non-uniform Fourier Transformation provides a potential solution. If it is assumed that the class label is zero for all members of the domain that are not in the learning set, then the Fourier spec27

trum exactly represents the data. This is called NFT and it can be shown that one can exactly construct the decision tree from the NFT of the data. However, this approach has potential drawbacks. It does not guarantee polynomial description or exponentially decaying magnitude of the coefficients. Also communication of the NFT of the data may require a substantial overhead. Another approach is to estimate the Fourier spectrum of the tree directly from the data, instead of the NFT of the data. This method has several advantages including: (1) Decision trees with bounded depth are generally useful for data mining. (2) The Fourier representation of bounded depth (say d) decision trees has polynomial number of non-zero coefficients. Coefficients corresponding to partitions involving more than d features are zero. (3) If the number of defining features determine the order of a partition, then the magnitude of the Fourier coefficients decay exponentially with the order of the corresponding partition. The existence of these properties guarantees that if there is a straightforward way to estimate the coefficients themselves, then decision trees can be built over distributed data using very little communication. An iterative approach to modelling the error obtained in approximating the Fourier coefficients obtained from non-local sites has been proposed. A more detailed analysis of this work is presented in [211]. The Fourier representation of decision trees and procedure to reconstruct trees from the Fourier Spectrum has been studied in much detail [209, 211, 212, 149, 152, 151]. Removing Redundancies from Ensembles: Existing ensemble-learning techniques work by combining (usually a linear combination) the output of the base classifiers. They do not structurally combine the classifiers themselves. As a result they often share a lot of redundancies. The Fourier representation of decision trees, referred to in the discussion above, offers a unique way to fundamentally aggregate the trees and perform further analysis to construct an efficient representation. The work on Orthogonal Decision Trees [153, 89, 119] focuses on this issue. Consider a matrix 1 where 123 465879 4: A@BDC matrix where @ >A@ is the size of the input domain and C is the total number of trees in the ensemble. An ensemble classifier that combines the outputs of the base classifiers can be viewed as a function defined over the set of all rows in 1 . If 1FE*3 4 denotes the G th column matrix of 1 then the ensemble classifier can be viewed as a function of 1 E*3 1 E*3  HHH1 E+3 I . When the ensemble classifier is a linear combination of the outputs of the base classifiers we have JK5ML 1 E*3 N L  1 E*3 ON HHHL I 1 E*3 I , where J is the column matrix of the overall ensemble-output. Since the base classifiers may have redundancy, it is possible to construct a compact low-dimensional representation of the matrix 1 . However, explicit construction and manipulation of the matrix 1 is difficult, since most practical applications deal with a very large domain. We can try to construct an approximation of 1 using only the available training data. One such approximation of 1 and its Principal Component Analysis-based projection is reported elsewhere [190]. Their technique performs PCA of the matrix 1 , projects the data in the representation defined by the eigenvectors of the covariance matrix of 1 , and then performs linear regression for computing the coefficients LP  L= HHH< and L=I . While the approach is interesting, it has a serious limitation. First of all, the con28

struction of an approximation of 1 even for the training data is computationally prohibiting for most large scale data mining applications. Moreover, this is an approximation since the matrix is computed only over the observed data set of the entire domain. In recent work [153, 89, 119], a novel way to perform a PCA of the matrix containing the Fourier spectra of trees has been reported. The approach works without explicitly generating the matrix 1 . It is important to note that the PCA-based regression scheme [190] offers a way to find the weightage for the members of the ensemble. It does not offer any way to aggregate the tree structures and construct a new representation of the ensemble which the current approach does. Now consider a matrix Q where Q 3 4 5SR69 4:T3 9 : , i.e. the coefficient corresponding to the . -th member of the partition set U from the spectrum of the tree 7V9 4: . It can be shown that the covariance matrices of 1 and Q are identical [119]. Note that Q is a @ UW@XBYC dimensional matrix. For most practical applications @ UZ@\[A[M@ >A@ . Therefore analyzing Q using techniques like PCA is significantly easier. PCA of the covariance matrix of W produces a set of eigenvectors    HHH, ( . The eigenvalue decomposition constructs a new representation of the underlying domain. Since the eigenvectors are nothing but a linear combination of the original column vectors of W, each of them also form a Fourier spectrum and we can reconstruct a decision tree from this spectrum. Moreover, since they are orthogonal to each other, the tree constructed from them also maintain the orthogonality condition. The analysis presented above, offers a way to construct the Fourier spectra of a set of functions that are orthogonal to each other and therefore redundancy-free. These functions also define a basis and can be used to represent any given decision tree in the ensemble in the form of a linear combination. Orthogonal decision trees can be defined as an immediate extension of this framework. We present the theoretical definitions and experimental results on the performance of Orthogonal Decision Trees (ODTs) in section 3.2. In this section we discussed several algorithms and techniques for distributed classification. The following section introduces distributed clustering.

2.3.3 Clustering Distributed clustering algorithms can be broadly divided into two categories: (1) methods requiring multiple rounds of message passing and (2) centralized ensemble methods [38, 141]. Algorithms that fall into the first category require significant amount synchronization. The second consists of methods that build local clustering models and transmit them to a central site (asynchronously). The central site forms a combined global model. These methods require only a single round of message passing, hence, modest synchronization requirements. In the next two subsections we discuss these issues in some details. Multiple Communication Round Algorithms Kargupta et al. [157] develop a principle components analysis (PCA) based clustering technique on the CDM framework for heterogeneously distributed data. Each local site perform PCA, projects the local data along the principle components, and applies a known clustering algorithm. Having obtained these local clusters, each site 29

sends a small set of representative data points to a central site. This site carries out PCA on this collected data and computes global principal components. The global principle components are sent back to the local sites. Each site projects its data along the global principle components and applies its clustering algorithm. A description of locally constructed clusters is sent to the central site which combines the cluster descriptions using different techniques such as nearest neighbor methods. Klusch et al. [160] consider kernel-density based clustering over homogeneously distributed data. They adopt the definition of a density based cluster from [126]: data points which can be connected by an uphill path to a local maxima, with respect to the kernel density function over the whole dataset, are deemed to be in the same cluster. Their algorithm does not find a clustering of the entire dataset. Instead each local site finds a clustering of its local data based on the kernel density function computed over all the data. In principal, their approach could be extended to produce a global clustering by transmitting the local clusterings to a central site and combining them. However, carrying out this extension in a communication efficient manner is non-trivial task and is not discussed by Klusch et al. Eisenhardt et al. [90] develop a distributed method for document clustering. They extend ] -means with a "probe and echo" mechanism for updating cluster centroids. Each synchronization round corresponds to a ] -means iteration. Each site carries out the following algorithm at each iteration. One site initiates the process by marking itself as engaged and sending a probe message to all its neighbors. The message also contains the cluster centroids currently maintained at the initiator site. The first time a node receives a probe (from a neighbor site ^ with centroids _ ), it marks itself as engaged, sends a probe message (along with _ ) to all its neighbors (except the origin of the probe), and updates the centroids in _ using its local data as well as computing a weight for each centroid based on the number of data points associated with each. If a site receives an echo from a neighbor ^ (with centroids _ and weights QY_ ), it merges _ and QY_ with its current centroids and weights. Once a site has received either a probe or echo from all neighbors, it sends an echo along with its local centroids and weights to the neighbor from which it received its first probe. When the initiator has received echoes from all its neighbors, it has the centroids and weights which take into account all datasets at all sites. The iteration terminates. Dhillon and Modha [78] develop a parallel implementation of the ] -means clustering algorithm on homogeneously distributed data. A similar approach is taken by Forman and Zhang [105] to extend it to the problem of ] -harmonic means. The problem of clustering on P2P networks has been addressed in recent work [2, 236]. Their algorithm works as follows: each node in the P2P network is provided with a random number generator (same for all the sites) that produces the same set of initial centroid seeds when the algorithm begins. The points in the local data are first assigned to the nearest centroid. Then the centroids are updated to produce dimensionwise mean of the points. If there is a drastic change in centroids (measured by a user defined parameter) then a flag is raised indicating a change in centroids. A particular node N, will poll neighboring nodes for their centroids. The choice of neighborhood is determined in two different ways - uniform sampling and immediate neighborhood. The node N computes the weighted mean of the centroids it receives with its local centroid to produce the final set of centroids for a particular iteration. While this is 30

the first known P2P clustering algorithm, it appears to be an asynchronous algorithm. Moreover, it does not deal with dynamic network topology, as is common in peer to peer networks. A further extension of this algorithm taking into consideration large dynamic networks, has been studied by Datta et all [236]. It must be noted that while all algorithms mentioned in this category require multiple rounds of message passing, [157] and [160] require only two rounds. The others require as many rounds as the algorithm iterates. Centralized Ensemble-Based Methods These algorithms typically have low synchronization requirements and potentially offer two other nice properties: (1) If the local models are much smaller than the local data, their transmission will result is excellent message load requirements. (2) Sharing only the local models may be a reasonable solution to privacy constraints in some situations [188]. A brief survey of the literature is presented. Johnson and Kargupta [145] develop a distributed hierarchical clustering algorithm on heterogeneously distributed data. It first generates local cluster models and then combines these into a global model. At each local site, the chosen hierarchical clustering algorithm is applied to generate local dendograms which are then transmitted to a central site. Using statistical bounds, a global dendogram is generated. Samatova et al. [237] develop a method for merging hierarchical clusterings from homogeneously distributed, real-valued data. Lazarevic et al. [167] consider the problem of combining spatial clusterings to produce a global regression-based classifier. They assume homogeneously distributed data and that the clustering produced at each site has the same number of clusters. Each local site computes the convex hull of each cluster and transmits the hulls to a central site along with regression model for each cluster. The central site averages the regression models in overlapping regions of the hulls. Strehl and Ghosh [250] develop methods for combining cluster ensembles in a centralized setting (they did not explicitly consider distributed data). They argue that the best overall clustering maximizes the average normalized mutual information over all clusters in the ensemble. However, they report that finding a good approximation directly is very time-consuming. Instead they develop three more efficient algorithms which are not theoretically shown to maximize mutual information, but are empirically shown to do a decent job. Fred and Jain [102] develop a method for combining clusterings in a centralized setting. Given ` clusterings of a data points, their method first constructs an abB6a , coassociation matrix (the same as ce cOd as described in [250]). Next a merge algorithm is applied to the matrix using a single link, threshold, hierarchical clustering technique. For each pair ). TGf# whose co-association entry is greater than a predefined threshold, merge the clusters containing these points. In principal both Strehl and Ghosh’s ideas and Fred and Jain’s approach can be readily adapted to heterogeneously distributed data. However, for Strehl and Ghosh’s ideas to be adapted to a distributed setting, the problem of constructing an accurate centralized representation of g using few messages need be addressed. In order for Fred and Jain’s approach to be adapted to a distributed setting, the problem of building

31

an accurate co-association matrix in a message efficient manner must be addressed. Merugu and Ghosh [188] develop a method for combining generative models 17 produced from homogeneously distributed data. Each site produces a generative model from its own local data. Their goal is for a central site to find a global model from a predefined family (e.g. multivariate, 10 component Gaussian mixtures). which minimizes the average Kullback-Leibler distance over all local models. They prove this to be equivalent to finding a model from the family which minimizes the KL distance from the mean model over all local models (point-wise average of all local models). They assume that this mean model is computed at some central site. Finally the central site computes an approximation to the optimal model using an EM-style algorithm along with Markov-chain Monte-carlo sampling. They did not discuss how the centralized mean model was computed. But, since the local mode ls are likely to be considerably smaller than the actual data, transmitting the models to a central site seems to be a reasonable approach. Januzaj et al. [144] extend a density-based centralized clustering algorithm, DBSCAN, by one of the authors to a homogeneously distributed setting. Each site carries out the DBSCAN algorithm, a compact representation of each local clustering is transmitted to a central site, a global clustering representation is produced from local representations, and finally this global representation is sent back to each site. A clustering is represented by first choosing a sample of data points from each cluster. The points are chosen such that: (i) each point has enough neighbors in its neighborhood (determined by fixed thresholds) and (ii) no two points lie in the same neighborhood. Then ] -means clustering is applied to all points in the cluster, using each of the sample points as an initial centroid. The final centroids along with the distance to the furthest point in their ] -means cluster form the representation (a collection point, radius pairs). The DBSCAN algorithm is applied at the central site on the union of the local representative points to form the global clustering. This algorithm requires an h parameter defining a neighborhood. The authors set this parameter to the maximum of all the representation radii. Methods [144], [188], and [237] are representatives of the up-and-coming class of distributed clustering algorithms, centralized ensemble-based methods. These algorithms focus on transmitting compact representations of a local clustering to a central site which combines to form a global clustering representation. The key to this class of methods is in the local model (clustering) representation. A good one faithfully captures the local clusterings, requires few messages to transmit, and is easy to combine. Two techniques which solve a closely related but different problem (which they call “distributed clustering”). They address the problem of forming clusters of distributed datasets. Each one of their clusters is a collection of datasets, not a collection of tuples from datasets. McClean et al. [186] consider clustering a collection of data cubes. Parthasarathy and Ogihara [234] consider clustering homogeneously distributed tables. Having discussed distributed classification and clustering algorithms, we now focus our attention on distributed data stream mining. The following section introduces the topic. 17 a generative

model is a weighted sum of multi-dimensional probability density functions i.e. components

32

2.3.4 Distributed Data Stream Mining Distributed Data Stream Mining (DDSM) is becoming an area of active research ([20, 181, 163, 27, 148]) due to the emergence of geographically distributed stream-oriented systems such as online network intrusion detection applications, sensor networks, vehicle monitoring systems, web click streams, and systems analyzing multimedia data. In these applications, data streams originating from multiple remote sources needs to be monitored. A central stream processing system will not provide a good solution since streaming data rates may exceed the infrastructure of storage, communication, and processing [20]. Thus there arises a need for distributed data stream mining 18 . Several projects deal with data mining on streams including [117, 131, 87, 97, 106]. While these are closely related, it is not clear whether all of them can be directly applied in a distributed setting. In this section we provide a brief review of current work done in the field of distributed data stream mining . Babcock and Olston [20] describe a distributed top-K monitoring algorithm, designed to continuously report the k largest values from distributed data streams (top-k monitoring queries). Such queries are particularly useful in tracking atypical behavior, such as distributed denial of service attacks, exceptionally large or small values in telephone call records, auction bidding patterns and web usage statistics. The approach to solving the problem is as follows: The co-ordinator maintains the top k set initially. It installs arithmetic constraints at each monitor node over partial data values. As updates occur in the distributed streams, the arithmetic constraints should always be satisfied. If there is a conflict between the coordinator and the monitor nodes a conflict resolution scheme is resorted to. Thus distributed communication is only needed when the constraints imposed in the system are violated. The main drawback of this scheme seems to be the fact that the procedure of updating and reallocation is not instantaneous and thus overall conflict resolution scheme may not happen in real time. The problem of mining frequent item sets from multiple distributed data streams has been studied by Manjhi et all [181]. A naive solution to the problem is to combine frequency counts from the distributed nodes. However, as the number of nodes increases, a large amount of data structures need to be stored. The authors suggest a solution which is based on the precision of the frequency count maintained at each node.They introduce a hierarchical communication structure that maintains an error tolerance for frequency counts at each level. This is referred to as the precision gradient. The setting of the precision gradient is posed as an optimization problem with the objective of (1) minimizing load on the central node to which answers are delivered or (2) minimizing the worst case load on any communication link in the hierarchical structure. Kotecha et all [163] address the problem of distributed classification of multiple targets in a wireless sensor network. They cast it as a hypothesis testing problem. The major concern in multi-target classification is that as the number of targets increases, the hypotheses increase exponentially. The authors propose to re-partition the hypoth18 There exists a significant amount of work in stream data architectures ([22, 177]), query processing ([206, 60, 178, 19]), stream-based programming languages and algebras ( [208, 243, 74]), and applications ( [235, 204]). The current proposal will not focus on these problems and a reader interested in data streams is referred to [22] for a detailed overview.

33

esis space to reduce the exponential complexity. Ghoting and Parthasarathy [11] present algorithms for mining distributed streams with interactive response times. Their work performs a Directed Acyclic Graph (DAG) based decomposition of queries over distributed streams and makes use of this scheme to perform k-median clustering. They introduce a way to effectively update clustering parameters (such as k) by distributed interactive operator scheduling. A ticket based scheduling algorithm is presented along with an optimal distributed operator allocation for interactive data stream processing in a distributed setting. The authors adapt a graph partitioning scheme for stream query decomposition. While this is an interesting approach, it remains to be seen how well this approach can scale in large real time systems. The VEhicle DAta Stream Mining (VEDAS) project [148] is an experimental system for monitoring vehicle data streams in real time. It is one of the very early distributed data mining systems that perform most of the data analysis and knowledge discovery operations in onboard computing devices. The data collected from onboard monitoring devices such as PDAs are subjected to principal component analysis for dimensionality reduction. Since performance of PCA in a resource-constrained environment may be expensive, the authors present ways to monitor changes in covariance matrix which is useful for incremental PCA and avoids re computation in the entire PCA process. The fault detection module of the application handles vehicle health data. It makes use of incremental clusters to represent safe regimes of operation and can automatically monitor outliers from new vehicle data. The paper also provides mechanisms for drunk driver detection, which can be viewed as locating deviations from normal or characteristic behavior. In this section, we discussed several algorithms and applications for distributed stream mining. We argue that since data repositories on the grid are heterogeneous in nature, can be static or streaming, distributed data mining can play an important role in extracting patterns from repositories on the grid. In the next section we analyze the challenges for distributed data mining on the grid.

2.4 The Challenges A Data Grid can be thought of as a distributed system having the following characteristics: 1. It comprises of several resources (computers, sensors etc) storing data repositories (Relational, XML databases, flat files etc). 2. The resources do not share a common memory or a clock. 3. They can communicate with one another by exchanging messages over a communication network. 4. Each resource has its own memory and can perform limited / extensive data intensive tasks.

34

Join Attribute(X)

i i  i  i m

A

jk j jf j

B

l+m l l,n

l

Table 2.1: Matched Catalog P and Q.

5. The data repositories owned and controlled by a resource are said to be local to itself, while repositories owned by other machines are considered remote. 6. Accessing remote resources in the network are more expensive than local resources, since this includes communication delays and CPU overhead to process communication protocols. 7. The resources are capable of forming virtual organizations amongst themselves. Members of a virtual organization are allowed to share data under local policies which specify what is shared, who is allowed to share and the conditions for sharing. Sharing amongst disparate virtual organizations is allowed, although policies for sharing could be guided by different rules. Thus, a Data Grid can be conceived such that there is either (1) a hierarchy amongst virtual organizations. (2) complete de-centralization amongst virtual organizations ([253, 254]). Given the characteristics of the Data Grid, let us examine what a service-oriented architecture for distributed data mining on the grid requires. 1. Distributed Data Integration: The purpose of this is to integrate heterogeneous data repositories. Schema integration is a difficult problem, given that the data repositories contain different types of data (relational, unstructured, data streams), different attributes, indices are not all aligned and the criteria required to integrate them may be complex. We illustrate with examples from astronomy. Example 1 (1) Consider the catalogs P and Q shown in Table 2.2. For the sake of discussion, we assume Catalog P has X and A attributes and Catalog Q has X and B attributes. We further assume that X is the join attribute. (2) The Table 2.1 illustrates one possibility of aligning the catalogs. Notice that the attribute i  matches with j  and both l  and l n . Thus attribute i  appears twice in the matched catalog. Also, if either Catalog P or Q has a join attribute that the other does not have, that tuple will not show in the matched catalog. Example 2 Consider that catalog A has Cartesian co-ordinates  i  !o p q # and Catalog B stores co-ordinates  i V !V qp# , both representing the spatial positions of astronomical objects (stars, galaxies etc). It is required to perform a

35

Tuple ID

^r ^  ^ m

Join Attribute(X)

i i  i m

Tuple ID

s s sm sn

A

jk j j

Join Attribute(X)

i m i  i 

i

B

l l l,n lm

Table 2.2: Catalog P (Left) and Catalog Q (Right). join between the two catalogs based on the probabilistic calculation that mini mizes the paramter t in the following equation

u u t 5 v=wx L'y) i z i  #  N ! z !  #  N )q z q  # + { z v . 1 is an @ >A@XBWC matrix where @ >A@ is the size of the input domain and C is the total number of trees in the ensemble. An ensemble classifier that combines the outputs of the base classifiers can be viewed as a function defined over the set of all rows in 1 . If 1FE*3 4 denotes the G th column matrix of 1 then the ensemble classifier can be viewed as a function of 1ŸE*3 Ç 1ŸE*3  HHH1ŸE+3 I . When the ensemble classifier is a linear combination of the outputs of the base classifiers we have JK5ML 1 E*3 N L  1 E*3 ON HHHL I 1 E*3 I , where J is the column matrix of the overall ensemble-output. Since the base classifiers may have redundancy, we would like to construct a compact low-dimensional representation of the matrix 1 . However, explicit construction and manipulation of the matrix 1 is difficult, since most practical applications deal with a very large domain. We can try to construct an approximation of 1 using only the available training data. One such approximation of 1 and its Principal Component Analysis-based projection is reported elsewhere [190]. Their technique performs PCA of the matrix 1 , projects the data in the representation defined by the eigenvectors of the covariance matrix of 1 , and then performs linear regression for computing the coefficients LP  L= HHH< and L=I . While the approach is interesting, it has a serious limitation. First of all, the construction of an approximation of 1 even for the training data is computationally prohibiting for most large scale data mining applications. Moreover, this is an approximation since the matrix is computed only over the observed data set of the entire domain. In the following we demonstrate a novel way to perform a PCA of the matrix containing the Fourier spectra of trees. The approach works without explicitly generating the matrix 1 . It is important to note that the PCA-based regression scheme [190] offers a way to find the weightage for the members of the ensemble. It does not offer any way to aggregate the tree structures and construct a new representation of the ensemble which the current approach does. The following analysis will assume that the columns of the matrix 1 are meanzero. This restriction can be easily removed with a simple extension of the analysis. Note that the covariance of the matrix 1 is 1 ® 1 . Let us denote this covariance matrix by  . The ., Gf# -th entry of the matrix,

 3 4 5

[µ1WTÒf .´#* 1WTÒf TGo# ³ 5É[ 79 : #* 79 4: # ³ 5

º (

R69 :T3 ( R69 4:T3 ( 5É[ʧ¨9 : §¨9 (3.4) 4:³

The fourth step is true by Lemma 2. Now let us the consider the matrix Q where Q 3 4 5šR 9 4:´3 9 : , i.e. the coefficient corresponding to the . -th member of the partition set U from the spectrum of the tree 7"9 4: . Equation 3.4 implies that the covariance matrices of 1 and Q are identical. Note that Q is an @ UÎ@ÏBYC dimensional matrix. For most practical applications @ UÎ@Œ[A[Ñ@ >A@ . Therefore analyzing Q using techniques like PCA

is significantly easier. The following discourse outlines a PCA-based approach. PCA of the covariance matrix of W produces a set of eigenvectors    HHH, ( . The eigenvalue decomposition constructs a new representation of the underlying domain. Note that since the eigenvectors are nothing but a linear combination of the original column vectors of W, each of them also form a Fourier spectrum and we can reconstruct a decision tree from this spectrum. Moreover, since they are orthogonal to 54

each other, the tree constructed from them also maintain the orthogonality condition. The following section defines orthogonal decision trees that makes use of these eigen vectors. Orthogonal Decision Trees The analysis presented in the previous sections offers a way to construct the Fourier spectra of a set of functions that are orthogonal to each other and therefore redundancyfree. These functions also define a basis and can be used to represent any given decision tree in the ensemble in the form of a linear combination. Orthogonal decision trees can be defined as an immediate extension of this framework. A pair of decision trees ›V p# and ›V# are orthogonal to each other if and only if [¥›*)f#* ›*+# ³ 5¥— when j 5 ™ l and [ћ)f)#+ ›+)# ³ 5 u otherwise. The second condition is actually a slightly special case of orthogonal functions—orthonormal condition. A set of trees are pairwise orthogonal if every possible pair of members of this set satisfy the orthogonality condition. The orthogonality condition guarantees that the representation is not redundant. These orthogonal trees form a basis set that spans the entire function space of the ensemble. The overall output of the ensemble is computed from the output of these orthogonal trees. Specific details of the ensemble output computation depends on the adopted technique to compute the overall output of the original ensemble. However, for most popular cases considered here boils down to computing the average output. If we choose to go for weighted averages, we may also compute the coefficients corresponding to each -, by simply performing linear regression.

3.2.5 Experimental Results This section reports the experimental performance of orthogonal decision trees on the following data sets - SPECT, NASDAQ, DNA, House of Votes and Contraceptive Method Usage Data. For each data set, the following three experiments are performed using known classification techniques: 1. C4.5: The C4.5 classifier is built on training data and validated over test data. 2. Bagging: A popular ensemble classification technique, bagging, is used to test the classification accuracy of the data set. 3. Random Forest: Random forests are built on the training data, using approximately half the number of features in the original data set. The number of trees in the forest is identical to that used in the bagging experiment5. We then perform another set of experiments for comparing the techniques described in the previous sections in terms of error in classification and tree complexity. 5 We used the WEKA implementation(http://www.cs.waikato.ac.nz/ml/weka/) of Bagging and Random Forests

55

1. Reconstructed Fourier Tree(RFT): The training set is uniformly sampled, with replacement and C4.5 trees are built on each sample. The Fourier representation of each individual tree is obtained preserving a certain percentage(e.g. 90%) of the energy. This representation of a tree is used to reconstruct a decision tree using the TCFS algorithm described in Section 3.2.3. The performance of a reconstructed Fourier tree is compared with the original C4.5 tree. The error in classification and tree complexity of each of the reconstructed trees is reported. The purpose of this experiment is to study the effect of representation of a tree by its Fourier spectrum and how much accuracy is lost in the entire cycle of summarizing a decision tree by its Fourier spectrum and then re-learning the tree from the spectrum. 2. Aggregated Fourier Tree(AFT): The training set is uniformly sampled, with replacement and C4.5 decision trees are built on each sample (This is identical to bagging). A Fourier representation of each tree is obtained(preserving a certain percentage of the total energy), and these are aggregated with uniform weighting, to obtain the spectrum of an Aggregated Fourier Tree (AFT). The AFT is reconstructed using the TCFS algorithm described before and the classification accuracy and the tree complexity of this aggregated Fourier tree is reported. 3. Orthogonal Decision Trees: The matrix containing the Fourier coefficients of the decision trees is subjected to principal component analysis. Orthogonal trees are built, corresponding to the principal components. In most cases it is found that that the first principal component captures most of the variance, and thus the orthogonal decision tree constructed from this principal component is of particular interest. So we report the error in classification and tree complexity of the orthogonal decision tree obtained from the first principal component. We also perform experiments where we keep k6 significant components.The trees are combined by weighting them according to the coefficients obtained from a Least Square Regression7 . Each orthogonal decision tree is weighted using coefficients calculated from Least Square Regression. For this, we allow all the orthogonal decision trees to individually produce their classification on the test set. Thus each ODT produces a column vector of its classification estimate. Since the class-labels in the test set are already known, we use the least square regression to obtain the weights to assign to each ODT. The accuracy of the orthogonal decision trees is reported as ODT-LR(ODTs combined using Least Square Regression). In addition to reporting the error in classification, we also report the tree complexity, the total number of nodes in the tree. Similarly, the term ensemble complexity reflects the total number of nodes in all the trees in the ensemble. A smaller ensemble tree complexity implies a compact representation of an ensemble and therefore it is desirable. Our experiments show that ODTs usually offer significantly reduced ensemble 6 We select the value of k in such a manner that the total variance captured is more than 90%. One could potentially do cross-validation to obtain a suitable value of k as pointed out in [189] but this is beyond the current scope of the work and will be explored in future. 7 Several other regression techniques such as ridge regression, principal component regression can also be tried. This is left as future work

56

tree complexity without any reduction in the accuracy. The following section presents the results for the SPECT data set. SPECT Data set This section illustrates the idea of orthogonal decision trees using a well known binary data set. The dataset, available from the University of California Irvine, Machine Learning Repository, describes diagnosing of cardiac Single Proton Emission Computed Tomography (SPECT) images into two categories, normal or abnormal. The database of 267 SPECT image sets (patients) is processed to extract features that summarize the original SPECT images. As a result, 44 continuous feature patterns are obtained for each patient, which are further processed to obtain 22 binary feature patterns. The training data set consists of 80 instances and 22 attributes. All the features are binary, and the class label is also binary (depending on whether a patient is deemed normal or abnormal). The test data set consists of 187 instances and 22 attributes. Method of classification C4.5 Bagging Random Forest Aggregated Fourier Tree (AFT) ODT from 1st PC ODT-LR

Error Percentage 24.5989 (%) 20.85 (%) 22.99466 (%) 19.78(%) 8.02(%) 8.02(%)

Table 3.1: Classification error for SPECT data Method of classification C4.5 Bagging (average of 40 trees) Random Forest (average of 40 trees) Aggregated Fourier Tree(AFT)(40 trees) Orthogonal Decision Tree from 1st PC Orthogonal Decision Trees (average of 15 trees)

Tree Complexity 13 5.06 49.67 3 17 4.3

Table 3.2: Tree complexity for SPECT data. Table 3.3 shows the error percentage obtained in each of the different classification schemes. The root mean squared error for the 10 fold cross validation in the C4.5 experiment is found to be 0.4803 and the standard deviation is 2.3862. For Bagging, the number of trees in the ensemble is chosen to be forty. Our experiments reveal that further increase in number of trees in the ensemble causes a decrease in accuracy of classification of the ensemble possibly due to over-fitting of the data. For experiments with Random Forests, forest of 40 trees, each constructed while considering 12 random features is built. The average Out of bag error is reported to be 57

100

15

C4.5 RFT

C4.5 RFT

90 80

10

60

Tree Complexity

Accuracy of Classification

70

50 40

5

30 20 10 0

0

5

10

15

20 25 i−th Tree in the ensemble

30

35

40

45

0

0

5

10

15

20 25 i−th tree in the ensemble

Figure 3.7: The accuracy and tree complexity of C4.5 and RFT for SPECT data 0.3245. The Figure 3.7(Left) compares the accuracy of the original C4.5 ensemble with that of the Reconstructed Fourier Tree(RFT) ensemble preserving 90% of the energy of the spectrum. The results reveal that if all of the spectrum is preserved, the accuracy of the original C4.5 tree and RFT are identical. When the higher order Fourier coefficients are removed, this becomes equivalent to pruning a decision tree. This explains the higher accuracy of the reconstructed Fourier tree preserving 90% of the energy of the spectrum. The Figure 3.7(Right) compares the tree complexity of the original C4.5 ensemble with that of the RFT ensemble. In order to construct the orthogonal decision trees, the coefficient matrix is projected onto the first fifteen most significant principal components. The most significant principal component captures 85.1048% of the variance and the tree complexity of the ODT constructed from this component is 17 with an accuracy of 91.97%. The Figure 3.8 shows the variance captured by all the fifteen principal components. Table 3.2 illustrates the tree complexity for this data set. The orthogonal trees are found to be smaller in complexity, thus reducing the complexity of the ensemble. NASDAQ Data set The NASDAQ data set is a semi-synthetic data set with 1000 instances and 100 discrete attributes. The original data set has three years of NASDAQ stock quote data. It is preprocessed and transformed to discrete data by encoding percentages of changes in stock quotes between consecutive days. For these experiments we assign, 4 discrete values, that denote levels of changes. The class labels, predict whether the Yahoo stock is likely to increase or decrease based on attribute values of the 99 stocks. We randomly select 200 instances for training and the remaining 800 instances forms the test data set. Table 3.3 illustrates the classification accuracies of different experiments performed

58

30

35

40

45

90

80

Percentage of variance captured

70

60

50

40

30

20

10

0

0

5

Principle Component

10

15

Figure 3.8: Percentage of variance captured by principal components for SPECT Data. Method of classification C4.5 Bagging Random Forest Aggregated Fourier Tree(AFT) ODT from 1st PC ODT-LR

Error Percentage 24.63 (%) 32.75 (%) 25.75 (%) 34.51 (%) 31.12(%) 31.12(%)

Table 3.3: Classification error for NASDAQ data Method of classification C4.5 Bagging (average of 60 trees) Random Forest (average of 60 trees) Aggregated Fourier Tree(AFT) (60 trees) Orthogonal Decision Tree from 1st PC Orthogonal Decision Trees (average of 10 trees)

Tree Complexity 29 17 45.71 15.2 3 6.2

Table 3.4: Tree Complexity for NASDAQ data.

on this data set. The root mean squared error for the 10 fold cross validation in the C4.5 experiment is found to be 0.4818 and the standard deviation is 2.2247. C4.5 has the best classification accuracy, though the tree built has the highest tree complexity also. For the bagging experiment, C4.5 trees are built on the dataset, such that the size of each bag (used to build the tree) as a percentage of the data set is 40%. Also, Random forest of 60 trees, each constructed while considering 50 random features is built on the

59

training data and tested with the test data set. The average out of bag error is reported to be 0.3165. 100

40

C4.5 RFT

90

C4.5 RFT

35

80 30

25

60

Tree Complexity

Accuracy of Classification

70

50 40

20

15

30 10 20 5

10 0

0

10

20

30 40 i−th tree in the ensemble

50

60

0

70

0

10

20

30 40 i−th tree in the ensemble

Figure 3.9: The accuracy and tree complexity of C4.5 and RFT for Nasdaq data Figure 3.9(Left) compares the accuracy of the original C4.5 ensemble with that of the Reconstructed Fourier Tree(RFT) ensemble preserving 90% of the energy of the spectrum. Figure 3.9(Right) compares the tree complexity of the original C4.5 ensemble with that of the RFT ensemble. For the orthogonal trees, we project the data along the first 10 most significant principal components. The Figure 3.10 illustrates the percentage of variance captured 70

60

Percentage of variance

50

40

30

20

10

0

1

2

3

4

5 6 Principle Component

7

8

9

10

Figure 3.10: Percentage of variance captured by principal components for Nasdaq Data.

60

50

60

70

by the ten most significant principal components. Table 3.4 presents the tree-complexity information for this set of experiments. Both the aggregated Fourier tree and the orthogonal trees performed better than the single C4.5 tree or bagging. The tree-complexity result appears to be quite interesting. While a single C4.5 tree had twenty nine nodes in it, the orthogonal tree from the first principal component requires just three nodes, which is clearly a much more compact representation. DNA Data Set The DNA data set8 is a processed version of the corresponding data set available from UC Irvine repository. The processed StatLog version replaces the symbolic attribute values representing the nucleotides (only A,C,T,G) by 3 binary indicator variables. Thus the original 60 symbolic attributes are changed into 180 binary attributes. The u —V—k ,æ5M— u —Ï .#æ5 nucleotides A,C,G,T are given indicator values as follows: 5  u —"— 5 —"—V— . The data set has three class values 1, 2, and 3 corresponding to exonintron boundaries (sometimes called acceptors), intron-exon boundaries (sometimes called donors), and the case when neither is true. We further process the data such that, there are are only two class labels i.e. class 1 representing either donors or acceptors, while class 0 representing neither. The training set consists of 2000 instances and 180 attributes of which 47.45% belongs to class 1 while the remaining 52.55% belongs to class 0. The test data set consists of 1186 instances and 180 attributes of which 49.16% belongs to class 0 while the remaining 50.84% belongs to the class 1. Table 3.5 reports the classification error. The root mean squared error for the 10 fold cross validation in the C4.5 experiment is found to be 0.2263 and the standard deviation is 0.6086 . Also, Random forest of 10 trees, each constructed while considering 8 random features is built on the training data and tested with the test data set. The average out of bag error is reported to be 0.2196. Method of classification C4.5 Bagging Random Forest Aggregated Fourier Tree(AFT) ODT from 1st PC ODT-LR

Error Percentage 6.4924 (%) 8.9376(%) 4.595275 (%) 8.347(%) 10.70(%) 10.70(%)

Table 3.5: Classification error for DNA data It may be interesting to note, that the first five eigenvectors are used in this experiment. The Figure 3.11 shows the variance captured by these components. As before, the redundancy free trees are combined by the weights obtained from Least Square Regression. Table 3.6 reports the tree complexity for this data set. 8 Obtained

from http://www.liacc.up.pt/ML/statlog/datasets/dna

61

Method of classification C4.5 Bagging (average of 10 trees) Random Forest (average of 10 trees) Aggregated Fourier Tree(AFT)(10 trees) Orthogonal Decision Tree from 1st PC Orthogonal Decision Trees (average of 5 trees)

Tree Complexity 131 34 701.22 3 25 7.4

Table 3.6: Tree Complexity for DNA data. 80

70

Percentage of variance captured

60

50

40

30

20

10

0

1

2

3 Principle Components

4

5

Figure 3.11: Percentage of variance captured by principal components for DNA Data. Figure 3.12(Left) compares the accuracy of the original C4.5 ensemble with that of the Reconstructed Fourier Tree(RFT) ensemble preserving 90% of the energy of the spectrum. Figure 3.12(Right) compares the tree complexity of the original C4.5 ensemble with that of the RFT ensemble. House of Votes Data Set The 1984 United States Congressional Voting Records Database is obtained from the University of California, Machine Learning Repository. This data set includes votes for each of the U.S. House of Representatives Congressmen on the 16 key votes identified by the CQA including water project cost sharing, adoption of budget resolution, mx-missile, immigration etc. It has 435 instances, 16 boolean valued attributes and a binary class label(democrat or republican).Our experiments use the first 335 instances for training and the remaining 100 instances for testing. In our experiments, missing values in the data are replaced by one. The results of classification are shown in the Table 3.7 while the tree complexity is shown in Table 3.8. The root mean squared error for the 10 fold cross validation in

62

90

30

80

C4.5 RFT

C4.5 RFT

25

60

20 Tree Complexity

Accuracy of Classification

70

50

40

30

15

10

20 5 10

0

0

1

2

3

4

5 6 7 i−th tree in the ensemble

8

9

10

11

12

0

0

1

2

3

4

5 6 7 i−th tree in the ensemble

Figure 3.12: The accuracy and tree complexity of C4.5 and RFT for DNA data the C4.5 experiment is found to be 0.2634 and the standard deviation is 0.3862. For Bagging, fifteen trees are constructed using the dataset, since this produced the best classification results. The size of each bag was 20% of the training data set. Random Forest of fifteen trees, each constructed by considering 8 random features produces an average out of bag error of 0.05502. The accuracy of classification and the tree complexity of the original C4.5 and RFT ensemble are illustrated in the left and right hand side of Figure 3.13 respectively. For orthogonal trees, the coefficient matrix is projected onto the first five most significant principal components. The Figure 3.14(Left) illustrates the amount of variance captured by each of the principal components. Method of classification C4.5 Bagging Random Forest Aggregated Fourier Tree(AFT) ODT from 1st PC ODT-LR

Error Percentage 8.0 (%) 11.0(%) 5.6(%) 11(%) 11(%) 11(%)

Table 3.7: Classification error for House of Votes data.

Contraceptive Method Usage Data Set This dataset,is obtained from the University of California Irvine, Machine Learning Repository and is a subset of the 1987 National Indonesia Contraceptive Prevalence 63

8

9

10

11

12

100

C4.5 RFT

90 80

Accuracy of Classification

70 60 50 40 30 20 10 0

0

1

2

3

4

5 6 7 8 9 10 11 12 i−th tree in the ensemble ( i varies from 1 to 15)

13

14

12

15

16

C4.5 RFT

10

Tree Complexity

8

6

4

2

0

0

1

2

3

4

5 6 7 8 9 10 11 12 i−th tree in ensemble (i varies from 1 to 15)

13

14

15

16

Figure 3.13: The accuracy and tree complexity of C4.5 and RFT for House of Votes data Survey. The samples are married women who are either not pregnant or do not know if they are at the time of interview. The problem is to predict the current contraceptive method choice of a woman based on her demographic and socio-economic characteristics. There are 1473 instances, and 10 attributes including a binary class label. All attributes are processed so that they are binary. Our experiments use 1320 instances for the training set while the rest form the test data set. The results of classification are tabulated in the Table 3.9 while Table 3.10 shows the tree complexity. The root mean squared error for the 10 fold cross validation in the C4.5 experiment is found to be 0.5111 and the standard deviation is 1.8943. Random Forest built with 10 trees, considering 5 random features produces an average error in classification of about 45.88% and an average out of bag error of 0.42556. Figure 3.15(Left) compares the accuracy of the original C4.5 ensemble with that of the Reconstructed Fourier Tree(RFT) ensemble preserving 90% of the energy of the spec-

64

Method of classification C4.5 Bagging (average of 15 trees) Random Forest (average of 15 trees) Aggregated Fourier Tree (AFT)(15 trees) Orthogonal Decision Tree from 1st PC Orthogonal Decision Trees (average of 5 trees)

Tree Complexity 9 5.266 37.42 5 5 3

Table 3.8: Tree Complexity for House of Votes data. 90

70

80

60

Percentage of variance captured

Percentage of variance captured

70

60

50

40

30

50

40

30

20

20 10

10

0

1

2

3 Principle Components

4

5

0

1

2

3

4 5 6 7 Siginificant Principle Components

Figure 3.14: Percentage of variance captured by principal components for (Left) House of Votes Data and (Right) Contraceptive Method Usage data. trum. Figure 3.15(Right) compares the tree complexity of the original C4.5 ensemble with that of the RFT ensemble. For ODTs, the data is projected along the first ten principal components. The Figure 3.14(Right) shows the amount of variance captured by each principal component. It is interesting to note that the first principal component captures only about 61.85% of the variance and thus the corresponding ODT generated from the first principal component has a relatively high tree complexity.

3.3 DDM on Data Streams 3.3.1 Introduction Several challenging new applications demand the ability to do data mining on resource constrained devices. One such application is that of monitoring physiological data streams obtained from wearable sensing devices. Such monitoring has applications 65

8

9

10

Method of classification C4.5 Bagging Random Forest Aggregated Fourier Tree(AFT) ODT from 1st PC ODT-LR

Error Percentage 49.6732(%) 52.2876(%) 45.88234 (%) 33.98(%) 46.40(%) 46.40(%)

Table 3.9: Classification error for Contraceptive Method Usage Data. Method of classification C4.5 Bagging(average of 10 trees) Random Forest (average of 10 trees) Aggregated Fourier Tree(AFT)(10 trees) Orthogonal Decision Tree from 1st PC Orthogonal Decision Trees (average of 10 trees)

Tree Complexity 27 24.8 298.11 55 15 6.6

Table 3.10: Tree Complexity for Contraceptive Method Usage Data. 70

30

C4.5 RFT

C4.5 RFT 60

25

20 Tree Complexity

Accuracy in Classification

50

40

30

15

10 20

5

10

0

0

1

2

3

4 5 6 7 8 9 i−th tree in the ensemble (i varies from 1 to 10)

10

11

12

0

0

1

2

3

4 5 6 7 8 9 i−th tree in the ensemble (i varies from 1 to 10)

Figure 3.15: The accuracy and tree complexity of C4.5 and RFT for Contraceptive Method Usage data for pervasive healthcare management, be it for seniors, emergency response personnel, soldiers in the battlefield or atheletes. A key requirement is that the monitoring system be able to run on resouce constrained handheld or wearable devices. Orthogonal decision trees(ODTs) (introduced in section 2.3.2) offer an effective way to con-

66

10

11

12

struct a redundancy-free, accurate, and meaningful representation of large decisiontree-ensembles often created by popular techniques such as Bagging, Boosting, Random Forests and many distributed and data stream mining algorithms. This section discusses various properties of the ODTs and their suitability for monitoring physiological data streams in a resource-constrained environment. It offers experimental results to document the performance of orthogonal trees on grounds of accuracy, model complexity, and other characteristics in a resource-constrained mobile environment. In closing, we argue that this application will have significant benefits if integrated with a grid infrastructure. Physiological Data Stream Monitoring We draw two scenarios to illustrate the potential uses of physiological data stream monitoring. Both cases involve a situation where a potentially complex decision space has to be examined, and yet the resources available on the devices that will run the decision process are not sufficient to maintain and use ensembles. Consider a real time environment to monitor the health effects of environmental toxins or disease pathogens on humans. There are significant advances being made today in biochemical engineering to create extremely low cost sensors for various toxins[162] that could constantly monitor the environment and generate data streams over wireles networks. It is not unreasonable to assume that similar sensors could be developed to detect disease causing pathogens. In addition, most state health/environmental agencies and the federal government entities such as CDC and EPA have mobile labs and response units that can test for the presence of pathogens or dangerous chemicals. The mobile units will have handheld devices with wireless connections on which to send the data and/or their analysis. In addition, each hospital today generates reports on admissions and discharges, and often reports that to various monitoring agencies. Given these disparate data streams, one could analyze them to see if correlates can be found, alerting experts to potential cause-effect relations (Pfiesteria found in Chesapeake Bay and hostpitals report many people with upset stomach who had seafood recently), potential epedemiological events (field units report dead infected birds and elederly patients check in with viral fever symptoms, indicating tests needed for west nile virus and preventive spraying), and more pertinent in present times, low grade chemical and biological attacks (sensors detect particular toxins, mobile units find contaminated sites, hospitals show people who work at or near the sites being admitted with unexplained symptoms). At present, much of this analysis is done “post facto” – experts hypothesize on possible causes of ailments, then gather the data from disparate sources to confirm their hypotheses. Clearly, a more proactive environment which could mine these diverse data strems to detect emergent patters would be extremely useful. This scenario, of course, has some futuristic elements. On a more present day note, there are now several wearable sensors on the market such as SenseWear armband from BodyMedia [30], Wearable West [272], and LifeShirt Garment from Vivometrics [266] that can be used to monitor vital signs for a person such as temperature, heartrate, heatflux, Û\^-/  etc.

67

Figure 3.16: The Body Media SenseWear armband and The Vivometrics Life Shirt Garment The figure 3.169 on the left hand side shows the SenseWear armband that was used to collect the data. The sensors in this band were capable of measuring the following: 1. Heat flux: The amount of heat dissipated by the body. 2. Accelerometer: Motion of the body 3. Galvanic Skin Response: Electrical conductivity between two points on the wearer’s arm 4. Skin Temperature: Temperature of the skin and is generally reflective of the body’s core temperature 5. Near-Body Temperature: Air temperature immediately around the wearer’s armband. The subjects were expected to wear the armband as they went about their daily routine, and were required to timestamp the beginning and end of an activity. For example, before starting to take a jog, they could press the timestamp button, and when finished, they could press the button again to record the end of the activity. This body monitoring device can be worn continuously, and can store up to 5 days of physiological data before it had to be retrieved. The LifeShirt Garment is yet another example of an easy to wear shirt, that allows measurement of pulmonary functions via sensors woven into the shirt. The figure 3.16 on the right hand side shows the heart monitor. Subjects are capable of measuring symptoms, moods, activities and several other physiological characteristics. 9 The figures are http://www.vivometrics.com

obtained

from

http://www.cs.utexas.edu/users/sherstov/pdmc/

68

and

Analysing these vital signs in real time using small form factor wearable computers has several valuable near term applications. For instance, one could monitor senior citizens living in assisted or independent housing, to alert physicians and support personnel if the signs point to distress. Similarly, one could monitor athletes during games or practice. Given the recent high profle deaths of athletes both at the professional and high school levels during practice, the importance of such an application is fairly apparent. Other potential applications include battlefield monitoring of soldiers, or monitoring first responders such as firefighters.

3.3.2 Experimental Results In order to perform on line monitoring of physiological data using wearable or handheld (PDAs, cellphones) devices, data streams are sent to them from sensors using short range wireless networks such as PANs. Precomputed(based on training data obtained previously) orthogonal decision trees and bagging ensembles are kept on these devices. The data streams are classified using these precomputed models, which are updated on a periodic basis. It must be noted that while the monitoring is in real time, the model computation is done off-line using stored data. This section documents the performance of orthogonal decision trees on a physiological data set. It makes use of publicly available data set in order to offer benchmarked results. This dataset10 was obtained from the Physiological Data Modeling Contest11 held as part of the International Conference on Machine Learning, 2004. It comprises of several months of data from more than a dozen subjects and was collected using BodyMedia12 wearable body monitors. In our experiments, the training set consisted of 50,000 instances and 11 continuous and discrete valued attributes13 . The test set had 32,673 instances. The continous valued attributes were discretized using the WEKA software14 . The final training and test data sets had all discrete valued attributes. A binary classification problem was formulated, which monitored whether an individual was engaged in a particular activity(class label=1) or not(class label=0) depending on the physiological sensor readings. C4.5 decision trees were built on data blocks of size 150 instances and the classification accuracy and tree complexity was noted. These were then used to compute their Fourier spectra and the matrix of the Fourier coefficients was subjected to principle component analysis. Orthogonal trees were built, corresponding to the significant components and they were combined using an uniform aggregation scheme. The accuracy and size of the orthogonal trees are noted and compared with the corresponding results generated by Bagging using the same number of decision trees in the ensemble. The Figure 3.17 illustrates four decision trees built on the uniformly sampled training data set(each of size 150). The first decision tree, has a complexity 7 and considers 10 Obtained

from http://www.cs.utexas.edu/users/sherstov/pdmc/

11 http://www.cs.utexas.edu/users/sherstov/pdmc/ 12 http://www.bodymedia.com/index.jsp

13 The attributes used for the classification experiments were gender, galvanic skin temperature, heat flux, near body temperature, pedometer, skin temperature, readings from the longitudinal and transverse accelerometer and time for recording an activity called session time 14 http://www.cs.waikato.ac.nz/ml/weka/

69

Figure 3.17: Decision Trees built from four different samples of the physiological data set

Figure 3.18: An Orthogonal Decision Tree

attribute transverse acceleromenter reading, session time and near body temperature as ideal for splits. Before pruning, only two instances are mis-classified giving an error of 1.3(%). After pruning, there is no change in structure of the tree. The estimated error percentage is 4.9(%). The second, third and fourth decision trees have complexities 5, 7, and 3 respectively. An illustration of an orthogonal decision tree obtained from the first principle component, is shown in Figure 3.18. Figure 3.19 illustrates the distribution of tree complexity and error in classification for the original C4.5 trees used to construct an ODT ensemble. The total number of nodes in the original C4.5 trees varied between three and thirteen. The trees had

70

Distribution of tree size in ensemble used to construct an orthogonal decision tree

20

Histogram of tree error in ensemble

25

18 16

20

15 Distribution

12 10 8

10

6 4

5

2 0

3

4

5 6 7 8 9 10 11 Size of a decision tree (measured by number of nodes in tree)

12

0

13

0

5

10 15 Error in classification of trees in ensemble

Figure 3.19: Histogram of tree complexity (left) and error (right) in classification for the original C4.5 trees. Distribution of the error in classification for the ODT

35

30

25

Distribution

Distribution in ensemble

14

20

15

10

5

0

5

10

15

20 25 Error in classification

30

35

40

Figure 3.20: Histogram of error in classification in the ODT ensemble.

an error of less than 25(%). In comparison, the average complexity of the orthogonal decision trees was found to be 3 for all the different ensemble sizes. In fact, for this particular dataset, the sensor reading corresponding to transverse accelerometer attribute was found to be the most interesting. All the orthogonal decision trees used this attribute as the root node for building the trees. The Figure 3.20 illustrates the distribution of error in classification for an ODT ensemble of 75 trees.

71

20

25

Figure 3.21: Comparison of error in classification for trees in the ensemble for aggregated ODT versus Bagging.

Figure 3.22: Plot of Tree Complexity Ratio versus number of trees in the ensemble.

We compared the accuracy obtained from an aggregated orthogonal decision tree to that obtained from a bagging ensemble(using the same number of trees in each case). Figure 3.21 plots the error in classification of the aggregated ODT and bagging versus the number of decision trees in the ensemble. We found that the classification from an aggregated orthogonal decision tree was better than bagging when the number of trees in the ensemble was smaller. With increase in number of trees in the ensemble bagging provided a slightly better accuracy. It must be noted however, that in constrained environments such as in pocket pcs, personal assistants and sensor network setting, increasing the number of trees in the ensemble arbitarily may not be feasible due to memory constraints.

72

Figure 3.23: Variance captured by the first principle component versus number of trees in ensemble.

In resource constrained environments it is often necessary to keep track of the amount of memory used to store the ensemble. In the current implementation storing a node data structure in a tree requires approximately 1 KB of memory. Consider an ensemble of 20 trees. If the average number of nodes in the trees in the ensemble is 7, then we are required to store 140 KB of data. Orthogonal decision trees on the other hand are smaller in size, with less redundancy. In the experiments we performed they typically have a complexity of 3 nodes. This means that we need to store only 3 KB of data. We define Tree Complexity Ratio (TCR) as the total number of nodes in the ODT versus the total number of nodes in the bagging ensemble. Figure 3.21 plots the variation of the TCR as the number of trees in the ensemble increases. It may be noted that in resource constrained environments one can opt for meaningful trees of smaller size and comparable accuracy as opposed to larger ensembles with a slightly better accuracy. An orthogonal decision tree also helps in the feature selection process and indicates which attributes are more important than others in the data set. The Figure 3.23 indicates the variance captured by the first principle component as the number of trees in the ensemble was varied from 5 to 75 trees. As is expected, as the number of trees in the ensemble increases, the first principle component captures most of the variance and those occupied by the second and third components gradually decreases. The following section illustrates the response time for classification on a pocket pc using a bagging ensemble and an equivalent orthogonal decision tree ensemble.

73

Figure 3.24: Plot of Response time for Bagging and equivalent ODT ensemble versus the number of trees in the ensemble.

3.3.3 Monitoring in Resource Constrained Environments Resource Constrained environments such as personal digital assistants, pocket pcs, cell phones are often used to monitor the physiological conditions of subjects. These devices present additional challenges in monitoring owing to the limited battery power, memory restrictions and small displays that they have. The previous section indicated that an aggregated orthogonal decision tree was small in size, and captured an accuracy better or comparable to that of bagging when the ensemble size was small. Although bagging was found to perform better in larger ensembles, the number of trees that needed to be stored was considerably larger and clearly not an option in the resource constrained environments. Therefore a tradeoff exists between the memory usage and accuracy. In order to test the response time for monitoring, we performed classification experiments on an HP iPAQ Pocket PC. We assumed that physiological data blocks of size 40 instances were sent to the handheld device. Using training data obtained previously, we precomputed C4.5 decision trees. The Fourier spectra of the trees were evaluated(preserving approximately 99(%) of the total energy) and the coefficient matrix was projected onto the most significant principal components. Since the time required for computation is of considerable importance in resource constrained environments, we estimated the response time for Bagging ensemble versus the equivalent ODT ensemble. We define response time as the time required to produce an accuracy estimate from all the instances available by the specified classification scheme. The Figure 3.24 illustrates the response time for a bagging ensemble and an equivalent ODT ensemble. Clearly the equivalent orthogonal decision tree produces classification results faster than a bagging ensemble and this may be attributed to the fact that much of the redundancy in bagging ensemble has been removed in the ODT ensemble. Our method thus offers a computationally efficient method for classification on resource constrained devices.

74

3.3.4 Grid Based Physiological Data Stream Monitoring - A Dream or Reality ? In the previous sections, we describe an application of a distributed data mining technique (Orthogonal Decision Tree ensembles) for monitoring physiological data streams in time-critical resource-constrained environments. It is shown that ODTs offer an effective way to construct redundancy-free ensembles that are easier to understand and apply. They are particularly useful in monitoring data streams using resource constrained platforms where storage and CPU computing power are limited but fast response is important. ODTs are constructed from the Fourier spectra of the decision trees in the ensemble. Redundancy is removed from the ensemble by performing a PCA of these Fourier spectra. This offers an efficient representation of the ensemble, often needed for fast response in many real-time data mining applications. This also allows a meaningful way to visualize the trees in a low dimensional space. There is a rising trend for analysis of physiological data obtained from wearable devices and sensors (such as The Biomedical Informatics Group at Nottingham University15 , The Computer Assisted Reporting16 of Electro Cardio Grams (ECG). These applications need to facilitate the management of experimental scenarios for health monitoring and provide easy access to historical or real-time streaming data for analysis. Furthermore, the data is inherently distributed, since users of the medical devices and data miners are typically not located at the same place. In some cases it becomes necessary to make the data available through grid facilities for remote usage by doctors and other medical professionals. It is also important to allow users of the grid to directly interact with remote devices (e.g. vests or jacket having physiological data monitoring sensors, armbands recording galvanic skin temperature). We have shown that it is possible to do sophisticated distributed data analysis on resource-constrained devices. However, porting such devices on the grid is an active area of research. The only related work known to us at this time has been done as part of the Equator project at the University of Nottingham17. Even this work makes assumptions such as the monitoring devices are available only through grid services and thus provides a simulated environment. While it serves as an interesting proof of concept and motivates further research, it is apparent that physiological data stream monitoring on the grid still has a long road to go. In the following sections we examine the feasibility of doing distributed data mining on federated astronomy catalogs.

3.4 DDM on Federated Databases 3.4.1 The National Virtual Observatory There are several instances in the astronomy and space sciences research communities where data mining is being applied to large data collections [76, 196]. Some dedicated 15 http://www.eee.nott.ac.uk/medical/ 16 http://www.gla.ac.uk/care/

17 http://www.equator.ac.uk/index.php/articles/c70/

75

data mining projects include F-MASS [98], Class-X [67], the Auton Astrostatistics Project [16], and additional VO-related data mining activities (such as SDMIV [241]). In essentially none of these cases does the project involve truly DDM [187]. Through a past NASA-funded project, K. Borne applied some very basic DDM concepts to astronomical data mining [33]. However, the primary accomplishments focused only on centralized co-location of the data sources [32, 31]. One of the first large-scale attempts at grid data mining for astronomy is the U.S. National Science Foundation (NSF) funded GRIST [116] project. The GRIST goals include application of grid computing and web services (service-oriented architectures) to mining large distributed data collections. GRIST is focused on one particular data modality: images. Hence, GRIST aims to deliver mining on the pixel planes within multiple distributed astronomical image collections. The project that we are proposing here is aimed at another data modality: catalogs (tables) of astronomical source attributes. GRIST and other projects also strive for exact results, which usually requires data centralization and co-location, which further requires significant computational and communications resources. DEMAC (our system) will produce approximate results without requiring data centralization (low communication overhead). Users can quickly get (generally quite accurate) results for their distributed queries at low communication cost. Armed with these results, users can focus in on a specific query or portion of the datasets, and down-load for more intricate analysis. The U.S. National Virtual Observatory (NVO) [203] is a large scale effort funded by the NSF to develop a information technology infrastructure enabling easy and robust access to distributed astronomical archives. It will provide services for users to search and gather data across multiple archives and some basic statistical analysis and visualization functions. It will also provide a framework for new services to be made available by outside parties. These services can provide, among other things, specialized data analysis capabilities. As such, we envision DEMAC to fit nicely into the NVO as a new service. The Virtual Observatory can be seen as part of an ongoing trend toward the integration of information sources. The main paradigm used today for the integration of these data systems is that of a data grid [91, 115, 140, 225, 92, 263, 195, 28]. Among the desired functionalities of a data grid, data analysis takes a central place. As such, there are several projects [84, 159, 112, 116, 125, 129], which in the last few years attempt to create a data mining grid. In addition, grid data mining has been the focus of several recent workshops [158, 85]. DDM is a relatively new technology that has been enjoying considerable interest in the recent past [214, 156]. DDM algorithms strive to analyze the data in a distributed manner without down-loading all of the data to a single site (which is usually necessary for a regular centralized data mining system). DDM algorithm naturally fall into two categories according to whether the data is distributed horizontally (with each site having some of the tuples) or vertically (with each site having some of the attributes for all tuples). In the latter case, it is assumed that the sites have an associated unique id used for matching. In other words, consider a tuple Æ and assume site has a part of this tuple, Æ0 , and 1 has the remaining part Æ.2 . Then, the id associated with Æ0 equals

76

the id associated with Æ 2 .18 The NVO can be seen as a case of vertically distributed data, assuming ids have been generated by a cross-matching service. With this assumption, DDM algorithms for vertically partitioned data can be applied. These include algorithms for principal component analysis (PCA) [157, 155], clustering [145, 157], Bayesian network learning [62, 63], and supervised classification [49, 108, 123, 215, 262].

3.4.2 Data Analysis Problem: Analyzing Distributed Virtual Catalogs We illustrate the problem with two archives: the Sloan Digital Sky Survey (SDSS) [242] and the 2-Micron All-Sky Survey (2MASS) [3]. Each of these has a simplified catalog containing records for a large number of astronomical point sources, upward of 100 million for SDSS and 470 million for 2MASS. Each record contains sky coordinates (ra,dec) identifying the sources’ position in the celestial sphere as well as many other attributes (460+ for SDSS; 420+ for 2MASS). While each of these catalogs individually provides valuable data for scientific exploration, together their value increases significantly. In particular, efficient analysis of the virtual catalog formed by joining these catalogs would enhance their scientific value significantly. Henceforth, we use "virtual catalog" and "virtual table", interchangeably. To form the virtual catalog, records in each catalog must first be matched based on their position in the celestial sphere. Consider record Æ from SDSS and  from 2MASS with sky coordinates Æ*© Ój ¹oÔ3 ­ and  © Ój ¹oÔ4 ­ . Each record represents a set of observations about an astronomical object e.g. a galaxy. The sky coordinates are used to determine if Æ and  match, i.e. are close enough that Æ and  represents the same astronomical object. The issue of how matching is done will be discussed later. For each match Æ+  # , the result is a record Æ6587  in the virtual catalog with all of the attributes of Æ and  . As described earlier, the virtual catalog provides valuable data that neither SDSS or 2MASS alone can provide. DEMAC addresses the data analysis problem of developing communication-efficient algorithms for analyzing user-defined subsets of virtual catalogs. The algorithms allow the user to specify a region 9 in the sky and a virtual catalog, then efficiently analyze the subset of tuples from that catalog with sky coordinates in 9 . Importantly, the algorithms we propose do not require that the base catalogs first be centralized and the virtual catalog explicitly realized. Moreover, the algorithms are not intended to be a substitute for exact, centralization-based methods currently being developed as part of the NVO. Rather, they are intended to complement these methods by providing, quick, communication-efficient approximate results to allow browsing. Such browsing will allow the user to better focus their exact, communication-expensive, queries. Example 3 The all data release of 2MASS contains attribute, “K band means surface brightness” (Kmsb). Data release four of SDSS contains galaxy attributes “red shift” (rs), “petrosian I band angular effective radius” (Iaer) and “velocity dispersion” (vd). 18 Each id is unique to the site at which it resides; no two tuples at site : have the same id. But, ids can match across sites; a tuple at site : can have the same id as a tuple at site ; .

77

To produce a physical variable, consider composite attribute “petrosian I band effective radius” (Ier) formed by the product of Iaer and rs. Note, since Iaer and rs are both at the same repository (SDSS), then, from the standpoint of distributed computation, we may assume Ier is contained in SDSS. A principal component analysis over a region of sky 9 on the virtual table with columns log(Ier), log(vd), and Kmsb is interesting in that it can allow the identification of a “fundamental plane” (the logarithms are used to place all variables on the same scale). Indeed, if the first two principal components capture most of the variance, then these two variables define a fundamental plane. The existence of such things points to interesting astrophysical behaviors. We develop a communication-efficient distributed algorithm for approximating the principal components of a virtual table.

3.4.3 The DEMAC system This section describes the high level design of the proposed DEMAC system. DEMAC is designed as an additional web-service which seamlessly integrates into the NVO. It consists of two basic services. The main one is a web-service providing DDM capabilities for vertically distributed sky surveys (WS-DDM). The second one, which is intensively used by WS-DDM, is a web-service providing cross-matching capabilities for vertically distributed sky surveys (WS-CM). Cross-matching of sky surveys is a complex topic which is dealt with, in itself, under other NASA funded projects. Thus, our implementation of this web-service would supply bare minimum capabilities which are required in order to provide distributed data mining capabilities. To provide a distributed data mining service, DEMAC would rely on other services of the NVO such as the ability to select and down-load from a sky survey in an SQLlike fashion. Key to our approach is that these services be used not over the web, through the NVO, but rather by local agents which are co-located with the respective sky survey. In this way, the DDM service avoid bandwidth and storage bottlenecks, and overcomes restrictions which are due to data ownership concerns. Agents, in turn, take part in executing efficient distributed data mining algorithms, which are highly communication-efficient. It is the outcome of the data mining algorithm, rather than the selected data table, that is provided to the end-user. With the removal of the network bandwidth bottleneck, the main factor limiting the scalability of the distributed data mining service would be database access. For database access we intend to rely on the SQL-like interface provided to the different sky-surveys to the NVO. We outline here the architecture we propose for the two web-services we will develop.

3.4.4 WS-DDM – DDM for Heterogeneously Distributed Sky-Surveys This web-service will allow running a DDM algorithm (three will be discussed later) on a selection of sky-surveys. The user would use existing NVO services to locate sky-surveys and define the portion of the sky to be data mined. The user would then use WS-CM to select a cross-matching scheme for those sky-surveys. This specifies how the tuples are matched across surveys to define the virtual table to be analyzed. Following these two preliminary phases the user would submit the data mining task. 78

Execution of the data mining task would be scheduled according to resource availability. Specifically, the size of the virtual table selected by the user would dictate scheduling. Having allocated the required resources, the data mining algorithm would be carried on by agents which are co-located with the selected sky-surveys. Those agents will access the sky-survey through the SQL-like interface it exposes to the NVO and will communicate with each other directly, over the Internet. When the algorithm has terminated, results would be provided to the user using a web-interface.

3.4.5 WS-CM – Cross-Matching for Heterogeneously Distributed Sky-Surveys

Central to the DDM algorithms we develop is that the virtual table can be treated as vertically partitioned (see Section ?? for the definition). To achieve this, match indices are created  and co-located with each sky survey. Specifically, for each pair of surveys (tables) and Û , a distinct pair of match indices must be kept, one at each survey. Each )ß index is a list of pointers; both indices have the same number of entries. The . entry =<   . Let ¸ 79

entry of this column. Let ?$¸

4#

denote the sample mean of this column i.e.

x ¸ 4 .0# ¤ 4  €ˆ ?$¸ P# 5  (3.5) a 4 # denote the sample variance of the G )ß column i.e. Let bjVӌ$¸ ¤ x€ˆ © ?¸ 4 # u z ¸ 4 ).0# ­   (3.6) a z 4 ,¸ ( # denote the sample covariance of the G )ß and è )ß columns i.e. Let ÖÅ3@r$¸ ¤ x€ˆ © ?$¸ 4 # z ¸ 4 .0# u ­ © ?¸ ( # z ¸ ( .0# ­  (3.7) a z 4 #5–Ö3Å @r$¸ 4 ¸ 4 # ). Finally, let Ö3Å @r$¸¥# denote the covariance Note, Ajoӌ$¸ )ß 4 ¸ ( # . matrix of ¸ i.e. the C BFC matrix whose GV èk# entry is Ö3Å @r$¸ Û 0 and Û 2 . Since Assume this dataset has been vertically distributed over two sites & Û 0 has the first ^ we are assuming that the data at the sites is perfectly aligned, then & attributes and ÛA2 has the last s attributes (^ N s 5SC ). Let denote the aöBÉ^ matrix Û 0 , and 1 denote the aÊB s matrix representing the representing the dataset held by A œ 1 denote the concatenation of the datasets i.e. ¸ 5 Ñœ 1 . dataset held by A Let Ñ )ß columnÛ of2 . Ñ œ1 is denoted © ¥œ1 ­ 4 . The G

In the next two sections, we describe communication-efficient algorithms for PCA and outlier detection on ¸ vertically distributed over two sites. Both algorithms easily extend to more than two sites, but, for simplicity, we only discuss the two site scenario. We have also developed a distributed algorithm for decision tree induction (supervised classification). We will not discuss it further and refer the reader to [108] for details. Later examine the effectiveness of the distributed PCA algorithm through a case study on real astronomical data. We leave for future work the job of testing the decision tree and outlier detection algorithms with a case studies. Following standard practise in applied statistics, we pre-process ¸ by normalizing 4 #? 5 — and &jVӌ¸ 4 #?5 u . This is achieved by replacing each entry so that ?$¸ ÀDC ¸ 4 .0# with B E 9 F : )HG 9 9 B  : : . Since both ?$¸ 4 # and bjVӌ¸ 4 # can be computed without B

any communication, then normalizing can be performed without any communication. 4 #P5š— and &joӌ$¸ 4 #5 u . Henceforth, we assume ?¸ Let | JI | KI  I | ILI — denote the eigenvalues of ÖÅ3@r$¸¥# and @ @   .@ I u the associated eigenvectors (we assume the eigenvectors are column vectors i.e. C B )  ß )  ß matrices.) (pairwise orthonormal). The G principal direction of ¸ is @p4 . The G )ß principal component is denoted qÇ4 and equals ¸M@Ç4 (the projection of ¸ along the G direction).

3.4.7 Virtual Catalog Principal Component Analysis PCA is a well-established data analysis technique used in a large number of disciplines: astronomy, computer science, biology, chemistry, climatology, geology, etc. Quoting [146] page 1: "The central idea of PCA is to reduce the dimensionality of a data set

80

consisting of a large number of interrelated variables, while retaining as much as possible of the variation present in the dataset." Next we provide a very brief overview of PCA, for a more detailed treatment, the reader is referred to [146]. )ß principal component, q4 , is, by definition, a linear combination of the The G )ß column has coefficient @4V$èk# . The sample variance of q4 equals columns of ¸ – the è are all uncorrelated i.e. have zero pairwise sample co| 4 . The principal components u variances. Let N6ODG ( ¿ Ó ¿ C ) denote the a&B‡Ó matrix with columns qV p  q3G . This is the dataset projected onto the subspace defined by the first Ó principal directions. If Ó¨5 C , then N O-G is simply a different way of representing exactly the same dataset, because ¸ can be recovered completely as ¸ 5PN ODG  OD® G where where  ODG is the a2BAÓ matrix with columns @ * @ G and  denotes matrix transpose (  O I is a square matrix with orthonormal columns, hence  OrI  O ® I equals the CðBDC identity matrix.) However, if Ó¨[îC , then N O-G is a lossy lower dimensional representation of ¸ . The amount of loss is typically quantified as

¤ 4G €ˆ | 4

¤ 4I €ˆ | 4

(3.8)

the “proportion of variance” captured by the lower dimensional representation. The larger the proportion captured, the better NQO-G represents the "information" contained in the original dataset ¸ . If Ó is chosen so that a large amount of the variance is captured, then, intuitively, N ODG , captures many of the important features of ¸ . So, subsequent analysis on N O-G can be quite fruitful at revealing structure not easily found by examination of ¸ directly. Our case study will employ this idea. To our knowledge, the problem of vertically distributed PCA computation was first addressed by Kargupta et al. [157] based on sampling and communication of dominant eigenvectors. Later, Kargupta and Puttagunta [155] developed a technique based on random projections. Our method is a slightly revised version of this work. We describe a distributed algorithm for approximating ÖÅ3@ $ –œ1 # . Clearly, PCA can be performed from ÖÅ3@r) ¥œ1É# without any further communication. Recall that ќ1 is normalized to have zero column sample mean andï unit sample  Ú

4

(

9 :

9 :

variance. As a result, ÖÅ3@r © œR1 ­ © œJ1 ­ #Ö5TS U ÁV x À V which is 4 ( the inner product between © œX1 ­ and © ̜X1 ­ . Clearly this inner product can be 4 ( computed without communication when © œ!1 ­ and © œZ1 ­ are at the same site u u ¿ GV è ¿ ^ N s ). It suffices to show how the inner product (i.e. ¿ GV è ¿ ^ or ^ N can be approximated across different sites, in effect, how ® 1 can be approximated. The key idea is based on the following fact echoing the observation made in [199] that high-dimensional random vectors are nearly orthogonal. A similar result was proved elsewhere [12]. 0XW 2AY

0XW 2AY

Fact 1 Let 9 be an } B¨a matrix each of whose entries is drawn independently from a distribution with variance one and mean zero. It follows that [Ÿ© 9 ® 9 ­ 5Ê}]\ x where \ x is the aíB¨a identity matrix. We will use the Algorithm 3.4.7.1 for computing ® 1 . The result is obtained at both sites (in the communication cost calculations, we assume a message requires 4 bytes of transmission). 81

Algorithm 3.4.7.1 Distributed Covariance Matrix Algorithm 1. Û0 sends ÛA2 a random number generator seed. [1 message] 2. Û0 and Û2 generate a }&BYa random matrix 9 where } . Each entry is generated independently and identically from any distribution with mean zero and variance one. 3. Û0 sends 9Ö to ÛA2 ; Û2 sends 9R1 to ÛA0 . [CF} messages] 9^ 0\: d 9^ 2=: . 4. Û 0 and 2 compute 1æ5



Note that, [

© 1 ­ 5_[Ÿ© ®  9 } ® 9Ö#1 ­ 5 ® [Ÿ© 9 } ® 9 ­ 1

(3.9)

which, by Fact 1, equals

® }]\ x #1 5S ® }

1



(3.10)

Hence, on expectation, the algorithm is correct. However, its communication cost (bytes) divided by the cost of the centralization-based algorithm,

úC } N ú u ú \a C 5 a } N a\C

(3.11)

is small if }[A[ a . Indeed } provides a "knob" for tuning the trade-off between communication-efficiency and accuracy. Later, in our case study, we present experiments measuring this trade-off.

3.4.8 Case Study: Finding Galactic Fundamental Planes The identification of certain correlations among parameters has lead to important discoveries in astronomy. For example, the class of elliptical and spiral galaxies (including dwarfs) have been found to occupy a two dimensional space inside a three dimensional space of observed parameters, radius, mean surface brightness and velocity dispersion. This two dimensional plane has been referred to as the Fundamental Plane([83, 147]). This section presents a case study involving the detection of a fundamental plane among galaxy parameters distributed across two catalogs: 2MASS and SDSS. We consider several different combinations of parameters and explore for a fundamental plane in each (once such combination was mentioned above and described earlier in Example 3). Our goal is to demonstrate that, using our distributed covariance matrix algorithm to approximate the principal components, we can find a very similar fundamental plane as that obtained by applying a centralized PCA. Note that our ultimate goal is to enable new discoveries in astronomy through our DDM algorithms and DEMAC system. However, for now, our primary goal is to demonstrate that our distributed covariance algorithm could have found a very similar results to the centralized approach at a fraction of the communication cost. Therefore, we argue that DEMAC could provide a valuable tool for astronomers wishing to explore many parameter spaces across different catalogs for fundamental planes. 82

In our study we measure the accuracy of our distributed algorithm in terms of the similarity between its results and those of a centralized approach. We examine accuracy at various amounts of communication allowed the distributed algorithm in order to assess the trade-off described at the end of Section 3.4.7. For each amount of communication allowed, we ran the distributed algorithm 100 times with a different random matrix and report the average result (except where otherwise noted). For the purposes of our study, a real distributed environment is not necessary. Thus, for simplicity, we used a single machine and simulated a distributed environment. We carried out two sets of experiments. The first involves three parameters already examined in the Astronomy literature [83, 147] (Example 3) and the fundamental plane observed. The second involves several previously unexplored combinations of parameters. Experiment Set One We prepared our test data as follows. Using the web interfaces of 2MASS and SDSS, http://irsa.ipac.caltech.edu/applications/Gator/ and http://cas.sdss.org/astro/en/tools/crossid/upload.asp, and the SDSS object cross id tool, we obtained an aggregate dataset involving attributes from 2MASS and SDSS lying in the sky region between right ascension (ra) 150 and 200, declination (dec) 0 and 15. The aggregated dataset had the following attributes from SDSS: Petrosian I band angular effective radius (Iaer), redshift (rs), and velocity dispersion (vd);19 and had the following attribute from 2MASS: K band mean surface brightness (Kmsb).20 After removing tuples with missing attributes, we had a 1307 tuple dataset with four attributes. We produced a new attribute, logarithm Petrosian I band effective radius (log(Ier)), as log(Iaer*rs) and a new attribute, logarithm velocity dispersion (log(vd)), by applying the logarithm to vd. We dropped all attributes except those to obtain the three attribute dataset, log(Ier), log(vd), Kmsb. Finally, we normalized each column by subtracting its mean from each entry and dividing by its sample standard deviation (as described in Section 3.4.6). We applied PCA directly to this dataset to obtain the centralization based results. Then we treated this dataset as if it were distributed (assuming cross match indeces have been created as described earlier). This data can be thought of as a virtual table with attributes log(Ier) and log(vd) located at one site and attribute Kmsb at another. Finally, we applied our distributed covariance matrix algorithm and computed the principal components from the resulting matrix. Note, our dataset is somewhat small and not necessarily indicative of a scenario where DEMAC would be used in practice. However, for the purposes of our study (accuracy with respect to communication) it suffices. Figure 3.25 shows the percentage of variance captured as a function of communiu3` vabcd # c #‡5 `*da cation percentage (i.e. at 15%, the distributed algorithm uses —Ï  19 e*f.gihkjl&m]n

_ o (galaxy view), z (SpecObj view) and velDisp (SpecObj view) in SDSS DR4 _qJr st hHuwv _ fku*u in the extended source catalog in the All Sky Data http://www.ipac.caltech.edu/2mass/releases/allsky/index.html 20 p

83

Release,

Percentage of variance captured by first and second PC

105

Std Error Bars Distributed Variance Centralized Variance

100

95

90

85

80

0

10

20

30

40 50 60 Communication Percentage

70

80

90

100

Figure 3.25: Communication percentage vs. percent of variance captured, (log(Ier), log(vd), Kmsb) dataset.

bytes). Error bars indicate standard deviation – recall the percentage of variance captured numbers are averages over 100 trials. First observe that the percentage captured by the centralized approach, 90.5%, replicates the known result that a fundamental plane exists among these parameters. Indeed the dataset fits fairly nicely on the plane formed by the first two PCs. Also observe that the percentage of variance captured by the distributed algorithm (including one standard deviation) using as little as 10% communication never strays more than 5 percent from 90.5%. This is a reasonibly accurate result indicating that the distributed algorithm identifies the existence of a plane using 90% less communication. As such, this provides evidence that the distributed algorithm would serve as a good browser allowing the user to get decent approximate results at a sharply reduced communication cost. If this piques the user’s interest, she can go through the trouble of centralizing the data and carrying out an exact analysis. Interestingly, the average percentage captured by the distributed algorithm appears to approach the true percentage captured, 90.5%, very slowly (as the communication percentage approaches infinity, the average percentage captured must approach 90.5%). At the present we don’t have an explanation for the slow approach. However, as the communication increases, the standard deviation decreases substantially (as expected). To analyze the accuracy of the actual principal components computed by the distributed algorithm, we consider the data projected onto each pair of PCs. The projection onto the true first and second PC ought to appear with much scatter in both directions as it represents the view of the data perpendicular to the plane. And, the projections onto the first, third and second, third PCs ought to appear more “flattened” as they repre84

sent the view of the data perpendicular to the edge of the plane. Figures 3.26 and 3.27 displays the results. The left column depicts the projections onto the PCs computed by the centralized analysis (true PCs). Here we see the fundamental plane. The right column depicts the projections onto the PCs computed by our distributed algorithm at 15% communication (for one random matrix and not the average over 100 trials). We see a similar pattern, indicating that the PCs computed by the distributed algorithm are quite accurate in the sense that they produce very similar projections as those produced by the true PCs. In closing, it is important to stress that we are not claiming that the actual projections can be computed in a communication-efficient fashion (they can’t). Rather, that the PCs computed in a distributed fashion are accurate as measured by the projection similarity with the true PCs. Experiment Set Two We prepared our test data as follows. Using the web interfaces of 2MASS, SDSS and the SDSS object cross id tool, we obtained an aggregate dataset involving attributes from 2MASS and SDSS lying in the sky region between right ascension (ra) 150 and 200, declination (dec) 0 and 15. The aggregated dataset had the following attributes from SDSS: Petrosian Flux in the U band (petroU), Petrosian Flux in the G band (petroG), Petrosian Flux in the R band (petroR), Petrosian Flux in the I band (petroI), and Petrosian Flux in the Z band (petroZ)21 ; and had the following attributes from 2MASS: K band mean surface brightness (Kmsb) and K concentration index (KconInd). 22 We had a 29638 tuple dataset with seven attributes. We produced new attributes petroU-I 5 petroU - petroI, petroU-R 5 petroU - petroR, petroU-G 5 petroU - petroG, petroG-R 5 petroG - petroR, petroG-I 5 petroG - petroI, and petroR-I 5 petroR - petroI. We also produced new attribute, logarithm K concentration index (log(KconInd)), by applying the logarithm to KconInd. In each experiment, we dropped all attributes except the following to obtain out test dataset: petroU-I, petroU-R, petroU-G, petroG-R, petroG-I, petroR-I, log(KconInd), and Kmsb. Finally, we normalized each column by subtracting its mean from each entry and dividing by its sample standard deviation (as described in Section 3.4.6). We considered the following six combinations of three attributes: (petroU-I,log(KconInd),Kmsb), (petroU-R,log(KconInd),Kmsb), (petroU-G,log(KconInd),Kmsb), (petroG-R,log(KconInd),Kmsb), (petroG-I,log(KconInd),Kmsb), and (petroR-I,log(KconInd),Kmsb). In each case we carried out a distributed PCA experiment just as in experiment set one. Table 3.11 depicts the results at 15% communication. The “Band” column indicates the combination of attributes used, e.g. U-I indicates (petroU-I,log(KconInd),Kmsb). The “Centralized Variance” column contains the sum of the variances (times 100) of the first two principal components found by the centralized algorithm. In effect, it contains the percent of variance captured by the first two PCs. The “Distributed Variance” column contains the sum of the variances of the first two PCs found by the distributed algorithm (averaged over 100 trials). The “STD Distributed Variance” column contains the standard deviation in this average over 100 trials. 21 e*f.gihkjxym]z 22 p

_t , e*f.gihjxymHz _z , e*fgihjxymHz _h , efgihjxymHz _o , e*f.gihjxymHz _ { (PhotoObjAll) n ~ in the extended source catalog in the All Sky Data Release _qJr st hHuwv _ fku*u and p]| j r }ir H

85

6

4

4 2nd Principal Component

2nd Principal Component

6

2

0

−2

−4

−6 −6

2

0

−2

−4

−4

−2 0 2 1st Principal Component

4

−6 −6

6

6

6

4

4

2

0

−2

−4

−6 −6

−2 0 2 1st Principal Component

4

6

4

6

(b) PCs from distributed algorithm

3rd Principal Component

3rd Principal Component

(a) PCs from centralized analysis

−4

2

0

−2

−4

−4

−2 0 2 1st Principal Component

4

6

(c) PCs from centralized analysis

−6 −6

−4

−2 0 2 1st Principal Component

(d) PCs from distributed algorithm

Figure 3.26: Projections, PC1 vs. PC2 and PC1 vs. PC3; communication percentage 15%, (log(Ier), log(vd), Kmsb) dataset. We see that in all cases the centralized experiments yield a relatively weak fundamental plane (72% of the variance captured by the first two PCs) relative to the fundamental plane from experiment set one (90.5% captured). The distributed algorithm in all cases does a decent job at replicating this result: 67.8% to 69.0% of the variance captured (all with standard deviation less than 0.99). These results come at an 85% communication savings over centralizing the data. To further illuminate the trade-off between communication savings and accuracy,

86

6

4

4 3rd Principal Component

3rd Principal Component

6

2

0

−2

−4

−6 −6

2

0

−2

−4

−4

−2 0 2 2nd Principal Component

4

6

(a) PCs from centralized analysis

−6 −6

−4

−2 0 2 2nd Principal Component

(b) PCs from distributed algorithm

Figure 3.27: Projections, PC2 vs. PC3; communication percentage 15%, (log(Ier), log(vd), Kmsb) dataset. Band U-I U-R U-G G-R G-I R-I

Centralized Variance 72.6186 72.5375 72.6392 72.3501 72.4842 72.7451

Distributed Variance 67.8766 68.0817 69.0102 68.5664 68.6199 68.0521

STD Distributed Variance 0.717 0.8191 0.9781 0.9879 0.9842 0.8034

Table 3.11: The centralized and distributed variances captured by the first and second PCs (15% communication).

Figure 3.28 shows the percentage of variance captured on the petroR-I dataset as a function of communication percentage (error bars indicate standard deviation – recall the percentage of variance captured numbers are averages over 100 trials). Interestingly, the average percentage captured by the distributed algorithm appears to move away from the true percentage (72%) for communication up to 40%, then move very slightly toward 72%. This is surprising since the distributed algorithm error must approach zero as its communication percentage approaches infinity. The results appear to indicate that this approach is quite slow. At the present we don’t have an explanation for this phenomenon. However, as the communication increases, the standard deviation decreases (as expected). Moreover, despite the slow approach, the percentage of variance captured by the distributed algorithm (including one standard deviation) never strays more than 6 percent from the true percent captured – a reasonably accurate fig-

87

4

6

percentage of variance captured by first and second PC

74

Distributed Variance Centralized Variance

73 72 71 70 69 68 67 66

0

20

40 60 Communication Percentage

80

100

Figure 3.28: Communication percentage vs. percent of variance captured, (petroRI,log(KconInd),Kmsb) dataset.

ure. As in experiment set one, we analyze the accuracy of the actual principal components computed by the distributed algorithm by considering their data projections. We do so for the petroR-I dataset. Figures 3.29 and 3.30 displays the results. The left column depicts the projections onto the PCs computed by the centralized analysis (true PCs). Note that 10 projected data points (out of 29638) are omitted from the centralized PC1 vs. PC2 and PC2 vs. PC3 plots due to scaling. The y-coordinate for these points (x in case of PC2 vs. PC3) does not lie in the range [-20,20] (in both cases a few points have coordinate nearly 60 or -60). Unlike the case of experiment set one, we do not see a very pronounced plane (as expected). The right column depicts the projections onto the PCs computed by our distributed algorithm at 15% communication (for one random matrix and not the average over 100 trials). The distributed projections appear to indicate a fundamental plane somewhat more strongly than the centralized ones. This indicates some inaccuracy in the distributed PCs. However, since the distributed variances correctly did not indicate a strong fundamental plane, the inaccuracies in the PCs would likely not play an important role for users browsing for fundamental planes.

3.4.9 Summary We proposed a system, DEMAC, for the distributed exploration of massive astronomical catalogs. DEMAC is to be built on top of the existing U.S. national virtual observatory environment and provide tools for data mining (as web services) without re88

20

15

15

10

10 2nd Principal Component

2nd Principal Component

20

5 0 −5

5 0 −5

−10

−10

−15

−15

−20 −20

−15

−10

−5 0 5 1st Principal Component

10

15

−20 −20

20

20

20

15

15

10

10

5 0 −5

−15

−5 0 5 1st Principal Component

10

15

20

15

20

−5

−15

−10

10

0

−10

−15

−5 0 5 1st Principal Component

5

−10

−20 −20

−10

(b) PCs from distributed algorithm

3rd Principal Component

3rd Principal Component

(a) PCs from centralized analysis

−15

15

(c) PCs from centralized analysis

20

−20 −20

−15

−10

−5 0 5 1st Principal Component

10

(d) PCs from distributed algorithm

Figure 3.29: Projections, PC1 vs. PC2 and PC1 vs. PC3; communication percentage 15%, (petroR-I,log(KconInd),Kmsb) dataset. quiring datasets to be down-loaded to a centralized server. Instead, the users will only down-load the output of the data mining process (a data mining model); the actual data mining from multiple data servers will be performed using communication-efficient DDM algorithms. The distributed algorithms we have developed sacrifice perfect accuracy for communication savings. They offer approximate results at a considerably lower communication cost than that of exact results through centralization. As such, we see DEMAC as serving the role of an exploratory “browser”. Users can quickly get

89

20

15

15

10

10 3rd Principal Component

3rd Principal Component

20

5 0 −5

5 0 −5

−10

−10

−15

−15

−20 −20

−15

−10

−5 0 5 2nd Principal Component

10

15

(a) PCs from centralized analysis

20

−20 −20

−15

−10

−5 0 5 2nd Principal Component

10

(b) PCs from distributed algorithm

Figure 3.30: Projections, PC2 vs. PC3; communication percentage 15%, (petroRI,log(KconInd),Kmsb) dataset. (generally quite accurate) results for their distributed queries at low communication cost. Armed with these results, users can focus in on a specific query or portion of the datasets, and down-load for more intricate analysis. To illustrate the potential effectiveness of our system, we developed communicationefficient distributed algorithms for principal component analysis (PCA). Then, we carried out a case study using distributed PCA for detecting fundamental planes of astronomical parameters. We observed our distributed algorithm to replicate fairly nicely the fundamental plane results observed through centralized analysis but at significantly reduced communication cost. In closing, we envision our system to increase the ease with which large, geographically distributed astronomy catalogs can be explored, by providing quick, lowcommunication solutions. Such benefit will allow astronomers to better tap the riches of distributed virtual tables formed from joined and integrated sky survey catalogs.

90

15

20

Chapter 4

Future Work 4.1 The DEMAC system - further explorations The DEMAC system proposed in section 3.4.3 is a currently ongoing research project. There are several directions for future work including development of a simulated grid environment for the DEMAC system using the OGSA-DAI infrastructure, analysis of the DDM algorithms taking into account the overhead due to web-services and development of other distributed data mining algorithms such as outlier detection that can be incorporated into the system. The following sections delve deeper into each of these directions for future work.

4.1.1 Grid-enabling DEMAC A distributed system on the grid may contain several data repositories. These repositories may have different schemas, access mechanisms, models of storage, replication facilities and can be stored locally or remotely. The Open Grid Service Architecture (OGSA) data services provides mechanisms for virtualization and transparency over the disparate data repositories [137] on the grid. The specifications for enabling unified data access and integration [182, 175, 183] using the service based architecture advocated in OGSA has been laid down by the Data Access and Integration Services Working Group of the Global Grid Forum (GGF). This family of specifications defines web service interfaces to data resources, such as relational or XML databases and include properties that can be used to describe a data service or the resource to which access is being provided, and define message patterns that support access to (query and update) data resources. In the section 2.2.3 we provide a brief description of the architecture of the OGSA-DAI infrastructure. In this section we propose a grid-enabled version of the DEMAC system using the OGSA-DAI services as a starting point. We intend to research the following aspects of DEMAC: 1. Distributed Schema Integration: The current simulation of DEMAC uses two different astronomy catalogs 2MASS [3] and SDSS [242]. Upon selection of the portion of the sky to be mined from these catalogs, the web interfaces of the 91

two sky surveys are used to download the data. The pre-processing of the data (including estimation of derieved attributes), cross-matching and index maintainence at distributed sites are all performed off-line. This procedure can be time consuming and has to be repeated every time a different portion of the sky is chosen for analysis. Hence it appears unrealistic, and motivates the need for development of better data access and integration schemes. Since the federated astronomy databases are envisioned to be grid-enabled, an interesting possibility is to build a service based schema integration module. This module should allow integration of heterogeneous data sources (including flat files, relational and XML databases) and also come up with novel methods for indexing the integrated virtual databases. 2. Distributed Query Processing: The Open Grid Service Architecture - Distributed Query Processing (OGSA-DQP) [193] provides a service based distributed query processor and is built on top of the OGSA-DAI introduced in section 2.2.3. It supports queries over OGSA-DAI data services and uses grid data services to provide consistent access to metadata and interact with databases on the grid. The service-based DQP framework consists of two services: (1) Grid Distributed Query Service (Coordinator) and (2) Query Evaluation Service (Evaluator). The role of the coordinator is to obtain metadata to compile, partition and schedule distributed query execution plans over nodes in the Grid. The Evaluator is used by the Coordinator to execute query plans generated by the query compiler, optimiser and scheduler. While the distributed query processor is an interesting contribution, work still needs to be done for coordinated use of query processing services on the grid for scientific applications. 3. Distributed Workflow Management: In order to enable composition of web services, there is a need to develop workflow management schemes for distributed, scientific data mining. In section 2.4 we discussed some of the related work and discuss challenges that need to be overcome.

4.1.2 PCA based Outlier Detection on DEMAC In the previous section 3.4.7 we have used Principal Component Analysis for dimension reduction of astronomy catalogs. However, PCA can also be used for outlier detection. While the principal components carry most of the variance in the data, the last components carry valuable information. Some techniques for outlier detection have been developed based on the last components [109, 121, 120, 146, 245]. These techniques look to identify data points which deviate sharply from the “correlation structure” of the data. )ß component is a linear combination of Recall from Chapter 3 section 3.4.6 the G )  ß the columns of ¸ (the è column has coefficient @4o$èk# ) with sample variance | 4 i.e. the variance over the entries of q4 is | 4 . Thus, if | 4 is very small and there were no outlier data points, one would expect the entries of q 4 to be nearly constant. In this case, @ 4 expresses a nearly linear relationship between the columns of ¸ . A data point which deviates sharply from the correlation structure of the data will likely have its q 4 92

entry deviate sharply from the rest of the entries (assuming no other outliers). Since < the last components have the smallest |  , then an outlier’s entries in these components )ß will likely stand out. This motivates examination of the following statistic for the . u )  ß data point (the . row in ¸ ) and some ¿ j ¿ C :

ºI ¹  +3  )jŒ#P5 q*4V).0#  (4.1) 4€X) )ß entry of q 4 . j is a user defined parameter. A possible where q 4 ).0# denotes the .  criticism of this approach is pointed out in [146] page 237: “it [¹ ,3  ] still gives insufficient weight to the last few PCs, [...] Because the PCs have decreasing variance with  increasing index, the values of qÇ4V.0# will typically become smaller as G increases, and ¹ ,3  therefore implicitly gives the PCs decreasing weights as G increases. This effect can be severe if some of the PCs have very small variances, and this is unsatisfactory as it is precisely the low-variance PCs which may be most effective ...” To address this criticism, the components are normalized to give equal weight. Let R   4 denote the normalized G )ß principal direction: the C B u vector whose . )ß entry is  9 : E ƒ  . The normalized G )ß principal component is q 4 5S¸îR 4 . The sample variance of

q 4

equals one, so, the weights of the normalized components are equal. This statistic )ß we use for the . data point is (following notation in [146])

ºI   ¹ *3  ) jŒ#5 q 4 .0#  4€X)

(4.2)

Using the above technique, we plan to develop a distributed top-k outlier detection algorithm for astronomy catalogs and provide experimental results to compare the performance of the distributed algorithm to existing centralized outlier detection algorithms. It would also be interesting to track the performance of the distributed algorithm in the simulated grid environment described above.

4.2 Proposed Plan of Research The Table 4.1 illustrates the plan of research.

93

Task Aug

Sept

2006 Oct

Develop an architecture for DDM using the OGSA-DAI framework Grid Enabling DEMAC Develop an architecture for Distributed Stream Mining on the grid Writing thesis Table 4.1: Research Plan

94

Nov

Dec

Jan

Feb

2007 March

Apr

Bibliography [1] S. Al Sairafi, F. S. Emmanouil, M. Ghanem, N. Giannadakis, Y. Guo, D. Kalaitzopolous, M. Osmond, A. Rowe, iJ. Syed and P. Wendel. The Design of Discovery Net: Towards Open Grid Services for Knowledge Discovery. International Journal of High Performance Computing Applications, 17(3):297–315, 2003. [2] S. Bandyopadhyay, C. Giannella, U. Maulik, H. Kargupta, K. Liu, and S. Datta. Clustering Distributed Data Streams in Peer-to-Peer Environments. Information Science, 2005. In Press. [3] 2-Micron All Sky Survey. http://pegasus.phast.umass.edu. [4] Alexander Reinefeld and Florian Schintke. Concepts and Technologies for a Worldwide Grid Infrastructure. In Lecture Notes in Computer Science, volume 2400, pages 62–71. Springer 2002. (c) Springer-Verlag, 2002. [5] Peter Brezany Alexander Wöhrer and A. Min Tjoa. Novel mediator architectures for Grid information systems. Future Generation Computer Systems, 21(1):107– 114, January,2005 2005. [6] Peter Brezany Alexander Woehrer and Ivan Janciak. Virtualization of Heterogeneous Data Sources for Grid Information Systems. In Accepted for the MIPRO 2004, Opatija, Croatia, May 24-28, 2004 2004. [7] A. S. Ali, O. F. Rana, and I. J. Taylor. Web Services Composition for Distributed Data Mining. In ICPP 2005 Workshops. International Conference Workshops on Parallel Processing, pages 11 – 18. IEEE, June 2005. [8] Ali Shaikh Ali, Omer F. Rana and Ian J. Taylor. Web Services Composition for Distributed Data Mining. In Workshop on Web and Grid Services for Scientific Data Analysis (WAGSSDA), Oslo, Norway, 2005. [9] W. Allcock, I. Foster, S. Tuecke, A. Chervenak, and C. Kesselman. Protocols and services for distributed data-intensive science. In In Proceedings of Advanced Computing and Analysis Techniques in Physics Research(ACAT2000), 2000. [10] Alon Y. Halevy, Zachary G. Ives, Dan Suciu, Igor Tatarinov. Schema Mediation in Peer Data Management Systems. In 19th International Conference on Data Engineering (ICDE 2003), pages 505–516, 2003. 95

[11] Amol Ghoting and Srinivasan Parthasarathy. Facilitating Interactive Distributed Data Stream Processing and Mining. In In Proceedings of the IEEE International Symposium on Parallel and Distributed Processing Systems (IPDPS), April 2004. [12] Arriaga R. and Vempala S. An Algorithmic Theory of Learning: Robust Concepts and Random Projection. In Proceedings of the 40th Foundations of Computer Science, 1999. [13] Szalay A.S., Gray J., Kunszt P., Thakar A., and Slutz D. Large Databases in Astronomy. In Mining the Sky, Proceedings of MPA/ESO/MPE workshop, pages 99–118. Springer, 2001. [14] J. Gray A.S. Szalay, P.Z. Kunszt. The Sloan Digital Sky Survey Science Archive: Migrating a Multi-Terabyte Astronomical Archive from Object to Relational DBMS. Computing in Science and Engineering, IEEE Press, V5.5:20–27, June 2003. [15] Astro grid. www.astrogrid.org. [16] The AUTON Project. http://www.autonlab.org/autonweb/showProject/3/. [17] P. Avery. Data Grids: A New Computational Infrastructure for Data Intensive Science. Technical Report GriPhyN Report 2002-24, GriPhyN, 2002. [18] P. Avery and Ian Foster. The GriPhyN Project: Towards Petascale Virtual-Data Grids. Technical Report GriPhyN Report 2000-1, GriPhyN, December 2001. Submitted to the 2000 NSF Information and Technology Research Program, NSF award ITR=0086044. [19] Ron Avnur and Joseph M. Hellerstein. Eddies: continuously adaptive query processing. In SIGMOD, pages 261–272, 2000. [20] B. Babcock and C. Olston. Distributed top-k monitoring. In In Proceedings of the ACM SIGMOD International Conference on Management of Data, pages 28–39, San Diego,California, June 2003. [21] B. Gilburd, A. Schuster, and R. Wolff. Privacy Preserving Data Mining on Data Grids in the presence of Malicious Participants. In Proceedings of HPDC, 2004, Honolulu,Hawaii, June 2004. [22] Babcock B., Babu S., Datar M., Motwani R., and Widom J. Models and Issues in Data Stream Systems. In Proceedings of the 21th ACM SIGMOD-SIGACTSIGART Symposium on Principals of Database Systems (PODS), pages 1–16, 2002. [23] Eric Bauer and Ron Kohavi. An empirical comparison of voting classification algorithms: Bagging, boosting, and variants. Machine Learning, 36(1–2):105– 139, 1999.

96

[24] Boualem Benatallah, Quan Z. Sheng, and Marlon Dumas. The Self-Serv Environment for Web Services Composition. IEEE Internet Computing, 7(1):40–48, 2003. [25] Steve Benford, Neil Crout, John Crowe, Stefan Egglestone, Malcom Foster, Alastair Hampshire, Barrie Hayes-gill, Alex Irune, Ben Palethorpe, Timothy Reid, and Mark Sumner. e-science from the antarctic to the GRID, August 2003. [26] Fran Berman. Viewpoint: From teragrid to knowledge grid. Commun. ACM, 44(11):27–28, 2001. [27] R. Bhargava, H. Kargupta, and M. Powers. Energy Consumption in Data Analysis for On-board and Distributed Applications. In Proceedings of the 2003 International Conference on Machine Learning workshop on Machine Learning Technologies for Autonomous Space Applications, 2003. [28] bioGrid – Biotechnology Information and Knowledge Grid. http://www.biogrid.net/. [29] BIRN: Biomedical Informatics http://www.nbirn.net/AU/index.htm.

Research

Network.

[30] Body media sensewear armband. http://www.bodymedia.com/index.jsp. [31] Borne K. Distributed Data Mining in the National Virtual Observatory. In Proceedings of the SPIE Conference on Data Mining and Knowledge Discovery: Theory, Tools, and Technology V, Vol. 5098, page 211, 2003. [32] Borne K., Arribas S., Bushouse H., Colina L., and Lucas R. A National Virtual Observatory (NVO) Science Case. In Proceedings of the Emergence of Cosmic Structure, New York: AIP, page 307, 2003. [33] Distributed Data Mining Techniques for Object Discovery in the National Virtual Observatory. http://is.arc.nasa.gov/IDU/tasks/NVODDM.html. [34] Leo Breiman. Bagging predictors. Machine Learning, 24(2):123–140, 1996. [35] Leo Breiman. Random forests. Machine Learning, 45(1):5–32, 2001. [36] C. L. Bridges and D. E. Goldberg. The nonuniform Walsh-schema transform. In G. J. E. Rawlins, editor, Foundations of Genetic Algorithms, pages 13–22. Morgan Kaufmann, San Mateo, CA, 1991. [37] Giuseppe Bueti, Antonio Congiusta, and Domenico Talia. Developing distributed data mining applications in the knowledge grid framework. In Proc. of the 6th Int. Conf. on High Performance Computing for Computational Science (VECPAR 2004), volume 3402 of LNCS, pages 156–169, Valencia, Spain, June 2004. ISBN 3-540-25424-2.

97

[38] C. Giannella, R. Bhargava, and H. Kargupta. Multi-Agent Systems and Distributed Data Mining. In Proceedings of 8th International Workshop on Cooperative Information Agents (CIA 2004), Lecture Notes in Artificial Intelligence, volume 3191, Erfurt, Germany, September 27-29 2004. [39] Mario Cannataro, Antonio Congiusta, Carlo Mastroianni, Andrea Pugliese, Domenico Talia, and Paolo Trunfio. Grid-based data mining and knowledge discovery. In N. Zhong and J. Liu, editors, Intelligent Technologies for Information Analysis, pages 19–45. Springer-Verlag, 2004. ISBN 3-540-40677-8. [40] Mario Cannataro, Antonio Congiusta, Andrea Pugliese, Domenico Talia, and Paolo Trunfio. Distributed data mining on grids: services, tools, and applications. IEEE Transactions on Systems, Man, and Cybernetics: Part B (TSMC-B), 34(6):2451–2465, 2004. [41] Mario Cannataro, Antonio Congiusta, Domenico Talia, and Paolo Trunfio. A data mining toolset for distributed high-performance platforms. In Proc. of the 3rd International Conference on Data Mining Methods and Databases for Engineering, Finance and Others Fields (Data Mining 2002), pages 41–50, Southampton, UK, September 2002. WIT Press. ISBN 1-85312-925-9. [42] Mario Cannataro and Domenico Talia. Parallel and distributed knowledge discovery on the grid: A reference architecture. In Proc. of the 4th International Conference on Algorithms and Architectures for Parallel Computing (ICA3PP), pages 662–673, Hong Kong, December 2000. World Scientific Publ. [43] Mario Cannataro and Domenico Talia. The knowledge grid. Communications of the ACM, 46(1):89–93, 2003. [44] Mario Cannataro and Domenico Talia. Semantics and knowledge grids: Building the next-generation grid. IEEE Intelligent Systems, 19(1):56–63, 2004. [45] Mario Cannataro, Domenico Talia, and Paolo Trunfio. Design of distributed data mining applications on the knowledge grid. In Proc. of the National Science Foundation Workshop on Next Generation Data Mining (NGDM’02), pages 191–195, Baltimore, Maryland, November 2002. [46] Mario Cannataro, Domenico Talia, and Paolo Trunfio. Design of distributed data mining applications on the knowledge grid. In H. Kargupta, A. Joshi, K. Sivakumar, and Y. Yesha, editors, Data Mining: Next Generation Challenges and Future Directions, pages 67–88. AAAI/MIT Press, 2004. ISBN 0-262-61203-8. [47] D. Caragea, A. Silvescu, and V. Honavar. A Framework for Learning from Distributed Data Using Sufficient Statistics and its Application to Learning Decision Trees. International Journal of Hybrid Intelligent Systems., 2003. [48] D. Caragea, A. Silvescu, and V. Honavar. Decision Tree Induction from Distributed, Heterogeneous, Autonomous Data Sources. In Proceedings of the Conference on Intelligent Systems Design and Applications (ISDA 03), Tulsa, Oklahoma, 2003. 98

[49] Caragea D., Silvescu A., and Honavar V. Decision Tree Induction from Distributed Data Sources. In Proceedings of the Conference on Intelligent Systems Design and Applications, 2003. [50] Carl Barratt, Andrea Brogni, Matthew Chalmers, William R. Cobern, John Crowe, Don Cruickshank, Nigel Davies, Dave De Roure, Adrian Friday, Alastair Hampshire, Oliver J. Gibson, Chris Greenhalgh, Barrie Hayes-Gill, Jan Humble, Henk Muller, Ben Palethorpe, Tom Rodden, Chris Setchell, Mark Sumner, Oliver Storz, Lionel Tarassenko. Extending the Grid to Support Remote Medical Monitoring. In UK e-Science All Hands Meeting, 2003. [51] The CfA Redshift Catalogue, Version http://vizier.cfa.harvard.edu/viz-bin/Cat?VII/193.

June

1995.

[52] Dipanjan Chakraborty, Anupam Joshi, Yelena Yesha, and Tim Finin. Toward distributed service discovery in pervasive computing environments. IEEE Transactions on Mobile Computing, 5(2):97–112, 2006. [53] P. Chan and S. Stolfo. Experiments on multistrategy learning by meta-learning. In Proceeding of the Second International Conference on Information Knowledge Management, pages 314–323, 1993. [54] P. Chan and S. Stolfo. Meta-learning for multistrategy and parallel learning. In Proceeding of the Second International Work on Multistrategy Learning, pages 150–165, 1993. [55] P. Chan and S. Stolfo. Toward parallel and distributed learning by meta-learning. In Working Notes AAAI Work. Knowledge Discovery in Databases, pages 227– 240. AAAI, 1993. [56] P. Chan and S. Stolfo. On the accuracy of meta-learning for scalable data mining. Journal of Intelligent Information System, 8:5–28, 1996. [57] P. Chan and S. Stolfo. Toward scalable learning with non-uniform class and cost distribution: A case study in credit card fraud detection. In Proceeding of the Fourth International Conference on Knowledge Discovery and Data Mining, page o. AAAI Press, September 1998. [58] P. K. Chan and S. J. Stolfo. A comparative evaluation of voting and metalearning on partitioned data. In Proceedings of Thwelfth International Conference on Machine Learning, pages 90–98, 1995. [59] P. K. Chan and S. J. Stolfo. Sharing learned models among remote database partitions by local meta-learning. In E. Simoudis, J. Han, and U. Fayyad, editors, The Second International Conference on Knowledge Discovery and Data Mining, pages 2–7. AAAI Press, 1996. [60] J. Chen, D. DeWitt, F. Tian, and Y. Wang. NiagaraCQ: a scalable continuous query system for Internet databases. In ACM SIGMOD’00 International Conference on Management of Data, 2000. 99

[61] Liang Chen, Reddy K., and G. Agrawal. GATES: a grid-based middleware for processing distributed data streams. In Proceedings. 13th IEEE International Symposium on High Performance Distributed Computing, 2004, pages 192– 201, Honolulu, Hawaii USA, 4-6 June 2004 2004. [62] Chen R. and Sivakumar K. A New Algorithm for Learning Parameters of a Bayesian Network from Distributed Data. Proceedings of the Second IEEE International Conference on Data Mining (ICDM), pages 585–588, 2002. [63] Chen R., Sivakumar K., and Kargupta H. Learning Bayesian Network Structure from Distributed Data. In Proceedings of the Third SIAM International Conference on Data Mining, pages 284–288, 2003. [64] Ann Chervenak, Ian Foster, Carl Kesselman, Charles Salisbury, and Steven Tuecke. The Data Grid: Towards an Architecture for the Distributed Management and Analysis of Large Scientific Datasets. Journal of Network and Computer Applications, 23:187–200, 2001. (based on conference publication from Proceedings of NetStore Conference 1999). [65] F. Chung. Spectral Graph Theory. American Mathematical Society, Providence, Rhode Island, USA, 1994. [66] D. Churches, G. Gombas, A. Harrison, J. Maassen, C. Robinson, M. Shields, I. Taylor, and I. Wang. Programming Scientific and Distributed Workflow with Triana Services. Grid Workflow 2004 Special Issue of Concurrency and Computation: Practice and Experience, To be published, 2005. [67] The ClassX Project: Classifying http://heasarc.gsfc.nasa.gov/classx/.

the

High-Energy

Universe.

[68] Carmela Comito, Anastasios Gounaris, Rizos Sakellariou, and Domenico Talia. Data integration and query reformulation in service-based grids. In Proc. of the 1st CoreGRID Integration Workshop, Pisa, Italy, November 2005. [69] Carmela Comito and Domenico Talia. Gdis: A service-based architecture for data integration on grids. In Proc. of OTM 2004 Workshops, volume 3292 of LNCS, pages 88–98, Agia Napa, Cyprus, October 2004. Springer-Verlag. ISBN 3-540-23664-3. [70] Carmela Comito and Domenico Talia. Gdis: A service-based architecture for data integration on grids. In Proc. of OTM 2004 Workshops, volume 3292 of LNCS, pages 88–98, Agia Napa, Cyprus, October 2004. Springer-Verlag. ISBN 3-540-23664-3. [71] Antonio Congiusta, Carlo Mastroianni, Andrea Pugliese, Domenico Talia, and Paolo Trunfio. Enabling knowledge discovery services on grids. In Proc. of the 2nd European AcrossGrids Conference (AxGrids 2004), volume 3165 of LNCS, pages 250–259, Nicosia, Cyprus, January 2004. Springer-Verlag. ISBN 3-54022888-8. 100

[72] Antonio Congiusta, Andrea Pugliese, Domenico Talia, and Paolo Trunfio. Designing grid services for distributed knowledge discovery. Web Intelligence and Agent Systems (WIAS), 1(2):91–104, 2003. [73] Antonio Congiusta, Domenico Talia, and Paolo Trunfio. Vega: A visual environment for developing complex grid applications. In Proc. of the First International Workshop on Knowledge Grid and Grid Intelligence (KGGI 2003), pages 56–66, Halifax, Canada, October 2003. Department of Mathematics and Computing Science, Saint Mary’s University. ISBN 0-9734039-0-X. [74] C. Cortes, K. Fisher, D. Pregibon, A. Rogers, and F. Smith. Hancock: A language for extracting signatures from data stream. In Proceedings The ACM SIGKDD-2000, 2000. [75] D.Bosio,J.Casey,A.Frohner,L.Guy et al. Next Generation EU DataGrid Data Management Services. In Computing in High Energy Physics (CHEP 2003), March 2003. [76] Digital Dig - Data Mining in Astronomy. http://www.astrosociety.org/pubs/ezine/datamining.html. [77] Deep Wide-Field Survey. http://www.noao.edu/noao/noaodeep/. [78] Dhillon I. and Modha D. A Data-clustering Algorithm on Distributed Memory Multiprocessors. In Proceedings of the KDD’99 Workshop on High Performance Knowledge Discovery, pages 245–260, 1999. [79] Thomas G. Dietterich. An experimental comparison of three methods for constructing ensembles of decision trees: Bagging, boosting and randomization. Machine Learning, 40(2):139–158, 2000. [80] Dipanjan Chakraborty and Anupam Joshi. Anamika: Distributed Service Composition Architecture for Pervasive Environments. SIGMOBILE Mob. Comput. Commun. Rev., 7(1):38–40, 2003. [81] Dipanjan Chakraborty and Suraj Kumar Jaiswal and Archan Misray and Amit A. Nanavati. Middleware architecture for evaluation and selection of 3rd-party web services for service providers. icws, 0:647–654, 2005. [82] Dipanjan Chakraborty, Anupam Joshi, Tim Finin, Yelena Yesha. Service Composition for Mobile Environments. Mobile Networks and Applications, 10:435 – 451, 2005. [83] Elliptical Galaxies: Merger Simulations and the Fundamental Plane. http://irs.ub.rug.nl/ppn/244277443. [84] Data Mining Grid. http://www.datamininggrid.org/. [85] Workshop on Data Mining and the Grid (DM-Grid http://www.cs.technion.ac.il/ € ranw/dmgrid/program.html. 101

2004).

[86] Domenico Talia, Paolo Trunfio, Oreste Verta. WEKA4WS: a WSRF-enabled Weka Toolkit for Distributed Data Mining on Grids. In Proc. of the 9th European Conference on Principles and Practice of Knowledge Discovery in Databases (PKDD 2005), volume LNAI vol. 3721, pages 309–320, Porto, Portugal, October 2005. Springer-Verlag. [87] Pedro Domings and Geoff Hulten. Mining high-speed data streams. In Sixth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, Boston, MA, August, 2000. [88] Harris Drucker and Corrina Cortes. Boosting decision trees. Advances in Neural Information Processing Systems, 8:479–485, 1996. [89] H. Dutta, H. Kargupta, and A. Joshi. Orthogonal Decision Trees for ResourceConstrained Physiological Data Stream Monitoring using Mobile Devices. In Varadarajan Sridhar David A. Bader, Manish Parashar and Viktor K. Prasanna, editors, Lecture Notes in Computer Science (3769) High Performance Computing - HiPC 2005 , Goa,India, December 18-21 2005. Springer Science and Business Media. [90] M. Eisenhardt, W. Muller, and A. Henrich. Classifying Documents by Distributed P2P Clustering. In Proceedings of Informatik 2003, GI Lecture Notes in Informatics, Frankfurt, Germany, September 2003. [91] EU Data Grid. http://web.datagrid.cnr.it/. [92] Astrophysical Virtual Observatory. http://www.euro-vo.org/. [93] European Data Grid Project. http://edg-wp2.web.cern.ch/edg-wp2/index.html. [94] The FAEHIM project. http://users.cs.cf.ac.uk/Ali.Shaikhali/faehim/. [95] Wei Fan, Sal Stolfo, and Phillip Chan. Using conflicts among multiple base classifiers to measure the performance of stacking. In ICML-99 Workshop on Recent Advances in Meta-learning and Future Work, pages 10 – 17, 1999. [96] Wei Fan, Sal Stolfo, and Junxin Zhang. The application of adaboost for distributed, scalable and on-line learning. In Fifth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, San Diego, California, 1999. [97] J. Feigenbaum, S. Kannan, M. Strauss, and M Viswanathan. Testing and spot checking of data streams. In Proceedings of the 11th Symposium on Discrete Algorithm, ACM/SIAM, pages 165–174, New York/Philadelphia, 2000. [98] Framework for Mining and http://www.itsc.uah.edu/f-mass/.

Analysis

of

Space

Science

Data.

[99] Ian Foster. The Grid: A new Infrastructure for 21st Century Science. Physics Today, Feb 2002. 102

[100] Ian Foster and C. Kesselman. Computational Grids. Chapter 2 of The Grid: Blueprint for a New Computing Infrastructure, 1999. [101] Ian Foster and Carl Kesselman. The Grid: Blueprint for a New Computing Infrastructure. Morgan Kaufman, 2004. [102] Fred A. and Jain A. Data Clustering Using Evidence Accumulation. In Proceedings of the International Conference on Pattern Recognition 2002, pages 276–280, 2002. [103] Yoav Freund. Boosting a weak learning algorithm by majority. Information and Computation, 121(2):256–285, 1995. [104] Yoav Freund and Robert E. Schapire. Experiments with a new boosting algorithm. In Proc. 13th International Conference on Machine Learning, pages 148–146, Murray Hill, NJ, 1996. Morgan Kaufmann. [105] Forman G. and Zhang B. Distributed Data Clustering Can Be Efficient and Exact. SIGKDD Explorations, 2(2):34–38, 2000. [106] J. Gehrke, F. Korn, and D. Srivastava. On computing correlated aggregates over continual data streams. In SIGMOD 2001, Santa Babara, CA, 2001. [107] Chris Giannella, Kun Liu, Todd Olsen, and Hillol Kargupta. Communication Efficient Construction of Decision Trees Over Heterogeneously Distributed Data. In Proceedings of The Fourth IEEE International Conference on Data Mining (ICDM’04), Brighton, UK, November 2004. [108] Giannella C., Liu K., Olsen T., and Kargupta H. Communication Efficient Construction of Decision Trees Over Heterogeneously Distributed Data. In Proceedings of the The Fourth IEEE International Conference on Data Mining (ICDM), 2004. [109] Gnanadesikan R. and Kettenring J. Robust Estimates, Residuals, and Outlier Detection with Multiresponse Data. Biometrics, 28:81–124, 1972. [110] D. Goldberg. Genetic algorithms and Walsh functions: Part I, a gentle introduction. Complex Systems, 3(2):129–152, 1989. [111] V. Gorodetski, V. Skormin, L. Popyack, and O. Karsaev. Distributed learning in a data fusion systems. In Proceedings The Conference of the World Computer Congress (WCC-2000) Intelligent Information Processing (IIP2000), Beijing, China, 2000. [112] GridMiner. www.gridminer.org. [113] Gridpp. www.gridpp.ac.uk. [114] Grid Weka. http://smi.ucd.ie/ rinat/weka/. [115] Grid Physics Network. http://www.griphyn.org. 103

[116] GRIST: Grid Data Mining for Astronomy. http://grist.caltech.edu. [117] S. Guha, N. Mishra, R. Motwani, and L. O’Callaghan. Clustering data streams. In Proceedings of the Annual Symposium on Foundations of Computer Science, pages 359–366, November,2000. [118] Y. Guo and J. Sutiwaraphun. Distributed learning with knowledge probing: A new framework for distributed data mining. In Advances in Distributed and Parallel Knowledge Discovery,Eds: Hillol Kargupa and Phillip Chan. MIT Press, 2000. [119] H. Dutta H. Kargupta, B. Park. Orthogonal Decision Trees. In Accepted for publication in Transactions on Knowledge and Data Engineering (In press), 2005. [120] Hawkins D. The Detection of Errors in Multivariate Data Using Principal Components. Journal of the American Statistical Association, 69(346):340–344, 1974. [121] Hawkins D. and Fatti P. Exploring Multivariate Data Using the Minor Principal Components. The Statistician, 33:325–338, 1984. [122] Helen Conover, Sara J. Graves, Rahul Ramachandran, Sandi Redman, John Rushing, Steve Tanner, Robert Wilhelmson. Data Mining on the TeraGrid. In Supercomputing Conference, Phoenix, AZ, Nov. 15 2003. [123] Hershberger, D. and Kargupta, H. Distributed Multivariate Regression Using Wavelet-based Collective Data Mining. Journal of Parallel and Distributed Computing, 61:372–400, 1999. [124] Tony Hey and Anne Trefethen. The Data Deluge. Grid Computing - Making the Global Infrastructure a Reality, January 2003. [125] Hinke T. and Novotny J. Data Mining on NASA’s Information Power Grid. In Proceedings of the 9th IEEE International Symposium on High Performance Distributed Computing (HPDC’00), page 292, 2000. [126] Hinneburg A. and Keim D. An Efficient Approach to Clustering in Large Multimedia Databases with Noise. In Proceedings of the 1998 International Confernece on Knowledge Discovery and Data Mining (KDD), pages 58–65, 1998. [127] J. Hofer and P. Brezany. Distributed Decision Tree Induction within the grid data mining framework gridminer-core. Technical Report TR 2004-04, University of Vienna, Vienna, Austria, March 2004. [128] Juergen Hofer and Peter Brezany. DIGIDT: Distributed Classifier Construction in the Grid Data Mining Framework GridMiner-Core. In Proceedings of the Workshop on Data Mining and the Grid (GM-Grid 2004) held in conjunction with the 4th IEEE International Conference on Data Mining (ICDM’04), November 1-4 2004. 104

[129] Hofer J. and Brezany P. Distributed Decision Tree Induction Within the Grid Data Mining Framework. In Technical Report TR2004-04, Institute for Software Science, University of Vienna, 2004. [130] H.Stockinger, F.Dono,E.Laure, S.Muzzafar et al. Grid Data Management in Action. In Computing in High Energy Physics (CHEP 2003), March 2003. [131] G. Hulten, Spencer L., and P. Domingos. Mining time-changing data streams. In Proceedings of the Seventh International Conference on Knowledge Discovery and Data Mining, San Francisco, CA, 2001. ACM Press. [132] The human genome project. http://www.ornl.gov/sci/techresources/Human_Genome/home.shtml. [133] I. Foster. Globus Toolkit Version 4: Software for Service-Oriented Systems. IFIP International Conference on Network and Parallel Computing, SpringerVerlag, LNCS 3779:pp 2–13, 2005. [134] I. Foster, C. Kesselman, J. Nick, S. Tuecke. The Physiology of the Grid: An Open Grid Services Architecture for Distributed Systems Integration. Open Grid Service Infrastructure WG, Global Grid Forum., June 2002. [135] I. Foster, C. Kesselman, S. Tuecke. The Anatomy of the Grid: Enabling Scalable Virtual Organizations. International J. Supercomputer Applications, 15(3), 2001. [136] I. Foster, E. Alpert, A. Chervenak, B. Drach, C. Kesselman, V. Nefedova, D. Middleton, A. Shoshani, A. Sim, D. Williams. The Earth System Grid II: Turning Climate Datasets Into Community Resources. In Proceedings of the American Meterologcal Society Conference, 2001. [137] I. Foster, H. Kishimoto, A. Savva, D. Berry, A. Djaoui, A. Grimshaw, B. Horn, F. Maciel, F. Siebenlist, R. Subramaniam, J. Treadwell, J. Von Reich. The Open Grid Services Architecture, Version 1.0. In Informational Document, Global Grid Forum (GGF), January 2005. [138] Adriana Iamnitchi and Domenico Talia. P2p computing and interaction with grids. Future Generation Computer Systems (FGCS), 21(3):331–332, 2005. [139] Informing business and regional policy. http://www.epcc.ed.ac.uk/projects/inwa/. [140] International Virtual Data Grid Laboratory. http://www.ivdgl.org/. [141] J. da Silva, C. Giannella, R. Bhargava, H. Kargupta, M. Klusch. Distributed Data Mining and Agents. In Engineering Applications of Artificial Intelligence, volume 18, pages 791–807, 2005. [142] J. Kim, Y. Gil, M. Spraragen. A Knowledge-Based Approach to Interactive Workflow Composition. In In Workshop: Planning and Scheduling for Web and Grid Services, at the 14th International Conference on Automatic Planning and Scheduling (ICAPS 04), Whistler, Canada, 2004. 105

[143] J. Syed, M. Ghanem, Y.Guo. Discovery Processes: Representation and Re-use. In UK e-Science All-hands Conference, Sheffield, UK, Sept 2002. [144] Januzaj E., Kriegel H.-P., and Pfeifle M. DBDC: Density Based Distributed Clustering. In Proceedings of EDBT in Lecture Notes in Computer Science 2992, pages 88–105, 2004. [145] Johnson E. and Kargupta H. Collective, Hierarchical Clustering From Distributed, Heterogeneous Data. In M. Zaki and C. Ho, editors, Lecture Notes in Computer Science, volume 1759, pages 221–244. Springer-Verlag, 1999. [146] Jolliffe I. Principal Component Analysis. Springer-Verlag, 2002. [147] Lewis A. Jones and Warrick J. Couch. A statistical comparison of line strength variations in coma and cluster galaxies at z 0.3. Astronomical Society, Australia, 15:309–317, 1998. [148] H. Kargupta, R. Bhargava, K. Liu, M. Powers, P. Blair, S. Bushra, J. Dull, K. Sarkar, M. Klein, M. Vasa, and D. Handy. VEDAS: A Mobile and Distributed Data Stream Mining System for Real-time Vehicle Monitoring. In Proceedings of 2004 SIAM International Conference on Data Mining (SDM’04), Lake Buena Vista, FL, April 2004. [149] H. Kargupta and B. Park. Mining time-critical data stream using the Fourier spectrum of decision trees. In Proceedings of the IEEE International Conference on Data Mining, pages 281–288. IEEE Press, 2001. [150] H. Kargupta, B. Park, D. Hershberger, and E. Johnson. Collective data mining: A new perspective towards distributed data mining. In Advances in Distributed and Parallel Knowledge Discovery, Eds: Kargupta, Hillol and Chan, Philip. AAAI/MIT Press, 2000. [151] H. Kargupta and B.H. Park. A fourier spectrum-based approach to represent decision trees for mining data streams in mobile environments. IEEE Transactions on Knowledge and Data Engineering, 16(2):216–229, 2002. [152] H. Kargupta, B.H. Park, S. Pittie, L. Liu, D. Kushraj, and K. Sarkar. Mobimine: Monitoring the stock market from a PDA. ACM SIGKDD Explorations, 3(2):37– 46, January 2002. [153] Hillol Kargupta and Haimonti Dutta. Orthogonal Decision Trees. In Proceedings of The Fourth IEEE International Conference on Data Mining (ICDM’04), Brighton, UK, November 2004. [154] Kargupta H. and Chan P. (editors). Advances in Distributed and Parallel Knowledge Discovery. AAAI press, Menlo Park, CA, 2000. [155] Kargupta H. and Puttagunta V. An Efficient Randomized Algorithm for Distributed Principal Component Analysis from Heterogeneous Data. In Proceedings of the Workshop on High Performance Data Mining in conjunction with the Fourth SIAM International Conference on Data Mining, 2004. 106

[156] Kargupta H. and Sivakumar K. Existential Pleasures of Distributed Data Mining. In Data Mining: Next Generation Challenges and Future Directions, edited by H. Kargupta, A. Joshi, K. Sivakumar, and Y. Yesha, MIT/AAAI Press, pages 3–26, 2004. [157] Kargupta H., Huang W., Sivakumar K., and Johnson E. Distributed clustering using collective principal component analysis. Knowledge and Information Systems Journal, 3:422–448, 2001. [158] First International Workshop on Knowldge and Data Mining Grid (KDMG’05). http://laurel.datsi.fi.upm.es/KDMG05/. [159] Knowledge Grid Lab. http://dns2.icar.cnr.it/kgrid/. [160] M. Klusch, S. Lodi, and G. L. Moro. Distributed Clustering Based on Sampling Local Density Estimates. In Proceedings of International Joint Conference on Artificial Intelligence (IJCAI 2003), pages 485–490, Mexico, August 2003. [161] Konstantinos Karasavvas, Mario Antonioletti, Malcolm Atkinson, Neil Chue Hong, Tom Sugden, Alastair Hume, Mike Jackson, Amrey Krause, Charaka Palansuriya. Introduction to OGSA-DAI Services. Lecture Notes in Computer Science , 3458:1 – 12, Jun 2005. [162] Yordan Kostov and Govid Rao. Low-cost optical instrumentation for biomedical measurements. Review of Scientific Instruments, 71(12):4361–4373, December 2000. [163] J. Kotecha, V. Ramachandran, and A. Sayeed. Distributed multi-target classification in wireless sensor networks. IEEE Journal of Selected Areas in Communications (Special Issue on Self-Organizing Distributed Collaborative Sensor Networks), 2003. [164] Ludmila I. Kuncheva. Combining Pattern Classifiers: Methods and Algorithms. Wiley-Interscience, Hoboken, N.J., July, 2004. [165] E. Kushilevitz and Y. Mansour. Learning decision trees using the Fourier spectrum. SIAM Journal oo Computing, 22(6):1331–1348, 1993. [166] S. W. Kwok and C. Carter. Multiple decision trees. Uncertainty in Artificial Intelligence 4, pages 327–335, 1990. [167] Lazarevic A., Pokrajac D., and Obradovic Z. Distributed Clustering and Local Regression for Knowledge Discovery in Multiple Spatial Databases. In Proceedings of the 8th European Symposium on Artificial Neural Networks, pages 129–134, 2000. [168] M. Lenzerini. Data integration: a theoretical perspective. In ACM Press, editor, In Proceedings of the Twenty-First ACM SIGMOD-SIGACT-SIGART Symposium on Principles of Database Systems: PODS 2002, pages 233–246, Madison, Wisconsin, June 3–5 2002. ACM Press, New York, NY 10036, USA. 107

[169] F. Leymann and D. Roller. Workflow-Based Applications. IBM Systems Journal, 36(1):102–123, 1997. [170] Large Hadron Collider (LHC). http://lcg.web.cern.ch/LCG/. [171] Kuncheva L.I. A Theoretical Study on Six Classifier Fusion Strategies. IEEE Transactions on Pattern Analysis and Machine Intelligence, 24(2):281–286, 2002. [172] Kuncheva L.I. ’Fuzzy’ vs ’Non-fuzzy’ in combining classifiers designed by boosting. IEEE Transactions on Fuzzy Systems, 11(6):729–741, 2003. [173] Laser Interferometer http://www.ligo.caltech.edu/.

Gravitational

Wave

Observatory.

[174] N. Linial, Y. Mansour, and N. Nisan. Constant depth circuits, fourier transform, and learnability. Journal of the ACM, 40:607–620, 1993. [175] M. Antonioletti, B. Collins, A. Krause, S. Laws, S. Malaika, J. Magowan, N.W. Paton. Web services data access and integration - the relational realization (WSDAIR), Version 1.0. In Global Grid Forum (GGF), 2005. [176] M. Cherniack, H. Balakrishnan, M. Balazinska, D. Carney, U. Cetintemel, Y. Xing, S. Zdonik. Scalable Distributed Stream Processing. In First Biennial Conference on Innovative Database Systems (CIDR’03), Asilomar, CA, January 2003 2003. [177] Sam Madden and Michael J. Franklin. Fjording the stream: An architecture for queries over streaming sensor data. In ICDE, 2002. [178] Sam Madden, Mehul A. Shah, Joseph M. Hellerstein, and Vijayshankar Raman. Continuously adaptive continuous queries over streams. In SIGMOD Conference, 2002. [179] S. Majithia, M. S. Shields, I. J. Taylor, and I. Wang. Triana: A Graphical Web Service Composition and Execution Toolkit. In Proceedings of the IEEE International Conference on Web Services (ICWS’04), pages 514–524. IEEE Computer Society, 2004. [180] S. Majithia, I. Taylor, M. S., and I. Wang. Triana as a Graphical Web Services Composition Toolkit. In Simon J. Cox, editor, Proceedings of UK e-Science All Hands Meeting, pages 494–500. EPSRC, September 2003. [181] A. Manjhi, V. Shkapenyuk, K. Dhamdhere, and C. Olston. Finding (recently) frequent items in distributed data streams. In International Conference on Data Engineering (ICDE’05), 2005. [182] M.Antonioletti, A. Krause, S. Hastings, S. Langella, S. Laws, S.Malaika, N.W. Paton. Web services data access and integration - the XML realization (WSDAIX), Version 1.0,. In Global Grid Forum (GGF), 2005. 108

[183] M.Antonioletti, M. Atkinson, A. Krause, S. Laws, S. Malaika, N.W. Paton, D. Pearson, G. Riccardi. Web services data access and integration - the core (WSDAI) Specification, Version 1.0. In Global Grid Forum (GGF), 2005. [184] María S. Pérez, Alberto Sánchez, Pilar Herrero, Víctor Robles, José M. Peña. Adapting the Weka Data Mining Toolkit to a Grid Based Environment. In Lecture Notes in Computer Science, volume 3528, pages 492 – 497, January 2005. [185] Mario Antonioletti, Malcolm Atkinson, Rob Baxter, Andrew Borley, Neil P. Chue Hong, Brian Collins, Neil Hardman, Alastair C. Hume, Alan Knox, Mike Jackson, Amy Krause, Simon Laws, James Magowan, Norman W. Paton, Dave Pearson, Tom Sugden, Paul Watson, Martin Westhead. The design and implementation of Grid database services in OGSA-DAI. Concurrency and Computation: Practice and Experience , 17(2-4):357–376, November 1-4 2005. [186] McClean S., Scotney B., and Greer K. Conceptual Clustering Heterogeneous Distributed Databases. In Workshop on Distributed and Parallel Knowledge Discovery, Boston, MA, 2000. [187] Distributed Data Mining in Astrophysics and http://www.cs.queensu.ca/home/mcconell/DDMAstro.html.

Astronomy.

[188] Merugu S. and Ghosh J. Privacy-Preserving Distributed Clustering Using Generative Models. In Proceedings of the IEEE Conference on Data Mining (ICDM), pages 211–218, November 2003. [189] C. Merz and M. Pazzani. A principal components approach to combining regression estimates. Machine Learning, 36:9–32, 1999. [190] Christoper J. Merz and Michael J. Pazzani. A principal components approach to combining regression estimates. Machine Learning, 36(1–2):9–32, 1999. [191] Michael Beynon and Renato Ferreira and Tahsin M. Kurc and Alan Sussman and Joel H. Saltz. Datacutter: Middleware for filtering very large scientific datasets on archival storage systems. In IEEE Symposium on Mass Storage Systems, pages 119–134, 2000. [192] Michael Russell, Gabrielle Allen, Ian Foster, Ed Seidel, Jason Novotny, John Shalf, Gregor von Laszewski and Greg Daues. The Astrophysics Simulation Collaboratory: A Science Portal Enabling Community Software Development. In Proceedings of the 10th IEEE International Symposium on High Performance Distributed Computing, pages 207–215, San Francisco, CA, 7-9 August 2001. [193] M.Nedim Alpdemir, Arijit Mukherjee, Norman W. Paton, Paul Watson, Alvaro A.A. Fernandes, Anastasios Gounaris, and Jim Smith. OGSA-DQP: A servicebased distributed query processor for the Grid. In Simon J Cox, editor, Proceedings of UK e-Science All Hands Meeting Nottingham. EPSRC, 24th September 2003. [194] Mobile medical monitoring. http://www.equator.ac.uk/index.php/articles/634. 109

[195] NASA’s Information Power Grid. http://www.ipg.nasa.gov/. [196] NASA’s Data Mining Resources for Space Science. http://rings.gsfc.nasa.gov/  borne/nvo_datamining.html. [197] World data center for meterology. http://www.ncdc.noaa.gov/oa/wmo/wdcamet.html. [198] F. Neubauer, A. Hoheisel, and J. Geiler. Workflow-based Grid applications. Future Generation Comp. Syst., 22(1-2):6–15, 2006. [199] Nielsen R. Context Vectors: General Purpose Approximate Meaning Representations Self-organized From Raw Data. In Computational Intelligence: Imitating Life, pages 43–56. IEEE Press, 1994. [200] Nikolaos Giannadakis, Anthony Rowe, Moustafa Ghanem and Yike Guo. InfoGrid: Providing Information Integration for Knowledge Discovery. Information Sciences, 155:199–226, 2003. [201] Nithya Vijayakumar, Ying Liu, and Beth Plale. Calder: Enabling Grid Access to Data Streams. In IEEE High Performance Distributed Computing (HPDC), Raleigh North Carolina, July 2005. [202] Nithya Vijayakumar, Ying Liu, and Beth Plale. Calder Query Grid Service: Insights and Experimental Evaluation. In IEEE Cluster Computing and Grid (CCGrid), May 2006. [203] US National Virtual Observatory. http://www.us-vo.org/. [204] O. Wolfson, S. Chamberlain, P. Sistla, B. Xu, J. Zhou. DOMINO:Databases fOr MovINg Objects tracking. In Proceedings of the ACM-SIGMOD 1999, International Conference on Management of Data, Philadelphia, PA, June 1999. [205] OGSA DAI Project. http://www.ogsadai.org.uk/. [206] C. Olston, J. Jiang, and J. Widom. Adaptive filters for continuous queries over distributed data streams. In ACM SIGMOD’03 International Conference on Management of Data, 2003. [207] David Opitz and Richard Maclin. Popular ensemble methods: An empirical study. Journal of Artificial Intelligence Research, 11:169–198, 1999. [208] P. Seshadri, M. Livny, and R. Ramakrishnan. SEQ: A Model for Sequence Databases. In In Proc. IEEE Intl. Conf. on Data Engineering (ICDE), pages 232–239, 1995. [209] B. Park. Knowledge Discovery from Heterogeneous Data Streams Using Fourier Spectrum of Decision Trees. PhD thesis, Washington State University, 2001. PhD. Dissertation. [210] B. Park, H. Kargupta, E. Johnson, E. Sanseverino, D. Hershberger, and L. Silvestre. Distributed, collaborative data analysis from heterogeneous sites using a scalable evolutionary technique. Applied Intelligence, 16(1):19–42, 2002. 110

[211] B. Park, Ayyagari R., and H. Kargupta. A fourier analysis-based approach to learning decision trees in a distributed environment. In Proceedings of the First SIAM Internation Conference on Data Mining, Chicago, US, 2001. [212] B. H. Park and H. Kargupta. Constructing simpler decision trees from ensemble models using fourier analysis. In Proceedings of the 7th Workshop on Research Issues in Data Mining and Knowledge Discovery, ACM SIGMOD, pages 18–23, 2002. [213] B. H. Park and H. Kargupta. Distributed data mining: Algorithms, systems, and applications. In Data Mining Handbook. To be published, 2002. [214] Park B. and Kargupta H. Distributed Data Mining: Algorithms, Systems, and Applications. In The Handbook of Data Mining, edited by N. Ye, Lawrence Erlbaum Associates, pages 341–358, 2003. [215] Park B., Kargupta H., Johnson E., Sanseverino E., Hershberger D., and Silvestre L. Distributed, Collaborative Data Analysis From Heterogeneous Sites Using a Scalable Evolutionary Technique. Applied Intelligence, 16(1), 2002. [216] The protein data bank (pdb). http://www.rcsb.org/pdb/Welcome.do. [217] M. P. Perrone and L. N Cooper. When networks disagree: Ensemble method for neural networks. In R. J. Mammone, editor, Neural Networks for Speech and Image processing. Chapman-Hall, 1993. [218] Ivan Janciak Peter Brezany, Andrzej Goscinski and A Min Tjoa. The Development of a Wisdom Autonomic Grid. In Knowledge Grid and Grid Intelligence, Beijing, Sept 26,2004 2004. [219] Peter Brezany, A Min Tjoa, Helmut Wanek, Alexander Woehrer. Mediators in the Architecture of Grid Information Systems. In Accepted for the Conference on Parallel Processing and Applied Mathematics, Czestochowa, Poland, Sept 7-10, 2003 2003. [220] Peter Brezany, Juergen Hofer, A Min Tjoa, Alexander Woehrer. GridMiner: An Infrastructure for Data Mining on Computational Grids. Accepted for the APAC Conference and Exhibition on Advanced Computing, Grid Applications and eResearch, Queensland Australia, 29 September - 2 October, 2003 2003. [221] Beth Plale. Framework for Bringing Data Streams to the Grid. In Amsterdam IOS Press, editor, Scientific Programming, volume 12, pages 213–223, 2004. [222] Beth Plale. Using Global Snapshots to Access Data Streams on the Grid. In Springer Verlag, editor, 2nd EUROPEAN ACROSS GRIDS CONFERENCE (AxGrids 2004),Lecture Notes in Computer Science Series, volume 3165, 2004. [223] Beth Plale and Karsten Schwan. Dynamic Querying of Streaming Data with the dQUOB System. In IEEE Transactions on Parallel and Distributed Systems, volume 14, pages 422–432, April 2003. 111

[224] Palomer Observatory Sky Survey. http://www.astro.caltech.edu/observatories/palomar/public/index.html. [225] Particle Physics Data Grid. http://www.ppdg.net/. [226] Provost F. Distributed Data Mining: Scaling Up and Beyond. In Advances in Distributed and Parallel Knowledge Discovery, edited by H. Kargupta, A. Joshi, K. Sivakumar, and Y. Yesha, MIT/AAAI Press, pages 3–27, 2000. [227] J. Ross Quinlan. Induction of decision trees. Machine Learning, 1(1):81–106, 1986. [228] R. Jacob, C. Schafer, I. Foster, M. Tobis, J. Anderson. Computational Design and Performance of the Fast Ocean Atmosphere Model, Version One. In Intl Conference on Computational Science, 2001. [229] Rahul Ramachandran, John Rushing, Helen Conover, Sara J. Graves, Ken Keiser. Flexible Framework for Mining Meteorological Data. In American Meteorological Society’s (AMS) 19th International Conference on Interactive Information Processing Systems (IIPS) for Meteorology, Oceanography, and Hydrology, Long Beach, CA, Feb. 9 - 13 2003. [230] Richard Kuntschke, Bernhard Stegmaier, Alfons Kemper, and Angelika Reiser. StreamGlobe: Processing and Sharing Data Streams in Grid-based P2P Infrastructures. In Proceedings of the 31st International Conference on Very Large Data Bases (VLDB 2005), pages 1259–1262, Trondheim, Norway, August 30 September 2 2005. [231] Rinat Khoussainov, Xin Zuo and Nicholas Kushmerick. Grid-enabled Weka: A Toolkit for Machine Learning on the Grid. In ERCIM News, volume No 59, October 2004. [232] C.J. Alonso Rodriguez J.J, L.I. Kuncheva. Rotation Forest: A new classifier ensemble method. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2006. [233] J. Rushing, R. Ramachandran, U. Nair, S. Graves, R. Welch, and H. Lin. ADaM: a data mining toolkit for scientists and engineers. Computers and Geosciences, 31:607–618, jun 2005. [234] Parthasarathy S. and Ogihara M. Clustering Distributed Homogeneous Datasets. In Proceedings of the Fourth European Conference on Principles of Data Mining and Knowledge Discovery, volume 1910 of Springer-Verlag Lecture Notes in Computer Science, pages 566–574, 2000. [235] S. Babu, L. Subramanian, and J. Widom. A Data Stream Management System for Network Traffic Management. In In Proceedings of Workshop on NetworkRelated Data Management (NRDM 2001), May 2001. [236] S. Datta, C. Giannella, H. Kargupta. K-Means Clustering over a Large, Dynamic Network. In Proceedings of the SIAM International Data Mining Conference, 2006. Accepted for publication. 112

[237] Samatova N., Ostrouchov G., Geist A., and Melechko A. RACHET: An Efficient Cover-Based Merging of Clustering Hierarchies from Distributed Datasets. Distributed and Parallel Databases, 11(2):157–180, 2002. [238] Sandi Redman. Mining on the Grid with ADaM. In Southeastern Universities Research Association (SURA) Targeted Communities Workshop Focus Study, Atlanta, GA, January 2005. [239] Sara J. Graves, Helen Conover, Ken Keiser, Rahul Ramachandran, Sandi Redman, John Rushing, Steve Tanner. Mining and Modeling in the Linked Environments for Atmospheric Discovery (LEAD). In Huntsville Simulation Conference, Huntsville, AL, Oct. 19, 2004 2004. [240] Robert E. Schapire and Yoram Singer. Improved boosting algorithms using confidence-rated predictions. Machine Learning, 37(3):297–336, 1999. [241] Scientific Data Mining, Integration http://www.anc.ed.ac.uk/sdmiv/.

and

Visualization

Workshop.

[242] Sloan Digital Sky Survey. http://www.sdss.org. [243] Praveen Seshadri, Miron Livny, and Raghu Ramakrishnan. The design and implementation of a sequence database system. In The VLDB Journal, pages 99– 110, 1996. [244] Shafer J., Agrawal R., and Mehta M. SPRINT: A Scalable Parallel Classifier for Data Mining. In Proceedings of the 22nd International Conference on Very Large Databases (VLDB), pages 544–555, 1996. [245] Shyu M.-L., Chen S.-C., Sarinnapakorn K., and Chang L. A Novel Anomaly Detection Scheme Based on a Principal Component Classifier. In Proceedings of the Foundations and New Directions of Data Mining Workshop, in Conjuction with the Thrid IEEE International Conference on Data Mining (ICDM), pages 172–179, 2003. [246] The Spitfire Project. http://edg-wp2.web.cern.ch/edg-wp2/spitfire/index.html. [247] S. Stolfo et al. Jam: Java agents for meta-learning over distributed databases. In Proceedings Third International Conference on Knowledge Discovery and Data Mining, pages 74–81, Menlo Park, CA, 1997. AAAI Press. [248] S. J. Stolfo, A. L. Prodromidis, S. Tselepis, W. Lee, D. W. Fan, and P. K. Chan. JAM: Java agent to meta-learning over distributed databases. In Proceedings on the Third International Conference on Knowledge Discovery and Data Mining, pages pages 748–1, Newport Beach, CA, August 1997. AAAI Press. [249] W. Nick Street and YongSeog Kim. A streaming ensemble algorithm (sea) for large-scale classificaiton. In Seventh ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, San Francisco, CA, 2001.

113

[250] Strehl A. and Ghosh J. Cluster Ensembles – A Knowledge Reuse Framework for Combining Multiple Partitions. Journal of Machine Learning Research, 3:583– 617, 2002. [251] The swiss-prot protein knowledge base. http://www.expasy.org/sprot/. [252] Domenico Talia. Knowledge discovery services and tools on grids. In Foundation of Intelligent Systems - ISMIS 2003, volume 2871 of LNCS, pages 14–23. Springer-Verlag, October 2003. [253] Domenico Talia and Paolo Trunfio. Toward a synergy between p2p and grids. IEEE Internet Computing, 7(4):94–96, 2003. [254] Domenico Talia and Paolo Trunfio. Adapting a pure decentralized peer-to-peer protocol for grid services invocation. Parallel Processing Letters (PPL), 15(12):67–84, 2005. [255] Ian Taylor, Ian Wang, Matthew Shields, and Shalil Majithia. Distributed computing with Triana on the Grid. Concurrency and Computation:Practice and Experience, 17(1–18), 2005. [256] Telemedicine. www.escience.cam.ac.uk/projects/telemed. [257] Teragrid. http://www.teragrid.org/about/. [258] The teragrid primer, sept 2002. http://www.teragrid.org/about/. [259] Tho Manh Nguyen, A Min Tjoa, Guenter Kickinger, Peter Brezany. Towards Service Collaboration Model in Grid-based Zero Latency Data Stream Warehouse (GZLDSWH). In Services Computing,2004 IEEE International Conference on (SCC’04), pages 357–365, 2004. [260] The Triana project. http://www.trianacode.org/. [261] K. Tumer and J. Ghosh. Robust order statistics based ensembles for distributed data mining. In H. Kargupta and P. K. Chan, editors, Advances in distributed and Parallel Knowledge Discovery, pages 185–210. MIT, 2000. [262] Tumer K. and Ghosh J. Robust Order Statistics Based Ensembles for Distributed Data Mining. In Kargupta H. and Chan P., editors, Advances in Distributed and Parallel Knowledge Discovery, pages 185–210. MIT/AAAI Press, 2000. [263] US National Virtual Observatory. http://us-vo.org/. [264] V. Curcin, M. Ghanem, Y. Guo, M. Kohler, A. Rowe, J. Syed, P. Wendel. Discovery Net: Towards a Grid of Knowledge Discovery. In Proc of The Eighth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, KDD-2002, Edmonton, Alberta, Canada, July 23 - 26 2002. [265] Alexander S. Szalay; Jim Gray; Jan Vandenberg. Petabyte Scale Data Mining: Dream or Reality? SIPE Astronomy Telescopes and Instruments, 22-28 August 2002. 114

[266] Vivometrics life shirt garment. http://www.vivometrics.com. [267] Gregor von Laszewski and Ian Foster. Grid Infrastructure to Support Science Portals for Large Scale Instruments. In Proceedings of the Workshop Distributed Computing on the Web (DCW), pages 1–16. University of Rostock, Germany, 21-23 June 1999. [268] W. Allcock. GridFTP Protocol Specification. In Global Grid Forum Recommendation GFD.20, March 2003. [269] W. Allcock, J. Bresnahan, R. Kettimuthu, M. Link, C. Dumitrescu, I. Raicu, I. Foster. The Globus Striped GridFTP Framework and Server. In Proceedings of Super Computing 2005 (SC05), November 2005. [270] W. Hoschek and G. McCance. Grid Enabled Relational Database Middleware. In Global Grid Forum(GGF), 2001. [271] P. Watson. Databases and the Grid. Technical Report UKeS-2002-01, UK eScience Programme Technical Report Series, December 2001. [272] Wearable west. http://www.smartextiles.info. [273] Weka Toolkit. http://www.cs.waikato.ac.nz/ ml/. [274] Weka4WS. http://grid.deis.unical.it/weka4ws. [275] Wolfgang Hoschek, Javier Jaen-Martinez, Asad Samar, Heinz Stockinger, Kurt Stockinger. Data Management in an International Data Grid Project. In IEEE/ACM International Workshop on Grid Computing Grid’2000, December 2000. [276] D. Wolpert. Stacked generalization. Neural Networks, 5:241–259, 1992. [277] Ying Liu, Beth Plale, and Nithya Vijayakumar. Distributed Streaming Query Planner in Calder System. In IEEE High Performance Distributed Computing (HPDC), Raleigh North Carolina, July 2005. [278] Zaki M. Parallel and Distributed Association Mining: A Survey. IEEE Concurrency, 7(4):14–25, 1999. [279] Zaki M. Parallel and Distributed Data Mining: An Introduction. In Large-Scale Parallel Data Mining (Lecture Notes in Artificial Intelligence 1759), edited by Zaki M. and Ho C.-T., Springer-Verlag, Berlin, pages 1–23, 2000.

115

Suggest Documents