Discovery Net: Towards a Grid of Knowledge ... - Semantic Scholar

3 downloads 0 Views 641KB Size Report
rative and distributed knowledge discovery systems within. Grid-based computing .... tional tables queried from databases and other data sources available from the ... to system through service descriptors including: Input and Output types; Pa-.
Discovery Net: Towards a Grid of Knowledge Discovery V. (~urSin M. Ghanem Y. Guo M. K6hler A. Rowe J. Syed P. Wendel Imperial College of Science, Technologyand Medicine 180 Queen's Gate, London [email protected] ABSTRACT This paper provides a blueprint for constructing collaborative and distributed knowledge discovery systems within Grid-based computing environments. The need for such systems is driven by the quest for sharing knowledge, information and computing resources within the boundaries of single large distributed organisations or within complex Virtual Organisations (VO) created to tackle specific projects. The proposed architecture is built on top of a resource federation management layer and is composed of a set of different resources. We show how this architecture will behave during a typical KDD process design and deployment, how it enables the execution of complex and distributed data mining tasks with high performance and how it provides a community of e-scientists with means to collaborate, retrieve and reuse b o t h KDD algorithms, discovery processes and knowledge in a visual analytical environment.

1.

INTRODUCTION

This paper proposes an architecture to support the KDD process in a Grid-enabled distributed computing environment. The approach is generic b u t originates from the needs of the knowledge discovery processes in the bioinformatics industry, where complicated d a t a analysis processes are constructed using a data-pipelined approach. At different stages of the discovery pipeline researchers need to access, integrate and analyse d a t a from disparate sources, in order to use t h a t data to find patterns and models, and feed these models to further stages in the pipeline. At each stage, new analysis is conducted by dynamically combining new d a t a with previously developed models. As a motivating example, consider an automated laboratory experiment, where a range of sensors produces large volumes of d a t a about the activity of genes in cancerous cells. A short time series is produced t h a t records how each

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page: To copy otherwise, to republish, to post on servers orto redistribute to lists, requires prior specific permission an~or a fee. Copyright 2002 ACM 1-58113-567-X/02/0007 ...$5.00.

gene responds to the introduction of a possible drug. The initial requirement of the analysis is to filter interesting time series from uninteresting ones; one approach is to use clustering [5]. If a group of interesting genes is found then a crucial step in the scientific discovery process is to verify ff the clusters can be explained by referring to existing biological knowledge. Bioinformatics researchers have made available a significant amount of information on the Internet about various biological items and processes (Genes, Proteins, Metabolism and Regulation). These semi-structured resources can be accessed, from remote online databases over the Internet, through a range of search mechanisms, including a key based lookup to biosequence similarity searches. The need to integrate this information within the discovery process is inevitable since it dictates how the discovery may proceed. Furthermore, recording an audit trail of how this was acquired and used is essential since it allows researchers to document and manage their discovery procedures, re-use the same procedure in similar scenarios, and in many cases it may help them in managing intellectual property activities such as patent applications, peer reviews and publications. Another feature of such discovery pipelines is t h a t the analysis components used can be tied to remote computing resources, e.g. similarity searches over DNA sequences executing on a shared high performance machine. New services and tools for executing similar operations are continually being made available over the Internet for access by various researchers. Also, the discovery process itself is almost always conducted by teams of collaborating researchers who need to share the data sets, the results derived from these data sets and, more importantly, details about how these results were derived. This data-pipelined approach is gaining grounds beyond life sciences, where similar needs arise for cross-referencing patterns discovered in a dataset, with patterns and data stored in remote databases, and for using shared high performance resources. Examples abound in the analysis of heterogeneous data in fields such as geological analysis, environmental sciences, astronomy, and particle physics. Supporting the data-pipelined knowledge discovery process requires KDD tools t h a t flexibly operate in an open system allowing: • Dynamic retrieval and construction of data sets, • Execution of data mining components on distributed computing servers,

658

• Dynamic integration of new servers, new databases and new algorithms within in the KDD process. The above requirements can be contrasted to the services offered by existing tools t h a t mainly focus on extracting knowledge within closed systems such as a centralised database or a data warehouse where all the data required for an analysis task can be materialised locally at anytinm, before feeding them to data mining algorithms and tools t h a t were predefined at the configuration stage of the tool. Recently [12], the Grid, a novel IT infrastructure, has been proposed to provide a well-defined resource management infrastructure for virtual organisations (VO) and allow end users to share b o t h information and computing resources in secure environments. Grid concepts and tools offer a flexible b u t secure computing infrastructure t h a t meets some of the requirements of the data-pipelined knowledge discovery process. However, a gap still exists between the services offered by Grid-based methodologies and the requirements of the KDD process. This paper proposes a service layer and architecture t h a t aims to make best use of existing KDD practice [1] and tools [2, 8] as well as making use o f existing Grid infrastructure and concepts.so as to bridge the gap between the traditional KDD process (from its definition to its deployment as an application for knowledge discovery) and its mapping onto the VO's resources.

2.

BACKGROUND

Increased research demands in fields such as high energy particle physics, astronomy and environmental modelling led to the exploitation of fast networks and large-scale distributed computing techniques. This at first took place exclusively within the scientific community (Seti project [19]) where individual computers were given processing tasks in their 'spare time' with the results being assembled in one centralised location. Very soon this idea of a global computing platform was dubbed 'The Grid', and numerous research groups (many of which are now in the Global Grid Forum ]15]) started devoting time and effort to developing architectures to support this new infrastructure. Initial Grid work was directed purely towards 'processorstealing' algorithms, concentrating on hardware resources t h a t could be shared, like CPU, memory and bandwidth, b u t this soon changed to encompass a far broader, softer, class of resources, like programs, data sources, knowledge repositories etc. In their seminal paper [12} Foster, Kesselman & Tuecke state t h a t the actual problem t h a t the Grid is trying to solve is 'coordinated resource sharing and problem solving in dynamic, multi-institutional virtual organisations', where virtual organisations are considered to be any formal or informal communities t h a t are sharing a certain set of resources under some well-defined rules. Furthermore, they define the layers of a Grid architecture with lower levels providing middlewaxe support for higher level applicationspecific services, thereby opening the door to the development and porting of more ambitious systems, like Knowledge Discovery or bioinformatics platforms [16], onto the Grid. At the present stage, most of the current Grid middleware products suffer from lack of interoperability and an insufficient focus on the software dimension of the problem. Even The Globus Toolkit 2110], widely acknowledged to be the most sophisticated middleware available, does not run on

Windows, so we are still far from being able to completely abstract over any middleware platform in a commercially viable system. Still, there are some running grids t h a t can be useful sources for further information, such as European Data Grid [6], EuroGrid [7], NASA Information Power Grid [17] and a number of university projects. Fortunately, the next version of Globus (GT3) will include an Open Grid Services Architecture (OGSA) [11] implementation t h a t combines the Grid work performed with the broad experience of Web service technologies [21] to provide an industry-usable platform. Using WSDL [22] and UDDI [20] it effectively provides the service-based IT infrastructure level t h a t can be used in constructing and providing application layer of Knowledge Discovery and Data Mining services. Regardless of the implementation details, it is our belief t h a t the next generation of d a t a mining tools will be running on the Grid, therefore the KDD community should participate in the project and influence the work so as to ensure t h a t the resulting environment will be well-suited to its needs as well. The aim of this paper is to propose a way forward and provide the first step towards the next generation of data mining architectures, Grid enabled and capable of meeting the challenges posed by the evolved distributed computing landscape.

3.

KNOWLEDGE DISCOVERY SERVICES

In the context of this paper, we use the term Knowledge Discovery Service to describe the building blocks used in a data-pipelined KDD process. As with the traditional KDD process[91 the Grid-based version spans all activities from data collection to modelling and deployment. This definition thus encapsulates KDD algorithms as well as components t h a t extract data from a database. Regarding these components as services is essential since it allows us to separate their definition from their implementation. In this paper, we also use the term Knowledge to be any structure t h a t a Knowledge Discovery Service needs as input or generates as output.

3.1

Service Types

We classify knowledge discovery services into two categories: Computation services and Data services. A computation service allows users to define and compose their analysis processes by assembling together data preparation and data mining algorithms. In contrast, a data service allows users to define their analysis d a t a set as a composition of relational tables queried from databases and other data sources available from the VO which can augment the training set with additional information.

3.1.1

Computation Services

A computation service can be seen as a classical KDD algorithm. Its input and o u t p u t can be any number of Knowledge objects. Each of its inputs can be linked one or more other service's output allowing the definition of distributed and parallel d a t a mining processes. Since a computation service has an implementation, it can also have resources constraints such as: • Location: A service might be bound to a particular resource as either its implementation is platform-specific (e.g. a high performance parallel implementation) or

659

that the service is licensed only for a particular machine. • P l a t f o r m / O S c o n s t r a i n t s : Even if a service is not bound to a particular location and its implementation can be downloaded, it might execute only on a specific platform or operating systems. 3.1.2

D a t a Services

In contrast to computation services, which enable the composition of functions as discovery processes, data services do not provide implementations, instead they provide metainformation for composing data sets from heterogeneous and distributed data sources. Data services are used to model data for analysis. In a large VO the required data is produced from different devices and is retrieved through different protocols (RDBMS, XML on Web servers, etc...) in different locations. The information required for a specific task is a composition of these sources, in effect an inner-join operation over heterogenous and distributed data. A data service describes its functionality using metadata descriptors for the source of information it will attempt to use in order to materialise the data set. The metadata also describes the form of the data in terms of the different features and their d a t a types. Features from different data services can be extracted and composed in a similar way to computation services to create a Knowledge Schema which is effectively a composition of data services. This schema can then be incorporated at any point of the discovery process.

3.1.3

Service type

In order to de-couple resources from service definitions and service implementations, we define four types of services that are applicable to both computation and data services: • A Resource-bound service (e.g. parallel implementation, optimised native code, a special data source or data sinks) • A Resource-free service (e.g. pure Java code for algorithms, simple data processing and visualisation or standard input )

selection and data exportation. Internally, however, allowing such composition requires the use of a service definition language.

3.3

• F a c t o r y : An object that allows a client to retrieve a reference to the service, or to download the service if it can be instantiated on the client machine. • C a t e g o r y : Define the location of the service in the hierarchy of services available. • K e y w o r d s : Used to index and retrieve this service, a set of keywords is associated with it. • D e s c r i p t i o n : A human-readable description of the functionality provided by the service.

3.4

1. Recursively match any template service with a welldefined service, favouring if a choice needs to be made, resource-bound services over other types as they would be either native or high-performance implementations. 2. For any remaining resource-free service, the location is decided based on the location of preceding services. If a conflict arises because more than one service precede then the best resource (as defined and chosen by the resource management for components layer) is selected. For that, we plan to use the ICENI infrastructure developed at the London e-Science Centre [13].

• A Composed service (e.g. a composition of services which can also be seen as a discovery process)

Service composition

Any non-triviai discovery process requires the composition of existing services to create discovery processes by combining different services into larger, more complicated, ones. In a Grid-based environment, this may involve the augmentation of a training data set using a data service, accessing external data sources that are driven by the knowledge extracted by the computation service, etc. In this case, data services can be composed to create a view of the data that is used for analysis and to compose computational services to develop an analysis pipeline. Composition may be specified by the end-nser through a GUI for component composition that provides access to a library of services including data importation, data cleaning, data normalisation, mining and statistical algorithms, data

Service location resolution

The location where each service of a discovery process is executed may have a strong impact on the overall performance of the process execution. When dealing with very large data sets, it is more efficient to keep as much of the computation as near to the data as possible. Therefore a location resolution method must be provided to take that particular constraint into account. This service / location resolution method can be achieved as follows:

• A Template service, that will be matched, at execution time, to a service (either machine-dependant or free).

3.2

Description and registration

Each of the services provided to a user can have several implementations; e.g. specialised implementations for specific platforms. Each service must thus be catalogued and categorised, afiowing the user to browse the available services, or even locate and retrieve them through queries. This requires each discovery process to be well-defined through a descriptor registered with a specific server. This registration mechanism allows new services to be added to system through service descriptors including: Input and Output types; Parameters, Constraints (range of validity, discrete set of possible values, optional or required, ..); Registration information and Service type. The registration information can include the following:

4.

DISCOVERY NET IMPLEMENTATION

Our architecture is built on top of the basic Grid services, and therefore we will assume throughout that this lower layer of services, dealing with security and communication issues, is robust and stable. However, this does not mean that the data available on one of the grid's computational node can be moved to another node since the data set itself can have restricted permissions. Neither does it mean that the connection is necessarily fast enough to favour moving data to another machine instead of mining it locally.

660

Client

Discovery Services

Resources

name-"Decision Tree">

Suggest Documents