the design of discovery net: towards open grid ... - Semantic Scholar

1 downloads 946 Views 371KB Size Report
plain text, binary or semi-structured data. .... federated databases are usually integrated using wrap- ..... accessed through specialized Web server HTML wrap-.
1

Introduction

1.1 MOTIVATION

THE DESIGN OF DISCOVERY NET: TOWARDS OPEN GRID SERVICES FOR KNOWLEDGE DISCOVERY Salman AlSairafi Filippia-Sofia Emmanouil Moustafa Ghanem Nikolaos Giannadakis Yike Guo Dimitrios Kalaitzopoulos1 Michelle Osmond Anthony Rowe Jameel Syed Patrick Wendel DEPARTMENT OF COMPUTING, IMPERIAL COLLEGE, 180 QUEEN’S GATE, LONDON SW7 2BZ, UK

Abstract With the emergence of distributed resources and grid technologies there is a need to provide higher level informatics infrastructures allowing scientists to easily create and execute meaningful data integration and analysis processes that take advantage of the distributed nature of the available resources. These resources typically include heterogeneous data sources, computational resources for task execution and various application-specific services. The effort of the high performance community has so far mainly focused on the delivery of low-level informatics infrastructures enabling the basic needs of grid applications. Such infrastructures are essential but do not directly help end-users in creating generic and re-usable applications. In this paper, we present the Discovery Net architecture for building grid-based knowledge discovery applications. Our architecture enables the creation of high-level, re-usable and distributed application workflows that use a variety of common types of distributed resources. It is built on top of standard protocols and standard infrastructures such as Globus but also defines its own protocols such as the Discovery Process Mark-up Language for data flow management. We discuss an implementation of our architecture and evaluate it by building a real-time genome annotation environment on top.

The International Journal of High Performance Computing Applications, Volume 17, No. 3, Fall 2003, pp. 297–315 © 2003 Sage Publications

The design and features of the Discovery Net architecture have originally developed from the needs of the knowledge discovery process as applied to the field of bioinformatics, where complicated data analysis workflows are typically constructed in a data-pipelined approach. At different stages of these workflows, also called discovery pipelines, there are requirements to acquire, integrate and analyze data from disparate sources, to use that data in finding patterns and models, and to feed these models to further analysis stages. In each stage new analysis is conducted by dynamically combining new data with previously developed models. As a motivating example, consider an automated laboratory experiment where a range of sensors produces large volumes of data about the activity of genes in cancerous cells. A short time series is produced that records how each gene responds to the introduction of a possible drug. The initial requirement of the analysis is to filter interesting time series from uninteresting ones; one approach is to use data clustering techniques (Eisen et al., 1998). If a group of interesting genes is found, then a crucial step in the scientific discovery process is to verify if the clusters can be explained by referring to existing biological knowledge. This simple discovery pipeline has four main features that are common to the knowledge discovery process as applied within many scientific communities. We first describe these four features, and then describe in more detail the requirements that allow an informatics infrastructure to support a wide range of complex discovery pipelines. 1.1.1 Features of Discovery Pipelines. Dynamic Information Integration: The first feature of discovery pipelines is that they may include dynamic queries to decentralized and semi-structured data sources. Bioinformatics researchers have made available a significant amount of information on the Internet about various biological items and processes (Genes, Proteins, Metabolism and Regulation). These semi-structured resources can be accessed, from remote online databases over the Internet, through a range of search mechanisms, including key-based lookups to biosequence similarity searches. The need to integrate this information within the discovery process is inevitable since it dictates how the discovery may proceed. Workflow Management and Auditing: The second feature is that recording how the results of the analysis were reached and used may be as important as the results of the analysis itself since they provide an audit

DISCOVERY NET

297

trail of the discovery procedure. This recorded audit trail allows researchers to document and manage their discovery procedures, re-use the same procedure in similar scenarios, and in many cases it is an essential component in managing intellectual property activities such as patent applications, peer reviews and publications. Remote Execution: The third feature is that the analysis components used within them can themselves be tied to remote computing resources, e.g. similarity searches over DNA sequences executing on a shared high performance machine. New services and tools for executing similar or related operations are continually being made accessible over the Internet by various researchers, and there is a need to make them available for use in newly created discovery pipelines. Collaborative Knowledge Discovery: The fourth feature is that the discovery process itself is almost always conducted by teams of collaborating researchers who need to share the datasets, the results derived from these datasets and, more importantly, share the details about how these results were derived. This data-pipelined approach is gaining ground beyond life sciences, where similar needs arise for cross-referencing patterns discovered in a dataset with patterns and data stored in remote databases, and for using shared high performance resources. Examples abound in the analysis of heterogeneous data in fields such as geological analysis, environmental sciences, astronomy, and particle physics. Irrespective of the application area, supporting the data-pipelined knowledge discovery process requires the provision of knowledge discovery tools that can flexibly operate in an open system by allowing:

• the dynamic retrieval and construction of required datasets;

• the execution of data mining algorithms on distributed computing servers; • the dynamic integration of new servers, new databases and new algorithms within the knowledge discovery process. The above requirements can be contrasted to the services offered by existing knowledge discovery tools that mainly focus on extracting knowledge within closed systems such as a centralized database or a data warehouse where all the data required for an analysis task can be materialized locally at any time, and fed to data mining algorithms and tools that were pre-defined at the configuration stage of the tool. 1.2 REQUIREMENTS Having described some of the features of discovery pipelines, we now formalize the requirements for a knowl298

COMPUTING APPLICATIONS

edge discovery infrastructure that can effectively and efficiently support them. We describe these requirements along three axes while bearing in mind that the main goal of such an infrastructure is to support collaborative and grid-based data integration and analysis. 1.2.1 Data Requirements. The first axis for our analysis covers how data is accessed, managed and integrated from within the desired infrastructure. Firstly, such an infrastructure must naturally provide well-defined and optimized data management and must be able to handle large datasets of any type. More precisely, collaborative data analysis requires a higher level of data access than provided by sequential files or input streams. The use of relational databases, although common, can be an obstacle to achieving high performance since typical relational databases perform well only if the user application takes great care in defining the structure of the data and its access patterns. The dynamic definition, derivation and refinement of datasets is an important part of typical data analysis workflows along with statistical, data mining and data integration operations. To efficiently support all such operations, the required infrastructure must be able to provide efficient and lightweight table management services. Secondly, since the data resources can be located on and used from any location or resource accessible to the user, the required infrastructure must be able to support dynamic access, integration and structuring of data from multiple heterogeneous data sources. Finally, and in order to preserve the overall quality of its supported services, the infrastructure must naturally provide optimized data transmission between the available resources. 1.2.2 Execution Requirements. The second axis for our analysis covers the features of the execution environment required for such an infrastructure. Due to an ever-increasing amount of data being analyzed as well as the increasing complexity of the algorithms used, the need for accessing and utilizing distributed high performance computing resources to execute these analyses is a clear requirement for the desired infrastructure. This infrastructure must be able to utilize all resources made available to an application in order to maximize the application’s performance. However, the infrastructure should also, as much as possible, separate the application definition level from the planning of its execution on available resources. It is important for an application to be able to preserve its analytical definition separately from the details of its execution in order to be easily published and re-usable in different contexts and on different resources.

Table 1 Features of distributed knowledge discovery pipelines. Data Resources

Execution Resources

Component Resources

Data analysis workflows can require datasets located on different machines, not only because of the virtual organization’s structure but also for possible security or copyrighting reasons.

At any time in the lifetime of a typical grid application, new computational resources that are made available are candidates for inclusion within the set of available resources.

A component can be published on a given server but its execution might be bound to another execution resource.

Heterogeneity The types of data involved in a typical analysis can vary from relational databases, to data streams from http/ftp or just file sources containing plain text, binary or semi-structured data.

A typical feature of a grid is that these computational resources are heterogeneous running different operating systems. As practically as possible, there should not be constraints on the type of resource that can be used.

Components can be bound to a resource or be resource-independent in which case an execution engine must make the decision. Component implementations can use different languages or different mechanisms such as Web-services.

Security

An execution resource must be able to define which users (or which other resources) are allowed to access it.

User restrictions can be applied to access and use of components.

The end-user and the execution engine must be able to map actions of the analysis and integration workflow to the best resources to improve performance.

The execution location for components not bound to any particular resource must be chosen by the execution engine in order to optimize performance.

Distribution

With data being an important asset for most organizations, security is highly important and must be handled at the infrastructure level.

Performance The use of optimized I/O for data processing and data transmission must also be provided.

1.2.3 Component Requirement. The third axis for our analyses covers the nature of the software components to be supported and executed over the desired infrastructure. Here we use the term component to denote a piece of software code that can be used as an operation in a workflow. Components are basically encapsulated code; they can either be bound to a particular execution resource (e.g. a specialized similarity search that is optimized for a particular supercomputer), or alternatively be resource free (e.g. Java code). An important aspect of the required infrastructure is that it needs to support the publication of such components, thus making them available to the a collaborative user community who can browse and access them in the same way they browse and access published data.

ments against the following criteria: Distribution, Heterogeneity, Security and Performance. We believe that these three axes, namely data, resource and components, must all be taken into account when designing any grid-based infrastructure. However, current Grid-based approaches suffer from their bias towards one of these aspects neglecting the need for the others, especially at the architecture level. A key contribution of the Discovery Net architecture is that it orchestrates these three aspects in a single environment. In addition, it allows workflows composed of these services to be deployed and published as new services made available to the user community.

1.3 SUMMARY OF REQUIREMENTS

Before presenting our architecture, we review some of the related work especially Grid-based infrastructures and middleware, supercomputing and cluster management tools, distributed data analysis and data integration

Table 1 summarizes our final list of requirements, obtained by analyzing the data, resource and component require-

2

Related Works

DISCOVERY NET

299

architectures, and grid-based programming models. The brief survey provides both insights into the traditional approaches typically used to address our requirements and also describes some of the lower level services that our architecture relies on. 2.1 GRID INFRASTRUCTURE The widely acknowledged Globus infrastructure (Foster and Kesselman, 1999) defines a set of low-level services for building grid applications. These services include a certificate-based security mechanism (GSI), GSI-enabled data transfer services (GridFTP), resource allocation (GRAM) and resource meta-information services (MDS). These components are all relevant to our infrastructure. In particular, the security mechanism, now widely adopted in the community, must be re-used by current and upcoming grid-based infrastructures in order to make sure the various efforts converge towards the use of standard mechanisms and protocols. The Open Grid Software Architecture (OGSA; Foster et al., 2002), adds the advantages of the Web Service technology in order to define similar but extended services called Grid Services and is rapidly becoming the new standard to provide services. The main disadvantage of the Globus effort, however, lies in the low-level of abstraction it provides, which makes it unsuitable for end-users. The Discovery Net architecture provided in this paper can be regarded as an application layer specific to knowledge discovery activities built on top of the Globus services. Other Grid research focuses on the use of a set of computers in order to achieve better performance, mainly through task farming and cycle-stealing techniques (Henderson and Tweten, 1995; Zhou, 1992; Abramson et al., 1995; Litzkow et al., 1988). These approaches are very scalable and can ensure better throughput for their applications. However, they usually either lack support for using heterogeneous sets of nodes or are based on very restrictive administrative requirements. Newer efforts such as Nimrod/G (Buyya et al., 2000) support the Globus security infrastructure and therefore enable the creation of cross-organization task farms. The task farming approaches typically used in such environments strongly assume that the amount of data needed by the application is low enough such that the cost of moving it is less than the cost of computation on the best machine of the farm. Furthermore, we cannot assume that any application will run on the task farm, for various reasons such as limited license availability, limited set of architectures supported by a particular implementation needed in the analysis workflow and so on. Therefore, it becomes the task of the application builder to decompose its application into parts that can 300

COMPUTING APPLICATIONS

take advantage of a task farm. Such computational grids are typically resource oriented approaches that do not account for data intensive applications nor for the type of dynamicity required by the components. One final set of Grid infrastructure research focuses on distributed data storage and access. The SRB/MCAT project (Baru et al., 1998) from SDSC concentrates on providing uniform access to heterogeneous data sources. Those data sources are normally file systems and domain specific metadata can be stored. The infrastructure also takes care of the data replication. Although SRB improves and makes uniform the access to distributed datasets, it was not designed for the processing and the transformation of these datasets, nor for accessing functionalities of distributed computing servers. 2.2

GRID APPLICATIONS

Numerous Grid application projects have been developed in recent years. An example is the GriPhyN (Avery et al., 2001) project which tackles, among other goals, the problem of so-called “Data-Grids” or data-oriented Grid application and frameworks. The GriPhyN project defines the data-flows as G-DAGs or Grid Direct Acyclic Graphs and defines the process from the DAG definition to its execution on the Grid. It heavily relies on and is tightly integrated with the Globus Toolkit 2 re-using most of the services it provides. One of the main differences between our work and GriPhyN, is Discovery Net’s ability to prepare, verify and deploy analysis workflows. Moreover, it defines the services for storing and retrieving the workflows it uses. The possibility to store these processes opens the way to greater re-usability of the process which can be defined by an analyst in place of a Grid computing specialist. The SETI@Home (Anderson et al., 2002) project is an extremely distributed but specialized application handling very large datasets. The application is fixed and is highly parallelizeable and the implementation challenge is reduced to moving small amounts of data to a very large number of clients. The Discovery Net project is a Grid application building environment that can potentially deal with this class of application and also with computation intensive problems on large datasets. The definition of the problem and how to solve it is allowed to differ from one application to the next. 2.3 DISTRIBUTED DATA ANALYSIS AND INTEGRATION FRAMEWORKS 2.3.1 Distributed Data Integration. Distributed data integration platforms focus on providing transparent access to a set of data sources. The basic distributed data

integration approach adopted by many systems relies on the provision of a database integration layer with a common query language for a collection of databases. These federated databases are usually integrated using wrapper (or driver) based technologies that transform user queries into the native database query language, submit the query and then translate the results back to the common language of the integration system. An example of this approach is the DiscoveryLink (Roth and Schwarz, 1997) system from IBM that provides the user with an SQL query interface to distributed databases. The Kleisli Query System (Wong, 2000) provides a two-tier query language. In Kleisli, wrappers are still used to provide an interface to the databases, however, the Collection Programming Language (CPL) is used to define higher-order functions that embody functions defined in native query languages. This approach allows the Kleisli system to offer greater flexibility than the approach of federated databases as used by DiscoveryLink. However, the framework has limited appeal due to the lack of dynamic discovery of new database services. These can only be added with a new release of the API of function calls, in which case the application programs have to be recompiled to take advantage of these new services. Of course, the queries must also be submitted in the universal language, which is in this case CPL. 2.3.2 Distributed Knowledge Discovery. Several architectures have been proposed to enable distributed knowledge discovery. The JAM project (Stolfo et al., 1997) is an example of a distributed data mining system using meta-learning. It is a Java-based, agent-based system dedicated to meta-learning and it has been applied for fraud detection problems. It implements a meta-learning approaches and provides several classic classification algorithms. The system is extensible in the sense that it defines an interface to plug in a new classification algorithm, however the emphasis is on supporting a particular framework for meta-learning, not on building a fully reusable and extensible framework for distributed data mining. Papyrus (Grossman et al., 1999) proposes an architecture for data mining using metaclusters or superclusters (clusters of clusters, the usual architecture of the computational and data grid). Each node of the cluster decides, based on a metric, whether it should process its data fragment locally, send the data to another node for processing, or process the data until an intermediate result is obtained and can be sent to another node for further processing. IntelliMiner (Guo and Sutiwaraphun, 2000) optimizes the execution of distributed data mining tasks using task farming and the Distributed DOALL parallel primitive. The architecture separates the data server from the set of computing servers.

A set of requirements for distributed/parallel KDD systems is proposed in Maniatty and Zaki (2000). As part of these requirements the authors note that “An even more important research area is to provide a rapid development framework to implement and conduct the performance evaluation of a number of competing parallel methods for a given mining task”. Other important requirements comprise: Location transparency (the KDD process is not bound to a particular data location, and it can be applied to different configurations, algorithm evaluation, process support, data type transparency, system transparency, availability, fault tolerance, QoS. Finally the Knowledge Grid project (Cannataro et al., 2002) also proposes a set of services for distributed data mining applications together with a visual front-end to compose workflows. The workflow definition requires explicit settings that should be hidden from end-users such as data transfers between nodes which have to be defined explicitly and therefore the workflow definition cannot be fully separated from its execution. 2.4 PROGRAMMING MODELS Little, so far, has been done to reconcile end-users with the complexity of grid application programming. The Unicore (Romberg, 1999) system allows for the definition of workflows composed of processes execution. The Business Process Execution Language for Web Services (BPEL4WS; Curbera et al., 2002) specification augments the Web-service standard with a way to compose Web services together. But mostly it is still up to the grid application developer to take care of much of the implementation details and to be aware of the resources to use at the moment the application is defined. Agent technologies (Caromel et al., 1998) are good candidates for Grid-based infrastructure, but they focus on what we define as components, i.e. the actual agent implementation which can easily be migrated from machine to machine. This volatility and resource-free approach is unfortunately too high a level for a typical application where components can be bound to a resource if needed, as some algorithms are only available on a particular machine or the service provider. For example, the Web Service model does not allow the client program to use the service on a different resource. 3

The Discovery Net Infrastructure

A high-level infrastructure needs to take into account the user’s view of the services provided and the programming model associated with them. The resources should be easily composable in order to create the final grid application, and the composition model should allow DISCOVERY NET

301

Fig. 1 Discovery Net Services.

for analysis workflows to be easily defined by end-users, easily constrained to be executed on some pre-defined resources or resource specifications, as well as easily be part of a wider distributed application. The Discovery Net infrastructure aims to fulfill these requirements. It is based on the Kensington (2003) system, a client/server system for data mining and integration that provides basic server implementation, a large library of ready-made functionalities and user-interface elements that can be re-used by a Discovery Net client application. 3.1 ARCHITECTURE OVERVIEW Figure 1 presents the services provided by Discovery Net. Following our requirements, the architecture is divided into three main types of services (data, execu302

COMPUTING APPLICATIONS

tion and components) together with a common security infrastructure. Then two additional services for composition and deployment, built on top of the basic services, are also provided. Each of the core services provides basic registration/lookup methods. Discovery Net has built on the work of the Kensington system (Chattratichat et al., 1999), extending its architecture in many ways to enable its capabilities to be used in a grid environment. The figure shows how this has been achieved and how the infrastructure interacts with protocols such as data transfer protocols (HTTP, FTP, JDBC, DSTP), security (GSI for authentication, and SSL for communication), discovery process workflow protocols (Discovery Process Mark-up Language (DPML)) and how it is integrated with other infrastructures such as the OGSA. Figure 2 shows how the Discovery Net infrastructure can be deployed. Arrows in the figure show invocations, with clients invoking servers and services via the Internet.

Fig. 2 Discovery Net Infrastructure.

A Discovery Net server is both a consumer and provider of services from the Internet. Where the InfoGrid service consumes and aggregates data services, the deployment service provides new services based on processes developed by users of the system. 3.2 DATA SERVICE 3.2.1 Data Storage and Access Services. The data service of Discovery Net offers both local and distributed data management that supports tables (datasets), workflows, reports and also Java objects. The two main parts of this service are its storage/retrieval functionalities and its table management. The storage and retrieval functions use a typical directory access mechanism with user and group level storage spaces. It is designed to deal mainly with tables, workflows and HTML reports but supports any kind of document or Java serializable

object. The table management was designed to deal with the requirements that temporary or persistent tables with potentially a very high number of attributes can be created in main memory or on a sequential access storage device, and manipulation such as the creation of attribute or object sub-sets can be performed to derive new virtual tables without threatening performance. The solution consists of storing columns in separate files and allowing the creation of row-selection columns, hence providing the ability to define virtual tables with low overhead and to add new attributes to a table without having to materialize it again. In addition, workflow components were developed to convert data from and to other formats such as FTP, GridFTP, DSTP and JDBC. 3.2.2 InfoGrid Integration Service. As well as table manipulation, the platform needs to support data integration. The InfoGrid service deals with integration of DISCOVERY NET

303

Fig. 3 InfoGrid Wrappers.

data from various formats and providers in order to augment tables. In the InfoGrid approach, as an alternative to the single query language strategy, the middleware allows its clients to use the native query mechanisms of the remote resources. The rationale is that many scientists are already familiar with the specialized query mechanisms of the remote services they use. In this case, the role of the middleware is to connect the users transparently to the remote resources, ensuring that they have all knowledge about the resources available and providing them with the tools required to construct heterogeneous, distributed queries and combine their results. Each InfoGrid component is a subclass of the Discovery Net components (see Section 3.3 for a definition of a Discovery Net component) called wrappers. They may wrap anything from executables to http servers. A wrapper has the main difference from a generic component in that it deals with the integration of data related to one particular key value at a time, whereas components can deal with any type of input including an entire dataset or a data stream. 304

COMPUTING APPLICATIONS

The anatomy of InfoGrid wrappers is shown in Figure 3(a). A wrapper provides a set of query construction user interfaces that can be used to construct queries in the native language of the resource, and a set of administration user interfaces, which may be used to configure its properties. These interfaces build on top of wrapper documents that define the access control properties (the wrapper access metadata), how a wrapper is linked to other resources (the wrapper linkage metadata) and all other execution related aspects of the wrapper (the wrapper configuration files). The wrapper also provides the visualization services that can be used with the specific data model of the resource it wraps. Finally, the execution logic of a wrapper abstracts everything that a wrapper has to do to communicate, execute, etc., the resource it wraps. A wrapper may provide multiple interfaces for the queries it accepts; and the client identity can be used to retrieve the interfaces that are customized for it. For example, a command line application would be returned a command line user interface; a graphical Java applica-

tion would be returned a GUI, etc. For composite wrappers, as shown in Figure 3(b), a composition layer is provided to manage the composite wrapper execution logic. A composite wrapper is a new service that builds on top of the functionality of other published InfoGrid wrappers providing a customized and/or combined view thereof. The new wrapper is a service built on top of existing ones and may correspond to the scientists’ need for a combined or customized view of a number of resources. We have created wrappers for several bioinformatics resources including LocusLink, Medline, SwissProt, EMBL and KEGG, as well as for BLAST standalone servers (bio-sequence similarity searches services) and BLAST executables. For our initial implementation we have used XML and an XPath-based query mechanism as our common denominator. Each of the implemented wrappers offers a customized view of the wrapped resource as well as a description of the documents it returns, expressed as an XML Schema document. We use and define XPath expressions by means of selecting features from the tree representation of the XML Schema views of the resources. Wrapper composition is currently implemented from a Object Model perspective; new XML wrappers can be composed programmatically by using the basic interfaces provided. 3.3 COMPONENT SERVICE A component is a piece of logic that defines a possible action of the data analysis graph. Discovery Net provides components to access its data storage, data access and data integration services. Each component can specify whether it is resource-free, and therefore can act as an agent which will be executed on the machine picked by the run-time execution engine, or if it can only be executed on one or a set of execution resources. In both cases, a class downloading mechanism is used to transfer the code from its original location to its execution location. As described earlier, InfoGrid wrappers are a special class of Discovery Net components dealing with data integration. 3.4 EXECUTION SERVICE As components can be bound to resources, Discovery Net provides an execution service that allows the component composition front-end to find out which servers are accessible from a particular server by a particular user and also to look up new servers given their URLs. From the Discovery Net client, it is then possible to map actions of the workflow to particular resources. This information is also stored in the DPML document as extra mapping information. The composition language

DPML is presented in Section 3.5. For dataset processing components (i.e. components that take a table as input and output another table), it is possible to bind several resources hence defining a specific task farm for the execution of this component. Fragments of parametrizable size will be sent to the next available resource. One important functionality of the execution service is the mapping from DPML to an execution plan using the Grid. The current pragmatic approach to allocate components to resources is based on the very large datasets computation principle that as much of the processing as possible should occur near the data source. Therefore, if a task in the workflow is not currently bound to a particular resource it will be executed at its input component execution location. If it has more than one input, the location resolution is decided based on the size of each input. Based on this heuristic, data are transferred only when needed and the amount of data transferred is reduced. We are also working on more complex allocation schemes taking into account more parameters such as network weather information, resource meta-information and component complexity models. However, it appears that constraining components to define their own complexity model somehow reduces the simplicity of integrating new components and services, which is one of our main requirements in order to allow the platform to evolve as quickly as scientists' methods and techniques change. 3.5 COMPOSITION OF DISCOVERY PROCESSES In Discovery Net each analysis task is described as a graph, whose nodes represent operations and edges represent data flows. Users are presented with a uniform view of the operations available, irrespective of where or how each operation is implemented. In practice, nodes may be local executables, Web or grid services, data resources or results of execution. 3.5.1 Discovery Net Clients. The Discovery Net client allows users to compose these operations into discovery processes by dragging and dropping icons, connecting operations and then configuring each operation's parameters. The data types passed between operations are typically high-level abstractions such as tables and machine learning models (e.g. decision trees, dendrograms, associations rules models represented in PMML (Data Mining Group, 2003) although the environment can handle objects of any type. The use of a visual programming language (Chattratichat et al., 1999) is a key feature of Discovery Net since it enables users who are not from a computing backDISCOVERY NET

305

ground, from domain scientists to statisticians, to perform analyses using distributed resources without the difficulty of having to manually knit them together. Where previously a computer scientist would have to encode the logic of discovery processes on behalf of a domain expert, typically using a scripting language such as Perl, this graphical approach allows dynamic construction of processes in a way that is directly accessible to all users. Implementing such a visual programming system is not as straightforward as it might first seem, and it should be noted that the graphical part of the tool is quite a minor part of the mechanism. There are three levels of composition and checking for the visual language, which from a user perspective work seamlessly together: basic type checking, metadata level propagation and execution. At the most simple level, each operation must describe its input and output types which allow the environment to enforce basic checks to ensure that the graph edges are valid. However, an aim of our client is to allow a complete process graph to be configured for execution prior to any portion of it being executed. Since each operation is potentially very computationally expensive it is not feasible to insist the user wait until each step has executed before allowing the next step to be appended to the process description. Instead, each operation describes in terms of the metadata of its input how the data are transformed to produce a result. By executing the process in terms of metadata instead of the data itself, we can know the metadata output of every operation in the process with little computational cost. This allows, for example, successive transformations over a table to be defined in terms of columns of each operation’s input table, without transforming the actual input table data. The metadata of the source table are processed by the first operation and the output metadata fed as input into the next operation in the sequence. In this way, the user is able to construct a process in terms of metadata, independently of the raw data used. If we take, for example, a derive new column operation that appends new columns to a table using an expression in terms of values in existing columns, we see that it can be defined at three levels. First, we can specify simple typing information, that it has a single table input and output. Secondly, we can also define at the metadata level how the operation will change its input to produce its output, in this case appending a new column of user specified name and type. We can verify at metadata propagation time that the columns used in the operation’s derive expression will be available at execution time. Thirdly, the operation defines its execution behavior. If the first two levels of checking have been successful it is rare for an execution over actual data to fail. Execution may be performed over entire or sampled datasets. 306

COMPUTING APPLICATIONS

If we compare our approach with other workflow systems a major difference is in the level of information known about each operation by the composition environment. Existing approaches define only the most basic typing information which is not sufficient to fully configure a discovery process independently of execution over actual data. 3.5.2 Discovery Process Markup Language. Tasks composed in Discovery Net are defined in DPML. DPML is an XML-based language that describes processes as dataflow graphs mirroring the visual language used to compose the task. In addition to information needed for execution, users can annotate each operation with structured notes to describe the rationale behind the process. The authoring environment also records a history of how the process was constructed. As such, DPML represents a complete description of a discovery process including information for execution, reporting and audit. The example given in Section 6.1 shows a simple DPML task where a microarray-generated gene expression dataset has been manipulated to derive a new attribute and then passed to a K-means clustering node. In comparison to a traditional workflow language it does not include any implementation specific details. How nodes are mapped to actual component implementations and execution services is left as a matter for the execution environment, which also performs verification of the process. This is important since we wish to warehouse processes at a high level of abstraction; we can transform this description to lower level implementation languages and code in the execution environment. Each node’s operation is uniquely identified by an element that acts as a parametrization message. The node’s inputs are determined by connection elements that define the graph’s structure. The K-means cluster node relies on having a five-column input table whose attributes are Gene, t1, t2, t3 and avg. By flowing table metadata through the process, the server is able to perform extensive validation, checking that the node receives the input it expects. A DPML graph can be sent for execution to any Discovery Net server. The server may choose to execute nodes on other Discovery Net servers, and each node may also use external distributed resources. This distribution is transparent to the user and the DPML sent from the client. The result, or a proxy to the result in the case of datasets, is returned to the caller. 3.6

DEPLOYMENT

Deployment is the final step of an analysis process (Chapman et al., 1999) where results are put into practice. There are two main methods of deployment: knowl-

edge publication and service publication. Our integrated approach to representing information about discovery processes using DPML allows a single description to be re-used for both these roles. Knowledge publication is usually performed by users manually summarizing how they performed their analysis process. In our approach, a DPML description of the process is automatically transformed into different report formats for different purposes. For example, a technical report would include every detail of every operation, whereas a set of slides might only include images and models produced in an analysis. Since these publication formats are based on an executable description we can also publish active reports, augmenting static Web content with applets and the ability to trigger execution to compute intermediate results on demand. Additionally, we can publish processes directly in their native XML description. At an organizational level, this process warehouse can be analyzed visually and algorithmically as a data resource in its own right (meta-mining), in order to assist users by finding common patterns of activity and identifying useful processes or relevant experts to deal with a given situation. Service deployment is where an existing process description is generalized so that it can be re-executed with different data, parameter settings or other operations that are specified at execution time. For example, a successful process for finding genes of interest can be published as a service so that others can re-run the process over their own gene dataset. Similarly, an analyst may wish to publish a process with a few user-definable parameters to allow end-users to explore the data in a what-if type analysis. The act of deploying a process involves defining which aspects of the process should be modifiable at runtime and is performed using a simple point and click interface. This deployment descriptor is appended to the existing DPML process description and is consumed by our deployment framework. The framework allows deployed processes to be executed from a variety of environments using the same DPML description to represent their logic. For example, the same process may be used as a service via a Web browser client, as a Java object or as an OGSA grid service with no repetition of effort. In this way, non-technical users are able to create re-usable software components. 3.7 SECURITY There are two aspects of security which must be tackled by a Grid infrastructure, i.e. secure authentication and secure transmission. For both of these, relevant standards already exist and must be used. For secure authentication, the Globus toolkit provides a Grid Security Infrastructure (GSI) based on certificates. For data trans-

mission, secure network sockets technologies (e.g. SSL) are required. For authentication of client users and between servers, Discovery Net supports the GSI. The integration uses the Java CogKit and the Globus GRAM service as described in Figure 4. 3.8 KNOWLEDGE DISCOVERY SERVICES The flexibility of Discovery Net allows for different types of Discovery Net client and server to be part of the Grid of Knowledge Discovery. Any participating Discovery Net server can provide any of the core services allowing for the specialization of servers. In Curcin et al. (2002), we presented a set of Discovery Services required to build the Grid of Knowledge Discovery. These services correspond in this design to particular instances of Discovery Net servers. For example, the Meta-information server is a Discovery Net server providing only the data integration registration/lookup service. Equally, the Knowledge server provides process storage and finally the lookup server gathers component and execution server management. 4 Case Study: Building a Real-Time Distributed Genome Annotation Application 4.1 INTRODUCTION In this section we present an application built on top of the Discovery Net infrastructure for real-time genome annotation. Each step in the workflow required for this application was integrated as a Discovery Net component. The complete process can be executed on various resources and the data can be visualized at various stages of the process. 4.2 MOTIVATION The genome sequence is the map of an organism’s genetic information. The sequencing of a genome is only the first step in understanding it. What follows is typically a complex and lengthy genome annotation process that is so far performed in a semi-automatic way using various software tools and Web resources, executed by a large number of collaborating scientists who attempt to make sense of each fragment. There are three levels of annotation and there are many tools that can be used at each level: nucleotide-level, protein-level and process-level annotation (Stein, 2002). With the increasing speed of sequencing technologies, annotation is becoming the new bottleneck of this DISCOVERY NET

307

Fig. 4 GSI Authentication in DNet.

process. It is therefore important for scientists to be able to build, understand and share the processes they use to annotate sequences as much as it is important for annotations themselves to be shared. However, there are many tools and resources for annotation scattered over the Internet. A biologist often faces the problem of finding and understanding how these tools are used, and the greater problem of integrating and visualizing the results of each annotation. In the following section we describe some of these tools and, in Table 2, we show a list of Web sites and Web services that were used by our genome annotation platform. 4.2.1 Nucleotide-Level Analysis Tools. The nucleotide-level annotation is the first step in annotating a genome. The aim is to identify the key features and landmarks of the genome. The most important landmarks of a genome are the genes. Predicting genes is a non-trivial task. There are two methods of predicting genes: ab initio gene prediction and similarity-based gene prediction. Genscan (Burge and Apweiler, 1997) is one of the most accurate and sophisticated algorithms. However, the similarity of a region of the genome to a sequence that has been deposited in a public database and is known to code for a protein is a more powerful way of predicting genes. The problem is that not all genes have homologues deposited in databases. Thus, a combination of the ab initio gene prediction methods and the similar308

COMPUTING APPLICATIONS

ity-based gene prediction methods provide a refined and highly accurate prediction. The most powerful tool for similarity searches is the ubiquitous BLAST (Altschul et al., 1990). Other tools that help in gene prediction are the EMBOSS tools cpgreport and getorf (Rice and Longden, 2000) that predict CpG islands, which are found in gene-rich regions, and open reading frames (ORF), regions of the genome that code for a protein. In the case study, GenScan was used as a dedicated Web service in our annotation workflow, while BLAST, cpgreport and getorf were executed on DNET servers. Genetic markers, such as long restriction-fragment length polymorphisms, can be identified by BLAST. Other important features of genomes are the non-coding RNAs. DNA is transcribed to RNA, most of which is then translated into protein. These protein-coding RNAs are called mRNAs. rRNAs and other non-coding RNAs can be identified by BLAST, while tRNAs can be predicted by tRNAScanSE (Lowe and Eddy, 1997). Regulatory regions, which are regions that regulate transcription of DNA to RNA, such as transcription-factor binding sites, can be detected by similarity searches that are specialized for short motifs. Promoter regions can be partly identified by Promoter 2.0 (Knudsen, 1999). A large eukaryotic genome contains repeated segments of DNA. Known repetitive elements are masked using RepeatMasker (Smit and Green, 2003). Once a family of repetitive elements is identified, other members of

Table 2 Web sites and service locations of the components used in the case study. Service

URL

Interface

Genscan

http://genes.mit.edu/GENSCAN.html

Web Service

BLAST, PSI-BLAST

http://www.ncbi.nlm.nih.gov/BLAST/

DNet Server

EMBOSS

http://www.hgmp.mrc.ac.uk/Software/EMBOSS/

DNet Server

RepeatMasker

http://woody.embl-heidelberg.de/repeatmask/

HTML wrapper

tRNAscan-SE

http://bioweb.pasteur.fr/seqanal/interfaces/trnascan.html

HTML wrapper

Artemis

http://www.sanger.ac.uk/Software/Artemis/

Local

Promoter 2.0

http://www.cbs.dtu.dk/services/

HTML wrapper

TargetP

http://www.cbs.dtu.dk/services/

HTML wrapper

ProtFun

http://www.cbs.dtu.dk/services/

HTML wrapper

HMMTOP

http://www.enzim.hu/hmmtop/

HTML wrapper

Gene Ontology

http://www.geneontology.org/

InfoGrid

KEGG

http://www.genome.ad.jp/kegg/kegg2.html

InfoGrid

Das Servers

http://biodas.org/

Web Service

that family are found by sequence similarity methods (BLAST). In the case study, tRNAscanSE, Promoter 2.0 and RepeatMasker were used from remote Web servers via HTML wrappers. 4.2.2 Visualization Tools. After the basic features have been identified and tabulated, biologists typically need a visual appreciation of genome data to review and make sense of the annotation. However, due to the complexity and vast amount of the annotation data in eukaryotic organisms it is impossible to visualize the whole of the results at once. Artemis (Rutherford et al., 2000) is a visualization tool, written in Java, which is designed for viewing sequence features of entire bacterial genomes and small eukaryotic ones. Large eukaryotic genomes can also be visualized in smaller chromosomal fragments. All the annotation results produced by the nucleotide annotation tools described previously are translated into a format supported by Artemis and visualized. It is then possible to inspect whether the results are biologically sensible and to manually update the results and add annotations to them. From within the visualizer, the analyst can also control how further automatic annotation step can proceed by setting their parameters and also interactively choosing which parts of the genome data to use. In the case study, Artemis was directly integrated into the Discovery-Net client.

4.2.3 Protein-Level Tools. After the identification of genes, their encoded products have to be classified into protein families and be functionally characterized. A common workflow of protein annotation proceeds by performing BLAST and PSI-BLAST (Altschul et al., 1997) searches against the protein databases SWISS-PROT (Bairoch and Apweiler, 2000) and SWISS-PROT TrEMBL (O’Donovan et al., 2002) to identify homologues that are well characterized and of known function. Identification of functional domains is carried out by InterProScan (http://www.ebi.ac.uk/interpro/scan.html). Prediction of other domains, such as transmembrane helices and signal peptides, which are responsible for the subcellular localization of proteins, can be performed by HMMTOP (Tusnady and Green, 2001) and TargetP (Emanuelsson et al., 2000), respectively. The ab initio predictions of protein function from sequence can be performed by ProtFun 2.0 (Jensen et al., 2002). Again visualizing protein annotation is extremely helpful in this case. All the protein annotation tools used in the case study were accessed through specialized Web server HTML wrappers and specialized viewers were integrated in the Discovery Net client. 4.2.4 Process-Level Tools. The relation of the genome to biological processes, such as cell cycle, metabolism, embryogenesis, occurs in the process-level annotation. DISCOVERY NET

309

The Gene Ontology database (http://www.geneontology.org) describe the molecular function (functional category) of eukaryotic genes, the biological process (pathways) in which their encoded products are involved and the cellular component, such as nucleus, in which they localize. In addition, querying the KEGG database (Ogata et al., 1999), which contains Escherichia coli metabolic pathways, can give insights about the biological process of the desired proteins. Access to these data sources was provided from within Discovery Net via the InfoGrid data service. 4.2.5 Sharing Annotations. Genome annotation is not a completely automated task. The predictions produced by different software need to be evaluated by many different members of the community to curate and reach a consensus on the annotations that are being stored. Typically, this is a collaborative effort with several groups around the world annotating a genome. Without proper support, it may be difficult for one group to access the annotation data of another group. To address this issue in our built application, the real-time generated annotation results are tabulated and stored into distributed annotation system (DAS) formatted tables and shared across the globe using DAS servers. The DAS (http://biodas.org) is a client–server system in which a single client integrates information from multiple servers. The DAS protocol allows a single machine to gather up genome annotation data from multiple distant Web sites, collate the information, and display it to the user in a single view. Access to the LDAS server was provided as a Web service.

The Discovery Net client was used to compose annotation workflows that combined these distributed resources in a live presentation for the High Performance Computing Challenge at the SuperComputing 2002 conference in Baltimore, USA. DNA sequencing equipment operating at deltaDot Ltd (http://biodas.org) in London was used as a real-time data source. These distributed real-time genome annotation workflows thus spanned from initial data capture to the final population of the annotation warehouse (see Figures 6 and 5). The annotation pipe-line was executed using a mixture of high performance resources hosted by the London E-Science Centre, locally executing servers in Baltimore and data from various geographically distributed bioinformatics databases. The application was validated by domain experts and won an award for the most innovative data intensive application. 5

Conclusion

In this paper, we have presented the Discovery Net architecture, a high-level grid-based knowledge discovery architecture. We have described its base services and how they interact in order to create a complete infrastructure for grid-based data integration and analysis. We have also presented an application for real-time genome sequencing, not only proving the need for such infrastructure due to the inherent complexity and distribution of current annotation processes, but also demonstrating that it is possible to integrate those tools in order to create easily configurable and re-usable distributed workflows. ACKNOWLEDGMENTS

4.3 IMPLEMENTATION AND EVALUATION This genome annotation environment was built on top of Discovery Net by integrating a heterogeneous set of tools described in Table 2 as new components. Some components were connected to a remote Web service to perform their functionality, others act as links to visualization tools and some integrate data-intensive algorithms. The components were implemented without awareness of the location of their execution and without assumption about the workflow structure in which they will be used, hence achieving the goal for a high-level infrastructure. Integrating the components and building the workflows took less than three man-weeks. As opposed to a predefined genome annotation environment, the users of Discovery Net can customize the pipeline used to include or remove specific components. This is a major advantage of the approach since different parameters and methods are used in annotating different genomes. 310

COMPUTING APPLICATIONS

The authors would like to thank their colleagues at the Data Mining Group, Imperial College and at InforSense Ltd for providing access to their resources as well as their support and help. They would also like to thank Dr John Hassard, Dr Matthew Howard, Dr Dimitris Sideris and Dr Stuart Hassard from deltaDot Ltd. for their help in the case study and for providing the interface to the genome sequencing device. We also would like to thank Dr Catherine Rice and Team 65 – HumGen Informatics, and Nigel Carter and Team 70 – Molecular Cytogenetics at the Wellcome Trust Sanger Institute for their support and access to their team’s expertise in conducting the case study. We finally would like to thank the London e-Science Centre for their help and for providing access to their high performance computing resources. Discovery Net is an EPSRC Pilot Project funded under the UK e-Science program.

Fig. 5 Nucleotide Annotation Workflow showing the composition and interaction between the different components.

AUTHOR BIOGRAPHIES Salman AlSairafi is currently a researcher at the Department of Computing, Imperial College London. His main research area concerns Visualization and Data Processing specializing in visual data mining, front-end and workflow design. His work in this field forms part of the visualization aspects of the Discovery Net e-Science Pilot Project. He has been working with the group as a visualization and front-end architect and has an MSc in Advanced Computing from Imperial College London. Filippia-Sofia Emmanouil is a researcher at the Department of Computing, Imperial College London. Her main research interests include the integration of visualization with data mining techniques for the construction and representation of knowledge discovery processes. She received a BSc in Computer Science from Aristotle Uni-

versity in Greece, an MSc in Software Engineering from Imperial College London, where she is currently studying towards her Phd. Moustafa Ghanem is a research fellow at the Department of Computing, Imperial College London. He is also project manager of the EPSRC funded Discovery Net e-Science Pilot Project. His research interests include parallel and distributed systems, including novel applications of data mining and Text Mining. He has published more than 15 papers in these domains. He has a PhD and MSc in high performance computing from Imperial College London and a BSc in Electronics and Telecommunication Engineering. Nikolaos Giannadakis is currently a PhD student at the Computing Department of Imperial College. He received his first degree (Ptyhion) from the University DISCOVERY NET

311

Fig. 6 Discovery Net client interface showing the available data analysis groups. The workflow shown is for protein annotation. Each of the nodes executes on a remote resource.

of Crete, Greece and has an MSc in Advanced Computing from Imperial College. His interests include distributed query languages, grid databases, database integration and model transformation. Yike Guo is Professor of Computing Science at Department of Computing at Imperial College. His main research areas are in large-scale data analysis, especially in the field of life science as well as functional and logic programming. He has published more than 70 papers in these areas over the last ten years. He is the head of Data Mining Group comprised of 15 researchers where he is currently leading one of the UK's largest e-Science projects. He is the technical director of the Imperial College Parallel Computing Centre, the Chief scientist of Shanghai Centre for Bioinformatics Technology, and the founder and CEO of InforSense Ltd, a leading pro312

COMPUTING APPLICATIONS

vider of discovery informatics software and services for the life science industry. Dimitrios Kalaitzouplos is currently a PhD student at The Wellcome Trust Sanger Institute under the supervision of Dr Catherine Rice and Dr Nigel Carter. His research area is in the mapping and computational analysis of DNA sequence regions involved in chromosomal translocation events. He has a BSc in Molecular biology from Kings College London, and an MSc in Computational Genetics and Bioinformatics from Imperial College London. His MSc research project investigated the largescale annotation of protein sequences. Michelle Osmond is a PhD student at the Department of Computing, Imperial College London, working on Web Services and Grid computing in the Discovery Net e-Science Pilot Project. She has an MSci in Physics and

an MSc in Computing Science from Imperial College London. Her MSc research project examined the potential of web services in chemoinformatics. Anthony Rowe is a researcher and PhD student in the Department of Computing, Imperial College London. His research interests include distributed data mining, intelligent agents, bioinformatics and cheminformatics. He has an MSc in Advanced Computing at Imperial College’s Department of Computing and a BEng in Computer Systems Engineering from Warwick University. Jameel Syed is a researcher at the Department of Computing, Imperial College London. His main research area concerns Knowledge Discovery processes and how they are constructed, represented and re-used. His work in this field forms the knowledge management part of the Discovery Net e-Science Pilot Project. He is one of the original architects of the Kensington informatics platform and has an MEng in Software Engineering from Imperial College London and is currently finishing his PhD thesis there. Patrick Wendel is currently a PhD student at the Department of Computing, Imperial College London, working in the Data Mining Group under the supervision of Professor Yike Guo. His research interests are mainly in Grid computing infrastructures applied to large-scale and distributed data analysis and he is working on the architecture part of the Discovery Net project. He received a degree from the Ecole Nationale Superieure d'Informatique et de Mathematiques Appliquees de Grenoble, France. NOTE 1 Now at Wellcome Trust Sarger Institute, Wellcome Trust Genome Campus, Hinxton, Cambridge CB10 1SA, UK

REFERENCES Abramson, D., Sosic, R., Giddy, J., and Hall, B. 1995. Nimrod: A tool for performing parameterised simulations using distributed workstations. In HPDC, pp. 112–121. Altschul, S., Gish, W., Miller, W., Myers, E., and Lipman, D.J. 1990. Basic local alignment search tool. Journal of Molecular Biology, 215:403–410. Altschul, S., Madden, T.L., Schaffer, A.A., Zhang, J., Zhang, Z., Miller, W., and Lipman, D.J. 1997. Gapped basic and PSI-blast: a new generation of protein database search programs. Nucleic Acid Research, 25:3389–3402. Anderson, D.P., Cobb, J., Korpela, E., Lebofsky, M., and Werthimer, D. 2002. Seti@Home: an experiment in public-resource computing. Communications of the ACM, 45(11):56–61.

Avery, P., Foster, I., Gardner, R., Newman, H., and Szalay, A. 2001. An international virtual-data grid laboratory for data intensive science. http://www.griphyn.org. Bairoch, A., and Apweiler, R. 2000. The swiss-prot protein sequence database and its supplement trembl in 2000. Nucleic Acid Research, 28:45–48. Baru, C., Moore, R., Rajasekar, A., and Wan, M. 1998. The sdsc storage resource broker. In CASCON’98. Burge, C., and Apweiler, R. 1997. Prediction of complete gene structures in human genomic DNA. Journal of Molecular Biology, 268:78–94. Buyya, R., Abramson, D., and Giddy, J. 2000. Nimrod/g: An architecture for a resource management and scheduling system in a global computational grid. In HPC ASIA’2000. Cannataro, M., Talia, D., and Trunfio, P. 2002. Design of distributed data mining application of the knowledge grid. In Proceedings National Science Foundation Workshop on Next Generation Data Mining. Caromel, D., Klauser, W., and Vayssière, J. 1998. Towards seamless computing and metacomputing in Java. Concurrency: Practice and Experience, 10(11–13):1043–1061. Chapman, P., Clinton, J., Khabaza, T., Reinartz, T., and Wirth, R. March 1999. The CRISP-DM process model. http:// www.crisp-dm.org/. Chattratichat, J., Darlington, J., Guo, Y., Hedvall, S., Kohler, M., and Syed, J. 1999. An architecture for distributed enterprise data mining. In Proceedings of the 7th Conference on High Performance Computing and Networking Europe. Chattratichat, J., Guo, Y., and Syed, J. 1999. A visual language for internet-based data mining and data visualisation. In IEEE Symposium on Visual Languages, Tokyo, Japan. Curbera, F., Goland, Y., Klein, J., Leymann, F., Roller, D., Thatte, S., and Weerawarana, S. 2002. Business process execution language for web services, version 1.0. Curcin, V., Ghanem, M., Guo, Y., Kohler, M., Rowe, A., Syed, J., and Wendel, P. 2002. Discovery net: Towards a grid of knowledge discovery. In Proceedings of the Eigth International Conference on Knowledge Discovery and Data Mining (KDD-2002). Eisen, M., Spellman, P., Brown, P., and Botstein, D. 1998. Cluster analysis and display of genomewide expression patterns. Proc. Natl. Acad. Sci., 95:14863–14868. Emanuelsson, O., Nielsen, H., Brunak, S., and von Heijne, G. 2000. Predicting subcellular localization of proteins based on their n-terminal amino acid sequence. Journal of Molecular Biology, 300:1005–1016. Jensen, J. et al. 2002. Ab initio prediction of human orphan protein function from post-translational modifications and localization features. Journal of Molecular Biology, 319: 1257–1265. Foster, I., and Kesselman, C., editors. 1999. The globus toolkit. In The Grid: Blueprint for a New Computing Infrastructure, Chap. 11, pp. 259–278. Morgan Kaufmann, San Francisco, CA. Foster, I., Kesselman, C., Nick, J.M., and Tuecke, S. 2002. The physiology of the grid an open grid services architecture for distributed systems integration. Technical report, http://www.globus.org/research/papers/ogsa.pdf.

DISCOVERY NET

313

Grossman, R.L., Bailey, S.M., Sivakumar, H., and Turinsky, A.L. 1999. Papyrus: A system for data mining over local and wide-area clusters and super-clusters. In SC’99. ACM Press and IEEE Computer Society Press. Data Mining Group. 2003. PMML specification. http://www. dmg.org/. Guo, Y., and Sutiwaraphun, J. 2000. Distributed classification with knowledge probing. In H. Kargupta and P. Chan, editors, Advances in Distributed and Parallel Knowledge Discovery, chapter 4. AAAI Press. Henderson, R., and Tweten, D. 1995. Portable batch system: Requirement specification. Technical report, NASA Ames Research Center. Kensington discovery edition. 2003. http://www.inforsense.com. Knudsen, S. 1999. Promoter 2.0: for the recognition of polii promoter sequences. Bioinformatics, 15:356–361. Litzkow, M.J., Livny, M., and Mutka, M.W. 1988. Condor: A hunter of idle workstations. In 8th International Conference on Distributed Computing Systems, pp. 104–111, Washington, DC, USA, June. IEEE Computer Society Press. Lowe, T.M., and Eddy, S.R. 1997. trnascan-se: A program for improved detection of transfer rna genes in genomic sequence. Nucleic Acid Research, 25:955–964. Maniatty, W., and Zaki, M.J. 2000. A requirements analysis for parallel KDD systems. In IPDPS Workshops, pp. 358–365. O’Donovan, C., Martin, M.J., Gattiker, A., Gasteiger, E., Bairoch, A., and Kanehisa, M. 2002. High-quality protein knowledge resource: Swiss-prot and trembl. Brief. Bioinformatics, 16:944–945. Ogata, H., Goto, S., Sato, K., Fujibuchi, W., Bono, H., and Kanehisa, M. 1999. Kegg: Kyoto encyclopedia of genes and genomes. Nucleic Acid Research, 27:29–34. Rice, A.W.B.P., and Longden, I. 2000. Emboss: The European molecular biology open software suite. Trends in Genetics, 16:276–277. Romberg, M. 1999. The unicore architecture. In Proceeding of the 8th IEEE International Symposium on High Performance Distributed Computing. Roth, M., and Schwarz, P. 1997. Don’t scrap it, wrap it! a wrapper architecture for legacy data sources. In VLD 1997. Rutherford, K.J., Parkhill, J., Crook, J., Rice, T.H.P., Rajandream, M.-A., and Barrell, B. 2000. Artemis: sequence visualisation and annotation. Bioinformatics, 16:944–945. Smit, A., and Green, P. 2003. http://ftp.genome.washington.edu/ RM/RepeatMasker.html. Stein, L. 2002. Genome annotation: from sequence to biology. Nature Reviews Genetics, 2:493–503. Stolfo, S., Prodromidis, A.L., Tselepis, S., Lee, W., Fan, D.W., and Chan, P.K. 1997. JAM: Java agents for meta-learning over distributed databases. In D. Heckerman, H. Mannila, D. Pregibon, and R. Uthurusamy, editors, Proceedings of the Third International Conference on Knowledge Discovery and Data Mining (KDD-97), p. 74. AAAI Press. Tusnady, G.E., and Green, P. 2001. The hmmtop transmembrane topology prediction server. Bioinformatics, 17:849–850. Wong, L. 2000. Kleislu, a functional query system. Journal of Functional Programming, 10(1):19–56.

314

COMPUTING APPLICATIONS

Zhou, S. 1992. LSF: load sharing in large-scale heterogeneous distributed systems. In Proceedings of the Workshop on Cluster Computing, Tallahassee, FL, December. Supercomputing Computations Research Institute, Florida State University.

6 6.1

Appendix DPML EXAMPLE

avg(t1,t2,t3) Find baseline expression for control



Find three clusters of similar genes

DISCOVERY NET

315

Suggest Documents