Oct 22, 2008 - cutable component (Meandre terminology for a black box operator) with two inputs. When data is present on ..... Instead of dedicated servers,. 9.
Meandre: Semantic-Driven Data-Intensive Flows in the Clouds
´ Xavier Llor` a, Bernie Acs, Loretta S. Auvil, Boris Capitanu, Michael E. Welge, David E. Goldberg IlliGAL Report No. 2008013 October, 2008
Illinois Genetic Algorithms Laboratory University of Illinois at Urbana-Champaign 117 Transportation Building 104 S. Mathews Avenue Urbana, IL 61801 Office: (217) 333-2346 Fax: (217) 244-5705
Meandre: Semantic-Driven Data-Intensive Flows in the Clouds ´ ‡ , Loretta S. Auvil‡ , Boris Capitanu‡ , Xavier Llor`a†,? , Bernie Acs Michael E. Welge†,‡ , David E. Goldberg? †
Data-Intensive Technologies and Applications, National Center for Supercomputing Applications,
University of Illinois at Urbana-Champaign, Urbana, IL 61801 ‡
Automated Learning Group, National Center for Supercomputing Applications,
University of Illinois at Urbana-Champaign, Urbana, IL 61801 ?
Illinois Genetic Algorithms Laboratory, Dept. of Industrial and Enterprise Systems Engineering,
University of Illinois at Urbana-Champaign, Urbana, IL 61801 {xllora, acs1, lauvil, capitanu, mwelge, deg}@illinois.edu October 22, 2008
Abstract Data-intensive flow computing allows efficient processing of large volumes of data otherwise unapproachable. This paper introduces a new semantic-driven data-intensive flow infrastructure which: (1) provides a robust and transparent scalable solution from a laptop to large-scale clusters,(2) creates an unified solution for batch and interactive tasks in high-performance computing environments, and (3) encourages reusing and sharing components. Banking on virtualization and cloud computing techniques the Meandre infrastructure is able to create and dispose Meandre clusters on demand, being transparent to the final user. This paper also presents a prototype of such clustered infrastructure and some results obtained using it.
1
Introduction
There is a wide variety of data-intensive flow frameworks. Well known representatives in this arena can be trace back to mid 90’s with the appearance of frameworks such as D2K (Welge, Auvil, Shirk, Bushell, Bajcsy, Cai, Redman, Clutter, Aydt, & Tcheng, 2003), and later simplified and popularized by Google’s MapReduce model (Dean & Ghemawat, 2004) and Yahoo!’s Hadoop project1 . Designed to efficiently process large amounts of stored data, these frameworks provide the means to express complex tasks and mapping these tasks against large data volumes. However, using such frameworks usually require a steep learning curve. Also, MapReduce requires processes which can be expressed as a direct acyclic graph, sometimes forcing the reengineering of the application. Moreover, the growth of the internet is pushing researchers from all disciplines to deal with volumes of information where the only viable way of processing it is to utilize data-intensive frameworks (Uysal, Kurc, Sussman, & Saltz, 1998; Beynon, Kurc, Sussman, & Saltz, 2000; Foster, 2003; Mattmann, Crichton, 1
http://hadoop.apache.org/
Medvidovic, & Hughes, 2006). However, the current frameworks have a large entry toll for non tech-savy users. Meandre2 intends to ease some of these issues. It is designed to: (1) provide a robust and transparent scalable solution from a laptops to large-scale clusters, (2) create an unified solution for batch and interactive tasks in high-performance computing environments, and (3) encourage reusing and sharing components. As a results, Meandre proposes a semantic-web-driven dataintensive flow execution infrastructure to construct, assemble, and execute components and flows. Flows are aggregations of basic computations tasks in a directed graph—regardless if they are cyclic or acyclic, a key difference when compared to MapReduce models. Meandre provides, (1) tools for creating components and flows, (2) a high-level language to describe flows, and (3) a multicore and distributed execution environment based on a service-oriented paradigm. Meandre makes the execution environment transparent to the user thanks to the Meandre server which can mutate transparently from standalone to clustered mode without any extra effort. In addition, the Meandre infrastructure allows rapid deployment in the cloud via extensive use of virtualization techniques, revealing itself as a dynamic on-demand clustered solution for interactive and grid environments. The rest of this paper presents the details of the Meandre infrastructure. Section 2 presents a brief introduction to the key elements used by Meandre (data-intensive flows, semantic-web technology, and publishing schemes). Once we have introduced these concepts, we describe the basic data-intensive flow framework exposed by Meandre in section 3. This section describes the metadata organization and the programming paradigm used. After clarifying such concepts, an overall description of the Meandre architecture is introduced in section 4. Then, section 5 presents how cloud computing and virtualization help deploying on-demand Meandre clusters. Relying on prepared virtual appliances, new Meandre servers can be dynamically instantiated, added to a Meandre clusters, used, and finally disposed back to the cloud. We build a Meandre cluster in the cloud—as detailed in section 5—and the results of a few pilot experiments are reported in section 6. This section also comments on the differences between interactive high-performance computing and batch oriented efforts, where the Meandre cluster can offload some of the effort to grid engines. Finally, we conclude with a brief description of the current usage and partners of the Meandre infrastructure—section 7—and short list of conclusions and further work—section 8.
2
Meandre
Meandre is a semantic-enabled web-driven, dataflow execution environment. It provides the machinery for assembling and executing data flows. Flows are software applications consisting of components that process data. Each flow represents as a directed graph of executable components— nodes—linked through their input and output ports. Based on the inputs, properties, and its internal state, an executable component may produce output data. Meandre also provides publishing capabilities per flow and executable component, enabling users to assemble a repository of components based on the reuse and share. Users can readily leverage other research and development efforts by querying and integrating component descriptions that have been published previously at other shareable repository locations. It is important to mention here, that theses are self contained elements—other approaches like Chimera still rely on external information (Foster, 2003). Meandre builds on three main concepts: (1) dataflow-driven execution, (2) semantic-web metadata manipulation, and (3) metadata publishing. The rest of this section provides a brief introduction to each of these concepts. 2
Catalan spelling of the word meander.
2
2.1
Dataflow execution engines
Conventional programs perform their computational tasks by executing a sequence of instructions. One after another, each code instruction is fetched and executed. Any data manipulation is performed by these basic units of execution. In a broad sense, this approach can be termed “code-driven execution.” Any computation task is regarded as a sequence of code instructions that ultimately manipulates data. However, data-driven execution (or dataflow execution) revolves around the idea of applying transformational operations to a flow or stream of data. In a data-driven model, data availability determines the sequence of code instructions to execute. An analogy of the dataflow execution model is the black box operand approach. That is, any operand (operator) may have zero or more data inputs. It may also produce zero or more data through its data outputs. The operand behavior may be controlled by properties (behavior controls). Each operand performs its operations based on the availability of its inputs. For instance, an operand may require that data is available in all its inputs to perform its operations. Others may only need some, or none. A simple example of a black box operand could be the arithmetic ‘+’ operand. This operand can be modeled as follows: 1. It requires two inputs. 2. When two inputs are available, it performs the addition. 3. It then pushes the result as an output. Such a simple operand may have two possible implementations. The first one defines a executable component (Meandre terminology for a black box operator) with two inputs. When data is present on both inputs, then the operator is executed—fired. The operator produces one piece of data to output, which may become the input of another operator. Another possible implementation is to create a component with a single input that adds together two consecutive data pieces received. The component requires an internal variable which stores the first data piece of a pair. When the second data piece arrives, it would be added to the first and an output is produced. The internal variable would then be cleared so that the component will know that the next data piece received is the first of a new pair. Meandre uses the following terminology: 1. Executable component: A basic unit of processing. 2. Input port: Input data required by a component. 3. Firing policy: The policy of component execution (e.g. when all/any input ports contain data). 4. Output port: Outputs data produced by component execution. 5. Properties: Component variables used to modify component behavior. 6. Internal state: The collection of data structures designed to manage data between component firings. Figure 1 presents a schema of the component and flow anatomy. Components with input and output ports can be interconnected to describe a complex task, commonly referred as flow. Dataflow execution engines provide a scheduler that determines the firing (execution) sequence of components. Meandre uses a decentralized scheduling policy designed to maximize the use of multicore architectures. Meandre also allows works with processes that require directed cyclic graphs—extending beyond the traditional MapReduce directed acyclic graphs. 3
Inputs
Outputs
Read
Merge
P
Component
P
Show
Get
P
P
P
Convert
P
Behavior
descriptor
Implementa7on
Dataflow execution
(a) A component is described by several input and output ports where data flows through. Also, each component have a set of properties which govern its behavior in the presence of data.
(b) A flow is a directed graph where multiple components are connected together via input/output ports. A flow represents a complex task to solve.
Figure 1: A data-intensive flow is characterized by the components it uses (basic process units) and their interconnection (a direct graph). Grouping several components together describes a complex task. It also emphasize rapid development by component reutilization.
2.2
Semantic web concepts
The semantic web (W3C, 2008) provides a common framework to share and reuse data across application, enterprise, and community boundaries. The semantic web focuses on common formats for integration and combination of data drawn from diverse sources. It also pays special attention to the language used for recording how the data relates to real world objects. That allows a person, or a machine, to start off in one database, and then move through an unending set of databases which are connected not only merely by the physical network but by being connected semantically. The semantic web effort relies on the usage of the resource description framework (RDF) (Beckett, 2004; Manola & Miller, 2004). RDF is a simple notation to express graph relations. It mainly relies on XML3 to provide a set of conventions and exchange information. Introductions to RDF can be found elsewhere (Beckett, 2004; Brickley & Guha, 2004; Manola & Miller, 2004). RDF basic expression is the triple. A triple is a predicate that describes some property about an object. RDF objects are uniquely characterized by URIs (universal resource identifiers) (BernersLee, Fielding, & Masinter, 2005). For instance http://seasr.org/meandre or file:///tmp/potato.png are examples of objects identified by URIs. Properties about an object also take the form of an URI; for instance http://purl.org/dc/elements/1.1/creator (Weibel, Kunze, Lagoze, & Wolf, 2008) is a property identifying the creator of a given object. Ontologies standardize object organization and properties. For instance, the previous property of belongs to the Dublin Core initiative (Weibel, Kunze, Lagoze, & Wolf, 2008). Property values can take two possible forms: literal (a string which may or may not be typed) or another URI. URIs (or objects) are usually referred as resources in the RDF terminology. An example of a triple could be file:///tmp/potato.png http://purl.org/dc/elements/1.1/creator 3
For efficiency and human readability purposes other non-XML formats have also been developed, such as the n-triple (Grant & Beckett, 2004) and turtle (Beckett, 2007).
4
"Joe"^^http://www.w3.org/2001/XMLSchema#string ,
indicating that Joe is the creator of potato.png. RDF can be expressed in several formats (Beckett, 2004; Beckett, 2007; Brickley & Guha, 2004; Grant & Beckett, 2004; Manola & Miller, 2004). Meandre relies on RDF to provide a standardized exchange format of its metadata descriptions.
2.3
Publishing schemes
Meandre uses publishing schemes to create a distributed repository of shareable components. Each piece of the repository is published on some reachable web location. RDF also standardizes the publishing process using the SPARQL protocol (Prud’hommeaux & Seaborne, 2008). In Meandre, each component description is self-contained, in terms of having all the required information for its retrieval, regeneration, and execution. Meandre’s publishing scheme allows dynamic inspection of published repositories. That is, Meandre can inspect locations (local files, remote web objects, or metadata stores using the SPARQL protocol (Prud’hommeaux & Seaborne, 2008)) to discover a new location where components are published. This way, the discovered components can be retrieved to form a custom-made repository, which can also be published for others to use. Hence, different component views and flavors are easy to create, maintain, and upgrade.
3
Data-Intensive Flow Framework
Meandre components serve as the basic building block of any computational task. There are two kinds of Meandre components: (1) executable components and (2) flow components. Regardless of type, all Meandre components are described using metadata expressed in RDF. Executable components also require an executable implementation form that can be understood by the Meandre execution engine4 . The rest of this section will present a quick overview of the basic semantics for executable and flow components.
3.1
Basic metadata
Meandre’s metadata relies on three ontologies: (1) the RDF ontology (Beckett, 2004; Brickley & Guha, 2004) serves as a base for defining Meandre components; (2) the Dublin Core elements ontology (Weibel, Kunze, Lagoze, & Wolf, 2008) provides basic publishing and descriptive capabilities in the description of Meandre components; and (3) the Meandre ontology describes a set of relationships that model valid components, as understood by the Meandre execution engine architecture. A complete description of how the metadata is beyond the scope of this paper. A detailed explanation can be found elsewhere5 . Executable components and flows share properties such as component name, creator, creation date, description, tags, and rights. In addition to this set of commonly identified metadata, executable components also provide specific metadata to describe the components’ behavior and the location and type of its implementation (firing policy, runnable, format, resource location, and execution context), whereas the flow metadata focuses on describing the directed graph of components (components instances, connectors, connector instance data port source, connector, instance data port target, connector instance source, connector instance target, instance name). 4 5
Java, Python, and Lisp are the current languages supported by Meandre to implement a component http://seasr.org/meandre/documentation/architecture/
5
Figure 2: The Meandre Workbench provides a visual programming environment via web browsers. Executable components can be drag and dropped and connected onto a flow canvas. The workbench also allows modifying the component instance properties.
3.2
Programming Paradigm
The programming paradigm creates complex tasks by linking together a bunch of specialized components. Meandre’s publishing mechanism allows components develop by third parties to be assembled in a new flow. There are two ways to develop flows on Meandre: (1) using visual programming tools, or (2) using Meandre’s ZigZag scripting language. Figure 2 shows the Meandre’s Workbench. The Meandre Workbench provides a visual programming environment. Executable components can be drag and dropped into the flow canvas to create component instances. Instantiated components are then connected by clicking at the desired ports. The workbench also provides easy access to change the component instance properties. Once the desired flow is ready, it can be saved and executed. Also, flows can be shared with colleagues by publishing them. Such visual programming paradigm is ideal for new users that want to quickly prototype a solution to a certain task. ZigZag, Meandre’s scripting language allows to easily describe data-intensive flows which can then be interactively built and compiled into a self-contained flow task for later sharing and execution. ZigZag is loosely modeled after Python’s simplicity. ZigZag is a declarative language for expressing the directed graphs that describe flows. An interpreter and compiler are available to transform a ZigZag program (.zz) into a Meandre self-contained task—or Meandre archive unit (.mau). A MAU file contains all the metadata describing executable components and flows. It also contains any implementation required by the executable components. MAUs can then be executed
6
# # Import all the available demo components (CDA) # import # # Alias the components we want to use to create a flow (CDA) # alias as PushString alias as ToUpperCase alias as PassThrough alias as PrintObject # # Instantiate the components (CI) # ps,tuc,pt,po=PushString(),ToUpperCase(),PassThrough(),PrintObject() # # Set properties on the component instances (CM) # po.count="true" ps.times="250000" # # Define the directed graph that describes the flow (II) # @pso = ps() @tuco = tuc(string:pso.string) @pto = pt(string:tuco.string) po(object:pto.string) Figure 3: ZigZag description of a data-intensive flows. Figure describes the flow shown on Figure 2. Four components developed using different programming languages are orchestrated together in a single flow.
7
by a Meandre engine or they can be executed on their own in grid environments—via Torque6 or the Sun Grid Engine7 . MAU constructs also serve the purpose to offload batch jobs that do not require any interaction to the grid and, thus, boost the infrastructure throughput. ZigZag provides four basic constructs. Component discovering and aliasing (CDA) instructions retrieve components from a repository location and create an alias for them. Component instantiation (CI) instructions instantiate a component that will be part of the data-intensive flow. Component modification (CM) instructions change the behavior of a component based on its properties. Instance invocation (II) instructions describe the data-intensive component relations between components in the same flow. The instance invocation instructions also provides directives to provide parallelization of instances (manually or automatic based on the underlying architecture). Figure 3 presents an example of a ZigZag script. The script describes the same example flow shown on Figure 2. The script starts by importing components provided by a Meandre server. From all the available components, four components are aliased. PushString and PrintObject are Java-based executable components, whereas ToUpperCase and PassThrough are Python and Lisp executable components. It is worth noting how there is no difference between different executable component implementations during the flow creating process. Then, the aliased executable components get instantiated, and some of the instance properties modified. Finally, the last set of scripting instructions create the direct graph. A detailed explanation of the ZigZag scripting language and associated tools can be found elsewhere8 .
3.3
Developing components
Components can be currently developed using three programming languages: Java, Python, and Lisp. In all three cases, RDF provides the proper abstraction needed. The component developer just has to implement the basic interface used by the the Meandre executable components. Also, when developing a component developers can specify the required metadata—for instance using Java annotations or extended comments when developing in Python or Lisp.
4
The Meandre Architecture
The design of the Meandre architecture follows three directives: (1) provide a robust and transparent scalable solution from a laptop to large-scale clusters, (2) create an unified solution for batch and interactive tasks, and (3) encourage reusing and sharing components. To ensure such goals, the designed architecture relies on four stacked layers and builds on top of service-oriented architectures (SOA). From bottom to top, the stack is composed by: 1. Virtualization infrastructure: Provide a uniform access to the underlying execution environment. It relies on virtualization of machines and the usage of Java for hardware abstraction. 2. IO standardization: A unified layer provides access to shared data stores, distributed file sytems, specialized metadata stores, and access to other service-oriented architecture gateways. 3. Data-intensive flow infrastructure: Provide the basic Meandre execution engine for dataintensive flows, component repositories and discovery mechanisms, extensible plugins and web user interfaces (webUIs). 6 7 8
http://www.clusterresources.com/torque http://gridengine.sunsource.net/ http://seasr.org/meandre/documentation/for-developers/zigzag/
8
Meandre
Apps
Meandre
Plugins
Meandre
Web
Apps
Meandre
Services
Meandre
Data‐Intensive
Flows
Developer
Tools
Components
Data
Manipula)on
Computa)on
Tasks
Visualiza)on
Gateway
Connec)ons
Data
Persistence
Data
Transforma)on
...
Machine
Learning
Evolu)onary
Computa)on
Natural
Lang
Processing
…
Char)ng
Modeling
Visualiza)on
Informa)on
Visualiza)on
…
Component
Repository
Component
Discovery
Meandre
Core
Plugins
Meandre
Infrastructure
Shared
Stores
Distributed
FS
Metadata
Stores
SOA
Gateways
Virtualiza)on
Infrastructure
Figure 4: Meandre architecture stack. 4. Interaction layer: Allows access to the basic four modes of expression of data-intensive flows when executed. They can provide self-contained applications via webUIs, create plugins for third-party services, interact with the embedding application that relies on the Meandre engine, or provide services to the cloud. Figure 4 presents how these layers are organized. In order to guarantee the scalability of the system, the Meandre architecture banks on providing a unified view via the usage of scalable storage facilities. Such back end storage elements allow fault-tolerant and expandable-on-the-fly clusters. Such design allows a Meandre engine to run on standalone mode on a laptop, or to be part of a large orchestrated cluster facility—see next section—transparently.
5
Cloud Computing
Virtualization has changed the way most data centers operate. It is quite common nowadays to see large computing facilities were farms or virtual servers are running. A well know exponent of this approach is Amazon’s EC2 elastic cloud computing approach9 . Instead of dedicated servers, 9
http://aws.amazon.com/ec2
9
MDX
(a) Meandre server virtual appliance
(b) Meandre cluster with multiple servers
Figure 5: Meandre cluster running on a virtualized environment. Meandre server virtual appliances can be booted in the cloud and added to the cluster. Virtual instances can be returned to the cloud when they are no longer needed. virtual servers provide a flexible way to manage services. Virtual servers can be floated between different hardware easing resource management. Virtual appliances (prefab virtual servers) provide quick deploy times for new services. New virtual servers can be allocated and booted when needed, and then disposed when their services are not required any longer. Meandre was engineered with this scenario in mind. New Meandre servers can be instantiated out of the cloud when needed and then disposed when their task is accomplished as part of the Meandre cluster—see Figure 5. When, for instance, more execution power is needed, a new instance of a Meandre server could be instantiated and added into the cluster. When no need for such a resource exists, the instantiated Meandre server can be disposed, and thus removed from the pool of Meandre servers that forms the effective cluster. Meandre guarantees, that no matter what instance of the Meandre cluster you interact with, a single system image is provided thanks to the Meandre Distributed Exchange (MDX). In order to achieve this goal, we created a virtual appliance which contained an instance of the Meandre server ready to run. Thus, we could arbitrarily start Meandre clusters of arbitrary size by just running the required number of virtual appliances. Each Meandre cluster we also run two other kinds of virtual appliances: a fault tolerant load balancer and a redundant back end storage server. The fault tolerant load balancer was built using ldirectord10 and drdb11 . Its purpose is to reliably distribute the requests to the Meandre cluster across the Meandre servers instantiated out of the cloud. Users usually need to interact with their flows, and the load balancer is the crucial piece that makes it possible. Also each Meandre cluster relies on a shared storage back end which provided the shared sheet required to coordinate the Meandre servers of the cluster via MDX. For our initial test we relied on a two highly available master-master replicating MySQL servers12 . The only requirement for the back end storage is that it must be highly available and arbitrate distributed transaction providing the proper isolation levels. Thus the minimal configuration for 10 11 12
http://www.vergenet.net/linux/ldirectord/ http://www.drbd.org/ http://www.mysql.com
10
the Meandre cluster explored required at least five virtual instances (two for the highly available load balancer, two for the MySQL servers, and at least one Meandre server).
6
Experimental Prototype
Following the cloud mechanics described in the previous section, we prepared a set of experiments to validate its viability. We installed VMware server software13 on two identical machines. Each machine, running Windows Server 2003, was equipped with two quad-core 2.8GHz Xeon processors, 1600MHz front side bus, 32Gb of RAM, and 4Tb of RAID 5 disk. These two boxes provided the virtualized cloud where Meandre server instances were deployed. Each virtual instance was build around a 32-bit Ubuntu 8.04 Linux and it ran the Meandre server using Sun’s Java 1.5 JVM. We pack up to 16 virtual appliances (Meandre servers) across both boxes—aligning the number of instance with the number of cores available. We also used a third box (identically setup as the previous ones) to run the highly available fault tolerant load balancer and MySQL servers. All the virtual appliances used 3Gb of RAM each. We conducted three different experiments—see Figure 6. All three where based on the flow presented on Figure 3, where we changed the amount of data to be processed (from a single text line to a batch of 250,000 lines). The first test was designed to test the scalability of a single Meandre server. The average time per flow increased linearly with the number of concurrent flows, only breaking down when the JVM ran close to it memory limit (we allows our instances to use 2Gbs) and the garbage collection starts dominating the performance. The next two experiments were run against a virtual Meandre cluster formed by 16 Meandre servers. Results clearly show how the cluster throughput grows linearly with the number of Meandre servers available. The interaction with the flow may not always be required, a Meandre cluster can also boost its throughput by submitting flows to execute over a grid facility thanks to the Meandre’s self-contained MAU execution format. Adding—for instance—Torque or the Sun Grid Engine capabilities to the Meandre server virtual appliance allows a Meandre cluster to efficiently execute large number of batch jobs by borrowing resources over grid computing facilities.
7
Current Usage and Partners
The Meandre infrastructure is currently being developed to support the SEASR project at NCSA. The Andrew W. Mellon Foundation sponsors the Software Environment for the Advancement of Scholarly Research (SEASR) to create flexible and scalable architecture that can be quickly deployed and reused for humanities. Several projects already have Meandre components for areas including data manipulation, document collection retrieval and management, data and text mining, music retrieval, online conversations and social network analysis, and evolutionary computation. More information can be found at http://seasr.org/ and http://seasr.org/meandre.
8
Conclusions and Further Work
This paper has introduced the Meandre infrastructure, a semantic-web-driven data-intensive flow execution infrastructure which allows transparent usage from a single laptop to large scale clusters. Meandre achieves such a goal by using the flow abstraction. Moreover, as the results presented in the previous section show, the Meandre infrastructure is also tailored to allow rapid deployment 13
http://www.vmware.com/
11
1000 500
20
●
200
● ● ●
100
●
●
● ● ●
●
●
5
●
●
2
●
Execution time (secs)
●
●
10
● ●
●
50
Execution time (secs)
●
1
● ●
2
5
10
●
●
20
1
5
Number of concurrent flows
10
●
50
500
Number of concurrent flows
(b) Concurrent flows running on a 20 node cluster on a log/log scale (1 line of text being pushed through the flow)
2000
(a) Concurrent flows running on a standalone engine on a log/log scale (250,000 lines of text being pushed through the flow)
●
500 200
●
100
●
●
50
Execution time (secs)
●
●
20
● ●
1
5
●
10
50
500
Number of concurrent flows
(c) Concurrent flows running on a 20 node cluster on a log/log scale (250,000 lines of text being pushed through the flow)
Figure 6: Scalability test of the flow presented on Figures 2 and 3 on three different setups.
12
in cloud environments thanks to the extensive use of virtualization techniques, revealing itself as a dynamic on-demand clustered solution for interactive and batch tasks—also able to reach into grid environments for extra computational resources. Our current efforts and further work are focused on extending the infrastructure distribution capabilities and improving the final user tool set. Advances on the distributed execution capabilities, besides the concurrent flow execution, distributed flow execution of flows may leave the research and development stage and enter the stable release distribution. Efforts to create larger scalability tests are targeting NCSA clusters and grid services. Finally, the standardization of Meandre cluster virtual appliances for cloud computing has also started.
Acknowledgments This work is mainly sponsored by The Andrew W. Mellon Foundation. Parts of this work has also been sponsored by the Air Force Office of Scientic Research, Air Force Materiel Command, USAF, under grant F49620-03-1-0129. The US Government is authorized to reproduce and distribute reprints for Government purposes notwithstanding any copyright notation thereon. We would also like to thank all the members of the Automated Learning Group and Data-Intensive Technologies and Applications for they continual support, help, and encouragement. The views and conclusions contained herein are those of the authors and should not be interpreted as necessarily representing the official policies or endorsements, either expressed or implied, of the Air Force Office of Scientic Research, the Technology Research, Education, and Commercialization Center, the Office of Naval Research, the National Science Foundation, or the U.S. Government.
References Beckett, D. (2004). RDF/XM Syntax Specification (Revised) (W3C Recommendation 10 February 2004). The World Wide Web Consortium. Beckett, D. (2007). Turtle - Terse RDF Triple Language (Technical Report 20 November 2007). Institute for Learning and Research Technology. Berners-Lee, T., Fielding, R., & Masinter, L. (2005). Uniform Resource Identifier (URI): Generic Syntax (Technical Report RFC3986). The Internet Society. Beynon, M. D., Kurc, T., Sussman, A., & Saltz, J. (2000). Design of a framework for dataintensive wide-area applications. In HCW ’00: Proceedings of the 9th Heterogeneous Computing Workshop pp. 116. Washington, DC, USA: IEEE Computer Society. Brickley, D., & Guha, R. (2004). RDF Vocabulary Description Language 1.0: RDF Schema (W3C Recommendation 10 February 2004). The World Wide Web Consortium. Dean, J., & Ghemawat, S. (2004). MapReduce: Simplified Data Processing on Large Clusters. In OSDI’04: Sixth Symposium on Operating System Design and Implementation Foster, I. (2003). The virtual data grid: A new model and architecture for data-intensive collaboration. In in the 15th International Conference on Scientific and Statistical Database Management pp. 11–. Grant, J., & Beckett, D. (2004). RDF Test Cases (W3C Recommendation 10 February 2004). The World Wide Web Consortium.
13
Manola, F., & Miller, E. (2004). RDF Primer (W3C Recommendation 10 February 2004). The World Wide Web Consortium. Mattmann, C. A., Crichton, D. J., Medvidovic, N., & Hughes, S. (2006). A software architecturebased framework for highly distributed and data intensive scientific applications. In ICSE ’06: Proceedings of the 28th international conference on Software engineering pp. 721–730. New York, NY, USA: ACM. Prud’hommeaux, E., & Seaborne, A. (2008). SPARQL Query Language for RDF (W3C Recommendation 15 February 2008). The World Wide Web Consortium. Uysal, M., Kurc, T. M., Sussman, A., & Saltz, J. (1998). A performance prediction framework for data intensive applications on large scale parallel machines. In In Proceedings of the Fourth Workshop on Languages, Compilers and Run-time Systems for Scalable Computers, number 1511 in Lecture Notes in Computer Science pp. 243–258. Springer-Verlag. W3C (2008). W3C Semantic Web Activity (Technical Report). The World Wide Web Consortium. Weibel, S., Kunze, J., Lagoze, C., & Wolf, M. (2008). Dublin Core Metadata for Resource Discovery (Technical Report RFC2413). The Dublin Core Metadata Initiative. Welge, M., Auvil, L., Shirk, A., Bushell, C., Bajcsy, P., Cai, D., Redman, T., Clutter, D., Aydt, R., & Tcheng, D. (2003). Data to Knowledge (D2K) (Technical Report). Technical Report Automated Learning Group, National Center for Supercomputing Applications, University of Illinois at Urbana-Champaign.
14