A QoS management system and an efficient scheduling heuristic for ... It groups the available machines into clusters to scale down the .... [11] to get a list of machines in the GrADS testbed and then contacts the NWS [7] to get the dynamic.
A QoS management system and an efficient scheduling heuristic for network computing applications Feras Al-Hawari and Elias Manolakos Electrical and Computer Engineering Department, 442 Dana Research Building, Northeastern University, Boston, MA 02115 Abstract The dynamic state conditions of the Network of Workstations (NOW) resources should be used to find the best acceptable tasks-to-machines allocation strategy before a network computing (NC) application is launched. We developed an interactive QoS management system that automatically finds the most efficient tasks-to-machines mapping at startup time, and in addition facilitates application adaptation at runtime. The system is self contained and relies on its own middleware layer to do the resource monitoring and servicing behind the scenes, at the application-level. It groups the available machines into clusters to scale down the number of required QoS measurements and to facilitate the mapping process. A scheduler considers user requirements and constraints and uses an efficient mapping heuristic to find an acceptable assignment. The performance estimator predicts the overall running time of a given mapping based on application and network models as well as resource state information. The estimated time is used to decide whether the mapping satisfies the QoS targets set by the user. Furthermore, we conducted many simulations that demonstrate the accuracy of the application and network models as well as the quality of the obtained mappings in different kinds of scenarios and using different types of applications.
1
Introduction
Networks of Workstations (NOWs) are considered as an attractive parallel processing platform for tackling compute intensive problems. The wide availability of inexpensive PCs and fast communication networks enable NOWs to offer a much better cost/performance ratio than supercomputers. Despite this reality, building distributed component-based applications for NOWs is still a painstaking experience. Frameworks and tools that can help designers to model, simulate, build, and run efficiently coarse grain parallel applications will certainly contribute to the growth of NOW's popularity and user base [1-6]. The resources of a NOW are usually heterogeneous and shared, making the system's state quite dynamic. Therefore, the performance of a network computing (NC) application is highly dependent on the state of the resources on which it will run. For example, a NC application that exhibits performance gains under light load conditions (relatively to a sequential program realization) may not enjoy any speedup in the same
1
environment under heavy load conditions. So, if the objective is to deliver a targeted Quality of Service level (QoS) to an application (e.g. in terms of expected completion time or speedup), the system's state should be used in finding the best acceptable mapping of the parallel application's tasks to the available NOW resources before the application is launched (i.e. at startup time). Furthermore, the application should remain resource state aware and possibly adapt itself accordingly in order to keep meeting its QoS demands at runtime. The task of monitoring the system's state and providing information that can be exploited at startup as well as during the application's runtime can be delegated to an application-level QoS management system. The main requirements for such a system are to be able to find automatically the most efficient tasks-tomachines mapping at startup time, and in addition facilitate application adaptation at runtime. Our group has developed such a flexible QoS management system for coarse grain NC applications in the context of the ongoing JavaPorts framework project [1, 5]. In this paper we only present the design and main features of the startup phase QoS management components. The details of the runtime phase middleware and QoS API are discussed in [4]. User
QoS GUI
JPVAC
Setup Data
JPVTC
ATG
Task Behavioral Graphs
Front End Back End Static Dynamic Resource Data
Resources
QoS Manager
Local Monitoring Module
Resources
Scheduler
Performance Estimator
Remote Monitoring Module
Remote Monitoring Module
Resources
Figure 1: Startup-phase QoS system architecture. Oval nodes represent software entities. Rectangular nodes represent data entities. The front- and back- end subsystems are delineated.
The developed startup-phase QoS management system (see Figure 1) consists of a front- and a back-end subsystem. The front-end subsystem modules provide an interface between the user and back-end subsystem modules. They are user-friendly graphical tools that help a developer construct intuitive network computing application models, build an available resources database, launch the QoS management system in a NOW, specify QoS requirements for the NC application, and send requests to (or get results from) the back-end modules. At the back-end, a QoS Manager module interacts with local and remote Resource Monitoring modules and a Scheduler in order to automatically find if there exists a tasks-to-machines 2
allocation strategy that satisfies the QoS targets set by the user. The lightweight Resource Monitoring modules measure and communicate to the QoS Manager static/dynamic resource state information (such as the current workload of each available machine, throughput of the network links, etc). The Scheduler uses an efficient mapping heuristic in conjunction with a performance estimation algorithm in order to find an application schedule (i.e. a tasks-to-machines mapping) that meets the user’s desired QoS demands based on: (1) a structural top-level model (Application Task Graph ATG) [6] describing how the tasks interact in the NC application, (2) a behavioral model for each task involved [4], and (3) NOW resource state related data. In this paper, we present the design of the scalable and non-intrusive resource monitoring subsystem. We explain how the networked machines are partitioned into different clusters in order to scale down the communication links’ measurements and to facilitate the application scheduling process. Moreover, we present an efficient mapping heuristic used to find the best acceptable mapping of the distributed application's tasks to the available NOW resources. Furthermore, we present the QoS GUI and discuss its capabilities. The rest of the paper is organized as follows: the related QoS management frameworks are surveyed and compared to our system in section 2, the graphical models to represent component based NC applications are discussed in section 3, the network abstraction and representation are presented in section 4, the efficient mapping heuristic for assigning application tasks to networked machines is introduced in section 5, the QoS GUI as well as the monitoring system implementation are shown in section 6, the experimental results to validate our mapping heuristic as well as the network and application abstractions are demonstrated in section 7, and finally a summary of our findings and future work is given in section 8.
2
Existing application-level QoS management frameworks
The existing QoS management systems can be categorized into two types: (1) resource monitoring and management systems and (2) scheduling frameworks. The resource management systems provide services to the scheduling frameworks. They monitor and record the characteristics of the underlying resources (workstations, network links, etc). In addition, they make the resource information (e.g. CPU speed, link throughput) available to an application-level scheduler, via an API, to influence the mapping decisions. Moreover, they may provide a scheduler with services (e.g. resource discovery, job submission, check pointing, job migration, security) to execute the application on the best-found resources. On the other hand, the objective of a scheduling framework is to automatically find a tasks-to-machines mapping that meets the desired QoS preferences at startup time. The basic components of scheduling frameworks are: (1) mapping heuristic, (2) performance estimation method, (3) application model, and (4)
3
resource information. A scheduler uses the mapping heuristic to find an acceptable tasks-onto-machines assignment. The performance estimator predicts the overall running time of a given mapping based on the application model as well as the resource information. The scheduler uses the performance estimate to decide whether the mapping satisfies the desired QoS demands. These frameworks may rely on their own, or on third party, resource monitoring or management systems to obtain the resource information or to execute the application tasks on distributed machines.
2.1
Resource monitoring and management systems
The Network Weather Service (NWS) [7] is a resource monitoring system that periodically measures the dynamic attributes of network and computational resources. It includes software sensors to measure the attributes of machines (e.g. CPU speed, workload, free memory size) as well as end-to-end TCP/IP network links (e.g. throughput and latency). It also uses numerical methods to forecast what the resource conditions will be in the near future. Moreover, it provides a network-level API that static or dynamic schedulers may use to access the gathered resource information. Similarly to the NWS, the REsource MOnitoring System (Remos) [8-10] is designed to provide a scheduler with dynamic machine and link data. It supports flow and topology queries to obtain the attributes of the resources along an end-to-end communication path and to get a dynamic view of a set of networked machines respectively. It performs the link measurements at the TCP/IP level and can discover the whole network topology. Furthermore, it provides client applications with a network-level API to obtain the dynamic resource attributes. The Globus Toolkit [11-13] is a set of software components and tools that provide a variety of core services (e.g. resource discovery, file and data management, information infrastructure, fault detection, security, portability) for grid-enabled applications. It is considered as an enabling technology for the Grid, since it allows users to share various computing and data storage resources securely across geographic and other boundaries without sacrificing local autonomy. Moreover, it provides the Grid Resource Allocation and Management (GRAM) [13] service to facilitate remote job submission and control. GRAM is not a scheduler, but it is often used as a front-end to schedulers. It provides a uniform interface to heterogeneous compute resources that span multiple administrative domains (i.e. Grid-wide and local-area resources). Furthermore, it supports basic Grid security mechanisms, reliable job execution, job status monitoring and job signaling (e.g. stop, restart, kill). In addition, the Globus Toolkit includes the Monitoring and Discovery System (MDS) [11], which is mainly used by static as well as dynamic schedulers. It consists of a set of services that implement standard interfaces (e.g. WS-ResourceProperties [14]) to publish and access XML-based resource properties. An Index service collects resource data from registered information sources (e.g. a third party monitoring
4
system such as the NWS) and publishes that information as resource properties. Client applications use the WS-ResourceProperties query interface to retrieve information from an Index. Our scalable and non-intrusive resource monitoring system is developed to collect the static/dynamic attribute values of the available machines (e.g. CPU speed, workload, free memory, swap sizes, etc) as well as of the network links interconnecting the machines (e.g. throughput and latency). The JavaPorts framework is used to deploy the monitoring modules on machines within the same administrative domain (e.g. local-area NOW). Our system performs the link measurements at the same level as the JavaPorts message passing operations i.e. at the application-level, which leads to more accurate performance predictions because these operations are implemented in a middleware layer on top of the Java Remote Method Invocation (RMI) [15] layer. On the other hand, systems such as the NWS [7] and Remos [8] perform the link measurements at the network level (i.e. TCP/IP level), which results in more optimistic performance estimates because they do not account for the RMI overheads. However, unlike our monitoring system, the NWS and Remos can monitor the state of various heterogeneous resources across administrative domains.
2.2
Application-level scheduling frameworks
Our research has been inspired by the Application Level Scheduling (AppLeS) [16-19] project. AppLeS is an agent-based system in which agents try to find an application mapping that satisfies the user specifications based on equation based performance models (describing the behavior of an application that consists of a set of interacting tasks) as well as static and dynamic resource information. It consists of an active agent called the Coordinator and four subsystems, which are: the Resource Selector, the Planner, the Performance Estimator, and the Actuator. The four subsystems share a common information pool that consists of: QoS requirements and preferences provided by the user, model templates to be used by the performance estimator, and dynamic information and forecasts of the system state supplied by the NWS [7]. The Resource Selector selects a set of possible resource configurations based on: user, resource and application information. The Planner, in conjunction with the Performance Estimator and the NWS, computes a potential mapping for each possible resource configuration using predictive models from a model pool. The Coordinator considers the performance of each candidate schedule and selects a mapping that meets the user’s requirements for implementation. Finally, the Actuator interacts with a resource management system (e.g. Globus [11]) in order to schedule the selected application mapping. The Grid Application Development Software (GrADS) [20, 21] runs on top of Globus [11] to facilitate application scheduling, launching and runtime adaptation. A Resource Selector queries the Globus MDS [11] to get a list of machines in the GrADS testbed and then contacts the NWS [7] to get the dynamic attributes of the machines. The Performance Modeler uses resource information as well as a skeleton based execution model (built specifically for an application that consists of a set of interacting tasks) to
5
map the application to machines. Upon approving a mapping by a Contract Developer component, the Launcher launches the jobs on the given machines using the Globus job management mechanism GRAM [13]. It also spawns a Contract Monitor component to monitor the application’s progress. In addition, the Rescheduler component is launched to decide when to migrate a job to a better machine. Moreover, the application can make calls to a Stop Restart Software (SRS) package that is built on top of MPI to checkpoint data, to be stopped at a particular point, to be restarted later on a different configuration of machines, and to be continued from a previous point of execution. Other scheduling frameworks such as Condor [22-23], Condor-G [24], Legion [25], MSHN [26], SmartNet [27], and MAP [28] can map a set of independent jobs from different users onto a heterogeneous suite of machines. They also can map a set of inter-dependent tasks represented by Directed Acyclic Graphs (DAGs). A DAG is considered as a structural application representation in which nodes represent tasks and edges represent the tasks inter-dependencies. However, unlike our system, these systems cannot evaluate the performance of an application that consists of a set of interacting and communicating tasks. The QoS management system we have developed is suitable for distributed and multitasked, coarse grain, network computing applications. An application task may contain asynchronous or synchronous read and write message-passing operations and it may spawn new threads. The AppLeS [16] and GrADS [21] frameworks as well as methods such as those in [29-33] are similar to our system in that they support the mapping of an application that consists of a network of communicating tasks to NOWs. AppLeS and GrADS, however, do not seem to support the very realistic situation arising in large scale computing in which more than one of the tasks are allocated to the same machine (i.e. multitasking). Our system supports any type of interactions between application tasks via anonymous message passing operations (synchronous and asynchronous). The approach in [29] supports only three application classes namely: concurrent, concurrent-overlapped, and pipeline. Moreover, the method introduced in [30] does not allow asynchronous message passing operations in the tasks. Furthermore, the work discussed in [31] is only suitable for Manager-Worker type of applications. The AppLeS [16], Globus [11], and GrADS [21] frameworks rely on third party monitoring systems such as the NWS [7] and Remos [8] to obtain the resource information they need. Similarly to our system, Condor [22] and Legion [25] have their own resource monitoring modules. However, Legion only monitors the state of the machines and does not estimate any link attributes. Moreover, unlike our system that performs all link measurements at the application-level, systems such as Condor, NWS and Remos measure the link attributes at the network-level. Hence, our performance estimates can be more accurate than the AppLeS and GrADS estimates.
6
Although this is not part of this research, but it is important to emphasize that our system uses the JavaPorts framework to deploy the application tasks on the desired machines. JavaPorts provides scripts to deploy an application onto heterogeneous machines that fall within the same administrative domain (i.e. local area clusters of workstations). Similarly to our system, the Condor [23] and Legion [25] and systems have their own job submission mechanisms. Conversely, systems such as AppLeS and GrADS rely on third party frameworks, such as Globus [12], for job submission and resource discovery. So, our system is basically a middleware layer that does all the monitoring and servicing behind the scenes, at the application-level, without the support of some underlying resource monitoring infrastructure. This makes it applicable in any NOW that just runs Java/RMI and JavaPorts.
3
Abstractions to represent component-based parallel and distributed applications
JavaPorts [1, 5] is a component framework and a suite of tools for the rapid prototyping of distributed Java and Matlab applications executing on NOWs. JP facilitates the modeling, development, configuration, and deployment of coarse grain parallel and distributed applications. The JP framework provides the user with abstractions and APIs that enable anonymous message passing between tasks while hiding the inter-task communication and coordination details. In addition, a unique feature of the latest JP version 3.0 is that it allows in the same application the co-existence and interaction of reusable Java and Matlab components.
ml1 Manager 0
l2
l1 ml2
0
W1
1
2
l3 ml3
0
W2
0
W3
(a)
(b)
(c)
Figure 2: (a) ATG for a manager-worker application in which the dashed rectangles represent logical machines, the solid rectangles represent tasks, and the solid lines represent the peer-to-peer logical links between the tasks), (b) the behavioral graph for the manager task, and (c) the behavioral graph for a worker task.
A JP application is a set of distributed tasks and its structure can be described using an Application Task Graph (ATG) abstraction. An ATG for a manager-worker application is shown in Figure 2(a). The ATG nodes represent tasks and the edges represent task-to-peer-task bi-directional connections. The ATG can be
7
captured either textually, using the JP Configuration Language (JPCL), or graphically using the JP Visual Application Composer JPVAC tool [3, 6]. Tasks are assigned to logical machines that are eventually allocated to machines and several tasks may share the same machine (multi-tasking). Each task has its own predefined input-output communication ports. Two tasks may exchange messages via an edge (point-topoint connection) using two peer ports (edge terminals). Each task is associated with either a Java or a Matlab software component and several tasks may share the same component implementation [2]. The ATG can be considered as the top (structural) level in a hierarchical application representation. A JP task may use anonymous message passing to communicate with another peer task. In anonymous communications the name (and port) of the destination task does not need to be mentioned explicitly in the message passing method [1]. JP maintains a port list data structure for each port, used to buffer incoming messages. Each port list has different elements, which are uniquely identified by message keys. Hence the message key is used to identify the port list element when writing/reading a message. There are four allowed communication operations in JP, summarized below:
public Object AsyncRead(int MsgKey): Does not block the calling task. Returns a handle to a message if the message has already arrived at the port list element with the specified key, otherwise it returns null.
public void AsyncWrite(Object msg, int MsgKey): Does not block the calling task. Spawns a new thread to transfer and store the message in the receiving task's port list element with the specified key.
public Object SyncRead(int MsgKey): Blocks the calling task until a message arrives at the port list element with the specified key.
public void SyncWrite(Object msg, int MsgKey): Blocks the calling task until the sent message is read from the receiving task's port list element with the specified key.
Furthermore, JP allows developers to construct graphically behavioral models for the tasks of distributed applications using the JP Visual Task Composer (JPVTC) tool [4]. A behavioral graph consists of nodes representing basic code constructs and of edges representing dependencies between them. JPVTC supports several elements for modeling iteration constructs (for loops), conditionals (if statements), sequential code blocks, anonymous message passing operations, and thread spawning. Most of the elements can be annotated with attributes and benchmark performance data, as needed to estimate the overall application's performance. A behavioral model is associated with each application task. Therefore behavioral graphs of tasks can be considered as the second (lower) level, in a hierarchical two-level representation of a distributed and multi-tasked application. The Performance Estimator [4] (see Figure 1) uses the constructed application models and static/dynamic resource state data to predict accurately the expected total running time of different application task configurations even before any coding is attempted.
8
The behavioral graphs, constructed using the JPVTC tool, for the manager and worker tasks in Figure 2(a) are shown in Figure 2(b) and Figure 2(c) respectively. The manager asynchronously sends a message to each worker and then synchronously waits to receive a message back from each worker. Each worker synchronously waits to receive a message from the manager and after some computation it synchronously sends back a results message to the Manager to store. In addition to the hierarchical two-level application representation that is used when we need to get a very accurate performance estimate of a given application configuration, our system also supports a simple onelevel application model that the scheduler can use to quickly and roughly predict the overall running time of a given mapping. The structural and behavioral application models can be combined to get a simplified high-level application graph that captures: (1) the connectivity between the logical machines in the ATG, (2) the amount of computation on each machine, and (3) the sizes of the messages that are exchanged between the machines. The computation amounts and the message sizes are extracted from the tasks behavioral graphs. This high-level graph is called the Application Logical Machines Graph (ALMG). The number of nodes |ML| in this graph is equal to the number of logical machines in the ATG, where ML is a set of |ML| logical machines, {ml1, ml2, ..., ml|ML|}. The links in this graph represent the connectivity between the logical machines. LA is the set of laij links, where laij is the aggregate link between logical machines mli and mlj in the ALMG. Each node in the ALMG is annotated with the aggregate computation time (CompAmount) of the task(s) that are allocated to its corresponding logical machine. An edge is annotated with the aggregate size of all the messages (CommSize) that are exchanged between the tasks on the logical machines that it connects. ml1 CompAmount1 = 30 sec
ml1
la13
la12
CommSize13 = 16 Gbits
CommSize12 = 32 Gbits
CompAmount2 = 20 sec
ml2
CompAmount3 = 10 sec
ml3 ml2
(a)
ml3
(b)
Figure 3: (a) The ALMG that corresponds to the ATG in Figure 2(a), and (b) the ALMG annotated based on the behavioral graphs in Figure 2(b) and Figure 2(c).
The ALMG for the ATG in Figure 2(a) is shown in Figure 3(a). The three nodes in this graph correspond to the three logical machines in the ATG. The edges between logical machines ml1 and ml2 as well as ml1 and ml3 correspond to the peer-to-peer connections between the manager task and worker tasks W1 and W2, as well as the manager and worker W3 respectively. There is no edge between logical machines ml2 and ml3 because there are no peer-to-peer connections between the tasks on these machines. Based on the behavioral graphs of the manager and worker tasks in Figure 2(b) and Figure 2(c), the CompAmount of the
9
tasks on logical machines ml1, ml2, and ml3 is 30 (i.e. execution time of the two codeSegments in the behavioral graph of the manager task), 20 (i.e. the sum of the execution times of the codeSegments of workers W1 and W2), and 10 (i.e. execution time of the codeSegment of worker W3) seconds respectively. Furthermore, the CommSize of link la12 is 32Gbits i.e. the aggregate size of the messages exchanged between the manager and workers W1 and W2. Moreover, link la13 is annotated with a 16Gbits CommSize i.e. the aggregate size of the messages exchanged between the manager and worker W3. The annotated ALMG is shown in Figure 3(b).
4
Network abstractions and representation
A typical network (see Figure 4(a)) consists of interconnected machines, hubs, bridges, and routers. A hub is a layer-one switch (as stated in the Open Systems Interconnection OSI 7-layer network model) that receives an incoming packet, possibly amplifies the electrical signal and broadcasts the packet out of its links (including the link on which the packet is received). A bridge is a layer-two device that connects two or more links and forwards packets between them based on the source and destination Media Access Control (MAC) addresses. A router is a layer-three device that connects multiple subnets together and operates at the network layer of the OSI model. m3
m4
bus1
m1
m2
100Mbps
m5
1Gbps
Internet
1Gbps router 1Gbps
1Gbps
1Gbps hub1
bridge
m3
m6
m4 thr11 c1
hub2
thr12
m1
m5
m2
m6 thr22 c2
(a) (b) Figure 4: (a) Typical network topology. (b) The FCCG for the machines in Figure 4(a). In Figure 4(b), the dashed circles represent clusters, the solid circles represent machines, and the solid lines represent links.
Assuming that any machine can communicate with any other machine in the pool, we can represent the underlying network of machines as a Fully Connected Machines Graph (FCMG). The number of nodes in this graph is equal to the number of machines |MP| in the pool, where MP is the set of machines in the pool, {m1, m2, …, m|MP|}. The number of logical links |LN| in a FCMG is (|MP|*(|MP|-1))/2. LN is the set of lnij links, where lnij is the link between machines mi and mj in the FCMG, given that 1 i (|MP|-1), 2 j |MP|, i j, and i j. The nodes and links in the FCMG are annotated with the corresponding machine and link attribute values (e.g. CPU speed, workload, free memory size, throughput, latency), which are provided by the resource monitoring modules. The FCMG and ALMG are used by the scheduler to find an acceptable tasks-to-machines allocation strategy.
10
The need to frequently measure the throughput of (|MP|*(|MP|-1))/2 logical links becomes inefficient as |MP| increases (e.g. there are 435 logical links between 30 networked machines). Therefore, there is a need for a mechanism to reduce the required number of frequent throughput measurements (scalability) and to reduce the frequency of these measurements (intrusiveness). In a typical network (Figure 4(a)), there are groups (clusters) of machines that exhibit similar communication characteristics when they communicate with each other or with machines connected to other switches. For example, machines m1, m2, m5, and m6 in Figure 4(b) can be grouped in one cluster because they experience 1Gbps and 100Mbps throughputs when they communicate with each other and with any other machine (e.g. m3 and m4), respectively. Therefore, measuring the throughput between machines m1 and m2, for example, is sufficient to roughly estimate the throughput between any of the four machines in the cluster. This reduces the number of needed intra-cluster throughput measurements from 6 to 1. Our QoS management system automatically groups the machines into different clusters according to their communication characteristics i.e. without any knowledge about the physical network topology. The clusters are represented as a Fully Connected Clusters Graph (FCCG) (see Figure 4(b)), which consists of |C| fully connected clusters (nodes), where C is the set of the fully connected clusters in the FCCG, {c 1, c2, …, c|C|}. Thrii is the intra-throughput of cluster ci , where 1i|C|. Thrij is the inter-throughput between clusters ci and cj, where 1 i (|C|-1), 2 j |C|, ij, and ij. The machines in each cluster are represented as a clique to indicate that they have the same communication attributes. The links between clusters are used to represent the communication characteristics between machines in different clusters. The FCCG is considered as a higher-level view of the FCMG and as a logical network representation as seen by the application. The FCCG is logical because it may not directly correspond to the underlying physical network topology. For example, machines m1 and m2 as well as machines m5 and m6 are grouped into one cluster even though they are physically connected to two different hubs i.e. hub1 and hub2 respectively. The re-clustering of the machines is conducted occasionally (e.g. once per hour) because it requires performing time-consuming all-to-all delay measurements. The re-clustering measurements are preformed at the network-level (using Java sockets) to be as close to the actual network delays and to avoid any undesired application-level overheads. After determining the clusters, the system measures the throughput between the first two machines in a cluster to determine the intra-cluster throughput. Also it measures the throughput between the first machines in any two clusters (i.e. one candidate machine from each cluster) to obtain the inter-clusters throughputs. The inter-clusters measurements are applied to the corresponding FCMG links between the machines in both clusters. The fewer intra- and inter-cluster measurements are conducted frequently (e.g. once per minute) to obtain the dynamic throughput values that are needed to make the mapping decisions. These frequent measurements are conducted at the application-level (using Java RMI [15]) to be as close to what the application might experience, which will result in more accurate application performance estimates.
11
4.1
Clustering algorithm
In order to cluster the machines we first need to measure the delays of the entire FCMG links and then group the links that have roughly equal delays in the same class (the links in a class are considered equivalent). The number of classes corresponds to the number of different types of links that interconnect the available machines (e.g. in Figure 4(a) we have two types of links corresponding to the 100Mbps and 1Gbps links). A link belongs to a certain class if its delay is not smaller or greater than X% (error delta) of the mean delay of all the links in the class (see Figure 5). Thus, using a small delta may result in many classes with link delay values around their mean delays (i.e. higher delay accuracy). While, using a large delta may result in fewer classes with link delay values widely spread around their corresponding mean delays (i.e. lower delay accuracy). Typically the throughput of the different types of links is widely separated (e.g. the throughput of a 1Gbps link is 10 times larger than a 100Mbps throughput), thus setting delta to 25% turned to be good enough to identify the different classes in existing LANs while maintaining the delay accuracy of the links in each class.
delay
class2
+X% -X%
mean2 class1
+X% -X%
links mean1
Figure 5: A value in a class must not be less or greater than X% of the class mean.
1. 2. 3.
Annotate the links in the FCMG with the mean delays of the classes that they fall within. k = 0 // cluster index ClusteredMachines = 0
4. 5. 6. 7.
for ( each unclustered mi ) { k++ // initialize a new cluster ClusteredMachines ++ Addmi to cluster ck
8. 9. 10. 11. 12. 13. 14. 15.
for ( each unclustered mj ) { if ( delays from mi and mj to every other machine are the same ) { ClusteredMachines ++ Addmj to cluster ck } if (ClusteredMachines == |MP| ) break;// clustering is complete }// for each unclustered mj if (ClusteredMachines == |MP| ) break; // clustering is complete }// for each unclustered mi
Figure 6: The pseudo code for the clustering algorithm.
12
The pseudo code for the clustering algorithm that basically converts the FCMG to a FCCG is shown in Figure 6. At line 1, each link in the FCMG is annotated with the previously computed mean delay value of its corresponding class. Next, the algorithm groups the machines into various clusters inside the for loop that begins at line 4. At line 9 within that loop, the algorithm determines whether an un-clustered machine belongs or not to the current cluster. The un-clustered machine is added to the current cluster (at lines 1011) if the delays from it and from a machine in the current cluster to every other machine are the equal (i.e. they have similar communication characteristics such as machines m1 and m5 in Figure 4(a)). The complexity of this algorithm is O(|C||MP|2). The iterations over the for loop at line 4 are equal to the number of clusters. In addition, the iterations over the for loop at line 8 are equal (|MP| -1) the first time. Furthermore, the complexity of the comparison at line 9 is O(|MP|). In a typical LAN the number of clusters is very small (1-4). In such cases, the complexity is reduced to O(|MP|2). Moreover, the number of machines under consideration is typically not very large, which makes this algorithm computationally inexpensive. Furthermore, we reduced the time to perform the comparison at line 9 by encoding the delay classes of the links, which are connected to each machine, into one or a few 64-bit integer(s) and then performing one or a few binary exclusive-OR (XOR) operation(s) to compare the delays of the links. For example, if there are 4 classes of links in a FCMG, then we need only 2 bits to encode each class (typically the range of delay classes is limited, which requires few bits to encode each class). Thus, a 64-bit integer can represent the delays of 32 links, which allows us to compare the delays of two groups of 32 links using one XOR operation. Moreover, we only need to perform four comparisons at line 9, if we have 100 machines in the pool. We ran the clustering algorithm on a 1.8GHz workstation and it took about 2 seconds to cluster 1000 machines interconnected by 4 types of links. In general, the number of machines under consideration is much less than 1000 and the algorithm will complete in fractions of a second. Furthermore, the reclustering of machines is conducted less frequently (e.g. once every one or half an hour).
4.2
Clustering example
Let us take the topology in Figure 4(a) as an example to show how the clustering of machines is done. As can be seen, machines m3 and m4 experience the same delays with each other and with machines m1, m2, m5, and m6 (i.e. they form one cluster). In addition, machines m1, m2, m5, and m6 belong to another cluster because they experience the same delays with each other and with the machines in the first cluster. Hence, we have two clusters in the FCCG as shown in Figure 4(b). Based on this FCCG, the resource monitors frequently measure the state of three logical links rather than 15 (i.e. 80% reduction in the number of required link measurements; the reduction can be much more for graphs with many more nodes). The measured throughput thr12 is applied to any logical link that connects a machine in cluster c1 to a machine in cluster c2 (e.g. the links between machines m3 and m1, as well as m4 and m6). Moreover, the measured
13
throughputs thr11 and thr22 are applied to the logical links connecting the machines in clusters c1 (e.g. the link between machines m3 and m4) and c2 (e.g. the link between machines m2 and m5) respectively.
4.3
Related work
An Effective Network View (ENV) [34] is derived from user-level observations and measurements. This view is a representation of the network topology as it relates to the application performance. An application scheduler can use the ENV to assign communication intensive tasks to machines linked by fast network links. Each ENV reflects the network topology as observed by the process that ran the performance tests and collected the results. However, our FCCG represents the physical network as observed by all the machines in the pool. The NWS [7, 35] system organizes the network sensors as a hierarchy of sensor sets called cliques in order to provide a scalable way to generate all-to-all network bandwidth measurements. The NWS administrator is responsible for configuring the cliques based on his knowledge about the network topology. Similarly, Condor [23] partitions the machines into different pools. The pools are merged to form a virtual pool called a Condor flock. The Condor system administrator manually stores the information about the various machines pools in a special configuration file. Similarly to the NWS and Condor, the logical network topology model described in [36] is manually configured by the network administrator. Our method differs from the previous systems by discovering the logical clusters automatically and regardless of the network topology. Remos [8] automatically discovers the physical network topology as described in [37]. The network is represented as a graph in which nodes correspond to machines, routers and bridges, while edges represent network links. Unfortunately, the network-level topological information is more difficult to interpret and to use by an application-level scheduler. On the other hand, we represent the network logically as observed by the application. Thus, an application-level scheduler can directly use our logical network representation.
5
Mapping NC applications to internetworked machines
One of our objectives is to find an acceptable mapping of the parallel application’s tasks (represented as an ALMG) to a set of internetworked machines (represented as a FCMG). The mapping problem is intractable, since mapping N logical machines to N machines can be done in N (N factorial) ways. Thus, we need an algorithm that is based on certain heuristics to find a tasks-to-machines allocation strategy that meets the desired user’s QoS targets. The details of our efficient mapping algorithm are discussed in this section.
5.1
Mapping heuristic
The objective of the mapping heuristic is to assign the ALMG to a subset of machines in the FCMG, such that the application’s total completion time (TCT) is minimized. Our mapping heuristic is based on a scheduling heuristic proposed by Jon Weissman [29], which tries to find an assignment of a multicomponent application to grid resources that minimizes the TCT. We refer to the Weissman heuristic as WH 14
and to our heuristic as AMH (Al-Hawari and Manolakos Heuristic) in the rest of the paper. The scheduling decisions in the WH are based on cost models that are constructed using application and resource information. It supports cost functions for three application classes: concurrent, concurrent-overlapped, and pipeline (see section 7 for more details on these application classes). Unlike the WH models, our application models can be used to represent any distributed application class. Moreover, the WH is more suitable for computation intensive applications because it always assigns the component with the largest computation amount to the fastest machine, regardless of the communication characteristics of the links that are connected to that machine. This can limit the message delay from this machine to the other machines (e.g. when the fastest machine is connected to a very slow network link), thus it can have a large impact on the application’s completion time. On the other hand, the AMH approach obtains the communication characteristics of the machines from the FCCG and uses it to make the mapping decisions. That makes the AMH suitable for both computation and communication intensive applications, which leads to better mappings than the WH.
1.
Sort the clusters in descending order by theirThroughputs
2.
Sort the machines in eachcluster in descending order by their EffectiveSpeeds
3.
Sort the logical machines in descending order by theirCompAmounts
4.
Set MinTimeall to the MAX double
5. 6. 7. 8. 9. 10. 11. 12. 13. 14.
for ( each clusterck in C ) { Assign the firstmli machine in ML to the first mi machine in ck for ( each unassignedmli in ML ) { Set MinTimeiter to the MAX double for ( eachcj in C ) { if ( there is an unassignedmachine mpj in cluster cj ) { MappingTime = TCT of current assignment if (MappingTime < MinTimeiter ) { BestMach = mpj MinTimeiter = MappingTime } } // there is unassigned machinempj } // end for each cj
15.
Assign machinemli to BestMach } // end for each unassigned mli
16.
MappingTime = time of mapping when ck is the initial cluster
17. 18. 19.
if ( MappingTime < MinTimeall ) { MinTimeall = MappingTime BestMapping = mapping when ck is the initial cluster } } // end for each cluster ck
Figure 7: Pseudo code for the AMH algorithm.
The pseudo code for the AMH algorithm is shown Figure 7. At lines 1-2, the method starts by sorting the logical machines and the clusters in descending order by their CompAmounts and Throughputs respectively (note that the machines in each cluster are also sorted in descending order by their effective performance). Then it evaluates one or more mappings, if any, by iteratively (for loop at line 5) assigning the logical machine with the highest CompAmount to the fastest machine in each cluster (at line 6) i.e. the number of 15
evaluated mappings is equal to the number of discovered clusters |C|. Eventually, the algorithm reports the mapping with the minimum TCT (based on the evaluations at lines 16-19). After assigning the first logical machine of each mapping, the algorithm evaluates the communication and computation costs of mapping the second logical machine onto the fastest available machines in all clusters. Hence, the clustering information made it sufficient to only consider the best available machines in each cluster, rather than considering all the available machines in the pool as in the WH. This makes the AMH less greedy and faster than the WH. Based on the estimated times of the possible machine allocations, the algorithm assigns the logical machine under consideration to the machine that results in the minimum TCT. The same procedure continues till all the logical machines are assigned to machines. Then, the current mapping time is compared to the time of any previously evaluated mappings to determine the best mapping. The complexity of the AMH is O(|C|2|ML|2). Since, there are |C||ML| possible mappings and |ML| steps to evaluate the mapping time. Unlike the WH, we also evaluate the effect of assigning the first logical machine to the best machine in each of the |C| clusters. However, |C| is typically small (e.g. 1-4) and |ML| is also small in practice, which makes this heuristic scalable. Note that the complexity of the WH is O(|MP||ML|2), so the complexities of both heuristics are almost equal because |C|2and |MP| are small in practice. In addition, when |MP| is much larger than |C|2 the AMH outperforms the WH.
5.2
Mapping example
Let us take an example to demonstrate how the AMH works. The example is based on the network topology and its corresponding FCCG shown in Figure 4(a) and Figure 4(b) respectively. Moreover, we use the ATG and its corresponding ALMG, shown in Figure 2(a) and Figure 3(b) respectively. We assume that the CompAmounts are measured on a 1000 MHz reference machine (mref). The CPU speeds of machines m1, m2, m3, m4, m5, and m6 are 500MHz, 333MHz, 1000MHz, 250MHz, 333MHz, and 250MHz respectively. Moreover, we assume that the machines are lightly loaded. Based on Figure 4(a), the values of thr11, thr12, thr22 in Figure 4(b) are 100Mbps, 100Mbps, and 1Gbps respectively. The algorithm steps are shown below (in this case we assume that cluster c2 is the initial cluster, and we use the concurrent cost function defined in [29]):
(2) The machines in cluster c1 are sorted as follows: m3 (1000MHz) and m4 (250MHz). The machines in cluster c2 are sorted as follows: m1 (500MHz), m2 (333MHz), m5 (333MHz), and m6 (250MHz).
(3) The logical machines are sorted as follows: ml1 (30sec), ml2 (20sec), and ml3 (10sec).
(6) Map ml1 to the first machine in the initial cluster c2 (i.e. m1). Mark machines ml1 and m1 as assigned.
(7) Consider logical machine ml2 in ML.
16
(8) MinTime = MAX double.
(9) Consider cluster c1 in C.
(10) The best unassigned machine in c1 is m3.
(11) CompTime = (1000/1000)*20 = 20 seconds.
(11) CommTime = 0.
(11) ml1 is assigned and it is connected to ml2 (see Figure 3(b)).
(11) ml1 is already mapped to m1 in c2, and we are considering mapping ml2 to m3 in c1. Thus, the Throughput is equal thr12 = 100Mbps.
(11) CommTime += (32e9/100e6) = 320 seconds.
(11) MappingTime = 20 + 320 = 340 seconds.
(12) MappingTime is less than MinTime. So, MinTime = 320 and BestMach = m3.
(9) Consider cluster c2 in C.
(10) The best unassigned machine in c2 is m2.
(11) CompTime = (1000/333)*20 = 60 seconds.
(11) CommTime = 0.
(11) ml1 is assigned and it is connected to ml2.
(11) ml1 is already mapped to m1 in c2, and we are considering mapping ml2 to m2 in c2. Thus, the Throughput is equal thr22 = 1Gbps.
(11) CommTime += (32e9/1e9) = 32 seconds.
(11) MappingTime = 60 + 32 = 92 seconds.
(12) MappingTime is less than MinTime. So, MinTime = 92and BestMach = m2.
(9) No more clusters to consider
(16) Map ml2 to the BestMach (i.e. m2). Mark machines ml2 and m2 as assigned.
The algorithm continues in a similar manner and results in mapping ml3 to m5. Thus, the logical machines ml1, ml2, and ml3 are mapped to machines m1, m2, and m5, respectively. The MappingTime for this configuration is equal 198 seconds (i.e. (60+60+30) + (32+16)). Moreover, we also need to consider the case when c1 is the initial cluster. In this case, the algorithm results in mapping the logical machines ml1, ml2, and ml3 to machines m3, m1, and m2, respectively. The MappingTime for this case is equal 580 seconds (i.e. (30+40+30) + (320+160)). Based on that, the mapping when cluster c2 is the initial cluster is considered the best mapping (evaluated at lines 17-19). Furthermore, this example shows that taking the clustering information into consideration and evaluating several mappings with different starting points can result in a better mapping than the WH. In this example, the WH starts by assigning ml1 to m3, which limits its bandwidth with ml2 and ml3 from the beginning to 100Mbps Then it maps ml2 and ml3 to machines m1 and m2, respectively. So, the WH MappingTime is equal 580 seconds, which is 382 seconds more than AMH best mapping time (198 seconds).
17
5.3
Related work
An enhanced application model that is based on the idea of tasks refinement is introduced in [30]. This approach is based on the WH [29], but it uses more detailed information about the tasks behavior, which resulted in more efficient mappings than WH. However, this method is only applicable to the concurrent application class, since it does not allow asynchronous communication operations in the tasks. Moreover, as in [29], the task with the largest computation amount is always assigned to the fastest machine, which can have a significant impact on the application’s TCT. Our AMH supports asynchronous and synchronous communication, which makes it applicable to any distributed application class. In addition, it uses the clustering information in the mapping decisions, which makes it suitable for communication and computation intensive applications. A work-rate based model to determine a performance efficient mapping of Master-Slave processes to a set of machines in a distributed and heterogeneous environment is introduced in [31]. This model is applicable to computation and communication intensive Master-Slave applications. On the other hand, the AMH is applicable to any network computing application class. In addition, AMH does not perform an exhaustive search to find an acceptable mapping as in [31]. Node selection algorithms for maximizing the available computation and communication capacities are introduced in [10]. These algorithms are based on selecting the least loaded machines and repeatedly removing the edges with the minimum bandwidth from the network graph. However, pivotal factors to find an acceptable mapping, such as the application configuration, the tasks execution times, and the message sizes, are not taken into consideration in these algorithms. Condor [38] adopts a matchmaking paradigm [39] to specify and implement a resource allocation scheme that takes into account resource and user (i.e. application) requirements. In this framework, users are called principals and they are represented in the system by software components called agents. Agents and resources advertise their attributes and requirements described by Classified Advertisements (ClassAds) [39] to a Matchmaker. The Matchmaker processes the published ClassAds and generates agent-resource pairs that satisfy each other’s constraints and preferences. Then it informs both parties (i.e. the agent and the resource) of the match. Finally, the agent and the resource, without the Matchmaker intervention, establish contact and corporate to execute the job through a claiming process. This approach is suitable for allocating one job to a single machine, but it is not suitable for assigning a distributed application onto multiple resources. Moreover, the resource selection is totally independent from the job characteristics. Another resource selection framework that comprises three modules is introduced in [40]. The resource monitor is responsible for querying Globus-MDS and NWS to obtain resource information and for storing this information in a local database. The set matcher takes application requests described by set-extended
18
ClassAds and uses a set-matching algorithm to find a resource set that satisfies the requested individual and set constraints. The set-extended ClassAds syntax extends the Condor ClassAds language [39] to support both single resource and multiple resource selection. Finally, the mapper is responsible for allocating the workload of the application to the selected resources. The mapper uses AppLeS like application performance models (i.e. equation based) to evaluate the various mappings. A Virtual Grid Execution System (vgES) [42] that is based on the GrADS project [21] is described in [4146]. The objective of this framework is to improve the scalability of a scheduling algorithm by constraining its operation to a subset of resources. In this system, the application resource requirements are specified using the resource description language (vgDL) [41], which supports three resource aggregates: LooseBag (a collection of heterogeneous nodes with poor connectivity), TightBag (a collection of heterogeneous nodes with good connectivity), and Cluster (a well-connected set of homogenous nodes). A resource selection and binding component (vgFAB) [44] takes a vgDL specification from the application and returns to it a Virtual Grid (VG) i.e. a hierarchical abstraction of a resource set that matches the vgDL specification. Moreover, a vgLAUNCH component is used to launch the application on the bound resources. In addition, the vgMON component ensures that the resource requirements are met throughout the application execution. The efficiency of this approach is based on the fact that the resource data is already organized and cached in a local database. However, they do not discuss the overhead of constructing, organizing and updating the database.
6
Interactive QoS management system for scheduling NC applications
We integrated and validated the network and application models as well as the mapping and performance estimation algorithms in an interactive QoS management system that monitors the state of the resources (e.g. machines and network links) and automatically finds an application-to-machines mapping that may satisfy the desired QoS levels in terms of a selected metric (e.g. application completion time, speedup ratio). The developer interacts with the QoS system via a QoS GUI (see Figure 8) in order to accomplish two main tasks: (1) setup and launch the QoS monitoring modules on a pool of networked machines to measure the resources state data, and (2) run suitable QoS sessions to efficiently map the tasks to machines.
6.1
Managing the QoS system
The Setup QoS System tab (Figure 8(a)) can be used to: (1) define the Machines Pool (a list of available machine names) on which the logical machines in the ATG may be allocated, (2) specify the reference machine, which was employed to collect the benchmark data used in annotating the task behavioral models, (3) setup the QoS system, (4) launch the QoS monitoring modules on the machines pool, (5) view the measured resource data, and (6) terminate the QoS modules.
19
(a)
(b)
Figure 8: QoS GUI: (a) The Setup QoS System tab, and (b) The Open Application tab.
The JP framework is used to configure, deploy, and terminate the backend QoS monitoring modules. A QoSManager is launched on the MASTER machine (i.e. the machine on which the QoS GUI is launched). In addition, a QoSMonitor is launched on every machine in the pool (including the MASTER machine). The QoS modules are configured as a ring of JP tasks (see Figure 9(a)). The logical links are used to exchange information between the QoSMonitors themselves and with the QoSManager. The JP message passing API [1] is used to coordinate the monitoring events and exchange the measured data between the various modules. The QoSManager occasionally (once an hour) triggers the QoSMonitors to perform all-to-all delay measurements, which are used to partition the machines pool into different clusters. The re-clustering phase measurements are performed at the network-level using Java sockets. In between re-clustering events, each QoSMonitor frequently (once per minute) measures the attributes of its machine (e.g. Workload, FreeRAMSize). Moreover, the first QoSMonitors in each cluster are required to frequently measure the intra- and inter-clusters Throughput. The frequent measurements are called QoS measurements, since they are used to make the scheduling decisions. Hence, the QoS-phase Throughput measurements are performed at the application-level (i.e. using Java RMI) to be as close to what the application may experience, which leads to more accurate performance estimates.
20
QoSMonitor
QoSMonitor
m3
m2
QoSManager
QoSMonitor
QoSMonitor
m1
m4
(a)
(b)
Figure 9: (a) The QoS modules configuration. In this figure, the solid boxes represent JP tasks, the dashed boxes represent machines, and the solid lines represent the logical links between the corresponding peer-to-peer ports, (b) the QoS System Setup dialog (see text for details).
Since the link Throughput depends on the message size, we perform each QoS-phase Throughput measurement using either two or four different message sizes. This allows our scheduler to accurately predict the Throughput of any message size, which results in more accurate mapping decisions. We apply the Throughput of the largest specified message size to any message with a larger size. The Throughput of the minimum specified message size is applied to any message with a smaller size. Moreover, we use linear interpolation to predict the throughput of messages with sizes between the specified message sizes. The developer can use the QoS System Setup dialog (Figure 9(b)) to select the number of points to use in the Throughput measurements, as well as their corresponding sizes. Moreover, the developer can use the Setup dialog to override the pre-defined system parameters (e.g. re-clustering and QoS measurements frequencies). The QoS GUI exchanges information with the QoSManager via shared objects. The QoS GUI registers these objects in an RMI registry on its local machine. There are two shared objects: one to make the setup data (SetupDataObject) available to the QoSManager, and another to make the measured data (MeasDataObject) available to the QoSManager and the Scheduler (see Figure 1). The QoS GUI updates the SetupDataObject based on the user selections. The QoSManager updates the MeasDataObject based on the resource information that the QoSMonitors measure and provide. The QoSMonitors send their measured data to the QoSManager via the logical links of the ring i.e. each QoSMonitor passes its and the gathered data to its neighbor till it reaches the QoSManager. 21
6.2
Running QoS sessions
After setting up and launching the QoS monitoring modules the developer can use the Open Application tab (Figure 8(b)) to open (load) multiple application models (i.e. ATG and tasks behavioral graphs) and possibly launch multiple QoS session dialogs (Figure 10(a)) concurrently to manage several QoS sessions for different applications. The developer can launch the JPVAC tool [5] to view or edit the ATG of a selected application. Moreover, the JPVTC tool (Figure 2(b)) can be launched from the JPVAC tool to view or edit the tasks behavioral graphs.
(a)
(b)
Figure 10: (a) QoS Session dialog, (b) QoS session results report.
The objective of a QoS session is to automatically find a tasks-to-machines mapping that may satisfy the desired QoS levels in terms of a selected metric (e.g. application execution time, speedup ratio). The speedup is defined as the application's sequential time on a reference machine divided by the estimated execution time of the best found mapping. The sequential time can be either: (a) an actual time (in time units) provided by the user, or (b) the time estimate obtained by the performance estimation algorithm [4] when all tasks are assigned to the reference machine. The user has the flexibility to assign some or all of the logical machines to machines. If some of the logical machines are left unassigned, then a flexible mapping session is in effect. In this case, the user can also specify a set of constraints that the mapping algorithm needs to meet, while optimizing the selected metric. The supported constraints are: BestCPUSpeed, BestWorkload, BestEffectiveSpeed, BestRAMSize, BestSwapSize, BestMemorySize, and BestCommAndComp. If any of the first six constraints is selected, the mapping algorithm allocates the unassigned logical machines with the largest CompAmounts to the free machines with the best attributes (i.e. best CPU speeds, best workloads, etc). However, if the BestCompAndComp constraint is selected, the AMH (section 4.1) algorithm uses the ALMG and the
22
FCMG to try to find a mapping that minimizes the application’s TCT. The performance estimator is then used to predict the total running time of the best mapping. The application’s estimated running time, or speedup, is then compared to the user-specified QoS target in order to determine if that mapping is acceptable or not. Moreover, when a session completes its results are displayed in a results dialog (Figure 10(b)) which contains: (1) a summary of the session settings and result, (2) the best found machines, (3) a breakdown of the estimated overall time of each application task (the overall time is divided into: computation, communication, and idle times), and (4) the tasks with the (min, max) computation, communication, idle, as well as total times. Furthermore, the developer can generate a JavaPorts configuration file for the bestfound mapping using the Save Configuration File button in the report dialog shown in Figure 10(b).
7
Validation and results
Similarly to [29], we conducted two types of experiments to validate our mapping heuristic. In the first experiment we vary the number of tasks from 3 to 8 and fix the number of machines to 10. This experiment shows the sensitivity of the heuristic to the number of tasks. While in the second experiment we vary the number of machines from 5 to 10 and fix the number of tasks to 5. This experiment shows the sensitivity of the heuristic to the number of machines. Moreover, we analyzed the results statistically using different types of plots. As in [29], we used three types of applications in each experiment: concurrent, concurrent overlapped, and pipeline. The ATGs (constructed using JPVAC [5]) and the tasks’ behavioral graphs (constructed using JPVTC [4]) for these applications are shown in Figure 11, Figure 12, and Figure 13 respectively. Each task in these applications is assigned to a different logical machine. In the concurrent application the task computations and inter-task communication are sequential [29] i.e. the computation and communication operations in the same task cannot run concurrently, but the computation and computation operations in different tasks may be running concurrently as shown in Figure 11(b). While in the concurrent-overlapped application the task computations and inter-task communication are overlapped [29] i.e. computation and communication operations can be running concurrently regardless if they are in the same task or not as shown in Figure 12(b). Finally, in the pipeline application a computation stage cannot start until the previous stage has finished [9] i.e. a computation operation in a task cannot start until the computation operation in the previous task has finished and a ready/result message is received from the previous task as shown in Figure 13(b). Note that the ATGs of the concurrent and concurrent-overlapped applications are the same, but they differ from the ATG of the pipeline application. Moreover, we showed only the 3-tasks instances of the various application types. Similar, but expanded, instances are used when larger number of tasks is used in an experiment. For example, in the 8-tasks instance of the concurrent application, the behavioral graphs of 23
tasks T2 to T8 would be the same as those of tasks T2 and T3 in the 3-tasks instance. Moreover, the behavioral graph for task T1 in the 8-tasks instance would have seven SyncWrite elements instead of two in order to send messages to tasks T2-T8 respectively.
(a)
(b)
Figure 11: Concurrent application: (a) its ATG, and (b) from left to right the behavioral graphs for tasks T2, T1, and T3, respectively.
(a)
(b)
Figure 12: Concurrent-overlapped application: (a) its ATG, and (b) from left to right the behavioral graphs for tasks T2, T1, and T3, respectively.
(a)
(b)
Figure 13: Pipeline application: (a) its ATG, and (b) from left to right the behavioral graphs for tasks T1, T2, and T3, respectively.
24
We compute 10000 environments for each evaluation point (i.e. a given fixed number of tasks and machines) in an experiment. An environment corresponds to a set of parameters used to generate and annotate the task behavioral graphs and the FCMG. For each environment, we transform the FCMG to a FCCG and we generate an ALMG using the ATG and behavioral graphs. Moreover, we compare the AMH, WH, and optimal mapping TCT times. The range of application and resource parameters that we used in the different environments are shown in Table 1. For each environment the nodes, links, and elements (in the corresponding graphs) are annotated with values that are drawn from the corresponding ranges using a uniform pdf. Note that, in a comma-separated range, we uniformly draw values only from the set of specified values. While in a dot-separated range, we uniformly draw any integer value within the range. Parameter
Range
CompAmount (MInstructions)
[1e4 … 1e6]
CommSize (KBytes)
[1 … 1e4]
CPUSpeed (MIPS)
[1 … 10000]
Network Link Throughput (Kb/sec)
[50, 1e3, 1e4, 1e5]
Table 1: Task and resource parameters
Let us take an example to clarify how we draw the values from Table 1 in each environment. Let us consider evaluation point (3 tasks, 10 machines) in experiment one for a pipeline type of application. To generate an environment for this evaluation point, we draw values for the following parameters: CompAmounts of the codeSegments of tasks T1, T2, and T3, respectively; CommSizes of the two SyncWrite elements in tasks T1 and T2; the CPUSpeeds of the 10 machines; the Throughput of each logical link in the FCMG (i.e. 45 links) of the 10 machines. We generated (1-CDF) plots (see Figure 14) for the ratio of the WH completion times over AMH completion times for all the evaluation points in the two experiments. Given a heuristic, the (1-CDF) plot shows the probability Y of getting a ratio that is greater than a target ratio value X. For example, in Figure 14(a), for a ratio value X equals 5 (i.e. WH completion time is 5 times larger than AMH) the values of Y for the (3 tasks, 10 machines) and (7 tasks, 10 machines) evaluation points are 0.21 and 0.13 respectively, which means that in 21% and 13% of the 10k environments (i.e. 2100 and 1300 respectively) for the previous two points the WH completion time is 5 times larger than the AMH. Moreover, in experiments 1 and 2, for the concurrent application, the plots show that X increases as flexibility increases in the (0.1