Cluster Comput DOI 10.1007/s10586-011-0166-7
Aspect-oriented development of cluster computing software Hyuck Han · Hyungsoo Jung · Heon Y. Yeom
Received: 2 September 2010 / Accepted: 17 May 2011 © Springer Science+Business Media, LLC 2011
Abstract In complex software systems, modularity and readability tend to be degraded owing to inseparable interactions between concerns that are distinct features in a program. Such interactions result in tangled code that is hard to develop and maintain. Aspect-Oriented Programming (AOP) is a powerful method for modularizing source code and for decoupling cross-cutting concerns. A decade of growing research on AOP has brought the paradigm into many exciting areas. However, pioneering work on AOP has not flourished enough to enrich the design of distributed systems using the refined AOP paradigm. This article investigates three case studies that cover time-honored issues such as fault-tolerant computing, network heterogeneity, and object replication in the cluster computing community using the AOP paradigm. The aspects that we define here are simple, intuitive, and reusable. Our intensive experiences show that (i) AOP can improve the modularity of cluster computing software by separating the source code into base and instrumented parts, and (ii) AOP helps developers to deploy additional features to legacy cluster computing soft-
A preliminary version [1] of this paper was presented at IEEE Cluster 2007, Austin, Texas, USA. H. Han · H.Y. Yeom School of Computer Science and Engineering, Seoul National University, Seoul, 151-742, Korea H. Han e-mail:
[email protected] H.Y. Yeom e-mail:
[email protected] H. Jung () School of Information Technologies, University of Sydney, Sydney, NSW 2006, Australia e-mail:
[email protected]
ware without harming code modularity and system performance. Keywords Aspect-Oriented Programming · Fault tolerance · Heterogeneity · Object replication · Message-passing interface · Key-value storage
1 Introduction All programming paradigms include an abstraction level for grouping or encapsulating similar distinct functionalities (so-called concerns). For example, Object-Oriented Programming (OOP) provides class inheritance. Nevertheless, some concerns are inevitably scattered over many parts of an entire program irrespective of roles, and this degrades the modularity and readability of the program source code. As a result, separation of concerns becomes a major principle in designing, implementing, and maintaining any kind of system. Clean separation of concerns, however, is often impossible due to cross-cutting concerns, which are concerns that affect other concerns. Examples of cross-cutting concerns are logging and locking, which are typically inserted into a base code. Aspect-Oriented Programming (AOP) was proposed to address cross-cutting concerns. In AOP, a crosscutting concern is encapsulated in one place called an aspect. AOP is also useful for incremental development; instead of modifying various places within a base source code, an additional software feature may be specified as an aspect. Since the initial concept of AOP was introduced in 1997, many tools, such as AspectJ [2] and AspectC/C++ [3], have been developed, and many research projects [4–8] have broadened its target areas, including OS and Middleware. The outcomes of such research projects and AOP-related symposia have attracted many researchers and developers,
Cluster Comput
and AOP has been established as a promising software development methodology, especially for modularizing source code. However, the AOP paradigm has been applied to few cases in cluster computing thus far [9], in spite of active research to apply AOP to many different areas. In this paper, we present two challenging case studies to apply AOP to cluster computing, and show that an AOP-based approach, namely, Aspect-Oriented Software Development (AOSD), can be useful for cluster computing software. For the last decade, clusters of computers have been replacing traditional mainframe systems for supercomputing applications. As the hardware resources have changed, the parallel software platform, on which parallel programs run, has also reflected this shift. Currently, Message Passing Interface (MPI) is regarded as the de facto standard, and many research projects are being conducted to improve MPI. The growing size of cluster systems underlines the importance of fault tolerance in parallel software, since a failure could render all parallel computations useless. Therefore, developing a version of MPI that provides fault tolerance has been an important issue in cluster computing [10– 12]. There also has been much effort to integrate heterogeneous clusters [13, 14], which requires a message gateway that connects heterogeneous clusters, as well as extension to MPI software for message routing among clusters. Another popular usage of cluster computing is an object replication, which is deployed by many Internet services. Internet service providers think of their contents as objects and store objects using key-value storage systems such as BerkeleyDB [15] and the popular Memcached [16]. Keyvalue storage supports replication over many different physical machines for reliability, load balancing, and scalability. Key-value storage systems such as Apache CouchDB [17], Amazon’s Dynamo [18], and Facebook’s Cassandra [19] are now deployed in clusters (datacenters) of enterprises, and this trend shows that using key-value storage is a key technique of Internet online service providers. Accordingly, endowing key-value storage systems for a single machine (i.e., BerkeleyDB and Memcached) with replication capability to increase service availability and throughput is self-evident work. Such improvement described above is achieved by blending the source code of core software with secondary code for fault tolerance, message routing, or replication. Unfortunately, however, the resulting source code may be very hard to read since code for additional features would be scattered in various places across many source files, and the blended source code is even more complex. As a result, maintenance of such software is very costly, since it is a substantial burden for developers to understand the previous code for an upgrade or change. These observations have motivated our study. By using AOP, the original code of the core software
remains intact, and code for each additional feature is written in a separate file. Therefore, AOP greatly improves the overall modularity and readability of the code. In this paper, we present three interesting case studies: fault-tolerant MPI, MPI communication on heterogeneous clusters, and object-replication for key-value storage systems. For each case study, we first introduce software components, namely, prerequisites. For example, for faulttolerant MPI, (plain) MPI software and a checkpoint/restart library are required, among others. Then we explain how to integrate the components using AOP. Note that the source code of the components is not modified in our implementation. Only the interaction among the components is implemented using AOP. This enhances the modularity and readability of the program logic immensely. Implementing fault-tolerant MPI and MPI between heterogeneous clusters requires the Job Management System, Checkpoint/Restart library, and Message Gateway. The Job Management System manages execution contexts, detects failures, and recovers from them. The Checkpoint/Restart library stores the entire image of each process to a file and restores the process from the saved file. The Message Gateway routes messages to their appropriate destinations. Objectreplication for key-value storage requires multicast middleware that supports a wide variety of consistency models. Our work does not focus on how to make those prerequisites but on how to merge those features into the core software using AOP. Therefore, we reuse an existing library (Checkpoint/Restart library and multicast middleware) or implement a prototyped version of systems (Job Management System and Message Gateway) to obtain the necessary software programs. We have also evaluated the effectiveness and performance of our proposed methodology. For fault-tolerant MPI, without using AOP, the additional code is scattered over dozens of points in nine files. However, in our AOP-based approach, the original code is not modified. Additional code is stored in only three separate files. Thus, the modularity and readability of the code are greatly improved. Furthermore, compared to non-AOP-based versions, the performance overhead caused by our AOP-based approach does not exist. We believe that AOP helps developers in the cluster computing world to mitigate tangling or scattering source code when they integrate software components. The main contributions of this article are as follows: – We integrate existing software by using three, two, and two aspects to implement “fault-tolerant MPI,” “MPI between heterogeneous clusters,” and “object replication for key-value storage,” respectively. The aspects that we define are simple, intuitive, and reusable. – We compare AOP-based software to software without using AOP through extensive evaluations. From our qualitative and performance evaluations, we see that AOP signif-
Cluster Comput
icantly improves the code readability as well as the modularity, and AOP-based software has the same performance and scalability as similar software that is developed without using AOP. – We introduce important lessons we learned while developing AOP-based cluster computing software. We also classify our aspects into three groups according to their roles. The remainder of this paper is organized as follows. Section 2 discusses background and related work. Section 3 introduces the components for each case study, namely prerequisites. Section 4 explains how to integrate the components using AOP. Qualitative and quantitative evaluations of our implementations are presented in Sect. 5. In Sect. 6, we present lessons learned from this study, and finally, in Sect. 7, we conclude the paper.
2 Background and related work 2.1 AOP AOP is a programming methodology for separating crosscutting concerns, which is usually difficult in traditional programming models such as OOP and procedural programming. AOP complements these models by allowing the developer to dynamically modify the static model to create a system that can grow to meet new requirements. Just as objects or functions in the real world can change their states during their lifecycles, an application can adopt new characteristics as it develops. There are five important AOP terms. – Cross-cutting concerns: Even though most classes or procedures in a programming model perform a single, specific function, they often share common or secondary requirements with others. Those requirements are the crosscutting concerns. – Advice: is the additional code that developers want to apply to the existing model. – Joinpoint: is the point of execution in the application at which a cross-cutting concern needs to be applied. – Pointcut: is the program construct to designate a specific joinpoint and collect specific context at the point. – Aspect: is a combination of the pointcut and advice. Aspects basically provide three declarations of advice, namely, before, after, and around advice. Before, after, and around advice runs before, after, and instead of specified joinpoints, respectively. Deployment of advice in the existing model is known as weaving, which can be divided into two main approaches, source-level weaving and runtime-level weaving.
2.2 Related work There are many AOP tools, such as AspectJ, AspectC/C++, Aquarium [20], and AOJS [21]. Many research projects have applied AOP to many areas using or modifying those tools. Kienzle et al. [6] implemented the ACID properties for transactional objects. Cunha et al. [4] presented a collection of well-known, high-level concurrency patterns and mechanisms, such as synchronization and barrier. In [5, 22, 23], a loop joint model, a region aspect, and a synchronized block join point were proposed for parallelization in AspectJ. Rashid et al. [8] proposed a persistence model of Java objects using an AOP approach. In [7, 24–27], many OSes such as eCos, AUTOSAR OS, and Linux was refactored and extended using aspect-oriented programming. Cannon et al. [28] defined authority aspects to enforce the security of desktop clients. These results are based on the OOP model using AspectJ or AspectC/C++. Although the target of AOP is not only the OOP model, most research has focused on the OOP model. However, the majority of cluster computing software programs are not based on the OOP model. The implementation of MPI is based on the procedural programming model, and in the case of MPICH, it is written in ANSI C. Accordingly, our case studies are related to the integration of the primary code and other concerns based on the procedural programming model. Our technique can also be easily adopted by other programming systems such as OpenMP [29] and Cilk [30]. In [31–33], AOP was used to replicate objects for fault tolerance. In [34], AOP was used to introduce fault tolerance for transient faults, hardware permanent faults, and residual software faults in embedded systems applications. However, in this work, AOP was used to deploy a consistent distributed checkpointing protocol to a well-known MPI implementation for fault tolerance of parallel applications. Many fault-tolerant MPI systems such as MPI with BLCR [35], MPICH-PCL [10], and MVAPICH2-ivc [11, 12] have been proposed. MPICH-PCL is a new implementation of a blocking-checkpointing mechanism for fault-tolerance inside MPICH2. For this implementation, a new channel (ft-sock) that is based on the TCP sock channel is introduced to support blocking-checkpointing. Recently, Berkeley Lab Checkpoint/Restart (BLCR) [35] is integrated into LAM/MPI, MVAPICH2, and Open MPI. In LAM/MPI with BLCR, out-of-band communication channels in LAM are used to clear MPI communication channel, which is necessary to a blocking-checkpointing mechanism. W. Huang et al. introduced virtual machine (VM) migration over InfiniBand with MVAPICH for fault-tolerance and proposed a new VM-aware MPI library, called MVAPICH2ivc. MVAPICH2-ivc supports efficient VM-aware communication for shared memory communication in different VMs on the same physical host. MPICH Madeleine [13]
Cluster Comput
and MPICH-SCore [14] are famous MPI implementations supporting communication between clusters with heterogeneous networks. These implementations have forwarding nodes that relay messages to the target process. All these systems produced meaningful results by adding fault tolerance or message routing capabilities to the original MPI source code, and AOP can deploy such a new functionality without causing tangling or scattering the existing code. Many key-value storage systems such as BerkeleyDB, Memcached, Cassandra, and Dynamo support two primitive operations: the get (or read) operation returns the value under the given key, and the set (or write) operation writes a new value of the key. BerkelyDB and Memcached do not support replication, while Cassandra and Dynamo do. From the viewpoints of scalability, fault tolerance, and high performance of legacy systems, enabling BerkeleyDB or Memcached to support replication is valuable work. The consistency model is also an important design consideration for replication in key-value storage systems. In many cases such as Google, Amazon, eBay, and Facebook, strong consistency [36, 37] is sacrificed in order to guarantee performance, while weaker consistency such as eventual consistency [18, 38] is widely adopted. In this study, we build a key-value storage system that supports replication with the eventual consistency via integration of Memcached and JGroups [39] using AOP.
3 Prerequisites The two base versions of MPI for fault-tolerant MPI and MPI between heterogeneous clusters are MPICHGM [40] and MPICH-VMI [41]. In previous work, we used MPICH-GM to implement fault tolerance [42], and modified MPICH-VMI to enable MPI processes on heterogeneous clusters to communicate with each other [43]. For object replication of key-value storage systems, Memcached [16] and JGroups [39] are used as key-value storage and multicast middleware, respectively. As we will explain in detail, we can apply AOP to the two different MPI variants and Memcached successfully independent of its implementation detail. In this article, we abbreviate “faulttolerant MPI,” “MPI between heterogeneous clusters,” and “object replication for key-value storage” as FT MPI, HETERO MPI, and OR-Memcached, respectively. As mentioned in Sect. 1, we evaluate important programs, which are inevitable in building the above systems. Job Management System. Basically, the Job Management System needs to support FT MPI and HETERO MPI, and this system manages multiple MPI processes running on multiple nodes and monitors their execution environment. The Job Management System also has to coordinate global
and consistent checkpoints, as well as help the system recover from multiple failures gracefully, because FT MPI is built on a coordinated checkpointing and rollback-recovery scheme [44]. In HETERO MPI, the Job Management System should maintain topology information about multiple heterogeneous clusters as well as the location of the Message Gateway, since HETERO MPI requires the Message Gateway to relay messages between clusters. All output and error messages from the MPI processes are redirected by the Job Management System. Henceforth, the management system itself must be lightweight to avoid affecting the performance of the remaining MPI processes. The Job Management System has two main components defined by the role and layer location, the Central Manager and Job Managers, which are shown in Fig. 1(a). The role of the Central Manager is to manage all system functions and to maintain the system state. Meanwhile, Job Managers should manage all MPI processes, and redirect output and error messages from the MPI processes to the Central Manager. Checkpoint/Restart library. The ckpt library [45] provides user-level process checkpointing functionality to an ordinary program and supports asynchronous checkpoints triggered by signals sent by other processes. Ckpt writes its checkpoint of a program to a checkpoint file. A checkpoint file usually consists of the in-memory states of the working program, including the CPU registers’ states, stack pointer, program counter, and all content residing in the volatile memory, i.e., content that can easily be dumped using /proc/self/maps. A dumped checkpoint file is carefully crafted to be an ordinary executable (in ELF format) that, when executed, continues the program from the point at which the program was checkpointed. Message Gateway. The most important and unique role of the Message Gateway is to relay messages between processes on heterogeneous clusters. It forwards heterogeneous messages over different networks such as Myrinet and Infiniband as shown in Fig. 1. The Message Gateway must relay all types of messages that MPICH-VMI supports because HETERO MPI is based on MPICH-VMI. Basically, MPICH-VMI uses eager and rendezvous protocols for small and large messages. Additionally, it has a special buffer for barrier operations (barrier protocol). For that reason, the Message Gateway is designed to handle these messages efficiently. Details of the Message Gateway are found in [43]. Memcached. Memcached is a high-performance keyvalue storage system that maintains every object in memory. The system’s main objective is to accelerate dynamic web applications by alleviating database load through memorybased caching. For this study, we defined a generic remote service for key-value storage that communicates with the local Memcached daemon using Java RMI and an open-source
Cluster Comput Fig. 1 Prerequisite systems
Java client library of Memcached since Memcached does not support the remote object service and the replication functionality. JGroups. JGroups is an open-source toolkit that provides reliable multicast communication. The main features of JGroups are (1) group management including group creation, view management, and group deletion, (2) membership management such as member join/leave, and (3) sending and receiving messages in point-to-multipoint or point-to-point ways. JGroups supports the FIFO consistency between any two end points, and the eventual consistency for OR-Memcached is based on this FIFO consistency. Note that the Job Management System and the Message Gateway are newly designed architectures, while other programs are exploiting open-sourced software. All of these systems do not guarantee FT, HETERO, and OR functionalities. In other words, developing these extended features requires mechanisms that merge the aforementioned programs into the original versions of core software, MPICH-GM, MPICH-VMI, and Memcached, to achieve the operational goals. The mechanisms, which will be explained in the next section, are based on an AOP concept.
4 Blending concerns with aspects This section describes our AOP approach to develop FT MPI, HETERO MPI, and OR-Memcached. Three aspects were designed for FT MPI, two aspects were designed for HETERO MPI, and two aspects were designed for ORMemcached. We use the Aspicere [46] tool for FT MPI and HETERO MPI, and use the AspectJ tool for ORMemcached. The Aspicere tool takes our aspects and the source code as input, pre-compiles them, and then generates a new source code. Then, we compile the new code using a standard C compiler, and install the system. The AspectJ tool automatically compiles our aspects and the source code with the help of the Java compiler, and we install the newly compiled programs. 4.1 Building FT MPI Initialization. The Central Manager receives the job specification from the user and launches Job Managers on appropriate nodes. Then, each Job Manager spawns the MPI process. As shown in Fig. 1(a), the Central Manager, Job Managers, and MPI processes run independently. Thus, all
Cluster Comput Fig. 2 Initialization
MPI processes need special APIs to communicate with the Job Management System (i.e., recvTypedMessage and sendTypedMessage). Each MPI process can exchange its own communication information such as board id and port number, with the other processes using the APIs. Then, the MPI process initializes a checkpoint signal handler that processes the checkpoint command from the Job Management System. We can incorporate codes that are related to initialization into MPICH-GM without the help of AOP, as shown in Fig. 2(a). By using AOP, we can separate the above concerns from the modified code using the Initialization aspect. Figure 2(b) shows that our concerns, such as registering the checkpoint signal handler and exchanging communication information, are performed instead of the original MPID_getConfigInfo function (around advice). Consistent Checkpoint. FT MPI adopts the coordinated checkpointing scheme because the scheme is relatively easy to be implemented and does not incur large overhead [47, 48]. Thus, the Central Manager has the role of issuing checkpoint commands periodically to all MPI processes through Job Managers. Upon receiving a checkpoint command, each process saves its entire state to a stable storage device. Then, the process informs the Central Manager of the process’s completion.
To perform coordinated checkpointing safely, the entire system must preserve two important conditions. When processes take a global checkpoint, there should be (1) no intransit messages and (2) no orphan messages between any two processes. For these conditions, we first define critical sections inside every send/recv function of MPICH-GM. Figure 3(a) shows the rendezvous communication for large messages. When the receiver receives a “Request-tosend” message from the sender, the receiver sends “Ok-tosend” messages to the sender. The sender, after receiving “Ok-to-send” messages, sends the data directly from the source to the destination. Because of limitations of the receiver’s DMA memory, the sender splits the source data into chunks of a pre-defined size. After the receiver delivers the data, the receiver reallocates this memory region, and the address of this memory region is written in a new “Ok-tosend” message, which is sent to the sender. The number of “Ok-to-send” messages that can be sent never exceeds the pre-defined size. Figure 3(b) explains eager communication for small messages. To send messages, the message is first copied into a send buffer and is forwarded to the destination. When MPI_Recv is called by the receiver, it checks the unexpected queue first. If the message exists in this queue, the receiver collects this message. If the message is not in
Cluster Comput Fig. 3 Critical sections
this queue, the receiver waits for the message by periodically checking the event queue. When the message arrives at the destination, the Myrinet interface puts the message in a fixed receive buffer, and the data portion of the message is copied to the allocated memory of the receiver. We define critical sections of both procedures as red lines in Figs. 3(a) and 3(b). The critical section indicates that the process is currently executing communication operations (i.e., sending or receiving a message through the Myrinet interface). In this region, the process cannot snapshot its current state due to a violation of the conditions. If the process receives a checkpoint command in a critical region, the process delays and marks the command. In addition to the func-
tions in Figs. 3(a) and 3(b), more functions should be defined as critical sections (i.e., MPI_ Isend/Irecv). Moreover, the number of critical sections increases because MPICHGM exploits shared memory devices for processes running on the same Symmetric MultiProcessor (SMP) machine, so that send/recv functions based on shared memory devices need to also be defined as critical sections. To encapsulate existing send/recv functions of MPICH-GM containing critical sections, we can modify the send/recv functions of MPICH-GM using the globally defined variable critical_section. Figure 4(a) shows an example of modified eager communication functions. Before the critical section, the value of the criti-
Cluster Comput
Fig. 4 Consistent checkpointing aspect
cal_section is incremented. After the critical section, the value is decremented, and the process issues a checkpoint command to itself again if the process has a pending command, and the value of the critical_section is set to zero. This procedure is similar to the lock/unlock mechanisms in operating systems. From Fig. 4(a), we can see that the modified code contains cross-cutting concerns; one concern is about the message transfer, and the other concern is about the critical section. Therefore, we can instrument the cross-cutting concerns in Fig. 4(a) (using an aspect) for weaving as shown in Fig. 4(b). We set the begin and exit points of each critical section as joinpoints that are applied to new advice about consistent checkpointing. The non-critical state in which the value of the critical_ section equals zero cannot directly guarantee the consistency conditions. We assume that processes P1 and P2 receive a checkpoint command before P1 enters the MPI_Recv function and after P2 exits the MPI_Send, and that both functions are executed through eager communication. Then, both P1 and P2 are not in critical regions, but this situation does not guarantee that the message, which P2 sends to P1 , is delivered to P1 ’s address space. Therefore, we need the following mechanism that guarantees message delivery. Upon entering the non-critical
Fig. 5 Rollback recovery aspect
state, a process invokes a broadcast function before the do_checkpoint function of the checkpoint/restart library (before advice), as shown in Fig. 5(a). This technique is similar to CoCheck’s [49] Ready Message. Performing the broadcast function guarantees two properties at the same time (no in-transit and no orphan messages between any two processes), because broadcast messages push messages that were previously sent to the receiver. Pushed messages are delivered to the receiver’s unexpected queue in the user-level memory, so that they can be included in the checkpoint file. These in-transit messages are valid upon recovery at the receiver’s user-level memory. This technique is similar to Chandy and Lamport’s [50] distributed snapshot algorithm, and the technique is valid only with the assumption that all channels built on Myrinet should have a FIFO delivery property. Of course, channels in MPICH_GM also
Cluster Comput
guarantee a FIFO-order delivery. Once the broadcasting is complete, each MPI process is checkpointed to a stable storage device. Subsequently, the Central Manager determines whether all checkpoint files have been generated. If so, it confirms the checkpoint completion and increases the version of the global checkpoint. Rollback Recovery. When the Job Management System detects a failure, the Central Manager starts to coordinate the recovery procedure and enters a new epoch. All Job Managers receive a failure notification from the Central Manager, and they reincarnate the MPI processes using the last (recent) checkpoint images (fork/exec). In the case of node failures that kill the Job Managers as well, the Central Manager launches Job Managers again on another node. Each reincarnated MPI process starts not at the main function but at the end of the do_checkpoint function [45]. Then, since the return value of the function at reincarnation is not OK, the process executes the else block that is the reinitialization procedure. Figure 5(a) shows the reinitialization procedure of each MPI process. In summary, the Job Management System detects a failure and coordinates the rollback recovery, and all reincarnated MPI processes perform the reinitialization procedure. We can define the reinitialization procedure as shown in Fig. 5(b). To distinguish reincarnation from normal checkpointing, we adopt a unique method as follows. Each MPI process obtains the epoch value from environments (epoch_value) before the process executes the issue_ckpt_command function, and it saves the epoch value to a checkpoint image. When each Job Manager reincarnates each MPI process, the Job Manager receives a new epoch value from the Central Manager and sets the environment variable by naming the epoch_value with the new value. Then, after the MPI process executes the issue_ckpt_command function, the epoch value from a checkpoint image (epoch_value) is different from that (t_epoch_value) of new environments; hence, the process performs the reinitialization procedure. Special care must be taken when the entire system starts recovery from failures. We assume that two processes run on the same node (SMP machine) and communicate with each other using shared memory before a failure. Then, after reinitialization, the processes might be allocated to different nodes. This means that the device information saved in their checkpoint images is no longer valid. Therefore, the affected processes should update their device configuration (device reconfiguration). Moreover, the processes should reconfigure topology information if the new topology is incompatible with the one in the checkpoint image (topology reconfiguration). Figure 6 represents a situation that requires device and topology reconfiguration. We separate these steps -device reconfiguration and topology reconfiguration- and reinitialization since they are specialized in MPICH-GM.
Fig. 6 Consistent and transparent recovery
Fig. 7 Initialization aspect
4.2 Building HETERO MPI Initialization. Figure 7 shows the initialization procedure for HETERO MPI. Initialization of HETERO MPI is very similar to that of FT MPI. The main difference is that each MPI process
Cluster Comput
Fig. 9 OR-Memcached architecture
Fig. 8 Connection aspect
exchanges communication information not only with each other but also with Message Gateway. Then, if a process knows that other processes are executed on different clusters, it uses a special mark on the processes running on heterogeneous clusters. Another difference is that the checkpoint signal handler is not registered. Connection. At the initialization phase, an MPI process receives communication and topology information. Thus, if the target process runs on the same cluster, the source connects to the target directly. Otherwise, the source process connects to the Message Gateway instead of the target using the information above. Figure 8 shows code with and without AOP. This technique guarantees that HETERO MPI can support the communication of heterogeneous clusters without modifying send/recv functions or topology-based collective functions. 4.3 Building OR-Memcached Replication. Figure 9 shows the architecture of replicated Memcached servers. To build OR-Memcached, we used JGroups as a multicast toolkit and introduced a new class (UpdateAgent) for update operations. As shown in Fig. 9, each UpdateAgent receives update operations from JGroups and forwards them to its corresponding Memcached server. On the client side, a popular Java Memcached client library [51] is modified for replication; client applications with the Memcached client library send read-requests (get operations) directly to a Memcached server and broadcast update-request (set, add, replace, and delete operations) to
all Memcached servers through JGroups. Thus, client programs that use the Memcached client library does not require source modification since we add the replication feature in the Memcached client library (MemcachedClient class) with and without AOP. In the initialization phase of the MemcachedClient class, a connection is set up with a JGroup channel for further communication. In the delete and set method of the MemcachedClient class, the key and value arguments are transferred to all UpdateAgents via the JGroup channel that is established in the initialization phase. Then, each UpdateAgent invocates the actual delete and set to the local Memcached daemon. The set method carries out set, add, and replace operations of Memcached. Figure 10 shows code with and without AOP, and we build OR-Memcached using two aspects like HETERO MPI. Since JGroup supports reliable multicast communication with FIFO consistency, OR-Memcached can support object replication with eventual consistency. The get (or read) method is not modified since the method does not affect the state of the given key. 4.4 Summary In building FT MPI, we define three aspects, Initialization, Consistent Checkpoint, and Rollback Recovery. The most important aspect among these three is the aspect of Consistent Checkpoint. This aspect can be true only with the FIFO property condition. The FIFO property indeed allows us to conclude that we can easily apply this aspect to MVAPICH [52] and MPICH-G2 [53], and this is valid since the Infiniband architecture and TCP/IP preserve the FIFO condition as well. Moreover, we also can reuse reinitialization and topology reconfiguration advice in these two MPI variants. To build HETERO MPI, we use two aspects, Initialization and Connection. Fortunately, initialization of HETERO MPI is similar to that of FT MPI, which implies that the initialization aspect is generic enough to be used in many MPI implementations. In the case of MPICH-G2, the initialization and connection aspects are very helpful in building
Cluster Comput Fig. 10 Initialization and replication aspects
Cluster Comput Fig. 11 Comparison of Code Modularity in FT MPI: The scattered code about orthogonal concerns in the above figure is separated as aspect code in three different files. We have a few lines of code for rollback recovery since the concern is implemented by reusing the initialization code
NAT-enabled MPI implementation, since the proxy [54] for MPICH-G2 is similar to the Message Gateway.
5 Evaluation First, we qualitatively evaluate our methodology in terms of ease of integration, readability, and modularization. Due to the nature of a qualitative evaluation, the evaluation is rather subjective, based on our experience. FT MPI consists of three components—Job Management System, Checkpoint/Restart library, and MPICHGM. To integrate these components, three aspects and several pointcuts were defined. HETERO MPI also has three components—Job Management System, Message Gateway, and MPICH-VMI, which were integrated using two aspects and several pointcuts. In both cases, the integration process was very simple due to automatic weaving of the source code. In fact, we spent most of our time analyzing the communication mechanism of MPICH-GM. For comparison purposes, we also implemented C_FT MPI and C_HETERO MPI. C_FT MPI is a fault-tolerant MPI implementation, based on MPICH-GM, without using AOP. To build C_FT MPI, we directly modified some related functions of MPICH-GM, and the ckpt library to enforce consistent checkpointing and rollback recovery. C_HETERO MPI, which is based on MPICH-VMI, is implemented, without using AOP, to support multiple clusters with heterogeneous networks. We modified the internal functions of MPICH-VMI to support computation over heterogeneous clusters.
Figure 11 shows the distribution of code for initialization, consistent checkpoint, and rollback recovery concerns in C_FT MPI (Fig. 11(a)) and FT MPI (Fig. 11(b)). In Fig. 11(a), each colored line indicates embedded code related to the corresponding concern. The code for the concerns is scattered over nine source files, which makes the overall program complex to read and understand. In Fig. 11(b), the original source files are not modified. The code for each concern is stored in a separate file. As a result, the code readability as well as the modularity is greatly improved. We also investigate lines of code (LOC) in initialization, consistent checkpoint, and rollback recovery concerns that are scattered over nine source files, and compare the code to the LOC in corresponding aspects. As shown in Table 1, when AOP is used, the LOC are slightly larger because the definition of each advice is added. Some might think that this would decrease benefits of the AOP. However, the definition of each advice helps users or developers read the source code and understand the structure of the program. We can verify this by using the MPID_CH_Eagerb_send_short function as an example (Fig. 4). Figure 4(a) shows a critical section of an eager communication function without AOP, and the definition of the critical section holds over 70 lines of code. In contrast, the definition of the critical section holds over fewer than 10 lines of code in Fig. 4(b) owing to AOP. We see similar results in other critical sections. This improves the readability of codes about critical sections and makes developers grasp the overall structure of the critical section easily. In addition, advice about critical sections in the Consistent Checkpoint
Cluster Comput Table 1 LOC Comparison of C_FT_MPI and FT_MPI: Only fault tolerance-related code is counted Component
Source file
Function name
Code length (# of lines) C_FT_MPI
MPICH-GM
gmpi_chshort.c gmpi_chnrndv.c gmpi_smpshort.c gmpi_smprndv.c
MPID_CH_Eagerb_send_short
14
20
MPID_CH_Eagerb_isend_short
14
20
MPID_CH_Rndvn_send
15
21
MPID_CH_Rndvn_isend
8
10
MPID_SMP_Eagerb_isend_short
14
20
MPID_SMP_Eager_isend_short
14
20
MPID_SMP_Rndvn_isend
11
17
4
6
MPID_SMP_Rndvn_unxrecv_posted gmpi_priv.c gmpi_smppriv.c
adi2init.c
ckpt library
FT_MPI
gmpi_put_data_callback
1
3
gmpi_packet_recv_event
1
3
smpi_post_send_ok_to_send
1
3
smpi_recv_done_get
1
3
smpi_recv_ok_to_send
1
3
smpi_recv_get
1
3
MPID_Init
27
33
MPID_DeviceCheck
18
30
371
395
2
8
518
618
signal.c
asynchandler
ckpt.c
ckpt_save
Total lines
aspect can help us deploy the fault-tolerant feature to other MPI systems such as MPICH-P4 and MVAPICH since they have communication patterns (e.g., eager and rendezvous communication) and program structure similar to those of MPICH-GM.1 It may be unclear why the number of colored lines in Fig. 11(a) is different from the number of functions in Table 1. Each line in Figure 11(a) indicates the corresponding advice. Several types of advice can be applied to one function. For example, two types of advice are applied to the MPID_CH_Eagerb_send_short function in gmpi_chshort.c applied (see Fig. 4). In the rest of this section, we will evaluate the performance of FT MPI and HETERO MPI. We will show that they have very small overhead, and have the same performance and scalability as similar software that was developed without using AOP, such as in [42, 43, 47, 48]. 5.1 FT MPI evaluation The experiment was performed on the Hamel cluster, serviced by the Korea Institute of Science and Technology Information (KISTI) Supercomputing Center. The Hamel 1 We have developed fault-tolerant MPICH-P4 and MVAPICH without AOP. At that time, aspects defined in this study greatly helped us investigate points for the fault-tolerant feature.
cluster consists of 256 nodes with dual Intel Xeon 2.8 GHz CPU, 3 GB RAM, and a 36 GB SCSI disk drive running Linux 2.4.20. All machines are connected through a switched 1 Gbps Ethernet LAN and Myrinet 2000. We first measure the overhead of FT MPI and compare its performance with that of a conventional implementation. To that end, we compare the running time of MPICHGM, which is the base code without fault tolerance, FT MPI with and without checkpointing, and C_FT MPI, which is a fault-tolerant MPI implemented without using AOP. We evaluate FT MPI by running the LU and BT of the Numerical Aerodynamics Simulation (NAS) Parallel Benchmark 2 suite, which is a set of programs designed to measure the performance of parallel computers. The problem size of LU and BT is class C. It is noted that each application is abbreviated in the form of “application name.problem size.number of processes,” i.e., bt.C.144 and lu.C.16. From Fig. 12, we can see the LU and BT runtimes. Since in cases of FT MPI with “No Ckpt,” each MPI application runs over FT MPI with no checkpoint; the difference in FT MPI with “No Ckpt” and MPICH-GM is the initialization of each MPI process. The runtime of FT MPI with “No Ckpt” is slightly longer that of MPICH-GM in most cases, but the differences are small even if the number of processes is large. However, in some cases, the differences are a little bigger. This would happen when MPI processes are initial-
Cluster Comput Fig. 12 Runtime (FT MPI)
ized on too many nodes, which requires a relatively long start-up time. Figure 12 also shows that the running time overhead of FT MPI is around 10% larger than that of MPICH-GM due to the checkpoint overhead. For long-term applications, checkpoint overhead should be considered to be more important than initialization and rollback recovery overhead, because the Central Manager forces each MPI process to be checkpointed periodically during the entire execution. Thus, more frequent checkpoints and larger checkpoint sizes lead to longer execution time. Initialization occurs only once during the execution and rollback recovery depends on the mean time between failures (MTBF.) In cases of LU and BT with “Ckpt,” each MPI process is forced to be checkpointed every minute and only once during the execution. The checkpoint overhead consists of two activities, taking consistent checkpoints and storing checkpoint images to a disk. Even though IO overhead is indispensable in faulttolerant systems adopting the checkpoint/restart scheme, we skip the detailed analysis of IO overhead because it is irrelevant to the aspects we used. We use three pieces of advice to take a global consistent checkpoint. One advice is for the broadcast function, and the others are related to critical sections. The value of critical_section, which indicates the number of on-the-fly messages, does not affect the performance, since it increases or decreases by only one. Thus, the core overhead in which we should measure is the time required to complete the broadcast function. In fact, the broadcasting overhead is unavoidable in the coordinated checkpointing scheme; thus, the actual overhead incurred by our aspects is ignorable. Since C_FT MPI is the fault-tolerant MPI without adopting AOP, C_FT MPI has the same functionality as FT MPI. However, the code related to initialization, consistent checkpoint, and rollback recovery is scattered (Fig. 11(a)). So, there is little difference between FT MPI and C_FT MPI in terms of performance, and Fig. 12 demonstrates this fact. Table 2 shows the number of checkpoints, the size of a single checkpoint image created by each process, and the time required to perform the broadcast function and write to disk. The time for the broadcast function is the only meaningful overhead that our advice incurs. The overhead tends
Fig. 13 Recovery cost
to be larger as the number of processes (BT cases) and intransit (LU cases) messages increase. However, the overhead is negligibly small. From Fig. 12 and Table 2, we can see that FT MPI can run with low overhead even if the number of participants increases. To evaluate the rollback recovery, we simulated failures by killing the MPI process using the SIGKILL signal. Figure 13 presents the recovery cost of FT MPI and C_FT MPI, and the recovery procedure is as follows: (1) fetching the checkpoint image, (2) reincarnating the MPI processes by calling fork and exec, (3) exchanging communication information between MPI processes, and (4) reconfiguring the device and the topology information if necessary. Step (4) does not occur in this experiment because the failure is local. Each bar in Fig. 13 indicates a summation of time to perform steps (1), (2), and (3). The time required for executing steps (1) and (2) is the overhead of the checkpoint/restart library (1–2 seconds), while the time required for executing step (3) is related to our reinitialization advice. From the figure, we can see that FT MPI can detect failures and recover from the failures within a short period (up to 4 seconds), and that the recovery cost of FT MPI is analogous to that of C_FT MPI. The result for BT is skipped because it is similar to that of LU. From the figures and table, we can confirm that FT MPI shows the same level of performance and scalability as shown in [42, 47, 48]. Thus, we can safely conclude that applying AOP to the development of fault-tolerant MPI systems is a valuable technique since FT MPI has low overhead and is highly scalable.
Cluster Comput Table 2 Breakdown of experiment # of
Checkpoint
Broadcast
Disk
Checkpoints
size (MB)
function (sec)
overhead (sec)
lu.C.8
15
143
0.480
2.0607
lu.C.16
7
97
0.340
1.4731
lu.C.32
3
74
0.065
1.2759
lu.C.64
1
62
0.007
1.0001
lu.C.128
1
42
0.003
1.008
bt.C.64
1
119
0.0011
1.3818
bt.C.81
1
106
0.0017
1.6125
bt.C.100
1
97
0.0012
1.8112
bt.C.121
1
85
0.0019
1.7821
bt.C.144
1
72
0.055
1.6
Fig. 14 Runtime (HETERO MPI)
5.2 HETERO MPI evaluation The experiment was performed on two clusters. One cluster consists of 20 nodes with a 2.2 GHz Dual PowerPC processor, which are interconnected by Myrinet, while the other cluster consists of 10 nodes with a 3.0 GHz Dual Pentium4 processor, which are interconnected by Infiniband. The Message Gateway runs on an Intel Xeon server equipped with a Myrinet LANai X NIC and an Infiniband NIC. We ran two types of NPB benchmarks, BT and SP, while changing the number of nodes from 4 to 36 nodes, and their problem size was set to C. Half of the processes run on each cluster in every experiment. We evaluate HETERO MPI by comparing its performance with that of a conventional implementation. To that end, we compare the running time of MPICH-VMI, which is the base code, HETERO MPI, and C_HETERO MPI, which is a heterogeneous MPI implemented without using AOP. Figure 14 shows the running times for BT and SP. Initialization and connection phases occur only once during the execution, and the initialization overhead is small, which was explained in the previous section (the initialization of HETERO MPI is very similar to that of FT MPI). We see that the overhead of the connection aspect is very small because the connection to the Message Gateway is
identical to the connection to the MPI process. “MPICHVMI(Myrinet)” and “MPICH-VMI(Infiniband)” represent the times required for each MPI application to execute on the Myrinet and the Infiniband cluster, respectively. Obviously, the running times of “HETERO MPI” are longer (10%– 20%) than those of “MPICH-VMI(Myrinet)” and “MPICHVMI(Infiniband).” However, the overhead is incurred not by our aspects but by the longer latency because “HETERO MPI” cases require a message relay between heterogeneous clusters. The running time of HETERO MPI is similar to that of C_HETERO MPI since C_HETERO MPI is a naive implementation whose functionality is identical to that of HETERO MPI. From the figure, we can also conclude that applying AOP to parallel computation over heterogeneous clusters is beneficial since HETERO MPI shows acceptable overhead. Similar studies were performed in [13, 14, 43]. 5.3 OR-Memcached evaluation The experiment was performed on our 8-node cluster in which each node is equipped with an Intel(R) Core(TM)2 Quad CPU 2.83 GHz and 8 GB RAM running Ubuntu with 2.6.30.5 kernel. Nodes are connected through a switched 1 Gbps Ethernet LAN. We evaluate OR-Memcached by
Cluster Comput
first case study reported on applying the AOP to core software for cluster computing. In this section, we provide a few of the lessons learned during our work. 6.1 Design principles and roles of aspects
Fig. 15 OR-Memcached performance
comparing its performance with that of a conventional implementation, C_OR-Memcached, which is implemented without using AOP. We also compare OR-Memcached with a popular distributed key-value storage service, Cassandra. Figure 15 shows the performance of replicated Memcached and Cassandra when the number of replicas increases. As we expect, we can see that more replicas lead to worse performance due to multicast communication overhead. Since the initialization phase occurs only once during the execution, the initialization overhead is small. Most of the overhead of the object replication results from multicast communication. The performance of ORMemcached is similar to that of C_OR-Memcached since C_OR-Memcached is the same software as OR-Memcached in terms of its functionality. From the figure, we can also conclude that applying AOP to object replication in a cluster environment is beneficial since OR-Memcached shows little overhead. The performance of OR-Memcached is better than that of Cassandra due to Cassandra’s RPC and disk I/O overhead. However, Cassandra’s performance decreases slower than that of OR-Memcached as the replication factor increases. When the replication factor increases, the overhead of multicast communication in OR-Memcached is large since OR-Memcached maintains the one-copy value for each key. In contrast, the overhead of multicast communication for Cassandra is not relatively large since Cassandra maintains multiple values for each key and they can be distinguished by timestamps.
6 Lessons and discussion As stated in Sect. 1, few studies apply AOP paradigm to cluster computing. A few related works discuss making a specific application program into a parallel version using AOP [5, 9]. To the best of our knowledge, our work is the
From our extensive experience with various cluster software, we learned a valuable set of design principles for developing aspect-oriented cluster computing software. The principle of Loose Coupling. System developers should confirm that the aspects touch all facets of the integration of all software. In our experience, we make sure that concerns for our target applications are cross-cuttable (orthogonal) and the aspects we define in this article are enough to implement the concerns without modifying existing software. The principle of Reusability. Developers should consider the reusability of aspects used in software integration. In our experiences, we find that our aspects can be reused in other system integration (e.g., our aspects defined for developing fault-tolerant MPI can be reused in implementation of stop-and-sync [55] style protocols). The principle of Consistency of Application Messages. Developers should make sure that the aspects do not hurt or minimally interfere with the order of the application messages. Change in an application message order might result in an unexpected result. In our experiences, aspects for the checkpointing protocol put a special barrier message for ensuring safe checkpoints. However, we find that our aspects do not affect application results since the barrier message does not change the order of application messages. We believe that aspects for cluster computing software explicitly support these principles. Throughout this article, we classify our aspects into the following groups according to their roles. Protocol aspects. These aspects are used to implement a specific protocol that the cluster computing software should guarantee. In our experience, aspects for consistent checkpointing are used to mark critical sections rather than to integrate software. Extension aspects. These aspects glue unrelated functions together to implement some functionality. For example, aspects for rollback recovery are used to define some steps in checkpoint software after reincarnation. In the replication aspect for OR-Memcached, operations for multicast communication are added after the set method and the constructor method. Replacement aspects. These aspects do not glue unrelated functions but replace existing functionality with new functionality defined in the aspects. In our experience, aspects for the initialization of MPI-related cases replace the original initialization procedure with the initialization via a newly introduced job management system.
Cluster Comput
6.2 Discussion Before implementing the AOP-based software, we were certain that applying the AOP paradigm to two base versions of core software is very feasible because of the following reasons: 1. Most of the add-on functionalities, including fault tolerance, security, and transaction support, for message passing systems (or other parallel runtimes) and key-value storage services are separating concerns. 2. Those add-on functionalities are orthogonal to the base functions of the message passing systems, from a functional point of view. 3. Hence, AOP can be an easy and convenient paradigm that best fits the development of those add-on functionalities for message passing and key-value storage systems. Providing fault tolerance, network heterogeneity, or replication support to the base software does not require modifying the internal structure (data structure and core program flow) of the base system. It is enough to add new code for the necessary functionality. In addition, according to our experiences in developing MPI variants [42, 47], most of the new code is inserted at the beginning or end of a function, similar to inserting code for locking and unlocking around a function body. This additional code is naturally separating concerns and can be modularized by AOP. We believe that our work is an example of the successful application of AOP to core software. However, the separation into aspects causes the fragility problem. Even a simple change to a specific aspect (e.g., change of a function name) might cause an incorrect program if the developer forgets to change the aspect (e.g., pointcut). In related research work, a logic-based crosscut language [56], a model-based pointcut [57] and a pointcut delta analysis [58] are proposed to mitigate the fragility problem. However, if the base code is very stable and unlikely to be changed in the future, the fragility is unlikely to happen. In our study, since send/receive functions, initialization function, and put/get methods in the baseline software are very stable,2 we believe that the fragility problem in applying AOP to implement fault tolerance, network heterogeneity support, or object replication functionality is unlikely to happen. Even using very stable base software, however, we needed to confirm that the aspects touch all facets of the integration of all software. In our experiences, it took a long time for us to verify that all concerns for 2 For example, the current version (1.2.7p1) of the MPICH-GM was published in March 2008, following a series of updates to the initial version. Despite the sequence of updates, the send/receive and initialization functions have remained unchanged except a few debuggingrelated codes, such as assert/debug functions.
our target applications do not require any change in the internal data structures of each base software. If not, even a small change in the internal data structures might lead to disorder processes and messages. Actually, we first tried to implement consistent checkpointing aspects by adding the critical_section variable to an internal data structure (gmpi_var). In that case, a rigorous verification of the safety of every function that used the data structure was inevitable. For example, we had to carefully investigate every function (including send/recv functions) that used an instance variable of gmpi_var as an operand of the sizeof operator, which was very time-consuming. It also resulted in more time for defining critical sections. Thus, we used the critical section as a global variable in order to make the gmpi_var structure and the consistent checkpointing aspect independent. Owing to this loose coupling, a change in the gmpi_var structure does not affect the consistent checkpointing aspect. Some might doubt the granularity of our aspects; some of our aspects could be broken into finer aspects. For example, the consistent checkpointing aspect could be separated into synchronization and checkpoint aspects. However, in this case, we have to deal with the order among aspects that affect the same join point (i.e., the checkpoint aspect should be executed after the execution of the synchronization aspect). To this end, some trick methods such an order advice that describes the order of aspects have to be newly introduced. Thus, rather than introducing an order advice, we grouped aspects for the same join point.
7 Conclusion AOP is helpful in modularizing software and implementing additional requirements without causing tangling or scattering of the existing source code; these advantages are very useful in simplifying software maintenance. This article shows that AOP is well suited to cluster computing software by using simple, intuitive, and reusable aspects. We define three, two, and two aspects to implement FT MPI, HETERO MPI, and OR-Memcached, respectively. Throughout our qualitative and performance evaluations, AOP significantly improves the code readability as well as the modularity, and AOP-based software has the same performance and scalability as similar software that is developed without using AOP. Moreover, we summarize the lessons learned and the roles of our aspects. We hope that our study will motivate developers to use AOP. In the future, we will apply our aspects not only to MVAPICH, MPICH-P4, and MPICH-G2 but also to OpenMP and Cilk. We will also define a series of aspects to facilitate the parallelization of high-performance computing software.
Cluster Comput Acknowledgements This work was supported by the National Research Foundation (NRF) grant funded by the Korea government (MEST) (No. 2010-0014387). The ICT at Seoul National University provided research facilities for this study.
References 1. Han, H., Jung, H., Yeom, H.Y., Lee, D.Y.: Taste of AOP: blending concerns in cluster computing software. In: Proceedings of 2007 IEEE International Conference on Cluster Computing (2007) 2. Xerox, P.A.R.C.: AspectJ Homepage (2011). http://aspectj.org 3. Spinczyk, O., Lohmann, D., Urban, M.: AspectC++: an AOP Extension for C++. Softw. Dev. J. (05) (2005) 4. Cunha, C.A., Sobral, J.L., Monteiro, M.P.: Reusable aspectoriented implementations of concurrency patterns and mechanisms. In: Proceedings of the 5th International Conference on Aspect-oriented Software Development (2006) 5. Harbulot, B., Gurd, J.R.: A join point for loops in AspectJ. In: Proceedings of the 5th International Conference on Aspect-oriented Software Development (2006) 6. Kienzle, J., Gelineau, S.: AO challenge—implementing the ACID properties for transactional objects. In: Proceedings of the 5th International Conference on Aspect-oriented Software Development (2006) 7. Lohmann, D., Scheler, F., Tartler, R., Spinczyk, O., Preikschat, W.S.: A quantitative analysis of aspects in the eCOS kernel. In: Proceedings of the 1st European Systems Conference (EuroSys 1) (2006) 8. Rashid, A., Chitchyan, R.: Persistence as an aspect. In: Proceedings of the 2nd International Conference on Aspect-oriented Software Development (2003) 9. Boner, J., Kuleshov, E.: Clustering the Java virtual machine using aspect-oriented programming. In: Proceedings of the 6th International Conference on Aspect-oriented Software Development (2007) 10. Buntinas, D., Coti, C., Herault, T., Lemarinier, P., Pilard, L., Rezmerita, A., Rodriguez, E., Cappello, F.: MPICH-PCL: Nonblocking coordinated checkpointing for large-scale fault tolerant MPI. In: Proc. of the IEEE/ACM Supercomputing (2006) 11. Huang, W., Gao, Q., Liu, J., Panda, D.: High performance virtual machine migration with RDMA over modern interconnects. In: Proc. of the IEEE Cluster (2007) 12. Huang, W., Koop, M., Gao, Q., Panda, D.: Virtual machine aware communication libraries for high performance computing. In: Proc. of the IEEE/ACM Supercomputing (2007) 13. Aumage, O., Mercier, G.: MPICH/MadIII: a cluster of clusters enabled mpi implementation. In: Proceedings of the 3rd IEEE/ACM International Symposium on Cluster Computing and the Grid (2003) 14. Takahashi, T., Sumimoto, S., Hori, A., Harada, H., Ishikawa, Y.: PM2: High performance communication middleware for heterogeneous network environments. In: Proceedings of SC’00 (2000) 15. Oracle: Oracle BerkeleyDB. http://www.oracle.com/technetwork/ database/berkeleydb (2011) 16. Fitzpatrick, B.: Memcached: a distributed memory object caching system. http://memcached.org (2011) 17. Apache: CouchDB Homepage. http://couchdb.apache.org (2011) 18. DeCandia, G., Hastorun, D., Jampani, M., Kakulapati, G., Lakshman, A., Pilchin, A., Sivasubramanian, S., Vosshall, P., Vogels, W.: Dynamo: Amazon’s highly available key-value store. In: ACM SOSP (2007) 19. Facebook: Cassandra: A structured storage system on a P2P network. http://cassandra.apache.org (2011) 20. Team, A.D.: Aquarium Homepage. http://aquarium.rubyforge.org (2011)
21. Washizaki, H., Kubo, A., Mizumachi, T., Eguchi, K., Fukazawa, Y., Yoshioka, N., Kanuka, H., Kodaka, T., Sugimoto, N., Nagai, Y., et al.: Aojs: aspect-oriented javascript programming framework for web development. In: ACP4IS’09: Proceedings of the 8th Workshop on Aspects, Components, and Patterns for Infrastructure Software (2009) 22. Akai, S., Chiba, S., Nishizawa, M.: Region pointcut for aspectj. In: ACP4IS’09: Proceedings of the 8th Workshop on Aspects, Components, and Patterns for Infrastructure Software (2009) 23. Xi, C., Harbulot, B., Gurd, J.R.: Aspect-oriented support for synchronization in parallel computing. In: PLATE’09: Proceedings of the 1st Workshop on Linking Aspect Technology and Evolution (2009) 24. Lohmann, D., Streicher, J., Spinczyk, O., Schröder-Preikschat, W.: Interrupt synchronization in the ciao operating system: experiences from implementing low-level system policies by aop. In: ACP4IS’07: Proceedings of the 6th Workshop on Aspects, Components, and Patterns for Infrastructure Software (2007) 25. Lohmann, D., Hofer, W., Schroder-Preikschat, W.: CiAO: an aspect-oriented operating-system family for resource-constrained embedded systems. In: Proceedings of the 2009 USENIX Annual Technical Conference (2009) 26. Hofer, W., Lohmann, D., Schröder-Preikschat, W.: Concern impact analysis in configurable system software: the autosar os case. In: ACP4IS’08: Proceedings of the 2008 AOSD Workshop on Aspects, Components, and Patterns for Infrastructure Software (2008) 27. Reynolds, A., Fiuczynski, ME, Grimm, R.: On the feasibility of an aosd approach to Linux kernel extensions. In: ACP4IS’08: Proceedings of the 2008 AOSD Workshop on Aspects, Components, and Patterns for Infrastructure Software (2008) 28. Cannon, B., Wohlstadter, E.: Enforcing security for desktop clients using authority aspects. In: AOSD’09: Proceedings of the 8th ACM International Conference on Aspect-oriented Software Development (2009) 29. OpenMP Architecture Review Board: OpenMP Homepage. http:// www.openmp.org (2011) 30. Blumofe, R.D., Joerg, C.F., Kuszmaul, B., Leiserson, C.E., Randall, K.H., Zhou, Y.: Cilk: An efficient multithreaded runtime system. In: Proceedings of 5th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming (1995) 31. Alexandersson, R., Ohman, P.: Implementing fault tolerance using aspect oriented programming. In: Proceedings of the Third LatinAmerican Symposium (2007) 32. Fabry, J.: A framework for replication of objects using aspectoriented programming. Ph.D. Thesis, University of Brussel (1998) 33. Sevilla, D., Garcia, J., Gomez, A.: Aspect-oriented programing techniques to support distribution, fault tolerance, and load balancing in the Corba-lc component model. In: IEEE International Symposium on Network Computing and Applications (2007) 34. Afonso, F., Silva, C., Brito, N., Montenegro, S., Tavares, A.: Aspect-oriented fault tolerance for real-time embedded systems. In: ACP4IS’08: Proceedings of the 2008 AOSD Workshop on Aspects, Components, and Patterns for Infrastructure Software (2008) 35. Hargrove, P.H., Duell, J.C.: Berkeley lab checkpoint/restart (BLCR) for Linux clusters. In: Proc. of the SciDAC (2006) 36. van Renesse, R., Schneider, F.B.: Chain replication for supporting high throughput and availability. In: OSDI (2004) 37. Terrace, J., Freedman, M.J.: Object storage on craq highthroughput chain replication for read-mostly workloads. In: USENIX (2009) 38. Chang, F., Dean, J., Ghemawat, S., Hsieh, W.C., Wallach, D.A., Burrows, M., Chandra, T., Fikes, A., Gruber, R.E.: Bigtable: a distributed storage system for structured data. In: OSDI (2006) 39. JGroups: The JGroups Project. http://www.jgroups.org (2011)
Cluster Comput 40. Myricom: Myricom Homepage. http://www.myri.com (2011) 41. Pant, A., Jafri, H.: Communicating efficiently on cluster based grids with MPICH-VMI. In: Proceedings of IEEE International Conference on Cluster Computing (2004) 42. Jung, H., Han, H., Yeom, HY, Kang, S.A.: A user-transparent and fault-tolerant system for parallel applications. IEEE Trans. Parallel Distrib. Syst. 99 (2011). doi:10.1109/TPDS.2011.63 43. Kim, S.G., Han, H., Jung, H.S., Yeom, H.Y.: Design and implementation of RDMA gateway for heterogeneous clusters. In: Proceedings of International Conference on Convergence Information Technology (2007) 44. Elnozahy, E.N., Johnson, D.B., Wang, Y.M.: A survey of rollbackrecovery protocols in message-passing systems. ACM Comput. Surv. 34(3), 378–408 (2002) 45. Zandy, V.C.: ckpt. http://www.cs.wisc.edu/~zandy/ckpt (2011) 46. Adams, B.: Aspicere Homepage. http://sailhome.cs.queensu.ca/ ~bram/aspicere/ (2011) 47. Woo, N., Jung, H., Yeom, H., Park, T., Park, H.: MPICH-GF: transparent checkpointing and rollback-recovery for grid-enabled MPI processes. IEICE Trans. Inf. Syst. E87-D, 1820–1828 (2004) 48. Woo, N., Jung, H., Shin, D., Han, H., Yeom, HY, Park, T.: Evaluation of consistent recovery protocols using MPICH-GF. In: Proceedings of the 5th European Dependable Computing Conference (2005) 49. Stellner, G.C.: Checkpointing and process migration for mpi. In: International Parallel Processing Symposium (1996) 50. Chandy, K.M., Lamport, L.: Distributed snapshots: determining global states of distributed systems. ACM Trans. Comput. Syst. 3(1), 63–75 (1985) 51. Whalin, G.: Java memcached client. http://www.whalin.com/ memcached (2011) 52. Liu, J., Wu, J., Kini, S.P., Wyckoff, P., Panda, D.K.: High performance RDMA-based MPI implementation over InfiniBand. In: Proceedings of 17th Annual ACM International Conference on Supercomputing (2003) 53. Karonis, N.T., Toonen, B., Foster, I.: MPICH-G2: a grid-enabled implementation of the message passing interface. J. Parallel Distrib. Comput. 63(5), 551–563 (2003). Special Issue on Computational Grids 54. Park, K., Park, S., Kwon, O., Park, H.: MPICH-GP: a private-IPenabled MPI over grid environments. In: Proceedings of the Second International Symposium on Parallel and Distributed Processing and Applications (2004) 55. Plank, J.S.: Efficient checkpointing on MIMD architectures. Ph.D. Thesis, Princeton University (1993) 56. Gybels, K., Brichau, J.: Arranging language features for more robust pattern-based crosscuts. In: Proceedings of the 2nd International Conference on Aspect-oriented Software Development (2003) 57. Kellens, A.: A model-driven pointcut language for more robust pointcuts. In: Proceedings of Software engineering Properties of Languages for Aspect Technologies (2006) 58. Störzer, M., Koppen, C.P.: Attacking the fragile pointcut problem, abstract. In: European Interactive Workshop on Aspects in Software (2004)
Hyuck Han received his B.S., M.S., and Ph.D. degrees in Computer Science and Engineering from Seoul National University, Seoul, Korea, in 2003, 2006, and 2011, respectively. Currently, he is a postdoctoral researcher at Seoul National University. His research interests are distributed computing systems and algorithms.
Hyungsoo Jung received the B.S. degree in mechanical engineering from Korea University, Seoul, Korea, in 2002; and the M.S. and the Ph.D. degrees in computer science from Seoul National University, Seoul, Korea in 2004 and 2009, respectively. He is currently a postdoctoral research associate at the University of Sydney, Sydney, Australia. His research interests are in the areas of distributed systems, database systems, and transaction processing. Heon Y. Yeom is a Professor with the School of Computer Science and Engineering, Seoul National University. He received his B.S. degree in Computer Science from Seoul National University in 1984 and his M.S. and Ph.D. degrees in Computer Science from Texas A&M University in 1986 and 1992 respectively. From 1986 to 1990, he worked with Texas Transportation Institute as a Systems Analyst, and from 1992 to 1993, he was with Samsung Data Systems as a Research Scientist. He joined the Department of Computer Science, Seoul National University in 1993, where he currently teaches and researches on distributed systems, multimedia systems and transaction processing.