Performance evaluation of a remote memory system ...

8 downloads 9779 Views 2MB Size Report
ferent way than that of disk-based data processing systems, ... memory in remote computers; remote computers can use ... For reliable memory, a fast crash/recovery mechanism ...... HDD, we equipped the front-end1 node with a SSD (SAM-.
Cluster Comput DOI 10.1007/s10586-011-0164-9

Performance evaluation of a remote memory system with commodity hardware for large-memory data processing Hyuck Han · Hyungsoo Jung · Sooyong Kang · Heon Y. Yeom

Received: 6 September 2010 / Accepted: 21 March 2011 © Springer Science+Business Media, LLC 2011

Abstract The explosion of data and transactions demands a creative approach for data processing in a variety of applications. Research on remote memory systems (RMSs), so as to exploit the superior characteristics of dynamic random access memory (DRAM), has been performed for many decades, and today’s information explosion galvanizes researchers into shedding new light on the technology. Prior studies have mainly focused on architectural suggestions for such systems, highlighting different design rationale. These studies have shown that choosing the appropriate applications to run on an RMS is important in fully utilizing the advantages of remote memory. This article provides an extensive performance evaluation for various types of data processing applications so as to address the efficacy of an RMS by means of a prototype RMS with reliability functionality. The prototype RMS used is a practical kernel-level RMS that renders large memory data processing feasible. The abstract concept of remote memory was materialized by borrowing unused local memory in commodity PCs via a high H. Han · H.Y. Yeom School of Computer Science and Engineering, Seoul National University, Seoul, 151-742, Republic of Korea H. Han e-mail: [email protected] H.Y. Yeom e-mail: [email protected] H. Jung () School of Information Technologies, University of Sydney, Sydney, NSW 2006, Australia e-mail: [email protected] S. Kang Division of Computer Science and Engineering, Hanyang University, Seoul 133-791, Republic of Korea e-mail: [email protected]

speed network capable of Remote Direct Memory Access (RDMA) operations. The prototype RMS uses remote memory without any part of its computation power coming from remote computers. Our experimental results suggest that an RMS can be practical in supporting the rigorous demands of commercial in memory database systems that have high data access locality. Our evaluation also convinces us of the possibility that a reliable RMS can satisfy both the high degree of reliability and efficiency for large memory data processing applications whose data access pattern has high locality. Keywords Remote memory system · Large memory data processing · Main memory databases

1 Introduction The recent phenomenon of information explosion has driven IT researchers to develop an innovative methodology for large data processing. The two most important requirements of large data processing are efficiency and failure-resilience. The simplest solution to this problem is to use a primary backup approach with large DRAM (≥1 TB) to accommodate both raw data and applications. This simple architecture, however, costs a great deal. Conversely, as a costeffective way of mitigating the budget barrier, attempts at building systems with a remote memory model have been undertaken for many years. The design rationale for these studies has been that reliable remote memory connected with high speed interconnects are better than a single big machine in terms of cost-effectiveness.1 1 The

widespread architecture adopted by vendors showing top ten TPC-C results is a clustered architecture, not a big mainframe.

Cluster Comput

The main advantage of DRAM is that its access time is tens of thousands times faster than that of disk storage. Owing to this feature, numerous researchers have worked on large memory data processing systems, trying to fit all of the data processing components, as well as large data, in the volatile memory space. Indeed, extensive research efforts have been made to develop high performance, large memory data processing systems, namely in-memory database systems that are orders of magnitude faster than disk-based database systems, and information retrieval systems that require all index lookup processing to be performed on memory. For example, the renowned search engine companies (e.g., Google and Yahoo) adopted this approach for quality assurance in data processing. These efforts were driven by the decreasing price and increasing capacity of volatile memory, which makes building computers with large amounts of main memory not only possible, but affordable. Despite the performance advantages, large memory data processing systems have not fully exploited remote memory features commercially. For large memory data processing systems to fully utilize the fast random access memory feature, the underlying architecture should be designed in a different way than that of disk-based data processing systems, so as to maximize data processing throughput. This problem is currently being addressed by numerous researchers. The most important issue, aside from redesigning large memory data processing systems, is to find a cost effective way to build large memory computer systems. This is a tradeoff between a single, expensive, large memory mainframe and clustered, inexpensive new memory hierarchy systems. This tradeoff can be ascribed to the innate limitations of large memory data processing systems. That is, the system itself must always reside in the volatile memory, lest its performance decline dreadfully in overcommitted situations. Large memory systems that can support up to a terabyte of memory at the cost of millions of dollars are the only option remaining for clients who want to run large memory data processing applications. This research presents an extensive performance evaluation for a kernel-level reliable remote memory system (R2 MS) to validate whether an RMS is efficient for large memory data processing. For the performance study, we designed and implemented a prototype R2 MS, which is based on Linux. The basic concept of remote memory was realized not by renting both memory and computing power of remote computers, but by borrowing a fraction of the local memory in remote computers; remote computers can use their full computing power in processing their own tasks. The R2 MS only accesses the remote memory space via an RDMA interface. Owing to this leasing principle, the performance of the R2 MS is independent of the computing power of the remote computers, even though the remote computers

loan their memory space. In addition, the leasing principle is very useful in cluster environments in that it does not require extra computing infrastructures to build R2 MS. The prototype R2 MS was built using this design concept and has two important features in order to be able to support a reliable and scalable remote memory service. For the efficient and scalable handling of remote data, (1) all page transmission operations (i.e., send/receive operations) were replaced with Remote Direct Memory Access (RDMA) operations, and (2) enforced an exclusive ownership for all of the data pages to resolve data inconsistency problems. We invented a simple whiteboard marking mechanism to achieve exclusive ownership management for each memory page by letting each machine have the ownership for a particular remote memory region after obtaining the authority for that region. For reliable memory, a fast crash/recovery mechanism was devised to protect the entire OS and application programs against a remote machine crash disaster. This led to the design of the Redundant Group of Independent Memory (ReGIM) architecture, which is analogous to a RAID system in that the ReGIM system is constructed on several remote memory blocks exported from other machines by assembling each block according to the interleaving rule of RAID5. We evaluated the performance of the R2 MS using various large memory data processing applications. What we learned from the extensive experiments is that the R2 MS shows good performance results for large memory data processing applications whose data access patterns have locality, as seen with in-memory database systems. For example, R2 MS achieves up to 92% performance of adequate local memory cases for in-memory database applications. The breakdown of evaluation results show that the R2 MS (1) is very efficient in handling remote pages in various types of applications ranging from CPU-intensive jobs to I/Ointensive jobs; (2) imposes minimal performance overhead; (3) allows all types of applications to utilize large memory implicitly without changes to the existing applications; and (4) masks a single remote machine crash from the entire OS, allowing both the OS and applications to run normally during and after a machine crash (OS transparency). From a performance aspect, the evaluation with various types of applications suggests that the performance of applications with high locality on the R2 MS is much better than what we have previously seen. The main contributions of this article are as follows: • We design and implement a reliable remote memory system (R2 MS) based on RDMA semantics. Our system fully exploits RDMA semantics to maximize the system performance as well as to guarantee the exclusive use of remote memory regions. • We show that our systems have good performance through an extensive performance evaluation. Our experimental

Cluster Comput

results show that remote memory systems such as R2 MS can have good performance for applications with high locality (e.g., up to 92% performance of a full memory system for in-memory database applications). The rest of the article is organized as follows. Section 2 describes the R2 MS architecture in further details. Section 3 evaluates R2 MS using various applications. Section 4 discusses our experiences. Section 5 reviews the related works. Finally, Sect. 6 concludes the article.

2 R2 MS design In this section, we explain the details of the R2 MS, our prototyped reliable remote memory system. The algorithm and design adopted in building the R2 MS is based on valuable theoretical foundations that have been proposed and validated by previous research. The R2 MS has two major components: a Dynamic Memory (DyMem) system and a Redundant Group of Independent Memory (ReGIM). DyMem is a unified system that manages both local and remote memory space to provide a best effort memory management service to application programs. If the system undergoes overcommitment of the local memory,2 DyMem facilitates the use of remote memory space to mitigate the drastic performance drop that would have occurred without remote memory sharing. The use of remote memory guarantees a performance compromise between local memory and disk storage. In the mean time, the ReGIM system is designed to provide strong faultresilience to the R2 MS using a checksumming technique. The OS-transparent crash/recovery is the pivotal feature of the ReGIM system. 2.1 DyMem When the memory is overcommitted, DyMem must take over the role of the swapper, i.e., it should maintain a minimal number of free page frames so that the kernel can safely handle out-of-memory situations. To maintain free pages, DyMem has to relocate candidate pages to remote memory space, which is tens of thousands of times faster than disk storage. As a global dynamic memory allocator, DyMem manages remote memory space in a structured way in order to manipulate the relocation of the candidate page frames efficiently. In the R2 MS, DyMem logically partitions the entire remote memory space into a set of small memory blocks, each of which is a self-assembly unit of a memory block for achieving structured remote memory management. The memory block that we call a lego block is a logical unit of 2 Overcommitment

means that the total size configured for all running programs exceeds the total amount of actual machine memory.

remote memory space. It can be viewed as a large sized page frame in memory space. As depicted in Fig. 1, a lego block represents a single chunk of remote memory space. Each lego contains metalevel information, such as the lego shape, physical start address, and order. A unique host identifier is used as a lego shape value, because a group of lego blocks exported from the same host should have the same shape value that is different from any other machine’s shape. Clearly, a lego block works as a meta-structure for a remote page frame. DyMem set 256 MB as the default chunk size, which is configurable at the boot time. When a machine exports its physical memory space to other machines, the machine should fragment its memory into the default chunk size and broadcast the meta-description table (whiteboard) to other machines. Once it gathers enough lego blocks, DyMem needs to assemble the acquired lego blocks so as to create reliable memory space. The basic building block for the reliable memory is a tetris. A tetris, as in the popular game, has a grid-like structure in which each tetris row is built by plugging five differently-shaped lego blocks.3 The plugging rule by which each tetris row is arranged using different shaped lego blocks, prepares the ground for the ReGIM architecture. In the R2 MS, each tetris row consists of five lego blocks to make tetris work with RAID5. The structure of the entire tetris rows, shown in Fig. 1, is constructed as follows. When building each tetris row, five lego blocks are used, none of which have the same shape as any of the plugged lego blocks. The enforcement of this shape discrepancy rule facilitates the use of RAID5 as a ground data distribution method to reinforce remote memory with fault resilience. With 256 MB as the default size for each assembled lego, a single tetris row can cover a memory space of 1 GB used as remote memory. Under memory pressure conditions, DyMem relocates candidate pages to remote memory space by writing the corresponding page to the micro-row area in a tetris row. A single micro-row has a 5 KB region, and each column in a single tetris row is an 1 KB segment. The content of a local page frame is divided into four segments, and the segmented regions are mapped to the first four columns of the microrow. Then, DyMem starts to write an entire page to remote memory using a RDMA write. Concurrently, while waiting to receive all acknowledgements from the remote machines, DyMem generates a parity block from the page content and then writes the parity block to the last column of the microrow. When reading a data page from the remote memory, DyMem reads from the first four columns only, because the 3 The total number of lego blocks needed to form a single tetris row can

be adjusted. The only invariant is that due to the checksum block, the count of the machine should be larger than the number of legos in a single tetris row.

Cluster Comput Fig. 1 Architecture of lego and tetris

last column contains a parity block and is not required until crash/recovery is performed. From a different perspective, structuring tetris might be regarded as the “bin packing problem” which is intractable and impractical especially in operating system kernel. If we are only concerned about packing memory contents into the smallest number of remotes nodes (or bins), which is assumed to export the same amount of memory, this would definitedly be regarded as the “bin packing problem”. Unlike the original “bin packing” problem, organizing the tetris using exported “lego” memory fragments and storing data contents into this structured tetris space can be simplified, based on the following design rationale: reduce the recovery cost and increase available types of nodes. In particular, when building the tetris whose row should consist of legos from 5 different physical nodes, the goal is to maximize the number of available nodes; in other words, to build each tetris row, we used one fragment from each node in turn. We do this to reduce the recovery cost upon the machine failure. By relaxing the goal and seeking different design rationale, we could escape from finding a solution for the NP-Hard problem. 2.2 Basic communication architecture The communication subsystem of the R2 MS has two main components: a R2 MS_Comm_Driver and a R2 MS Group Manager. They are shown in Fig. 2. To make remote data handling efficient, the R2 MS exploits the Infiniband architecture as a backbone network to share memory in remote nodes. Sharing remote memory in-

volves transmitting local pages to remote nodes, and the Infiniband architecture guarantees extremely low latency and large bandwidth for this purpose. R2 MS_Comm_Driver is a Linux device driver which exports two basic and simple functions (Remote_Read() and Remote_Write())to both the DyMem and the ReGIM systems. These functions are written using the Infiniband software stack Verbs API. R2 MS Group Manager is a user-level application which manages the membership of the nodes that participate in the R2 MS. It enables a node to join and leave R2 MS. It also exchanges communication information, such as the unique node ID, information about the message transfer engine of Infiniband architecture called queue pair (qp), and rkey to resolve memory mapping, all of which are transferred to or from the Comm_Driver through the ioctl system call. 2.3 Whiteboard: dynamic lego allocation & fault detection With the naive design of the communication subsystem, the R2 MS may encounter a data race. A data race occurs when two or more nodes try to write their data pages to the same remote memory space. Since data races in remote memory can result in a data corruption, which may lead to the machine crash, it is necessary to renovate the R2 MS communication system with a robust solution. To resolve the access conflict, we developed a whiteboard mechanism, which is a dynamic ownership management architecture. Whiteboard is a data structure which opens ownership of the exported memory to all participant nodes. As shown in Fig. 3, Whiteboard records which Lego is allocated to which node.

Cluster Comput Fig. 2 Communication subsystem of R2 MS

Algorithm 1 Safe and dynamic lego allocation algorithm 1: 2: 3: 4: 5: 6: 7: 8: 9: 10: 11: 12: 13: 14: 15: 16:

Fig. 3 An instance of Whiteboard: the first and second Legos are allocated to node 1 and 3, respectively

2.3.1 Safe and dynamic lego allocation To resolve access conflict to the whiteboard when multiple nodes try to mark their ownership to the same whiteboard entry simultaneously, a safe lego allocation method (Algorithm 1) was devised and implemented in the R2 MS. This method uses the CompareAndSwap operation provided by the Infiniband architecture. The CompareAndSwap operation of Infiniband is only executed over a reliable connection that guarantees the FIFO property. Besides, the CompareAndSwap operation of Infiniband is as atomic as CompareAndSwap operations of modern CPUs. Thus, R2 MS does not have additional overhead for mutual exclusive accesses except channel delay of Infiniband. The acquireLock() function in the first line of Algorithm 1 was implemented using the CompareAndSwap operation, which tries to acquire the whiteboard lock shown in Fig. 3. The acquireLock() function guarantees that only one node can access the whiteboard and mark its ownership (lines 5, 13, 14 and 15 in Algorithm 1).

ret = acquireLock( remote_node_id ); if ret == -1 then return LOCK_FAILURE; end if ret = retrieveWhiteboard( remote_node_id ); newLegoIdx = checkWhiteboard(); if newLegoIdx == -1 then releaseLock( remote_node_id ); return NO_LEGO; end if Last_Host_Id = my_node_id; Timestamp = current_time; Lego[newLegoIdx].Used_Flag = 1; Lego[newLegoIdx].Allocated_Host = my_node_id; ret = saveWhiteboard( remote_node_id ); releaseLock( remote_node_id );

2.3.2 Failure or crash detection Failure detection plays a pivotal role in recovery. In the R2 MS, three types of failures may occur: (1) A failure of the remote machine during page transmission or lock acquisition. This is the most common failure and can be detected directly by the node which attempts to read from (or write to) the crashed node through the timeout mechanism. (2) If a node crashs while holding the whiteboard lock, the other nodes can never acquire the lock. To resolve this problem, we used the Last_Host_Id and Timestamp, as shown in Fig. 3. The Lego allocation function is forced to record the node ID that acquired the whiteboard lock and the access time (lines 11 and 12 in Algorithm 1). Accordingly, if the value of the whiteboard lock is LOCK_STATE and Last_Host_Id and Timestamp remain unchanged for a certain period of time, any node trying to get the whiteboard lock can request the release of the whiteboard lock. (3) Failures, which cannot be detected in an active way, can be caught by the R2 MS Group Manager. Since the R2 MS Group Manager probes the status of other nodes periodically, it can eventually detect a node failure.

Cluster Comput

After a failure is detected, the incident is reported to the R2 MS_Comm_Driver. Once the R2 MS_Comm_Driver is notified of a failure event, it deletes the communication context of the node and clears the corresponding entries of the whiteboard that are owned by the node. It subsequently informs the ReGIM of the incident in order to recover from the fault, which is explained in further detail in the next section.

row, ReGIM distributes all 1 KB segments to the designated columns of the micro-row, resulting in the same interleaving rule as for RAID4. Based on the distribution rule among tetris rows, however, the page frames and parity data are interleaved on different lego blocks, as is the case with RAID5. Upon a single machine crash, the ReGIM then attempts to minimize the recovery of lost lego blocks using the RAID5’s interleaving rule.

2.3.3 Join/leave procedure

2.4.1 Crash recovery

When a node needs to join or leave the R2 MS group due to administrative reasons such as a reboot, all of the other nodes in the R2 MS group must support the join or departure procedures without any crashes or failures. If a node that exports its local memory feels memory pressure due to its applications, the node can perform the safe departure procedure to afford more local memory. It is noted that when a node executes the join/leave procedure frequently, it can lead to severe performance degradation of its corresponding memory consumer node. In particular, in the case of a node leave, unless there is a safe leave procedure, a node may lose some data because it cannot access some remote nodes. Therefore, when a node wants to leave the R2 MS group, it sends a LEAVE message to all of the other nodes. Then, they move their data which are resident in the memory of the leaving node to another node, and acknowledge the leave event. Conversely, the join procedure is simple. When an administrator makes a node join the R2 MS group, the node sends JOIN messages and exchanges communication information with the other participants, as explained in the previous section. After the R2 MS completes the join procedure, it starts to restructure the DyMem for load balancing under the RAID5 rule, as explained in Sect. 2.3.

The R2 MS performs a crash/recovery mechanism in an OStransparent way. It never kills any running programs in a local machine, but a process might stop because the recovery routine is protected by a global spin lock. (Read/write refers to RDMA read/write in this section.)

2.4 ReGIM The fault resilience implemented in the R2 MS is based on the aforementioned lego-tetris frame. The ReGIM system, a fault resilient memory architecture, does not suggest a new principle for crash/recovery. Rather, its underlying architecture is based on the RAID system. The difference is that unlike the RAID system, the ReGIM uses virtual memory blocks as a redundant group of independent memory. A single tetris column, which may consist of differently shaped lego blocks, can be viewed as a single independent memory unit, similar to a single disk in RAID. The unpredictable join/leave behavior of a node was the motivation behind designing the ReGIM system with virtual lego blocks instead of mapping each independent memory directly to the entire memory space of a single machine. As shown in Fig. 4, the ReGIM writes a page content and checksum block to each micro-row. Within a single tetris

Passive notification of node leave When a node wants to leave its current group, the node notifies the R2 MS of its intention to leave. Then, the ReGIM starts to move the data that pertains to the lego blocks of the node to the new lego blocks selected by the shape discrepancy rule. The data relocation to the newly assembled lego blocks does not require checksum blocks, because no data blocks are missing. Therefore, the ReGIM moves all of the leaving node’s lego block data to the new lego blocks. Active detection of a machine crash or network failure When a failure4 occurs either in the hardware or the network, the ReGIM starts to perform recovery work, which was not required in the previous case. The ReGIM architecture is able to protect the entire system from a single machine crash disaster perfectly; however, upon multiple failures, data recoverability is determined by whether more than one lego blocks belong to faulty machines. If this is the case, the ReGIM is not able to recover from multiple crashes. Otherwise, the ReGIM can successfully handle the situation. For active detection of a machine crash or network failure, we exploited the completion queue of Infiniband. In the reliable communication model that we used, each node posts an RDMA read/write descriptor to the send/receive queue to initiate page transfer requests. The completion queue entry can be checked to see whether the corresponding request was successful or not. A network or machine failure can be detected by checking the completion queue entries. In [1, 2], a similar method was used to detect network failure. The ReGIM system carries out data recovery as follows. First, the ReGIM sweeps the entire tetris from the bottom row to detect a defective lego block. After discovering such a lego block and replacing it with a new one, the ReGIM 4 In

this study, we assume a fail-stop model. In other words, we did not consider failures such as a byzantine failure.

Cluster Comput Fig. 4 Data write in the ReGIM: various latency fractions are indicated on the timeline. Checksumming can be performed during the completion of all RDMA requests and this causes no overhead for parity block generation in the ReGIM

recovers a 256 MB lego by rebuilding an original data by checksumming four data blocks from live lego blocks. Once the data is rebuilt, the ReGIM writes it to a new lego block. This continues until reaching the top row of the tetris. Based on the data recovery experiment, the ReGIM can rebuild 1 GB data within 10 seconds. However, the R2 MS is not available during the recovery period. During the recovery, a process can run without being affected by the recovery only if its image resides in the local memory. Otherwise, the process may stop running until the recovery completes and cannot resume until afterwards. Figure 5 shows the crash recovery mechanism in detail. During the recovery, if a new node joins the R2 MS group by exporting its physical memory, the ReGIM restructures the current tetris arrangement so that data blocks can be welldispersed as various shapes of lego blocks. The real benefits are that restructuring: (1) extends the inter-lego distance between lego blocks having the same shape, which determines how much data a single node should have and (2) enables the ReGIM to maintain differently shaped lego blocks, as much as possible. When restructuring the tetris, the ReGIM sweeps the entire row of tetris to find the best candidate lego block to be replaced with a new one. The rule is very simple: (1) at each row, the ReGIM selects two lego blocks where one has the lowest order value and the other has the highest order; then (2) if the difference between the order values is less than 1, the ReGIM jumps to the next row, otherwise, the ReGIM checks if the order of the new lego is lower than the highest one, in which case it then replaces the highest lego with a new lego block. The newly plugged lego block needs to read all of the data from the old lego. After completing the

tetris restruction, the ReGIM has a well distributed tetris as shown in Fig. 5. However, our greedy choice has a drawback. When a remote node joins, the ReGIM redistribute the data to new legos from appropriate remote nodes by our greedy choice so that the usage of legos in each remote node is even across all remote nodes. We admit that if a node joins when RMS already has a large tetris, the restructuring cost (large data copying) is non-trivial. This is a trade-off for our greedy choice. Even though we could not prove that our greedy choice always guarantees optimality for the goal, in terms of practicality the greedy method to assign legos works very efficiently inside real systems (the Linux kernel) in a reasonable sense. 2.5 R2 MS implementation The R2 MS implementation is based on all of the architectural details explained in previous sections. We implemented the R2 MS for the Linux operating system by replacing the current swap system with our module. Most of the components were implemented in a straightforward way. The total working lines of code are less than 5,000 lines, and less than half of them are for the core part of both the DyMem and ReGIM. The remaining code is devoted to the implementation of the device driver of the R2 MS communication system. We rewrote Linux’s swap index generation routine for space and time efficiency. We used a hierarchical bit-vector compression algorithm as the swap index generation routine. We will discuss the swap index generation routine in Sect. 4.1.

Cluster Comput Fig. 5 Crash Recovery & Restructuring. When node 5 crashes, the ReGIM sweeps the entire tetris to find the defective lego block, which is replaced with a new lego whose shape is not matched with any of the live lego blocks in the same row. Data rebuilding can be done using the mechanism shown in Fig. 4. Restructuring occurs when node 9 joins

3 Performance evaluation We evaluated the performance of our reliable remote memory system (R2 MS) using three types of applications: a Main Memory DBMS (MMDB), a file system benchmark program, and a set of parallel benchmark programs. These application sets represent the most important types of large memory data processing applications. Intensive experiments with these applications show a broad performance spectrum suitable to adequately address the pros and cons of using the reliable remote memory system. The Solid-State Drive (SSD) is an emerging storage device which is known to provide much faster access speeds and lower latency than HDD. Due to the attractive performance characteristics of the SSD, we can expect that an SSD-based low cost system can make remote memory systems useless. To investigate the validity of this expectation, we conducted several experiments to verify whether an SSD-based low cost system could beat a low cost remote memory system, such as the R2 MS, in performance. 3.1 Evaluation of main memory databases 3.1.1 Experimental environment We first evaluated the OLTP (Online Transaction Processing) performance on MMDB. We used MySQL Cluster,

which enables the clustering of in-memory databases in a shared-nothing system [3]. The core components of the MySQL Cluster are mysqld, ndbd, and ndb_mgmd. mysqld is the process that allows external clients to access the data in the cluster. ndbd stores data in the memory and supports both replication and fragmentation. ndb_mgmd manages other processes in the cluster. Although the MySQL Cluster is known to scale up to large clusters, we performed all of our experiments with a single ndbd process in order to observe the behavior of general main memory databases, which are sometimes not scalable. To measure the OLTP performance on MMDB, we used BenchmarkSQL [4], which is a JDBC benchmark that closely resembles the TPC-C standard for OLTP. To compare the performance of the R2 MS with a system that has sufficient memory space, we set up two experimental environments, Remote Memory (RM) and Local Memory (LM). Figure 6 shows the configuration of the RM environment, which is built on the R2 MS. Two front-end nodes are equipped with two 1.8 GHz Dual Core AMD Opteron(tm) Processor 265 CPUs, 2 GB of RAM, and one 250 GB SATA disk. We installed the MySQL Cluster software on these two nodes. Node 1 (front-end1) executes mysqld and ndb_mgmd, and node 2 (front-end2) executes ndbd. By separating MySQL daemons into different machines, we can observe the exact behavior of the databases without the un-

Cluster Comput Fig. 6 Remote memory environment

Table 1 Results of the BenchmarkSQL test: the performance ratio is the ratio of the measured tpmC to that of the LM environment with 50 WH # of WH

50 (LM)

50 (RM)

75 (RM)

100 (RM)

Measured tpmC

3,163

2,940

2,726

2,610

Performance Ratio

1

0.92

0.87

0.82

Remote Memory Used

N/A

6.4 GB

8.4 GB

10.6 GB

To measure the scalability of the OLTP performance, we changed the storage size. Under the 50 WH condition, the front-end2 node running the ndbd process demands almost 8 GB, which is the full amount of memory under the LM environment. Hence, we could not test 75 and 100 WH cases in the LM environment. It is worth noting that the performance decreases as the number of WH increases. 3.1.2 Experimental results

expected interferences of temporal memory that would otherwise occur. Each of the six back-end machines has 3 GB of memory and one hyperthreading-enabled Xeon 3.2 GHz CPU, and exports 2.5 GB of local memory to the frontend2 node. The machine running tpc client programs has the same specifications as the back-end machines. The frontend1 node runs the original Linux kernel-2.6.13 and the front-end2 node and six back-end nodes run the modified Linux kernel to support the remote memory. Notably, while the front-end1 node has enough memory to process queries and manage cluster nodes, the front-end2 node has insufficient memory to store data, and, thus, the remote memory needs to be exported from the back-end nodes. In the LM environment, unlike the RM environment, the front-end2 node is configured so as not to use the physical memory of the back-end machines. Instead, we provide the front-end2 node with a memory space of 8 GB RAM running Linux with an unmodified 2.6.13 kernel. Throughout our evaluation of the LM environment, we only performed experiments in which a swap operation did not occur. To configure the memory-related parameters, we performed experiments, changing only the parameters of DataMemory and IndexMemory, while the default values were used for the remaining parameters. The sizes of DataMemory used were 8 GB, 10 GB, and 12 GB, and the sizes of IndexMemory used were 1 GB, 1.5 GB, and 2 GB (as the number of warehouses increases from 50 to 100), respectively. For the remainder of this article, “warehouse” or “warehouses” will be abbreviated as WH.

We measured the tpmC in each configuration. Table 1 shows the measured tpmC. As expected, we obtained the highest tpmC at the smallest WH instance in the LM environment. However, it is surprising that the performance of the R2 MS (RM environment) is very comparable to that of the local memory-only system (LM environment), notwithstanding the large amount of remote memory usage. As shown in the table, the OLTP performance in the RM environment is 92% of that in the LM environment when the number of WH is 50. Following a careful investigation, we found that, in the RM environment, the local memory worked as a fast cache for the remote memory. The ndbd process had a strong locality in its data access pattern when processing the TPC-C-like workload, and the remote memory system (R2 MS) could exploit this locality. Actually, frequently accessed data, such as indices, were always in the local memory, while less frequently accessed data, such as unpopular records, were relocated to the remote memory. We can also figure out from the table that, although the ratio of local to remote memory increased as the number of WH increases (the ratio was approximately 3.2 and 5.3 for the 50 WH and 100 WH cases, respectively), the performance degradation values were within 10%. This implies that the R2 MS can provide good scalability for Main Memory Databases. To compare the performance of the R2 MS with that of the diskbased DBMS, we measured the tpmC using the MyISAM engine5 in the LM environment. The measured values were 5 We

also measured the tpmC using the HEAP engine in the LM environment and noted similar results to those of the MyISAM engine case.

Cluster Comput

Fig. 7 Trace of bandwidth usage for accessing remote memory with varying numbers of WH (# of clients = 5)

between 300 and 400. The MMDB built on R2 MS outperforms disk-based databases by a factor of 7 to 8, and the results are much closer to those of the LM environment than those of disk-based databases. Figure 7 shows the amount of data read from or written to the remote memory during the execution of the BenchmarkSQL. Clearly, more remote memory for data storage means that more data are transmitted between the front-end and back-end nodes. Therefore, it is natural that the amount of data transmission increases as the number of WH increases. To measure the network traffic that each transaction type incurs, we configured a single client to issue a total of 5,000 transactions, one by one, every 10 seconds. For each transaction, we captured the exact amount of data read from or written to the remote memory. We categorized the transactions into five types and calculated the average amount of data transfer for each transaction type. Table 2 shows the breakdown of network usage. Among the transaction types, the “Delivery” transaction incurs the highest traffic, because it consists of complex SQL statements. Since “Stock-Level” and “Order-Status” transactions consist of only a few select statements, they show relatively small amounts of remote write traffic. “New-Order” and “Payment” transactions contain a few update statements as well as several select and insert statements. Thus, they incur more traffic than the “Stock-Level” and “Order-Status” transactions. It is apparent that more insert or update statements result in higher remote write traffic and more select statements result in increased read traffic. Since the back-end nodes in the R2 MS do nothing but export memory space to the front-end node, increasing the number of back-end nodes only increases the remote memory space available to front-end node. Hence, given the numbers of clients and warehouses, the performance (tpmC value) does not increase as the number of back-

Table 2 The breakdown of network usage at a WH of 50: RemoteWrite and Remote-Read are the average amounts of data written to and read from the remote memory per transaction. The number of page I/O means the total number of remote page transfers (read or write) for 5,000 transactions (New-Order: 45%, Payment: 43%, Delivery: 4%, Stock-Level: 4%, Order-Status: 4%). Transaction Type

Delivery

Remote-Write

Remote-Read

# of

Traffic (KB)

Traffic (KB)

page I/O

114.320

123.139

1,936

Stock-Level

24.260

50.810

1,991

Order-Status

25.810

64.266

1,942

Payment

27.629

58.832

21,460

New-Order

39.979

68.461

22,671

end nodes increases. However, it is possible that when the number of back-end nodes increases, the latency of the RDMA read/write operations would increase, resulting in even worse tpmC value. If this phenomenon occurs in the R2 MS, we cannot say that the R2 MS is scalable. To check whether this would occur in the R2 MS, we conducted experiments that measure the tpmC values for varying numbers of back-end nodes. Figure 8 shows the measured tpmC values with increasing numbers of back-end nodes. As we can see from the figure, the measured tpmC does not change irrespective of the number of participant back-end nodes. The results imply that the latency of the RDMA read/write operations does not increase as the number of back-end nodes increases, when the numbers of clients and warehouses are fixed. 3.2 Evaluation of the virtual memory file system This section details the evaluation results for the applications on the Virtual Memory File System (VMFS). The

Cluster Comput

Interestingly, the deletion time in the RM is shorter than that in the LM. In the case of the RM, the inode reclamation that is needed when deleting a file or a directory entails only the removal of mapping information for remote memory pages. Conversely, the inode reclamation performs activities for search and reclamation procedures for the target data block in the LM. Thus, it leads to a better deletion performance in the RM. Fig. 8 Results upon varying the number of back-end nodes

VMFS keeps all file system structures, including metadata, in memory to benefit from fast RAM. Measuring the VMFS performance on the R2 MS reveals the potential benefits of using the VMFS as a fast local file system for disk-based databases, rather than a local disk. To measure the VMFS performance Postmark, a file system benchmark, BenchmarkSQL, and TPC-H [5], which is a decision support benchmark, were used. The system configuration in these experiments is similar to that in the previous experiments. The only difference is that the number of back-end nodes is 7, and the front-end2 node runs mysqld (tpc clients are connected to the front-end2 node). Before executing the benchmark programs, the entire working directories are mounted to the tmpfs file system, which is a type of VMFS. Even though we use MySQL software, the database itself is stored in the tmpfs, not in the address space of the ndbd process. 3.2.1 File system benchmark Postmark, which is designed to evaluate the performance of file servers for applications, such as email, netnews, and web-based commerce, is performed in three phases: file creation, transaction execution, and file deletion. In this experiment, the number of transactions and subdirectories are 30000 and 100, respectively. Five experiments were performed by increasing the number of files. Table 3 shows the results of the Postmark benchmark when (1) memory is not shared in the VMFS, (2) memory is shared in the VMFS and (3) in ReiserFS. In the case of 400000 and 500000 files in LM, both experiments could not be performed because the size of data exceeds the memory of the local machine. Overall, both VMFS cases outperform ReiserFS by a factor of more than 20. These results indicate that remote memory has even better performance than disk. In the VMFS cases, the LM case shows a better performance than the RM case. Because Postmark is an I/Ointensive benchmark, I/O operations mean memory operations in the VMFS, and since memory operations are likely to involve RDMA operations in RM, the times for creating files, performing transactions, and deleting files in the RM are longer than those in the LM.

3.2.2 Results of BenchmarksQL We conducted five experiments by increasing WH for the RM environment and one experiment (WH = 50) for the LM environment. Table 4 shows the results. As seen in the table, the R2 MS shows almost the same performance with the local memory-only system (when the number of WH is 50). The performance ratio exhibits a similar pattern to the one shown in the MMDB experiment. In MMDB experiments, for each 2 GB increase of remote memory, the performance decreased by 5%. Likewise, in the VMFS experiment, for each roughly 2 GB increase of remote memory, the performance decreased by 2– 3%. This means that mysqld with the MyISAM engine6 has a similar access pattern when it fetches data either from the ndbd process or from the VMFS. Surprisingly, the measured tpmC in the VMFS experiment is greater than that in the MMDB experiment. An MMDB that is faster than a diskbased database stores all data (including records, indices, and temporary space for relational algebraic operations) in the main memory, and this requires the implementation of distributed systems. Thus, the MMDB system invokes many control messages. In addition, MySQL cluster supports only very small-sized temporary space and does not support temporary tables. Therefore, complex queries that require temporary space result in large overhead during the processing of relational algebraic operations. However, when a diskbased database system uses a virtual memory file system instead of a normal disk-based file system, all data are stored in the virtual memory space, and the R2 MS uses remote memory space instead of the swap space. Table 4 implies that a memory hierarchy for a virtual memory file system of the R2 MS (file system cache–swap cache–remote memory) is superior to the process architecture of the MySQL, which exploits socket communication (mysqld–ndb). Hence, special care should be taken with memory space hierarchy when designing MMDB architecture. Figure 9 shows the results of the benchmark when the number of clients varies. The tpmC value increases until 6 MyISAM

does not support transactions, while MySQL Cluster supports transactions. To execute TPC-C transactions in MyISAM, we removed the “BEGIN TRANSACTION”, “COMMIT” and “ROLLBACK” statements from every TPC-C transaction statement and executed each SQL statement one by one.

Cluster Comput Table 3 Results of the Postmark Benchmark. The number of transactions is set to 30000, and the number of subdirectories is set to 100. C1, C2, and C3 represent Creation Time in VMFS(LM), VMFS(RM), and ReiserFS, respectively. T1, T2, and T3 represent Transaction Times in # of files

C1

C2

C3

T1

VMFS(LM), VMFS(RM), and ReiserFS, respectively. D1, D2, and D3 represent Deletion Times in VMFS(LM), VMFS(RM), and ReiserFS, respectively

T2

T3

D1

D2

D3

100000

4.25

9.50

216.94

1.95

2.33

738.53

0.90

0.88

27.39

200000

8.72

27.29

457.64

2.04

7.01

771.65

1.90

1.58

39.89

300000

13.15

44.93

730.50

1.98

7.68

787.99

2.67

1.95

58.28

400000

N/A

61.26

952.23

N/A

9.56

982.81

N/A

2.60

57.06

500000

N/A

88.23

1147.54

N/A

10.80

1102.68

N/A

3.12

72.82

Table 4 Results of the BenchmarkSQL: in this experiment, MySQL with the MyISAM engine, which is the default disk-based engine, was used to evaluate its performance on the VMFS (# of clients = 5)

# of WH

Measured

Remote Memory

Size of

Performance

tpmC

Used (MB)

Database (MB)

Ratio

50 (LM)

5,585

N/A

3,921.5

1

50 (RM)

5,502

2,744.9

3,921.5

0.985

75 (RM)

5,284

5,047.9

5,787.3

0.946

100 (RM)

5,179

7,208.6

7,733.2

0.927

125 (RM)

5,077

9,139.1

9,627.2

0.909

150 (RM)

4,872

11,104.3

11,563.1

0.872

Fig. 9 Results of the benchmark with various number of clients: # of WH = 50

the CPU is saturated, as the number of clients increases. In the case of five clients where the CPU is not fully utilized, the R2 MS shows almost the same tpmC value as the local memory-only system. In the cases of 10, 15, and 20 clients, the difference in performance increases because of the remote page access overhead. The tpmC values are similar for both 20 and 25 clients, indicating that the CPU is saturated. The performance gaps between the LM and RM environments lie within reasonable bounds. Figure 10 shows the tpmC trace for two failures. The R2 MS detects the first crash 5 seconds after the crash occurs (at time 20), and it completes the recovery approximately 10 seconds after detection. During the recovery, mysqld cannot perform any operation because the recovery is under the protection of a global spin lock. The second crash is forced by rebooting another machine at time 93. The R2 MS detects this crash 5 seconds later and it completes the recovery 17

Fig. 10 Crash/recovery experiment with a WH of 100

seconds after the time of detection. The time for the second recovery (22 sec) is longer than that of the first (17 sec) because the size of the data that each remote machine holds increases due to the first crash. The same amount of memory is provided by six back-end machines instead of seven. Notice that mysqld was able to serve transactions for a short time even after the failure, by using data in the local memory, although the performance gradually decreased. We can see from the figure that the R2 MS safely recovers from the single machine crash.

Cluster Comput Fig. 11 Results of the benchmark while increasing the scale factor (SF): all of the measured values in the “RM” environments were normalized by that in the “LM” environment with SF = 4

Fig. 12 Two comparative queries

3.2.3 Results of TPC-H benchmark To evaluate the performance of the TPC-H benchmark, we conducted six experiments by increasing the Scale Factor (SF) that is related to the database size. For the LM environment, we conducted only one experiment with a SF = 4, due to the size limitations of the available local memory. The consumed sizes of the remote memory for each RM experiment (as SF increased from 4 to 8) were measured as 3.5 GB, 4.9 GB, 6.1 GB, 7.1 GB, and 8.1 GB. The TPC-H benchmark suite generates 22 test queries. We observed the execution times of the test queries in each experiment and calculated the normalized running times using the case of a SF of 4 in the LM. Of note is that the 1st , 13th and 18th queries could not be executed under the version of MySQL that we used in this evaluation. Figure 11 shows the normalized running times of 19 queries. The execution time at a SF of 4 in the LM is 1 for all of the queries. Clearly, in the cases of certain queries such as the 8th , 9th , 11th , 14th , 19th , and 20th , increasing the SF led to much longer running times and, in other cases, to slightly longer running times. The differences between the running times depends heavily on what type of record a join operation is performed on. For example, as shown in Fig. 12(a), the join operation of the 14th query is performed on PART and LINEITEM tables. However, L_PARTKEY in the WHERE clause is not the primary key of the LINEITEM table, which is the biggest table in the database, nor does it have any index created for itself. Therefore, when it is executed in the RM environment, the 14th query requires many more database accesses,

which leads to an unbearably long execution time. The other queries that show much longer running times are similar to the 14th query. Meanwhile, the join operation of the 3rd query, as shown in Fig. 12(b), is performed on CUSTOMER, ORDERS, and LINEITEM tables. Both C_CUSTKEY and O_ORDERKEY in the WHERE clause are the primary keys in the table, and indices were built on them. Therefore, the query does not need to frequently access the database, which results in only a slight increase in running time. The evaluation results demonstrate that in read intensive transactions (heavy join operations), the overall performance in the RM environment is mostly affected by whether the database itself performs optimized read operations or not, regardless of the size of the database. For example, if we make appropriate indices for tables which are stored in the remote memory, the R2 MS would show a reasonable performance. 3.3 Evaluation of scientific applications 3.3.1 Experimental environment In this section, we evaluate the performance of scientific applications on the R2 MS. Since scientific applications generally show very low data access locality, we can predict that the remote memory system would show relatively poor performance. For the scientific applications, we used the serial version of NAS Parallel Benchmarks 3.2 (NPB3.2-SER) [6] and selected four applications of the benchmarks: bt, cg, ft, and sp. The problem sizes of bt and ft are class B, and

Cluster Comput

those of cg and sp are class C. Each application is abbreviated using the form of “application name.problem size”, i.e., bt.B and sp.C. The system configuration in these experiments is also similar to the configuration in the previous experiments, other than the fact that the client machine is not needed in this evaluation. Each application program is executed at the front-end2 node, and the back-end nodes provide remote memory when the amount of the local memory in the front-end2 node is not sufficient. bt.B, cg.C, ft.B, and sp.C use 1,219 MB, 1,109 MB, 1,685 MB, and 1,262 MB of memory, respectively, To make the front-end2 node invoke remote memory accesses, we performed experiments with relatively small-sized local memories (512 MB, 640 MB, 768 MB, 896 MB, and 1,024 MB). For the LM environment, we used 2 GB of local memory, which is far larger than the required amount of memory by each application. 3.3.2 Experimental results Figure 13 shows the normalized running times of each experiment (the execution time in the LM environment is 1). As the amount of local memory increases, the running time decreases. In the case of bt.B, the R2 MS shows very poor

performance even when the local memory size is 1,024 MB. This is due to the behavior of the program. The main loop of bt starts from the compute_rhs function to successive functions (x_solve, y_solve and z_solve). x_solve and y_solve functions access matrices while increasing the value in the z-direction. Whenever the value of the z-direction changes, data that is accessed by these functions is always new. In the case of the z_solve function, whenever the values of the x-direction or y-direction change, the data z_solve accesses are new. This pattern imposes little temporal and spatial locality and, as a consequence, incurs a lot of traffic for page in/out, as shown in Figs. 14(a) and 14(b). Meanwhile, ft.B in the RM environments is slightly slower than that in the LM environment, while it uses far more memory space than bt.B. This is because ft.B performs Fast Fourier Transformations (FFTs) in the x-, y-, and z-directions successively and repeatedly. Since each FFT in any direction has strong locality, there are some areas that incur less traffic for page in/out, as shown in Figs. 15(a) and 15(b). The results indicate that, to benefit greatly from using remote memory systems, developers need to write applications based on iterations of matrix operations and should also consider locality between iterations or successive functions. 3.4 RDMA vs. SSD

Fig. 13 The running time of each benchmark

In this section, we compare MMDB on R2 MS to RDB on a solid-state drive (SSD) and a hard-disk drive (HDD), measuring OLTP performance. As we noted earlier, through this comparison, we can verify whether an SSD-based low cost system is more feasible than a low cost remote memory system, such as the R2 MS, for large memory data processing. For MMDB on the R2 MS, we used the same experimental configuration as was used in Sect. 4.1. For RDB on SSD and

Fig. 14 Trace of Memory Page In/Out (bt.B the size of local memory: 768 MB)

Cluster Comput

Fig. 15 Trace of Memory Page In/Out (ft.B, the size of local memory: 768 MB)

Fig. 16 Trace of Average tpmC (SSD, ext3 and MyISAM)

HDD, we equipped the front-end1 node with a SSD (SAMSUNG MMCRE64G5MXP, 64 GB capacity) and a HDD, and we configured the MySQL parameters so as to use MyISAM as a storage engine. MyISAM is the fastest disk-based storage engine that MySQL provides. While SSD is known to show outstanding performance, it has a few drawbacks inherited from the characteristics of NAND flash memory. Since ‘in-place update’ is not possible in NAND flash memory, when update occurs, either the data block containing the original data is erased before the updated data can be written, or the updated data is written in a separate page, and the page storing the original data is marked as invalid. The invalidated pages can be recycled through garbage collection during which the bulk of the block erasures and page copies are performed. Since the write operation in flash memory is an order of magnitude slower than read operation, and the block erasure operation is an order of magnitude slower than the write operation, the overall update costs in SSD are very expensive. In particular, random writes to SSD, which incur large numbers of block

erasures and page copy operations internally, dramatically increase the response time to the file system. Figure 16 shows the tpmC trace under the configuration composed of SSD, the ext3 file system and the MyISAM storage engine. As seen in this figure, the measured performance increases from the beginning and degrades after 20 seconds. TPC-C-like applications lead to updates of many existing database tuples. At the increasing part, modified tuples reside in the MySQL buffer. However, after the buffer is overcommitted, a time-consuming erase operation must be performed before overwriting to update an existing tuple stored in SSD. The erase operation cannot be performed selectively on a particular tuple, but can only be done for an entire block (128 KB), which is larger than a page (4 KB). The performance degradation results from updates of many database tuples, which leads to many page updates in SSD. Figure 17 shows the tpmC trace under the configuration composed of SSD, the nilfs2 file system, and the MyISAM storage engine. Log structured file systems, such as nilfs2, are known to be appropriate for NAND flash memory-based storage devices because of the ‘not-in-place update’ char-

Cluster Comput

Fig. 17 Trace of Average tpmC (SSD, NILFS2 and MyISAM)

Table 5 Recovery Time of RM, SSD, and HDD (VMFS, TPC-C, 100 WH)

Recovery Time (sec)

Fig. 18 VMFS (LM, RM, SSD) vs. ReiserFS (HDD): Postmark Result

acteristics of the NAND flash memory [7]. For this experiment, we configured the SSD partition so that 44% of a specific partition was used for the database population. We can see from the figure that the measured performance of this experiment is better than that of the MMDB experiment (Sect. 4.1). After 600 seconds, however, the performance degrades greatly. In the nilfs2, which is a case of log structured file systems, updated tuples are stored as logs. Although logs do not lead to erase operations, the size of the logs in the nilfs2 file system increases quickly. After 600 seconds, the partition for the experiment was full of database and logs, and the garbage collector of the nilfs2 file system was forced to reclaim disk space. The forced reclamation procedure is extremely time-consuming, and, during the reclamation procedure, nilfs2 does not respond to any file system operations. Un-plotted regions in the figure mean that disk operations (read/write) cannot be performed during the reclamation. Figure 18 shows the results of Postmark in various environments. For this experiment, we used the SSD device as a swap device. From the figure, we can see that the VMFS cases outperform ReiserFS. In particular, the VMFS cases with the RM and the LM outperform those with SSD by a factor of more than 14. These results indicate that remote memory has an even better performance than SSD. We performed TPC-C experiments with failures to compare the recovery time when RM, SSD, and HDD are used.

RM

SSD

HDD

17

99

140

To simulate failures, we forced detaching each device. For example, a block device was forced to be unmounted in SSD and HDD cases, and a remote node was rebooted in the RM case. Table 5 shows the recovery time of RM, SSD, and HDD. The RM case completes the recovery procedure faster than the SSD (/HDD) case by a factor of more than 5 (/8). This indicates that remote memory has a better recovery performance than SSD and HDD.

4 Discussion In the previous section, we verified, through rigorous evaluations, that the R2 MS has performance benefits. In this section, we summarize our experiences. 4.1 Implementation Problem in the Linux swap system From a top-down perspective, the entry point to the R2 MS starts at the swap index generation routine. In the R2 MS, the unique key to finding the location of a remote data page is the swap index. Unfortunately, the current implementation of the index generation, even in the latest Linux, i.e., RedHat kernel-2.6.13-15, cannot be used as it is. The current implementation, once contiguously available index slots are running out, searches a free index slot linearly from the lowest free slot position in an index slot array (as shown in Fig. 20(a)). Therefore, if the search reaches the end of the swap index array, then the search time might be sensitive to the memory access behavior of the running program. Figure 19 shows the results of our initial experience with the original index lookup algorithm. The ‘Loop count’ in the

Cluster Comput Fig. 19 Trace of the Loop Count of the linear index lookup algorithm: the loop count is traced at the point at which the free swap index slot is searched. Once the index lookup algorithm is switched to the linear scanning mode, the trace has a huge amount of unexpected spikes due to the memory access behavior of the MySQL database (on the R2 MS), which shows strong spatial locality. (0–1700 sec: a data insertion period, 1700–2600 sec: a delimitation period, and 2600–9000 sec: under the TPC-C benchmark test)

figure is the number of scanned array entries performed in order to find an empty slot in the swap index array. A large count means that the kernel scanned the entire array to detect a free slot, while a small count implies that a free slot was found at the front part of the search. Current Linux maintains a single big array in order to keep track of the swap index usage. As outlined in any operating systems textbook, activating the swap system is known to be the last measure taken for the illusion of 32/64 bits virtual memory, because swapping in/out might cause a thrashing problem. This is why the searching algorithm is naively designed: people did not expect the swap system to be activated. For example, even if the total amount of used swap space is slightly more than half of the total swap size, all of the occupied slots are spread in a skewed way over the entire array so that once the searching mode is switched to the linear scanning (after 1200 sec. in Fig. 19), the algorithm can consume a significant amount of time searching a free index slot. This unexpected pitfall happens when all of the free index slots are positioned at both ends of the index array with a high hit ratio, and the used slots are spread around the middle of the array with a very low hit ratio. This phenomenon emphasizes that an optimistically designed linear algorithm can be easily broken by a program whose memory access behavior has high spatial locality. Enhancement For efficiency, we rewrote Linux’s swap index generation routine in a space- and time-efficient manner. We implement a hierarchical bit-vector compression algorithm to search the free index slot or insert one after using

Fig. 20 Enhancement of swap index generation routine in Linux kernel

it, as shown in Fig. 20(b). Each bit represents a single index slot, which points to a single page. Therefore, to cover a maximum of 1 terabyte, we implemented the algorithm with 7 hierarchies. The bit-vector compression algorithm consumes less than 4 MB to cover 64 GB index slots and takes a constant time to flip a used index slot or to find a free index slot irrespective of the memory access behavior. This eliminates the problem we met and plays a critical role in improving the overall efficiency. However, this modification can have negative effects on SSD-based storage devices. If a swap index needs to be chosen for a swap write operation, the bit-vector compression algorithm returns the earliest index, which is very likely to have been recently freed by the previous swap read operation. It might lead to an expensive block erase operation to perform a page update. Hence, even though there remains many clean pages, which do not require erase operation be-

Cluster Comput

Fig. 21 Original Linux swap algorithm (L) vs. bit-vector compression algorithm (B): postmark result

fore writing, the algorithm returns recently freed dirty page which requires erase operation. Conversely, the index search algorithm in the native Linux kernel returns a free index slot linearly from the current slot position in an index slot array. Figure 21 shows the effects of the bit-vector compression algorithm for SSD. SSD (L) means that SSD is used as a swap device in the unmodified Linux kernel, while SSD (B) means that SSD is used as a swap device in a new kernel, in which the bit-vector compression algorithm is used for a swap slot index choice. From the figure, we can see that the performance of SSD (B) is much worse than that of SSD (L). Therefore, we used the original swap algorithm for all SSDbased experiments in Sect. 3.4. 4.2 Performance It is well known that an MMDB is generally faster than a disk-based database, since an MMDB stores all data—such as tuples, indices, and temporal data—in the main memory. However, the performance for TPC-C-like applications in the RM-based VMFS experiment was greater than that in the MMDB experiment. When a disk-based database system uses the VMFS instead of a normal disk-based file system, all data are stored in the virtual memory space of the VMFS. It is a natural benefit that the VMFS uses a large virtual memory space through the R2 MS, which provides high-speed and low-latency remote memory access. Thus, we believe that the VMFS-based database can be a practical alternative to an MMDB. Recently, SSD has been widely regarded as a much faster storage device than a hard-disk drive, because SSD provides fast random read access and low latency. We compared a SSD-based system to the R2 MS under several configurations using the TPC-C-like benchmark and Postmark with several noteworthy results. First, when the ext3 file system was used for SSD, the performance of the TPC-Clike benchmark worsened, since the meta-data updates in the ext3 file system require heavy erase operations. We believe that such a phenomenon will appear in other kinds of ext2, ReiserFS, and other Berkeley FFS-like file systems, since

they store some parts of meta-data in fixed locations, similar to the ext3 file system. Second, when the nilfs2 file system was used for SSD, the performance of the TPC-C-like benchmark was not promising, since logs increased quickly, forcing a rather complex and costly reclamation procedure for securing disk space. A quick increase of logs, which leads to performance degradation, appears in LFS-like file systems. Third, when SSD was used as a swap device, the performance of Postmark7 was not comparable to the RMS. Moreover, the choice of the swap index generation algorithm is important, since a swap index generation algorithm might lead to many erase operations in SSD. From these lessons, we can conclude that it is difficult to build robust and reliable systems using SSD, since the current SSD and file systems for SSD are premature. Some might think that a computer system with a large capacity of DRAM can be currently afforded. Such systems have many benefits, such as high performance and no requirements for crash recovery mechanisms, in comparison to the R2 MS. However, it is not cost-effective since the special hardware for shipment of many DRAMs is generally required. Moreover, they are not as scalable as the R2 MS due to the special hardware requirements. From this perspective, remote memory systems, such R2 MS, are not useless. The random access memory is very attractive in terms of access time. Owing to the features of random access memory, database researchers have performed many projects related to MMDB systems, which are orders of magnitude faster than disk-based database systems. However, current large memory processing systems, such as MMDB database systems, do not fully exploit remote memory systems due to the well-known folklore; remote memory systems are not as efficient, nor as reliable as expected. From our rigorous evaluations, we found that this is not always true. In cases of TPC-C-like applications, the R2 MS-based MMDB with a naive configuration is some magnitudes faster than the disk-based databases. The R2 MS-based MMDB showed reasonable performance ratings in comparison to its LMbased counterpart. Moreover, a VMFS-based database on R2 MS showed a better performance than did MMDB. In cases of well-engineered scientific applications, such as ft.B, the R2 MS showed slightly lower performances (reasonable performance) than the LM.

5 Related works Memory hierarchy and memory sharing have been traditional research topics [8–11]. Closely related work on memory management for distributed architectures includes page 7 We

also tried to evaluate SSD using the TPC-C-like benchmark, but we failed the evaluation since database population was extremely slow.

Cluster Comput

placement strategies for distributed shared memory architecture, and shared virtual memory systems. The cost of local memory access is significantly lower than accessing remote memory. Research works [12–15] in this area have shown that dynamic page replacement is an effective solution to the problem. However, it is not recommended to use this technique in cluster architectures which rely heavily on explicit message communications unless outfitted with special hardware for direct access to remote memory. There have also been numerous studies on remote memory sharing. Markatos’ work [16] discussed many important issues regarding implementing reliable network memory. One of the issues mentioned relates to our research. He implemented a system by adding a block device driver to the DEC OSF/1 operating system. Converserly, our work uses a much faster network to validate the efficacy of the reliable remote memory system and redesigns the poorly designed swap interface. Liang’s work [17] is the first work that exploited the Infiniband network interface for remote memory. The contribution of this work is the use of the Infiniband network rather than the Ethernet interface; however, it lacks reliability. Feeley’s work [18] presented global memory management in a workstation cluster. The system employs a single but distributed memory management algorithm to manage all cluster-wide memory. The global memory manager can be regarded as a global paging system, and it thus has a global paging mechanism in adapting a page replacement policy. However, it involves explicit inter-node communication to exchange various control messages. This differs from our work in that we manage remote memory by exchanging explicit control messages. In [19, 20], cooperative cache or buffer management techniques for distributed systems are proposed. These techniques are based on the degree of the locality. Data that have high (low) locality scores are placed on a high-level (low-level) cache. Currently, it is difficult to determine the locality score of virtual memory pages in the R2 MS kernel, and our future research aim is to devise the metric of the locality for cooperative page placement . Comer [21] described a remote memory model, in which the cluster contains workstations, disk servers, and remote memory servers. The remote memory servers are dedicated machines whose large primary memories could be allocated to workstations with heavy paging activity. No client-toclient resource sharing occurs, except through the servers. Franklin et al. [22] examined the use of remote memory in a client-server DBMS system. Their system assumes a centralized database server that contains the disks for stable storage plus a large memory cache. Clients interact with each other via a central server. On a page read request, if the page is not cached in the server’s memory, the server checks whether any client has that page cached; if so, the server asks the client to forward its copy to the workstation requesting the read. They evaluated several variants of this algorithm using a synthetic database workload.

Dahlin et al. [23] evaluated the use of several algorithms for utilizing remote memory, the best of which is called N-chance forwarding. Using N-chance forwarding, when a node is about to replace a page, it checks whether that page is the last copy in the cluster; if so, the node forwards that page to a randomly-picked node, otherwise discarding the page. Each page sent to remote memory has a circulation count, N , and the page is discarded after it has been forwarded to N nodes. When a node receives a remote page, that page is made the youngest on its LRU list, possibly displacing another page on that node; if possible, a duplicate page or recirculating page is chosen for replacement.

6 Conclusion This article investigated the potential benefits of a reliable remote memory system. In terms of the performance criterion, our evaluation with applications that have high data access locality confirmed the possibility of a reliable remote memory system. Some of the performance results approach the one with pure in-memory data processing systems. Considering the speed gap between DRAM and networks, this gain is more than we expected. Particularly, R2 MS outperforms HDD-based (SSD-based) systems by up to 13 (5) times in TPC-C cases. Meanwhile, the evaluation with applications that have random data access patterns suggests that use of the R2 MS for high throughput should be moderate. In extreme cases (e.g., bt.B in the evaluation of scientific applications) we observed poor performance. This evaluation emphasizes that data access locality, as we expected, is the most important factor affecting application performance on remote memory systems. In addition, the method to achieve the reliability of R2 MS is robust enough to face a series of single point failures. The following conclusions can be drawn from the results of this research: (1) a reliable remote memory system can be a good choice when running large memory data processing applications that possess good locality characteristics, and (2) the OS-level crash/recovery mechanism can be robust enough to guard an entire data processing system against a single remote machine failure. Acknowledgements This work was supported by the National Research Foundation (NRF) grant funded by the Korea government (MEST) (No. 2010-0014387). The ICT at Seoul National University provided research facilities for this study.

References 1. Vishnu, A., Gupta, P., Mamidala, A.R., Panda, D.K.: A software based approach for providing network fault tolerance in clusters with udapl interface: mpi level design and performance evaluation. In: SC’06: Proceedings of the 2006 ACM/IEEE Conference on Supercomputing (2006)

Cluster Comput 2. Vishnu, A., Gupta, P., Mamidala, A.R., Panda, D.K.: An efficient hardware-software approach to network fault tolerance with infiniband. In: Proceedings of the IEEE Cluster 2009 (2009) 3. MySQL. http://www.mysql.com 4. BenchmarkSQL. http://pgfoundry.org/projects/benchmarksql 5. Transcation Processing Performance Council. TPC-H. http:// www.tpc.org/tpch 6. NAS Parallel Benchmarks. http://www.nas.nasa.gov/Resources/ Software/npb.html 7. Shin, D.: About SSD. In: Proc. of the USENIX Linux Storage and Filesystem Workshop (LSF’08) (2008) 8. Amza, C., Cox, A.L., Dwarkadas, S., Keleher, P., Lu, H., Rajamony, R., Yu, W., Zwaenepoel, W.: Treadmarks: shared memory computing on networks of workstations. IEEE Comput. 29(2), 18– 28 (1996) 9. Carter, J.: Munin: Efficient Distributed Shared Memory Using Multi-Protocol Release Consistency. Ph.D. Thesis, Rice University (1992) 10. Keleher, P., Cox, A.L., Dwarkadas, S., Zwaenepoel, W.: An evaluation of software based release consistent protocols. J. Parallel Distributed Syst. 29(2), 126–141 (1995) 11. Keleher, P.: Distributed Shared Memory Using Lazy Release Consistency. Ph.D. Thesis, Rice University (1994) 12. Black, D., Gupta, A., Weber, W.D.: Competitive management of distributed shared memory. In: Sprint COMPCON 89 Digest of Papers (1989) 13. Bolosky, W., Scott, M., Fitzgerald, R.: Simple but effective techniques for numa memory management. In: Proc. of ACM SOSP (1989) 14. Holliday, M.: Reference history, page size, and migration daemons in local/remote architectures. In: Proc. of ACM ASPLOS (1989) 15. Bolosky, W., Scott, M., Fitzgerald, R., Fowler, R., Cox, A.: Numa policies and their relationship to memory architecture. In: Proc. of ACM ASPLOS (1991) 16. Markatos, E.P., Dramitinos, G.: Implementation of a reliable remote memory pager. In: Proc. of the 1996 USENIX Technical Conference (1996) 17. Liang, S., Noronha, R., Panda, D.K.: Swapping to remote memory over infiniband: an approach using a high performance network block device. In: Proc. of the 2005 IEEE Cluster Computing (2005) 18. Freeley, M.J., Morgan, W.E., Pighin, F.H., Karlin, A.R., Levy, H.M.: Implementing global memory management in a workstation cluster. In: Proc. of ACM SOSP (1995) 19. Jiang, S., Petrini, F., Ding, X., Zhang, X.: A locality-aware cooperative cache management protocol to improve network file system performance. In: ICDCS’06: Proceedings of the 26th IEEE International Conference on Distributed Computing Systems (2006) 20. Jiang, S., Davis, K., Zhang, X.: Coordinated multilevel buffer cache management with consistent access locality quantification. IEEE Trans. Comput. 56(1) (2007) 21. Comer, D., Griffioen, J.: A new design for distributed systems: the remote memory model. In: Proc. of USENIX Tech. Conf. (1990) 22. Frankling, M.J., Carey, M.J., Livny, M.: Global memory management in client-server DBMS architectures. In: Proc. of VLDB (1992) 23. Dahlin, M.D., Wang, R.Y., Anderson, T.E., Paterson, D.A.: Cooperative caching: using remote client memory to improve file system performance. In: Proc. of USENIX OSDI (1994)

Hyuck Han received his B.S., M.S., and Ph.D. degrees in Computer Science and Engineering from Seoul National University, Seoul, Korea, in 2003, 2006, and 2011, respectively. Currently, he is a postdoctoral researcher at Seoul National University. His research interests are distributed computing systems and algorithms.

Hyungsoo Jung received the B.S. degree in mechanical engineering from Korea University, Seoul, Korea, in 2002; and the M.S. and the Ph.D. degrees in computer science from Seoul National University, Seoul, Korea in 2004 and 2009, respectively. He is currently a postdoctoral research associate at the University of Sydney, Sydney, Australia. His research interests are in the areas of distributed systems, database systems, and transaction processing. Sooyong Kang received his B.S. degree in mathematics and the M.S. and Ph.D. degrees in Computer Science, from Seoul National University, Seoul, Korea, in 1996, 1998, and 2002, respectively. He was then a Postdoctoral researcher in the School of Computer Science and Engineering, SNU. He is now with the Division of Computer Science and Engineering, Hanyang University, Seoul. His research interests include Operating System, Multimedia System, Storage System, Flash Memories and Next Generation Nonvolatile Memories. Heon Y. Yeom is a Professor with the School of Computer Science and Engineering, Seoul National University. He received his B.S. degree in Computer Science from Seoul National University in 1984 and his M.S. and Ph.D. degrees in Computer Science from Texas A&M University in 1986 and 1992 respectively. From 1986 to 1990, he worked with Texas Transportation Institute as a Systems Analyst, and from 1992 to 1993, he was with Samsung Data Systems as a Research Scientist. He joined the Department of Computer Science, Seoul National University in 1993, where he currently teaches and researches on distributed systems, multimedia systems and transaction processing.

Suggest Documents