A Transparent Distributed Shared Memory for ... - Springer Link

17 downloads 3966 Views 368KB Size Report
With the transparency provided by Teamster, programmers can exploit ... a program dedicated to a single node, such that it may be run in parallel on multiple ...
The Journal of Supercomputing, 37, 145–160, 2006  C 2006 Springer Science + Business Media, LLC. Manufactured in The Netherlands.

A Transparent Distributed Shared Memory for Clustered Symmetric Multiprocessors JYH-BIAU CHANG [email protected] CE-KUEN SHIEH [email protected] Department of Electrical Engineering, National Cheng Kung University, Tainan, Taiwan, R.O.C. TYNG-YEU LIANG [email protected] Department of Electrical Engineering, National Kaohsiung University of Applied Science, Kaohsiung, Taiwan, R.O.C.

Abstract. A transparent distributed shared memory (DSM) system must achieve complete transparency in data distribution, workload distribution, and reconfiguration respectively. The transparency of data distribution allows programmers to be able to access and allocate shared data using the same user interface as is used in shared-memory systems. The transparency of workload distribution and reconfiguration can optimize the parallelism at both the user-level and the kernel-level, and also improve the efficiency of run-time reconfiguration. In this paper, a transparent DSM system referred to as Teamster is proposed and is implemented for clustered symmetric multiprocessors. With the transparency provided by Teamster, programmers can exploit all the computing power of the clustered SMP nodes in a transparent way as they do in single SMP computer. Compared with the results of previous researches, Teamster can realize the transparency of cluster computing and obtain satisfactory system performance. Keywords: distributed shared memory, symmetric multiprocessor, cluster computing, reconfiguration, thread architecture

1.

Introduction

Due to rapid advances in processor capability, and reducing costs, symmetric multiprocessor (SMP) clusters have now become popular for large-scale science applications. More and more applications running on shared-memory systems are now being modified to run on the SMP distributed systems. However, it is not an easy task for a user to modify a program dedicated to a single node, such that it may be run in parallel on multiple nodes. In order to reduce the programming difficulties involved in carrying out this task, past research has proposed many solutions, including MPI, and PVM. Unfortunately, these solutions are not ideal since they are message passing oriented, and as such they require the users to employ explicit send/receive primitives in their programs. Therefore, it is not really convenient to use these methods to exploit SMP clusters. Recently, a distributed shared memory (DSM) approach has been used successfully to hide network messages from users on distributed systems [7, 10, 16–19]. This run time system emulates a virtual shared memory for a network of computers which have physical memory. Consequently, two cooperating processes running on different nodes are able to communicate with each other through the use of memory accesses, rather than by message passing. In this way, data communication between the two processes is rendered more transparent.

146

CHANG, SHIEH AND LIANG

However, previous DSM systems have not been sufficiently transparent for users to make use of them in their work. A transparent DSM system must provide complete transparency in three ways: (1) data distribution, (2) workload distribution and (3) reconfiguration. In DSM systems which provide true data distribution transparency, programmers are able to declare and to allocate shared data using the same user interface as is used in shared-memory systems. The majority of previously developed DSM systems [1, 10, 16] require the user to use additional annotations such as tmk alloc and tmk distribute calls, which are used in the TreadMark system [1]. The reason for this is that the cache of the virtual shared memory does not occupy the entire address space of each node. Therefore, if a particular variable is not located in the region of the shared memory address, the primitives mentioned above are necessary to allow the variable to be shared among the entire cluster. However, the additional of these operations causes a loss in transparency of data distribution. In terms of the transparency of workload distribution in a DSM system, this means that the number of threads created for a program is dependent only upon the program algorithm, and is not influenced by the number of execution nodes, the number of processor in each node and the capability of each processor. However, previous SMP DSMs [19] only supported kernel-level multi-threading, rather than user-level multithreading. Although this approach is able to fully exploit the computational power of all the processors in each node, the programming independency is lost, and programmers must therefore partition their problems based on the factors listed above, rather than by program algorithms. Furthermore, since the resource assigned to an application is likely to change during the program execution, i.e. because the resource within distributed systems is not dedicated to one single application, transparent run-time reconfiguration is essential if good performance is to be achieved. Unfortunately, the SMP DSM systems which only support kernel-level multithreading are not readily able to accomplish this goal since the migration of kernel threads is almost impossible. This paper presents a transparent SMP DSM system called Teamster, which is designed for, and implemented on, a SMP cluster running Sun Solaris 8 for ×86. Teamster provides a Global Memory Image (GMI), whose purpose is to accomplish transparency of data distribution. With the GMI, the address space of each processor is precisely identical. Programmers are able to access and allocate shared data in the same way as they do in single SMP computers. Teamster uses the hybrid thread architecture [5] to achieve the transparency of workload distribution and reconfiguration. This thread architecture optimizes the parallelism at both the user-level and the kernel-level, and also improves the efficiency of run-time reconfiguration. The rest of this paper is organized as follow. Section 2 discusses several previous SMP DSMs. Section 3 provides an introduction to the system design of Teamster, while its implementation is described in Section 4. Section 5 evaluates the performance of Teamster, and Section 6 presents some brief conclusion and discusses the future work.

2.

Related work

Several DSM systems have been proposed previously for SMP clusters, including Brazos [19], OpenMP for Networks of SMP [10], Strings [16], SilkRoad [24–27], CPAR-Cluster

A TRANSPARENT DISTRIBUTED SHARED MEMORY

147

[22] and Mome [21, 23]. However, these systems are not sufficiently transparent for users to be able to take advantage of them, as below discussed. Brazos is a so-called third generation SMP DSM, which is implemented on Microsoft Windows NT 4.0. The Brazos provides software scope consistency and a superset of the PARMACS macro suite [4] for parallel programming. In Brazos applications, all shared data must be allocated by using the macros such as G MALLOC, since otherwise the DSM subsystem is unaware of which region needs to be consistent. However, this use of explicit annotations leads to a loss in transparency in data distribution. Furthermore, when running user applications, Brazos uses the kernel threads provided by Windows NT in order to fully exploit the SMP computing power. Because thread management is controlled by the underlying operating system, and is limited by the volume of kernel resource, programmers are unable to create as many threads as are required by the application program algorithm. In addition, the migration and dynamical reconfiguration of application threads is impossible under the control of Windows NT and therefore transparent workload distribution is also impossible. OpenMP for Networks of SMP is the first system to employ OpenMP on distributedmemory SMP clusters. OpenMP is an emerging standard for parallel programming on shared-memory multiprocessors. The authors adopted a modified TreadMarks beneath OpenMP APIs to form a virtual shared memory among the SMP cluster. In this system, programmers use the directives and library of OpenMP to parallelize their applications, and to allocate shared data. OpenMP takes care of data and workload distribution transparency, and users are able to develop their programs on this system without needing to consider the DSM. However, the modified TreadMarks of OpenMP uses POSIX threads for multi-threading and releases thread management to another library or subsystem. This has the drawback that any further control of the threads, such as reconfiguration and automatic distribution, is not easily achieved. Strings is an SMP DSM, which is modified from Quarks [12] running on Solaris. In this system, users must declare and allocate a DSM region by use of explicit Quarks primitives in order to share data within the SMP cluster. The declaration or allocation must then be broadcasted across the network explicitly. When this action has been completed, all Strings nodes are notified, and are then able to access the shared data via a memory address translation. In this type of region-based DSMs, the complicated primitives of memory accessing violate the requirements of transparency, and therefore hinder users from application development on DSMs. Furthermore, the memory address translation for each remote memory access is detrimental to system performance. Another disadvantage is that Strings uses POSIX threads which are supported by Solaris, and therefore the system may suffer in terms of thread migration and reconfiguration. SilkRoad is a variant of Cilk [28]. It extends the memory consistency model of Cilk, which results in RC dag consistency model. SilkRoad also provides a user-level shared virtual memory. However, programmers need to learn the programming paradigm of the Cilk in SilkRoad. SilkRoad is still a region-based DSM system. Similarly, CPAR-Cluster is an extension of CPAR parallel programming language [29]. Programmers still need to learn a new programming paradigm. Mome are runtime systems for several parallel languages. Parallel programs share objects among the clusters. The authors use strong consistency model to maintain data

148

CHANG, SHIEH AND LIANG

consistency. The performance of strong consistency model under the situation of highdegree sharing and false sharing is a kind of issues. The programmers also need to learn a new programming paradigm in this system. More than that, Mome is still a region-based DSM system. 3.

System design of teamster

Teamster is a transparent distributed shared memory system proposed for SMP cluster. This system has two main design characteristics which distinguish it from other DSM systems. One is the provision of a Global Memory Image (GMI) for all cooperating processes within an application. The other is the adoption of the hybrid thread architecture in thread management. By virtue of these design features, Teamster is able to provide transparency for users in data distribution, workload distribution, and reconfiguration. A detailed discussion of these two design characteristics now follows. 3.1.

Global memory image

The Global Memory Image is a uniform, and identical, global memory image, which is shared between all processes cooperating in the same computing jobs in Teamster. Under GMI, the entire address space of the cooperating processes is consistent, including code, static global variables and heaps. Only the stacks of the processes and some specific data are private. Memory allocations always take place within the shared memory address space, with no need for explicit annotations. Since all global data are accessed using the same memory addresses by all nodes within Teamster, programmers are able to develop their applications using the same programming paradigm as is used for shared memory multiprocessors. In contrast to GMI, most DSM systems are region-based [1, 10, 16], i.e. they use part of the processes’ virtual memory space as a DSM region for the memory allocation of shared data. In these region-based DSM systems, a specific annotation is required in order to allocate, or to declare, a shared data variable. If an explicit annotation is not used, this variable will not be allocated into DSM space, and therefore it will not be shared among the clusters. In order to maintain memory consistency, programmers developing applications on these region-based DSM systems are obliged to use an explicit distribution mechanism, such as the Tmk distribute() function call within TreadMarks, to propagate modifications to static global data. In this type of DSM system, the address of a shared variable varies from a node to another. If memory consistency is to be preserved, a memory address translation mechanism is required while transferring the same data from one node to another. Therefore, the region-based DSM increases not only the burden of writing the applications which will run on the system, but also the memory access overhead due to the memory address translation which must take place between global and local space. Figure 1 presents an example of a matrix multiplication which is ported on TreadMarks, and on Teamster respectively. In the TreadMarks, it will be seen that the globally declared pointer of the matrix, i.e. “A” in this example, must be propagated by the use

A TRANSPARENT DISTRIBUTED SHARED MEMORY

Figure 1.

149

Source codes of MM application in TreadMarks and Teamster.

of the explicit Tmk distribute() annotation. However, under Teamster, the “A” matrix pointer in Teamster will be allocated into DSM space and will be consistent across all nodes automatically. Therefore, Teamster’s Global Memory Image feature allows the programmers of Teamster applications to design their parallel programs in a transparent way. 3.2.

Hybrid thread architecture

The hybrid thread architecture of Teamster [5] fully exploits the multiprocessor computing power of the SMP clusters, and distributes workload transparently across the SMP DSM clusters. Within the hybrid thread architecture, there are both user-level threads

150

Figure 2.

CHANG, SHIEH AND LIANG

Teamster system architecture.

and kernel-level threads. The user-level threads in each process act as application threads to execute the parallel sub-tasks, while the kernel-level threads serve as virtual processors to be dispatched among the physical multiprocessors. Teamster’s hybrid thread architecture is shown as Figure 2. Each node of Teamster has a scheduler whose role is to distribute user-level threads transparently across the cluster. The schedulers are also responsible for binding user-level threads into kernel-level threads dynamically within each SMP nodes. The kernel-level threads are then multiplexed among the underlying hardware processors. Multiplexing kernel-level threads among different processors provides the option of binding a kernellevel thread to one particular processor if desired. One advantage of doing so is that the cost of switching kernel-level threads between different processors is eliminated. However, a drawback of this approach is that system flexibility and high processor utilization are sacrificed. In the case where kernel-level threads are not bound to a particular processor, the scheduler can dispatch any kernel-level thread into any available processor. The result is high utilization for only a marginal increase in switching time. In our design, kernel-level threads are not bound to any particular processors. Furthermore, although the proposed hybrid thread architecture allows users to specify any number of the kernellevel threads, in this study the default number of kernel-level threads created is the same as the number of underlying hardware processors. Base upon our experimental results, this approach allows for full exploitation of the multiprocessors’ computing power, without extending too much overhead in the context switching of kernel-level threads. The effort involved in developing reconfiguration of Teamster is minimized by virtue of its hybrid thread architecture and the GMI, which allows all objects related to the user-level threads to be viewed by all nodes of the SMP cluster. Therefore, the thread migration in the proposed hybrid thread architecture is easily accomplished. When

A TRANSPARENT DISTRIBUTED SHARED MEMORY

Figure 3.

151

Teamster system overview.

user-level threads are migrated, it is sufficient to pass only the pointers of the thread control blocks from the source nodes to the destination nodes. The other related movements, such as those of the stacks, memory pages and synchronization are hidden by the GMI. For example, when a thread accesses its stack after it has been migrated to the destination node, the GMI will move the memory pages of the stack from the source node to the destination node in order to maintain memory consistency. Furthermore, it is possible to create different numbers of kernel threads for different system load situations. Finally, since the kernel-level threads are the basic units of CPU time dispatching, the computation resource occupied by the DSM applications may be easily controlled by varying the number of kernel-level threads created. Most SMP DSMs, eg. Brazos and String, employ the thread library supported by the underlying operating system. Although this approach reduces the effort in system development, it also restricts the flexibility of thread management. These systems usually assume an equal numbers of kernel-level threads at each node of the cluster. This assumption can incur performance degradation if the computing power of each node across the DSM cluster is not equal all the time. Even though thread migration is supported, it is complicated by the involvement of operating system kernels. Furthermore, dynamical creation and deletion of threads is not easily control, too. Therefore, it will be clear from the preceding discussion that reconfiguration is not supported in most SMP DSM systems. 4.

Implementation

Teamster is built on a cluster of symmetric multiprocessor machines connected by Fast Ethernet. The underlying operating system is Sun Solaris 8 for ×86. An overview of the Teamster system is presented in Figure 3.

152

CHANG, SHIEH AND LIANG

Figure 4.

Thread scheduling.

There are three main components within Teamster, namely the hybrid thread subsystem, the DSM subsystem and the communication subsystems. Applications of Teamster lie above these three subsystems, and employ the facilities of the hybrid thread subsystem to create the application threads necessary for parallel computation. These threads are managed and scheduled by the hybrid thread subsystem such that they are executed across the entire Teamster cluster. The hybrid thread subsystem also has the responsibility for multiplexing application threads over the kernel level threads in such a way as to fully exploit the computing power of the multiprocessors. The DSM subsystem is the heart of Teamster, since it provides an abstraction of globally distributed shared memory space, namely the GMI, for the applications. The applications can allocate shared data into the DSM space and can then access them transparently within the cluster of Teamster. Due to performance considerations, a simplified reliable communication subsystem including a stop-and-wait mechanism is implemented over UDP, and all messages of Teamster pass through this subsystem to accomplish the information exchange.

4.1.

Thread system

The multithreaded programming environment is the three-level hybrid thread architecture as shown in Figure 2. Users create as many user-level threads as are required, and these threads are then multiplexed on several kernel-level threads, i.e. light-weight processes (LWP) in Solaris. These LWPs are then dispatched by the operating system into the hardware processors. As shown in Figure 4, there is a scheduler within each node in Teamster. When a user program starts, the scheduler of the local node is responsible for creating LWPs, and for preparing a ready queue array for the storage of ready threads in Teamster. Each LWP corresponds to an entry in the ready queue array. An LWP is a thread consumer, which receives a thread from its corresponding ready queue via the scheduler. If this ready queue is empty, it will receive a thread from another LWP ready queue.

A TRANSPARENT DISTRIBUTED SHARED MEMORY

Figure 5.

153

Memory arrangement in Teamster.

In order to prevent the problem of load unbalance, the user-level threads are scheduled across several LWPs rather than being bound to a particular LWP. Furthermore, individual LWPs are not bound to any specific processors since the kernel is then able to schedule the LWPs evenly across all available processors. Teamster provides a dynamic thread migration mechanism, which ameliorates load imbalance and significantly reduces the amount of communication caused by the nonlocal data access present in DSM systems. While a thread is migrating, it is necessary to move the state of the thread along with the thread itself. The thread state consists of global data and thread-specific information, i.e. stack contents, registers values and operating system internal control information. One of the most difficult problems encountered when migrating the state of a thread is dealing with the pointers in the migrant thread. The stack and the register may contain pointers to code, global data or data in the stack. However, after thread migration to another node, these pointers may not retain their original meaning. Four solutions that have been proposed to address this problem. One solution, e.g. in [15], is to translate the pointers within the stack, together with the stack pointer and the stack frame pointer, when the stack is moved to a distinct address in the destination node. A second solution adopted by some systems is to use a specialized type of pointers, known as a global pointer [8]. A third solution, e.g. in Millipede [11], provides thread migration by ensuring that stacks allocated by the operating system will occupy the same addresses on all hosts. The final solution ensures that the original meaning of the pointer is not lost after thread migration by reserving the virtual address space in advance. This solution is adopted by Amber [6] and Teamster. However, since the thread stacks in Teamster are allocated into the DSM objects segment of the GMI, the stack need not be explicitly moved at migration time. The DSM manager performs stack migration transparently when the stack is accessed for the first time at the new node. Since the DSM objects are maintained such that they are sequentially consistent, the pointers that point to the stack itself is correct both before and after thread migration. More than that, if the stack size is larger than one page, every pages of the stack is not necessary to migrate; only the page that is accessed in the new node is moved.

154 4.2.

CHANG, SHIEH AND LIANG

Global memory image

In order to implement GMI in Teamster, the different memory segments are arranged as shown in Figure 5. Teamster provides an identical global memory image (GMI) over all nodes participating in the same computing task. The GMI is divided into four parts: code, static global variables, DSM objects, and dynamic global heaps. When an application is initialized, the code segment of the applications will be loaded into the first part of GMI and then copied to each node of the Teamster cluster. The initialized global shared variables, and the pre-allocated DSM objects such as threads, thread stacks and synchronization objects are then put into the GMI. Since these data and objects are declared and initialized at the beginning of applications’ source files, the linker of the operating system helps us in forming this consistent part of GMI. Finally, a dynamic global heap is prepared for dynamic memory allocation, in which programmers are able to allocate their dynamic shared data. Since the code segment in the GMI has an access right of read-only, we copy it across each node of Teamster cluster in initialization. More than that, no further modification is taken with the code segment, and so we don’t need care about the consistency in this segment. The initialized global shared variables and the pre-allocated DSM objects are sequentially consistent and therefore explicit annotation to propagate memory modification is unnecessary [14]. On the other hand, the eager release consistency (ERC) model is adopted within the global dynamic heap segment [3]. This segment is used to allocate the computation data therefore the ERC can provide the satisfied performance. The new operator of C++ is overloaded in Teamster. The dynamic memory allocation, which is created by the traditional way, will be put into the global dynamic heap in Teamster automatically without any explicit annotation. The stacks of the applications are allocated in the individual node. A private heap is maintained within each node for some private data. These private data and stacks are specific to each node and are not shared within the cluster of Teamster. For example, as shown in Figure 1, the “A” matrix pointer is an initialized global shared variable. It will be located in the second part of the GMI. Namely, it will be sequentially consistent in Teamster. After using the overloaded new operator, the body of the matrix will be allocated in the last part of GMI and be maintained under the ERC model. The “A” variable is written only once when we dynamically allocate the body of the matrix and is read many times during the execution of the Matrix Multiplication. Similarly, most of the static global shared variable and DSM objects in Teamster have the same write-once and read-many accessing characteristic. Therefore, applying sequential consistency model in the first two parts of GMI does not affect performance significantly. On the other hand, because the main bodies of the applications’ computation data are always allocated in the last part of the GMI, applying ERC model in this area can gain better performance in the situation of false sharing.

4.3.

Memory consistency handling

Memory consistency handling comprises a page fault handler and a page server in each node. When a user level thread violates the access right of any GMI’s pages, a

A TRANSPARENT DISTRIBUTED SHARED MEMORY

155

SIGSEGV signal will be triggered, which causes the operating system to invoke the page fault handler. The page fault handler then sends the requests to other nodes for the update of that page. When the request arrives at a remote node, the page server is invoked by the triggering of the SIGIO signal. The page server will then respond appropriately, depending upon the nature of the request. Because there are more than one LWP inside a Teamster’s process, we must avoid the racing between the LWPs while handling any signal action. Therefore, the LWPs inside the Teamster’s processes are divided into two categories, namely working LWPs and a master LWP. The working LWPs execute the user level threads and handle the actions of the SIGSEGV signal triggered by the memory access violation of the user level threads. On the other hand, the master LWP is responsible of handling the actions of the SIGIO signal. By handling different signals by different kinds of LWPs, we can avoid the conflict of signal processing and the possibility of deadlocks. Due to the benefits afforded by multiprocessors, asynchronous page fault handling is adopted to reduce the cost of accessing remote pages. In traditional page fault handling, faulting threads send a request for remote page access, and then remain idle while they wait for the page reply. However, in asynchronous page faulting handling, the processors are dispatched for other ready threads rather than waiting for the remote pages. In the definition of ERC, a flush operation must be used to propagate the modification of DSM pages when a user level thread releases a lock or arrives at a barrier [3]. We apply a lazy flush mechanism to reduce the frequency of the flush operation in ERC. In Teamster, we can delay the flush operation until the lock is transferred to a different node or the last user level threads arrive at a barrier. For example, in original ERC, there are n flush operations if n user level threads arrive at a barrier. But in lazy flush of the Teamster, there is only one flush operation that is activated by the last user level thread arriving at a barrier. Therefore, the lazy flush of Teamster is more scalable in the cluster of SMP. The definition of release consistency with lazy flush, which is modified from the original ERC, is listed at below: Definition of release consistency with lazy flush (1) Before an ordinary LOAD or STORE access is allowed to perform with respect to any other node, all previous acquire accesses must be performed. (2) Before a release access is allowed to perform with respect to any other node, all previous ordinary LOAD and STORE accesses must be performed. (3) Special accesses are sequentially consistent with respect to one other. 5.

Performance

Teamster comprises four SMP computers, each having four Intel Pentium III Xeon 500MHz processors, and 512 MB of main memory. These SMP computers are connected via 100 Mbps Fast Ethernet with Intel fast Ethernet interface cards. The operating system of the SMP computers is Sun Solaris 8 for ×86. The hybrid thread architecture of Teamster imposes no restrictions up the total numbers of nodes, LWPs and user level threads within the system’s configuration. In the

156

Figure 6.

CHANG, SHIEH AND LIANG

Speedup under different configurations.

experimental measurements carried out as part of this section, the number of nodes, LWPs in each node, and application threads per node are represented by the letter N , P and T , respectively. For example, “4N 4P 8T” indicates that 4 nodes, each with 4 LWPs and 8 user level threads, were used in the testing of the Teamster system. In other words, this particular configuration involved a total of 16 LWPs and 32 user level threads. Five applications were used to evaluate the performance of Teamster. They are matrix multiplication (MM), N-Body, Successive over-relation (SOR), Vector Quantization (VQ) encoding of data compression and Motion Estimation in the MPEG-4 encoder. MM computes C = A ∗ B, where A, B, and C are three 2048 by 2048 square matrices. We divided the matrix A into several sub-arrays according to the number of user level threads. N-Body is a force calculation problem found in the field of astrophysics. NBody describes the behavior of 81920 particles interacting through the force of gravity by calculating the total force perceived by each particle in a self-gravitating space system according to Newton acceleration theorem [2][13]. Successive over-relation (SOR) is a linear equation problem found in the field of engineering phenomena. We use a 2048 by 2048 matrix to represent the grid of points in a pending area of the problem. The matrix is divided into roughly equal bands of rows. Each band is assigned to a thread for processing. Each entry of the matrix is updated by its neighboring entries in each of 50 iterations. VQ encoding is a technique of image compression [9]. Ten 512 by 512 images were decompressed into 4-dimensional vectors (i.e., a 2 by 2 blocks of pixel value), and each image vector was encoded with an 8-bit index of a code book of 1024 codevectors. ME, which is used in the MPEG-4 encoder, estimates the motion vectors of the macro-blocks between frames. In our experiments, the size of a macro-block is 16 by 16 pixels, searching range is 128 pixels, and the number of frames is 3. Figure 6 shows the overall system performance of Teamster. Teamster performance was evaluated for 7 different system configurations. The first configuration, namely 1N 1P 1T, can be regarded as sequential performance of these applications. The data of the 1N 4P 4T configuration can be considered to represent the performance of the applications when parallelized in a SMP computer. Finally, the configuration

A TRANSPARENT DISTRIBUTED SHARED MEMORY

157

of 4N 4P 4T is able to fully exploit the total computing power of the Teamster system. Ideally, the speedups should depend upon the total number of processors within the Teamster configuration, rather than upon the configuration itself. For example, the speedups of N-Body under a 1N 4P 4T configuration and under a 2N 2P 2T configuration will be nearly identical because both configurations employ 4 processors. Furthermore, the same result will be observed under 2N 4P 4T and 4N 2P 2T because both configurations use a total of 8 processors. This demonstrates that Teamster’s thread architecture is able to support as many application threads as programmers require, with no appreciable slowing down in system performance. Therefore, programmers may parallelize their application by considering only the program algorithm, and may ignore the precise configuration of SMP cluster. However in the case of MM, the speedups of 1N 4P 4T and 2N 4P 4T are less than those of 2N 2P 2T and 4N 2P 2T respectively. This is caused by a high frequency of memory page-in-out movements, and a high cache miss rate from the pages of the B matrix. While computing C = A ∗ B, the columns of the B matrix multiplied by the rows of the A matrix will access all memory pages of the B matrix in the row-majored memory allocation computers. In the cases of large problem size, the pages of the B matrix will be paged-in-out frequently. In order to eliminate these kinds of intra-node overheads, we try to transpose the B matrix in the MM(Modified) application. Therefore, the B sub-matrix which each threads accesses is gradually condensed in successive pages. This increases the cache hit rate, and reduces the frequency of memory page-in-out movement. In this modified MM application, it is found that the speedups are nearly equal when using the same number of processors, regardless of the system configuration. In the N-body application, the problems of high cache miss rate and high frequency of memory page-in-out movement also incur over-linear speedup when running with a 4N 4P 4T configuration. The base line of speedup, i.e. the execution time of 1N 1P 1T, is longer than usual due to much effort being expended in the intra-node overhead. In order to observe the effect of the intra-node overhead, the N-Body application was run with 40960 particles, and the results compared with the results for 81920 particles. It was found that the N-Body running with 40960 particles gave more reasonable speedup results than the case of 81920 particles for a 4N 4P 4T configuration. By comparing VQ and ME to others, their computation time is long enough to overlap with their communication time. Therefore, their speedups can reach 15 while using 4 nodes, 4 LWPs per node, and 4 user level threads per node in Teamster. The feature of transparent workload distribution in Teamster can automatically distribute user level threads across the cluster according to underlying hardware configuration. For example, if a cluster is consisted by node 0 with 3 processors, node 1 and 2 with only 1 processor, and node 3 with 3 processors individually, this hardware configuration was referred to as “3113” configuration. In the case of “3113”, Teamster will distribute user level threads according to the proportion of 3:1:1:3. In traditional DSMs, the workloads are usually evenly distributed across the cluster. Therefore, the workloads will be distributed according to the proportion of 2:2:2:2 in the same case of “3113”. In this situation, a load unbalance occurs. Oppositely, Teamster’s transparent workload distribution takes the consideration of load balance all the time. Figure 7 represents the speedups of Teamster’s transparent workload distribution. The experiments contain two

158

CHANG, SHIEH AND LIANG

Figure 7.

Speedup of transparent workload distribution.

Figure 8.

Occurrences of faults in sequential and release consistency under 4N 4P 4T.

hardware configurations. The workloads in the case of “3113” are distributed in traditional way, i.e., according to the proportion of 2:2:2:2. On the other hand, the workloads in the case of “3113 T” are distributed by Teamster’s transparent workload distribution, i.e., according to the proportion of 3:1:1:3. The speedup baselines of the “3113 T” and the “4224 T” are the execution time of the “3113” and the “4224” individually. The results in Figure 7 show that Teamster nearly double outperforms in N-body, SOR, and ME applications. In the case of “4114”, the results are similar to those of “3113”. Even the execution time of MM and VQ are too short to manifest the effect of load unbalance, Teamster still performs better. In proposed Teamster, sequential consistency is used for the global static shared variables to support the GMI in order to avoid the use of explicit primitives which would otherwise be required when notifying modifications of the global static shared variables. Although the performance leakage of applying sequential consistency in Teamster is understandable, the results presented in Figure 8 show that the total numbers of sequential write fault (SEQ WR) and sequential read fault (SEQ RD) are excessively less than those of release write fault (REL WR) and release read fault (REL RD). The overhead induced by sequential consistency can be tolerated, i.e., the GMI induces less handling in sequential consistency compared to ERC.

A TRANSPARENT DISTRIBUTED SHARED MEMORY

6.

159

Conclusions and future work

This paper has presented the design and implementation of a system referred to as Teamster, which is a transparent DSM system for clusters of symmetric multiprocessor computers. The hybrid thread architecture of Teamster provides a multi-threading and transparent scheme for workload distribution. Using this system, users are able to parallelize their applications without the need to take the underlying hardware configuration into account. The Global Memory Image forms a uniform and identical memory image across the cluster of Teamster. Data distribution under GMI is just as transparent as under tightly-coupled multiprocessors. It has been demonstrated that Teamster is able to provide a transparent distribution of workload and data throughout the cluster of SMP DSM systems by virtue of its hybrid thread architecture and the GMI. Research into dynamic thread migration and automatic reconfiguration on the Teamster is now being explored. The goal of dynamic thread migration and automatic reconfiguration is to achieve a high resource utilization of the cluster without disturbing the host applications which are already running on the nodes of the cluster. Since the SMP cluster is not usually dedicated only to the DSM system, the resource utilization will vary dynamically. The reconfiguration mechanism will re-distribute the data and the workload automatically in order to improve the system performance. The reconfiguration of Teamster not only improves the performance of DSM applications, but also does so without sacrificing the performance of the host applications. It is the future intention to exploit the potential of the Teamster system within other areas of applications, such as Internetworking. In this way, it is hoped that DSM will find a use in more realistic applications, rather than just within large-scale scientific computing. References 1. C. Amza, A. Cox, S. Dwarkadas, P. Keleher, H. Ly, R. Rajamony, W. Yu, and W. Zwaenepoel. TreadMarks: Shared memory computing on networks of workstations. IEEE Computer, 29(2):18–28, 1996. 2. I. G. Angus, G. C. Fox, J. S. Kim, and D. Walker. Solving Problems on Concurrent Processors. PrenticeHall International, 1988. 3. J. K. Bennett, J. B. Carter, and W. Zwaenepoel. Munin: Distributed Shared Memory Based on Typespecific Memory Coherence. Second ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, 168–175, 1990. 4. J. Boyle, R. Butler, T. Disz, B. Glickfeld, E. Lusk, R. Overbeek, J. Patterson, and R. Stevens. Portable Programs for Parallel Processors. Holt, Rinehart and Winston, Inc., 1987. 5. J. B. Chang, Y. J. Tsai, C. K. Shieh, and P. C. Chung. An Efficient Thread Architecture for a Distributed Shared Memory on Symmetric Multiprocessor Clusters. In Proceedings of International Conference on Parallel and Distributed Systems, 816–823, 1998. 6. J. S. Chase, F. G. Amador, E. D. Lazowska, H. M. Levy, and R. J. Littlefield. The Amber System: Parallel Programming on a network of Multiprocessors. In Proceedings of the 12th ASM Symposium on Operating System Principles, 147–158, 1989. 7. A. Erlichson, N. Nucholls, G. Chesson, and J. Hennessy. SoftFLASH: Analyzing the performance of clustered distributed virtual shared memory system. In Proceedings of the 7th Symposium on Architectural Support for Programming Languages and Operating Systems, 1996. 8. I. Foster, C. Kesselman, R. Olson, and S. Tuecke. Nexus: An interoperability layer for parallel and distributed computer systems. Technical Report, Argonne National Labs, 1993. 9. A. Gersho and R. M. Gray. Vector Quantization and Signal Compression. Kluwer Academic Publishers, London, 1992.

160

CHANG, SHIEH AND LIANG

10. Y. C. Hu, L. Honghui, A. L. Cox, and W. Zwaenepoel. OpenMP for network of SMPs. In Proceedings of 13th International and 10th Symposium on Parallel and Distributed Processing, 302–310, 1999. 11. A. Itzkovitz, A. Schuster, and L. Wolfovich. Millipede: Towards Standard Interface for Virtual Parallel Machines on Top of Distributed Environments. Technical Report 9607, Technion IIT, 1996. 12. D. Khandekar. Quarks: Portable Distributed Shared Memory on Unix. University of Utah, beta ed., 1995. 13. A. C. Lai. Design and Implementation of Release Consistency Protocol on Cohesion, Master thesis, Department of Electrical Engineering. National Cheng Kung University, R.O.C., 1994. 14. K. Li. IVY: A Shared Virtual Memory System for Parallel computing. In Proceedings of 1988 IEEE International Conference on Parallel Processing, 94–101, 1988. 15. E. Mascarenhas and V. Rego. Architecture of a Portable Threads System Supporting Thread Migration. Software: Practice & Experience, 26(3):327–256, 1996. 16. S. Roy and V. Chaudhary. Strings: a high-performance distributed shared memory for symmetric multiprocessor clusters. In Proceedings of the Seventh International Symposium on High Performance Distributed Computing, 90–97, 1998. 17. R. Samanta, A. Bilas, L. Iftode, and J. Singh. Home-based SVM protocols for SMP clusters: design and performance. In Proceedings of the Fourth International Symposium on High-Performance Computer Architecture, 1998. 18. D. Scales, K. Gharachorloo, and A. Aggarwal. Fine-grain software distributed shared memory on SMP clusters. In Proceedings of the Fourth International Symposium on High-Performance Computer Architecture, 125–136, 1998. 19. E. Speight and J. K. Bennett. Brazos: A Third Generation DSM System. In Proceedings of the First USENIX Windows NT Workshop, 1997. 20. R. Stets, S. Dwarkadas, N. Hardavellas, H. Hung, L. Kontothanassis, S. Parthasarahy, and M. Scott. Cashmere-2L: Software coherent shared memory on a clustered remote write network. In Proceedings of the 16th ACM Symposium on Operating Systems Principles, 170–183, 1997. 21. Gabriel Antoniu, Luc Boug´e, and S´ebastien Lacour. Making a DSM consistency protocol hierarchy-aware: an efficient synchronization scheme. In Proc. Workshop on Distributed Shared Memory on Clusters (DSM 2003), Tokyo, pages 516–523, May 2003. 22. G. da Silva Craveiro, L. M. Sato. CPAR—cluster: a runtime system for heterogeneous clusters with mono and multiprocessor nodes. In Proc. Of The 2004 International Workshop on Distributed Shared Memory on Clusters (DSM 2004), Apr. 2004. 23. Yvon J´egou. Implementation of Page Management in Mome, a User-Level DSM. In Proc. Intl. Workshop on Distributed Shared Memory on Clusters (DSM 2003), Tokyo, Japan, pages 479–486, May 2003. 24. L. Peng, W. F. Wong, M. D. Feng, and C. K. Yuen. SilkRoad: A Multithreaded Runtime System with Software Distributed Shared Memory for SMP Cluster, In Proc. Of IEEE International Conference on Cluster Computing (CLUSTER 2000), 243–249. Dec 2000. 25. L. Peng, W. F. Wong, and C. K. Yuen. SilkRoad II: A Multi-Paradigm Runtime System for Cluster Computing, In Proc. of IEEE International Conference on Cluster Computing (CLUSTER 2002) (Poster), 443–444. Sep 2002. 26. L. Peng, W. F. Wong, and C. K. Yuen. The Performance Model of SilkRoad—A Multithreaded DSM System for Clusters. DSM2003: Workshop on Distributed Shared Memory on Clusters, appeared in Proc. of the 3rd IEEE/ACM International Symposium on Cluster Computing and the Grid, 495–501. May 2003. 27. L. Peng, W. F. Wong, and C. K. Yuen. SilkRoad II: mixed paradigm cluster computing with RC dag consistency Parallel Computing, 29–8:1091–1115. Aug 2003. 28. Keith H. Randall. Cilk: Efficient multithreaded computing, Ph. D. Thesis. MIT Department of Electrical Engineering and Computer Science. June 1998. 29. L. M. Sato. Sistema de programac¸ ˜ao e processamento, para sistema multiprocessadores. In Anais do VI Simp’osio Brasileiro de Engenharia de Software, 1991.