Heterogeneous Distributed Shared Memory on ... - Semantic Scholar

29 downloads 253 Views 118KB Size Report
art software DSM techniques for supporting a single shared address ... mented on homogeneous dedicated hardware environment, ...... highly scalable server.
Heterogeneous Distributed Shared Memory on Wide Area Network Weisong Shi Department of Computer Science Courant Institute of Mathematics Sciences New York University [email protected]

Abstract In this paper, we analyze the applicability of start-of-theart software DSM techniques for supporting a single shared address space in large, heterogeneous wide area network. The main contributions of this paper include following three aspects. First, based on the detail analysis, ten challenges related to implement single shared address space on heterogeneous, dynamic scenario are listed. Furthermore, for every challenges, we discuss the applicability of new techniques which are widely used in the homogeneous software DSM systems. Second, for two kind of typical applications, two hierarchical schemes, HSASLM for scientific applications and HSAS for information service applications, are proposed to implement heterogeneous distributed share memory system on wide area system. Finally, four key problems, such as coherent information maintenance, fault tolerance, resource discovery and join, and application adaptation, inherent in both HSAS LM and HSAS scheme are analyzed and partial solutions are proposed too.

1 Introduction Distributed shared memory (DSM) is an effective method to provide a single shared address space presents on physically distributed memory systems. One of the most important advantages of single address space lies in its’ providing a nature extension of sequential programming model in parallel and distributed systems so that the underlying physical infrastructure is hide from high level users and from programmers. Hardware DSM systems and software DSM systems are two mainstream methods to implement distributed shared memory. The representative hardware DSM systems include DASH[26], FLASH[24], and SGI Origin 2000 [25] etc.. In comparison to software DSM systems, the low performnce/cost ratio prevents the hardware DSM systems from prevailing among ordinary users. On the contrary,

software DSM systems have been widely accepted because of easy of implementation and free. Since 1990s, many novel techniques have been proposed to improve the performance, availability and usability of software DSM systems. However, almost all of these systems are implemented on homogeneous dedicated hardware environment, such as intra-cluster and LANs, which limits its extensibility. Actually, many applications, will benefit greatly from the single address space provided by the underlying system. However, the applicability of these start-of-the-art software DSM techniques to heterogeneous counterpart is not clear yet. In this paper, we take the challenge to analyze the applicability of these techniques in this scenario. The main contributions of this paper include following three aspects.

 Based on the detail analysis, we list ten challenges to implement single shared address space on heterogeneous, dynamic environment. Furthermore, for each challenge, we analyze the applicability of new techniques which are widely used in the homogeneous software DSM systems.  For two kind of typical applications, two hierarchical schemes, HSASLM for scientific applications and HSAS for information service applications, are proposed to implement heterogeneous DSM system on wide area network.  Four key problems, such as coherent information maintenance, fault tolerance, resource discovery and join, and application adaptation, inherent in both HSASLM and HSAS scheme are analyzed and partial solutions are proposed too. The rest of this paper is organized as follows. Detailed background of software DSM systems is presented in the next section, moreover, the major challenges related to heterogeneous, dynamic systems are presented in Section 2 too. In Section 3, based on the analysis, two different design schemes and key issues are proposed and discussed.

Finally, related work and conclusion remarks are listed in the Section 4 and Section 5 respectively.

2 Background and Analysis 2.1 Software DSM Systems Software distributed shared memory systems, or shared virtual memory [27], has been extensively studied in the past 14 years because it can present a shared memory abstraction to the high level applications on physically distributed memory systems, such as network of workstations (NOWs) and clusters. Conceptually, with respect to message passing programming model, shared memory programming model is easy to be accepted because of its natural extension of sequential programming model, which makes the software distributed shared memory systems become an ideal vehicle for parallel programming environment on many parallel systems. Since the first prototype IVY took its breath into life in 1986, the development of software DSM systems can be divided into three important phases: ancient history (1986-1990), renaissance period (1991-1996), and present day (19972000). In ancient history, many of those systems are pagebased, supporting sequential consistency model , and running on uniprocessor-based experimental prototypes. The main contribution of this phase lies that the idea of software DSM system was evaluated, and many classical coherent algorithms were proposed too. However, the seriously performance bottleneck prevents the prevailing of software DSM systems, which leads to the renaissance phase. The introduction of release consistency model and multiple writer protocol bring the new breath into software DSM systems. Many novel ideas were proposed to improving the performance, availability and usability. The applicability of software DSM systems is remain problem because the mainstream market of software DSM systems in this stage is scientific applications, which has been becoming more and more compressed. Furthermore, lack of support for heterogeneity, persistence, and security affects its application prospect. Therefore, in the third phase, different applications focus, especially distributed applications and services, become an important direction. Furthermore, many issues, such as persistence, fault tolerance, security, heterogeneity etc., have been widely studied recently. From the technique perspective, the development of software distributed shared memory systems can be depicted in four orthogonal aspects, which form a four-dimension design space:heterogeneous or homogeneous; page-based or object-based; large memory or local memory limitation; and sequential or relaxed memory consistency model. To the best of our knowledge, though there are 2 of 20 existing software DSM systems support heterogeneous

characteristics, none of them can adapt to dynamic scenario. The challenges related to heterogeneous, dynamic environment will be discussed in the next section. Page-based and object-based are two major implementation schemes in software DSM systems, where page-based systems depend more on underlying operating system than object-based systems. Generally, the granularity of coherent unit in page-based systems is constrained by the size of virtual memory page (4K bytes or larger), which entails the false sharing problem and fragmentation. Although these two problems have been partially solved by multiple writers protocol, the performance of page-based software DSM systems remains undesirable. On the contrary, the coherent granularity of object-based systems is related to the size of object, which is variable according the definition of different objects. To support variable grains, one alternative is augmenting the access of shared objects with extra calls in the source code level, which will burden the programmer very much. The representative systems include CRL[21] and ORCA[1]. The other alternative is augmenting the binary code by compiler automatically, the representative systems are Blizzard-S[34] and Shasta [36]. The substantial disadvantage of this scheme is performance. Although providing shared memory abstraction has been widely accepted as an advantage of software DSM systems, the original idea presented in [29] was to improve the access time to local disk by using remote idle memory to store replaced data1 . Currently, there are few available software DSM systems can combine the main memory of different nodes together to support a large single address space. To the best of our knowledge, JIAJIA [19] system provides this function, the total memory space supported by many other systems are limited by the minimum of the sum of local memory and swap space of each node. Conceptually, memory consistency model is an interaction between application programmer and underlying hardware. The stricter the consistency model is, the more difficult to program, and the less potential for optimization. As described earlier, the fast development of software DSM in 1990s depends greatly on the relaxation of memory consistency model, which makes the programming become more and more difficult. Therefore, some researchers argue that this trend violates the motivation of software DSM system, and advocate that sequential consistency model should be supported in the system. According to this classification, several representative systems are listed and compared in the Table 1. Recently, there are several state-of-the-art techniques are proposed, such as multithreading for reducing remote latency [43, 40, 33], thread migration and affinity-based self scheduling for load balancing [30, 44, 39], hardware support to 1 At that time, main memory is expensive and small, while the ratio of access speed of disk and remote memory is 1:10.

Systems IVY [28] Mermaid [48] ORCA[2] CRL[21] TreadMarks[23] HLRC [49] Shasta [35] JIAJIA [19]

Heterogeneity No Yes No No No No No No

Granularity Page-based Page-based Object-based Object-based Page-based Page-based Object-based Page-based

Large Memory Yes No No No No No No Yes

Consistency Model Sequential Sequential Relaxed Relaxed Relaxed Relaxed Sequential Relaxed

Table 1. The representative software DSM systems simplify coherent protocol[4, 6, 38], block transfer for prefetching[12, 10]. The applicability of these techniques under the scenario of heterogeneous will be discussed in the next subsection respectively.

2.2 Challenges of Heterogeneity To provide single shared address space on heterogeneous, dynamic wide area network, many factors related to heterogeneity and dynamic behavior must be taken into account in the design, and therefore many techniques proposed for homogeneous software DSM systems should be revisited. Heterogeneity in a distributed system includes many aspects. The hardware architecture of the machines may be different, including the instruction sets, the data representation formats, the hardware page sizes, and the number of processors on a node, the available computing capacity of each node. Furthermore, the operating systems, the system and application programming languages and their compliers, the types of distributed file systems, and the communications protocols may also differ. The different network protocols used by different LAN or WAN makes the communication become more complicated than that of LAN. The dynamic characteristics of WAN results in three important requirements: fault tolerance, resource discovery/join and application adaptation. Moreover, the security issue plays important role in protecting the benefits of users and resources owners in WAN. As such, to implement our objective, all the following challenges must be solved.

 Data Conversion Data items may be represented differently on various types of hosts. For data types as simple as integers, the order of the bytes may be different. For floating point numbers, the lengths of the mantissa and exponent fields, as well as their positions can differ. For higher level structured data types (e.g., records), the alignment and order of the components in the data structure can differ between hosts. Sharing data among heterogeneous hosts means that the physical representation of the data will have to be converted

when the data is transferred between hosts of different types. Generally, data conversion not only incur runtime overhead, but also maybe impossible because of nonequivalent data content. This problem results that the heterogeneous single address space can not always be constructed on any hardware platforms.

 Page Size In page-based software DSM system, the coherent unit of data managed and transferred is a data block, which we called it a DSM page. In a homogeneous DSM system, a DSM page has usually the same size as a native virtual memory page, so that the memory management hardware can be used to trigger a DSM page fault. In a heterogeneous DSM system, the hosts may have different size of virtual memory pages, presenting both complexity in the coherent algorithm and opportunity of false sharing and fragmentation. In the general case, use the largest VM page size is an appropriate choice. As such, block transfer technique can be used here.  Thread Management As a means of providing a single shared address space, distributed shared memory usually allows multiple threads to share the same address space, so that the programming of parallel applications particularly easy. In a heterogeneous system, however, the facilities for thread management may be different on different types of hosts. Thread migration is an efficient mechanism to balance the load in a homogeneous DSM system, and is usually easy to implement, since minimal context is kept for the threads. For example, the per-thread stack is allocated in the shared address space, so the stack need not be moved explicitly. Only the TCB is needed to be moved at migration time. Unfortunately, in a heterogeneous DSM system, however, thread migration is much more difficult. The binary images of the program are different, so it is hard to identify “equivalent points of execution” in the binaries. Especially, the formats of the threads stacks are likely to be difficult, due to architectural, language, and complier dif-

ferences. Therefore, converting the stacks at migration time may be very different, if not impossible. As such, using thread migration for load balancing in heterogeneous, dynamic environment is really not a good idea. However, affinity-based self scheduling scheme is a good method to implement on this terrain.

 Programming Language Here the programming languages include two kind of languages: system programming language and application programming language. In a homogeneous system, it is very easy to support one specific system language. However, the system programming languages used on the heterogeneous hosts may be very different. This implies that multiple equivalent implementations of a heterogeneous single address space may have to be done in the various languages. However, application programming languages should not be affected by the system programming languages, as long as a functionally equivalent application programming interface is supported on all the hosts. If a common application programming language is available on all the hosts, then the same program would be usable on the hosts via recompilation. Otherwise, multiple (equivalent) implementations of an application would have to be written, which increasing the burden of using heterogeneous system. Up to now, many of software DSM systems are built on UNIX-portability platforms, such as Solaris, AIX, Digital Unix, etc., which is limited to scientific applications that occupy smaller and smaller market. As the prevalence of Windows NT, especially, as the appearance of Windows 2000, which integrates Windows NT and Windows 9x together, it is important to support single address space on Windows NT/9x based platforms and support corresponding programming languages on Windows platforms.

 Communication The realization of heterogeneous, dynamic single address space requires the existence of a common communication protocol between different types of hosts involved. This requirement is not specific to heterogeneous distributed shared memory system, as the prevalence of Internet, there do exist common transport protocol for the communication of different hosts, including workstations, laptops, and palm-PCs. However, Integrated with data conversion, the support of thread migration and dynamic join and/or leave of hosts makes the maintenance of communication connectivity become really a hard work. Furthermore, software DSM specific user space communication technique, which has been studied recently [38, 5, 42], is very difficult to be adopted in heterogeneous distributed shared memory recently. For ex-

ample, remote DMA read/write supported by memory channel plays important role in the performance of software DSM system [13, 5], while can not be used in its’ heterogeneous counterpart in the near future.

 Uniform File Access Generally, in a homogeneous system, the distributed file system allows the threads to open files and perform I/O in a uniform way. However, multiple incompatible distributed file systems may exist on heterogeneous hosts, due to the multiplicity of distributed file systems currently in existence. Therefore, in order to implement a heterogeneous single address space, a uniform file access interface should be provided to an application. Heterogeneous distributed file system itself is a research topic, discussion of this topic is out of the range of this paper.  Fault Tolerance In a system as large as wide area system, it is certain that at any given instant, several hosts, communication links, and disks is prone to be failed. Thus, dealing with failure and dynamic reconfiguration is a necessity, both for system itself and for applications.  Resource Discovery and Join To support dynamic computing environment, i.e., the computing nodes can leave and join dynamically and freely, the underlying infrastructure must provide mechanism to discovery and join the computing community. Furthermore, in order to obtain the fault page (located in remote memory), a lookup mechanism must be provided to decide where to get the fault page. Actually, resource discovery and join is really a hot research topic recently.  Application Adaptation and Load Balancing Because of the dynamic characteristics of the underlying computing environment, the applications must adapt themselves to the changing state, which is a tradeoff between performance and quality. Traditionally, to execute parallel programs on cluster or network of workstations, one assumes a priori static knowledge of the number, relative speeds and load involved in the computation. By having these information the program can distribute its load evenly for efficient execution. However, this case will not the same on the wide area network. That is, the application itself must has the capability to integrate new machines into a running computation, mask and remove failed machines, and balanced the workload. Self scheduling is an ideal solution to overcome this type of dynamic environment, in particular, TIES (two-phase idempotent execution strategy) [22] is a good method to mask fault in this environment.

 Security Security is the most important issue in wide area system, which is neglected in many single clusters or LAN systems. Here security mechanism is used to protect the benefit of users and resource owners. We believe very firmly that security must be built into the core from the very beginning in this large system.

3 Design of Heterogeneous DSM System Keeping those challenges in our mind, two design schemes to implement heterogeneous DSM system on dynamic WAN are studied in this section. Firstly, we must ask ourselves that what kind of applications will be supported. Generally, there are two kinds of applications running on WAN:computing-intensive and information service. Many computing intensive scientific applications require not only performance but also large memory support. While performance and other issues are the most concerns by many information services because the size of the local memory is large enough. Furthermore, with or without large memory support does great impact on the design of software DSM systems, especially, the memory organization scheme. As such, two design schemes are discussed separately in the following two subsections, and some common key techniques will be discussed in the last subsection.

3.1 With Large Memory Support As discussed in last section, the original idea and advantage of software DSM system is using remote memory to replace local disk, as such the disk access time will be reduced about one order of magnitude. This idea is very similar to that of virtual memory in operating system so that software DSM system is also named shared virtual memory system in many literature[20]. Therefore, supporting large memory is very important, especially for some scientific applications that require large memory support. In addition to support large memory, the system must adapt to heterogeneous, dynamic environment. Thus, a CCNUMA based, hierarchical scheme is proposed, as shown in Figure 1. This scheme is named as HSAS LM (Heterogeneous Single Address Space with Large Memory support). According to the geographical position of different nodes, we divide them into different supernodes (SNs). For example, a LAN in a research group can be defined as a supernode. Each supernode has a manager, which takes the responsibility to the join and leave of the computer nodes within it’s supernode, and maintains the memory coherence of intra-supernode and inter-supernode. The memory organization of each supernode is home-based, that is, the same memory page can replicate and migrate among within the

supernode freely on the condition that the manager must have a up-to-date copy. Furthermore, this manager must response to the memory requests from other supernodes, and forwards the memory requests from other nodes within the supernode to other supernodes (managers), which is similar to the home manager of the CC-NUMA systems. In order to support dynamic joining and leaving of supernodes and manage the total memory space of the system, a central manager is required in the system. This manager has the responsibility for the maintenance of global memory state and state of supernodes. Theoretically, the memory space supported by this scheme can scale with the number of the supernodes linearly. However, the maximum memory space support by this scheme is limited by operating system. As today’s technique, the memory space will be limited by 32-bit address (4GB). We hope the 64-bit OSes will be deployed in the near future.

3.2 Without Large Memory Support Though many scientific applications has the requirement of large memory support, many information service based applications do not have the same requirements. The main purpose of the system is to utilize the underlying system to provide transparent virtualization and middleware for sequential and parallel applications. Conceptually, this kind of software DSM systems will provide a single address space on top of the underlying heterogeneous, dynamic and distributed system, all of the participating nodes share the same memory space, and all the shared data can be replicated and migrated freely among those nodes according to certain memory coherence protocol. Actually, the basic organization of this scheme named HSAS is similar to that of HSASLM , except that the total memory space is limited to local memory size of the central manager, as shown in Figure 2. Similar to HSAS LM , all of the nodes are divided into different supernodes according to the geographical position, each supernode has a manager, which takes the responsibility for the join and leave of the computer nodes within it’s supernode, maintain the memory coherence of intra-supernode and inter-supernode. Similarly, the memory organization of each supernode is home-based, that is, the same memory page can be replicated and migrated among within the supernode freely on the condition that the manager must have a update copy. Furthermore, this manager must response to the memory requests from other supernodes, and forward the memory requests from other nodes within the supernode to other supernodes (managers). In addition to the memory coherent maintenance within the system, the supernode managers and center manager cooperate together to support the resource discovery and re-

FreeNode (FN)

11 00 1111 0000 111 000 111111 000000 0 1 0 1 0000 1111 000000 111111 0 1 0 1 0000 1111 0000 1111 11 00 000000 111111 0 1 0 1 0000 1111 0000 1111 111 000 000000 111111 0 1 0 1 0000 1111 00 11 11 00 00000 11111 0000 1111 00000 11111 000000 0 1 0000111111 1111 00 11 00000 11111 00000 11111 000000 1 111111 0 00 11111 11 00000 11111 000 111 00000 000 111 00000 11111 00000 11111 00000 11111 00000 11111 00000 11111 00000 11111 00000 11111 111 000 000 111 00 11 0000 1111 00 11 1 0 0000 1111 000 111 00 11 0000 1111 00 11 0000 1111 00 11 00 11 0000 00 1111 11 0000 1111 1

3 Request to add

Where to add myself?

2 Reply (SN2)

1

1111 0000 0000 1111 0000 1111 0000 1111

SuperNode2 (SN2)

SuperNode1 (SN1)

CenterManager (CM)

NodeManager (NM) WorkStation (WS) Cluster

Notebook or Palm-PC

SuperNode3(SN3)

Figure 1. The memory organization scheme I(HSASLM ): with large memory support. The supernode 1(SN1) takes the responsibility to maintain the first memory block, the supernode 2 (SN2) and supernode 3 (SN3) in charge of the second and the third memory block respectively. When a free node wants to join the system, it sends a request to the central manager firstly(step 1), the central manager decides which supernode should this free node to join(step 2), the free node then contacts with the appropriate supernode manager directly(step 3).

source join/leave within the whole system.

3.3 Key Problems and Partially Solutions In the last two subsections, two basic organization schemes corresponding to with/without large memory support respectively are discussed in detail. Though the memory organization scheme is different, there are several same key problems that must be solved, such as coherent information maintenance, fault tolerance, resource (computing node) discovery and join, and application reconfiguration. We will discuss these problems respectively and propose some partially solutions in this subsection. 3.3.1 Coherent Information Maintenance It is well known that the coherence of multiple copies is the core problem inherent in any shared memory systems. This problem will be more substantial in software DSM systems because of the distributed character of the underlying system. The memory coherence problem involves two closely related concepts: memory consistency model and cache coherence protocol. Memory consistency model is an interaction between hardware and programmer, it determines what

kind of programming model will be supported by the system. The memory consistency model does great impact on the high level programming. The stricter the memory consistency model is, the more difficult to programming. The mechanism for maintain the coherence of multiple copies of one shared data unit is cache coherence protocol. More details about the relationship between these two concepts can be found at [37]. As aforementioned, the coherence of inter-supernode in HSASLM scheme is maintained by home-based cache coherence protocol, where the central manager keep the most up-to-date directory information about all the shared data. Actually, shared data will be stored among different supernode managers. The coherence of intra-supernode is managed by the supernode manager itself. It keeps not only the shared information about each shared unit, but also the shared data itself. All the nodes within the supernode ask supernode manager to fetch corresponding shared data when page fault occurs. In HSAS scheme, both the inter-supernode and intra-supernode adopt the same coherence protocol which is similar to that of intra-supernode in HSASLM scheme. Eager write update protocol is adopted within the supernode in order to overcome the dynamic characteristic of underlying hardware environment. Here, we assume that

FreeNode (FN)

11 00

1111 0000 111111111 000000000 111 000 0 1 0 1 0000 1111 000000000 111111111 0 1 0 1 0000 1111 000000000 111111111 0000 1111 11 00 000000000 111111111 0 1 0 1 0000 1111 0000 1111 000000000 111111111 111 000 0 1 0 1 0000 1111 00000000 11111111 000000000 111111111 00 11 11 00 11111 00000 00000000 11111111 0000 1111 00000 11111 000000000 111111111 0 1 00000000 11111111 0000 1111 00 11 000000000 111111111 00000 11111 00000000 0000 1111 0 1 000000000 111111111 00 11111111 11 00000000 11111111 0000 1111 00000 11111 000000000 111111111 00000000 11111111 0000 1111 000000000 111111111 00000000 11111111 00000 11111 SuperNode2 (SN2) 00000000 SuperNode1 (SN1) 11111111 00000 11111 00000000 11111111 CenterManager (CM) 00000000 11111111 00000 11111 00000000 11111111 00000 11111 00000000 11111111 00000000 11111111 00000 11111 00000000 11111111 NodeManager (NM) 00000 11111 00000000 11111111 00000000 11111111 00000 11111 111 000 WorkStation (WS) 1

Where to add myself?

3 Request to add

2 Reply (SN2)

1

111 000 00 11 0000 00 11 1 1111 0 0000 1111 000 111 00 11 0000 1111 00 11 0000 1111 00 11 00 11 0000 00 1111 11 0000 1111

Cluster Notebook or Palm-PC

SuperNode3(SN3)

Figure 2. The memory organization scheme II(HSAS ):without large memory support. The central manager takes the responsibility to keep the up-to-date copy of the whole memory block. When a free node wants to join the system, it sends a request to the central manager firstly(step 1), the central manager decides which supernode should this free node to join(step 2), the free node then contacts with the appropriate supernode manager directly(step 3).

all the managers, including supernode managers and central manager will always be in good state. Any failure of one supernode manager will affect all other nodes within the supernode, however, this failure have no any effect to other supernodes.

mask all of these to the high level end users. It can be seen from Figure 1 and Figure 2 that in both schemes the supernodes are autonomous completely, so that the failure of one supernode does not do great effect to the whole system. Therefore, it can be find that our hierarchical schemes adapt to the dynamic environment effectively.

3.3.2 Fault Tolerance Fault tolerance has been neglected in the former two generation software DSM systems. Many of them are running on dedicated hardware environment, such as clusters or network of workstations. As the communication speed becomes more and more fast, it is possible to implement software DSM system on the Internet, which characters itself with heterogeneous, dynamic and fault-prone characteristics. Therefore, both HSAS LM and HSAS schemes should take the fault tolerance into consideration in the design. Fault tolerance can be treated as two aspects. On the one hand, the underlying system should provide some mechanisms to support fault tolerance. On the other hand, from the application view of point, it should be reconfigured to adapt to the failure. For example, TIES scheduling technique is a good idea to overcome hardware fault. However, this solution comes with the cost of burdening the programmer. What we should do is to provide a middleware layer to

3.3.3 Resource Discovery and Join As the fast development of Internet and World Wide Web, resource discovery and join have becoming a hot topics these days. Many groups and companies provide different mechanisms to solve this problem, such as SUN’s Jini technology[31], IETF Service Location Protocol (SLP)[45] , the Simple Service Discovery Protocol[17], Berkeley’s service discovery service[11], Globe[41], Globus [16], Legion[18] etc. Noteworthy, resource is a very general idea, it includes information services, such as printer server, mail server, and computing capacity etc.. For example, when a volunteer wants to contribute his/her computer to other users, what he/she should do firstly is to submit it’s information, including IP address, CPU type, memory size and available time, to the central manager (step 1 in Figure 1 and Figure 2), the central manager decides to which supernode this free node should join (step 2), then the free node will contact with the corresponding supernode man-

ager directly (step 3). Certainly, there are some security issues, such as authentication, authorization and digital signature, should be involved in these steps. When a node want to leave the whole system, it must send a message to notify local supernode manager so that the corresponding information directory will be updated. When all the nodes in the supernode have left the supernode, that is when the supernode manager is the last node in the supernode, the supernode managers can notify the central manager the leave of itself. The storage format and management of these information are similar to LDAP protocol[46]. That is, all the services have a name-specifiers. Clients use name specifiers in their messages to identify the desired resource requirement. Service providers use name specifiers to identify the available resources. The two main parts of the name-specifier are the attribute and the value. An attribute is a category in which an object can be classified, for example “CPU typ”. A value is the object’s classification within that category, for example, “PIII-500”. Attributes and values are strings that are defined by applications. A smart scheduler must be running on all of those managers. When an user submit an application, the basic description of the application must be provided. According to this description, the supernode manager will make a decision that where to execute this application, whether or not need to ask help from other supernodes. To decide ask which supernode for help, the local supernode manager must contact with central manager to find the most recently information about other supernodes. Therefore, each supernode manager must update the dynamic information to central manager periodically. 3.3.4 Application Reconfiguration In order to adapt themselves to the heterogeneous, dynamic hardware environment, the applications must be written in reconfigurable form. However, the traditional programming model of software DSM systems is SPMD (simple programs multiple data), which assumes the number of the nodes participating the computation is fixed during the computation. As such, this programming model can not adapt to the dynamic environment. Fortunately, OpenMP, a recently proposed standard for shared memory programming, takes this requirement into consideration directly, that is the number of the computing nodes can be varied dynamically according to the application requirement and/or the underlying hardware change. In order to support this function, some directives must be augmented into the applications. Moreover, the applications themselves can adapt to dynamic environment automatically. That is to say, the application must provide some alternative choices (tunable applications) between the number of processors and the desirable results. For example, assume ther is an application can be

running with 8-way parallelism, unfortunately, during the course of the execution, two processors are shut down. As such, this application must tune itself to be run with 6 processors, result in the delay of total execution time. Chang et.al in [8] study this problem in detail. Whatever, how to describe the tunability of an application does require further research in order to provide a convenient programming model.

4 Related Work Software DSM systems have been widely studied in the past 14 years, M. Rasit Eskicioglu of University of Alberta maintains a on-line bibliography of distributed shared memory area, the web site provided by Prof. Peter Keleher at University of Maryland lists almost all related projects in this area. A survey about software DSM system is presented in chapter 1 [37]. The first heterogeneous distributed shared memory prototype named Mermaid was designed and implemented by Songnian Zhou et.al in [48]. Mermaid was implemented on the IVY DSM system and supports C language. Many fundamental issues related to heterogeneous were discussed in detail in this paper. However, Mermaid prototype did not take the dynamic behaviour of hardware environment into account. Furthermore, in order to perform data conversion automatically, Mermaid asked the programmer to declare the type when allocating the object, and the data conversion requires non-standard compiler support. GMS[15] allows the homogeneous operating system to utilize cluster wide main memory to avoid disk accesses. However, GMS was not designed with scalability, persistence, security, or interoperability in mind, which limits its applicability. Globus and Legion are two leading projects in wide area high performance computing. Both of them study the issues related to heterogeneous and dynamic environment. However, the programming model supported by them are limited to message passing, which is different from our objective substantially. Globe project leaded by A. S. Tanenbaum at Vrije University[41] is a wide-area distributed object system. The objective of Globe is providing a set of common distribution service on the Internet. The concept of distributed shared object proposed in the paper is different with that of distributed shared memory in that the shared granularity of Globe will be more flexible and larger than that of DSM systems. Actually, from the viewpoint of software, Globe locates in higher level. Based on the techniques of distributed shared memory systems, Khazana[7] provides shared state management services to distributed application developers, including consistent caching, automatic replication and migration of data,

location management, access control. However, currently Khazana does not support heterogeneous systems, and the capability to fault tolerance is limited. InterWeave Project at Rochester[9] represents a merger and extension of their previous Cashmere and InterArt projects, combining hardware coherence within small multiprocessors, Cashmere-style release consistency within tightly coupled clusters, and InterArt-style version-based consistency for wide area systems. The grain of coherence unit on heterogeneous distributed machines are shared segments, which is an object-based approach, while our approach is page-based. MILAN[3] project, including three prototypes, Calypso, Chime, and Charlotte, provides several fundamental contributions in metacomputing. Where Charlotte is the first parallel programming system provides one-click computing on world wide web. Both the system and applications use Java as programming language. The shared data is implemented by shared class-type. Though some heterogeneous characters are hided by Java, such as data conversion, page size et.al, many legacy applications must be rewritten to be running on this new platform. Most importantly, Charlotte does not provide a true single address space for high level applications, so that those programs with complex pointer, and large memory requirement can not benefit from this platform. Finally, there are several onging projects about heterogeneous systems, such as MSHN[14], Harness[32] etc.. Both of them do not provide the support of single address space.

5 Conclusions In this paper, we evaluate the applicability of start-ofthe-art software DSM techniques for supporting a single shared address space in large, dynamic ,distributed systems with heterogeneous computing components. The main contributions of this report includes following three aspects: 1. Based on the detail analysis, ten challenges related to implement single shared address space on heterogeneous, dynamic scenario are listed. Furthermore, for every challenges, we discuss the applicability of startof-the-art techniques which are widely used in the homogeneous software DSM systems. 2. For two kind of typical applications, two hierarchical schemes, HSASLM for scientific applications and HSAS for information service applications, are proposed to implement single shared address space on heterogeneous platform. 3. Four key problems, such as coherent information maintenance, fault tolerance, resource discovery and join, and application adaptation, inherent in both

HSASLM and HSAS scheme are analyzed and partial solutions are proposed.

References [1] H. E. Bal, M. F. Kaashoek, and A. S. Tanenbaum. Orca: A language for parallel programming of distributed systems. IEEE Trans. on Software Engineering, 18(3):190–205, March 1992. [2] H. E. Bal, A. S. Tanenbaum, and M. F. Kaashoek. Orca: A language for distributed programming. ACM SIGPLAN Notices, 25(5):17–24, May 1990. [3] A. Baratloo, P. Dasgupta, V. Karamcheti, and Z.M.Kedem. Metacomputing with MILAN. In Eighth Heterogeneous Computing Workshop, April 1999. [4] R. Bianchini, L. I. Kontothanassis, R. Pinto, M. De Maria, M. Abud, and C. L. Amorim. Hiding communication latency and coherence overhead in software DSMs. In Proc. of the 7th Symp. on Architectural Support for Programming Languages and Operating Systems (ASPLOSVII), pages 198– 209, October 1996. [5] A. Bilas, C. Liao, and J. P. Singh. Accelerating shared virtual memory using commodity NI support to avoid asynchronous message handling. In Proc. of the 26th Annual Int’l Symp. on Computer Architecture (ISCA’99), May 1999. [6] M. A. Blumrich, C. Dubnicki, E. W. Felten, and K. Li. Userlevel dma for the SHRIMP network interface. In Proc. of the Fifth Workshop on Scalable Shared Memory Multiprocessors, June 1995. [7] J. B. Carter, A. Ranganathan, and S. Susarla. Building clustered services and applications using a global memory system. In Proc. of the 18th Int’l Conf. on Distributed Computing Systems (ICDCS-18), May 1998. [8] F. Chang, V. Karamcheti, and Z. Kedem. Exploiting application tunability for efficient predictable parallel resource management. In Proc. of the Second Merged Symp. IPPS/SPDP 1999), April 1999. [9] D. Chen, S. Dwarkadas, S. Parthasarathy, and M. L. Scott. Interweave:a middleware system for distributed shared state. In Proceedings of the fifth Workship on Languages, Compliers, and Runtime Systems for Scalable Computers, May 2000. [10] David E. Culler and J. P. Singh. Parallel Computer Architecture: A Hardware/Software Approach, chapter 12. Morgan Kaufmann, Inc., 1998. [11] S. Czerwinski, B. Zhao, T. Hodes, A. Joseph, and R. Katz. An architecture for a secure service discovery service. In Proc. ACM/IEEE MOBICOM, pages 24–35, August 1999. [12] C. Dubnicki and T. LeBlanc. Adjustable block size coherent caches. In Proc. of the 19th Annual Int’l Symp. on Computer Architecture (ISCA’92), pages 170–180, May 1992. [13] S. Dwarkadas, N. Hardavellas, L. Kontothanassis, R. Nikhil, and R. Stets. Cashmere-vlm: Remote memory paging for software distributed shared memory. In Proc. of the Second Merged Symp. IPPS/SPDP 1999), April 1999. [14] D. Hensgen et.al. An overview of MSHN: The management system for heterogeneous networks. In Eighth Heterogeneous Computing Workshop, April 1999. [15] M. J. Feeley, W. E. Morgan, F. H. Pighin, A. R. Karlin, and H. M. Levy. Implementing global memory management in a workstation cluster. In Proc. of the 15th ACM Symp. on Operating Systems Principles (SOSP-15), pages 201–212, December 1995.

[16] I. Foster and C. Kesselman. Globus: A metacomputing infrastructure toolkit. International Supercomputer Applications, 11(2):115–128, 1997. [17] Y. Goland, T. Cai, P. Leach, Y. Gu, and S. Albright. Simple service discovery protocol/1.0. Technical report, http://search.ietf.org/internet-drafts, December 1999. [18] A. Grimshaw, F. Knabe, and M. Humphrey. Wide-area computing: Resource sharing on a large scale. Computer, 49(5):29–37, May 1999. [19] W. Hu, W. Shi, and Z. Tang. JIAJIA: An SVM system based on a new cache coherence protocol. In Proc. of the High-Performance Computing and Networking Europe 1999 (HPCN’99), pages 463–472, April 1999. [20] L. Iftode and J. P. Singh. Shared virtual memory: Progress and challenges. Proc. of the IEEE, Special Issue on Distributed Shared Memory, 87(3):498–507, March 1999. [21] K. L. Johnson, M. F. Kaashoek, and D. A. Wallach. CRL: High-performance all-software distributed shared memory. In Proc. of the 15th ACM Symp. on Operating Systems Principles (SOSP-15), pages 213–228, December 1995. [22] Z. M. Kedem, K. Palem, and P. Spirakis. Efficient robust parallel computations. In Proceedings of the ACM Symposium on the Theory of Computing, April 1990. [23] P. Keleher, S. Dwarkadas, A. L. Cox, and W. Zwaenepoel. Treadmarks: Distributed shared memory on standard workstations and operating systems. In Proc. of the Winter 1994 USENIX Conference, pages 115–131, January 1994. [24] J. Kuskin, D. Ofelt, M. Heinrich, J. Heinlein, R. Simoni, K. Gharachorloo, J. Chapin, D. Nakahira, J. Baxter, M. Horowitz, A. Gupta, M. Rosenblum, and J. L. Hennessy. The Stanford FLASH multiprocessor. In Proc. of the 21th Annual Int’l Symp. on Computer Architecture (ISCA’94), pages 302–313, April 1994. [25] J. Laudon and D. Lenoski. The SGI Origin: A ccNUMA highly scalable server. In Proc. of the 24th Annual Int’l Symp. on Computer Architecture (ISCA’97), pages 241–251, June 1997. [26] D. E. Lenoski, J. Ludon, K. Gharachorloo W-D. Weber, A. Gupta, J. L. Hennessy, M. Horowitz, and M. S. Lam. The Stanford DASH multiprocessor. IEEE Computer, 25(3):63– 79, March 1992. [27] K. Li. Shared Virtual Memory on Loosely Coupled Multiprocessors. PhD thesis, Department of Computer Science, Yale University, September 1986. [28] K. Li. Ivy: A shared virtual memory system for parallel computing. In Proc. of the 1988 Int’l Conf. on Parallel Processing (ICPP’88), volume II, pages 94–101, August 1988. [29] K. Li and P. Hudak. Memory coherence in shared virtual memory systems. In Proc. of the 5th Annual ACM Symp. on Principles of Distributed Computing (PODC’86), pages 229–239, August 1986. [30] T-Y. Liang, D-Y. Chuang, and C-K. Shieh. Thread selection in software DSM systems. In Proc. of the 1st Workshop on Software Distributed Shared Memory (WSDSM’99), June 1999. [31] SUN Microsystems. jiniTM technology. Technical report, http://java.sun.com/products/jini, 1998. [32] M. Migliardi and V. Sunderam. Heterogeneous distributed virtual machines in the Harness metacomputing framework. In Eighth Heterogeneous Computing Workshop, April 1999. [33] T. C. Mowry, C. Chan, and A. Lo. Comparative evaluation of latency tolerance techniques for software distributed shared memory. In Proc. of the 4th IEEE Symp. on HighPerformance Computer Architecture (HPCA-4), pages 300– 311, February 1998.

[34] S. K. Reinhardt, J. R. Larus, and D. A. Wood. Tempest and Typhoon: User-level shared memory. In Proc. of the 21th Annual Int’l Symp. on Computer Architecture (ISCA’94), pages 325–337, April 1994. [35] D. J. Scales and K. Gharachorloo. Shasta: a system for supporting fine-grain shared memory across clusters. In Proc. of the Eighth SIAM Conference on Parallel Processing for Scientific Computing, March 1997. [36] D. J. Scales, K. Gharachorloo, and C. A. Thekkath. Shasta: A low overhead, software-only approach for supporting finegrain shared memory. In Proc. of the 7th Symp. on Architectural Support for Programming Languages and Operating Systems (ASPLOSVII), pages 174–185, October 1996. [37] W. Shi. Improving the Performance of Software DSM Systems. PhD thesis, Institute of Computing Technology, Chinese Academy of Sciences, available at http://www.cs.nyu.edu/˜weisong/pubs.html, November 1999. [38] W. Shi, Y. Mao, and Z. Tang. Communication substrate for software DSMs. In Proc. of the 11th IASTED Int’l Conf. on Parallel and Distributed Computing and Systems (PDCS’99), November 1999. [39] W. Shi and Z. Tang. Affinity-based self scheduling for software shared memory systems. In Proc. of the 6th Int’l Conf. on High Performance Computing (HiPC’99), December 1999. [40] W. E. Speight and J. K. Bennett. Using multicast and multithreading to reduce communication in software DSM systems. In Proc. of the 4th IEEE Symp. on High-Performance Computer Architecture (HPCA-4), pages 312–322, February 1998. [41] M. Steen, P. Homburg, and A. S. Tanenbaum. Globe: A wide-area distributed system. IEEE Computer Magzine, 5(4):379–400, April 1999. [42] R. Stets, S. Dwarkadas, L. Kontothanassis, U. Rencuzogullari, and M. L. Scott. The effect of network total order and remote-write capability on network-based shared memory comuputing. In Proc. of the 6th IEEE Symp. on High-Performance Computer Architecture (HPCA-6), January 2000. [43] K. Thitikamol and P. Keleher. Per-node multithreading and remote latency. IEEE Transactions on Computers, 47(4):414–426, April 1998. [44] K. Thitikamol and P. Keleher. Thread migration and communication minimization in DSM systems. Proc. of the IEEE, Special Issue on Distributed Shared Memory, 87(3):487– 497, March 1999. [45] J. Veizades, E. Guttman, C. Perkins, and S. Kaplan. Service location protocol, rfc 2165. Technical report, http://www.ietf.org/rfc/rfc2165.txt, June 1997. [46] M. Wahl, T. Howes, and S. Kille. Lightweight directory access protocol (version 3), rfc2251. Technical report, http://www.ietf.org/rfc/rfc2251.txt, December 1997. [47] W. M. Yu and A. L. Cox. Java/DSM: a platform for heterogeneous computing. In Proc. of Java for Computational Science and Engineering–Simulation and Modeling Conf., pages 1213–1224, June 1997. [48] S. Zhou, M. Stumm, K. Li, and D. Wortman. Heterogeneous distributed shared memory. IEEE Trans. on Parallel and Distributed Systems, 3(5):540–554, September 1992. [49] Y. Zhou, L. Iftode, and K. Li. Performance evaluation of two home-based lazy release consistency protocols for shared memory virtual memory systems. In Proc. of the 2nd Symp. on Operating Systems Design and Implementation (OSDI’96), pages 75–88, October 1996.