Dynamic allocation needs the resource requirements of a job to be known ..... The bandwidth of the link is not limited by the media, but by the transformation and decoding process ..... The residence time at a process or is given by RPi = SPi 1 +.
An Optically Interconnected Distributed Shared Memory System: Architecture and Performance Analysis Kalyani Bogineni and Patrick W. Dowd
1
Department of Electrical and Computer Engineering State University of New York at Bualo Bualo, NY 14260
Abstract This paper introduces an optically interconnected distributed shared memory (OIDSM) system. The distributed shared memory (DSM) approach integrates both shared memory and distributed memory system ideas to extract the strengths of each while balancing their respective weaknesses. The OIDSM system is a DSM system based on a photonic network to support the high communication requirement of DSM. The OIDSM employs wavelength division multiple access on the photonic network, enabling multiple channels to be formed on a single optical ber. A result of the high communication capacity is the simpli cation of the global address mapping problem. This simpli ed uniform address allocation scheme is introduced. The advantages of the proposed approach are examined through a performance analysis based on a closed queueing network which has been validated through extensive simulation. The performance of the OIDSM system is evaluated in terms of transaction time of a memory request and system throughput. The impact of variations in the number of channels and processors in the system on these metrics is studied. The eect of variations in memory and channel service times are also evaluated. Index Terms: parallel computer architecture, distributed shared memory, optical communication, wavelength division multiplexing, performance analysis.
1 Introduction Distributed shared memory (DSM) systems have attracted much interest since their introduction in 1986 [1]. The DSM system introduced in this paper combines the concepts of shared memory and distributed memory with a fast interconnection network made possible through the use of optical interconnects. This system has the topological advantages of distributed memory systems, the programming advantages of shared memory systems, and superior performance through the high bandwidth characteristics of optical interconnections. There is currently much activity in the design and investigation of DSM systems [2, 3, 4, 5, 6, 7, 8]. Distributed shared memory systems have a global address space that is shared by all the processors in the system [9]. It can be implemented in dierent ways. One way would be to assign address space to each processor with accesses by other processors through message passing [2, 3]. DSM can be implemented as a virtual memory shared by all the processors as in [5, 6, 7] where memory mapping managers maintain memory coherence. Fleisch and Popek discuss the disadvantages of message-passing in a multiprocessor environment and tightly-coupled processors with physical shared memory [4]. Data is stored in segments and processes sharing data attach these segments to their virtual address space. They propose techniques for reducing network trac that arise to maintain coherence between segments. Any updates to segments is visible to all sharing processes thus simplifying memory coherence schemes. Their technique diers from that proposed in [5, 6, 7] in implementational aspects and provides additional exibility to the user. Slow network performance has often been a disadvantage in tightly-coupled multi-processors, resulting in considerable delay for network accesses. Minnich and Farber overcome the problem of network delay by reducing the network load [8]. Transfer time is minimized by storing shared data in 32 byte pages. Work 1
This work was supported by the National Science Foundation under Grant CCR-9010774 and ECS-9112435.
1
presented in [2, 3] address the coherency problem in multiprocessors which support shared memory at the instruction set level; while [5, 6, 7] deal with the design of shared virtual memory in a distributed environment and implementation in the Aegis operating system of the Apollo domain. The problem of memory coherency is addressed, and data sharing in distributed memory is achieved through procedure calls. Process allocation and data allocation are closely related. This fact has been used in proposing a dynamic data allocation scheme for shared memory multiprocessors in [10]. Process allocation was considered static. A mechanism was proposed which tracks the access patterns to pages in the shared memory. Migration of the page was initiated when an access imbalance was observed. This scheme aimed at reducing communication latency through a complex implementation of a migration mechanism which tracks page accesses. The network latency was also identi ed in [11] as the chief cause of memory latency in shared memory multiprocessors. The technique proposed in [11] to reduce the latency time on the network was to place directories at the network switches. The directories were used by the memory access requests to identify the location of the memory module that contains the desired data. This scheme was further complicated in the requirement to maintain the directories by tracking migration of data blocks.
Dierent dynamic memory allocation schemes were proposed to reduce memory access times aimed at improving the performance of the system. Dynamic allocation needs the resource requirements of a job to be known before it is assigned to a particular node, and was used to reduce network accesses [10, 11]. An alternate approach is to uniformly allocate the address space to all the N nodes. The OIDSM system avoids the complexity of dynamic allocation, and implements the simpler uniform allocation method since network latency and bandwidth are no longer principal limiting factors. Several DSM systems have been commercially built, the rst being Heterogeneous Element Processor (HEP), a MIMD multiprocessor system [12]. Each process execution module (PEM) has its own memory, from which it regularly executes. A global shared memory is accessible by all PEMs through a packet switched network. Switches on the network maintain routing tables to route the packets in the proper direction of their destination. I/O caches are also maintained to increase the transfer rate from the global memory. The Tera machine is also a commercial DSM system which employs very fast context switching between many processes [13]. The memory organization and the interconnection network largely determine the overall system performance. The slow performance of the network has governed the design of DSM organization. Improved response time was achieved by reducing network trac [8], reducing memory contention [4], and by operating system support [5, 6, 7]. The optically interconnected distributed shared memory (OIDSM) system introduced in this paper overcomes the problem of network delay through a high performance photonic network for interprocessor communication. This eliminates the need of having a centralized shared memory, and facilitates the use of distributed memory. The contention problem at the shared memory is reduced by having distributed memory. A result of the high communication capacity is the simpli cation of the global address mapping problem. The photonic network of the DSM system introduced here alleviates the trac load concern which governed the design of earlier DSM systems, and enables the development of a xed memory allocation scheme that has lower complexity than dynamic schemes. The photonic network employs wavelength division multiple access (WDMA), enabling multiple channels to be formed on a single optical ber through wavelength-division multiplexing. A multiple access environment for the photonic network of the OIDSM can be achieved through a variety of optical channel topologies [14] (described in Section 2.2.2). System size limitations in terms of the optical power budget (OPB) were compared in [15, 16] and the star-coupled con guration was shown to exhibit superior fanout characteristics over optical bus-based systems. Star-coupled networks have high fault-tolerance due to their passive nature and complete unity distance connectivity [17, 16]. This high connectivity is achieved with low system complexity through the multiple access nature of the system. The OIDSM system is modeled using a closed queueing network which is solved using Mean Value Analysis. The performance is evaluated in terms of the transaction time of a memory request and the system throughput 2
for variations in the system size and number of channels in the network. The prediction of the system performance for large system sizes is made to present the scalable nature of the system if a large number of channels can be obtained in a WDM optical network. A uniform address allocation scheme is introduced for DSM systems and studied through a performance analysis. The analytic performance model has been compared to simulation results, and is shown to accurately portray the system performance. The performance analysis presented in this paper illustrates the features of this system, and provides a comparison with DSM systems not based on optical networks. The organization of this paper is as follows. Section 2 provides background information and the de nitions used in this paper. Parallel computer architectures are brie y described in Section 2.1, optical interconnects are described in Section 2.2, and the memory organization of OIDSM is described in Section 2.3. The closed queueing network used to model the OIDSM and the performance metrics are derived in Section 3. The analysis of the results obtained is provided in Section 4.
2 Architecture of OIDSM This section provides background information for the OIDSM system. Section 2.1 describes various parallel computer architectures, the motivation for this work, and the proposed OIDSM system. Section 2.2 provides information on optical components used in optical networks, optical network architectures, and media control protocols for access arbitration of the channels in the WDM network. Section 2.3 presents memory allocation schemes typically employed in parallel computers and the memory organization of the OIDSM system.
2.1 Parallel Computer Architectures
P2
C2
M2
...
...
...
Mn
Pn
Cn
Mn
M2
P2
M2
Pn
Mm
(a)
(b)
Network
M1
M1
Network
C1
P1
...
Pn
P1
M1
...
...
P2
Network
P1
...
Parallel computers may be classi ed according to memory organization as shared memory systems, distributed memory systems, or distributed shared memory systems.
(c)
Figure 1: General con gurations of parallel computers. (a) Shared memory system. (b) Distributed memory system. (c) Distributed shared memory system.
Shared Memory Systems: In this class of systems, a single global address space is shared by all processors. Shared memory systems can be based on circuit or packet switching. A typical con guration of such a system with N processors and 3
M memory modules is depicted in Fig. 1(a). Task execution on this class of systems involve code and data references by the processors to the shared memory modules. Contention occurs at both the interconnection network and memory modules. Private cache is a typical approach to reduce this problem. Contention is reduced at the cost of increased hardware complexity required to maintain coherence between the cache and main memory. Signi cant advantages of shared memory systems are the ease of programming and good performance with small systems. A main disadvantage is the limited system size capability due to the limited communication bandwidth and the increase in memory contention. Large shared memory systems face signi cant diculty: (multi)bus and multi-stage interconnection network complexity, contention and control.
Distributed Memory Systems: Each processor in this class of systems has its own memory and its own private address space. A general con guration of the distributed memory system is shown in Fig. 1(b). The processors are typically interconnected through a packet switched communication network. Application development for distributed memory systems face diculties: the message passing nature of sharing data and the relatively small size of the individual memory. The message passing mechanism involves packet formation, routing and decoding. This complexity, together with the relative programming diculty, has hampered distributed memory systems from becoming a more dominant approach to parallel architecture. Distributed memory systems have a strong advantage in their ability to support a large system size, systems which can extend into the massively parallel region.
Distributed Shared Memory Systems: Each processor in a DSM system is associated with a portion of the system address space, which can be accessed by the other processors through message passing. A general con guration for a DSM system with N processors is shown in Fig. 1(c). Each processor typically executes from its associated memory as in distributed systems. However, all processors have access to the entire address space as with shared memory systems. The required blocks are brought into the address space allotted to the processor, and used for process execution. This con guration is very attractive in that it combines the topological advantages of distributed systems with the programming advantages of shared memory systems. The application base of the resulting system is wider than for both of the two approaches individually, moving the system in the direction of general-purpose parallel computing Distributed shared memory systems are constructed as a distributed memory system, but appear to the programmer as a shared memory system. The shared memory appearance is obtained by mapping the memory modules located at the processor into a global address space. Each processor is associated with its own memory which can be also accessed by the other processors in the system through message passing. However, the message passing is initiated by the system to support the shared memory appearance and is hidden from the programmer. Message passing involves transmitting and receiving request and response packets. For example, in order to support the view of shared memory in a distributed memory system the following would have to be done. A node rst identi es the target node that contains the target memory block. It then transmits a request packet to the target node. The requested block is transmitted by the target node and the node that requested it receives the packet. This may create a contention problem at the network and the global memory module, increasing the average access delay. Memory allocation schemes have been designed to reduce network access by staging the data locally.
Motivation for proposed work: Message passing involves transmitting requests to the target memory blocks and receiving the required blocks over the network. There is a contention problem at the network and the memory modules since a block could be required by any processor resulting in high network delay. Design of DSM systems have 4
hence evolved around network constraints. The design constraints have changed with the introduction of optical interconnection networks. The network is no longer a principal constraint. The limitations due to network bandwidth have essentially vanished and performance improves due to the superior performance of the interconnection network. This factor enables various issues such as process migration and load balancing to be viewed from a new perspective. The combination of the concepts of shared memory, distributed memory and optical interconnects (described in Section 2.2) results in a system with the advantages of all the three: ease of programming, easily expandable systems, and fast low-cost performance.
Proposed DSM system: The DSM structure diers signi cantly from DSM structures proposed in earlier papers. It is based on an optical interconnection network (OIDSM). The system proposed in [5] has the shared memory concept implemented as a virtual memory. The memory manager manages the page faults. The unit of transfer between virtual memory and processors is a page. The DSM system proposed in this paper has a physical address space allotted to processors, and a large virtual address space that is mapped onto the physical address space through the use of a memory management unit at the operating system level (to be described in Section 2.3). The unit of transfer between shared memory modules is a block. The DSM system proposed in [2] has a local memory assigned to processors, and sharing is through message passing. Their system can have copies of blocks in more than one local memory, so multiple copies of the same block may exist in the global physical address space. The memory organization of the OIDSM system is designed to ensure that only one copy of a block exists in the global physical address space at any given time. This paper develops a performance analysis of the OIDSM system. The following sections describe the interconnection scheme and the memory organization of the proposed system.
2.2 Optical Interconnections Optical communication has emerged as a recent development in processor interconnection. Extensive work has been done in the area of optical interconnection networks and their suitability to parallel computers [17, 18, 16, 19, 20]. Optical interconnects may provide high performance interprocessor communication. Analysis of optical interprocessor communication in a distributed memory system showed that a large system can be built with low system complexity [18]. Optical interconnects also result in highly fault tolerant systems due to the large connectivity. Photonic switching can be implemented using Space division multiplexing, Time division multiplexing, or wavelength division multiplexing [21]. It has been shown in [18] that Wavelength Division Multiple Access (WDMA) channels result in better performance characteristics of average distance, diameter and average packet delay compared to traditional distributed memory architectures such as a hypercube. It provides more ecient resource allocation and the network also performs well under dynamically changing loads. The OIDSM interconnection network is based on WDMA channels. Characteristics such as increased fanout, very large bandwidth, high reliability, low power requirements, reduced crosstalk, and immunity to electro-magnetic interference make optical networks highly desirable. Multiple optical channels can be formed on a single optical ber through wavelength multiplexing to form Wavelength Division Multiple Access (WDMA) channels. This is an approach to circumvent the speed mismatch between the optics and the interface electronics: multiple channels are created on a single ber rather than creating a single (very) fast channel. The WDM channels form a set that can be individually switched and routed. However, tunable transmitters and receivers are required to achieve wavelength division multiplexing. Potential areas of applications for photonic networks are expanding rapidly, due to the signi cant advances in wavelength tunable laser diode transmitters and wavelength tunable receivers. The following section brie y outlines some of the characteristics of wavelength tunable devices.
5
2.2.1 Optical Components WDM networks require tunable transmitters and/or receivers to switch between the multiple channels created on the ber. This section brie y outlines the functional characteristics of the tunable devices.
Transmitters: Tunable lasers can be achieved through thermal, mechanical, injection-current, and acousto-optic means [22, 23]. Thermal and mechanical approaches achieve slow tunable devices (milliseconds - microseconds). Injection-current techniques are capable of tuning speeds of a few nanoseconds. Dense WDM networks have become possible through the signi cant advances in narrow linewidth Distributed Feedback (DFB) and Distributed Bragg Re ector (DBR) tunable lasers and lters, and low cost star couplers [24]. The range of a DFB laser has a limit of 10-15 nm, due to heating and nonradiative recombination [25]. A DBR laser diode was also demonstrated in [25] to achieve 50 separate channels with a switching time between channels of 15 nanoseconds. Spectral slicing is a low cost alternative to tunable laser diodes that has been recently introduced for this environment [26]. A tunable transmitter with C channels is constructed with C LEDs and a WDM multiplexer per node. The multiplexer is used to extract the desired wavelength for each channel and block the remaining spectrum of the LED. A system has been constructed with 16 channels each operating at 500 Mbps with o-the-shelf components.
Receivers: Wavelength selectivity can be achieved either with a coherent receiver, or a tunable lter with direct detection. The rst approach is more expensive, but has higher channel selectivity: channels can be placed closer together so a greater number of channels are formed in the tunable range. A lower cost alternative is a wavelength tunable lter with direct detection. Non-coherent, wavelength tunable lters can be constructed with a variety of techniques: Fabry-Perot and Mach-Zehnder approaches of wavelength dependence of interferometric phenomena, wavelength dependence of coupling through acousto-optic or electro-optic techniques, and resonant ampli cation that provides gain as well as wavelength selectivity [23, 27]. Fabry-Perot and Mach-Zehnder lters have been constructed where 30 channels have been separated with millisecond switching speed [24]. Active lters have either electro-optic or acousto-optic control. A lter bandwidth of 1 nm has been achieved with both approaches [23]. However, acousto-optic devices have a tuning range across the full 1.3 - 1.56 m range, while electro-optic devices are limited to about 15 nm [23, 28]. Acousto-optic devices have slower switching speeds (milliseconds) than electro-optic devices (nanoseconds). Acousto-optic lters have an additional advantage that by superimposing multiple acoustic control signals, multiple channels may be selected by a single acousto-optic lter [25, 28, 27]. The term receiver is used throughout the remainder of this paper to denote either a coherent receiver or a receiver based on a wavelength tunable lter with a direct detection receiver.
2.2.2 Network Architectures Two classes of WDM networks are Wavelength Routing (WR) and Broadcast-Select (BS) [24]. Wavelength selectable devices are contained within a WR network and the transmitted wavelength completely determines the path through the network. A BS network is passive with no internal wavelength selectivity. Selectivity is obtained with tunable sources and/or tunable wavelength lters or receivers. BS networks can be used to create optical multiple access channels [29, 30] or virtual point-to-point interconnections through the incorporation of a multi-hop scheme in the wavelength domain [31, 32, 33, 34, 35, 36]. For example, the perfect shue point-to-point WDM network was introduced in [37] using xed-channel transceivers. Tunable transceivers were considered in [36] to enable rede ning the network's logical connectivity in response to trac ow irregularities. 6
A major problem hindering the development of photonic based communication networks is the speed mismatch between the electronic and optical components. The low loss region of a single mode optical ber has a bandwidth of about 30THz [24]. This illustrates the speed mismatch problem: the optical media is capable of speeds far exceeding the maximum speeds of the electronic interface components. Wavelength-division multiplexing (WDM) is an approach that circumvents the speed mismatch problem. Multiple optical channels can be formed on a single optical ber through wavelength multiplexing to form Wavelength Division Multiple Access (WDMA) channels. Each channel operates at the data-rate limited by the electronic interface components. This achieves a signi cant improvement in bandwidth utilization, allowing concurrent transmission along the multiple channels. Depending on the architecture, optical self-routing is achievable where a node only receives data destined to it and the system has the non-blocking connectivity characteristics of a crossbar [16]. WDMA partitions the enormous optical bandwidth into multiple channels. A media access control protocol is required to provide arbitration for each channel. Dierent schemes have been considered to exploit the unidirectional nature of the optical ber and to harness the high capacity of the optical ber [19]. Optical selfrouting partitions the trac, relaxing the design constraints on the receiver subsystem since a node will now only have to receive and process a fraction of the network trac. This is an important characteristic since the photonic network can support a throughput rate far beyond the packet processing capability of a typical node. This approach has only recently become feasible due to the advances in wavelength tunable transmitters and receivers, and low cost coupling devices [24, 25]. In particular, this paper focuses on photonic star-coupled passive networks that use WDM to create wavelength division multiple access channels (WDMA). The multiaccess approach targets higher topological connectivity, low system complexity, improved fault-tolerance and enhanced performance through multiple WDMA channels [15, 14, 17].
2.2.3 Star-coupled Interconnection A multiple access environment for the photonic network of the OIDSM can be achieved through a variety of optical channel topologies [14] as shown in Fig. 2. System size limitations in terms of the optical power budget (OPB) were compared in [15, 16] and the star-coupled con guration was shown to exhibit superior fanout characteristics over optical bus-based systems. Analysis of optically based interprocessor communication methods in [16] showed that passive star-coupled systems support larger system sizes in comparison to optical bus-based systems. Star-coupled con gurations can have a single channel or multiple-channels. The single-channel architecture is based on xed receivers and transmitters. A passive star-coupler broadcasts a message to all processors, uniformly distributing all incoming optical power among the output waveguides. All transmitters and receivers operate on the same wavelength. A node must receive and process all channel trac. After a node receives a packet (o/e and serial to parallel conversion), the header of the packet is decoded to determine its destination address. The packet is routed according to the address r-tuple comparison in a multi-dimensional system [19]. The packet is discarded if the destination eld and the local processor addresses do not match on a single channel system. The bandwidth of the link is not limited by the media, but by the transformation and decoding process of the interface electronics described above. An optical ber is capable of very high bandwidth. WDM based architectures attempt to harness additional bandwidth capability of the optical ber by partitioning the bandwidth into multiple channels. Each channel operates at the maximum data-rate of the interface components. Tunable optical components are used which can shift their operating wavelength to access dierent channels. This approach assumes the available optical bandwidth is partitioned into a comb of narrow channels using Wavelength Division Multiple Access (WDMA). A general star-coupled optically interconnected system with tunable transmitters and receivers is shown in Fig. 2(d). Although the optical star-coupled system logically appears to the system as a high speed bus, it has the additional advantage of optical self-routing where a 7
T
R
R
T
R
R
T
...
Node m 2
Node m 1 R
R
T
Node m N R
R
(a)
(b)
(c)
(d)
...
Node m 2 T
T
R
T
Node m 1 R
T
R
Node m N R
T
... Node m 1
R
T
Node m 1
R
R
T
Node m 2
R
R
T
Node m 2
...
Node m N
R
Node m N
Figure 2: Optical multiple access channel con gurations; (a) folded unidirectional bus, (b) dual unidirectional bus, (c) doubly folded unidirectional bus, and (d) star-coupled con guration shown with tunable transmitters and receivers to achieve WDM. node only has to process packets directed to it. The two main constraints that limit system size are the optical power budget and the number of channels created through WDM. The OPB limitation is due to the physical characteristics of the transmitters, receivers, ber and coupling devices. The number of channels restricts system size due to performance constraints: total trac to be supported increases as the system size increases. A regular hypercube based structure has been recently introduced in [18, 19] that eliminates both constraints through the coupling of space and wavelength switching.
2.2.4 Media Access Control Protocols The total bandwidth of the system is proportional to the number of channels created through WDM. Media access control protocols for this environment must be developed to obtain the increase in interconnection bandwidth over electronic links [20, 38]. The system has two levels of access protocols. The rst is WDMA, which partitions the enormous bandwidth of a single channel into multiple, more manageable, channels. The object of this is to exploit the self-routing characteristic between source and destination along a single channel. A receiver subsystem only processes trac destined to it. The design requirements on this subsystem is eased since the volume of trac processed per node is reduced. This is an approach to obtain the high capacity characteristics of optical communication but circumvent the speed mismatch with electronic components. A second level media access protocol provides access in the time domain along each channel. A system with multiple access channels can be designed to either allow collisions and rely on a positive acknowledgment scheme, or avoid collisions through some reservation or xed allocation approach. Media access control protocols developed for photonic star-coupled WDM networks may be broadly classi ed into reservation and pre-allocation strategies [20, 19]. 8
Reservation techniques may designate one wavelength channel as the control channel that is used to reserve access on the remaining channels (designated as data channels) for data packet transmission [38, 39]. The control channel is used to transmit control information and reserve access on the data channels. Media access control protocols are required to provide arbitration on both the data and control channels. Pre-allocation techniques pre-assign the channels to the nodes, where each node has a home channel that it uses either for all data packet transmissions or all data packet receptions. This eliminates the requirement that a node possess both a tunable transmitter and a tunable receiver. Pre-allocation may be achieved by either specifying the channel a node will use to transmit (requires a tunable receiver and a xed transmitter) or receive (requires a tunable transmitter and a xed receiver). A source node tunes its transmitter to the home channel of the destination node and transmits according to the media access protocol with a system where channels are pre-allocated for data packet reception. A home channel may be shared with other nodes if the number of nodes exceeds the number of channels. Any node in the system can determine the home channel of any other node in a decentralized fashion with knowledge of the destination node number and the total number of nodes and channels in the system [20, 18, 40]. This approach does not require a control channel: all channels are used for data transmission [40, 20]. Random and static access approaches for networks with xed transmitters and tunable receivers, and for networks with tunable transmitters and xed receivers have been considered in [40, 16, 20].
Reservation protocols are often more complex than pre-allocation protocols since the transfer is based on two stages: reservation and transmission [20]. Depending on the implemented protocol, collisions may occur during control and/or data packet transmission which require a retransmission of both. Pre-allocation approaches appear to be very promising due to their low implementational and operational complexity.
2.3 Memory Organization of OIDSM The memory organization of the OIDSM system mainly consists of a physical memory shared by all the processors in the system, as shown in Fig. 3. The memory hierarchy at each node is as follows: each processor has a conventional cache (CC), a local cache (LC), and a local portion of the global memory that it manages. The global physical memory is assigned to the processors based on the allocation scheme. The sections of memory allocated to the processors are called memory modules. The memory that contains the local portion of the global memory associated with each processor is called local global memory (LGM). The other memory modules are non-local global memory (NLGM) to this processor. The unit of transfer between LC, LGM, and NLGM modules is a block. The unit of transfer between physical and virtual memory is a page. A DSM system uses the (programming) communication methods of shared memory systems in a distributed system environment. The two main sources of delay in a shared memory system while satisfying memory requests are: the access time for the main memory, and the delay due to communication on the interconnection network. With optical interconnects, the communication delay is reduced thereby improving the performance of the system. The requirement of small block sizes with previous DSM approaches has been eased since the fast transmission rate of optical networks enables larger blocks of data to be transferred from NLGM modules. It had been mentioned above that there is only one copy of a block in physical memory. This is possible because blocks resident in NLGM modules but needed by a processor are staged in the local cache. Since the processor has a separate fast cache, the local cache is used only for staging the large blocks being transmitted from other memory modules. Hence this cache can be the inexpensive memory used for LGM, and can also be much larger than the conventional expensive fast cache. By virtue of locality of reference, the presence of the large local cache results in faster response times of this system. The CC is an expensive, fast cache from which the processor executes regularly. This is similar to the on-chip cache in the Intel i860 [41]. The LC and LGM are inexpensive memory constructed by logically partitioning the physical memory located at each node. The LC here is similar to the program memory associated with 9
each PEM of the HEP [12]. The hit ratio of CC determines the frequency of requests to the memory modules. The LC is used for staging the large blocks of data being transferred from NLGM modules, and is much larger than the CC. By virtue of locality of reference, a process would reference blocks which are immediately after or before the current block being executed with a higher probability. If larger blocks are transferred when a page fault occurs, the page faults following this one could be reduced. The presence of a large LC enables larger blocks to be transferred from the shared memory modules. This increases the hit ratio of the LC, thereby reducing the response time of a memory request. The servicing of a memory request is as follows: the processor executes regularly from its own CC. When a miss occurs at the CC, the request is directed to the LC. When a miss occurs at this LC, the required block could be present in the LGM or in any of the NLGM modules. If it is present in the LGM of the processor, the block is rst copied into the LC, and the line needed is brought into the CC of the processor for execution. If the required block is not in the LGM, the request is transmitted via the photonic network to the target NLGM module. The required block is copied into the LC before any reference to a line in this block is made. The presence of the LC at each node results in the system having only one copy of every block in the global physical memory. This is not the case in the other proposed systems where there could be multiple copies of the same block in physical memory [2]. Read only copies could be present in any of the LC modules.
Node 1 P
MMU
LC
M1
P2
M2
P3
M3
LGM
Node 2
GLOBAL PHYSICAL MEMORY
LGM
...
...
LC
Network
MMU
Network
P
P1
Figure 3: Memory Organization of OIDSM. (a) Memory management unit. (b) Memory mapping. The OIDSM system has a two level hierarchy for memory management, incorporating virtual memory. Memory management is done at the operating system level, and at the node level. Memory management at the node level handles all the page faults of the local processor and the requests from other processors. Memory management at the node level maintains coherence between cache and the global shared memory (local and otherwise), managing the pages owned by the processor. Memory management at the operating system level allocates the physical address space to processors according to the allocation scheme. Page faults at the physical memory level are handled by the memory management at the operating system level. Block replacements due to faults are also handled at this level. Two advantages result from this memory con guration. Message passing is used for transferring blocks between local cache and other local shared memory modules, but it is hidden from the programmer. This is 10
the cause for the simpler application programmer interface, compared with traditional distributed memory systems. The problem of process migration is reduced to a very large extent since the total physical address space is shared by all processors. Process migration just involves the transfer of pointers: no physical transfer of data and code is done immediately. Data and code are transferred on a demand basis. DSM systems incorporate both distributed and shared memory concepts. They are not limited by the memory assigned to each processor. Access to shared memory modules involve network access. The network latency is thus a part of memory latency. Techniques for reducing memory latency have been proposed in [5, 6, 7, 11, 10]. Dierent dynamic allocation schemes have been proposed. Memory allocation and process allocation are closely related in parallel computers. Dynamic allocation needs the resource requirements of a job to be known before it is assigned to a particular node, and was used to reduce network accesses. An alternate approach is to uniformly allocate the address space to all the N nodes. The performance of the OIDSM with both types of memory allocation schemes has been studied in [42]. It has been shown that the OIDSM system avoids the complexity of dynamic allocation, and implements the simpler uniform allocation method since network latency and bandwidth are no longer principal limiting factors.
This paper examines the resulting performance when an OIDSM system uniformly distributes the physical address space among the N processors. In the uniform allocation scheme, when a request is sent by the processor to the shared memory, it can be directed to any one of the N memory modules. This leads to the assumption that local cache misses result in requests to one of the shared memory modules owned by other nodes in a random and uniform manner. In the dynamic allocation scheme, there would be a very high probability of the request being directed to the LGM module. With an optical interconnection network, which has a very fast transmission rate and many channels, the xed allocation scheme results in requests being dispersed uniformly over all the memory modules in the system. This results in lesser trac to the LGM, and lower delay for a request because of reduced queue lengths at the slow memory modules. Contention at the network channels and at memory modules between requests from dierent processors results in delays in a request being serviced. The performance model introduced in the following section assumes that the processor may context switch to the next process when an external access occurs to obtain the maximum utilization of the processor. However, the model proposed in the following section could examine the case where the processors remain blocked during an LC miss by setting the number of jobs in the system to equal the number of nodes.
3 Queueing Network Model of OIDSM A closed queueing network is used for modeling OIDSM. The OIDSM system described in Section 2 is assumed to have N nodes, each with its own memory. The optical network is assumed to have C distinct channels. The operation of the system is as follows: jobs entering the system are forwarded to a processor selected at random. The jobs are queued at the processors. When a currently executing process requires external access, the processor could either block and wait for the request to be satis ed or context switch and allow a dierent process to begin execution. We present a non-blocking model in this section where a processor context switches to the next job on a LC miss. A request to the shared memory can be directed to the LGM or to one of the NLGM modules. All the memory modules in the system service requests from local and non-local processors. Requests to the local processor compete with requests from other nodes and may be queued. After the memory request has been serviced, the process is returned to the ready state and is queued if the processor is executing another job. If the request is to a NLGM module, it contends with other accesses for transmission on a communication channel and is routed to the appropriate memory module. After the non-local memory request is serviced at the destination memory module, it is queued at the network again before nally returning to the processor queue. 11
N: C: : VPi : VMj : VCk : J: X: XPi : XMj : XCk : RPi : RMj : RCk : QPi : QMj : QCk : UPi : UMj : UCk : tA :
System size Number of communication channels in the network Prob. of accessing a shared memory module Prob. of spawning a new job Visit count to processor i Visit count to memory module j Visit count to communication channel k Total number of jobs circulating in the system Throughput of the system Throughput of processor i Throughput of memory module j Throughput of communication channel k Residence time at processor i Residence time at memory module j Residence time at communication channel k Queue length at processor i Queue length at memory module k Queue length at communication channel k Utilization of processor i Utilization of memory module j Utilization of communication channel k transaction time of a memory request Table 1: Notation used in OIDSM model.
The system has N processors denoted as P1, P2, ... , PN . The N memory modules associated with each of the N processors are denoted as M1, M2 , ... , MN , and the C communication channels are denoted as C1, C2, ... , CC . It is assumed that the characteristics of each set of processors, memory modules, and channels are independent and identically distributed. The probabilistic events used throughout this analysis are described as follows: M - Event of a memory request occurring; S - Event of spawning a new job; A - Event of accessing LGM; B - Event of accessing NLGM; P - Event of assigning a new job to same processor; and O - Event of assigning a new job to other processors in the system. When a job enters the system, it is assumed that it can be assigned to any processor with equal probability. The probabilities of the above events are: Prob[M ] = , Prob[S ] = = 1 ? since + = 1. With uniform allocation of address space to all the processors, Prob[A j M ] = N1 and Prob[B j M ] = N N? 1 . So the unconditional probabilities are Prob[A] = N and Prob[B ] = (NN? 1) . Similarly, Prob[P ] = N and Prob[O] = (NN? 1) .
An external request can be serviced by any one of the C channels with equal probability. Packets being transmitted through the channels are either request or service (response) packets. Request and response packets can be of dierent sizes resulting in dierent channel service times. It is assumed in this analysis that both request and response packets have equal size since media access protocols are typically slotted in this environment [20]. Since there is one response packet for each request packet, each occurs with equal frequency. We assume that a request packet can be directed to any of the N ? 1 other memory modules with equal probability, and the response packets are directed to the processors with equal probability. These assumptions are due to the uniform allocation scheme of the OIDSM.
External requests return to the processor queue after service at memory modules and the communication 12
network. A job executes without interruption at a processor until a miss occurs at the local cache. At this stage, the processor context switches to another job. The time between requests to the shared memory module is represented by the service time of the processor and taken to be exponential. The assumptions for the analysis of this model are: uniformly distributed address space, xed page size, xed packet size, and xed number of communication channels (C ). With the above assumptions, the memory module service times are deterministic. The service times at a communication channel depend on the optical media access control protocol. This paper assumes a static allocation scheme in both wavelength and time. Packet sizes are xed for request and response packets, so the channel service times are deterministic [20]. However, the waiting time at these queues can vary. The arrivals at the processors include processes completing service at other centers with non-exponential service times, so the distribution of the arrival process at the processors is not known. The service times at the dierent service centers in this queueing network are all not exponential, so the model results in a non-product form network.
NODE 1
NODE 2
Processor 1
LGM Module 1
Processor 2
LGM Module 2
Communication Channels 1
...
2
...
C
NODE N Processor N
LGM Module N
Figure 4: Closed queueing network model for analyzing OIDSM. The assumptions for the queueing network model of OIDSM are: queues at all centers are in nite, and simultaneous requests to a resource are queued and serviced on First Come First Serve scheduling scheme. The closed queueing network model for the entire system is shown in Fig. 4. In addition to the above operations, a job being executed at a node can spawn new processes. These could be executed at the same node, or could be migrated to any other node. This feature is also represented in Fig. 4. The next section describes the analysis for this non-product form queueing network. 13
P1
P1
P2
... ...
N
P2
(N ?1) N2
+ (NN?2 1) (N ?1) N2
N
+ (NN?2 1)
(N ?1) N2
(N ?1) N2
N
1
0
M2
0
N
N
P
N
:::
P
:::
:::
:::
N
N
M1
M2
:::
M
C1
(N ?1) N2
N
0
:::
0
(N ?1) NC
(N ?1) N2
0
N
:::
0
(N ?1) NC
+ (NN?2 1)
0
0
:::
N
(N ?1) NC N ?1) NC
:::
:::
:::
:::
C
C
(N ?1) NC (N ?1) NC (N ?1) NC N ?1) NC
:::
0
0
0
:::
0
(
N
:::
0
0
0
:::
0
(
0
0
:::
N
1
0
0
:::
0
(
C1
1 2
1 2
:::
1 2
1 2
1 2
:::
1 2
0
:::
0
C2
1 2
1 2
:::
1 2
1 2
1 2
:::
1 2
0
:::
0
C
1 2
1 2
:::
1 2
1 2
1 2
:::
1 2
0
:::
0
M1
... ...
M
... ...
C
N N N
1
N N N
N
N
N
N
N
N
N N N
N N N
N ?1) NC
N ?1) NC
:::
:::
:::
(
N ?1) NC
(
N ?1) NC
(
Table 2: Transition probability matrix of the model of OIDSM.
3.1 Mean Value Analysis A solution technique for closed queueing networks is Mean Value Analysis [43]. The steps are: 1. Little's Law is applied to the queueing network as a whole to obtain the throughput of the system as the number of jobs completed per unit time. The number of jobs circulating in the system is speci ed. 2. The Forced Flow Law is applied to each service center to obtain the throughput. 3. Little's Law is applied to each service center individually to obtain the queue lengths at each center. 4. The Utilization Law is used to obtain the utilization of each service center, describing the time spent by each center in useful work. 5. The residence time at each service center includes the waiting time in the queue at the service center and the service time at that center. With the aid of Fig. 4, the movement of jobs in the system can be traced. The transition probabilities from one service center to another are used to build the transition probability matrix from which the steadystate transition probabilities can be computed. These also give the visit counts to each service center, which is the number of times a job visits a center before completion. The visit counts are then used to compute the residence time of a job at each service center, the throughput of each center, and the system throughput. Using the system throughput, the queue lengths at each service center can be computed. The center throughputs can be used to compute the utilization of each center. The transition probability matrix is shown in Table 2. This matrix is used to obtain the steady-state probabilities of each service center, which are also the visit counts to those centers. The matrix of Table 2 is solved to obtain the steady-state equations: 14
for 1 i N ,
VPi = N VPi + (NN?2 1) (VP1 + :: + VPN ) + N1 VMi + 21N (VC1 + :: + VCC )
(1)
VMj = N VPj + 21N (VC1 + :: + VCC )
(2)
? 1) (V + :: + V ) VCk = (NNC? 1) (VP1 + :: + VPN ) + (NNC MN M1
(3)
VP1 + ::: + VPN = 1
(4)
for 1 j N ,
for 1 k C , and
Eqn. (4) is the normalizing equation for the analysis. This is derived from the fact that a job entering the system is assigned to one of the processors for execution. Summing Eqn. (2) over 1 j N , we obtain VM1 + : : : + VMN = N (VP1 + : : : + VPN ) + 12 (VC1 + : : : + VCC ) Similarly, summing Eqn. (3) over 1 k C results in (N ? 1) 1) VC1 + : : : + VCC = (N ? N (VP1 + : : : + VPN ) + N (VM1 + : : : + VMN )
(5)
(6)
The equations are a set of linear equations that are solved to obtain the visit counts as: VPi = N1 , VMj = N , ? 1) . and VCk = 2(NNC The performance metrics that are analyzed in this paper use the above derived visit counts. The forced ow law requires that the throughputs in all sections of the system must be proportional to one another, with the visit counts as the proportionality factors. Therefore, the throughput of a processor is XPi = XVPi for all 1 i N , the throughput of a memory module is XMj = XVMj for all 1 j N , and the throughput of a communication channel is XCk = XVCk for all 1 k C , where X is the system throughput. Little's Law is applied to each processor to obtain the queue length QPi = XPi RPi , to each memory module to get the queue length QMj = XMj RMj , and to each channel to obtain the queue length QCk = XCk RCk .
The Utilization Law gives the utilization of each service center in terms of the center throughput and the service time. The utilization of a processor is UPi = XPi SPi , the utilization of a memory module is UMj = XMj SMj , and the utilization of a communication channel is UCk = XCk SCk . The average queue at the service centers includes all the jobs waiting for service at the center and the jobs being serviced. The sum of queue lengths at all the service centers thus gives the total number of jobs in the system. Using the notation from Table 1, NQPi + NQMj + CQCk = J . Using the expressions obtained earlier for queue lengths and center throughputs, the system throughput is obtained as: X = NV R + NV JR + CV R (7) Pi Pi
Mj Mj
15
Ck Ck
The residence times at each of the service centers are obtained by using the queue lengths and service 1)=N Q , times of each center. The residence time at a process or is given by RPi = SPi 1 + (J ? Pi J=N where SPi is the service time for the job just entering the queue and the second term is the time that a job entering the queue must wait for the other jobs in the queue ahead of it to be serviced. Taking into account the visit count gives time at the service center. The residence time at the the total residence ( J ? 1) =N memory module is RMj = SMj 1 + J=N QMj , and the residence time at the communication channel 1)=N Q . The utilization and queue length expressions are used to solve the is RCk = SCk 1 + (J ? Ck J=N residence time equations. The aggregate time for a request to LGM or NLGM to be serviced and returned to the processor can now be obtained as tA = N RMj + (NN? 1) [RCk + RMj + RCi ] (8) (9) = RMj + RCk 2(NN ? 1) The above derived equations are used to evaluate the performance of the OIDSM system. The results are presented in the next section.
4 Analysis of Performance Metrics This section presents the results using the analysis given in the previous section with a uniform address space allocation scheme. The metrics analyzed are the transaction time of a memory request and the system throughput with variations in the number of jobs circulating in the system, the eect of variations in the system size, the number of channels in the communication network, memory service time, and channel service time. Other metrics considered are the queue length, residence time and utilization of each center. Validation of the analytic model and its assumptions is through a comparison to simulation. Unless otherwise noted, the following normalized service times are used in the sections that follow to illustrate the behavior of the OIDSM system: SPi = 2, SMj = 4, and SCk = 1. The relative service times were obtained as follows. The service time of a processor center is taken to be the interarrival time between two requests to one of the shared memory modules (a request that misses both the conventional cache and the local cache). The service time of the memory depends on the block size, memory transfer rate (memory bus width in bytes), and the speed of transfer from memory (in CPU cycles per transfer), and is given by bytes=block (cycles=transfer). The service time of a channel is dependent on the packet size SMj = bytes=transfer (block size) and the datarate of the communication channel, and is given by bytes=packet (cycles=transfer). For example, suppose the OIDSM was based on 33 MHz P SCk = bytes=transfer with a CPI of 1.25 and the conventional cache and local cache had a hit ratio of 98% and 80%, respectively. On average, an external memory request occurs every 7:5sec. Suppose the local cache is based on 1024 byte blocks. With a memory bus transfer rate of 4 bytes/transfer and 2 CPU cycles/transfer, the memory access time is 15:4sec. For 1024 byte packets transferred over the optical network at a speed of 2 Gbps, the channel service time is 4sec. Normalizing, we obtain the above relative service times.
16
4.1 Validation of the Model This section contains the results pertaining to the computation of transaction time in the OIDSM system through analytical modeling and simulation. Transaction time de ned by Eqn. (9) is the time required for an access to a global shared memory module to complete service and return to the processor. The simulator developed to study the performance of the OIDSM system is based on a stochastic selfdriven discrete event model [44, 45]. Steady state transaction times were measured for the system with the same parameters considered in the above sections. Simulation convergence was obtained through the replication/deletion method [44], with a 98% con dence in a less than 2% variation from the mean. 200
120
Transaction Time
180 160
N=64
N=16
140 120
80
C=N/4
100
N=256
60
80 60
40 C=N
40 20
20
0 (a)
C=N/2
100
N=1024
0 10
100
1000
(b)
J
1
10
100
1000
10000
J
Figure 5: Comparison of analytical and simulation results for transaction time with = 0:98, Sp = 2, Sm = 4, Sc = 1. (a) N = 64 and varying C 2 fN; N=2; N=4g. (b) C = N=2 and varying N 2 f16; 64; 256; 1024g. Fig. 5(a) shows the transaction time for N = 64 with C 2 fN; N=2; N=4g and = 0:98, Sp = 2, Sm = 4, and Sc = 1 as a function of the number of jobs circulating in the system. The maximum number of jobs that the system can support with variations in the parameters C and N is also studied. From Fig. 5(a), it can be seen that when the number of channels C is increased from N=4 to N=2, the system capacity increases by 25%. However only a tiny improvement in capacity is obtained when C is increased from N=2 to N . As will be seen in later sections, this is due to the fact that the communication network ceases to be a system limitation for C N=2. The transaction time of a memory request decreases as C is increased for a xed number of jobs in the system. For J = 1000 and C = N=4, the transaction time is about 120 time units, whereas it is only about 70 time units for C = N=2. Doubling C results in a 58% reduction in transaction time. For a xed system size, the transaction time can be maintained at a low value for a very large number of jobs by increasing the number of channels in the communication network. This is true as long as C N=2. The transaction times predicted by the analytic model and the simulation had less than a 5% deviation. Fig. 5(b) compares the transaction time obtained through analysis and simulation as a function of J with C = N=2 and the number of nodes varying as N 2 f16; 64; 256; 1024g. As will be seen in later sections, increasing C beyond N=2 does not improve the system performance. The number of channels is chosen as C = N=2 for this comparison. The maximum number of channels and system size that can be achieved using WDM is limited with the current technology. However, system sizes beyond N = 64 are shown to validate the proposed OIDSM analytic model even for very large system sizes. The saturation point decreases as N decreases: the number of jobs that can be executed in the system with reasonable transaction time decreases. When N is decreased from 64 to 16 (a decrease of 75%), the system capacity in terms of jobs also decreases 17
by 75%. A linear increase in N results in a linear increase in system capacity.
4.2 System Throughput The system throughput is de ned as the number of jobs completed per unit time. Fig. 6 shows the variation of system throughput as a function of the number of jobs circulating in the system (J ). The factors governing the system throughput are the system size (N ), the number of channels (C ), J and the address allocation scheme. 16
System Throughput
14 12
9 8
C=N/4 C=N/2 C=N
N=64 N=32 N=16
7 6
10
5
8
4 3
6
2 4
1
2 (a) 10
0 100
1000
10000
J
(b)
1
10
100 J
1000
10000
Figure 6: System throughput as J increases with = 0:98, Sp = 2, Sm = 4, and Sc = 1. (a) N = 64 with varying C 2 fN; N=2; N=4g. (b) C = 16 with varying N 2 f16; 32; 64g. The plots are for = 0:98, SPi = 2, SMi = 4, and SCi = 1. Fig. 6(a) plots the system throughput for N = 64 with C 2 f16; 32; 64g. When C is increased from N=4 to N=2, the maximum system throughput increases from 8 to 16. But when C is increased from N=2 to N , there is no signi cant increase in the maximum system throughput. This shows that the number of communication channels limits the maximum system throughput for C < N=2, but does not remain the limiting factor for C > N=2. Fig. 6(b) plots the system throughput as a function of J with C = 16 and N 2 f16; 32; 64g. When N is increased from 16 to 32, the maximum system throughput increases from 4 to 8. When N is increased from 32 to 64, there is little increase in the maximum system throughput. As N increases, the ratio of N=C increases since the number of channels remains xed. The communication network becomes the bottleneck of the system when the ratio N=C > 2. The network is not the limiting factor for the N = C = 16 case. The network is one of the limiting factors in the other two cases. These factors will be studied in the next few sections.
18
1
1 C=N/4 C=N/2 C=N
0.9
Processor Queue
0.8
(a)
0.8
0.7
0.7
0.6
0.6
0.5
0.5
0.4
0.4
0.3
0.3
0.2
0.2
0.1
0.1 10
100
1000
Memory Queue
25
C=N/4 C=N/2 C=N
20
100
1000
10
100
1000
100
1000
N=64 N=32 N=16
20
15
15
10
10
5
5
0
0 10
100
1000
1
(e)
100
100 C=N/4 C=N/2 C=N
90 80 Channel Queue
10
30
25
70 60
90 80
N=64 N=32 N=16
70 60
50
50
40 30
40 30
20 10
20 10
0 (c)
1
(d)
30
(b)
N=64 N=32 N=16
0.9
0 10
100
1000
(f)
J
1
10 J
Figure 7: Queue lengths for = 0:98, Sp = 2, Sm = 4, and Sc = 1 . Graphs (a)-(c) consider N = 64 with C 2 fN; N=2; N=4g, while graphs (d)-(f) have C = 16 with N 2 f16; 32; 64g. 19
4
4 C=N/4 C=N/2 C=N
Processor Residence Time
3.8 3.6
3.6
3.4
3.4
3.2
3.2
3
3
2.8
2.8
2.6
2.6
2.4
2.4
2.2
2.2 10
(a)
100
1000
C=N/4 C=N/2 C=N
80
100
60
60
40
40
20
20
100
1000
0 10
100
1000
10
(e)
100
100
90 80
90 80
Channel Residence Time
1000
N=64 N=32 N=16
80
0
C=N/4 C=N/2 C=N
70 60
N=64 N=32 N=16
70 60
50
50
40 30
40 30
20 10
20 10
0 (c)
100
120
100 Memory Residence Time
10
(d)
120
(b)
N=64 N=32 N=16
3.8
0 10
100
1000
(f)
J
10
100
1000 J
Figure 8: Residence times for = 0:98, Sp = 2, Sm = 4, and Sc = 1. Graphs (a)-(c) consider N = 64 with C 2 fN; N=2; N=4g, while graphs (d)-(f) have C = 16 with N 2 f16; 32; 64g. 20
0.5
0.5 C=N/4 C=N/2 C=N
0.45
Processor Utilization
0.4
0.4
0.35
0.35
0.3
0.3
0.25
0.25
0.2
0.2
0.15
0.15
0.1
0.1 0.05
0.05 10
(a)
100
1000
C=N/4 C=N/2 C=N
Memory Utilization
0.8
100
1000
0.8
0.7
0.7
0.6
0.6
0.5
0.5
0.4
0.4
0.3
0.3
0.2
0.2 10
10
100
1000
100
1000
N=64 N=32 N=16
0.9
0.1 100
0.1
1000
1
(e)
1
1
0.9
0.9 0.8
C=N/4 C=N/2 C=N
0.8 Channel Utilization
10
1
0.9
N=64 N=32 N=16
0.7 0.6
0.7 0.6
0.5
0.5
0.4 0.3
0.4 0.3
0.2 0.1
0.2 0.1 (c)
1
(d)
1
(b)
N=64 N=32 N=16
0.45
0 10
100
1000
(f)
J
1
10 J
Figure 9: Utilization for = 0:98, Sp = 2, Sm = 4 and Sc = 1. Graphs (a)-(c) consider N = 64 with C 2 fN; N=2; N=4g, while (d)-(f) plot C = 16 with N 2 f16; 32; 64g. 21
4.3 Queue Lengths at the Resources Fig. 7 shows the queue lengths at each service center for two cases: xed N = 64 and C varying as C 2 fN; N=2; N=4g; and xed C = 16 with N varying as N 2 f16; 32; 64g. Fig. 7(a)-(c) show the queue lengths at each processor, each memory module, and each communication channel as a function of J for xed N and varying C . From Fig. 7(a) it can be seen that as C is decreased from N=2 to N=4, the queue length at the processor decreases by 60% for J = 960. The change in processor queue length is not signi cant (4%) for a decrease in C from N to N=2 for the same J . Fig. 7(b) shows the number of jobs queued for service at the memory module. For a decrease in C from N to N=2, there is a 25% decrease in the jobs waiting for service at the memory module for J = 960. When C is decreased from N=2 to N=4, there is a 90% decrease in the number of jobs waiting at the memory module. The reason for this can be seen from Fig. 7(c), where the number of jobs waiting for service at the communication channels increases by 85% for decreases in C from N=2 to N=4 at J = 960. When C is decreased from N to N=2, the number of jobs waiting for service increases by 90% for the same J . By observing the set of graphs, it can be concluded that memory is the principal limiting factor as long as there are sucient number of channels in the network. As the number of channels is decreased, the bottleneck shifts from the memory to the communication network. The number of jobs waiting for service at the processor is negligible compared to those waiting for service at the memory and network. Fig. 7(d)-(f) show the queue lengths at each service center for xed C and varying N . Fig. 7(d) plots the processor queue length for C = 16 and N 2 f16; 32; 64g. The processor queue length is decreased by 10% when N is increased from 16 to 32 at J = 480. When N is increased from 32 to 64, the queue length decreases by 62%. Fig. 7(e) shows the queue length at the memory modules. The queue length decreases by 63% when N is increased from 16 to 32 at J = 480 and by 90% when N is increased from 32 to 64. The queue length at the communication channel increases by 89% when N is increased from 16 to 32 and by 70% when N in increased from 32 to 64 at J = 480. The processor queue is negligible compared to the memory and channel queues. The channel is the limiting factor as N increases with C xed as can be seen from Fig. 7(f). The bottleneck shifts to the memory when N = C as observed in Fig. 7(e). Thus, the ratio of the number of channels to the system size is an important factor in determining the system performance.
4.4 Residence Times at the Resources The residence time of a center indicates the time taken for a job to complete its work at that center, including the waiting time in the queue and the service time at the center for all the visits to that center. Fig. 8 shows the residence times at each of the service centers for two cases: xed N with varying C ; and xed C with varying N . Fig. 8(a)-(c) show the residence times at each processor, each memory module, and each communication channel for N = 64 with C varying as C 2 fN; N=2; N=4g, respectively. Fig. 8(a) shows that the residence time at the processor decreases as C is decreased for a xed N . When C is decreased from N to N=2, the residence time at the processor decreases by 3% at J = 960. However, when C is decreased from N=2 to N=4, a decrease of 28% in residence time results. Fig. 8(b) shows that the memory residence time also decreases as C is decreased. The memory residence time decreases by 23% when C is decreased by 50% from C = N and by 82% when C is further decreased by 50% at J = 960. Fig. 8(c) illustrates the magnitude of increase in channel residence time as the number of channels is decreased. The channel residence time increases by 413% when C is decreased from N to N=2 and by 489% when C is decreased from N=2 to N=4 at J = 960. As described in the previous section, the ratio of the number of channels to the system size governs the performance of the system. For C N=2, memory is the limiting factor. For C < N=2, the communication network is the limiting factor. Fig. 8(d)-(f) show the residence times at the service centers for C = 16 with N 2 f16; 32; 64g. With C xed, the network eventually becomes the factor limiting system performance as N is increased. For N = C , the 22
memory is the bottleneck of the system. The processor residence time can be observed in Fig. 8(d). When N is increased from 16 to 32, the residence time decreases by 5% and by 29% when N is increased from 32 to 64 at J = 480. From Fig. 8(e), the memory residence time decreases by 61% when N is increased from 16 to 32 and decreases by 82% when N is increased from 32 to 64 at J = 480. Fig. 8(f) plots the channel residence time which increases by 368% when N is increased by 100% from N = 16 and by 202% when N is further increased by 100% at J = 480.
4.5 Utilization of Resources The utilization of a service center gives the probability that the center is busy at any given time. Utilization of the centers are dependent on the center throughput and the center service time as shown in Section 3. Fig. 9 shows the utilization of processors, memory modules, and channels as a function of J for the two cases being considered: xed N and C varying; and xed C and N varying. The impact due to varying C as C 2 fN; N=2; N=4g for N = 64 can be seen from the graphs in Fig. 9(a)-(c). The graphs show that processor and memory utilization decrease whereas channel utilization increases as C is decreased. Fig. 9(a) plots processor utilization with variations in C . When C is decreased from N to N=2, the processor utilization decreases from 47% to 46% and to 25% when C is decreased to N=4 at J = 960. The memory utilization is plotted in Fig. 9(b). The memory utilization is shown to decrease from 93% to 91% when C is decreased from N to N=2 and to 50% when C is decreased to N=4 at J = 960. The impact on channel utilization can be observed from Fig. 9(c). The channel utilization increases from 46% to 89% when C is decreased by 50% from C = N and to 98% for a further 50% decrease in C at J = 960. The processor is not utilized to its maximum capacity. The memory is utilized to its maximum level for the cases of C = N; N=2. The channel achieves maximum utilization when C = N=4 since a reduction in the number of channels causes more jobs to be queued for channel service. For xed C = 16 and varying N as N 2 f16; 32; 64g, the utilization of resources available are plotted in Fig. 9(d)-(f). The processor utilization drops from 50% to 47% and then to 25% when N is increased from 16 to 32 and then to 64 at J = 480. The memory utilization decreases from 97% to 91% and then to 49% for the same variations in N . The channel utilization increases from 45% to 88% and then to 96%. With a xed C , increasing N results in an increased demand for channel resource which increases the channel utilization as shown in Fig. 9(f). The memory is utilized to its maximum capacity for N = 16; 32. From Fig. 9, it can be seen that the optimum utilization of all the resources: processors, memory, and channel is obtained for cases where C = N=2. Otherwise, either the memory or the channel achieve a maximum of 50% utilization.
4.6 Eect of Memory Service Time Analysis of the system with varying service time at the memory module enables us to predict the performance of the OIDSM system with improved technology for the memory. Fig. 10 shows the eect of having a faster memory on the transaction time and the system throughput. Fig. 10(a) shows that the transaction time reduces by about 10% when the memory becomes 25% faster. When the memory becomes 50% faster, the transaction time decreases by about 11%. Fig. 10(b) shows that the system throughput increases by about 10% for a 25% increase in memory speed and by 1% for a further 33% increase in speed. The graphs show the insensitivity of system performance with improved memory service time. For the case considered here with N = 64 and C = N=2, memory and network limit the performance of the system. For cases where the memory is the sole limiting factor, there would be a signi cant decrease in transaction time and an increase in the system throughput.
23
180
18
160
14
System Throughput
120
Transaction Time
16
Sm=2 Sm=3 Sm=4
140 100 80 60 40 20
12 10
0 (a)
8 6 4 2
10
100
1000
(b)
J
10
100
1000 J
Figure 10: Eect of varying memory service time with = 0:98, Sp = 2, Sc = 1, N = 64, and C = N=2. (a) Transaction time and (b) System throughput.
4.7 Eect of Channel Service Time The analysis of the distributed shared memory parallel computer with varying channel service time enables comparison of system performance with varying interconnection techniques. Fig. 11 shows the variation of transaction time and system throughput as the service time of the channel increases. Fig. 11(a) shows that as the channel service time is increased by 100%, the transaction time increases by 86% and by 276% when the channel service time is increased by 300% at J = 960. Fig. 11(b) shows that system throughput decreases by 44% and by 72% for a 100% and 300% change in the channel service time, respectively. For a service time four times faster, the throughput would increase by a factor of 8, and the transaction time would be only 1=4. The case of large channel service time would represent the situation with a slow network such as a bus.
5 Conclusions This paper introduced an optically interconnected distributed shared memory system. A closed queueing network model was used to provide a performance analysis. The performance was analyzed in terms of response time for a memory request and the system throughput for variations in the system size, the service times of the centers, the number of channels in the communication network, and the number of jobs circulating in the system. The study on the transaction time showed that there is no improvement in the system performance for C N=2. The throughput analysis con rmed this and showed that the ratio of N=C largely determines the system performance. The study of the queue lengths at the service centers gave an insight into the performance of the centers, and also indicated the bottleneck of the system which is dependent on the ratio of the number of channels to the system size. The residence time at the service centers were analyzed in terms of the system size, the number of channels, and the number of jobs circulating in the system. The utilization of the resources were studied in relation to the parameters described above. These metrics would be useful in determining the system parameters like C and N for optimum utilization. The eect of varying the number of communication channels, the memory service time and channel service time on the various performance metrics has been studied.
24
250
16 14 12
Sc=1 Sc=2 Sc=4
150
System Throughput
Transaction Time
200
100 50
10
0 (a)
8 6 4 2 0
10
100
1000
(b)
J
10
100
1000 J
Figure 11: Eect of varying channel service time with = 0:98, Sp = 2, Sm = 4, N = 64, and C = N=2. (a) Transaction time and (b) System throughput.
6 Acknowledgements The authors wish to thank Michael Carrato for assisting with the simulation experiments of the OIDSM system.
References [1] D. R. Cheriton, \Problem-oriented shared memory: A decentralized approach to distributed systems design," in Proc. 6th Dist. Comput. Sys., pp. 190{197, 1986. [2] R. Bisiani and M. Ravishankar, \Plus: A distributed shared-memory system," in Proc. 17th International Symposium Computer Architecture, pp. 115{124, May 1990. [3] R. Bisiani, A. Nowatzyk, and M. Ravishankar, \Coherent shared memory on a distributed memory machine," in International Conference on Parallel Processing, pp. I 133{I 141, 1989. [4] B. D. Fleisch and G. J. Popek, \Mirage: A coherent distributed shared memory design," Operating Systems Review, vol. 23, pp. 211{223, Dec. 1989. [5] K. Li, \Ivy: A shared virtual memory system for parallel computing," in International Conference on Parallel Processing, pp. 94{101, 1988. [6] K. Li and P. Hudak, \Memory coherence in shared virtual memory systems," ACM Trans. Computer Systems, vol. 7, pp. 321{359, Nov. 1989. [7] K. Li and R. Schaefer, \A hypercube shared virtual memory system," in International Conference on Parallel Processing, pp. I 125{I 132, 1989. [8] R. G. Minnich and D. J. Farber, \Reducing host load, network load, and latency in a distributed shared memory," in 10th International Conf. Distributed Computer Systems, pp. 468{475, May 1990. [9] M. Tam, J. M. Smith, and D. J. Farber, \A taxonomy-based comparison of several distributed shared memory systems," Operating Systems Review, vol. 24, pp. 40{67, July 1990. 25
[10] C. Scheurich and M. Dubois, \Dynamic page migration in multiprocessors with distributed global memory," IEEE Trans. Computers, vol. 38, pp. 1154{1163, Aug. 1989. [11] H. E. Mizrahi, J.-L. Baer, E. D. Lazowska, and J. Zahorjan, \Extending the memory hierarchy into multiprocessor interconnection networks: A performance analysis," in International Conference on Parallel Processing, pp. I 41{I 50, 1989. [12] B. J. Smith, \Architecture and applications of the HEP multiprocessor computer system," Real Time Signal Processing IV, vol. 298, Aug. 1981. [13] R. Alverson, D. Callahan, D. Cummings, B. Koblenz, A. Porter eld, and B. Smith, \The Tera computer system," in International Conference on Supercomputing, pp. 1{6, June 1990. [14] M. Fine and F. A. Tobagi, \Demand assignment multiple access schemes in broadcast bus local area networks," IEEE Trans. Computers, vol. c-33, pp. 1130{1159, Dec. 1984. [15] M. M. Nassehi, F. A. Tobagi, and M. E. Marhic, \Fiber optic con gurations for local area networks," IEEE Journal on Selected Areas of Communications, vol. SAC-3, pp. 941{949, Nov. 1985. [16] P. W. Dowd, \Optical bus and star-coupled parallel interconnection," in Proc. 4th International Parallel Processing Symposium, (Los Angeles, CA), pp. 824{838, Apr. 1990. [17] P. W. Dowd, \Optical interconnections for computer communications," Tech. Rep. TR01.A961, IBM Corporation, Apr. 1989. [18] P. W. Dowd, \Wavelength division multiple access channel hypercube processor interconnection," IEEE Trans. Computers, (Accepted for Publication), 1991. [19] P. W. Dowd, \High performance interprocessor communication through optical wavelength division multiple access channels," in Proc. 18th International Symposium on Computer Architecture, (Toronto, Canada), pp. 96{105, May 1991. [20] P. W. Dowd, \Random access protocols for high speed interprocessor communication based on a passive star topology," IEEE Journal on Lightwave Technology, vol. 9, pp. 799{808, June 1991. [21] H. S. Hinton, \Switching to photonics," IEEE Spectrum, pp. 42{45, Feb. 1992. [22] T. P. Lee and C. E. Zah, \Wavelength-tunable and single frequency semiconductor lasers for photonic communications networks," IEEE Communications Magazine, pp. 42 { 52, Oct. 1989. [23] H. Kobrinski and K.-W. Cheung, \Wavelength-tunable optical lters: Applications and technology," IEEE Communications Magazine, pp. 53 { 63, Oct. 1989. [24] C. A. Brackett, \Dense wavelength division multiplexing networks: Principles and applications," IEEE Journal on Selected Areas of Communications, vol. 8, pp. 948{964, Aug. 1990. [25] H. Kobrinski, M. Vecchi, M. Goodman, E. Goldstein, T. Chapuran, J. Cooper, M. Tur, C. Zah, and S. Menocal, \Fast wavelength switching of laser transmitters and ampli ers," IEEE Journal on Selected Areas of Communications, vol. 8, pp. 1190{1202, Aug. 1990. [26] M. M. Girard, C. R. Husbands, and R. Antoszewska, \Dynamically recon gurable optical interconnect architecture for parallel multiprocessor systems," in SPIE International Symposium Optical Applied Science and Engineering, (San Diego, CA), July 1991. [27] K. W. Cheung, D. A. Smith, J. E. Baran, and B. L. Hener, \Multiple channel operation of integrated acousto-optic tunable lter," IEE Electronic Letters, vol. 25, pp. 375{376, Mar. 1989. [28] K. W. Cheung, \Acoustooptic tunable lters in narrowband WDM networks: System issues and network applications," IEEE Journal on Selected Areas of Communications, vol. 8, pp. 1015{1025, Aug. 1990. 26
[29] E. Arthurs, M. Goodman, H. Kobrinski, and M. Veechi, \Hypass: An optoelectronic hybrid package switching system," IEEE Journal on Selected Areas of Communications, vol. 6, pp. 1500{1510, Dec. 1988. [30] M. S. Goodman, H. Kobrinski, M. Vecchi, R. Bulley, and J. Gimlettet, \The lambdanet multiwavelength network: Architecture, applications, an demonstrations," IEEE Journal on Selected Areas of Communications, vol. 8, pp. 995{1004, Aug. 1990. [31] A. S. Acampora, \A multichannel multihop local lightwave network," in Proc. GLOBECOM'87, (Tokyo, Japan), pp. 37.5.1{37.5.9, Nov. 1987. [32] J. A. Bannister, L. Fratta, and M. Gerla, \Topological design of the wavelength-division optical network," in Proc. IEEE INFOCOM, pp. 1005{1013, 1990. [33] I. Chlamtac, A. Ganz, and G. Karmi, \Circuit switching in multi-hop lightwave networks," in Proc. ACM SIGCOMM'88 Symposium, (Stanford, California), pp. 188{199, Aug. 1988. [34] M. Eisenberg and N. Mehravari, \Performance of the multichannel multihop lightwave network under nonuniform trac," IEEE Journal on Selected Areas of Communications, vol. 6, pp. 1063{1078, Aug. 1988. [35] M. G. Hluchyj and M. J. Karol, \ShueNet: An application of generalized perfect shues to multihop lightwave networks," IEEE Transactions on Communications, vol. COM-37, 1990. [36] J.-F. P. Labourdette and A. S. Acampora, \Logically rearrangeable multihop lightwave networks," IEEE Transactions on Communications, vol. 39, pp. 1223{1230, Aug. 1991. [37] A. S. Acampora, M. J. Karol, and M. G. Hluchyj, \Multihop lightwave networks: A new approach to achieve terabit capabilities," in Proc. ICC'88, vol. 1, pp. 1478{1484, 1988. [38] I. M. I. Habbab, M. Kavehrad, and C. E. W. Sundberg, \Protocols for very high-speed optical ber local area networks using a passive star topology," IEEE Journal on Lightwave Technology, vol. LT-5, pp. 1782{1793, Dec. 1987. [39] N. Mehravari, \Performance and protocol improvement for very high speed optical ber local area networks using a passive star topology," IEEE Journal on Lightwave Technology, vol. 8, pp. 520{530, Apr. 1990. [40] A. Ganz, \End-to-end protocols for WDM star networks," in IFIP/WG6.1-WG6.4 Workshop on Protocols for High-Speed Networks, (Zurich, Switzerland), May 1989. [41] Intel Corporation, i860 64-Bit Microprocessor Hardware Reference Manual, 1990. [42] K. Bogineni and P. W. Dowd, \Performance analysis of two address allocation schemes for an optically interconnected distributed shared memory system," in Sixth International Parallel Processing Symposium, (Beverly Hills, CA), Mar. 1992. [43] E. D. Lazowska, J. Zahorjan, G. S. Graham, and K. C. Sevcik, Quantitative System Performance: Computer System Analysis Using Queueing Network Models. Prentice-Hall, 1984. [44] A. M. Law and W. D. Kelton, Simulation Modeling and Analysis. McGraw Hill, 1991. [45] M. MacDougall, Simulating Computer Systems. The MIT Press, 1987.
27