LIGHTNING Network and Systems Architecture - CiteSeerX

1 downloads 0 Views 302KB Size Report
Abstract. LIGHTNING is a dynamically reconfigurable WDM network testbed project for .... architecture, operating system, and network/memory interface design.
LIGHTNING Network and Systems Architecture P. DOWD,1 J. PERREAULT,1 J. CHU,1 D. CROUSE,1 D. HOFFMEISTER,1 R. MINNICH,2 D. BURNS,2 F. HADY,3 Y.-J. CHEN,4 M. DAGENAIS,5 D. STONE,6 1

Department of Electrical and Computer Engineering, State University of New York at Buffalo 2 David Sarnoff Research Center, Princeton NJ. 3 Supercomputing Research Center, Bowie MD. 4 Department of Electrical Engineering, University of Maryland at Baltimore County 5 Department of Electrical Engineering, University of Maryland at College Park 6 Laboratory for Physical Sciences, College Park, MD.

Abstract LIGHTNING is a dynamically reconfigurable WDM network testbed project for supercomputer interconnection. This paper describes a hierarchical WDM-based optical network testbed project that is being constructed to interconnect a large number of supercomputers and create a distributed shared memory environment. The objective of the hierarchical architecture is to achieve scalability yet avoid the requirement of multiple wavelength tunable devices per node. Furthermore, single-hop all-optical communication is achieved: a packet remains in the optical form from source to destination and does not require intermediate routing. The wavelength multiplexed hierarchical structure features wavelength channel re-use at each level, allowing scalability to very large system sizes. It partitions the traffic between different levels of the hierarchy without electronic intervention in a combination of wavelength- and space-division multiplexing. A significant advantage of this approach is its ability to dynamically vary the bandwidth provided to different levels of the hierarchy. Each node in L IGHTNING receives traffic on n channels in an n-level hierarchy, one channel for each level. Each node monitors the traffic intensities on each channel and can detect any temporal or spatial shift in traffic balance. L IGHTNING can dynamically reconfigure to balance the traffic at each level by moving wavelengths associated with each level up or down depending on need. Bandwidth re-allocation is completely decentralized - any node can initiate it, achieving highly fault tolerant system behavior. This paper describes the system architecture, network and memory interface, and the optical devices that have been developed in this project.

1 Introduction Computer systems have developed rapidly over the past few years. Specifically, processor speeds continue to increase but I/O capability continues to lag. This is not a new trend yet no general solution to this perpetual problem has emerged. This paper describes an approach to support high-performance I/O - both interprocessor communication and remote filesystem access - with a combination of high speed optical network designed specifically for this situation. The objective is to develop a scalable technique for clustering: a strategy that is effective (both in performance and price/performance) with both workstation-class processor interconnection and high-performance supercomputersystem level interconnection. The interconnection strategy needs to be flexible to adapt to the severe cost constraints at the low-end and performance requirements at the high-end. This paper defines the general architecture, and then defines more specifically an experimental testbed currently under construction known as "LIGHTNING" that is targeted to supercomputer interconnection. This is a joint project involving the State University of New York at Buffalo (Buffalo, NY), University of Maryland (College Park and Baltimore County, MD), Laboratory for Physical Sciences (College Park, MD), Supercomputing Research Center (Bowie, MD), and David Sarnoff Research Center (Princeton, NJ). The optical network is hierarchical and based on wavelength-division multiple access (WDMA). WDM creates multiple channels, separated in wavelength, to be concurrently transmitted on a single fiber. The channels can be individually accessed, routed and switched. The architecture can be viewed as a tree, where processors reside at the leaves 0 This work was supported by the U.S

Department of Defense, Laboratory for Physical Sciences, College Park, Maryland. Preprints are available via WWW at http://piranha.eng.buffalo.edu/.

1

and the internal nodes have wavelength- and spatial-switching capabilities. This class of architecture for processor interconnection was first described in [1]. Section 2 briefly describes the structure but mostly focuses on the new developments of the architecture. An advantage of this architecture is that bandwidth can be dynamically re-allocated throughout the system to adapt to shifts in traffic patterns. As described in Section 4, the bandwidth assigned to different levels of the hierarchical system can be dynamically increased or decreased, based on the reference patterns and file activity of the system. However, in comparison to traditional reconfigurable architectures, the reconfiguration of the system is automatic and hidden from the user. The user does not have to logically map an application to a specific topology or inform the operating system at compile- or run-time of its intended communication patterns. LIGHTNING senses the traffic patterns inherent to an application and adapts itself accordingly. This function is hidden from the user and the operating system and avoids placing the burden of understanding the specifics of the system on the programmer. The programmer views the system as a pool of processors and is not involved with process placement. The operating system, during initial process placement and during migration, favors placing a process within the level-1 or level-2 cluster depending on size. Consider the Fat-Tree [2], the basis of Thinking Machines CM-5 [3, 4]. This is a metal-interconnected hierarchical network with VLSI routing elements at each internal node. A typical problem with tree-type networks is congestion near the root. When traffic exhibits a strong degree of locality, the communication is contained within the hierarchy and root congestion is reduced. However, severe root congestion occurs when traffic has a more general pattern, even for a short period of time. The solution of the Fat-Tree to this problem is to double the bandwidth to the parent node at levels closer to the root of the hierarchy over the preceding level through multiple metal interconnects creating multiple spatial channels. This design protected the system from the worst-case scenario (general traffic patterns) but does not optimize for the nominal case. Locality can not be exploited, beyond the basic tree structure that provides spatial separation, since the communication resources have been statically allocated. Spatial and temporal locality commonly exists due to inter-process communication patterns and because there is control over process placement by the operating system. LIGHTNING leverages the inherent and enhanced communication locality to improve performance and achieve better resource utilization. LIGHTNING provides bandwidth to the upper levels of the system to avoid root congestion, when there are non-local traffic patterns, but can exploit even short term locality through the capability of pushing bandwidth down from the root closer to the leaf nodes (processors). The additional constraint placed on LIGHTNING is that the reconfiguration must be hidden from the programmer. LIGHTNING accomplishes the dynamical reconfiguration capability by monitoring the traffic intensities on channels at

each level of the hierarchy. Section 2 defines the node architecture and describes how this is accomplished with minimal increase in hardware complexity. When an imbalance in traffic intensities is detected between two levels of the hierarchy, the system reconfigures to balance the traffic by pushing channels either up or down the hierarchy. Note that the reconfiguration is completely decentralized for fault tolerant reasons - there is no single watchdog monitor and all nodes share in the responsibility of monitoring and initiating reconfiguration. Since the hardware on the network interface card required to support this function is minimal, this approach is very cost effective and achieves faster detection and reconfiguration than a centralized approach. The goal of the testbed currently being developed is to support a (weakly coherent) distributed shared memory environment between the interconnected processors. This is accomplished through a combination of the network architecture, operating system, and network/memory interface design. The goal is not just to provide a high-bandwidth network, but to deliver that bandwidth to the application rather than wasting it in operating system overhead. Building wavelength tunable devices is not easy. A major objective of the architecture was to relax the constraints placed on the optical devices - strong constraints usually translate into high cost. In particular, we wanted to reduce the number of required WDM channels. An objective of LIGHTNING was to devise an architecture that could effectively use as many channels as were provideable but not require any more than a single wavelength. In the single wavelength case, spatial separation is used with time multiplexing. With multiple wavelengths, the increase in performance is magnified with wavelength re-use. The network architecture uses wavelength-, space-, and time-division multiplexing to achieve the communication requirements of the system. In particular, wavelength re-use is achieved at each level

2

of the hierarchy to both magnify the usefulness of WDM channels and reduce the number of required WDM channels. Alternative approaches to the processor interconnection problem have been suggested. A recent approach to supercomputer interconnection in addition to support for integrated services has been proposed for the Los Angeles area [5]. Another approach employs a bit-parallel strategy so that 16 bits of a 16-bit word is transmitted on all 16 wavelengths - one bit per wavelength [6]. It is a deflection-routed network and the receiving node is responsible to re-assemble the words into the PDU and support ARQ. A single-wavelength approach has also been suggested - Scalable Coherent Interface (SCI) [7, 8]. This IEEE standard is a multiple-ring single-wavelength based network. However, there are no inherent architectural limitations which would stop SCI from incorporating WDM in the future. SCI is designed to provide a coherent address space among the interconnected nodes. This paper briefly describes the basic architecture in Section 2 and then expands it to include a further generalization of the clustering strategy. Section 2.2 describes how media access is supported and Section 4 describes the support for reconfiguration in greater detail and describes the interaction with the media access protocol.

2 Lightning Architecture This section describes the architecture of the LIGHTNING network. First background information is discussed providing a framework for discussion, and then the architecture is defined. The architecture is then expanded by generalizing its clustering strategy. The objective of the architecture is to provide a consistent and scalable technique to achieve high-performance clustering of processors geographically distributed and also widely varying in performance and cost characteristics. In particular, two main classes of interconnection are of primary interest: supercomputer-class system interconnection and high-end workstation-class processor interconnection. A constraint on the architecture is that it must allow a consistent view to the programmer and operating system for both classes of interconnections. As defined below, the network architecture is based on a hierarchical WDM optical network. However, in order to provide a method to amortize the cost of the optical components, the architecture consists of a electronic- and optical-interconnected portion (the terms ’metal-interconnected’ and ’electronic-interconnected’ are used interchangeably in this paper). This dual technology strategy is essential on the leading edge of the optical component development cycle: it allows multiple electronic-interconnected processors to be interconnected in an "all-optical" fashion via a single WDM optical transceiver set. Early in the development cycle, when the optical components are very expensive, multiple nodes are interconnected per optical transceiver set to amortize the cost. When the costs are reduced, the performance/cost balance point shifts and the system becomes more completely optically interconnected. The LIGHTNING architecture was influenced by some of the strengths and weaknesses of the Fat-Tree [2]. LIGHTNING retains the advantages of the Fat-Tree structure – a low latency, scalable interconnection – while avoiding some of the limitations through optical interconnects. Optical networks have many advantages over metallic interconnections such as a relaxed bandwidth distance product, large fan out, low power consumption, reduced crosstalk and immunity to electromagnetic noise [9]. Major considerations in determining the best placement for optical connections is the speed mismatch between the optical and electrical components [10] and the performance/cost requirements. Optical fiber has a 30 THz bandwidth [11], much larger than the metallic bandwidth. To simply replace metal connections with optical ones would result in many expensive under-utilized links.

2.1

Lightning structure

LIGHTNING has a hierarchical structure in the form of a general

3

mi -ary tree. The leaves of the tree are processors and the

C



Σ λi i=1

X

C



Σ λi

Σ λi

+

i=1

to upper level

λ1



λ2

i=X+1

λ3 Input fiber from upper level

λ4

4x2

crossbar

λ1 λ2 λ3 λ4 C



Σ λi i=1

X

Σ λi i=1

C



+



Σ λi

4x2

crossbar

passive coupler

passive coupler

i=X+1

(a)

(b)

Input fibers from lower level electronic configuration control

Output fibers to lower level

Figure 1: Wavelength partitioning device used to achieve spatial separation of wavelengths. (a) Functional behavior of Lambda Partitioner. Wavelengths above and below the partition point are cross-routed and bar-routed, respectively. Selection of the partition point is electronically controllable by the LIM interface. (b) Structure of lambda partitioner.

interior nodes are routing elements. A combination of electronic routing elements and optical, wavelength selective routing elements are used. The top levels of the hierarchy form an all-optical network, while the lower levels can be electronically interconnected. Note that while the lower portion is described as "electronic", it is designed to be low cost. The physical links could be co-ax, twisted pair, or low cost multi-mode fiber interconnects. The cross-over point between electronic and optical interconnects could vary on different legs of the tree (see Figure 10). The system uses time-, space-, and wavelength-separation to increase the usable bandwidth in the system. However, the electronic interconnected levels use only time-, and space-separation. As mentioned above, the cross-over point between electronic and optical interconnects can vary within legs of the hierarchy. In this manner, the system could be constructed from a heterogeneous collection of different class processors, with more bandwidth supplied to the higher classes. A strength of the architecture is that the asymmetric electronicoptical hierarchy is invisible to the user. This is important since the performance and cost of the optical and electronic components are constantly shifting so it is essential that the architecture be sufficiently robust to handle the evolution. In the same way that designing an optimal topology for a wide-area computer network is viewed as futile, since the network is always in a state of flux, a network to support cluster-based computing should also be expected to constantly change. Internal nodes in the electronic interconnected levels provide spatial switching while the internal nodes in the optical levels provide wavelength selective routing. The spatial and wavelength optical routing element is denoted as a Lambda Partitioner ( Λx ). The structure of a Lambda Partitioner is shown in Figure 1. The purpose of the Λx is to provide a wavelength routing capability where specific wavelengths could be routed to the upper level and other wavelengths could be retained in a lower level. This enables wavelength re-use at peer levels of the hierarchy and reduces the total number of required WDM channels. A mixed radix system [12] is used to represent the node numbering. Let decimal integer represented as a product of r factors:

M= The processor identifier P , 1  P

r Y i=1

mi

 M , can be represented as an r-tuple 4

M

(the total number of processors) be a

(1)

to upper level

spectrometer λ1

λ3 λ4

Electronic 4X2 Crossbar select

λ2

detector/pre-amp

64

Rcvr subsystem w/ clock recovery Rcvr subsystem w/ clock recovery

R

to LIM Interface

Receiver Subsystem

R

R

Processor Cluster Level-0 (1,3)

Processor Cluster Level-0 (1,4)

R

Processor Cluster Level-0 (1,2)

Processor Cluster Level-0 (1,1)

64

electronic control from LIM

(a)

wavelength partitioner

4 wavelength laser array Transmitter Subsystem

from upper level

(b)

Figure 2: Lightning building blocks. (a) Receiver subsystem where each node can concurrently receive on any two of the wavelengths in this example; (b)A single level example showing that each node contains a receiver described in part (a) and a 4-wavelength laser array and are connected via the wavelength partitioning component. Note that the configuration is illustrated in a star configuration but a bus-type of configuration is possible depending on power budget considerations.

P = (pr ; pr?1; : : :; p1) where 1  pi  mi for all 1  i  r. The term r denotes the number of hierarchical levels.

(2)

An example of a single level network is shown in Figure 2, and a 3-level example is shown in Figure 3. A bus based system is drawn for clarity in Fig. 3 but the actual implementation is a star-coupled based system. The function of the Λx is to serve as a 2  2 switch for each wavelength channel as illustrated in Fig. 1. Denote the number of wavelength channels as C and the ordered set of channels as Λ = f1 ; 2 ; : : :; C g. All the wavelengths below the i-level partition point are retained within the i-level cluster while all wavelengths above the partition point are routed to the i + 1 level cluster. If the i-level partition point was set at j , then wavelengths f1 ; 2 ; : : :; j g are be retained within level i, and fj +1 ; j +2 ; : : :; C g are routed to level i + 1. The set of partition points in an r-level system are denoted as fX0 ; X1 ; : : :; Xr g. With the partition points staggered at each level such that X0 < X1 : : :Xr?1 < Xr , a separate set of wavelength channels are dedicated for communication at each level. Denote Ci as the set of channels allocated for i-level communication, j Ci j= Xi ? Xi?1 where X0 = 0 and Xr = C . Denote the collection of processors beneath an i-level node as an i-level cluster. If a processor wants to send a message to another processor in the same level 0 cluster, it uses a channel in C0 . If the second processor is not in the same level 0 cluster but is in the same level 1 cluster, a channel in C1 is used. The determination of which channel to transmit on is discussed in Section 2.4. Note that spatial re-use of the wavelength channels in different clusters occurs. A symmetric network would have all the i-level wavelength partitioners in the network set to the same partition point. Alternatively, Λx s in spatially separate clusters could partition the wavelengths differently, as longs as Xi < Xi+1 . In either case, the partition points can be dynamically changed. This allows the network to adapt to changes in traffic patterns. This dynamic bandwidth reconfiguration capability will be discussed in more detail in Section 4. The total number of spatial/wavelength channels in the system, denoted as C , can be calculated to determine the Qr advantage of wavelength re-use. For example, consider a symmetric configuration where Mj = i=j +1 mi denotes the number of j -level nodes in a hierarchy, and Ci as the number of channels allocated for i-level communication. Since there are a total of Mj j -level nodes, there is Cj Mj channels providing j -level communication. Therefore the total number of channels in an r level system is

C= Consider a small system of M

=

4  4 with C

=

r X j =1

Cj

r Y

k=j +1

mk

(3 )

4. In this example, the total number of effective channels increases 5

(210) (110) (010) Processor Level-0 Cluster

(014) R Processor Level-0 Cluster

(013) R Processor Level-0 Cluster

(012) R Processor Level-0 Cluster

(001) R

Processor Level-0 Cluster

(114) R Processor Level-0 Cluster

(113) R Processor Level-0 Cluster

(112) R Processor Level-0 Cluster

(101) R

Processor Level-0 Cluster

(214) R Processor Level-0 Cluster

(213) R Processor Level-0 Cluster

(212) R Processor Level-0 Cluster

(201) R

(200) (100) (000) Processor Level-0 Cluster

(004) R Processor Level-0 Cluster

(003) R Processor Level-0 Cluster

(002) R Processor Level-0 Cluster

(001) R

Processor Level-0 Cluster

(104) R Processor Level-0 Cluster

(103) R Processor Level-0 Cluster

(102) R Processor Level-0 Cluster

(101) R

Processor Level-0 Cluster

(204) R Processor Level-0 Cluster

(203) R Processor Level-0 Cluster

(202) R Processor Level-0 Cluster

(201) R

Figure 3: Example of a 3-level Lightning system. A bus based system is shown for clarity but implementation is closer to a star-couple configuration.

6

to a maximum of C

2.2

=

13 through wavelength re-use.

Protocols

The goal of LIGHTNING is not just to provide a high-bandwidth network, but to deliver that bandwidth to the application rather than squandering the high-performance capabilities in multiple buffer copies and other operating system level overhead. Our approach is to achieve a tighter integration in the optical, electronic and software technologies within the system. This section describes the protocol suite. First the low-level media access protocol is defined in Section 2.2.1. The upper layer protocol is defined in Section 2.3.1, along with the memory interface, and the zero-copy nature of the network/memory interface is described.

CHANNELS

Level-2 RCB Cycle λ4

1,1

1,2

1,3

1,4

2,1

2,2

2,3

2,4

3,1

3,2

3,3

3,4

4,1

4,2

4,3

4,4

1,1

λ3

1,1

1,2

1,3

1,4

2,1

2,2

2,3

2,4

3,1

3,2

3,3

3,4

4,1

4,2

4,3

4,4

1,1

λ2

x,1

x,2

x,3

x,4

x,1

x,2

x,3

x,4

λ1

x,1

x,2

x,3

x,4

x,1

x,2

x,3

x,4

RCB Cycle Level-1

Level-1 Block Cycle

Figure 4: Assignment map for two level FatMAC protocol for M = 4

2.2.1

 4 and

Note: not drawn to scale - block slot is actually more than two orders of magnitude larger than control slot size. C

= (2; 2)

Network protocol

The wavelength multiplexed environment requires a media access protocol to arbitrate access. With existing upper layer protocols in mind, the FatMAC protocol has been designed to exploit the bimodal pattern of network traffic. There have been many studies of typical Internet traffic, with TCP/IP, and a common trait is the bimodal distribution of packet size: essentially all of the traffic consists of either big packets or small packets with little traffic in between [13, 14]. Batched applications, such as ftp, tend to use large data packets to get high throughput while interactive applications, like telnet, tend to use small packets for low latency. This characteristic is also true with distributed shared memory computer systems. The traffic, generated by a memory coherence protocol needed to achieve DSM, has two major forms: memory consistency control packets (such as memory block requests, invalidations, acknowledgments) and memory block packets [15]. Memory consistency control packets are very small while the memory blocks can be up to 8 Kbytes. In a non-broadcast scheme, like MNFS, there can be multiple memory consistency control packets for every memory block that needs to be transferred. The FatMAC protocol has been designed to exploit this characteristic. In addition to the memory consistency control and memory block packet types, there is another type of packet called the hot (or button) packet. These packets are 32 bytes long and have a stringent low latency constraint. The hot packets can be handled as special control packets. Refer to [16] for a detailed description of a single-level FatMAC. The objective of this section is to briefly describe FatMAC and then generalize it to support a multi-level network architecture. FatMAC combines reservation access and pre-allocated receivers. Typical reservation style WDMA protocols use one of the channels as a dedicated control channel [17]. However, the number of available WDM channels usually is small and there are usually many more nodes than number of channels. In such a situation, reserving one channel as the control channel is not very attractive. FatMAC is a hybrid approach using reservation access but without a reserved control channel. Furthermore, it was designed specifically for the situation where each node had a fast-tunable transmitter (based on a laser-array) and a slow tunable receiver.

7

This protocol reserves access on a pre-allocated channel through control packets. Transmission on each channel consists of two phases: the reservation control block which is followed by the memory block transmission phase. Figure 4 shows the allocation map for a two level FatMAC where each level has two channels. The control packets are broadcast (on the subset of channels assigned to the level of the target node) while the data blocks are transmitted on the home channel of the target node. Note that this enables a collisionless environment but significantly reduces the overall cycle time, when compared with TDMA, since data block slots are only assigned when they are needed. This significantly reduces the latency since Figure 4 is not drawn to scale and the data blocks are typically more that two orders of magnitude larger than a control slot. The laser-array transmitter facilitates broadcast by simultaneously transmitting a control packet on all channels assigned to the target level of a system. This enables LIGHTNING to obtain the advantages of a reservation-based protocol without requiring a WDM channel to be a dedicated control channel. This is especially important in a multi-level architecture since one control channel would have to be assigned per level. A basic assumption and constraint on LIGHTNING is not to require a large number of WDM channels so LIGHTNING has no dedicated control channels and only uses the channels in that mode for a small fraction of the cycle duration. control packet

header

addr

Requester Node

ptr

Server Node data block packet header

addr

ptr

data block (8 kB)

8 KBytes

System Memory

Figure 5: Action upon request.

The control packet has multiple functions. In addition to reserving access in the following data cycle, it also has a small payload that is used to transport hot packets for the system applications. As defined above, hot packets have very low latency requirements and are useful to the applications, the operating system, and memory coherence protocol. A control packet may contain operating system information or an application hot packet and not need to reserve a data slot - this allows the average cycle length to be dramatically reduced. A fast broadcast facility is achieved through this multi-purpose control packet. The small payload of the control packet increases its usefulness since now it not only is used for media access purposes, but now is used to transport Operating System (OS) level information along with hot packets. If a data block needs to be transmitted, a control packet is used to reserve a space in the block cycle. The control packet specifies the data packet’s channel. The actual implementation of the Media Access protocol differs from the one originally designed. The concept of home channels is eliminated, due to the use of fast tunable receivers (see Figure 2(a)). Data packets can now be sent on any channel available to that level, regardless of destination. A data packet causes the data cycle to be extended by a slot only when all channels in the previous slot are full or if there is a destination conflict. The only channel conflict that can occur during a data slot is when two or more nodes target the same destination. High utilization is maintained by having the second node transmit in the next data slot. This scheme is an improvement over the original scheme, as conflicts such as this would occur every time two or more nodes targeted any group of nodes sharing the same home channel. The actual implementation shows improved performance, both in throughput due to the increased utilization of the data cycle, and in reduced latency, due to the reduction in the length of the data cycle over the original FatMAC 8

data cycle.

2.3

Packet Format

There are two main packet types supported by the system for the multiprocessor environment: Implementation Specific 40 (a) Control Packet Format Implementation Specific 40

SENDER ID 16

SENDER ID 16

DESTINATION PKT CHAN TYPE ID ID 16

8

4

DESTINATION PKT CHAN ID TYPE ID 16

8

4

RSVD

PAYLOAD

CRC

20

256

64

RSVD

PAYLOAD

CRC

20

72928

64

(b) Data Packet Format

Figure 6: Control and data packet formats for FatMAC: (a) Control packet, (b) Data block packet.

Control Packets: transmitted during the reservation phase of the FatMAC protocol are 53 bytes in length. These packets can be broadcast to all nodes with low latency. In addition to MAC-layer signaling, the payload contained within this packet could be injected directly into memory or placed into an Operating System (OS) buffer. Control packets include any type of packet that is sent and received by the OS with little or no card involvement. The glue logic just has to set up the data path to the appropriate FIFO. The types of packets are:

. . . . . .

memory request packets memory acknowledgment packets traffic statistic packets system reconfiguration packets synchronization packets hot (button) packets

Request and acknowledgment packets are used by the memory consistency protocol. Traffic and reconfiguration packets are used for bandwidth allocation and are described in Section 4. Buttons will be described below. A typical transaction is shown in Figure 5. This figure provides a simplified view of a memory request and illustrates how a response packet, which is about 200 times the size of a control packet, can be inserted directly into the memory space of an application. The requesting node sends a request as a control packet. In order for the card to be able to place memory blocks directly into Network Memory upon reception, some tag has to be provided on the packet to determine where to place the response packet. The tag on the outgoing packet points to an entry in a table maintained on the card. The table entry includes the physical address where the memory page has to be stored when it arrives. This is not actually how it is implemented and the additional security aspects are not discussed here. The headers of both a control packet and a data packet, shown in Fig. 6, are identical. This eases receiver state machine packet processing. Note that the Chan ID field would be unused in a control packet since it is broadcast on all channels assigned to the intended level. The first 40 bits of the packet are specific to this particular implementation.

9

Applications have access to very low latency communication through Hot Buttons. Often in applications, small amounts of data need to be exchanged. Examples would be a synchronizing variable such as a semaphore or a barrier. We use the control packet payload to support small data structures, 32 bytes in size residing somewhere in network memory. It is designed to allow fast non-coherent access to small pieces of memory. This corresponds to the short pages used in MNFS. Hot packets are supported through the payload of a control packet as shown above.

Memory Block Packets: are approximately 8 Kbytes in length. They are not broadcast, but are transmitted only on a single wavelength channel and are injected directly into the memory of the receiving processor. These are sent on the FatMAC data cycle. In order to send a data packet, two headers have to be written into the "From OS" FIFO. The first is for the reservation cycle and the second is the header of the data packet. The card sends each data packet in the time slot the card reserves for it on the network. Each reservation is broadcast and the card must keep track of each reservation made on the control cycle. It needs to do this to determine the length of the data cycle so that it knows when the control cycle begins again. This is required at each level of the multi-level architecture. Each receiver can hear the data packets sent on its home channel but only stores the packets actually destined to it. It checks the header on each packet and drops those not destined for it. Since data packets are large, the minimal packet processing involved is not a bottleneck. When it receives a packet, as Figure 5 shows, it can be injected straight into memory because it can determine via table look-up exactly where it should go.

2.3.1

Upper Layer Protocols

High performance is achieved through the idea of injecting data straight from the network into system memory. This idea has been suggested before in [18, 19]. LIGHTNING has a strong desire for fast application network access since this approach avoids the multiple buffer copies that a network interface and operating system typically require.

CPU

MAIN MEMORY

CPU

Memory Interface

LIGHTNING Interface Module

CPU Bus I/O Interfaces

LAN Adpater

SCSI

Figure 7: Computer system bus structure organization.

This avoids bottleneck of data copying over the system I/O bus [20], and allows the system to deliver truly usable high-bandwidth communication to an application in the form of low-latency interprocess communication. A weakly coherent distributed memory system with a very fast I/O subsystem is created since the same network can be used in an integrated fashion.

10

High Priority FIFO

Low Priority FIFO MINI Control Logic

Control/Status

Host Bus

Host Interface Logic Pseudo Dual Port Main Memory (DRAM)

transmit logic receive logic

VC/CRC Control Table

CRC Calculator

LIGHTNING INTERFACE MODULE

SIMM Extender SIMM mem bus

Figure 8: Computer system organization and structure of memory interface.

A key to a fast network is to reduce or if possible eliminate the amount of copying needed to go from application to application [18, 21, 20]. For example, a conventional protocol stack accesses the memory system five times per word sent over the network [18] and so the processing time becomes a real bottleneck in a fast network. In the spirit of a card like Afterburner [22], LIGHTNING minimizes the copying of data through both hardware and software techniques. Unlike Afterburner, LIGHTNING is placed on the memory bus and has the potential to be a zero-copy network interface. Once data is in memory, it is also essentially available for I/O. There is no need for a single copy to I/O space. The structure and overall organization is shown in Fig 7, and the organization of the memory interface is shown in Figure 8. In addition, it is not even necessary for an application to go through the operating system in order to do network I/O. The application, by the nature of the architecture, already has access to the network “buffers”. It could make its requests to the card directly. However, the simplified interface also means that it is possible to make the operating system interface fast, meaning fast TCP and NFS access. This architecture is being designed keeping current protocols in mind. It is being developed concurrently with MNFS [23], a coherent version of NFS with support for distributed shared memory. Other methods of implementing distributed shared memory have been explored [15]. However, this method takes advantage of virtual page mapping capabilities in UNIX as well as NFS to share memory pages ostensibly located at different nodes in a network. Using MNFS unites memory access with file access allowing the application fast access to remote memory as well as remote files. MNFS is compatible with NFS, insuring a solid base of existing applications which can take advantage of LIGHTNING’s fast I/O to achieve fast file I/O. MNFS supports a weakly coherent model of files and memory. It has two fixed page sizes. Access to the long page is coherent while short page access is non-coherent. This gives coherent access to data while allowing for faster access to small pieces of data, such as messages a process on one node might send to a process on another node, or taking advantage of the media access protocol described below, achieve some sort of Eager Sharing [24] with in place update of variables. By giving the network interface a dedicated port to main memory, high throughput is achieved without significantly disrupting the host’s processing. The separation of channel initialization from message sends allows a user or OS 11

process to initiate a send with a single write, providing low latency operation. This is a zero-copy interface. It is a multi-user interface which resides in the memory space of the computer system, as opposed to I/O space. Applications have direct access to pages of memory which are used for the network I/O; thus applications can send and receive packets with no operating system support needed. An application can initiate packet transmission with one memory write; it can determine if a packet has arrived by checking a per-circuit control and status word. At the same time, the operating system can use MINI for its own communications and for supporting standard networking software such as TCP/IP and NFS. The structure of the memory map is key to allowing a user process to send and receive data without OS involvement in a multiuser environment. At channel setup time, six pages are allocated to a channel. Two pages in Main Memory space are allocated to hold receive data with two more allocated to hold send data. One page within the VC/CRC Control Table space is allocated user-read-only and initialized by the OS at setup time. This page contains the send and receive control words, and a send CRC and receive CRC. The final page allows a user process to write to the High or Low Priority FIFO. Multiuser protection for the FIFOs is achieved by using the address bus to indicate the channel number on which the send is to take place. The unique configuration of the memory interface, achieving a zero-copy interface, enhances the capability of the system-level software. For example, TCP and NFS can be supported natively with a single buffer copy. The system software aspect of this project has been described in [25].

2.3.2

Synchronization

This section briefly described the frame and slot synchronization strategies used in the prototype testbed. Due to space constraints, the performance analysis of the schemes along with the implementational details have been removed from this paper but may be found in [26].

Network Synchronization A critical component of implementing the protocol is the network (frame) synchronization mechanism. This has to consider propagation delay in fiber, packet processing latency, and lambda partitioner latency. Packet processing is defined as the interval between packet arrival and packet decoding and cycle length information extraction. Lambda partitioner latency is the time required for the packet to propagate through various levels in the hierarchy before reaching the destination node. In a two-level system, a packet passes through one partitioner to reach nodes within the local cluster and through two partitioners to reach nodes outside the local cluster. Each cycle consists of a control phase and data phase. In an ideal network with zero propagation, the data phase immediately follows the control phase. There is a processing latency following the control phase to allow all nodes to compute their offsets and the cycle length. In a network with large propagation delays, the control packets reach nodes after a significant amount of time. Rather than allow the network to idle in the meantime, the cycles can be staggered so that the data phase for each cycle does not immediately follow its corresponding control cycle. Network synchronization has been examined via two possible strategies: Data Cycle Stall: for small propagation delay and processing latency systems, data cycle can immediately follow the control phase, simple to implement, and processing and large propagation latencies can cause nodes to remain idle for a significant amount of time. In-flight Reservation: designed to overcome big propagation delay and processing latency, pipelining approach where the control phase is used to reserve access for a later data cycle, minimum time between the initiation of control cycle and start of data cycle is given by M + 2 + ∆ where M is the number of notes,  is the maximum node-to-partitioner propagation delay and ∆ is packet processing time.

12

Slot Synchronization The previous section described two possible choices for beginning the data cycle for the corresponding control cycle. The next issue in protocol implementation is the slot synchronization mechanisms. This defines when each node accesses the network during the control phase. The objective is to develop a distributed synchronization algorithm with minimal central control. Several synchronization techniques have been studied for satellite systems using TDMA [27–29]. Two ways in which synchronization can be implemented are lock-step and distributed clock: Lock-step mechanism: each node transmits after transmission from its previous node is complete, simpler to implement, not suitable for large propagation delay, and cycle length dependent on node ordering. Distributed Clock: each node that joins the system maintains a clock synchronized to the distributed clock, the local clock is set using the clock packet time and the propagation delays of the clock node and the receiving node, node can calculate the time it should begin broadcasting its control packet, sufficient preamble and post-amble should be provided in the transmitted packet to ensure collisionless transmission, and the clock node can change if the current clock node fails.

2.4

Support of Asymmetric Networks and a Consistent Name Space

It was mentioned above that physically asymmetric networks are defined. In this context, an asymmetric network has: a optical-electronic crossover point varying from leg to leg of the tree, and the number of intermediate levels on each leg of the tree is not constant. It is important that a node’s view of the network does not change, regardless of how or where it is connected to the network. In order to support a constant view a physically asymmetric network must be mapped into a more symmetrical logical tree. The properties required of the logical tree are that 1. the crossover point should be at the same level throughout the network. 2. The leaf nodes (processors) should all be at the same level. The purpose of these conditions is to enable a constant hierarchical naming scheme. With such a scheme, the home channel of a node can be determined directly from the address of the node. This is a two stage process: 1. Determine the minimum level of communication required to reach the destination Z through a digit-by-digit comparison of the source A and destination address r-tuple: i-level communication is required if zi 6= ai and zj = aj for all k < i < j  r, where k is the level of the optical-electronic crossover point. 2. Determine the home channel of the destination processor Z at that level. The i-level home channel of determined by the allocation policy which in this case is interleaved and is given by 

HZ;i = Xi?1 + Z mod Xi ? Xi?1



A is (4)

The only state information that is needed is the partition points of the Λx s on the path between the node and the root of the tree, and the optical-electronic cross-over point. The above conditions also guarantee that only one receiver per level is required at the crossover node. Without the remapping, it would be possible for two nodes beneath an electronic router to be targeted at the same time on different channels. Two nodes beneath the router would be assigned different home channels if there is more than one channel allocated for communication at that level. Since the Electronic router receives the messages sent to these nodes, it would have to simultaneously receive messages on the different channels. It is feasible that the router could have to receive wavelengths on every channel. The logical re-mapping prevents this, by placing all electronic routers at the bottom of the optical hierarchy. In order to describe the mapping algorithm the following parameters are needed: B is the set of branches in the tree. 13

Level 0

Lambda Partioner

Level 0

Electronic Switch

Level 1

Logical node

Level 1

Level 2 Level 2 Level 3 Level 3 Level 4 Level 4 Level 5 Level 5 Level 6 Level 6

(a)

(b)

Figure 9: In order to support hierarchical naming and routing, the physical network must be mapped into a logical one. Logical nodes do not physically exist. (a) Network after step 2 of the algorithm. The optical-electronic cross-over point is at the same level in each branch. (b) Final logical mapping. Each branch of the tree terminates at the same level.

nc the crossover node in the branch. lc the level number of the crossover node that is the farthest from the root. 0  lc  r ? 1. lmax the level number of the node that is the farthest from the root. 0  lmax  r ? 1.

Note that the algorithm inserts virtual nodes in the system. These nodes do not represent any physical node that has to be put in the system. They exist only logically to provide a consistent naming space. The algorithm is as follows: 1. Find the level number of the crossover point the farthest from the root.

lc lc then lc

Suggest Documents