Hierarchical Scalable Photonic Architectures for ... - Semantic Scholar

To Appear in IEEE Transactions on Computers, 1993

Hierarchical Scalable Photonic Architectures for High-Performance Processor Interconnection 1 Patrick W. Dowd, Kalyani K. Bogineni, Khaled A. Aly, James A. Perreault Department of Electrical and Computer Engineering State University of New York at Buffalo Buffalo, NY 14260 Abstract This paper introduces two hierarchical optical structures for processor interconnection and compares their performance through analytic models and discrete-event simulation. Both architectures are based on wavelength division multiplexing (WDM) which enables multiple multi-access channels to be realized on a single optical fiber. The objective of the hierarchical architectures is to achieve scalability yet avoid the requirement of multiple wavelength tunable devices per node. Furthermore, both hierarchical architectures are single-hop: a packet remains in the optical form from source to destination and does not require cross dimensional intermediate routing. The first structure is physically hierarchical but wavelength flat: all nodes share the same wavelength space. The second structure is a wavelength multiplexed hierarchical structure with wavelength channel re-use at each level, allowing it to be scaled to very large system sizes. It employs acousto-optic tunable filters in conjunction with passive couplers to partition the traffic between different levels of the hierarchy without electronic intervention. An advantage of the second structure is its ability to dynamically vary the bandwidth provided to different levels of the hierarchy. The architectures are compared in terms of complexity and performance scalability. Key Words: parallel computer architecture, processor interconnection, optical fiber communication, wavelength division multiplexing.

1 Introduction The Fat-Tree [1] was proposed as an interconnect strategy to support concurrent interprocessor communication through a hierarchy of limited-bandwidth spatial channels. Unlike conventional tree processor interconnection approaches, the Fat-Tree provides increased bandwidth at levels closer to the root. The processors of a Fat-Tree are located at the leaves of a complete binary tree, and the internal nodes are spatial switches. The number of links connecting a node to its parent increases as the level increases to provide increased communication bandwidth. The rate of growth is influenced by the desired scalability and cost-performance ratio. A variation of the Fat-Tree network has been employed in Thinking Machine’s CM-5 [2, 3]. This paper introduces an optical-based approach, denoted as the space-wavelength hierarchical architecture (SWHA), that achieves the Fat-Tree objectives while also obtaining significant improvement in flexibility, performance and fault tolerance through wavelength-division multiple access 1 This work was supported by the National Science Foundation under Grant CCR-9010774 and ECS-9112435; and utilized the computing facilities at the National Center for Supercomputing Applications (University of Illinois at Urbana-Champaign) through NSF Grant CCR-920001N.

1

(WDMA) photonic interconnection. Not only can the Fat-Tree strategy of providing increased bandwidth per link at levels closer to the root be obtained, but now adaptable bandwidth allocation can be achieved at each level: the bandwidth at each level does not need to remain fixed but can be dynamically reallocated to adapt to changing communication requirements. Furthermore, a highly scalable architecture is achieved, overcoming the fixed number of WDMA channels through spatial re-use. The hierarchy resembles a general mi -ary tree, where an i-level node branches to one parent and mi children. The processors (0-level nodes) are located at the leaves and the internal nodes are wavelength/spatial routing switches denoted as FatNodes due to its motivation from the Fat-Tree. The result is a system with a unity degree per processor yet unity diameter: any two processors can communicate without any intermediate routing. This is significant since the intermediate routing latencies may be much larger than the packet transmission time in an optical environment. One additional advantage of the proposed architecture is the independence of the number of channels in the WDMA network and the number of nodes interconnected by it. In addition to a very large bandwidth, optical interconnects offer many desirable features such as a relaxed distance-bandwidth product, large fanout capability, low power requirement, reduced crosstalk and immunity to electro-magnetic interference. Computer architecture may employ optical interconnects at multiple system levels: chip-to-chip, module-to-module, board-to-board and node-to-node [4]. However, little benefit is obtained by simply replacing metal interconnects with optical fiber in a one-to-one fashion since the result would be many expensive but under-utilized communication links. Furthermore, the true advantages of optical interconnections would not be harnessed due to the speed mismatch between the optical and electronic components. The speed mismatch is a major problem hindering the development of photonic based processor interconnection architectures. The low loss region of a single mode optical fiber has a bandwidth of about 30THz [5]: the optical media bit and packet throughput rates greatly exceed the capacity of the electronic interface components. Wavelength-division multiplexing (WDM) is an approach that circumvents the speed mismatch problem where the bandwidth is partitioned into many, more manageable, high speed channels that may be concurrently accessed. The WDM channels form a set that can be individually switched and routed. However, wavelength tunable transmitters and/or receivers are required. Section 2 contains a brief outline of their characteristics. Depending on the architecture, optical self-routing is achievable where a node only receives data destined to it and the system has the non-blocking connectivity characteristics of a crossbar [6]. Optical self-routing partitions the traffic, achieving a significant relaxation of the receiver subsystem design constraints since a node will not have to receive and process all network traffic. This is an important characteristic since the photonic network can support a throughput rate far beyond the packet processing capability of a typical processor. WMCH is a hypercube based structure that combines both spatial and wavelength multiplexing through optical WDMA channels spanning each dimensional axis [7]. The WMCH requires a greater than unity degree: an r-dimensional hypercube requires r transmitter/receiver pairs per node. Each processor in a r-dimensional structure is connected to r channels, each spanning different dimensions. An i-dimensional channel has mi processors attached for a total of M

=

r Y

i=1

mi processors. A major objective of the WMCH

was to support system size expansion beyond the optical power budget (OPB) limitation of the single star-coupled configuration shown in Fig. 1, while also extending the number of available channels through a combination of spatial and wavelength switching. The WMCH does not require optical amplifiers and uses a multi-dimensional structure to achieve scalability. This requires that each node have r transmitter/receivers for a r-dimensional structure, which has significant cost implications since the cost of wavelength tunable devices is not negligible. Another limitation of the WMCH is that the system is multi-hop: a packet may not remain completely in the optical domain between the source and destination. There is a latency incurred at each intermediate node due to the optical/electrical conversion, routing and re-transmission that is required 2

T

Node 0

T

...

Node 1 R

R

T

Node m-1 R

Figure 1: Optical multiple access channel achieved through a star-coupled configuration shown with wavelength tunable transmitters and receivers to achieve WDMA.

each time a packet crosses a dimensional boundary. Though the average distance and diameter of the WMCH is very low, the impact of routing latency on the total delay can be significant due to the high speeds of the optical media. Developing interconnection architectures that require a single transmitter/receiver pair per node yet achieve excellent performance through the avoidance of multi-hop is a major objective of this paper. System scalability in this environment is limited by three main issues: physical scalability which bounds the maximum number of interconnected nodes through the OPB, performance scalability which is bound in the optical domain by the tunability of the optical devices (transmitters and receivers/filters) that limit the number of wavelength channels that are created, and scalability limitations when the cost or complexity grows faster than the performance. Physical scalability is examined by developing a framework to extend the system range beyond the OPB limit of a single star-coupled network as shown in Fig. 1. Performance scalability is examined by using both spatial and wavelength multiplexing (multiple, but spatially separate, channels of the same wavelength) to extend the performance capacity beyond the limit of C channels imposed by the wavelength tunable device characteristics. Complexity scalability is supported through the hierarchical structures that retain a unity degree: a processor only requires a single transmitter and receiver regardless of system size. Our objective is to decouple the maximum system size and the number of wavelength channels, avoiding the situation where a system of M nodes requires M channels since this would create a situation where the scalability of the system is too dependent on the characteristics of the wavelength tunable devices. A wavelength-flat but physically hierarchical structure, denoted as FHA, is described that provides both physical and complexity scalability but is not performance scalable since the same wavelength space is shared by all processors. The resulting system is logically similar to a single star network of Fig. 1. The SWHA achieves (physical, performance and complexity) scalability through both spatial and wavelength multiplexing. The FHA and SWHA are both single hop architectures so the significant intermediate routing latencies of the WMCH, relative to the nominal packet transmission time, are eliminated. The proposed architectures are defined in Section 2, and the performance and cost behavior of each approach is considered in Section 3.

3

2 Photonic Interconnection Architectures Both the physical limitations and performance requirements must be considered when developing scalable photonic architectures. The objective of this section is to provide a framework where system size can be varied within a desired range while maintaining a specified level of performance. Photonic network architectures may adopt multi-access or virtual point-to-point communication strategies. Multichannel multiaccess protocols provide distributed access scheduling by employing either reservation or pre-allocation strategies [8]. Virtual point-to-point networks rely on embedding an interconnection topology in the wavelength domain [9]. The embedded network may be used as a virtual multi-hop interconnect [10, 11] (using only fixed wavelength transceivers) or as a topologically reconfigurable pointto-point interconnect [12, 13] (using tunable multi-channel receivers). Tunability may be employed to enable redefining the networks logical connectivity in response to traffic flow irregularities [14]. WDM networks require wavelength tunable transmitters and/or receivers to switch between the multiple channels created on the single optical fiber. A limited tuning range, transmitter linewidth and filter bandwidth constrain the maximum number of selectable wavelength channels. Wavelength tunable transmitters can be obtained through a number of techniques: fast tunable laser diodes, an array of non-tunable but different wavelength laser diodes, or LED spectral slicing [5, 15–17]. Wavelength selectivity can be achieved either with a coherent receiver, or a tunable filter with direct detection. The first approach is more expensive, but has higher channel selectivity. A lower cost alternative is a wavelength tunable filter with direct detection [18, 19]. Active filters have either electro-optic or acousto-optic control. Acousto-optic devices have a tuning range across the full 1.3 - 1.56 m range, while electro-optic devices are limited to about 15 nm [19]. Acousto-optic devices typically have slower switching speeds (s) than electro-optic devices (ns). Acousto-optic tunable filters (AOTF) have the advantages of a broad tuning range, electronic control of tunability, and narrow filter bandwidth. Several implementations of an optical cross-connect based on an AOTF have been described in [5, 19] where different wavelength channels can be routed independently to different output fibers. The selected wavelength is switched in an AOTF by varying the acoustic wave control frequency. This allows an individual wavelength to be switched or not switched. Multiple wavelengths may be switched with this device by the superposition of multiple acoustic control signals. A mixed radix system is used to represent the node numbering. Let M (the total number of processors) be a decimal integer represented as a product of r factors:

M=

r Y i=1

mi

(1)

The processor identifier P , 1 P

M , can be represented as an r-tuple P = (pr ; pr?1; : : :; p1) (2) where 1 pi mi for all 1 i r. The term r denotes the number of hierarchical levels with the FHA and SWHA. The term processor is used to denoted a node at the leaves of the tree (a 0-level node). An actual system might place m0 processors at each 0-level node, perhaps interconnected through a shared metal bus as in [20], to amortize the cost of the communication subsystem over multiple processors depending on the cost, performance and typical communication requirements of the individual processors. The address of a processor is defined by the r-tuple of Eqn. (2).

4

m1 x 1 coupler

(1,1)

...

(1,2) R

m1 x 1 coupler

...

(1,m1)

R

(m2,1)

R

...

(m2,2) R

R

1 x m1 splitter

m2 x m2 star-coupler

(m2,m1) R

1 x m1 splitter Amplifier

Amplifier

Figure 2: A 2-level wavelength-flat hierarchical star-coupled architecture (FHA). A system size of M = m2 m1 is achieved for this 2-level hierarchy but is still limited to C channels. A total of m2 optical amplifiers are required. Each processor is shown to possess a wavelength tunable transmitter and a fixed (slow tunable) receiver.

2.1

A Flat Hierarchical Architecture

A straightforward method to achieve physical scalability beyond the OPB limitation of the single star-coupled configuration of Fig. 1 is through the incorporation of optical amplifiers in a hierarchical fashion. This is now a practical approach due to the rapid advancement in the development of optical amplifiers [21–24]. However, the FHA is a physical hierarchy: the system is flat in the wavelength domain since all nodes share a common wavelength space. The FHA addresses the OPB limitation and achieves the complexity scalability objective but does not address the performance scalability limitation. This hierarchical approach (and also that of the SWHA) is assumed to have r levels, where an i-level node has one parent and mi children nodes for all 1 i r. The number and placement of optical amplifiers are highly dependent on the OPB characteristics of the system. Fig 2 illustrates the case where m2 amplifiers are used. This configuration shows a system with tunable transmitters and fixed (or slow tunable) receivers. Many alternative configurations, in terms of amplifier placement, are possible depending on cost constraints and fault tolerance requirements. Let Mj denote the total number of j -level nodes, 0 j r. The total number of processors is M = M0 , where M is defined by Eqn. (1) and Mr = 1. The total number of 1-level nodes (a m1 1 and 1 m1 passive coupler pair of Fig. 2) is M1 = M0 =m1 , so in general Mj = Mj ?1 =mj , or

Mj = for all 1 j

r ? 1.

r Y

i=j +1

mi

(3)

An advantage of this approach is that the system is not multi-hop: no intermediate routing between source and destination is needed. Depending on the access protocol and configuration, optical self routing may be achieved due to the traffic partitioning characteristic of WDM. A packet undergoes optical/electrical conversion only at the source and destination. A limitation is that the extended system is still limited to a total of C wavelength channels. Furthermore, the network is now not completely passive and fault 5

tolerance concerns arise since the failure of an optical amplifier disconnects a fraction of the network (m1 nodes in the example of Fig. 2). This structure has excellent topological characteristics: unity degree and unity diameter. Routing is very simple since it is a single hop network. High connectivity is achieved with optical multiple access channels. WDM creates multiple multi-access channels on a single fiber and a large number of nodes may be interconnected with a unity distance. A major concern with a multi-access approach is that a media access protocol is required. An efficient and simple (due to the high speeds of the optics) solution is essential to obtain the performance and complexity advantages of a multiple access environment. Reservation and pre-allocation are two possible strategies. Reservation techniques designate a channel as the control channel that is used to reserve access on the remaining channels for data packet transmission. Media access protocols are required to provide arbitration on both the data and control channels. Pre-allocation techniques pre-assign the channels to the nodes, where each node has a home channel that it uses either for all data packet transmissions or all data packet receptions. Reduced system complexity is achieved with pre-allocation since a node is not required to possess both a fast tunable transmitter and a fast tunable receiver. Furthermore, the difficulty of supporting a control channel is eliminated and all channels are used for data transmission [25]. A time multiplexed pre-allocation protocol is used in this paper to provide access arbitration. A system has M nodes and C WDM channels where each node has a tunable transmitter and a fixed (or slow tunable) receiver. The number of channels required for the interconnection pattern is independent of the number of processors. Techniques have been proposed to overlap the tuning time of the optical devices for this protocol [26]. This protocol is the basis for the multi-level protocol defined in Section 2.2. A source node tunes its transmitter to the home channel of the destination node and transmits according to the access protocol. A source node can determine the home channel of a destination node in a decentralized fashion. Node A is assigned cA as its home channel based on the allocation policy, where 0 cA C ? 1 and 1 A M . An interleaved home channel allocation policy is defined as cA = A mod C . Each node with this approach has the opportunity to transmit on each channel once per cycle as shown in Fig. 3. Each node maintains C transmit queues to avoid head-of-line effects [25]. Collisions are avoided and the difficulty of supporting acknowledgments and retransmissions is eliminated. A cycle, denoted by L, is defined as the length of time for all nodes to be assigned a slot on all channels. Typically, L = m, but may differ depending on the switching latency (the protocol processing overhead plus the switching time of the optical devices) hiding strategy [26]. The common set of channels between all nodes with the FHA enables a reservation-based media access protocol [27] and a snooping-style of cache coherence protocol to be considered. However, a reservationbased media access protocol has greater system complexity than a pre-allocation based configuration due to the requirement of multiple wavelength tunable components per node. A common channel would enable a snooping-based cache coherence scheme to be considered, rather than requiring explicit invalidations with a directory-based cache coherence scheme for distributed shared memory systems [20, 28]. The FHA supports physical scalability with stable complexity implying that the snooping-based cache coherence approach would be possible with systems scalable up to the point where the limit of the C channels in the common wavelength space and the overhead (mainly the synchronization delay due to the cycle length) of the media access protocol begin to restrict performance.

6

Channels

0

M-1

M

1

2

3

M-3

M-2

M-1

M

1

2

1

M

1

2

3

4

M-2

M-1

M

1

2

3

2

1

2

3

4

5

M-1

M

1

2

3

4

...

Transmitting Nodes

...

C-2

C-3

C-2

C-1

C

C+1

C-5

C-4

C-3

C-2

C-1

C

C-1

C-2

C-1

C

C+1

C+2

C-4

C-3

C-2

C-1

C

C+1

Slot M

Slot 1

Slot 2

Slot 3

Slot M-3

Slot M-2

Slot M-1

Slot M

Slot 1

Slot 2

Cycle

Figure 3: Allocation map for I-TDMA. Channel/Transmitting Node space-time diagram when M cycle length is L = M slots.

2.2

> C and

A Hierarchical Architecture in Space and Wavelength

The SWHA achieves the Fat-Tree objective of increased bandwidth per link at higher levels of the hierarchy while also obtaining a significant improvement in flexibility, performance and fault tolerance. The SWHA achieves adaptable bandwidth allocation: the bandwidth at each level does not need to remain fixed but can be dynamically reallocated to adapt to changing communication requirements. An advantage of this approach is that the system adapts to the users (programmers) instead of the users adapting to the system. Bandwidth reallocation is invisible to the users, and is initiated by the system to balance the per channel intensities described below. A highly scalable architecture is achieved, overcoming the fixed number of WDMA channels through wavelength channel spatial re-use. The SWHA interconnection network is based on a component called the FatNode which is a space/wavelength switch constructed from passive couplers and an acousto-optic tunable filter (AOTF). The AOTF is used as a wavelength-selective space-division switch to select a subset of wavelengths to switch. Section 2.2.2 presents the media access protocol employed in the SWHA architecture.

2.2.1

Architecture

As with the FHA, the term processor is used to denote a 0-level node which may actually be multiple (m0 ) processors. Note that only the processors possess transmitters and receivers. All other i-level nodes, 1 i r, are FatNodes that provide wavelength/spatial switching. An advantage of the SWHA, when compared with a structure such as the WMCH, is that only one tunable transmitter is required per processor. An i-level FatNode couples a total of mi + 1 input fibers and mi + 1 output fibers: one fiber to/from its parent and mi fibers to/from its children. Fig. 4 shows the architecture of the FatNode. The functionality of a FatNode can be described as a 2 2 switching element (the lower mi links are passively coupled to a single fiber), as found in multistage interconnection networks, where each wavelength channel can be individually selected for a bar-connect or cross-connect configuration by the AOTF. This is accomplished through the superposition characteristic of acousto-optic tunable filters. Although in principle

FatNode:

7

C-1 C-1 C-1

C-1

j = Xi

j = Xi

Σ λj

Input fiber from upper level

Σ λj

Input fibers from lower level C-1

Σ λj

j=0

Σ λj

j=0

FatNode

Upper Level

Σ λj

j = Xi

AOTF C-1

Σ λj

Output fiber to upper level

j = Xi

mi X 1 Passive Coupler

Output fibers to lower level C-1

Σ λj

Σ λj

j=0

electronic control for wavelength selectivity

j=0

(a)

Xi - 1

2 X mi Passive Coupler

(b)

Input fibers from lower level

Output fibers to lower level

Figure 4: An i-level FatNode. (a) Graphical representation and (b) Possible implementation with two passive couplers and an acousto-optic wavelength tunable filter. The filter partitions the wavelength channels f0 ; 1; : : :; C ?1 g arriving from the lower level into two groups: f0 ; 1; : : :; Xi?1 g, which are retained locally; and fXi ; Xi+1 ; : : :; C ?1 g which is the global traffic and passed to the upper level. Note that the link from the upper level contains no optical energy on channels f0 ; 1 ; : : :; Xi?1 g so all traffic arrives along channels fXi ; Xi+1 ; : : :; C ?1 g and is directly routed to the lower level. an AOTF can cross-connect or bar-connect wavelength channels individually through the superposition of multiple acoustic control frequencies, a FatNode does not require this degree of sophistication to relax device design constraints and reduce device cost. Spatial re-use of wavelength channels occurs at each of the r levels in this hierarchy. Limited by the characteristics of the optical devices, let C denote the total number of wavelength channels capable of being separated on a single fiber and = f0 ; 1 ; : : :; i?1 ; i; i+1 ; : : :; C ?1 g denote the set of wavelength channels. An i-level FatNode is electronically configured to a crossover channel denoted as Xi , where Xi 2 , such that bar-connections are established for channels channels f0 ; 1; : : :; Xi?1 g and cross-connections are established for channels fXi ; Xi+1 ; : : :; C ?1 g. Essentially, this mapping is used to retain the local i-level traffic f0 ; 1; : : :; Xi?1 g, and forward the global traffic fXi ; Xi+1 ; : : :; C ?1 g to the (i + 1)-level FatNode. The dynamic reconfigurability characteristic is obtained by altering the wavelength channel partition at each level. Fig. 4(b) shows an implementation of a FatNode. Due to the traffic partitioning characteristics, traffic arriving from the upper level does not require partitioning and may avoid the AOTF and be coupled directly to the mi nodes attached at the lower level. Only the lower level traffic must be partitioned. Bar-connections are established for traffic arriving on channels f0; 1; : : :; Xi?1 g, locally retaining the traffic under the i-level FatNode. Traffic arriving from the lower level along channels fXi ; Xi+1 ; : : :; C ?1 g is cross-connected and passed to the upper level as shown in Fig. 4(b). This exploits the locality of communication expected within a cluster: the communication generated and destined to nodes beneath the i-level FatNode remains local but global traffic, destined to a node under the j -level FatNode for i < j k, proceeds up the tree to the j -level FatNode. The term ”i-level cluster” is used to denote all processors below an i-level FatNode.

8

(1,1,1)

(1,1,2)

(1,1,3)

(1,1,4)

(1,2,1)

(1,2,2)

(1,2,3)

(1,2,4)

Level 1 FatNode

(1,3,1)

(1,3,2)

(1,3,3)

Level 1 FatNode

(1,3,4)

Level 1 FatNode Level 2 FatNode

Level 3 FatNode (2,1,1)

(2,1,2)

(2,1,3)

(2,1,4)

(2,2,1)

(2,2,2)

(2,2,3)

(2,3,1)

(2,2,4)

Level 1 FatNode

(2,3,2)

(2,3,3)

Level 1 FatNode

(2,3,4)

Level 1 FatNode Level 2 FatNode

Figure 5: Spatial-wavelength hierarchical star-coupled architecture. A 3- level structure with M nodes.

=

234

Topology: An i-level FatNode, 1 i r, (r is the number of levels in the structure) partitions the traffic it receives according to its space-wavelength configuration. An r-level system partitions the wavelength = fX0 ; X1; X2 ; : : :; Xr g channels into r non-overlapping subsets. The partition points are defined as where Xr = C , X0 = 0 and 0 < Xi < Xj < C if i < j for all 0 < i < r ? 1 and 1 < j < r. Let Ci = Xi ? Xi?1 denote the number of channels allocated for i-level communication, 1 i r, such that

X

C=

r X i=1

Ci. Channels fXi? ; : : :; Xi?1 g are used to support i-level communication. 1

The configuration remains fixed at the partition point and is not altered until the system is reconfigured to adapt to a change in the locality characteristics. The partition points are shifted during reconfiguration which allows the system to adapt to a shift in traffic locality characteristics and dynamically reconfigure the amount of communication bandwidth provided at each level of the hierarchy. This relaxes the design constraints on the FatNode AOTF since fast tunability beyond the currently achievable microsecond switching speed is not required. Furthermore, narrow filter bandwidth is not required since individual channel selection is not needed.

X

As with the FHA, let Mj denote the number of j -level nodes in this hierarchy, given by Eqn. (3), and M = M0 as defined in Eqn. (1). Since there are a total of Mj j -level nodes, a total of Cj Mj channels provide j -level communication. Therefore an r-level SWHA provides a total of

C=

r X j =1

Cj

r Y

k=j +1

mk

(4)

separate channels that may be concurrently accessed due to the combination of spatial and wavelength multiplexing. For example, consider the case of M = 24 in Fig. 5. If C = 7 and = f0; 4; 6; 7g, a total of C = 29 separate channels are effectively created.

X

Low-latency, high-throughput interprocessor communication is achieved with a unity diameter and a unity degree with SWHA. There are additional advantages with this approach:

.

The bandwidth provided to each level of the hierarchy can be dynamically increased or decreased 9

depending on the local/global communication characteristics. Adaptive bandwidth allocation is achieved by shifting the local/global crossover point Xi , 1 i r, at each level of the hierarchy through the tunable AOTF within each FatNode. At least one channel must be allocated to each level to maintain complete connectivity.

. .

The design of the FatNode does not require AOTF tuning speeds greater than the microsecond speeds already achieved. The design constraints on the AOTF filter within a FatNode are further relaxed since individual channels do not need to be selected which would require very narrow filter bandwidths. The control is simplified since only a single acoustic frequency needs to be specified to identify the crossover point.

Each processor is required to possess a receiver capable of receiving a total of r channels. A typical implementation of the receiver subsystem would be a single (fixed or slow tunable) multi-channel filter such as an AOTF followed by direct detection receivers. Let HY;i denote the i-level home channel of processor Y , 1 i r. Node Y would have its multi-channel filter tuned to receive channels fHY;1; HY;2; : : :; HY;r g where Xi?1 HY;i < Xi . Home channel definition for each level is defined below.

Routing:

Routing is simple since the network has unity diameter and intermediate routing is not necessary: all packets are transmitted directly from the source to the destination processor with no electronic intervention. The routing algorithm must only determine the appropriate level and channel based on the source and destination addresses, the channel partition , C , and M . Once the level and channel have been determined, the packet is transmitted according to the access protocol on the home channel of the destination.

X

A processor is denoted by an r-tuple as defined in Eqn. (2). Consider two processors: A = (ar ; ar?1 ; : : :; a1 ), the source node, and Z = (zr ; zr?1 ; : : :; z1 ), the destination node. Fig. 5 illustrates a 3-level structure where M = 2 3 4 and r = 3. In this example, processor A = (a3 ; a2; a1 ) uses channels f0 ; 1 ; : : :; X1?1 g to communicate to processors with an address (z3 ; z2 ; z1 ) if z3 = a3 and z2 = a2 but z1 6= a1 . Channels fX1 ; X1+1 ; : : :; X2 ?1 g would be used if z3 = a3 and z2 6= a2 ; and channels fX2 ; X2+1 ; : : :; C ?1 g would be used if z3 6= a3 . Determining the correct channel to transmit the packet is a two stage process: 1. Determine the minimum level of communication required to reach the destination Z through a digit-by-digit comparison of the source and destination address r-tuple: i-level communication is required if zi 6= ai and zj = aj for all i < j r. 2. Determine the home channel of the destination processor Z at that level. The i-level home channel of A is determined based on the allocation policy which in this case is interleaved and is given by

HZ;i = Xi?1 + Z mod Xi ? Xi?1

2.2.2

(5)

Multi-level Access Protocol

The SWHA architecture needs a multi-level media access protocol. Random access or static allocation protocols can be employed at the different levels of the hierarchy. The media access protocol considered in 10

Channels

000

001

002

003

004

005

006

007

λ0

XX0

XX1

XX2

XX3

XX4

XX5

XX6

XX7

012

013

015

016

017

XX0

λ1

XX1

XX2

XX3

XX4

XX5

XX6

XX7

λ2

XX2

XX3

XX4

XX5

XX6

XX7

XX0

λ3

X03

X04

X05

X06

X07

X00

λ4

X04

X05

X06

X07

X00

X01

λ5

005

006

007

000

001

002

003

010

XX1

XX2

XX3

XX4

XX5

XX6

XX7

XX0

XX1

XX2

XX3

XX4

XX5

XX6

XX7

XX1

XX2

XX3

XX4

XX5

XX6

XX7

XX0

X01

X02

X13

X14

X15

X16

X17

X10

X02

X03

X14

X15

X16

X17

X10

X11

004

015

016

017

010

011

012

Level-1 Cycle

011

014

103

104

105

106

107

XX0

100

XX1

101

XX2

102

XX3

XX4

XX5

XX6

XX7

XX0

XX1

XX2

XX3

XX4

XX5

XX6

XX7

XX1

XX2

XX3

XX4

XX5

XX6

XX7

XX0

X11

X12

X03

X04

X05

X06

X07

X00

X12

X13

X04

X05

X06

X07

X00

X01

013

014

105

106

107

100

101

102

Level-1 Cycle

113

114

115

116

117

XX0

110

XX1

111

XX2

112

XX3

XX4

XX5

XX6

XX7

XX0

XX1

XX2

XX3

XX4

XX5

XX6

XX7

XX0

XX1

XX2

XX3

XX4

XX5

XX6

XX7

XX0

XX1

X01

X02

X13

X14

X15

X16

X17

X10

X11

X12

X02

X03

X14

X15

X16

X17

X10

X11

X12

X13

103

104

115

116

117

110

111

112

113

114

Level-1 Cycle

Level-2 Cycle

Level-1 Cycle Level-2 Cycle

Level-3 Cycle

Figure 6: Assignment map for 3-level I-TDMA protocol with L1 = 8, L2 = 16 and L3 = 32.

M

=

2 2 8 with

X = f 0 ; 3 ; 5 ; 6 g.

this paper for the SWHA is a multi-level generalization of the single level WDMA protocol described in Section 2.1. Time is slotted on the channels at all levels of the hierarchy. The time-multiplexed protocol avoids collisions on all channels, eliminates the need to support media access level acknowledgments (coherence level acknowledgments are still required as in [20]), and is insensitive to propagation delay. Performance comparisons of random and static access protocols for this environment showed that timemultiplexed protocols exhibit excellent performance but are limited in terms of scalability due to the long cycle lengths [25]. However, the protocol presented in this section limits the performance degradation due to long cycle lengths since an i-level cycle only depends on the number of reachable processors through i-level communication rather than on all processors. With increased communication locality within clusters, the impact of large higher level cycle lengths is reduced. The multi-level protocol is described through a 3-level example. The time-wavelength diagram is shown in Fig. 6 where M = 32 = 2 2 8, C = 6, and = f0; 3; 5; 6g. The allocation map shows time slotted on all channels, and every node is assigned a slot on each channel per cycle. Slots on every channel are combined into groups of m1 . The 1-level clusters of the system are allocated slots in the group assigned to each of the M1 clusters. This ensures fairness among all the clusters in the system and among all the nodes within a cluster. Note that each processor does not maintain the table shown in Fig. 6: this figure is used only for illustrative purposes. An actual processor would determine its assigned slots in a decentralized fashion through the simple algorithm defined below based on , X, M , C and its own address. Execution of the slot assignment algorithm is not required on slot or cycle boundaries. It is executed infrequently, only to support the dynamic re-allocation of bandwidth through reconfiguration by altering the partition points X. The characteristics of this multi-level protocol are:

X

Cycle Length: Let Wi denote the number of processors reachable by a processor through i-level communication (W0 = 1):

Wi =

i Y

j =1

mj

(6)

The cycle length depends on the number of nodes connected at that level: Li = Wi . Every node is assigned a slot on every channel at every level. In the example of Fig. 6, L1 = 8, L2 = 16, and L3 = 32. Node address: Every node is represented by an r-tuple address given by Eqn. (2). In the example of Fig. 6 the nodes have addresses with 3 digits. 11

Slot number: Every slot is represented by an r-tuple as [sr ; sr?1 ; : : :; s2 ; s1 ], where 0 si

(mi ? 1).

Cluster Assignment: The level-1 cycle length is L1 = m1 . All level-1 clusters are assigned slots on level-1 channels during this cycle. The cluster assignment on higher level channels is determined by the slot digits and the node address digits. Nodes with ai = si for 2 i k (the address digits match the slot digits for 2 i k) are assigned a slot on level-k channels during the cycle with s1 2 f0; : : :; m1 ? 1g. The channel on which a node is permitted to transmit in any slot is described in channel assignment discussed below. Nodes with a2 = 0 are assigned level-2 channels during slots with s2 = 0. Nodes with a2 = 1 are assigned level-2 channels during slots with s2 = 1. Nodes with (a3 ; a2 ) = (0; 0) are assigned level-3 channels during slots with digits (s3 ; s2 ) = (0; 0), nodes with (a3 ; a2 ) = (0; 1) are assigned level-3 channels during slots with digits (s3 ; s2 ) = (0; 1), nodes with (a3 ; a2 ) = (1; 0) are assigned level-3 channels during slots with digits (s3 ; s2 ) = (1; 0), and nodes with (a3 ; a2) = (1; 1) are assigned level-3 channels during slots with digits (s3 ; s2 ) = (1; 1).

Channel Assignment: The assignment of slots for i-level channels, 1 i r, to the nodes can be summarized as follows. At higher levels, clusters are assigned to slots as described above. The channel ( (j ? k) if j k . on which a node with a1 = j in slot with s1 = k is given by c(i; j; k) = (m1 + j ? k) if j < k A node can transmit on the assigned channel if the assigned channel number is within the valid range at that level. Nodes which have slots assigned on channels that are not within the valid range for that level do not transmit in that slot.

2.2.3

Channel Partition

An objective of the SWHA is to support communication when there is very little inherent locality but can adapt to take advantage of locality when it exists by providing increased bandwidth to the lower levels. The SWHA allows the channel allocation to adapt to variations in reference locality on a process-by-process basis by shifting the partition configuration through the retuning of the FatNodes.

X

The model presented here assumes that locality may be exhibited in general at any level, only relative to the levels above it. Typically, reference locality would be exhibited within a cluster (level) range. Let the i-level uniform reference probability be denoted by p~i , and the actual i-level reference probability when locality is considered (non-uniform reference) be pi . This represents the probability of targeting a destination through i-level communication that can not be reached through (i ? 1)-level communication, 1 < i r, which implies that the source and destination have the same i-level parent but reside in different (i ? 1)-level clusters. Uniform reference probabilities are p ~i = Wi?1 (mi ? 1)=(M ? 1) for all 1 i r and W0 = 1. The i-level traffic intensity is defined as the mean total traffic destined to i-level nodes (the fraction of the total traffic targeted to a node reachable through i-level communication). Assuming a geometric traffic generation process with a per slot generation probability of g per node, the i-level traffic intensity is Ii = gpiWi (7 ) for all 1 i r. Ii represents the traffic intensity within a single i-level cluster. Mi is defined by Eqn. (3) to indicate the total number of i-level nodes so Ii Mi represents the spatial re-use of the channels allocated to support i-level communication. The total traffic intensity given is:

I=

r X i=1 12

Ii Mi

(8)

which reduces to I = gM . A fair number of channels need to be allocated to each level based on the relative traffic intensities with a minimum of one channel being assigned to each level to maintain complete connectivity. Let Ci = Xi ? Xi?1 denote the number of channels allocated to support communication within a particular i-level cluster. Fair allocation can be determined from a channel partitioning factor

! = CIi

i

for all i, 1 i r, so

r 1 X

!=C and therefore

i=1

Ii

(9)

Ci = I!i

(10)

for all 1 i r. Eqn. (10) can be expanded to show that it is just a weighted sum

Ci =

"

#

pi Qiu=1 mu C Qj Pr j =1 pj k=1 mk

(11)

Eqn. (11) shows that the channel partition is independent of m1 . The optimal channel allocation (OCA) for a 2-level SWHA is p 1 C1 = p + (1 ? p )m C (12) 1 1 2

with C2 = C ? C1 . Section 3.3 shows that this partition achieves a uniform saturation load at each level. Two assumptions are made in Section 3.3 when determining valid ranges of the reference probabilities at each level, based on the hierarchical configuration and the non-uniform reference model: p1 p~1 and pi pi?1 for all 1 < i r. For example, in a 3-level system p~1 p1 1:0, p2 p3, and p2 + p3 = 1 ? p1 .

3 Structural Analysis System scalability, defined in Section 1, is primarily governed by three limitations: physical limitation due to optical power budget considerations, complexity limitations, and performance limitations due to the finite number of channels within the tuning range of optical sources and filters. Section 3.1 examines the physical limitation through an optical power budget calculation and determines the required amplification for the hierarchical architectures. Section 3.2 compares the system complexity of the architectures under consideration. Section 3.3 evaluates the delay-throughput performance via analytic models and discrete-event simulation.

3.1

Physical Scalability

The power budget calculations presented in this section are based on the device loss models given in Appendix A. The fanout of a single star coupler is OPB limited to a maximum of a few hundreds of ports, depending on its structure and realization technology. However, current implementations have not reached the OPB limitations and current devices are limited to up to about 100 ports [29]. Physical scalability, defined in Section 1, for both the FHA and SWHA is achieved with optical amplifiers at the level hierarchical boundaries. 13

A tree coupler realization is assumed throughout the analysis. LT (n1 ; n2 ) denotes the power losses of an n1 n2 coupler and PM denotes the overall system power margin, as defined in Appendix A. The amplifier gain is assumed to be uniform at all level boundaries, denoted as G. The total losses of an FHA system are given by:

FHA:

LFHA (mr; : : :; m1) =

r X i=1

LT (mi ; mi) ? (r ? 1)G

(13)

The required amplifier gain for an r-level FHA system is given by:

GFHA;r = r ? 1 1

" r X

i=1

LT (mi; mi) ? PM

#

(14)

SWHA:

The incorporation of one amplifier within each FatNode is considered, being placed at the input to the AOTF. Due to practical device characteristics, it is assumed that the number of nodes per cluster at any level is less than 256 (64 is a typical figure). The power loss the signal encounters through a FatNode is dependent on its direction (up or down the hierarchy), as can be noted from Fig. 4. For a two-level system, it will be shown that one amplifier per FatNode is sufficient to support system size on the order of a few thousand nodes with a gain requirement of less than 25 dB. A signal path will encounter both the upward and downward directions of the FatNode component. The the FatNode loss at i-level can be defined as

LFN (mi) = LT (mi; 1) + Lw + LT (2; mi)

(15)

where Lw represents the AOTF insertion loss. It is desirable to have the amplification requirement uniform throughout all levels. This results in minimizing the maximum required gain at any level, maxfGk j1 k rg, since some configurations require higher gain at different levels. The initial, nonuniform, gain requirements are:

G~1 LFN (m1) ? PM G ~k

k X i=1

LFN (mi ) ?

kX ?1 i=1

G~ i ? PM

(16)

k r. Having uniform amplification, G1 = = Gr = G, requires that for all 1 k r; kG LFN (mi) ? PM , or

for all 2

k X i=1

(

GSWHA = max k 1

"

k X i=1

#

LFN (mi ) ? PM jk 2 [1; r]

)

(17)

It can be shown that in a two-level SWHA the gain requirement is often independent of the node hierarchic distribution (m2 ; m1 ). This characteristic is important since it alleviates any constraints on the node distribution due to amplification requirements so the distribution can be mainly determined by a

r X

LFN (mi) is a constant for a given M and does not depend on the node distribution. Consider two factorizations of M : M = m2 m1 and M = (m2 =k) (km1 ).

cost-performance metric. It is first shown that

i=1

14

25

SWHA

Amplifier Gain (dB)

20 15

FHA 10

r=2 PM = 40 dB

5 0 0

500

1000

1500

2000

2500

3000

3500

4000

System Size Figure 7: Required amplification (dB) at each level of the hierarchical architectures. Using the definitions of Eqns. (15), (29), we have LFN (m2 ) + LFN (m1 ) = LFN (m2 =k) + LFN (km1 ). This can be generalized to an r-factorization of M by induction. From Eqn. (17),

LFN (m1 ) ? PM > 12 [LFN (m2 ) + LFN (m1) ? PM ] () LFN (m1) > LFN (m2) + PM which implies that the gain is independent on (m1 ; m2 ) if

m1 < 2PM =(3+2Le) m2 For a relatively low power margin of 30 dB, m1 =m2 < 64 is a realistic relationship due to the limited

available star coupler dimension. The required amplifier gain for both the FHA and the SWHA are illustrated in Fig. 7 for two-level systems with PM = 40 dB, Le = 1:0 dB, Lc = 0:75 dB, and Lw = 5:0 dB.

3.2

Complexity

This section evaluates the complexity of the FHA, SWHA, and WMCH; and compares to the hypertree structure of the CM-5. The architectures employ tunable transmitters and fixed (slowtunable) receivers. The WMCH requires r transmitters per node, while both the FHA and the SWHA require only a single transmitter. A single receiver is sufficient for the FHA. An r-level SWHA may use a single receiver with an r-channel AOTF to monitor the incoming traffic from all r levels arriving on a single fiber. The WMCH needs one distinct receiver per dimension in each node as a result of its topological definition. Complexity expressions are summarized in Table 1, where the first two entries present the overall node complexity (transmitters and receivers) 15

Transmitters (tunable) Rcvr Filters (slow-tunable)

Star couplers

Amplifiers

FHA

SWHA

rM

M

M

rM

M

M

(single channel AOTF)

r r X Y

=1 = 1 j 6=i

i

AOTFs

WMCH

j

m

(multi-channel AOTF)

2

i

r X

=1

M

i

r X

=1

i

;

None

r X

None r ?1 X

None

=r2

rM

2

i

=2

M

i

i

r X

M

X

=1

i

M

i

i

=r2

M

i

i

i

Fiber links

2

M

i

2

X

=1

M

i

i

Table 1: Complexity expressions for the WDM-based photonic architectures for processor interconnection.

while the remaining entries present the interconnection network complexity (star couplers, AOTFs, amplifiers, and fiber links). Star couplers have different port configurations depending on the number of nodes attached to each. The number of fiber links in the system also represents the total number of star coupler ports employed. This could be used as a measure of the overall coupler complexity based on the way couplers are fabricated. The CM-5 employs a hierarchical network with increased capacity at higher hierarchical levels. The network is based on the Fat-Tree [1], but differs in being a quad-tree (rather than a binary tree) and in the rate of channel capacity increase at higher levels [3]. The CM-5 interconnection scheme is denoted as hypertree in [3] (not the same hypertree introduced in [30]). The CM-5 hypertree employs data routers at the internal nodes and processors at the leaves of the tree as in the Fat-Tree. However, the CM-5 hypertree data routers provide multiple paths between the lower and upper level nodes [3]. With no doubling, there are log4 M levels of data routers in a network supporting

M processors. There are M routers at level- i, where each router has 2 upward and 4 downward 2i links for 1 i log4 M . A total of M data routers are required when M is asymptotically large, log M X M 1 1 since = M + + ' M . The number of links in each direction approaches 4M i 2 4 i 1 2 when M is large. A CM-5 network with doubling employs twice the number of routers and links. 4

=

The FatNode components in the SWHA are the wavelength domain equivalent of the data routers in the Fat-Tree network but have the capability to support dynamic reallocation of bandwidth at each level. The number of FatNodes in the SWHA depends on its hierarchical configuration,

16

r X

Mi = M m + m m + + M 1 2 1 1 1

1

1

2M

m1 , i= so the number of FatNodes approaches M=32 for an arbitrarily large number of levels considering m1 = 64 and mi = 2 for 2 i r as a worst case example. The number of fiber links in each 2 direction in the SWHA is, using the same argument, upper bounded by M 1 + . . The number of FatNodes is necessarily lower than

m1

The above comparison shows that the SWHA network has a significant asymptotic complexity advantage. Routing is also simpler since the internal nodes are configured only upon bandwidth re-allocation and the actual destination routing is furnished via the multi-level access protocol in a completely decentralized fashion. However, the CM-5 employs conventional electronic routers that are more readily available at lower cost than the optical filters used by the SWHA. The relatively high data router complexity in the CM-5 provides more data transfer simultaneity that serves its architectural design goal of supporting a cooperative conflict-free communication model. The SWHA employs fewer space channels and bandwidth allocation components (routers) due to the availability of multiple WDM channels on a single fiber. Conflict-free communication is provided as well, due to the time-multiplexed access protocol. Comparing the effective throughput of both networks would largely depend on the objective functionality (code execution mode) in each as well as other factors such as the communication locality in the SWHA and the adopted routing policy in the CM-5 hypertree. The SWHA throughput is evaluated in Section 3.3 based on a global interprocessor communication model with variable spatial reference locality.

3.3

Performance

This section is concerned with the performance of the photonic architectures under consideration. Analytic models and stochastic self-driven discrete-event simulation models are used to evaluate the delay-throughput characteristics. Section 3.3.1 presents the analytic models developed for the hierarchical structures. Section 3.3.2 validates the analytic models through comparison to simulation. The impact of varying system parameters on the performance is discussed in Section 3.3.3. The performance metrics of interest are the total average access delay (D) and system throughput (?). The delay is defined as the time between the generation of a request and its delivery (and acknowledgment if necessary) to the destination. The total system throughput is defined as the total number of concurrent transfers per slot. Di and ?i denote the delay and throughput of transfers that require i-level communication, for 1 i r. Time is normalized to the channel slot length.

3.3.1

Analytic Models

The FHA architecture is only physically hierarchical: the communication medium is a flat shared wavelength spectrum of C channels. The hierarchy of the SWHA involves partitioning the channels among the system levels. The arrival process of new traffic generated at each processor has a geometric distribution with parameter g , so g represents the per slot probability of a node generating a communication request. The preallocation access protocol requires separate buffer 17

space for each channel to avoid head-of-line blocking [25]. This allows a simple but accurate analytic model to be developed based on Geom/D/1 queues.

FHA: A typical channel queue is considered. The geometric arrival rate to a given channel

queue is g=C , implying a uniform reference model and that C divides M exactly so that each channel queue is responsible for an equal number of destinations and therefore have equal arrival rates. The model developed for the SWHA considers in detail the case when the number of channels allocated to a level does not exactly divide the number of nodes communicating through that level.

Instead of a uniform reference model, a non-uniform client/server reference model was also considered. Let rij denote the probability of a packet being destined to node Pj after being generated at node Pi . Let P1 be the server, then for a client node Pi , 1 < i M : 8 >
:

if j

0

i=j

=1

? M ?2 if 1 < j 1

M and i 6= j

where r1i = 1=(M ? 1) and r1i 1:0. A uniform reference model is used in Section 3.3.3 because the simulation studies showed that traffic partitioning through the pre-allocation of channels limited the performance impact when traffic on one channel is significantly greater than on the other channels. Furthermore, the C transmit queues at each node eliminated any head-of-line effect due to the server traffic. This resulted in performance characteristics of non-server traffic not significantly different than with the uniform traffic model: the congestion was contained to the server channel and had limited impact on non-server channel traffic.

= L = M , so the channel queue utilization factor gM . An average cycle synchronization time of (M ? 1)=2 is incurred since arrivals are is = C The service time equals the cycle length

1

equally likely to occur at any time instant, so the average access delay is then:

DFHA

M ?1 =1+

The saturation load is determined by gsat

2

gM 1+ (C ? gM )

#

(18)

= C=M , so the total system throughput is: (

?FHA =

"

gM C

if 0 g gsat if g > gsat

SWHA: Channel queues of all levels behave as Geom/D/1 queues when m1

(19)

C.

This constraint guarantees that no source node is given a chance to access more than one destination in a single time slot as shown in the allocation map of Fig. 6. The constraint is practical since C is limited by the tuning range of the optical devices and the most cost-effective configuration of the system is with the largest possible value for m1. Removing this constraint results in service interruptions at lower levels. 18

Apart from the above constraint, the number of channels is not related to the system size or configuration. This section determines the delay and throughput of traffic that requires i-level communication, for 1 i r, and then determines the total average of the performance metrics. Wi denotes the number of processors reachable through i-level communication, and Ci is the number of channels allocated to this level so there are a total of Ci transmit queues used to pre-sort outgoing i-level traffic. The transmit queues may be further classified based on the number of nodes who use the corresponding channel as a home channel. In general, there will be

Nci = Wi mod Ci queues for channels that are home channels for dWi =Ci e destination nodes (called a ceil-queue), and

Nfi = Ci ? (Wi mod Ci ) queues belonging to channels that are home for bWi =Ci c destination nodes (called a floor-queue). The behavior of the ceil-queues and floor-queues need to be considered individually since a

ceil-queue is served by a channel that is home for one more destination node than a floor-queue therefore the likelihood of an outgoing request joining a ceil-queue is higher.

All queues are assumed for modeling purposes to have infinite capacity. The model considers one ceil-queue and one floor-queue at each level as typical queues. The probability of an outgoing packet joining an i-level ceil-queue is

qci = dWWi=Ci e i

and the probability of a packet joining a floor-queue is

qfi = bWWi =Cic i

A node does not usually target itself, but this is ignored for simplicity in determining the probability of joining a queue. This simplification is shown in Section 3.3.2 to have no effect on the model accuracy for various system parameters in the model validation graphs of Fig. 9. The arrival to an i-level queue has a geometric distribution with parameter gpi qci for ceil-queues and gpi qfi for floor-queues. The deterministic service time of any i-level queue is Wi . The average delay of an i-level ceil-queue is therefore

Dci = 1 + Wi2? 1

"

gpi qciWi 1+ 1 ? gpi qci Wi

#

(20)

and the average delay of an i-level floor-queue is

Dfi = 1 + Wi2? 1

"

gpi qfiWi 1+ 1 ? gpi qfiWi

#

(21)

and so the total average i-level delay is

Di = qciDci + qfiDfi 19

(22)

The overall average access delay for SWHA can now be expressed as a weighted sum of the individual level delays: r X DSWHA = pi Di (23) i=1 The i-level ceil- and floor-queues have a slightly different saturation point of gci;sat = 1=(pi dWi =Ci e) and gfi;sat = 1=(pi bWi =Ci c), respectively. The system saturation load is determined by gsat = minfgci;sat j 8i 2 [1; r]g, since gci;sat gfi;sat . Due to the characteristics of the media access protocol, the throughput of an i-level ceil-queue is (

?ci = and the floor-queue throughput is

?fi =

gpidWi =Cie

if 0 g gci;sat if g > gci;sat

(24)

gpibWi =Cic

if 0 g gfi;sat if g > gfi;sat

(25)

1 (

1

The total average i-level throughput can be determined including the spatial re-use: 8 > <

Nci ?ci + Nfi? fi Mi N ?i = > ci + Nfi ?fi Mi : CiMi

and the overall system throughput is ?SWHA

=

r X i=1

if 0 g gci;sat if gci;sat g gfi;sat if g > gfi;sat

(26)

?i . When the system has an optimal channel

allocation, gci;sat gcj;sat for 1 i r and 1 j r, so the system saturates at all levels at the same traffic load. Stable behavior is still maintained when there is an imbalance which the model accurately illustrates in the throughput graph of Fig. 10.

3.3.2

Validation of Analytic Models

This section validates the analytic models developed for FHA and SWHA through a comparison to discrete-event simulation. The simulators for FHA and SWHA are written in C , with a C -based library of routines that provide discrete-event and random variate facilities. Steady state delay and throughput are measured. Simulation convergence is obtained through the replication/deletion method [31] with a 98% confidence in a less than 2% variation from the mean. A comparison of the results predicted by the analytic model and simulation for FHA is presented in Fig. 8. A varying number of channels ( C 2 f8; 16; 32g) with a fixed system size (M = 256) is considered in Fig. 8(a), while Fig. 8(b) plots a fixed number of channels (C = 8) with a varying number of nodes (M 2 f64; 256; 512g). The graphs show that the analytic model predicts the performance of the system with a high degree of accuracy under conditions of varying system parameters. Fig. 9 plots the validation graphs for SWHA for variations in system size, the number of WDM channels, communication locality and channel partitioning. Fig. 9(a) plots the delaythroughput validation graphs for variations in number of channels. The system size is fixed at 20

Access Delay

Access Delay

600

600

Variation in C M=256

500

C=8

400

500 C=16

Variation in M C=8

400 M=512

300

300 C=32

200 100

M=256

200 M=128

100

M=64

0 1 (a)

0 0.1

10

System Throughput

1

Figure 8: Validation of the model for FHA. (a) Variations in C in M 2 f64; 128; 256; 512g with C = 8.

10

System Throughput

(b)

2 f8; 16; 32g with M = 256. (b) Variations

M = 256 = 16 16 with locality fp1 ; p2 g = f0:9; 0:1g. The number of channels is varied as C 2 f16; 8; 4g with optimum channel partitioning. Fig. 9(b) plots the delay-throughput validation graphs for variations in system size. The system size is varied as M 2 f64; 128; 256; 512g with a configuration of f2 32; 4 32; 8 32; 16 32g, C = 32, and optimum channel partitioning for fp1 ; p2 g = f0:9; 0:1g. Fig. 9(c) plots the validation graphs with variation in locality. The system size is fixed at M = 256 = 16 16, C = 16 and optimum channel partitioning for fp1 ; p2 g 2 ff0:9; 0:1g; f0:7; 0:3g; f0:5; 0:5g; f0:3; 0:7g; funiformgg. Fig. 9(d) plots the validation graphs with varying channel partitions for a system size of M = 256 = 16 16 with C = 16 and locality fp1 ; p2 g = f0:9; 0:1g. The channel partition is varied as X = ff0; 8; 16g; f0; 6; 16g; f0; 4; 16gg.

This set of graphs shows that the analytic model is accurate for a wide variation in system parameters. The behavior of SWHA with varying system parameters is examined in the following sections.

3.3.3

Impact of Varying System Parameters

The impact of variations in communication locality and channel partitioning is investigated first. The performance impact of variations in system size and channels is then considered

Variation in communication locality: This section studies the impact of variations in communication locality within a cluster with fixed channel partitioning. The average access delay and system throughput are examined to investigate the behavior of the SWHA and FHA at varying traffic rates. Fig. 10 plots the performance characteristics of a 2-level SWHA with a system size of M = 1024 and a configuration of m1 = 32 and m2 = 32. The total number of channels in the system is held constant at C = 32, and the partition is fixed 21

100

300

Variation in C

Variation in M 250

80

C=4

Access Delay

40

Access Delay

M=256=16 x 16 P={0.9,0.1} OCA

60

200

C=32 P={0.9,0.1} OCA M=2 x 32

150

M=4 x 32

C=8

C=16

20 0

100 M=16 x 32

50

M=8 x 32

0 1

(a)

10

100

1

System Throughput

10

100

System Throughput

(b)

1000

Variation in P

Variation in X

P={uniform}

M=16 x 16 C=16 OCA

200

100

X={0,6,16}

10

{0.9,0.1}

0

M=256=16 x 16 C=16 P={0.9,0.1}

1 1

(c)

X={0,8,16} X={0,4,16}

100

Access Delay

Access Delay

300

10

20

100

System Throughput

(d)

40

60

80

100

System Throughput

Figure 9: Validation of the analytic models for SWHA. All graphs use OCA except (d). (a) Variations in C 2 f4; 8; 16g with M = 256 = 16 16 and fp1 ; p2g = f0:9; 0:1g. (b) Variations in system size as M 2 f4 32; 8 32; 16 32g with C = 32, fp1 ; p2g = f0:9; 0:1g. (c) Variations in locality fp1; p2g 2 ff0:9; 0:1g; f0:7; 0:3g; f0:5; 0:5g; f0:3; 0:7g; funiformgg with M = 256 = 16 16 with C = 16. (d) Variations in partition for M = 256 = 16 16, C = 16, fp1; p2g = f0:9; 0:1g and X 2 f0; 4; 16g; f0; 6; 16g; f0; 8; 16g.

22

Access Delay 200

10000

System Throughput

SWHA FHA

{0.9,0.1}

150

{0.6,0.4} {0.4,0.6}

1000

{0.8,0.2}

{0.2,0.8}

100

{0.9,0.1} {0.6,0.4}

100

M=32 x 32 C=32 X={0,4,32} 10 0.001 (a)

{uniform}

50

P={0.8,0.2}

0 0.01

0

0.1 (b)

Traffic Rate

0.2

0.4

0.6

0.8

Traffic Rate

Figure 10: Access delay-throughput characteristics of SWHA for M = 1024 with m1 = 32 and m2 = 32, C = 32 and channel assignment is fixed as X = f0; 4; 32g. FHA Reference probabilities are varied as fp1 ; p2 g 2 has M = 1024 with C = 32. ff0:9; 0:1g; f0:8; 0:2g; f0:6; 0:4g; f0:4; 0:6g; f0:2; 0:8g; funiformgg where uniform denotes the uniform reference model. (a) Access delay, (b) System throughput.

X = f0; 4; 32g. Six cases of reference probabilities to the two levels are considered: fp1 ; p2 g = ff0:9; 0:1g; f0:8; 0:2g; f0:6; 0:4g; f0:4; 0:6g; f0:2; 0:8g; f0:06; 0:94gg. The case of fp1 ; p2 g = f0:06; 0:94g corresponds to a uniform reference traffic pattern. The performance of FHA for the same system size (M = 1024) with C = 32 is also plotted. at

Fig. 10(a) plots the average access time as a function of traffic rate for FHA and SWHA. The reference probability fp1 ; p2 g = f0:8; 0:2g indicates that 80% of all references generated by a processor are directed to processors within the same 1-level cluster, and 20% are directed to 2-level clusters. The channel partition X = f0; 4; 32g allocates 4 channels for 1-level traffic and 28 channels for 2-level traffic creating 156 separate channels. The uniform reference traffic pattern generates traffic destined to all the nodes with equal probability. The delay characteristics of the SWHA with a uniform reference probability is same as that of FHA. As the reference probability is varied to increase the references to 1-level nodes with a fixed channel partition, the average access delay decreases and capacity increases. The difference in cycle lengths at different levels together with the increased 1-level reference probability results in the decreased access delay. The given partition X = f0; 4; 32g is the optimum partition for the reference probability fp1 ; p2 g = f0:8; 0:2g. This case has the highest system capacity and can be seen in Fig. 10(a). Fig. 10(b) plots the throughput characteristics of SWHA and FHA. Although the FHA and SWHA with uniform reference probability have the same delay characteristics, Fig. 10(b) shows that the SWHA achieves a significant increase in capacity even when there is no locality due to the spatial re-use of wavelength channels at each level of the hierarchy. The FHA cannot exploit locality due to its common wavelength space. The maximum throughput of ?max = 156 is obtained with the 23

Access Delay

System Throughput

10000

600 X={0,16,32}

X={0,16,32}

500 1000

M=32 x 32 C=32 P={0.8,0.2}

Level-2

400 Total

100

300 X={0,2,32}

10

200

Level-1

X={0,16,32}

100 1 0.001 (a)

0 0.01

0.1

1

Traffic Rate

0 (b)

0.2

0.4

0.6

0.8

1

Traffic Rate

Figure 11: Delay-throughput characteristics of SWHA for M = 1024 with m1 = 32 and m2 = 32, C = 32 and channel assignment is varied as X 2 ff0; 2; 32g; f0; 4; 32g; f0; 8; 32g; f0; 12; 32g; f0; 16; 32gg. Reference probabilities are fixed at fp1 ; p2 g = f0:8; 0:2g. (a) Access delay, and (b) System throughput.

SWHA even with only a slight reference locality. The maximum throughput depends on m1 and m2 and the rate of increase of throughput depends on the reference probabilities as described in Section 3.3.1.

Variation in channel partition: The impact of variations in channel partitioning with a fixed communication locality is considered next. The behavior of SWHA at different levels is examined to investigate the impact of channel partitioning on the maximum capacity of each level. The system saturation point is determined by the individual saturation points as described in Section 3.3.1. Fig. 11 plots the access delay and throughput of a 2-level SWHA for M = 1024 = 32 32 and C = 32, where the partition is varied as X 2 ff0; 2; 32g; f0; 4; 32g; f0; 8; 32g; f0; 12; 32g; f0; 16; 32gg. The reference probability is fixed at fp1 ; p2 g = f0:8; 0:2g. Fig. 11 plots the characteristics of the individual levels in addition to the total average. Fig. 11(a) plots the average access time while Fig. 11(b) plots the average throughput. The delay characteristics are determined by the system configuration and the reference probabilities. The 1-level cycle length is L1 = 32 since W1 = m1 = 32 and the 2-level cycle length is L2 = W2 = M = 1024. When the traffic load is low, the average 1-level access delay is the average cycle synchronization time L1 =2 = 16 slots and the average 2-level access delay is L2 =2 = 512 slots as shown in Fig. 11(a). The low traffic delay is independent of channel partition. The total average access time is dependent on the reference probabilities and the level access delays as described in Section 3.3.1. As the partition is varied to increase the channels assigned for 1-level communication with fixed reference probabilities, the 1-level delay decreases but the overall performance of the system may not improve. One case of X is the optimum partition for a particular reference probability set, such 24

Access Delay M=2x32

10000

Access Delay 4x32

8x32 16x32

1000

SWHA M=32x32 P={0.8,0.2} OCA

1000

32x32

FHA M=1024

SWHA C=32 P={0.8,0.2} OCA

100

C=32 C=16 C=8

10

100 0

(a)

20

40

60

80

100

120

20

140

System Throughput

40

60

80

100

120

140

System Throughput

(b)

Figure 12: A comparison of delay verses system throughput with varying characteristics. (a) SWHA with varying system size with a constant m1 = 32, C = 32, fp1 ; p2g = f0:8; 0:2g and OCA. The system size varies through the second level m2 = f2; 4; 8; 16; 32g. (b) Comparison of FHA and SWHA with variations in channels C 2 f8; 16; 32g with a constant system size of M = 1024. The SWHA has 2-level structure of M = 32 32 with fp1; p2g = f0:8; 0:2g and optimum channel partition.

Access Delay

System Throughput 80

FHA M=8 x 128

1000

C=8 M=1024

70 60

P={0.8,0.2}

50

SWHA,2-l

M=64 x 16

M=32 x 32

M=16 x 64

40 100

30 P={0.8,0.1,0.1}

SWHA,3-l

P={0.8,0.15,0.05}

M=2 x 16 x 32

10 0.001 (a)

M=8 x 128

20 FHA

10 0

0.01

0.1

Traffic Rate

0 (b)

0.05

0.1

0.15

0.2

Traffic Rate

Figure 13: Impact of variations in system configuration with M = 1024, C = 8 for 2-level and 3-level SWHA. The 2-level SWHA has fp1 ; p2g = f0:8; 0:2g and optimum channel configuration with varying configurations of fm1 ; m2 g 2 ff64 16g; f32 32g; f16 64g; f8 128gg. The 3-level SWHA has fm1; m2; m3g = f32; 16; 2g with varying locality fp1; p2; p3g 2 ff0:8; 0:15; 0:05g; f0:8; 0:1; 0:1gg and optimum channel allocation. (a) Access delay, and (b) System throughput.

25

as fp1 ; p2 g = f0:8; 0:2g, as derived in Section 2.2.3. The optimum cases result in the minimum total delay and the maximum total system capacity. The optimum partition is the case when all levels in the system have identical maximum capacities. The system saturation point depends on the saturation points of the individual levels, gsat = minfgsat;1 ; gsat;2g, as shown in Fig. 11(a). This illustrates that maximum system performance can be maintained with varying reference locality through the dynamic re-allocation of bandwidth feature of the SWHA.

Variation in system size: The performance impact with varying system sizes is examined through a two-level configuration where m1 is held constant at m1 = 32. Variation in size

is accomplished for M 2 f64; 128; 256; 512; 1024g through an increase in m2. The reference probability is fixed at fp1 ; p2 g = f0:8; 0:2g, and C = 32. An optimum channel channel partition is used for each system configuration. Fig. 12(a) plots the access delay-throughput characteristics of SWHA. As described in the previous section, the average access delay at each level is dependent on the cycle length at the levels and the reference probabilities. The 1-level cycle is constant at L1 = 32 in this example, but the 2-level cycle length L2 increases as the system size increases. Expansion results in increased average access delay if the reference probabilities remain fixed. Although C remains constant, the number of concurrently usable channels increases as the system size increases through channel re-use so a larger system has larger capacity as seen in Fig. 12(a). However, the marginal increase in capacity decreases as the system size increases for this example of fixed m1 = 32 and r = 2. For example, a 60% increase in maximum throughput occurs when the system is increased from M = 2 32 to M = 4 32 but the increase is reduced to 13% when the system is expanded from M = 16 32 to M = 32 32.

Variation in number of channels: Fig. 12(b) illustrates the significant increase in capacity and lower average delay of the SWHA compared with the FHA even though they may have an identical number of WDM channels. The delay-throughput characteristics are plotted for a system size of M = 1024 processors with varying C 2 f8; 16; 32g. The SWHA is configured as M = 3232 with reference probabilities fp1 ; p2 g = f0:8; 0:2g and optimum channel partitioning. While the maximum capacity of the FHA is fixed at ?FHA;max = C , the SWHA through spatial separation of chan# " nels attains a maximum throughput of ?SWHA;max

= m1p1 + (1 ? p1 )m2

C

p1 + (1 ? p1 )m2

.

Variation in system configuration: The impact of variations in configuration for a particular system size is considered next. Fig. 13 plots the delay and throughput of 2-level and 3level SWHA with a fixed system size of M = 1024 and C = 8. The 2-level SWHA has fp1 ; p2 g = f0:8; 0:2g and optimum channel allocation for varying configurations of fm1; m2g 2 ff6416g; f3232g; f1664g; f8128gg. The 3-level SWHA is configured as M = 21632 with varying locality fp1 ; p2 ; p3 g 2 ff0:8; 0:15; 0:05g; f0:8; 0:1; 0:1gg. System expansion should take place, for both cost and performance reasons, by first expanding the lower levels size to the maximum as determined by OPB limitations. The improvement in throughput under this policy is due to the increase of spatial re-use of lower level channels (1-level in the figure). This policy results as well in lower system complexity, upon scaling up the system, as was noted by Table 1. The benefit of larger lower level sizes is maximized when communication 26

locality is higher, since the contribution of the lower level throughput to the system throughput is higher.

4 Conclusions This paper introduced photonic-based structures for low-cost high-performance processor interconnection. Hierarchical architectures based on time, space and wavelength multiplexing were examined. The hierarchical approach preserved low cost while the combination of spatial and wavelength multiplexing, with a time multiplexed access technique, achieved excellent performance characteristics. Optical single-hop architectures were devised that eliminated all intermediate routing. In general, system scalability is hindered by the OPB, the number of channels realizable on a link, and the overall system complexity. The FHA overcomes the OPB limitation through optical amplification, but is still limited by the number of channels since all processors share the same wavelength space. The SWHA overcomes OPB and channel limitation by employing wavelength multiplexing and spatial re-use of channels at each level of the hierarchical architecture. An adaptable system results since the channels can be dynamically allocated to each level of the SWHA based on the reference locality. The SWHA has lower node complexity than the WMCH, a hypercube based structure which also employs spatial and wavelength multiplexing. The performance of the architectures were examined through analytical models and discrete-event simulation. The analytical models are based on Geom/D/1 queues and shown through a comparison to simulation to be accurate. The impact on the system performance was considered through variations in the communication locality, system size, configuration, number of channels, and channel configuration. The performance comparison between the two hierarchical structures shows that SWHA has consistently better performance than FHA both in terms of average access delay and system throughput. The optimum channel allocation for the SWHA was derived based on reference probability and shown to provide the best performance. The configuration that provided maximum channel re-use for a given system size was shown to achieve optimal performance. The FHA could not take advantage of any reference locality while the SWHA was shown to achieve significant improvements in performance through its adaptability characteristic even when there was only a weak reference locality present.

A System Optical Power Budget The optical power budget (OPB) limits the maximum number of stations based on the optical power that must be delivered to each receiver to maintain a specified bit-error-rate. The OPB is mainly determined from three factors: power coupled from source, system losses, and receiver sensitivity. The difference between the source power and the receiver sensitivity is defined as the power margin (PM ) which also must account for device characteristic variation. This paper considers star-coupled systems due to their inherent OPB advantage over optical bus based configurations [32]. Losses in an n1 n2 star-coupled system are mainly due to the coupler since fiber loss is typically 27

less than 1 dB/Km and may be ignored for local interconnection [32]. The system losses are mainly affected by the coupler insertion losses due to the following three components: power split losses (Ls ), excess losses (Le ), and connector insertion losses ( Lc ). The losses are dependent on the size and configuration of the coupler. Three schemes to construct a general n1 n2 star-coupler with logarithmic loss factors have been considered: a star through, for example, fused fibers; dual binary tree, constructed from 2 1 (integrated or discrete) passive couplers; and multistage constructed from 2 2 (integrated or discrete) passive couplers. The star configuration is typical for fiber-based couplers while both the tree and multistage configurations have been considered for integrated couplers or for constructing large arrays out of basic discrete elements. The star and multistage configurations have the same power loss for an identical square configuration. They both introduce less excess losses by log2 n factor than the tree configuration, where n1 = n2 = n. Either a star or a single binary tree configuration can be used to construct 1 n combiner and n 1 splitter stages. The total insertion losses for the star and tree coupler configurations are characterized below:

LS (n) = (3 + Le) log2 n + 2Lc (27) LS (n1; n2 ) = LS (maxfn1 ; n2 g) (28) LT (n1; n2 ) = Le log2 n1 + (3 + Le ) log2 n2 + 2Lc (29) Typical values for the loss components are Le = 1:0 dB and Lc = 0:75 dB. Connector losses of light source and receiver are assumed to be accounted for in the power margin. The star and multistage configurations have superior fanout. However, fabrication limitations may inhibit the availability of a single coupler with hundreds of ports.

References [1] C. E. Leiserson, “Fat-trees: Universal networks for hardware-efficient supercomputing,” IEEE Transactions on Computers, vol. c-34, pp. 892–901, Oct. 1985. [2] “The Connection Machine CM-5 Technical Summary”. Thinking Machines Corporation, October 1991. [3] M. Lin, R. Tsang, D. Du, A. Klietz, and S. Saroff, “Performance evaluation of the CM-5 interconnection network,” Tech. Rep. AHPCRC Preprint 92-111, University of Minnesota, Army High Performance Computing Research Center, Oct. 1992. [4] P. Haugen, S. Rychnovsky, A. Husain, and L. Hutcheson, “Optical interconnects for high speed computing,” Optical Engineering, vol. 25, pp. 1076 – 1085, Oct. 1986. [5] C. A. Brackett, “Dense wavelength division multiplexing networks: Principles and applications,” IEEE Journal on Selected Areas of Communications, vol. 8, pp. 948–964, Aug. 1990. [6] P. W. Dowd, “Optical bus and star-coupled parallel interconnection,” in Proc. 4th International Parallel Processing Symposium, pp. 824–838, Apr. 1990.

28

[7] P. W. Dowd, “Wavelength division multiple access channel hypercube processor interconnection,” IEEE Transactions on Computers, vol. 41, pp. 1223–1241, Oct. 1992. [8] P. W. Dowd, “Random access protocols for high speed interprocessor communication based on a passive star topology,” IEEE Journal on Lightwave Technology, vol. 9, pp. 799–808, June 1991. [9] J. A. Bannister, L. Fratta, and M. Gerla, “Topological design of the wavelength-division optical network,” in Proc. IEEE INFOCOM’90, pp. 1005–1013, 1990. [10] M. G. Hluchyj and M. J. Karol, “ShuffleNet: An application of generalized perfect shuffles to multihop lightwave networks,” IEEE Journal on Lightwave Technology, vol. 9, pp. 1386– 1397, Oct. 1991. [11] B. Li and A. Ganz, “Virtual topologies for WDM star LANs: The regular structures approach,” in Proc. IEEE INFOCOM’92, 1992. [12] M. S. Goodman, H. Kobrinski, M. Vecchi, R. Bulley, and J. Gimlett, “The lambdanet multiwavelength network: Architecture, applications, an demonstrations,” IEEE Journal on Selected Areas of Communications, vol. 8, pp. 995–1004, Aug. 1990. [13] K. A. Aly and P. W. Dowd, “Parallel computer reconfigurability through optical interconnects,” in Proc. International Conference on Parallel Processing, pp. I105 – I108, August 1992. [14] J.-F. P. Labourdette and A. S. Acampora, “Logically rearrangeable multihop lightwave networks,” IEEE Transactions on Communications, vol. 39, pp. 1223–1230, Aug. 1991. [15] S. Wagner and H. Lemberg, “Technology and system issues for a WDM-based fiber loop architecture,” IEEE Journal on Lightwave Technology, vol. 7, pp. 1759–1768, Nov. 1989. [16] T. P. Lee and C. E. Zah, “Wavelength-tunable and single frequency semiconductor lasers for photonic communications networks,” IEEE Communications Magazine, pp. 42 – 52, Oct. 1989. [17] M. M. Girard, C. R. Husbands, and R. Antoszewska, “Dynamically reconfigurable optical interconnect architecture for parallel multiprocessor systems,” in SPIE Proceedings (Optical Applied Science and Engineering), (San Diego, CA), July 1991. [18] K. W. Cheung, D. A. Smith, J. E. Baran, and B. L. Heffner, “Multiple channel operation of integrated acousto-optic tunable filter,” IEE Electronic Letters, vol. 25, pp. 375–376, Mar. 1989. [19] K. W. Cheung, “Acoustooptic tunable filters in narrowband WDM networks: System issues and network applications,” IEEE Journal on Selected Areas of Communications, vol. 8, pp. 1015–1025, Aug. 1990. [20] D. Lenoski, J. Laudon, K. Gharachorloo, A. Gupta, and J. Hennessy, “The directorybased cache coherence protocol for the DASH multiprocessor,” in Proc. 17th International Symposium Computer Architecture, pp. 148–159, May 1990.

29

[21] A. E. Willner, A. Saleh, H. Presby, D. DiGiovanni, and C. Edwards, “Star Couplers with Gain Using Multiple Erbium-Doped Fibers Pumped with a Single Laser,” IEEE Photonic Technology Letters, vol. 3, pp. 250–252, Mar. 1991. [22] M. Irshid and M. Kavehrad, “Star Couplers with Gain Using Fiber Amplification,” IEEE Photonic Technology Letters, vol. 4, pp. 58–60, Jan. 1992. [23] H. M. Presby and C. R. Giles, “Amplified Integrated Star Couplers with Zero Loss,” IEEE Photonic Technology Letters, vol. 3, pp. 724–726, Aug. 1991. [24] H. Kobrinski et al., “Fast wavelength switching of laser transmitters and amplifiers,” IEEE Journal on Selected Areas of Communications, vol. 8, pp. 1190–1202, Aug. 1990. [25] K. Bogineni, K. M. Sivalingam, and P. W. Dowd, “Low complexity multiple access protocols for wavelength-division multiplexed photonic networks,” IEEE Journal on Selected Areas of Communications, vol. 11, May 1993. [26] P. W. Dowd and K. Bogineni, “Switching latency overlap techniques for WDM star-coupled media access protocols,” in Proc. IEEE INFOCOM’93, pp. 65–74, Apr. 1993. [27] B. Mukherjee, “Architectures and protocols for WDM-based local lightwave networks Part I: Single-hop systems,” IEEE Network, pp. 12–27, May 1992. [28] B. W. O’Krafka and A. R. Newton, “An empirical evaluation of two memory-efficient directory methods,” in Proc. 17th International Symposium Computer Architecture, pp. 138– 147, May 1990. [29] C. Dragone, “Efficient N N star couplers using fourier optics,” IEEE Journal on Lightwave Technology, vol. 7, pp. 479–489, March 1989. [30] J. R. Goodman and C. H. Sequin, “Hypertree: A multiprocessor interconnection topology,” IEEE Transactions on Computers, vol. C-30, pp. 923–933, Dec. 1981. [31] A. M. Law and W. D. Kelton, Simulation Modeling and Analysis. McGraw Hill, 1991. [32] M. M. Nassehi, F. A. Tobagi, and M. E. Marhic, “Fiber optic configurations for local area networks,” IEEE Journal on Selected Areas of Communications, vol. SAC-3, pp. 941–949, Nov. 1985.

30