270
IEEE TRANSACTIONS ON VERY LARGE SCALE INTEGRATION (VLSI) SYSTEMS, VOL. 20, NO. 2, FEBRUARY 2012
Design and Implementation of Backtracking Wave-Pipeline Switch to Support Guaranteed Throughput in Network-on-Chip Phi-Hung Pham, Student Member, IEEE, Jongsun Park, Member, IEEE, Phuong Mau, and Chulwoo Kim, Senior Member, IEEE
Abstract—It is a challenging task in a network-on-chip to design an on-chip switch/router to dynamically support (hard) guaranteed throughput under very tight on-chip constraints of power, timing, area, and time-to-market. This paper presents the design and implementation of a novel pipeline circuit-switched switch to support guaranteed throughput. The proposed circuit-switched switch, based on a backtracking probing path setup, operates with a source-synchronous wave-pipeline approach. The switch can support a dead- and live-lock free dynamic path-setup scheme and can achieve high bandwidth and high area and energy efficiency. A silicon-proven prototype of a 16-bit-data 5-bidirectional-port switch in a four-metal-layer 0.18- m CMOS standard-cell technology can yield an aggregate data bandwidth of up to 73.84 Gb/s, while occupying only a modest area of 0.0315 mm2 . The synthesizable implementation of the proposed switch also results in a cost-effective design, fast development time, and portability. Index Terms—Backtracking, circuit-switched, dynamic path-setup, guaranteed throughput, network-on-chip (NoC), on-chip switch, source synchronous, wave-pipeline.
I. INTRODUCTION
A
S the complexity of systems-on-chips (SoCs) increases, the network-on-chip (NoC) is being adopted as a scalable communication-centric solution for integrating numerous on-chip components [i.e., subblocks, processing elements (PEs), and intellectual properties (IPs)] [1]. One of the most difficult tasks in NoC design is guaranteeing throughput (or providing a quality-of-service (QoS) mechanism) of the traffic offered by the on-chip components, particularly with the very limited budget of the on-chip resources [2]–[4]. The requirements of the performance metrics for certain traffic classes, such as data loss, data rate (throughput), delay, and delay jitter, need complex QoS mechanisms involved in many levels of abstraction of NoC design (e.g., application and system level, transport level, or network level) [4]. Many NoC-based applications used in hard real-time systems, as well for future possible NoC usages, demand a hard QoS (strong guaranteed throughput)
Manuscript received April 08, 2010; revised July 24, 2010, November 04, 2010; accepted November 20, 2010. Date of publication December 30, 2010; date of current version January 18, 2012. This work was supported by the Korea government (MEST) through the Korea Science and Engineering Foundation (KOSEF) under Grant R0A-2007-000-20059-0. The authors are with the Department of Electronics and Electrical Engineering, Korea University, Seoul, 136-713, South Korea (e-mail:
[email protected];
[email protected]). Color versions of one or more of the figures in this paper are available online at http://ieeexplore.ieee.org. Digital Object Identifier 10.1109/TVLSI.2010.2096520
requirement [4]–[6]. An efficient and proper design of on-chip switching node (i.e., the switch or the router) is needed with an adequate QoS property, while keeping compatibility with on-chip implementation (e.g., limited power and area overhead or clocking condition) to ensure the feasibility of such a hard QoS mechanism [3]. For clarity in this paper, the term “QoS property” is used to imply the property of a switching node, such as (low) fall-through latency, (high) throughput, being lossless, (no) data jitter, and ordered data delivery, that directly supports the specified QoS requirements. The two main themes in the literature for switching mechanisms used in the design of NoC routers are circuit switching and packet switching (at the granularity of packets or flits) that impact greatly on the implementation overhead and the QoS property. A. Guaranteed Throughput Implementation With the Packet-Switching Approach A time-division multiplexing (TDM) time slot [2], [7] and “logical lanes” (virtual circuits with priority) [8]–[10] solutions are used for the worst-case scenarios in most practical packet-switched NoCs to provide guaranteed data service. These solutions set up (or reserve) the dedicated path through a packet-based NoC for contention-free and deadlock avoidance of guaranteed data packets. The TDM approach faces difficulty in the management of huge time-slot tables at guaranteed service routers for contention-free communication, especially when the systems are scaled [2]. Restriction of the routing function for deadlock-free data transfer in the virtual circuits with a priority approach may lead to throughput degradation in packet-switched NoCs [9], [10]. Moreover, the implementation of queuing buffers in the packet-switched routers dramatically increases the cost in terms of area required and power consumption [6], [9]. B. Guaranteed Throughput Implementation With the Circuit-Switching Approach The “pure” circuit-switching approach is favored to provide hard guaranteed throughput due to its attractive QoS property, once a circuit is set up [5], [6], [11]. After this setup, end-to-end data can be pipelined in order at the full rate of the dedicated links with low delay, no data jitter, and in a lossless manner (i.e., without data dropping) due to there being no collisions among the data streams. Importantly, without queuing buffers and complex routing/arbitrating implementation, the
1063-8210/$26.00 © 2010 IEEE
PHAM et al.: DESIGN AND IMPLEMENTATION OF BACKTRACKING WAVE-PIPELINE SWITCH TO SUPPORT GUARANTEED THROUGHPUT IN NOC
circuit-switched router results in a low-cost (i.e., area, power) design suitable for the limited on-chip budget [5], [6], [11]. However, a path-setup scheme used in circuit-switched NoCs needs critical considerations for proper functioning (i.e., deadand live-lock free) and low-latency setup with minimization of introduced hardware overhead. Moreover, the dynamic and distributed feature of the path-setup in the circuit-switched NoC is also mandatory to ensure system flexibility and scalability in dynamic management (allocation) of the guaranteed communication circuits. This work advocates the guaranteed throughput implementation with the “pure” circuit-switching approach due to compact implementation of routers suitable for on-chip environment and an intrinsic hard QoS property after a circuit has been setup. A novel, practical pipeline circuit-switched switch design is proposed, termed backtracking wave-pipeline switch (or BW switch), to support on-chip hard guaranteed throughput applications. The proposed BW switch can meet the requirements of system flexibility and scalability in managing the circuits by supporting a dynamic path-setup scheme in distribution. It is also highly suited to end-to-end source-synchronous multi-Gb/s data transfer crossing on-chip multiclock domains. The remainder of this paper is organized as follows. Section II outlines the motivation for the proposal of the BW switch design and highlights the main contributions of the work. Section III presents the proposed BW switch architecture and its design with a switch-by-switch interconnection scheme that can support a backtracking path-setup and a source-synchronous wavepipeline transmission of the data. Section IV addresses the benefit of the backtracking feature by analyzing the property and the network-level performance of the backtracking path-setup supported by the BW switch. Section V deals with the implementation issue of a BW switch prototype, including implementation, verification, and its timing property with regard to different operating conditions and technology scale. Section V also validates the proposed BW switch by a proof-of-concept test chip with measurement results. Finally, the conclusion and discussion for further research are given in Section VI. II. MOTIVATION AND CONTRIBUTION The relevant design issues of a pipeline circuit-switched switch are considered from both the network-level and the implementation viewpoints. These lead to the key contributions of this paper. First, as introduced in Section I, a path-setup scheme is critical for circuit-switched NoC. The path-setup scheme (or path configuration) in circuit-switched NoC can be classified as static (at design time or at system boot up time) [11] or dynamic (at runtime) approaches [5], [6], [12]. The static path-setup approach lacks flexibility and scalability [11], while the dynamic path-setup scheme is flexible and more favored in some circuit-switched NoCs [5], [6]. The mesh-based SoCBUS presented in [5] can dynamically set up a path in distribution, but faces high path-setup latencies, because the occupied channels block the setup of a new path. The work in [6] proposes to use a supplemental packet-switched best-effort (BE) network for the delivery of the path configuration from a central control node (CCN). However, the use of a CCN limits system scalability, and
271
the use of a BE network may result in quite high path-setup laten, as stated in [6]. At a neutral point, cies, even in the order of a hybrid packet-/circuit-switching architecture, as proposed in [12], requires an additional packet-switched network for circuit-setup and backpressure signaling, thus increasing overall system cost and complexity [12]. Moreover, if congestion of the setup headers occurs, the preferred circuit-switching mode reluctantly changes to packet-switching mode [12]. This degrades the QoS property of the circuit-switched data flows. However, no design efforts for dead-/live-lock free delivery of the setup headers are clearly mentioned in previous works [6], [12]. In a pipelined circuit-switching scheme, the data do not immediately follow the header into the network. Hence, it is first observed that it is beneficial if the setup header can flexibly backtrack (under a backtracking routing algorithm) to search for available alternative paths rather than wait (or queue) until the blocking channels become available [13]. Such a backtracking-based path-setup scheme can be implemented in distribution to easily scale the system without the need of additional control network. Second, the implementation of NoCs without global synchronization schemes, e.g., mesochronous or asynchronous, where inter-clock domain data transfers are needed, becomes common in practice [9]–[11], [14]. In such schemes, source-synchronous1 wave-pipeline data transfers with transceiver designs are efficiently used to combat multiple-clock-cycle multi-Gb/s transmissions in global (inter-router) links [9], [11], [14]–[17]. An early study presented in [16] reported that the on-chip wave-pipeline approach, used in global interconnections, offers high performance, while using a smaller area and being more energy-efficient than the pipeline approaches using latches or flip-flops. In the design of a pipeline circuit-switched switch (or router), a separate implementation between the data path and the control part is feasible, since, after the path is set up, data can be directly pipelined from source to destination in a control-free manner. From this point, it can be observed that the design of a pipeline circuit-switched switch, in which its intra-switch data path allows direct-forwarding (i.e., wave-pipelining) of the source-synchronous data from the links, can result in the required latency/throughput property of an on-chip end-to-end path close to that of a dedicated interconnection. The above observations are the motivation for using both backtracking and wave-pipeline techniques for the proposed BW circuit-switched switch, to support on-chip hard guaranteed throughput. The main contributions of this paper are given here. 1) The proposed technique supports a dynamic dead- and livelock free path-setup scheme in distribution by backtracking the setup header (probe header), without the need of a central control node or any additional network as in previous studies [6], [12]. 2) The work in this paper provides a low fall-through latency2 and high multi-Gb/s bandwidth by direct-forwarding (i.e., wave-pipelining) of source-synchronous data suited to end-to-end source-synchronous data transfer. 1A source-synchronous data transfer denotes that a source clock (or strobe) from the sender is transmitted along with (or imbedded in) the data signals. 2The term “fall-through latency” denotes the data pipeline latency from an input to an output of the circuit-switched switch, once the path is set up.
272
IEEE TRANSACTIONS ON VERY LARGE SCALE INTEGRATION (VLSI) SYSTEMS, VOL. 20, NO. 2, FEBRUARY 2012
TABLE I FORMAT OF SWITCH-BY-SWITCH HANDSHAKE
3) We present a low-cost silicon-proven prototype of the BW switch implemented rapidly in a STD cell-based design flow for proof-of-concept demonstration. III. BACKTRACKING WAVE-PIPELINE SWITCH ARCHITECTURE Here, we propose the backtracking wave-pipeline switch architecture for use under a torus topology. The torus topology is chosen, as the folded torus, a laid-out version of the torus, can fit the tile-based NoC implementation in a conventional 2-D chip, and its good path diversity can be naturally suited to a path-setup scheme with backtracking. The operation of end-to-end communication with the proposed probing path-setup scheme is explained from a network-transaction viewpoint before detailing the proposed BW switch architecture. A. End-to-End Flow-Control Operation With Backtracking Probing Path-Setup Scheme A typical end-to-end communication, as used in the pipelined circuit-switching approach, is described to provide a clear description of the backtracking probing path-setup scheme used with the BW switch. Communication includes three basic phases: path-setup (or probing), transmission, and release. In the path-setup phase, a probe header containing destination address is sent from the source towards the destination to setup a communication circuit. When congestion occurs, the probe header needs to backtrack and search for alternative links, instead of waiting for a busy link to become idle. That is, the path is set up under a backtracking probing path-setup scheme. When the probe header reaches its destination, an ACK signal is returned to the source. Then, in the transmission phase, the source starts to transmit source-synchronous data through the set up path to the destination. The source clock, pipelined with the data, is used to keep the timing crossing on-chip multi-clock domains. In the release phase, the circuit is released in a hop-by-hop fashion from the source to the destination. A compact switch-by-switch handshake is proposed to support such end-to-end communication with the probing path-setup scheme. Table I defines the bit format used. Fig. 1 illustrates the inter-switch and switch-wrapper interconnections with this handshake. Each switch has five bidirectional ports: four ports are connected to corresponding neighboring switches, and the remaining port is connected to the on-chip IP through a wrapper. According to this handshake scheme, one bit is used for the Request (Req) signal to denote the on-probing state (circuit request) and the circuit idling state. Two bits are used for the Answer (Ans) signal. This has one of three statuses to direct the backpressure flow-control to upstream switch. An Ans status of “01” denotes that the receiver is ready to accept data from the sender, whereas a status of “10” denotes that the intended path
Fig. 1. Switch-by-switch interconnection scheme.
is blocked in the network, forcing the probe header to backtrack to discover possible alternative paths. An Ans status of “11” denotes that the receiver is not ready to receive data (e.g., due to being busy, or having an overflow at the receiving buffer). The use of the switch-by-switch handshake is illustrated through examples of end-to-end flow-control operations, as shown in Fig. 2. Fig. 2(a) denotes an end-to-end communication example where a successful path-setup without backtracking occurs. During the setup and the transmission phases, Req is set to “1.” When Ans is “00,” the probe header continuously advances until it reaches the destination. Then, the destination “01” to the source wrapper. When the source returns “01”, it immediately starts to transmit wrapper receives the pipelined data. Then the source wrapper sets Req to “0” to denote the release phase, immediately after the last data are sent. Fig. 2(b) shows an example similar to one in Fig. 2(a), where backtracking occurs in the setup phase. In this example, due to “10” a blocked link, an intermediate switch feedbacks to its upstream switch to force backtracking. Fig. 2(c) and (d) illustrates the use of this Ans signal when a path-setup fails, and then a retry is reiterated. In the example of Fig. 2(c), when “11” is a probe header reaches a busy destination, sent backward to the source wrapper and the occupied links are released accordingly. In example of Fig. 2(d), if there are “10” no available paths leading to the destination, (i.e., Network Blocked) is feedback to the source wrapper. “10” indicates the Regarding a dynamic setup scenario, current network condition (i.e., blocking) to the sender. In cases “11” (Busy Dest.) and “10” (Network with Blocked), the source wrapper under control of a high-level protocol3 (e.g., scheduling) needs to iterate (regulate) a retry appropriately (e.g., after a waiting time). In summary, from a transaction-level viewpoint, this subsection has introduced the concept of the probing path-setup scheme working with the compact switch-by-switch handshake. This is used in the proposed BW switch presented in the next subsection. B. Proposed Switch Architecture and Design As motivated in Section II, the key targets of a proposed BW switch architecture is to support the backtracking probing pathsetup scheme, and to allow direct-forwarding of source-synchronous data transmissions. As for the backtracking feature of the path-setup scheme, some design issues are first considered. Among varieties of backtracking protocol mentioned in [13], e.g., the Exhaustive 3Discussion
of this high-level protocol is beyond the scope of this paper.
PHAM et al.: DESIGN AND IMPLEMENTATION OF BACKTRACKING WAVE-PIPELINE SWITCH TO SUPPORT GUARANTEED THROUGHPUT IN NOC
273
Fig. 2. Examples of the end-to-end flow-control operation where the switch-by-switch handshake is used. (a) Successful path-setup without backtracking. (b) Successful path-setup with backtracking. (c) Failed path-setup due to busy destination, and a retry. (d) Failed path-setup due to all possible paths are blocked, and a retry.
Misrouting Backtracking (EMB), the k-family protocol, and the Exhaustive Profitable Backtracking (EPB), we propose to use EPB to reduce on-chip implementation complexity. Moreover, the proposed EPB-based path-setup establishes only minimum paths that result in energy-efficient data transfer. The EPB-based probing path-setup performs a straightforward depth-first search of the network using only profitable links. It does not repeatedly search the same path, and guarantees to find a minimal path if one exists. Regarding the design of the non-repeated searching feature, a critical consideration is how to store the history information used for backtracking, i.e., keeping in the probe header, or distributed storing in the switching nodes? The former method significantly increases probe header size, and, consequently, increases the required processing time to route the probe header through the network. It is particularly a problem when the number of links traversed during the path setup becomes very high. Therefore, the latter method is selected, in which the history information is distributed throughout the switching nodes of the network, to reduce the probe header size. Since the history information for backtracking is stored in switches, the probe header contains only the destination address, e.g., 6 b for a 64-node network. The probe header is handled to move forward or backward according to the control signals in switch-by-switch handshake (as denoted in Table I). The incoming probe header can be transported through the data path to save the wiring costs due to the separation between the setup phase and the data transmission phase. The BW switch architecture is proposed, as shown in Fig. 3, based on these above considerations. These proposed switches can be networked together, according to the interconnection scheme mentioned in Fig. 1, to construct a guaranteed throughput lane in tile-based NoCs. 1) Overall Architecture and Operation: The proposed switch architecture (Fig. 3) has the following main components that can be divided into two function groups. • The data path includes CROSBAR with internal transceivers to support a direct-forwarding (wave-pipelining) of the source-synchronous data.
Fig. 3. Proposed BW switch architecture.
• The control part includes Ctrl Ins, Ctrl Outs, and ARBITER. Each pair of Ctrl In and Ctrl Out performs handshaking activity at a certain bi-directional port, namely, North, East, South, West, and IP. Two kinds of clock are used in the BW switch (not shown in Fig. 3 for clarity): the probing clock fed into components of the control part and the pipeline clock (i.e., source clock) appearing in each direct-forwarding connection of the data path. a) Function Blocks: The Ctrl Ins are in charge of processing the incoming probe headers from upstream switches or from the wrapper (IP). When an incoming probe header arrives “1”), the corresponding Ctrl In at an input (with monitors the output status through Monitor bus and requests ARBITER to grant it access to the desired Ctrl Out through the internal Request bus. Based on output status or the feedback from ARBITER placed in Grant & Answer bus, the Ctrl In operates appropriately and replies to the upstream switch through its Ans In.
274
IEEE TRANSACTIONS ON VERY LARGE SCALE INTEGRATION (VLSI) SYSTEMS, VOL. 20, NO. 2, FEBRUARY 2012
Fig. 4. Simplified state diagram for FSM implementation of Ctrl In to support backtracking probing operation.
The ARBITER has two roles: first, for cross-connecting Ans Out signals back to Ctrl Ins, and second, as a referee for requests from Ctrl Ins. If a requesting Ctrl In is accepted, the first role of ARBITER keeps Ans signal from the requested Ans Out being connected back to the requesting Ctrl In through the Grant & Answer bus. By this way, in case of backtracking, the ARBITER can direct the probe header to backtrack to the corresponding input from which it arrived (reserved) previously. Contention resolution is required when several Ctrl Ins, upon observing the Monitor bus, request the same idling output in the same probing clock cycle. The second role of ARBITER (based on a static priority rule) allows only one request to be accepted, while the remainders are answered with “Network Blocked”. By receiving this answer, the requesting Ctrl In continues probing other outputs or returns a “Busy Dest.” to its corresponding upstream switch, depending on which probed port is the direction port (i.e., N, E, S, and W) or the IP port, accordingly. The Ctrl Outs, based on the command from the ARBITER placed in the Control bus, make requests to the downstream switches and control the CROSSBAR. When locked with a specific selecting value, the Ctrl Out handles the CROSSBAR to establish a direct connection from the Data In to the target Data out. The CROSSBAR has two functions, providing connection for probe headers in the setup phase and (with internal transceivers) direct-forwarding pipelined data in the transmission phase, under the control of Ctrl Outs. b) Operation: Fig. 4 depicts a simplified FSM diagram of the operation of Ctrl In with its states to support backtracking probing operation. The Ctrl In is triggered by the rising edge of the probing clock when processing an arriving probe header. Meanwhile, the ARBITER is triggered by the falling edge of the probing clock. The Ctrl Out is simply implemented as a retiming stage for the control signals from the ARBITER and for handshake signals (i.e., Req and Ans) with the downstream switch. This implementation basis in the BW switch ensures that an arriving probe header is always processed in every probe cycle. For example, it is assumed that a probe header arrives at input 4. If there is no desired profitable outputs available, the Ctrl In4 will immediately (after one probing clock cycle) change from the Probing state into the Backtrack state to force the probe header to backtrack. Then, the probe header backtracks to the upstream switch through its reserved link and
releases this link in the next probing clock cycle. If there are two desired outputs available, namely, Out1 and Out2, the Ctrl In4 will try to request Out1 first. If there is a case in which, another Ctrl In (e.g., Ctrl In3) tries to compete against the Ctrl In4 for the Out1 simultaneously, under refereeing by ARBITER, the Ctrl In4 fails to gain access to Out1. In this case, the Ctrl In4 will continue to probe Out2 in the next probing clock cycle. For a probe header returned from the downstream switch, depending on the feedback Ans (e.g., “Busy Dest.”, “Network Blocked” or “ACK”), the Ctrl In processes it accordingly, based on the next probing clock event. That is, under the BW switch operation, an arriving probe header is dynamically processed at every probing clock event, without waiting for occupied resources (links) to become available. The following subsections present the details of the Ctrl In and the CROSSBAR with internal transceivers. These are key components directly related to the design targets of the BW switch. 2) Design of Ctrl In: The Ctrl In is the key component to perform the backtracking probing task. This includes functions, such as processing history information of backtracking and dynamically constructing a table of possible output ports for probing (i.e., route-probing table). As shown in Fig. 4, when a request with the incoming probe header arrives, Ctrl In goes into the Probing state, compares the current switch address and the destination address (i.e., performs address decoding) to find possible outputs for probing. Based on history information of backtracking, current availability of the output ports, and/or the feedback from the downstream switches, Ctrl In may change into the ACK, nACK or Backtrack states, correspondingly. These operations of Ctrl Ins consistently constitute the backtracking probing path-setup scheme supported by BW switches throughout each guaranteed throughput lane of the NoC. Some main design aspects are considered to optimize Ctrl In, as follows: history information processing and address decoding. a) History information processing: An appropriate design of the history information processing is important to ensure the nonrepeated search feature of the path-setup, while minimizing the processing time and the power overhead. The Probing state presented in Fig. 4 is simplified for clarity. In the actual design of each Ctrl In of the BW switch, the Probing state includes several states. These mark the process of output probing in a . static order, e.g., from Fig. 5 illustrates this design scheme with an example of probing activity. This example assumes that a probe header arrives at Input 3 (South) of the switch. The Ctrl In3, after decoding the address and observing the outputs, finds that there are possible routes to Outs 1, 2, and 4. The probing process will take (Fig. 5). If, place in a probing order from Outs after trying all the possible outputs in an orderly fashion, and still receiving the “Network-Blocked” response, Ctrl In goes into the Backtrack state and releases the probed output. This mechanism avoids re-probing the same output at each arrival of the probe header. Therefore, the proposed backtracking probing path-setup scheme does not repeatedly search the same path. In this processing of history information, the FSM-based design can encode all the necessary probing states, while minimizing the number of power-hungry flip-flops and the processing time.
PHAM et al.: DESIGN AND IMPLEMENTATION OF BACKTRACKING WAVE-PIPELINE SWITCH TO SUPPORT GUARANTEED THROUGHPUT IN NOC
275
Fig. 6. Binary addressing scheme and “turning-direction” values in each axis of a 2-D 8 8 torus network.
2
Fig. 5. History information processing with FSM implementation. TABLE II BIT-LEVEL DESIGN OF ADDRESS DECODING SCHEME
• The intermediate switch address (statically assigned to switches at design time) is , where and are the corresponding Xand Y-addresses, respectively. In each axis, the address-decoding scheme returns the “turning direction” values, i.e., clockwise and counterclockwise, as shown in Fig. 6. These values show the desired outputs for the minimum paths leading to the destination, i.e., in the X-axis, as follows: ” denotes “East”; • “clockwise ” denotes “West.” and in the Y-axis, • “counterclockwise In the Y-axis, the following is true: ” denotes “North”; • “clockwise • “counterclockwise ” denotes “South”. Additionally, we define the bit-level designs of the comparand , ison of two binary numbers, according to the following procedures. ): Procedure Equal (i.e.,
Procedure Smaller (i.e.,
In the proposed BW switch design, five flip-flops are used to encode all the necessary states (including probing states) in each Ctrl In. b) Address Decoding Task: The address-decoding task is important for dynamic construction of a route-probing table in each Ctrl In. With regard to the implementation of EPB, this table must contain only profitable ports for minimum paths leading to destination. A compact design of the address-decoding task is essential to minimize the critical time of the probing process. In the BW switch, a bit-level design method is applied for address decoding to achieve compactness [18]. Table II presents the details of the bit-level address-decoding algorithm based on the following assumptions. Regarding the X- and Y-axes in a 2-D 8 8 torus network, the following is assumed. • The destination address (contained in probe header) is , where and are the corresponding X- and Y-addresses, respectively,
):
Table II shows the bit-level design of the “turning direction” algorithm for the X axis. This results in the corresponding values of clockwise and counterclockwise. Likewise, a bit-level design similar to the one in Table II is used to find the “turning direction” in the Y-axis with the input and . There being no “turns” in both axes denotes values that the probe header reaches the destination. The combination of the “turn direction” values of the X- and Y-axes dynamically constructs a list of possible routes. Based on this list, the history information and the output status, a route-probing table containing only profitable ports, is dynamically constructed for the probing activity in each Ctrl In. 3) Design of CROSSBAR With Internal Transceivers: The CROSSBAR with internal transceivers is the key component to perform the wave-pipelining of source-synchronous data. Regarding the layered design concept in the NoC paradigm [1], [4], the router/switch and the transceiver (with inter-router link) can be designed independently. They can cooperate in NoCs, provided the interface between them is defined. As introduced in Section II, the design of source-synchronous
276
IEEE TRANSACTIONS ON VERY LARGE SCALE INTEGRATION (VLSI) SYSTEMS, VOL. 20, NO. 2, FEBRUARY 2012
Fig. 7. Internal transceiver (gray boxes) in use with CROSSBAR.
transceivers (with wave-pipelined links) becomes common practice. It is studied in [14]–[16] to improve energy efficiency and data rate and to combat PVT variations, random mismatch, and crosstalk. In these transceivers, the received data can be realigned with the received (source) clock (as in circuit-switched NoC), or with a local (router) clock (often with synchronizing first-in first-out (FIFO), as in packet-switched NoC) [15]. In cooperating with a circuit-switched switch, like the BW switch, the realignment of data to the source clock can be applied. Regarding this scheme, the common interface between the switch and the source-synchronous transceiver is the data (from data registers) and source clock signals. In the BW switch, the concept of wave-pipeline is illustrated in the sense that it allows direct-forwarding of the source-synchronous data (i.e., data along with source clock) from its inputs to the corresponding outputs. In other words, the crossbar of the BW switch can be considered to be a set of intra-switch links that requires internal transceivers to work with a source-synchronous direct-forwarding scheme. A simple internal transceiver design (with Tx/Rx circuits) is used for direct-forwarding of intra-switch source-synchronous data, as shown in Fig. 7. In a practical implementation of NoC, this internal transceiver may be integrated as part of the source-synchronous (external) transceiver due to the common use of register-based data alignment. For instance, assuming that the internal transceiver is combined with the source-synchronous transceiver proposed in [15], the data registers of internal Tx become those of external Rx and vice versa. In this case, the difference is one inverter included into the internal source clock wire to form an internal source-synchronous transmission scheme (Fig. 7). The CROSSBAR is designed as a cross-connection matrix implemented with output multiplexers (Fig. 8), in which each multiplexer is controlled by its corresponding Ctrl Out. In combination with the internal transceivers, this cross-connection structure is capable of direct-forwarding a source-synchronous transmission of a RZ clock along with data pipelined through the switch, as illustrated in Fig. 8. The structure of MUX-based CROSSBAR with the internal transceivers can be easily implemented in a standard cell-based design flow. In summary, the proposed BW switch architecture and its design can support the EPB-based probing path-setup scheme. It
Fig. 8. MUX-based crossbar structure supporting data wave-pipelining.
allows direct-forwarding (i.e., wave-pipelining) of source-synchronous data. Moreover, this proposed design also suggests a synthesizable implementation in a STD cell-based design flow. IV. BACKTRACKING PROBING PATH-SETUP SCHEME As discussed in Section II, the path-setup scheme is essential and directly affects the overall performance of the circuit-switching approach. An analysis at the network-level, performed in previous work [19], confirmed the good performance of the backtracked routing circuit-switched NoC with the torus topology under certain communication patterns. In particular, in communications with larger packets, the data transmission duration can be long. This overwhelms the setup delay overhead, hence, improving the overall network performance. This section focuses on analyzing the property and the network-level performance of the proposed EPB-based probing path-setup scheme supported by the BW switch.
PHAM et al.: DESIGN AND IMPLEMENTATION OF BACKTRACKING WAVE-PIPELINE SWITCH TO SUPPORT GUARANTEED THROUGHPUT IN NOC
277
A. Path-Setup Property Here, notations used in discussion of a backtracking protocol using pipeline circuit-switched scheme (PCS) in a certain network, are defined as follows: source switch; destination switch; search tree for containing all the paths from to pruned by the restriction of the backtracking routing. As introduced in Sections I and III-B, the BW switch employs a PCS scheme, where the probe header in its setup phase, generated from the source wrapper of , is sent through the network towards the destination wrapper of to search for an available communication path. When there is congestion (i.e., link blocked), the proposed EPB-based path-setup scheme handles the probe header to backtrack under Exhaustive Profitable Backtracking to search for alternative profitable links rather than waiting for busy links to become idle. 1) Deadlock Freedom: As proved by Theorem 1 (DeadlockFreedom, Section V) of [20], any backtracking protocol using , is deadlock PCS, which constructs its search tree for free. The EPB (used in the proposed path-setup), constructing its search tree containing all the shortest paths pruned by the restriction of minimum routing, is one family of such protocols [20]. It is obvious that the EPB is deadlock free. Therefore, the proposed EPB-based probing path-setup is deadlock-free. From the implementation aspect, some techniques used for the BW switch to realize the deadlock-free property are addressed as follows. As discussed in Section III-B1a, when a probe header backtracks, Ctrl Out re-timings Ans signal from the downstream switch and passes it to corresponding requesting Ctrl In through the cross-connecting role of the “Network Blocked”, Ctrl In ARBITER. By receiving changes into other states (e.g., Probing with other output or Backtrack, see Fig. 4) and stops requesting the current Ctrl Out. Hence, it releases the reserved link leading to the downstream switch. It is noted that the releasing of an occupied “Busy Dest.” In this case, the link is also applied when Ctrl In changes into the nACK state and stops requesting the current Ctrl Out. This is to ensure that the released resource (i.e., Ctrl Out with its corresponding link) becomes available for other probe headers to advance. Another important technique is to keep the probing process performed in a nonwaiting manner. As mentioned in Section III-B1b, an implementation principle of the BW switch is that an arriving probe header is always processed in every probe cycle, without waiting for any occupied resource (link) to become idle. Thus, when the probe header tries other outputs or backtracks to the upstream switch through its reserved link, it is processed immediately without waiting for a busy resource. Fig. 9 illustrates an example of a setup scenario where four path-setups, i.e., P1, P2, P3 and P4, tend to form a “deadlock” cycle at a given probing clock event. As analyzed in Section III-B1, the Ctrl In of the BW switch, upon observing that the output is occupied, handles the probe header either to try
Fig. 9. Scenario with path-setups probing links in a cyclic manner.
other possible (profitable) outputs or backtrack to the upstream switch in the next probing clock cycle. Thus, this “deadlock” cycle will be broken immediately at the next probing clock event. Therefore, the deadlock phenomenon cannot occur. 2) Live-Lock Freedom: As proved by Theorem 2 (LivelockFreedom, Section V) of [20], any backtracking protocol using is live-lock free. PCS that constructs its search tree for The EPB (used in the proposed path-setup), which constructs its search tree containing all the shortest-paths pruned by the restriction of minimum routing, is one family of such protocols [20]. It follows immediately that the EPB is live-lock free. Therefore, the proposed EPB-based probing path-setup is livelock free. From an implementation viewpoint, some techniques are used to realize this property in the BW switch as follows. First, the BW switch is implemented to support the search within a set of shortest paths. As detailed in Section III-B2b, the bit-level design of address decoding dynamically constructs a route-probing table containing only the profitable outputs (i.e., minimum routing). Therefore, Ctrl Ins handles the probe header to be forwarded or backtrack along shortest paths only. Second, Ctrl In conducts the probing process in an orderly fashion, once it receives an “Network Blocked” (see Section III-B2a). This is to avoid re-searching the same path; hence the EPB-based path searching is nonrepeated. Third, the refereeing role of ARBITER resolves contention, when there are several arriving probe headers competing for a free profitable output (see Section III-B1a). After this refereeing, only one winning probe header advances through the profitable output. The unsuccessful probe header(s) is treated as “facing a Network Blocked.” This way of refereeing has two purposes. • The requesting Ctrl In(s) considers the unsuccessful probe header as in the normal backtracking case and processes it conforming to the specified orderly minimum-path probing rule. Hence, no misrouting of probe headers is allowed, even after contention resolution. • No dropping flow-control of probe headers needs to be applied in the case of contention, since if the probe header is dropped by a dropping flow-control (every time it reenters the network), it may never reach its destination, even though there exists available paths. This dropping flowcontrol differs to the proposed scheme, in which the probe header backtracks to the sending wrapper of the source
278
IEEE TRANSACTIONS ON VERY LARGE SCALE INTEGRATION (VLSI) SYSTEMS, VOL. 20, NO. 2, FEBRUARY 2012
(S)
Fig. 10. Illustration of the difference between the backtracking scheme and the retracting scheme.
switch due to no available path existing after an exhaustive search. Additionally, based on this backtracking feedback, the sender knows the current network state and can iterate a new search appropriately. In summary, these techniques ensure the realization of the live-lock free property in BW switch implementation. 3) Dynamic and Distributed Scheme: To setup a path, each BW switch maintains the backtracking probing activity individually and decides to probe the possible outputs depending on the current state of the network. Therefore, the path is set up dynamically in distribution without the need for a centralized control node or supplemental network, as in [6] and [12]. B. Performance To examine the path-setup performance, the simulation framework described in [19] based on OMNeT++,4 has been used to build a simulator. In the simulator, the path-setup scheme of the BW switch employs EPB [13] to route a probe header back and forth along minimum paths. The path-setup approach of the proposed BW switch is compared to another dynamic and distributed path-setup approach given in [5]. The mesh-based SoCBUS of [5] is conceptually proposed and employs a packet-connected circuit scheme. It can setup minimum paths without using an additional setup network or central control node. The key difference between our path-setup concept and those in [5] is that, the proposed BW switch uses backtracking to proactively search for the available links, rather than entirely retracting the partially established circuit (to avoid deadlock), when there are no free outputs leading towards the destination. This difference can be illustrated through a detailed example, as shown in Fig. 10. For simplicity, we estimate the travel of the probe header in terms of passing hops. We assume that the turn of the setup header, processed by each switching node under shortest-path routing, follows an order of E S W. A path from source S to destination N D needs to be set up. In Fig. 10, a network state is assumed, where only link 4 is blocked and the others are free. With the backtracking path-setup approach, probe header travels through the links in the following sequence: 4[Online].
Available: http://www.omnetpp.org/
(facing link blocked and backtrack) (D). This travel requires six hops. With the retracting path-setup approach, the probe header’s (facing travel is in the following sequence: (S) (S), (then a retry) (S) link blocked and retract) (facing link blocked and retract) (S), and so on. This travel requires more than six hops. Eventually, the retracting path-setup approach needs to wait until link 4 becomes free. Since the path-setup approach of SoCBUS has been conceptually introduced without any detailed information in the switch design, we import the concept of its “retracting” to our network simulator for comparison. Without in-depth consideration of implementation issues in the network-level simulations, the values of the probing frequency (200 MHz), source-synchronous frequency (1 GHz), (300 ps) (i.e., fall-through latency) are assumed to and be most appropriate for a BW switch prototype implemented in a 0.13- m CMOS standard-cell technology. The folded torus is used as a laid-out version of the torus topology in a 2-D tile-based NoC. Relating to the physical property of the is estimated from the latency of the tile-based layout, the source-synchronous signal on a 16-bit-data top-metal link of a 1-mm -tile based NoC model (210 ps for 1-mm link and 330 ps for 2-mm link). Simulation is performed under uniform traffic, with packets resulting in a medium duration of data transmis400 probe cycles), and Poisson-distribution of sion ( packet inter-arrival times. In all the simulation cases, a simple end-to-end control protocol is assumed for both schemes, in which the probe header is resent (i.e., a retried) immediately after a previously unsuccessful try. In the simulations, each source sends 10 000 packets into the network, and the first 1000 packets are discarded in the warm-up phase. For the sake of comparison, the retracting path-setup concept (of SoCBUS) and the proposed backtracking path-setup are analyzed in mesh and torus topologies. The path-setup latency (presented in units of probe cycles) and throughput performance, under uniform traffic in an 8 8-network, are shown in Figs. 11 and 12, respectively. The backtracking path-setup (of the proposed BW switch) outperforms the retracting one (of SoCBUS) (see Fig. 11) in terms of path-setup latency. When mesh topology is assumed, the backtracking path-setup scheme can reduce by 18% (at 0.1 load), and 21% (at 0.2 load) setup latency, compared to that of the retracting approach. Under torus topology, the backtracking path-setup results in a significant reduction of setup latency when the load becomes higher, from 35% (at 0.3 load) and up to 50% (at 0.4 load) compared to that of the retracting approach. This suggests that in using backtracking (to avoid blocked link), combined with a good path-diversity of the topology, the proposed path-setup scheme has a higher probability of establishing a path than that for retracting, when load increases. The throughput-saturation point, provided by the backtracking path-setup of the BW switch, increases 1.05 and 1.09 under mesh and torus topologies, respectively, compared with those provided by the retracting approach (see Fig. 12). In summary, this section has outlined the main properties and performance of the backtracking probing path-setup scheme
PHAM et al.: DESIGN AND IMPLEMENTATION OF BACKTRACKING WAVE-PIPELINE SWITCH TO SUPPORT GUARANTEED THROUGHPUT IN NOC
Fig. 11. Path-setup latency performance (8
279
2 8-network size, uniform traffic). Fig. 13. Example of switching operation verified in postlayout simulation.
A. Implementation and Post-Layout Verification
Fig. 12. Throughput performance (8
2 8-network size, uniform traffic).
supported by the proposed BW switch. The proposed BW switch can setup a path in dead- and live-lock free manner due to the implementation of backtracking. The backtracking path-setup scheme outperforms in both path-setup latency and saturated throughput aspects compared to similar path-setup approaches. Simulation results suggest the choice of using a torus with good path diversity is suited to the proposed BW switch to minimize path-setup latency and maximize saturated throughput. V. IMPLEMENTATION OF THE BW SWITCH Here, we examine various issues in the silicon implementation of the proposed BW switch, via a switch prototype designed in a standard cell-based design flow. The following subsections provide details of the implementation with the post-layout verification, and an investigation of the BW switch timing property under various working conditions. Then, a silicon-proven prototype, with measurement results, is presented for proof-of-concept demonstration.
A proof-of-concept BW switch, assumed to be used in an 8 8 torus network, is implemented with a configuration consisting of 16-bit-data 5-bidirectional-port and using EPB [13] for the probing path-setup. The prototype of the BW of switch is coded in VHDL, synthesized (targeting around 330 MHz in the worst-case), and laid out in a typical four-metal 0.18- m CMOS process, under a supply voltage of 1.8 V. The CROSSBAR is a critical component in the design of the BW switch, since it provides direct source-synchronous wave-pipelining connections from the Data Ins to the Data Outs. In the place and routing phase, the CROSSBAR is laid out as a hard macro, in which all the logic cells constituting each MUX are planned and placed in groups to reduce the wiring differences among the data lines of direct connections. As observed from our layout case, the layout area of a MUX is quite small ( 700 m ) for an acceptable relative wiring skew among the data wires of a direct connection. The operation of the switch is verified through post-layout simulation with layout-extracted parasitic back-annotation. Fig. 13 illustrates a case of switching operation with the backtracking probing activity, and the pipelining of a source-synchronous data through the switch. In Fig. 13, the connection from Input2 (East) to Output3 (South) is established successfully under a path-setup scheme (see notations 1–2–3–4). Then, the RZ source clock, along with data signals crossing the BW switch, are “wave-pipelined” from Input2 to Output3 (see notations 5–6). Meanwhile, an incoming probe header arriving at Input4 (West), requesting a connection to the busy Output3, is blocked and backtracks to the upstream switch (see notations 7–8). B. Timing Property Considering the variation of the operating conditions, and scaling of implementation technologies that may affect the timing property of the BW switch, timing bounds in terms of the maximum pipelining frequency and the fall-through latency
280
IEEE TRANSACTIONS ON VERY LARGE SCALE INTEGRATION (VLSI) SYSTEMS, VOL. 20, NO. 2, FEBRUARY 2012
Fig. 15. Fall-through latency versus technologies and PVT conditions. Fig. 14. Maximum pipelining frequency versus technologies and PVT conditions.
are projected, based on post-layout simulation. In addition, the energy-efficiency per transported bit provided by the BW switch is examined in each case. Before investigating the timing property, we first discuss an analysis scenario in which the BW switch is simulated. The internal transceiver (Fig. 7) is included in each data port of the CROSSBAR, in which each data input/output is wired from/to a 1 D flip-flop and a 16 inverter is included in the source clock wire. These internal transceivers are also implemented in the test chip mentioned later in Section V-C. The pipeline clock (i.e., source clock) is received by a clock buffer and used to sample the received data. The source-synchronous scheme (Fig. 8) of an RZ source clock along with 16-bit data is simulated using these internal transceivers. In the analysis, the BW switch prototype with these internal transceivers is implemented, laid out, and simulated in several STD-cell CMOS technologies, namely, 0.18 m, 0.13 m and 90 nm. Regarding the relative skew of the source clock line and 16 data lines, the maximum pipelining frequency, at which the Rx circuit can still correctly sample the transmitted data from the Tx circuit, is estimated. In each given technology, three cases of operating conditions are considered, as follows. • Typical case (TYPICAL): 25 C, TT corner, nominal . (The V is 1.8, 1.2, and 1 V in 0.18- m, voltage V 0.13- m and 90-nm technology, respectively) % • Worst case (WORST): 125 C, SS corner, %. • Best case (BEST): 0 C, FF corner, Assuming five unidirectional connections, with data streams of 100% switching activity (SA) are applied to the switch and the worst skew values among these connections are chosen to estimate the fall-through latency. Figs. 14 and 15 show the simulated maximum data pipelining frequencies and fall-through values for each technology with different latency working conditions. In the typical condition, the technology scaling from 0.18 to 0.13 m and 90 nm results in a maximum pipelining frequency increase of 1.4 and 1.7 , respectively. The scaling of the fall-through latency is almost proportional to the ratios of the gate lengths between the given CMOS technologies. In the typical condition, the energy per transported bit reduces dramatically by up to 90% and 98% when the
Fig. 16. Energy per transported bit versus technologies and PVT conditions.
technology scales from 0.18 to 0.13 m and 90 nm, respectively (Fig. 16). If the power overhead of the internal transceivers increases by around 21% and 28% with data is added, switching activity of 50% and 100%, respectively. The postsimulation results suggest that the variation of operating conditions provides a wider range of maximum pipelining frequency and a narrower range of fall-through latency of the BW switch when the technology scales down (Figs. 14 and 15). However, this projection is based on the static timing analysis performed under a STD cell-based design with automated layout. Other effects, such as crosstalk, and clock-skew, are not considered. Therefore, a real chip implementation is discussed in the next subsection, in which all these other effects are automatically taken into account. C. Test Chip and Measurement Results A test chip has been designed, including the tested switch prototype plus an “on-chip testbench”, and fabricated in an 0.18- m STD-cell CMOS process to validate the concept of the proposed BW switch, and measure its maximum aggregate throughput in a real-chip. The “on-chip testbench” comprises five Test wrappers; a Test control; and an Error monitor block, as shown in Fig. 17. The Test wrappers are implemented with FSM to emulate the behavior of the IP and the four neighboring switches, according to some predefined communication scenarios. This emulation implementation helps verify the
PHAM et al.: DESIGN AND IMPLEMENTATION OF BACKTRACKING WAVE-PIPELINE SWITCH TO SUPPORT GUARANTEED THROUGHPUT IN NOC
281
Fig. 18. Measurement of maximum pipelining frequency. (a) No error shown out at data clk of 923 MHz. (b) Error shown out at data clk of 924 MHz. Fig. 17. Test chip architecture.
switching operations (e.g., the backtracking) of the fabricated BW switch. Each Test wrapper is also comprised of Tx and Rx circuits in the data interface with the tested switch prototype (Fig. 17). The probing clock and pipeline (data) clock are fed directly from the off-chip through clock I/O buffers. Each Test wrapper, also functioning as a traffic generator, can generate predefined data and send them to the receiving partner. The partner compares the data received from the originating wrapper with a predefined pattern (by an XOR circuit) to detect errors, and then sends the errors to an Error monitor. The Error monitor sums (by an OR circuit) all the errors from the five Test wrappers, and then accumulates these errors with a binary counter. A Logic Analyzer at the off-chip side monitors the parallel outputs of the binary counter. The Test control block is used to control the test scheme (e.g., reset/start, predefined data patterns, and predefined communication scenarios). The test chip can operate properly at a measured maximum probing frequency of 345 MHz. In measurement of the aggregate bandwidth of the switch, five guaranteed connections (with data switching activity of 100%) are activated. To measure the maximum pipelining frequency, we gradually increase the pipeline clock until accumulated errors appear on the Logic Analyzer. In this way, the maximum pipelining frequency has been found to be 923 MHz, under 1.8-V V at room temperature (Fig. 18). This measured result is reasonable if other effects, such as noise (from the power supply or clock source), crosstalk, PCB degradation or clock duty-cycle distortion are considered. Fig. 19 shows the area and power breakdowns of the BW switch prototype. The CROSSBAR occupies 22% and each Ctrl In occupies around 8%–9% of the total BW switch area. In the power breakdown (under MHz, MHz, and a data switching activity of 100%), the CROSSBAR consumes 43%, the ARBITER and Ctrl Outs consume 14%, and Ctrl Ins consume the remaining BW switch power. Through analysis, when five connections are activated
Fig. 19. (a) Area and (b) estimated power breakdowns of the BW switch prototype.
Fig. 20. Die photograph of the test chip and summary of the BW switch prototype.
and the probing clock is fixed, the total switch power consumption varies depending on the data streams placed through the CROSSBAR. This suggests that an increase of the data-width of the CROSSBAR in future configurations of BW switch implementation is possible without sacrificing the (power) cost of the control part (i.e., Ctrl Ins, Ctrl Outs and ARBITER). Fig. 20 shows the die photograph of the test chip and the summary of the BW switch prototype. The fabricated BW switch prototype can offer a maximum aggregate guaranteed bandwidth of 73.84 Gb/s, while occupying a compact area of 0.0315 mm . The measured link bandwidth of 14.77 Gb/s shows that the BW switch can easily satisfy the required
282
IEEE TRANSACTIONS ON VERY LARGE SCALE INTEGRATION (VLSI) SYSTEMS, VOL. 20, NO. 2, FEBRUARY 2012
guaranteed throughput for various common wireless SoCs [5], [6], even when operating under the worst-case condition. In addition, the BW switch suggests its suitability for other data-intensive SoC applications, such as DMA, where the delivery of huge data blocks may require high bandwidth and long transmission durations [21]. Furthermore, future NoC usage may also find suitable uses for the BW switch due to its dynamic path-setup feature. The fabricated BW switch can also be used directly in a mesochronous on-chip communication scheme, since its control part can sample the probe header and the handshaking signals under a local probing clock. In summary, this section has discussed the implementation issues of a synthesizable BW switch prototype in a STD-cell based design flow. Moreover, a test chip in an 0.18- m STD-cell CMOS process is presented for demonstration. VI. CONCLUSION AND FURTHER STUDY This paper has presented a practical and cost-effective design of the proposed BW switch to support guaranteed throughput that combines both backtracking and wave-pipelining features. The backtracking feature provides a dynamic and dead- and live-lock free path-setup scheme in a distributed fashion. The wave-pipelining (i.e., direct-forwarding) feature practically provides low fall-through latency and high multi-Gbps bandwidth, and suggests suitability for an end-to-end source-synchronous data transmission. In addition, a BW switch prototype with 16-bit-data 5-bidirectional-port configuration has been fabricated and tested in a typical 0.18- m CMOS STD-cell technology for proof-of-concept. In this paper, the HDL-based implementation of the BW switch, using standard cells, can result in short design time and has good portability. There is room for further optimization due to the separate implementation of the data path from the control part. For example, the data path can be fully customized in a specific CMOS process for appropriate matching to a cascade scheme of inter-switch transceivers, e.g., as one presented in [15], to enable a robust and efficient end-to-end source-synchronous scheme. Furthermore, another idea from the BW switch architecture, although being outside the scope of this paper, is the fault-tolerant property of backtracking. This property is useful in constructing a NoC with a fault-tolerant performance. Regarding these open issues, extensions of the BW switch to support guaranteed throughput can be considered in future work with real-world NoC-based applications.
[3] R. Marculescu, U. Y. Ogras, P. Li-Shiuan, N. E. Jerger, and Y. Hoskote, “Outstanding research problems in NoC design: System, microarchitecture, and circuit perspectives,” IEEE Trans. Comput.-Aided Design Integr. Circuits Syst., vol. 28, no. 1, pp. 3–21, Jan. 2009. [4] G. D. Micheli and L. Benini, Networks on Chips: Technology and Tools (Systems on Silicon). San Mateo, CA: Morgan Kaufmann, 2006. [5] D. Wiklund and L. Dake, “SoCBUS: Switched network on chip for hard real time embedded systems,” in Proc. Int. Parallel Distrib. Process. Symp., 2003, p. 8. [6] P. T. Wolkotte, G. J. M. Smit, G. K. Rauwerda, and L. T. Smit, “An energy-efficient reconfigurable circuit-switched network-on-chip,” in Proc. IEEE Int. Parallel Distrib. Process. Symp., 2005, p. 155a. [7] Z. Lu and A. Jantsch, “TDM virtual-circuit configuration for network-on-chip,” IEEE Trans. Very Large Scale Integr. (VLSI) Syst., vol. 16, no. 8, pp. 1021–1034, Aug. 2008. [8] S. Vangal, A. Singh, J. Howard, S. Dighe, N. Borkar, and A. Alvandpour, “A 5.1 GHz 0.34 mm router for network-on-chip applications,” in VLSI Circuits Symp. Dig. Tech. Papers, 2007, pp. 42–43. [9] S. R. Vangal, J. Howard, G. Ruhl, S. Dighe, H. Wilson, J. Tschanz, D. Finan, A. Singh, T. Jacob, S. Jain, V. Erraguntla, C. Roberts, Y. Hoskote, N. Borkar, and S. Borkar, “An 80-tile sub-100-W TeraFLOPS processor in 65-nm CMOS,” IEEE J. Solid-State Circuits, vol. 43, no. 1, pp. 29–41, Jan. 2008. [10] D. Lattard, E. Beigne, F. Clermidy, Y. Durand, R. Lemaire, P. Vivet, and F. Berens, “A reconfigurable baseband platform based on an asynchronous network-on-chip,” IEEE J. Solid-State Circuits, vol. 43, no. 1, pp. 223–235, Jan. 2008. [11] D. N. Truong, W. H. Cheng, T. Mohsenin, Y. Zhiyi, A. T. Jacobson, G. Landge, M. J. Meeuwsen, C. Watnik, A. T. Tran, X. Zhibin, E. W. Work, J. W. Webb, P. V. Mejia, and B. M. Baas, “A 167-processor computational platform in 65 nm CMOS,” IEEE J. Solid-State Circuits, vol. 44, no. 4, pp. 1130–1144, Apr. 2009. [12] N. E. Jerger, M. Lipasti, and L.-S. Peh, “Circuit-switched coherence,” IEEE Comput. Archit. Lett., vol. 6, no. 1, pp. 5–8, Jan.-Jun. 2007. [13] J. Duato, S. Yalamanchili, and L. Ni, Interconnection Networks: An Engineering Approach. San Mateo, CA: Morgan Kaufmann, 2003. [14] S.-J. Lee, K. Lee, S.-J. Song, and H.-J. Yoo, “Packet-switched on-chip interconnection network for system-on-chip applications,” IEEE Trans. Circuits Syst. II, Exp. Briefs, vol. 52, no. 6, pp. 308–312, Jun. 2005. [15] D. Schinkel, E. Mensink, E. Klumperink, E. van Tuijl, and B. Nauta, “Low-power, high-speed transceivers for network-on-chip communication,” IEEE Trans. Very Large Scale Integr. (VLSI) Syst., vol. 17, no. 1, pp. 12–21, Jan. 2009. [16] L. Zhang, Y. Hu, and C. C.-P. Chen, “Wave-pipelined on-chip global interconnect,” in Proc. Asia South Pacific Design Autom. Conf., 2005, pp. 127–132. [17] J. Xu, W. Wolf, and W. Zhang, “Double-data-rate, wave-pipelined interconnect for asynchronous NoCs,” IEEE Micro, vol. 29, no. 3, pp. 20–30, 2009. [18] P.-H. Pham, Y. Kumar, and C. Kim, “A compact and high performance switch for circuit-switched network-on-chip,” in Proc. IEEE Int. SOC Conf., 2006, pp. 53–56. [19] P. T. Hong, P.-H. Pham, X.-T. Tran, and C. Kim, “Analysis and evaluation of traffic-performance in a backtracked routing network-on-chip,” in Proc. Int. Conf. Commun. Electron., 2008, pp. 13–17. [20] P. T. Gaughan and S. Yalamanchili, “A family of fault-tolerant routing protocols for direct multiprocessor networks,” IEEE Trans. Parallel Distrib. Syst., vol. 6, no. 5, pp. 482–497, May 1995. [21] M. Kistler, M. Perrone, and F. Petrini, “Cell multiprocessor communication network: Built for speed,” IEEE Micro, vol. 26, no. 3, pp. 10–23, 2006.
ACKNOWLEDGMENT The authors would like to thank the anonymous reviewers for their valuable comments. The authors would also like to thank the IC Design Education Center (IDEC) and the Korea Ministry of Knowledge Economy (MKE) for chip fabrication. REFERENCES [1] L. Benini and G. De Micheli, “Networks on chips: A new SoC paradigm,” IEEE Computer, vol. 35, no. 1, pp. 70–78, Jan. 2002. [2] K. Goossens, J. Dielissen, and A. Radulescu, “Æthereal network on chip: Concepts, architectures, and implementations,” IEEE Des. Test. Comput., vol. 22, no. 5, pp. 414–421, 2005.
Phi-Hung Pham (S’06) received the B.Sc. degree (with honors) and M.Sc. degrees in electronics and telecommunication engineering from Vietnam National University, Hanoi, Vietman, in 1999 and 2003, respectively. He is currently working toward the Ph.D. degree at Korea University, Seoul, South Korea. Since 2000, he has been a Lecturer with the College of Technology, Vietnam National University, Hanoi, Vietnam. He has been a Korea Research Foundation (KRF) Fellow since 2006. His current research interests include design and implementation of on-chip networks for complex SoC applications in reconfigurable computing, MPSoC, LDPC/Turbo decoding, and bio-inspired system.
PHAM et al.: DESIGN AND IMPLEMENTATION OF BACKTRACKING WAVE-PIPELINE SWITCH TO SUPPORT GUARANTEED THROUGHPUT IN NOC
Jongsun Park (M’05) received the B.S. degree in electronics engineering from Korea University, Seoul, South Korea, in 1998, and the M.S. and Ph.D. degrees in electrical and computer engineering from Purdue University, West Lafayette, IN, in 2000 and 2005, respectively. He joined the Electrical Engineering Faculty, Korea University, Seoul, South Korea, in 2008. From 2005 to 2008, he was with the Signal Processing Technology Group, Marvell Semiconductor Inc., Santa Clara, CA. He was also with the Digital Radio Processor System Design Group, Texas Instruments, Dallas, TX, during the summer of 2002. His research interests focus on variation-tolerant, low-power, and high-performance VLSI architectures and circuit designs for digital signal processing and digital communications.
Phuong Mau received the B.Sc. degree in electronics and telecommunication technology from Vietnam National University, Hanoi, Vietnam, in 2006. He is currently working toward the M.S. degree at Korea University, Seoul, South Korea. His interests include design and implementation of networks-on-chip.
283
Chulwoo Kim (S’98–M’02–SM’06) received the B.S. and M.S. degrees in electronics engineering from Korea University, Seoul, South Korea, in 1994 and 1996, respectively, and the Ph.D. degree in electrical and computer engineering from the University of Illinois at Urbana-Champaign, Urbana, in 2001. In 1999, he worked as a summer Intern with the Design Technology, Intel Corporation, Santa Clara, CA. In May 2001, he joined IBM Microelectronics Division, Austin, TX, where he was involved in cell processor design. Prior to joining IBM, he was a Research Staff Member with the University of California, Santa Cruz. in 2001. Since September 2002, he has been with the Department of Electronics Engineering, Korea University, Seoul, South Korea, where he is currently an Associate Professor. In 2008–2009, he was a Visiting Scholar with the University of California, Los Angeles. His current research interests are in the areas of wireline transceivers, memory, power management, and data converters. Dr. Kim was the recipient of the Samsung HumanTech Thesis Contest Bronze Award (1996), the ISLPED Low-Power Design Contest Award (2001), the DAC Student Design Contest Award (2002), SRC Inventor Recognition Awards (2002), the Young Scientist Award from the Ministry of Science and Technology of Korea (2003), the Seoktop Award for Excellence in Teaching (2006), and the 13th ASP-DAC Best Design Award (2008). He is currently on the editorial board of the IEEE TRANSACTIONS ON VERY LARGE SCALE INTEGRATION (VLSI) SYSTEMS.