Low-Complexity Link Microarchitecture for ... - Semantic Scholar

4 downloads 1667 Views 1MB Size Report
complexity link microarchitecture for mesochronous on-chip communication that .... For information on obtaining reprints of this article, please send e-mail to:.
IEEE TRANSACTIONS ON COMPUTERS,

VOL. 57, NO. 9,

SEPTEMBER 2008

Low-Complexity Link Microarchitecture for Mesochronous Communication in Networks on Chip Francesco Vitullo, Nicola E. L’Insalata, Student Member, IEEE, Esa Petri, Sergio Saponara, Luca Fanucci, Member, IEEE, Michele Casula, Riccardo Locatelli, and Marcello Coppola Abstract—Clock distribution is an important issue when designing Multiprocessor Systems on Chip on deep-submicron technology nodes and nonsynchronous approaches are becoming popular in this field. This work presents a lowcomplexity link microarchitecture for mesochronous on-chip communication that enables skew constraint looseness in the clock tree synthesis, frequency speedup, power consumption reduction, and faster back-end turnarounds. With respect to the state of the art, the proposed link architecture stands for its low power and low complexity overheads. Moreover, it can be easily integrated into a conventional digital design flow since it is implemented by means of standard cells only. Results are presented by referring to the link integrated within a multiprocessor tiled architecture based on a Network-on-Chip communication backbone on a CMOS 65-nm technology. Index Terms—Interconnection architectures, mesochronous communication, Network on Chip, System on Chip, VLSI systems.

Ç 1

INTRODUCTION

SYSTEM-ON-CHIP (SoC) evolution will soon see hundreds of predesigned Intellectual Property (IP) cores assembled on a single die to achieve complex functionalities. However, the performance of traditional interconnect systems (buses and point-to-point links) does not keep pace as system complexity increases and microelectronic technology enters the nanoscale domain. Future embedded systems will be dominated by communication rather than by computation [1], [2]. Connecting hundreds of IPs in an efficient way is a challenge that has channeled researchers’ interest toward the Network-onChip (NoC) design paradigm during the last few years [1]. Thanks to the abstraction layers (physical, data-link, network, transport, and application) of the ISO/OSI stack, NoCs provide a methodology for designing an interconnect architecture, independent of the attached IP cores. Design-flow parallelization, scalability, and reusability all benefit from this approach; furthermore, NoCs usually allow IP macrocells to be connected to the network in a Plug-and-Play fashion, which represents a powerful advantage for

. F. Vitullo, N.E. L’Insalata, E. Petri, S. Saponara, L. Fanucci, and M. Casula are with the Department of Information Engineering, University of Pisa, via G. Caruso, 16, I-56122 Pisa, Italy. E-mail: {francesco.vitullo, nicola.linsalata, esa.petri, s.saponara, l.fanucci, michele.casula}@iet.unipi.it. . R. Locatelli and M. Coppola are with AST Grenoble Labs STMicroelectronics, 12, rue Jules Horowitz, 38019 Grenoble Cedex, France. E-mail: {ricardo.locatelli, marcello.coppola}@st.com. Manuscript received 29 June 2007; revised 13 Nov. 2007; accepted 14 Feb. 2008; published online 19 Mar. 2008. Recommended for acceptance by R. Marculescu. For information on obtaining reprints of this article, please send e-mail to: [email protected], and reference IEEECS Log Number TCSI-2007-06-0268. Digital Object Identifier no. 10.1109/TC.2008.48. 0018-9340/08/$25.00 ß 2008 IEEE

Published by the IEEE Computer Society

1

system-level design exploration and performance analysis, thus reducing time-to-market and development costs. Even in the presence of NoCs, clock distribution remains a major issue in complex systems because of the wire delay problem. As Fig. 1 shows, a single IP benefits from technology scaling, thanks to the reduced size of transistors and length of internal wires. However, each new technology node integrates a greater number of IPs in the same die area; thus, the delay of global wires (i.e., clock) does not scale and becomes the actual bottleneck for system performance [3].

1.1

NoC Clocking Issues and Synchronization Strategies

A fully asynchronous approach would eliminate the clock distribution matter, but current design tools and IP libraries heavily rely on the synchronous paradigm instead. Many researchers have proposed alternative solutions for synchronization. One of the most popular solutions is the Globally Asynchronous Locally Synchronous (GALS) one: IP cores work synchronously with a clock signal whose internal skew is negligible (the IP has a relatively small area extension); conversely, the skew over long global wires is canceled by implementing an asynchronous handshake [4], [5], [6]. A trade-off between synchronous and asynchronous approaches consists of the mesochronous scheme. With respect to a fully synchronous paradigm, in a mesochronous scheme, the clock signal distributed to the various macrocells is the same, but all copies of the clock may have an arbitrary amount of skew. In modern SoCs, this is a consistent scenario since the clock signal is generated from a single source and is distributed across the chip floorplan with a space-dependent time-invariant phase offset, which is the clock skew [7]. Future embedded systems will be based on multiprocessor tiled architectures, where each tile will itself be a complex communication-dominated HW/SW system [6]. In this context, the GALS and the mesochronous approaches will coexist: Intertile communication will be GALS-like, while the mesochronous strategy will be useful at the intratile level. Some techniques have been proposed in the literature to perform synchronization in mesochronous systems. An architectural model for a mesochronous link is described in [8]. With respect to this model, this paper provides implementation solutions for the reset procedure and the receiver block based on a two-stage wagging FIFO. Architectures based on m-stage FIFO synchronizers are proposed in [9], [10]. Here, the value of m is sized on the clock skew to leave enough time margin between the location where data are written and the one where data are read. The main drawback is that m must be evaluated for each physical link in the system or, alternatively, it must be evaluated once and for all in the worst-case scenario at the cost of unnecessary area and power overheads. The authors report implementation and experimental results for this solution in [11], with a nonnegligible overhead of eight flip-flops (FFs) and a four-to-one mux for each bitline that will be synchronized. In [12], the operating principles of several periodic synchronizers are described. They avoid metastability by delaying either the data or the clock signal to sample data when the clock is stable. These solutions rely on nontrivial components such as configurable digital delay lines, which are often not available in a standard-cell design flow. Moreover, the experiments in [12] are limited to low clock frequencies, about 25 MHz. Other works [13], [14], [15] achieve mesochronous data synchronization by using components such as Muller C-elements and digital delay lines that are typically designed with a full-custom approach. In [13], the maximum recovered clock skew is limited to only 25 percent of the clock period. Mesochronous schemes are also used in the literature to implement high-speed pipelined circuits [16]. Here, the focus

2

IEEE TRANSACTIONS ON COMPUTERS,

2

Fig. 1. With technology scale, a greater number of IP cores may be integrated on the same die area to achieve complex functionalities. While local IP performance improves, the global wires become a bottleneck for the overall system performance.

is on boosting the speed performance of basic blocks (such as multipliers) rather than recovering synchronization in complex SoCs. Table 1 summarizes the above considerations about the state of the art. The wire delay problem is also addressed by the LatencyInsensitive (LI) approach based on a synchronous design style [17], [18]. LI removes timing violations of long wires by 1) breaking them up into shorter ones through relay stations insertion and 2) encapsulating IPs in a shell circuitry that supports a proper flowcontrol protocol. Compared to the mesochronous approach, LI can deal with arbitrary wire delays in single-clock and multiclock architectures [18], but it introduces circuit overheads for the relay stations and the IP shell circuitry. To address the above issues, a low-complexity microarchitecture of Skew-Insensitive Mesochronous Link (SIM-L), patent filed [19], is presented. This work presents the relevant implementation details at both the system and technology levels. As will be detailed in the next sections, SIM-L enables full bandwidth communication between synchronous IP macrocells by absorbing their clock skew while keeping low area and power overheads. It is fully supported by standard-cell design flows and introduces two clock cycles of latency. Results in the 65 nm CMOS technology are reported. Particularly, Section 2 describes the mesochronous link architecture and its operating principles. Section 3 shows how SIML may be integrated into a Multiprocessor System on Chip (MPSoC) with reference to the tile-based Scalable software Hardware Architecture Platform for Embedded Systems (SHAPES) architecture [20], [21] featuring a Spidergon NoC. Section 4 reports circuit complexity results, and conclusions are drawn in Section 5.

VOL. 57,

NO. 9,

SEPTEMBER 2008

SIM-L ARCHITECTURE

SIM-L implements a mechanism that guarantees correct communication between a transmitter module and a receiver module also in case of skewed clock signals driving these modules. Fig. 2 shows a top-level view of the proposed link for unidirectional data transfer.  indicates the clock skew. SIM-L is composed of two units: SIM-L TX and SIM-L RX. Link output (data_sync) is a copy of the incoming data synchronized with the receiver local clock, thus absorbing the skew  between the transmitter and the receiver modules. A TX/RX SIM-L pair supports communication in only one direction. If a full-duplex link is needed, two SIM-L TX/RX pairs must be instantiated, one for each direction. Figs. 3 and 4 show the SIM-L RX and SIM-L TX architectures, respectively. Their operating principles are detailed in Sections 2.1 and 2.2.

2.1

Steady-State Operation

During the steady-state operation, the SIM-L TX module generates a signal (strobe) toggling at half the clock frequency. The strobe signal is routed together with transmitted data and is used in the SIM-L RX module to synchronize them in the skewed clock domain. SIM-L RX module is composed of an rx_synch block and a dual stage buffer. Like the corresponding tx_synch block in SIM-L TX, the rx_synch block produces a strobe_rd signal synchronous with the skewed clock on the receiver side and toggling at half the clock frequency. SIM-L guarantees that data are sampled when they are stable, thanks to the dual stage buffer structure in Fig. 3, which is written by the transmitter (strobe) and is read by the receiver (strobe_rd). Incoming data, which are synchronous with the transmitter clock, are alternatively stored in registers A and B. Data stored in A and B are then read from registers A0 and B0 and are alternatively multiplexed toward the output. Fig. 5 shows the “ping-pong” operating principle of the two-stage buffer. The key to correct data transfer during the steady-state operation is in reading or writing both the A=A0 and B=B0 registers without causing metastability. This is achieved by ensuring that the sampling edges of strobe_rd occur enough time later than the corresponding sampling edges of strobe. For instance, this happens when strobe and strobe_rd have opposite phases. In general, the valid phase relationships between strobe_rd and strobe are defined by a skewing window whose width is equal to the master clock period ðTclock Þ, as shown in Fig. 6. Correct SIM-L operation for any skew  is ensured when the skewing window is placed across the strobe falling edge, as shown in Fig. 6. The condition on Tm is expressed as follows, where Tcq is the propagation delay of the A/B registers: Tcq þ Tsetup < Tm < Tclock :

TABLE 1 Schematic Summary of State-of-the-Art Work on Mesochronous Synchronization

ð1Þ

IEEE TRANSACTIONS ON COMPUTERS,

VOL. 57, NO. 9,

SEPTEMBER 2008

3

Fig. 4. SIM-L subblock at the transmitter side: microarchitecture details. A strobe at half the frequency of the clock is generated. Fig. 2. SIM-L is composed of two subblocks: one for the transmitter and one for the receiver side.

The sizing of the delay S on the strobe line in Fig. 3 is fundamental to a proper SIM-L operation. Such a delay guarantees that the Tsetup rule for registers A/B is not violated. In fact, both data and strobe signals are generated synchronously with the transmitter clock and they are routed together toward the receiver. Since strobe is used to sample data in the A/B registers, it must be delayed of S  Tsetup to give the data signal enough time to become stable. To this aim, a cascade of a few standard-cell buffers or inverters is enough since a fine-tuning of S is not required (see the implementation results in Section 4).

2.2

SIM-L Bootstrap

Bootstrap conditions are crucial to ensuring a correct steady-state operation. When the reset condition is removed in the tx_synch block (see Fig. 4), the output of the first FF is set, thus enabling the toggle of strobe signal at half the master clock frequency. As described in Section 2.1, strobe and strobe_rd should toggle with opposite phases so that the time interval between two consecutives strobe/strobe_rd rising (falling) edges is enough to allow FFs data sampling outside the metastability window ðTsetup þ Thold Þ. To this aim, the second FF in the rx_synch block is initialized to an opposite value (“1”) with respect to the second FF in tx_synch (initialized to “0”). However, a successful operation strongly depends on the delay S . Fig. 7 shows two different scenarios for the skewing window when the FFs in the tx_synch and rx_synch blocks are initialized with opposite values. Consider the scenario in Fig. 7a, where no delay on the strobe line is present. The skew amount between the two copies of the clock signal is not known

a priori, so it may happen that the strobe_rd rising edge falls on the right boundary of the skewing window; this situation may cause metastability. On the contrary, in the scenario in Fig. 7b, the risk is avoided thanks to the delay S that puts the skewing window in a safe position. Once bootstrap synchronization is accomplished, no other specific action is required during the steady-state operation.

2.3

SIM-L Application Requirements and Limits

To summarize, SIM-L ensures correct data synchronization between modules with skewed clocks, provided that the following are observed: The clock is generated by the same source (mesochronous scenario). . The strobe signal is routed together with data requiring synchronization. . A delay S  Tsetup is inserted in the SIM-L RX module. . The SIM-L pair is initialized such that strobe and strobe_rd are in a proper phase relationship. The latency introduced by SIM-L amounts to two clock cycles due to the cascade of A=A0 and B=B0 registers. Since the SIM-L data path is mainly a cascade of registers, no bottlenecks are introduced on the maximum achievable clock frequency. The line delay (i.e., the signal propagation delay from the transmitter to the receiver) that SIM-L tolerates is around 1 clock cycle; in fact, if S ¼ Tsetup , then Tm in Fig. 6 is Tclock  Tsetup . Note that the line delay adds up to S , thus shifting ahead the strobe signal and pushing its rising edge near the one of strobe_rd. If strobe gets delayed by Tclock , metastability risks are possible (a scenario similar to the one in Fig. 7a). In order to keep the SIM-L operation safe, the line length may add delay only up to Tclock  2Tsetup ; the factor 2 keeps a safe margin around the left boundary of the skewing window. Moreover, as will be demonstrated by the implementation results presented in Section 4, SIM-L overheads in terms of area and leakage power are negligible. .

Fig. 5. Behavior of the two-stage register. Registers A=A0 and B=B0 are collapsed Fig. 3. SIM-L subblock at the receiver side: microarchitecture details.

into a single abstract register.

4

IEEE TRANSACTIONS ON COMPUTERS,

Fig. 6. Phase relationship between strobe and strobe_rd signals. The rising edge of strobe_rd may fall in any point within the temporal interval denoted by the gray rectangle. It depends on the amount of skew between the transmitter clock and the receiver clock.

3

SIM-L DEPLOYMENT IN THE SHAPES MPSOC

Future MPSoCs will integrate an ever-increasing number of IP macrocells on a single chip. The major issue that should be dealt with when designing such complex systems is by far the communication architecture. NoCs are widely considered a suitable solution to this issue. Current research topics consider not only effective NoC architectures but also efficient mapping of software applications on the IPs such that the resulting traffic patterns would exploit the underlying communication architecture at its best [22]. NoCs bring into the microelectronic domain the layered approach which is typical of computer macronetworks. A layered paradigm disjoins IP design from the underlying communication architecture, thus easing and speeding up the design flow of a complex system. Fig. 8 illustrates the building blocks of a typical NoC [23]: the Router (R), which is in charge of routing data packets and offering QoS features, . the Network Interface (NI), which connects the IP macrocell domain to the NoC domain providing data bus size, clock frequency, and protocol conversion features, and . the Link, which represents the physical medium connecting two routers or a router to a NI. Note that an NoC is based on a few predesigned configurable and reusable components. Application-specific communication architectures may be easily built by means of these basic components. Moreover, thanks to the protocol conversion performed by the NIs, IPs from different vendors may be connected to the same NoC in a Plug-and-Play fashion. .

VOL. 57,

NO. 9,

SEPTEMBER 2008

Fig. 8. (a) Typical ISO-OSI layers for Internet applications and their mapping onto NoC components. (b) An example of the NoC architecture. Usually, IPs run software applications that correspond to the highest ISO layer.

An interesting example of a tiled MPSoC architecture is given by the SHAPES Project [20], [21]: Many identical IP tiles building a complex scalable multiprocessor are interconnected by means of a packet-switched network. A typical SHAPES tile contains, as sketched in Fig. 9, the following components [21]: a VLIW floating-point DSP based on the mAgicV architecture by ATMEL, featuring in 65 nm a complexity of 915 kgates plus 1 Mbit of program memory and 640 Kbits of RAM for a power cost of 200 mW/GFLOPS, . a RISC processor based on the ARM926 core, . a distributed network processor (DNP) for extra tile communication and including the interface to the NoC (NI), and . on-chip memories and a set of peripherals for off-chip communication. For intertile interconnect, the Spidergon-STNoC (S-STNoC) architecture has been chosen. S-STNoC is based on an original topology [24] proposed by STMicroelectronics, which reaches a good trade-off between performance, configurability, and floorplanning versatility. The mesochronous link presented in this work is suitable for reliable high-band communication between NoC routers, clocked at the same frequency with loose skew constraints. This is the .

Fig. 7. Relative position of the strobe_rd skewing window (a) without delay S and (b) with delay. In these scenarios, FFs in the tx_synch and rx_synch blocks are initialized with opposite values.

Fig. 9. Content of a typical SHAPES tile. Interconnection among different tiles is achieved by means of Spidergon-STNoC [21].

IEEE TRANSACTIONS ON COMPUTERS,

VOL. 57, NO. 9,

SEPTEMBER 2008

scenario of the first generation of SHAPES, which targets a 65 nm CMOS implementation and features tens of tiles interconnected by a Spidergon NoC, where all routers run with the same clock (see Fig. 8b). Frequency conversion between each IP and the NoC is implemented at the network boundary by means of the NIs. Future scaling of the SHAPES architecture in sub-45 nm CMOS technology foresees the integration of several hundreds of tiles. In such a case, even the NoC infrastructure will likely be partitioned in different synchronous regions, the GALS paradigm will be used for communication and synchronization over the long global wires, while the mesochronous strategy will still be used inside each synchronous island.

4

IMPLEMENTATION RESULTS

The SIM-L architecture has been described at the Register Transfer Level (RTL) in VHDL and synthesized on a 65 nm CMOS standard-cell technology with 1.1 V supply voltage, high threshold voltage (i.e., low-leakage) library version. In worst-case conditions and with a 125 C operating point, the maximum clock frequency is 1 GHz. For a 32-bit-wide SIM-L TX/RX pair, the total area is about 1,600 m2 , corresponding to 800 logic gates. The leakage power consumption at 1.1 V and 125 C is 0.133 W. Since the tx_synch and rx_synch blocks do not depend on data size, the area cost per synchronized bit amounts to roughly 50 m2 =bit. The leakage power consumption is 4 nW/bit. With regard to the circuit for S delay insertion in Fig. 3, note that Tsetup in the S-STNoC design ranges from 0.09 to 0.25 ns, considering variations of process, supply voltage, temperature, and load. To meet the coarse-grained constraint S  Tsetup (see Section 2.1), a cascade of four buffers is enough since the delay of a single standard-cell buffer ranges from 0.067 to 0.12 ns, taking into account variations of process, supply voltage, temperature, and load. Buffers are inserted during the RTL synthesis by proper min/max delay constraints. A test case system has been synthesized using the same 65 nm CMOS technology library (low-leakage, 1.1 V, and 125 C) of SIM-L. The system includes eight tiles interconnected by a Spidergon NoC with eight NIs, eight S-STNoC routers, and 40 SIM-L instances running at 500 MHz and using a 72-bit-wide data packet bus. Synthesis results show that the ratio between the area occupancy of the links and the one of the network blocks (routers + NIs) is limited to 3 percent. The leakage power consumption is negligible, i.e., less than 0.5 percent, when compared to that of the network components. Estimates based on previous industrial design experience indicate that the deployment of skew recovery techniques such as the mesochronous one allows for a gain of 25 percent in the maximum clock frequency. In terms of complexity, the SIM-L architecture requires four FFs plus some logic ports for implementing the delay on strobe + (4 FFs + 2-to1 mux)  bit-line. This may be compared, for instance, with the logic complexity for the architecture shown in [11], which requires 16 FFs + (8 FFs + 4-to-1 mux)  bit-line. Such a lower logic complexity comes at the expense of worse robustness with respect to the delay introduced by very long data lines: SIM-L tolerates up to almost one clock cycle delay against the two clock cycles tolerated by the solution in [11].

5

CONCLUSIONS

Deep submicron technologies allow for the integration of an everincreasing number of IP macrocells on a single silicon die. Due the well-known wire delay problem, the clock distribution is becoming the actual bottleneck of system performance in fully synchronous systems.

5

The mesochronous paradigm is a way of modeling fully synchronous systems affected by clock skew issues. A real-world example of such a system is the tiled MPSoC architecture of SHAPES, which uses an NoC as the intertile communication backbone. This work has presented a mesochronous physical link microarchitecture for connecting NoC components. It enables skew constraint looseness in the clock tree synthesis, frequency speedup, power consumption reduction, and faster back-end turnarounds. With respect to the state of the art, the SIM-L architecture can be easily integrated into a conventional digital design flow since it is implemented by means of standard cells. In our experiments on a 65 nm CMOS standard-cells library, the link meets a maximum frequency of 1 GHz by using only lowleakage cells. SIM-L has been integrated into an eight-tile MPSoC with eight routers and NIs using 72-bit-wide data packet bus. In this test case, SIM-L exhibits area and leakage penalties of only 3 percent and 0.5 percent, respectively, versus NoC components. Although the GALS approach prevails when integrating the even higher complexity enabled by sub-45 nm technology nodes, the mesochronous paradigm will still be useful for addressing the larger synchronous islands of GALS systems.

ACKNOWLEDGMENTS This work is the result of a strict collaboration between the University of Pisa and STMicroelectronics within the European Sixth FP Project SHAPES “Scalable Software Hardware Architecture Platform for Embedded Systems” under Contract FP6-IST26825.

REFERENCES [1] [2]

[3] [4]

[5] [6]

[7] [8] [9]

[10] [11]

[12]

[13]

[14]

[15]

[16]

L. Benini and G. De Micheli, “Networks on Chip: A New SoC Paradigm,” Computer, vol. 35, no. 1, pp. 70-78, Jan. 2002. R. Saleh, S. Wilton, S. Mirabbasi, A. Hu, M. Greenstreet, G. Lemieux, P. Pande, C. Grecu, and A. Ivanov, “System-on-Chip: Reuse and Integration,” Proc. IEEE, vol. 94, no. 6, pp. 1050-1069, June 2006. R. Ho, K.W. Mai, and M.A. Horowitz, “The Future of Wires,” Proc. IEEE, vol. 89, no. 4, pp. 490-504, Apr. 2001. J. Muttersbach, T. Villiger, and W. Fichner, “Practical Design of Globally Asynchronous Locally Synchronous Systems,” Proc. Sixth Int’l Symp. Advanced Research in Asynchronous Circuits and Systems, pp. 52-59, 2000. A. Martin and M. Nystrom, “Asynchronous Techniques for System-onChip Design,” Proc. IEEE, vol. 94, no. 6, pp. 1089-1120, June 2006. R. Marculescu, D. Marculescu, and L. Pileggi, “Toward an Integrated Design Methodology for Fault-Tolerant Multiple Clock/Voltage Integrated Systems,” Proc. 22nd IEEE Int’l Conf. Computer Design, 2004. J. Rabaey, A. Chandrakasan, and B. Nikolic, Digital Integrated Circuits: A Design Perspective, second ed. Prentice Hall, 2003. D. Wiklund, “Mesochronous Clocking and Communication in On-Chip Networks,” Proc. Swedish System-on-Chip Conf., Apr. 2003. A. Edman and C. Svensson, “Timing Closure through Globally Synchronous, Timing-Portioned Design Methodology,” Proc. 45th Design Automation Conf., pp. 71-74, 2004. W.J. Dally and J.W. Poulton, Digital Systems Engineering. Cambridge Univ. Press, 1998. P. Caputa and C. Svensson, “An On-Chip Delay- and Skew-Insensitive Multicycle Communication Scheme,” Proc. IEEE Int’l Conf. Solid-State Circuits, pp. 1765-1774, Feb. 2006. Y. Semiat and R. Ginosaur, “Timing Measurements of Synchronization Circuits,” Proc. Ninth Int’l Symp. Advanced Research in Asynchronous Circuits and Systems, pp. 68-77, May 2003. S. Kim and R. Sridhar, “Self-Timed Mesochronous Interconnection for High-Speed VLSI Systems,” Proc. Sixth Great Lakes Symp. VLSI, pp. 122-125, 1996. B. Mesgarzadeh, C. Svensson, and A. Alvandpour, “A New Mesochronous Clocking Scheme for Synchronization in SoC,” Proc. IEEE Int’l Symp. Circuits and Systems, pp. 605-608, 2004. F. Mu and C. Svensson, “Self-Tested Self-Synchronization Circuit for Mesochronous Clocking,” IEEE Trans. Circuits Systems II, vol. 48, no. 2, pp. 129-140, Feb. 2001. S.B. Tatapudi and J.G. Delgado-Frias, “A Mesochronous Pipelining Scheme for High-Performance Digital Systems,” IEEE Trans. Circuits Systems I, vol. 53, no. 5, pp. 1078-1088, May 2006.

6 [17] [18]

[19]

[20]

[21]

[22]

[23]

[24]

IEEE TRANSACTIONS ON COMPUTERS, L.P. Carloni and A.L. Sangiovanni-Vincentelli, “Coping with Latency in Soc Design,” IEEE Micro, vol. 22, no. 5, p. 12, Sept./Oct. 2002. M. Singh and M. Theobald, “Generalized Latency-Insensitive Systems for Single-Clock and Multi-Clock Architectures,” Proc. Int’l Conf. Design, Automation and Test in Europe, 2004. R. Locatelli, M. Coppola, D. Mangano, L. Fanucci, F. Vitullo, D. Zandri, and N.E. L’Insalata, “Synchronization System for Synchronizing Modules in an Integrated Circuit,” EU Patent Application 06291440.3-1237, Nov. 2006. P.S. Paolucci, A. Jerraya, R. Leupers, L. Thiele, and P. Vicini, “SHAPES: A Tiled Scalable Software Hardware Architecture Platform for Embedded Systems,” Proc. Fourth Int’l Conf. Hardware/Software Codesign and System Synthesis, pp. 167-172, 2006. P.S. Paolucci, F. Lo Cicero, A. Lonardo, M. Perra, D. Rossetti, C. Sidore, P. Vicini, M. Coppola, L. Raffo, G. Mereu, F. Palumbo, L. Fanucci, S. Saponara, and F. Vitullo, “Introduction to the Tiled HW Architecture of SHAPES,” Proc. Int’l Conf. Design, Automation and Test in Europe, vol. 1, pp. 77-82, Apr. 2007. U.Y. Ogras, J. Hu, and R. Marculescu, “Communication-Centric SoC Design for Nanoscale Domain,” Proc. 16th IEEE Int’l Conf. Application-Specific Systems, Architecture, and Processors, Apr. 2007. M. Coppola, R. Locatelli, G. Maruccia, L. Pieralisi, and A. Scandurra, “Spidergon: A Novel On-Chip Communication Network,” Proc. Int’l Symp. System-on-Chip, pp. 15-16, Nov. 2004. L. Bononi and N. Concer, “Simulation and Analysis of Network on Chip Architectures: Ring, Spidergon and 2D Mesh,” Proc. Int’l Conf. Design, Automation and Test in Europe, pp. 154-159, 2006.

. For more information on this or any other computing topic, please visit our Digital Library at www.computer.org/publications/dlib.

VOL. 57,

NO. 9,

SEPTEMBER 2008