IEICE TRANS. FUNDAMENTALS, VOL.E88–A, NO.12 DECEMBER 2005
3531
PAPER
Special Section on VLSI Design and CAD Algorithms
A Binary Tree Based Methodology for Designing an Application Specific Network-on-Chip (ASNOC) Yuan-Long JEANG†a) , Member, Jer-Min JOU†† , and Win-Hsien HUANG† , Nonmembers
SUMMARY In this paper, a methodology based on a mix-mode interconnection architecture is proposed for constructing an application specific network on chip to minimize the total communication time. The proposed architecture uses a globally asynchronous communication network and a locally synchronous bus (or cross-bar or multistage interconnection network MIN). First, a local bus is given for a group of IP cores so that the communications within this local bus can be arranged to be exclusive in time. If the communications of some IP cores should be required to be completed within a given amount of time, then a non-blocking MIN or a crossbar switch should be made for those IP cores instead of a bus. Then, a communication ratio (CR) for each pair of local buses is provided by users, and based on the Huffman coding philosophy, a process is applied to construct a binary tree (BT) with switches on the internal nodes and buses on the leaves. Since the binary tree system is deadlock free (no cycle exists in any path), the router is just a relatively simple and cheap switch. Simulation results show that the proposed methodology and architecture of NOC is better on switching circuit cost and performance than the SPIN and the mesh architecture using our developed deadlock-free router. key words: system-on-a-chip, application specific network on chip, globally asynchronous network, locally synchronous bus, wormhole routing, Huffman code
1.
Introduction
System-on-a-chip (SOC) designs provide integrated solutions for various applications. One of the major challenges of designing an SOC chip is the communication architecture among heterogeneous components operated with different frequencies and characteristics. This issue will become more and more critical when integrating SOC with more than hundreds of heterogeneous components. The traditional bus style architectures (such as AMBA [17] and WISHBONE [18]) are inefficient on performance, cost, and reliability due to the electrical properties of deep submicron process [15]. Following the general purposed computer system network architecture, a general purposed network might be unsuitable or sub-optimal for a specific application. Many methodologies have been proposed for creating application specific network on chip. However, most of them deal with the problem by selecting topology and mapping the IP (Intellectual Property) cores to the selected topology. To tackle the deadlock and livelock problems, a Manuscript received March 16, 2005. Manuscript revised June 14, 2005. Final manuscript received July 29, 2005. † The authors are with National Kaohsiung University of Applied Science, Kaohsiung, Taiwan. †† The author is with National Chen-Kung University, Taiwan, Taiwan. a) E-mail:
[email protected] DOI: 10.1093/ietfec/e88–a.12.3531
complex and expensive router should be designed. A new, simple and cost-effective methodology will be presented here to minimize the total communication time. The remainder of this paper is organized as follows. Section 2 gives a review of related work. In Sect. 3, the basic features, and an example for our proposed methodology based on binary tree architecture are presented. The experimental results are described in Sect. 4. Finally, some conclusions are given in Sect. 5. 2.
Related Work
The Virtual Socket Interface (VSI) alliance [8] has proposed a standard interface to be used in conjunction with on-chip system buses, for point-to-point connections between the high performance virtual components (VCs). However, as systems become much larger, the performance of a bus system can become bottlenecked. To combat problems associated with long and capacitive wires typically faced by bus-based architecture, researchers have proposed the concept of network-on-chip, which consists of regular interconnection topologies and protocols. Nowadays, there are experimental NOCs being developed. For instance, Kumar et al. [11] describes a meshbased interconnected architecture. These architectures consist of a mesh of computational resources (or IPs). There is an associated router switch for each IP. Each switch is thereby connected to four neighbor switches. Guerrier and Greiner have proposed a generic architecture for on-chip packet switched communication, called SPIN [4]. They proposed a Fat-tree topology for the network because of its lower diameter as compared with a mesh. An example is shown in Fig. 5(e) in this paper. It is also expected that NOC architecture should meet the reconfigurable and programmable requirements [4], [5], [7], [9], [11], [13] so that a complex routing algorithm should be applied to avoid the deadlock and livelock problem and to increase the latency and throughput performance. Therefore, those proposed methods incline to design and evaluate a general purposed network-on-chip. The implementation of routing algorithm creates an additional cost. We found that the scalability and programmability properties may not be sacrificed, but by carefully arranging the processors architecture, an application specific network is possible to further reduce the cost and increase the performance compared with those general purposed NOCs. Due to the deadlock and livelock problem [16] for
c 2005 The Institute of Electronics, Information and Communication Engineers Copyright
IEICE TRANS. FUNDAMENTALS, VOL.E88–A, NO.12 DECEMBER 2005
3532
some kind of architectures (such as the mesh topology), a lock-free routing algorithm should be developed. For deadlock-free architectures, such as tree styles, the switch can be designed as a self-routing mechanism (such as wormhole routing for the Fat-tree [4]). Thus, the cost of the implementation of a routing algorithm on switches can be greatly reduced. However, to avoid the crowded situation and to enhance the routing ability, the Fat-tree based architectures add many roots on the tree. Thus, the construction of those networks is originally for general purposed homogeneous multi-computer systems and based on a random communication style. The cost is still high for specific applications. General purposed networks cannot be optimized in cost and performance for application-specified NOC systems. It therefore motivates researchers to develop design methodologies for application-specified NOC systems. Many methodologies [20]–[25] have been proposed for creating an application specific network on chip. However, most of them deal with the problem by selecting connection topology and mapping the IP cores to the selected topology. To tackle the deadlock and livelock problems for mesh topology architecture, they either model the problem as an NP-hard problem or propose an iteratively improvement algorithm. In which cases, an expensive router should be associated. Indeed, implementing a mechanism which automatically detects and recovers from deadlock may not be affordable in terms of silicon resources; it also may lead to unpredictable delays [20]. The first paper discussing the mapping of IP cores to tiles and routing path allocation problem for tile-based architecture was presented by Hu et al. [20]. However, since it is based on the mesh architecture style, the solution for the deadlock problem is still expensive. NetChip [21] is a complete synthesis flow that, for customized NOC architecture, it partitions the development work into topology mapping, selection, and generation and provides proper tools for their automatic execution. For topology mapping and selection, a graph mapping is formed, which is a special case of an intractable problem – quadratic assignment problem. The solution for the deadlock problem is still expensive. In [22], a Communication Architecture Tuner (CAT) layer is added to each component in the existing communication architecture topology to monitor its internal state, analyze the communication transactions it generated, and predict the relative importance of the transactions in terms of their impact on system-level performance metrics. Sophistic CATs should pay a large amount of additional hardware overhead. In [23], a communication graph and a topology graph are generated for applications. After a selection of topology, a clustering procedure and an iterative improvement procedure are applied to these graphs to map system communications to physical links to obtain a near optimal performanceconstraint solution. Though the dynamic conflict effects have been considered, when the system becomes tremendously large, the calculation of all the possible preventions
of deadlocks is also intractable. In [24], Æthereal NOC system synthesizes a router with both guaranteed and best-effort services for mesh network. However, the cost is still high [25]. Here, based on the Huffman coding philosophy, we present a specific architecture and methodology for users to build up their own on-chip network based on binary tree architecture to minimize the total communication time and total energy. Just as the Huffman coding to shorten the total code length, our design goal of the proposed methodology is to minimize the total communication time and therefore the average communication time. On the other hand, since the cost is also decreased, the energy consumed is also decreased at the same time. Since the binary tree system is deadlock free (no cycle exists in any path), a router is just a relatively simple and cheap switch. Simulation results show that the proposed methodology and architecture of NOC is better on switching circuit cost and performance than the SPIN architecture [4] and the mesh architecture using our developed deadlockfree router [19]. 3.
The On-Chip Network Design
3.1 Interconnection Architecture The proposed architecture uses a globally asynchronous communication network and a locally synchronous bus (or cross-bar or multistage interconnection network MIN). First, a local bus is given for a group of IP cores so that the communications within this local bus can be arranged to be exclusive in time. Then, a communication ratio (CR) for each pair of local buses is provided by users, and based on the Huffman coding philosophy, a process is applied to construct a binary tree (BT) with switches on the internal nodes and buses on the leaves. An interconnection architecture is shown in Fig. 1. The network topology impacts the complexity of routers. Due to being lock free, the routing in a binary tree is easier than in tile-based (mesh-based) architectures. It results in potentially smaller switches, higher capacity, a shorter clock cycle, and overall scalability.
Fig. 1
A specific binary tree network.
JEANG et al.: A BINARY TREE BASED METHODOLOGY FOR DESIGNING AN ASNOC
3533
3.2 Bus System Construction Due to the popularity of some bus systems, such as the AMBA [17], many existing or developing IPs are AMBA compatible. Thus, grouping IPs in a single bus is becoming easier if all IPs have the same bus interface. A bus system is a time-shared medium. Thus, IPs can be grouped in a bus if their communications are essential or can be arranged to be exclusive in time. They are communicated in parallel instead of bit-serial or flit-based granularity. However, the number of IPs in a bus should be limited to a suitable number. Otherwise, an inefficient and costly bus system is also the original problem that NOCs are meant to resolve. If the communications of some IP cores should be required to complete within a given amount of time, then a non-blocking MIN or a crossbar switch should be made for those IP cores instead of a bus. Access requests are issued by using the bus arbiter to initiate a bus transaction. The arbiter decides priorities when there are conflicting requests. 3.3 Globally Asynchronous Communication System For each pair of local bus groups, a communication ratio (CR) is provided by user. The CR for a pair of IPs is the division result that the total number of bits transferred between the pair of IPs, divided by the total number of bits transferred by the system. Then, after the first step of grouping IPs into buses, the CR for a pair of buses is the sum of CRs for all pairs of IPs between the pair of buses. The CR on each pair of buses can be obtained by simulation. As shown in Fig. 2(a), based on these CRs, an undirected graph is constructed where each node represents a bus while each edge represents the existence of communications
Fig. 2
The construction of the tree network in Fig. 1.
between the pair of buses. A weight on the edge represents the CR of the corresponding buses. Similar to the Huffman coding method, based on the given CRs, the two local buses with the highest CR are grouped to be the first switching point for the globally asynchronous network. Hence, one can regard the two groups using a switching point as a new group. After that, the two CRs from the two old groups to each other local bus group are added to form the new CR between the new group and each other local bus group. For the example shown in Fig. 2(a), the largest CR (0.2) is on the edge of BUS1 and BUS2 (the CR on the edge of BUS4 and BUS5 is also 0.2, but we select only one at a time). Then, a new graph is constructed such that BUS1 and BUS2 are grouped to form a switch S 12 representing a new node in the new graph as shown in Fig. 2(b). All edges connecting to BUS1 and BUS2 in the original graph are now connected to the new node S 12. That is, two edges from a node to both of BUS1 and BUS2 are now merged to one edge and CRs on these edges are summed up to become the CR on the new edge to the new node S 12. For the example shown in Fig. 2(a), the CR on the edge of BUS3 and S 12 is the sum of the CR on edge of BUS3 and BUS1 (0.09) and the CR on edge of BUS3 and BUS2 (0.11), which is 0.2. A similar process is performed to form the next switching point. That is, as shown in Fig. 2(b), the largest CR is on the edge of S 12 and BUS3. S 12 and BUS3 are merged to form a new switch S 123 as shown in Fig. 2(c). Then, the same processes are applied to group BUS4 and BUS5 as S 45 shown in Fig. 2(d), to group S 123 and S 45 as S 12345 shown in Fig. 2(e). Finally, as shown in Fig. 1, a binary tree is built where each internal node is a switching point while each leaf is set to be a local bus (or crossbar or MIN). 3.4 Packet-Switching Technique The packet-switching communication model and wormhole routing [1] method are adopted. Under the wormhole routing scheme, the message is divided into smaller units called flits that contain a header flit and several data flits. The first (header) flit of a packet contains routing information, and the header flit enables the switches to establish the path and subsequent flits simply follow this path in a pipelined fashion by means of switch output port reservation. A flit is passed to the next switch as soon as enough space is available to store it, even though there is not enough space to store the whole packet. Thus, for each port, we need only to allocate registers as buffers with the length of a flit, instead of the whole packet. This will reduce the latency as well as the storage requirements on each node. In addition, it can make the construction of switches to be smaller, compacter, and faster. As shown in Fig. 3, Type defines the type of packet, namely header or data. Destination Address and Source Address are fields used for identifying the target and sender IPs. The obtaining of these addresses is discussed in the next section. Packet Length is used to define the number of flits in a packet.
IEICE TRANS. FUNDAMENTALS, VOL.E88–A, NO.12 DECEMBER 2005
3534
Fig. 3
(a) Format of header flit, (b) format of data flit.
Fig. 4 (a) The architecture of a switch, (b) directed graph for deciding the priority, (c) competing for Rout, (d) competing for Pout, (e) competing for Lout.
3.5 Switch Implementation As shown in Fig. 4(a), a switch has three ports, connecting to parent node, left child node and right child node denoting as P, L and R respectively. Each port has two channels for bidirectional communication. This implies that two packets can be transmitted simultaneously in opposite directions between neighboring switches. Each channel has its own handshaking mechanism, called link controller (LC), for asynchronously transferring. Each input channel has a buffer to store a flit of message. A message is usually divided into several flits. A flit is routed through the switch across a physical channel, and into the input buffer of the next switch. The arbitration and priority logic, as shown in Fig. 4(a), implements the routing algorithms and judges which packets can move out to the next node when two packets intend to forward to the same node at the same time.
Based on our philosophy, we give the communication with higher communication ratio a higher priority. As shown in Fig. 4(a), there are two incoming messages to compete an outgoing channel, that is, Pin and Lin are competing the Rout, Rin and Lin are competing the Pout, Pin and Rin are competing the Lout. Therefore, if there is any competition, we have to decide which one has a higher priority. For this purpose, when we are constructing the undirected graph to derive the whole binary tree architecture, at the same time, we construct a directed graph where each node represents a bus while each directed edge represents the direction of the message passing. A weight on edge is also represents the communication ratio (CR). For the example shown in Fig. 4(b), if NODE1 and NODE2 (a NODE may be a BUS or a switch) are merged using a switch and the left child and the right child of the switch are NODE1 and NODE2 respectively, the CR of Pin of this switch is the sum of all CRs on all edges incident to both NODE1 and NODE2 while the CR of Lin of this switch is the sum of the CRs of all edges outgoing from NODE1. Similarly, the CR of Rin of this switch is the sum of the CRs on all edges outgoing from NODE2. The CR of Pout is the sum of the CRs on all edges outgoing from both of NODE1 and NODE2 except those edges between NODE1 and NODE2. If this switch will be the left or right child of a more upper level switch according to our construction method, the CR of Pout is calculated as the CR of Rin or CR of Lin of the upper level switch. However, the calculation of CR of Rout and the CR of Lout are not necessary in our algorithm. For example, given a switch, if the CR of Lin is 0.2 while the CR of Pin is 0.3, and there is a competition by Lin and Pin for Rout, then Pin will has a higher priority. To prevent the lower priority communications be blocked forever or stuck too long so that the average performance of the whole system will be degraded, we associate a simple counter for each output. Since for each output, there are two possible inputs to compete the same output, we set the initial value of the counter equal to the ceiling of the larger CR divided by the smaller CR. For the above example, the ceiling of 0.3/0.2 is equal to 2. Thus, the initial value will be set to 2. Every time when there is a competition and the communication from Pin obtains the output buffer, the counter of the Rout will be decremented. Until to zero, the communication from Lin can obtain the Rout if there is a competition with Pin. Each completion of the communication from Lin when there is competition, the counter of Rout will be reset to 2. 4.
Experimental Results
An experimental example is shown in Fig. 5. The simulation tool is the VerilogXL of Cadence Co. There are 16 IPs. Communications among IPs in Group1={IP0, IP5, IP7, IP14}, or in Group2={IP1, IP12, IP8, IP13}, or in Group3={IP2, IP4, IP6, IP10}, or in Group4={IP3, IP11, IP9, IP15} are exclusive in time and thus IPs in each group can use a bus. The sum of CRs for all IPs inside each bus is
JEANG et al.: A BINARY TREE BASED METHODOLOGY FOR DESIGNING AN ASNOC
3535
shows a SPIN-like architecture with bus-concerned such that all IPs in each group are arranged in the first level of switches. Figure 5(g) shows a mesh-like architecture without bus concerns while Fig. 5(h) shows a mesh-like architecture with bus concerns, that is, buses with higher CRs are arranged near to each other. Our experiments assume that (1) Each packet has 2 flits and each flit has 32 bits. (2) There are one to four packets randomly generated in every unit of time until all the packets of the given number have been sent. Packets are sent via either unicasting or multicasting. (3) The width of each port to/from a switch is 32-bit. (4) Packets sent to IPs within a group are exclusive in time. (5) If all IPs in a group are implemented as a bus (or crossbar switch or MIN), each communication (only data flit needed) in a group takes 3 units of time, including 2 units of time for interrupt requesting and acknowledging and one unit of time for transferring. (6) For BT or SPIN network, sending a flit to a switch takes 4 units of time if channel is available, including 2 units of time for handshaking, one unit of time for transferring, and one unit of time for decoding and routing. (7) For mesh network, sending a header flit takes 16 units of time while sending a data flit takes 4 units of time. (8) Given a number of packets to be sent for an experiment for each type of architecture, we perform the experiment 5 times and obtain the average transferring time for completing all the communications of the given number of packets. (9) The source (an IP in a bus) of a communication can’t deliver the next packet until the whole previous packet has been sent out. That is, inter-bus sending from an IP should wait at least 6 units of time while intra-bus sending from an IP should wait at least 2 units of time.
Fig. 5 (a)–(c) Construction of binary tree, (d) the BT, (e) SPIN without bus concerns (SPIN1), (f) SPIN with bus concerns (SPIN2), (g) mesh without bus concerns (mesh1), (h) mesh with bus concerns (mesh2).
assumed to be 0.04. That is, there are totally 16% of intrabus communications. The initial CRs for buses are shown in Fig. 5(a). The construction of the binary tree by our algorithm is shown from Figs. 5(a) to 5(c). Figure 5(d) shows the constructed binary tree architecture. For SPIN architecture, if all IPs with communications can be arranged to be exclusive in time are grouped in the first level of switches (the switches connecting to IPs), or for mesh architecture, all IPs with communications can be arranged to be exclusive in time are grouped in the neighborhood or as close as possible, then we say that these arrangement in architecture is bus-concerned. Figure 5(e) shows a SPIN-like architecture without bus-concerned. Figure 5(f)
To prevent the deadlock problem, our router [19] of mesh architecture uses two time-multiplexed virtual channels for each physical channel. It takes 16 units of time to route a header flit and takes 4 units of time to send a data flit. However, for tree type and bidirectional channel architecture, the routing is relatively simple and effective. Based on the wormhole routing philosophy, when a network contention occurs, the newcomer should wait until previous packet, which has held the channel, has completely transferred. There are three factors may affect the performance under the wormhole routing philosophy due to the network contention problem: (1) the number of packets issued at the same time, (2) the average length (number of flits) of the packets issued, (3) the average length that a flit should pass through. The architecture and the connection arrangements of the IP cores may all affect the factor (3). Therefore, to evaluate the performance under a fixed architecture and connection arrangements (the factor (3) is the same), one may either fix the number of flits in a packet and increase the number of packets or fix the number of packets and increase the number of flits in a packet. How-
IEICE TRANS. FUNDAMENTALS, VOL.E88–A, NO.12 DECEMBER 2005
3536 Table 1 Comparisons of BT based design versus SPIN and mesh architectures without bus-concerns.
Table 2 Comparison results of BT based design versus SPIN with bus concerns and mesh with bus-concerns.
ever, for most high-performance computation schemes such as these tightly coupled computer systems (SOC is an example), short passing messages are more frequent than long passing messages. Thus, in our experiments, we adopt the method that fixes the number of flits of a packet and increases the number of packets to evaluate the performance. 4.1 Performance Comparisons Figure 5(e) and Fig. 5(g) respectively shows a SPIN and mesh architecture without bus concerns. Table 1 lists the performance comparisons of our designs (Fig. 5(d)) with SPIN1 (Fig. 5(e)) and mesh1 architectures (Fig. 5(g)). The performances of our designs are obviously better than that of SPIN and mesh. The percentage of speeding up with SPIN1 (as an example) is calculated as the difference of the average transferring time using SPIN1 architecture and using the binary tree, divided by the average transferring time using SPIN1. Figure 5(f) and Fig. 5(h) respectively shows a SPIN and mesh architecture with bus concerns, namely, SPIN2 and mesh2 respectively. In Table 2, the performance of BT based design still better than SPIN and mesh architectures. However, the percentage of the speeding up when the number of packets increased is not increased so fast. It is because that when the SPIN or mesh architecture is arranged with bus concerns, then, the time spent on routing through switches will be greatly reduced. That is, the number of packets put into the global network pool is reduced. Just as specified in the previous section, there are three factors may affect the performance. That is, the larger the number of packets issued is or the longer the packet is or the longer the average length of all transfers is, the larger the probability of network contentions is. Based on the wormhole routing philosophy, therefore, the larger the probability of network contentions is, the worse the performance is. Our design philosophy, similar to the Huffman coding philosophy, is based on “hierarchically construct the network,” that is, more often communications use shorter paths (through local bus or switches) while lesser communications use longer path. Thus, the average length of data transfer-
Fig. 6
The layout of the switch for our binary tree.
ring is obviously shorter than the SPIN style and the mesh style designs. Hence, when the number of packets issued increasing, the probability of network contentions for SPIN and mesh architecture is also increased more quickly than our design under the same conditions. That is why our designs obtained a higher speedup when the number of packets increased. Based on the Huffman coding philosophy, which has been proved to be the most efficient coding method for lossless compression, the communication ratio concept is similar to the occurrence frequency concept of the Huffman code compression. The purpose of Huffman code compression is to shorten the total code length while the purpose of the Binary Tree we created is to shorten the total communication length. The main philosophy of the Huffman coding is that the higher the occurrence frequency is, the shorter the code represented is. Similarly, the main philosophy of our methodology is that the higher the communication ratio is, the shorter the communication path is. 4.2 Cost Comparisons Figure 6 shows the layout of the switch for our binary tree. Table 3 shows the comparisons of the cost of Fig. 5(d), Fig. 5(f) and Fig. 5(h). It should be noted that our mix-mode design use local bus instead of separate transmission lines routed to switches, the area overhead of buses of our design
JEANG et al.: A BINARY TREE BASED METHODOLOGY FOR DESIGNING AN ASNOC
3537 Table 3
Comparisons of interconnection cost in Fig. 6.
should be less than the separate transmission lines of SPIN style or mesh style. Furthermore, comparisons on the areas of switches, the number of switches, the number of buffers, and the hardware complexity of each switch also show that our design will be better than that of the SPIN style or mesh style on area overhead. 5.
Conclusion
This paper has shown that a mix-mode binary tree based methodology is cost-effective. A Huffman coding method is used to minimize the total communication time. This proposed bottom-up methodology helps users to construct a better NOC system under their expected requirements and specifications. It has been shown that this architecture can be developed compactly so that it can be integrated into a variety of possible cores using a well-known bus compatible interface. References [1] L.M. Ni and P.K. McKinley, “A survey of wormhole routing techniques in direct networks,” Computer, vol.26, no.2, pp.62–76, Feb. 1993. [2] J.A.J. Leijten, J.L. van Meerbergen, A.H. Timmer, and J.A.G. Jess, “Stream communication between real-time tasks in a highperformance multiprocessor,” Proc. 1998 DATE Conference, Paris, France, March 1998. [3] A. Hemani, A. Jantsch, S. Kumar, A. Postula, J. Oberg, M. Millberg, and D. Lindqvist, “Network on chip: An architecture for billion transistor era,” Proc. IEEE NorChip Conference, Nov. 2000. [4] P. Guerrier and A. Greiner, “A generic architecture for on-chip packet-switched interconnections,” Proc. Conference on Design, Automation and Test in Europe, pp.250–256, March 2000.
[5] J. Liang, S. Swaminathan, and R. Tessier, “A SOC: A scalable, single-chip communications architecture,” 2000 International Conference on Parallel Architectures and Compilation Techniques (PACT’00), Philadelphia, PA, Oct. 2001. [6] D. Wingard, “MicroNetwork-based integration for SOCs,” Proc. IEEE Design Automation Conference, pp.673–677, 2001. [7] W.J. Dally and B. Towels, “Route, packets, not wires: On-chip interconnection networks,” Proc. IEEE Design Automation Conference, pp.684–689, 2001. [8] “Virtual component interface standard version 2 on-chip bus DWG,” VSI, April 2001. [9] M. Forsell, A. Hemani, A. Jantsch, S. Kumar, M. Millberg, J. Oberg, J.-P. Soininen, and K. Tiensyrja, “A network on chip architecture and design methodology,” Proc. ISVLSI 2002, pp.105–112, 2002. [10] L. Benini and G. De Micheli, “Networks on chips: A new SoC paradigm,” Computer, vol.35, no.1, pp.70–78, Jan. 2002. [11] S. Kumar, A. Jantsch, J.-P. Soininen, M. Forsell, M. Millberg, J. ¨ Oberg, K. Tiensyrj¨a, and A. Hemani, “A network on chip architecture and design methodology,” Proc. ISVLSI 2002, pp.117–124, 2002. [12] W. Ces´ario, A. Baghdadi, L. Gauthier, D. Lyonnard, G. Nicolescu, Y. Paviot, S. Yoo, A.A. Jerraya, and M. Diaz-Nava, “Component-based design approach for multicore SoCs,” Proc. 39th Design Automation Conf. (DAC 02), pp.789–794, ACM Press, New York, 2002. [13] P.P. Pande, C. Grecu, A. Ivanov, and R. Saleh, “Design of a switch for network on chip applications,” ISCAS’03, vol.5, pp.217–220, May 2003. [14] D. Wiklund and L. Dake, “SoCBUS: Switched network on chip for hard real time embedded systems,” Proc. Parallel and Distributed Processing Symposium, 2003, pp.78–85, April 2003. [15] K. Keutzer, “Chip level assembly (and not integration of synthesis and physical) is the key to DSM design,” Proc. ACM/IEEE International Workshop on Timing Issues in the Specification and Synthesis of Digital Systems (Tau’99), Monterey, CA, March 1999. [16] W.J. Dally and C.L. Seitz, “Deadlock-free message routing in multiprocessor interconnection networks,” IEEE Trans. Comput., vol.36, no.5, pp.547–553, May 1987. [17] ARM, Amba Specification, available from www.arm.com. [18] W. Peterson, “Design philosophy of the wishbone SoC architecture,” Silicore Corporation, 1999, please refer to http://www.silicore.net/ wishbone.htm [19] S.-H. Hsu, J.-M. Jou, and Y.-C. Wu, “New routing algorithms and router architecture design for NOC,” 15th VLSI Design/CAD Symposium, Taiwan, Aug. 2004. [20] J. Hu and R. Marculescu, “Energy- and performance-aware mapping for regular NOC architectures,” IEEE Trans. Comput.-Aided Des. Integr. Circuits Syst., vol.24, no.4, pp.551–562, April 2005. [21] D. Bertozzi, A. Jalabert, S. Murali, R. Tamhankar, S. Stergiou, L. Benini, and G. De Micheli, “NoC synthesis flow for customized domain specific multiprocessor system-on-chip,” IEEE Trans. Parallel Distrib. Syst., vol.16, no.2, pp.113–129, Feb. 2005. [22] K. Lahiri, A. Raghunathan, C. Lakshminarayana, and S. Dey, “Design of high-performance system-on-chips using communication architecture tuners,” IEEE Trans. Comput.-Aided Des. Integr. Circuits Syst., vol.23, no.5, pp.620–636, May 2004. [23] K. Lahiri, A. Raghunathan, and S. Dey, “Design space exploration for optimizing on-chip communication architectures,” IEEE Trans. Comput.-Aided Des. Integr. Circuits Syst., vol.21, no.6, pp.952– 961, June 2004. [24] K. Goossens, J. Dielissen, O.P. Grangwal, and S.G. Pestana, “A design flow for application-specific networks on chip with guaranteed performance to accelerate SOC design and verification,” Proc. Design, Automation and Test in Europe Conference and Exhibition (DATE’05), 2005. [25] E. Rijpkema, K.G.W. Goossens, A. Radulescu, J. Dielissen, J. van Meerbergen, P. Wielage, and E. Waterlander, “Trade offs in the design of a router with both guaranteed and best-effort services for
IEICE TRANS. FUNDAMENTALS, VOL.E88–A, NO.12 DECEMBER 2005
3538
networks on chip,” Proc. Design, Automation and Test in Europe Conference and Exhibition (DATE’03), 2003.
Yuan-Long Jeang received his Bachelor’s degree in Engineering Sciences from the National Cheng Kung University, Taiwan, in 1978 and his M.S. degree in computer science from Stevens Institute of Technology, U.S.A., in 1983 and his doctoral degree in Electrical Engineering from the National Cheng Kung University, Taiwan, in 1991. He is currently an associate professor of the Electronic Engineering Department, National Kaohsiung University of Applied Sciences, Kaohsiung City, Taiwan. He is currently working on the design of a reconfigurable microprocessor generator and its embedded in-circuit emulator, an exploration on application specific network on chip (NOC) architecture and router design, and researches on the hardware design of embedded Wavelet Image Coding and Decoding.
Jer-Min Jou received the Ph.D. degree in electrical engineering and computer science from National Cheng Kung University, Tainan, Taiwan, R.O.C., in 1987. In 1989, he was an associate professor in the Department of Electrical Engineering, National Cheng Kung University, R.O.C. Currently he is a professor at the same university. His research interests include SoC hardware-software codesign, system design, ASIC Design/Synthesis, VLSI CAD, and asynchronous circuit design. Dr. Jou was the recipient of a Distinguished Paper Citation at the 1987 IEEE ICCAD Conference in Santa Clara, CA. He received the Long-term Best Paper Award from Acer Foundation in 1998 and 1999, also received the First Level of 2001 IP Competition sponsored by Ministry of Education, ROC. He owned three ROC patents and one USA patent, and was a reviewer of many IEEE journals.
Win-Hsien Huang received his Bachelor’s degree in Electronic Engineering from the Cheng-Shiu University, Kaohsiung County, Taiwan. Now he is a graduate student in the Electronic Engineering Department, National Kaohsiung University of Applied Science, Kaohsiung City, Taiwan.