Computationally efficient locality-aware ... - Springer Link

Article Computer Science & Technology

October 2010 Vol.55 No.29: 3363–3371 doi: 10.1007/s11434-010-4118-z

SPECIAL TOPICS:

Computationally efficient locality-aware interconnection topology for multi-processor system-on-chip (MP-SoC) Haroon-Ur-Rashid Khan*, SHI Feng, JI WeiXing, GAO YuJin, WANG YiZhuo, LIU CaiXia, DENG Ning & LI JiaXin School of Computer Sicence and Technology, Beijing Institute of Technology, Beijing 100081, China Received March 16, 2009; accepted May 1, 2010

This paper evaluates the Triplet Based Architecture, TriBA – a new idea in chip multiprocessor architectures and a class of Direct Interconnection Network (DIN). TriBA consists of a 2D grid of small, programmable processing units, each physically connected to its three neighbors so that advantageous features of group locality can be fully and efficiently utilized. Any communication model can be well characterized by locality properties and, any topology has its intrinsic, structural, locality characteristics. We propose a new criterion in performance evaluation that is based on the concept of locality in an interconnection network, the “lower layer complete connect”. Our proposed criterion depicts how completely a processing node is connected to all its neighbors. TriBA is compared with 2D Mesh and Binary Tree as static interconnection network. The comparison / evaluation is enumerated from three orthogonal view points, viz., computational speed, physical layout and cost. Our analysis concludes that TriBA is computationally efficient interconnection strategy that exploits group locality in processing nodes. multiprocessor, locality, interconnection network, VLSI layout, performance evaluation Citation:

Khan H U R, Shi F, Ji W X, et al. Computationally efficient locality-aware interconnection topology for multi-processor system-on-chip (MP-SoC). Chinese Sci Bull, 2010, 55: 3363−3371, doi: 10.1007/s11434-010-4118-z

Multiprocessor Systems on Chip (MPSoC) combine the advantages of parallel computing of multiprocessors with single chip integration of SoCs. MPSoCs are employed in embedded system that requires high performance data processing capabilities [1–4]. Examples include network processors (NPs), parallel multimedia processors (PMPs) and other application specific array processors (ASAPs). Improvements in semiconductor technology have made it possible to include multiple processor cores on a single die. Chip Multi-Processors (CMP) are an attractive choice for future billion transistor architectures due to their low design complexity, high clock frequency, and high throughput. Multi-Processor (MP-SoC) platforms are emerging as the latest trend in SoC design. These MP-SoCs consist of a large number of Intellectual Property (IP) blocks in the form of functionally homogenous/heterogeneous embedded *Corresponding author (email: [email protected])

© Science China Press and Springer-Verlag Berlin Heidelberg 2010

processors. In this new design paradigm, IP blocks need to be integrated using a structured interconnect template, for example, according to high-performance parallel computing architectures. A formal evaluation process is required before adopting a specific parallel architecture to SoC domain [1,5]. Complex Systems on Chip (SoCs) can be realized consisting of billions of transistors in 65 nm technology [5,6]. The emergence of SoC platforms consisting of large, heterogeneous sets of embedded processors is imminent [1,2,6]. A key focus of such multiprocessors SoC platform is the interconnect topology. Therefore, the on-chip interconnect topology should resemble the interconnect architecture of high-performance parallel computing systems [1,2]. Many interconnection networks for on-chip multiprocessor architecture have been proposed in the literature, over the past three decades. Extensive accounts of these networks and their performance evaluation have been recsb.scichina.com

www.springerlink.com

3364

Haroon-Ur-Rashid Khan, et al.

Chinese Sci Bull

ported in [2–4]. In this article we have emphasized evaluation and comparison of Triplet Based Architecture-TriBA [7] with the widely accepted networks topologies: 2D mesh and binary tree. The primary reason for comparison with 2D meshes is wide acceptability of Mesh topology. The computational model used here to asymptotically evaluate the performance of the networks is similar to Thompson’s [8] grid model for VLSI implementation. The most important classical quantitative topological principle to design/evaluate interconnection networks is the locality [6,9,10]. Triplet based topology’s advantage over other 2D topologies such as mesh, binary trees and hypercube topology is the efficient exploitation of locality characteristics in complex scientific computations. This paper introduces a new criterion in performance evaluation that is based on the concept of group locality in an interconnection network, the “lower layer complete connect”. Our proposed criterion depicts how completely a processing node is connected to all its neighbors. In order to fully exploit the advantages of locality, important consideration in a VLSI environment is the spatial distribution of the processors. This ultimately plays an important role in the chip area. Besides this, the regularity of the networks topology decides the layout cost. Fault tolerance has additional role to play. To improve the device yield and thereby to reduce the overall cost, it is sometimes necessary to introduce redundant processors. But since on-chip devices have high fault tolerance, the concept of redundant processor has lost importance in VLSI world. The criteria of evaluation are enumerated from three orthogonal viewpoints, viz., area, speed and cost. The speed of computation depends both on the topology of the networks and the presence of long interconnects. The number of links and the interconnection structure also decide its message traffic density and serves as an important measure to determine its bottleneck in communication. The cost aspects consider the fabrication cost and the replacement cost due to poor reliability of the networks. In addition to already known evaluation criterion in literature our proposed criterion gives another dimension in evaluation of interconnection networks and is based on group locality.

October (2010) Vol.55 No.29

Figure 1

Triplet based interconnection.

The most frequently used on-chip interconnect architecture is the shared medium arbitrated bus, where all communication devices share the same transmission medium. For a relatively long bus line, the intrinsic parasitic resistance and capacitance can be quite high. As the bus length and/or the number IP blocks increases, the associated delay in bit transfer over the bus may grow to become arbitrarily large and will eventually exceed the targeted clock period. This thus limits, in practice, the number of IP blocks that can be connected to a bus and thereby limits the system scalability [11]. One solution for such cases is to split the bus into multiple segments and introduce a hierarchical architecture [12]. However, this is ad-hoc in nature and has the inherent limitations of the bus-based systems. For SoCs consisting tens or hundreds of IP blocks, bus-based interconnect architectures will lead to serious bottleneck problems as all attached devices must share the bandwidth of the bus [9,11,12]. 1.1

Tiled layout for TriBA

Each of the nodes in TriBA is designed as a tile using Tiled Architecture approach so that no link is longer than the length or width of a tile. Note that each core/node has private L1 cache memory, however, a triplet of three cores share a common L2 cache as shown in Figure 2. It is worth mentioning here that L2 Cache has been designed so that maximum data related to three nodes is present in it. We believe this would help reduce inter node communication traffic. A detailed design of L2 cache is beyond the scope of this article.

1 Triplet based architecture Triplet based architecture (TriBA) is a new solution for computer architecture, which is believed to be suitable for sophisticated embedded applications with multiple concurrent processing centers. TriBA consists of a 2D grid of small, programmable processing units, each physically connected to its three neighbors [7]. TriBA is a hierarchical network with number of processing nodes increasing with power of three at each stage. If the value of K represents the levels of hierarchy then K=0 represents a single node. The interconnection strategy for different levels of TriBA is shown in Figure 1.

Figure 2

Triplet of cores share L2 Cache.


Chinese Sci Bull

1.2 Interconnection network and layout Triangular structure is a special case of WK-Recursive networks [13]. Triplet based hierarchical interconnection network (THIN) is introduced in [14] which is not only a new kind of direct networks but also a kind of HIN’s. Figure 3 shows the address strategy for address assignment to each node with increasing THIN levels (K). THIN is a hierarchical and scalable interconnection network. We define a single node as a level-0 THIN. A level-1 THIN can be constructed by connecting three nodes with three communication links and then forming a triangle. A new routing algorithm DDRA (Distributed Deterministic Routing Algorithm) in THIN has been proposed in [14]. In this approach, each node, upon receiving the message, decides whether the message should be delivered to the local processor or forwarded to which neighboring node. Each node is connected to three other nodes, except the vertex nodes. Failure of a link or a node will always leave choice for an alternate path. This would, no doubt, result in performance degradation but will not have adverse effect on the performance of the network such as deadlock. Therefore, we believe that TriBA has good fault tolerance capability. Source routing algorithm and Table look-up routing algorithm are two well-known algorithms in hierarchical interconnection networks, however, DDRA uses simple calculation to determine the route without extra storage in each node [14,15].

2 Analysis of triplet based network Multi-Processor (MP-SoC) platforms are emerging as the latest trend in SoC design. These MP-SoCs consist of a large number of IP blocks. In this new design paradigm, IP blocks need to be integrated using a structured interconnect template. A formal evaluation process is required before adopting a specific parallel architecture to SoC domain. Therefore, we proceed with the analysis of triplet based

3365


architecture and in the later section will compare it with other types of static interconnection networks, well-known in literature for VLSI implementation. Since all three nodes are completely connected to each other with an independent link, it is reasonable to assume that the area required by interconnection length can be ignored for the graphical layout of Figure 1. If NT is the total number of processing nodes, then the area required for NT processors is

Area( NT ) =

4 NT . 3

(1)

Hence the longest link in the layout is of size NT / 2 . The total length of links in the layout is given by the approximate relationship: L( NT ) = [( NT / 3) × 3( K − ( K −1)) + ⎢ NT / 2⎥ × NT / 3] ⎣ ⎦ = [ NT + ⎢ NT / 2⎥ × NT / 3] = NT [1 + ⎢ NT / 2⎥ × 3], (2) ⎣ ⎦ ⎣ ⎦

which gives a solution L( NT ) = O( NT log NT ) . Thus the worst case delay occurs when a message is propagated between the vertices of the triangle, which is O( NT ) . Note that this delay is slightly smaller than that in the typical mesh network since approximately 2 log NT processors are to be visited as opposed to 2 NT − 1 in a mesh network, especially when the processing node count increases. This figure is somewhat similar to binary tree topology. The average message delay between processors on the vertices is of the order of O( NT ) . The generalized form of the equation for total number of l −1 N links in the triplet architecture is L( NT ) = ∑ ( iT ) , which is 3 i =0 O(NT). Assuming all nodes issue messages simultaneously (worst case), the average message traffic density is then

NT [

average message delay between nodes at vertices ] total number of links

= NT O( NT ) / O( NT ) = O( NT ).

(3)

Since the average delay is O( NT ) and the chip size is

Figure 3

Addressing strategy (a) Level-1 THIN; (b) Level-3 THIN.

O(NT), the average chip dissipation is O( NT3/ 2 ) . The layout can be constructed hierarchically and each level of embedding needs 9 O (1) cost and the overall connection cost for a network of NT processors is O(log3NT). The total number of links in triplet architecture is O(NT). Thus the regularity factor is O(NT/logNT). The key advantage of triplet based architecture is that the external communication can be done through any of the processor on the vertex of the structure. The fault tolerance capability of the network due to the failure of a single proc-

3366


Chinese Sci Bull

essor depends generally on the location of the processor. Similar failure of any link cannot have disastrous effect on the performance of the IC due to presence of many alternate paths. Also, since this is an on-chip topology connecting different processing nodes, the probability of failure is reduced. Therefore, interconnection reliability can be considered high since the network consists of nearest neighbortype connections. The degradation factor for TriBA is, therefore, very low. If R p = e

−λpt

is the functional reliability of each proc-

essor, then the overall reliability of the triplet based network, without any redundancy, is RT = R pNT . This reliability can be sufficiently ameliorated by adding redundant processors, and the overall network can be made at least

( NT − 1) fault- tolerant. The reliability improvement factor, RIF is given by the ratio of Overall Network Reliability without redundant processors and the Reliability with redundant processors. RIFT = RrT / RT .

(4)

3 Performance comparison of triplet based network Under VLSI environment the spatial distribution of processors, length of interconnection links per node play an important role in total chip area, signal propagation time and design cost. In this section we will compare the performance of TriBA with other networks widely accepted and recognized by research community and industry. The evaluation criteria have been selected from three orthogonal entities - computational speed (message delay), physical (chip area and dissipation), and cost (chip yield, layout cost). 3.1

Computational aspect

The computational model used here to asymptotically evaluate the performance of the networks is similar to Thompson’s model [7] and additionally accounts for device faults and chip yield. We introduce another criterion in computational aspect of performance evaluation defined below. The distribution of the nodes by the distance‘d’ to an arbitrarily chosen origin is one of the primary evaluation parameters of an interconnection topology. The intercommunication structural potential of a DIN is optimally used in a communication process characterized by any routing distribution Φ if the lower level structure is completely connected. The criterion can be used to design new topologies for maximum efficiency. Definitions 1 Lower level Interconnect: Lower level in interconnection between nodes at a distance ‘d’ in a Direct Interconnection Network DIN is the subset of minimum


number of nodes Nmin(≤2) that can be assigned a task out of set of tasks for parallel execution.

⎧ n ⎫ N min = Min ⎨∑ N i (d ) d = 1⎬ . ⎭ ⎩ i =1 Definitions 2 Complete connect ratio: A set/subset of nodes is said to be completely connected CCR (=1), iff all its constituent nodes of the set/subset are connected to each other via an independent link. It is the ratio between available connections to total number of connections such that every node in the subset has an independent connection with any other node. CCR =

Available connections . Total links required for complete connect

Lower layer connections are shown in Table 1 for TriBA, two dimensional mesh, binary tree and CCC of degree 3 topologies. Solid lines indicate a connection in the respective topology while dotted line signifies missing connection to qualify for a lower layer complete connect definition. This format of interconnection can be exploited in software while job partitioning and task assignment gain efficient completion of the task by reducing the message delay, for example. Every node in TriBA is completely connected to all its neighbors and thus the communication delays within a triplet are negligibly small. However, for 2D mesh two connections and one in binary trees, shown by dotted lines, are missing. This increases the communication delays between a sub mesh of processors, when a job is assigned collectively to a set of nodes. The average message delay of a typical two-dimensional mesh has already been documented in literature as O( NT ) [16,17]. For binary tree network connected in H-Tree to-

⎡ NT ⎤ pology is O( NT ) [18] and O ⎢ ⎥ for cube con⎣ log NT ⎦ Table 1

Lower level complete connect defined

Interconnection topology

Nmin

CCR

TriBA

3

1 (3/3)

2D Mesh

4

0.667 (4/6)

Binary Tree

3

0.667 (2/3)

Cube Connected Cycles (Degree=3)

3

1 (3/3)

Lower level interconnect


Chinese Sci Bull

nected cycles. Similarly in our analysis of previous section, it turns out to be the same O ( NT ) for TriBA. However, the average message density has been calculated to be

O( NT ) which is again similar as for two-dimensional mesh. But for binary trees it is of the order of the total number of processors, O( NT ) and is O(logNT) for cube connected cycles. The performance of interconnection networks is determined by throughput. Throughput is the maximum rate at which the network can accept the data. Our simulation results for throughput of three networks are shown in Figure 4. Average message delay for TriBA is considered better than binary tree topology because of the fact that each processor in the triplet based architecture is connected to each other and hence has a direct link with every other processor in a triplet. We introduced, in this paper, the interconnection cost for lowest layer complete connects which reflects how completely a processing node is connected to its neighbors. TriBA has all three processors connected in such a way that each processor has an independent link to its neighboring processor strengthening its usability and efficiency for a better computational performance with respect to other networks. Algorithms involving matrices and vectors are applied in several numerical and non-numerical contexts. Due to their regular structure, parallel computations involving matrices lend themselves to data-decomposition. We discuss the implementation of multiplication of dense square n × n matrix with n × 1 vector on topologies under consideration. A costoptimal parallel implementation of matrix-vector multiplication with block 2-D partitioning of the matrix can be obtained if the granularity of computations at each process p is increased by using fewer than n2 processes. In a p processors’ interconnection each processor owns (n / p × n / p ) block of the matrix. The vector is distributed in portions of n p elements and assigned to processes according to the

Throughput vs. Network size.

3367

topology. For example, n p elements may be assigned to a column of processes and the entire vector must be distributed evenly on each row. In a binary tree this assignment can be for each leaf of the topology. In CCC (of degree 3) and Triplet based topology each of the n p elements is assigned to triplet of processors. Then one-to-all broadcast of these elements takes place. Each process then performs n2/p and locally adds the n p sets of products. At the end of this step each process has n p partial sums that must be accumulated to obtain the resultant vector. Hence, the last step is an all-to-one reduction of the n p values.

Tp′ = Aligning the vector + one - to - all broadcast + Computations + all - to - one reduction. If ts is the startup required to handle the message and tw is the per word transfer rate then the first step of aligning the vector in each topology takes ts + tw n / p . One to all broadcast and all-to-one reduction both approximately take (ts + tw n / p ) log( p ) . For simplicity we assume multiplication and addition pair takes unit time, each processor spends approximately n2/p time in computation. The parallel run time for this procedure is as follows: Tp′ = [ts + tw n / p ] + [(ts + tw n / p ) log p ] + [n 2 / p] + [ts + tw n / p ) log p ]. ≈

n2 n + ts log p + tw log p. p p

According to definitions and CCR in Table 1, it is fair enough to compare 2D mesh with TriBA. Parallel execution time has been calculated for TriBA, 2D mesh and binary tree topologies. Figure 5 shows the comparison of these network topologies. The results show TriBA’s advantage in parallel execution time required for a fixed problem size. Further, it can be deduced that TriBA shows its strength when the number of processors increases. To further strengthen our argument we have calculated speedup and efficiency for the two structures for fixed problem size, in Figure 6 and 7 respectively. 3.2

Figure 4


Physical aspects

The physical aspects of performance evaluation are judged by two parameters, chip area and power consumption. Interconnect is a critical design element of a multi-core architecture. On an 8-core processor, for example, the interconnects can consume the power equivalent of one core and take the area equivalent of three cores [6]. A system level model of an MPSoC can be used to estimate power consumption and power dissipation characteristics of the many-core SoC in

3368


Figure 5

Parallel time for fixed size problem.

Figure 6

Computational speedup comparison.

Figure 7

Efficiency comparison.

Chinese Sci Bull

design. Geometric parameters of interconnection links have an effect on power consumption and power density [18]. We have considered a system level tentative wire model for


triplet based architecture for CMP design. The basis of our analysis with respect to physical aspects is to take into account interconnection links geometric parameters. (i) Tentative wire model. In the Thompson’s VLSI model, a network is represented by a graph. Its nodes correspond to processing elements (PE); its edges correspond to wires. Our model uses the same theoretical basis for PE as square of h units side (h2 units area). However, our model differs in the following aspects. The processing elements (PE) are not placed at regular intervals on the grid; rather their placement is dictated by the fact that each triplet forms an equilateral triangle. The wires can run vertically, horizontally, at 60-degree and at 120-degree along the grid lines as shown in Figure 8. So, triplet based architecture requires, at least 3 later layers for interconnection. This is true iff the PE placement is perfect otherwise 4 layers are needed. We introduce a novel approach to physical layout of PE. Recent research on diagonal interconnects have opened new paradigm in orientation of layers that can be used in interconnection [19,20]. Experiments have shown that compared with Manhattan architecture, the Y-architecture demonstrates a throughput improvement of 30.7% for a square chip [19]. By applying diagonal routing, net length is reduced by 36% and path delay is reduced by 14-pico-seconds per net on the average [20]. Technology that we consider facilitates 10 layers of metal. In many technologies the chip size is related to some extent with the summary area used for interconnections on the chip. The latter parameter can be estimated with the tentative wires model [18]. Note that interconnection wiring in Figure 9 has two strong features. First, all connecting wires are of the same length and depict the same path delay between adjacent PE. Second, PEs are placed symmetrically so that the structure can fit well on a square chip. (ii) Area estimation. For our study, chip area is calculated, considering a stripped version of out-of-order Power4-like cores, listed in http://www.research.ibm.com/ power4. We determine the area taken up by such a core at 65 nm technology to be 10 mm2. The area and power determination methodology are similar to the one presented in [21]. The area occupied by a bus is determined by three factors and that is number of wires, effective pitch of the wires and the length. In this paper we only consider the area

Figure 8

Interconnection layers. (a) 0; (b) 60 and (c) 120 degrees.


Figure 9

Chinese Sci Bull

3369

Interconnection for 2-level THIN. (a) Orthogonal; (b) diagonal interconnects.

by PE. However, we recognize that the area required by busses is also important in overall area estimation. We consider the contribution of busses in power estimation. (iii) Power dissipation. The power taken up by the core is determined to be 10 W, including leakage. The total area occupied by connecting wires is determined by the number of links in the network times the number of wires in a bus times the effective pitch of the wires times the length. We have compared and plotted the number of links with increasing number of processing nodes for in Figure 10. For the sake of simplicity, our analysis number of wires in the bus is assumed to 64 (a constant). Although pitch is a factor that varies with the interconnection layer we take it to be the same for all layers i.e., signal wiring pitch for wires in different metal planes for 65 nm. Typical pitch values vary from 0.2 μm to 1.6 μm. These exact pitch values are estimated conforming to the considerations mentioned in [22]. Length of the diagonal wire is more than the horizontal or vertical wire. Using geometry the ratio between diagonal and straight wire is 1.154. Table 2 shows power calculations based on the above considerations. 3.3


Cost aspects

The cost aspects consider the fabrication cost and the re-

placement cost due to poor reliability of the networks. The manufacturing cost of the IC is related to total chip area and the regularity of the layout. The fault-tolerance capability largely decides the reliability of a working chip. Due to a host of causes, such as electromigration, Kirkendall’s effects, hot electron effects, etc., a processor or a link may fail during the normal use of the chip. Depending on the topology of the networks, the effect of failure of a single processor or a link will adversely affect the operation of the network. Normally the level of masking and processing associated with interconnect is far simpler than the processors. Hence the reliability of interconnect is higher than that of the processor. So, the reliability due to the processor failure and the interconnect failure is separated and different measures are used here. Since the probability of interconnect failure is directly proportional to its length, total length of interconnects in the network is used as the measure of fault tolerance due to interconnect failure and is denoted by Rl=xλl where x < 1 and λl is the mean life of a wire of length l. The failure of a single processor will result in performance degradation because it may isolate one or more processors. In two-dimensional network the failure of a single processor does not impair the performance of the networks drastically. Due to the presence of many parallel paths in the square grids, the failure of a single processor results into isolation of the failed processor only and does not impair the performance of the networks drastically. This is also true in case of TriBA. The degradation factor is thus

δ=(NT–1)/NT. If R p = e

−λpt

is the functional reliability of

each processor in the meshes, and then the overall reliability of the network is RM=RNp. This reliability can be sufficiently ameliorated by adding a redundant row and the overall network can be made to be ( NT − 1) fault-tolerant. For binary Figure 10

Comparison of Links with increasing PE count.

tree networks routed and placed in H-Tree configuration, if the leaf processors and their feathers are not replicated and

3370 Table 2


Chinese Sci Bull


Area - power calculations

Network size ‘N’

2-D Mesh (Links)

2 N ( N − 1)

Triplet (Links) l −1 NT

∑( 3

i

)

Area

Power consumed by Power consumed (W) interconnection wires (mW)

%age improvement (overall)

i=0

3×3 9×9 27×27 81×81

12 144 1404 12960

12 120 1092 9840

Mesh 153 1843 17971 178688

90 810 7290 65610

all other nodes in the tree are replicated even when reliability improvement factor RIF is poorer compared with two-dimensional mesh. 3

4

Conclusions

The paper presents the results of a modeling of the triplet based interconnection architecture. The results show that the architecture of the Triplat based interconnection is advantageous to 2D mesh topology, in computational aspect. However, the present analysis is purely analytical in nature and a deeper investigation using standard bench marks is in the pipeline of present and future work on TriBA. Although its physical aspect still needs deeper investigation, but it is competitive topology with 2D mesh, especially the link count, which proves that TriBA is more power conscious. Also we believe diagonal interconnects would be soon a reality, as its need is being justified in many recent research papers. TriBA can therefore, be realized as a useful interconnection topology for future multi-core systems. In addition, we intend to take area requirements for different cache sizes into account, in our current and future work. The basic conclusion of the analyses done here is that the applications which require high performance multiple-node computing resource like cellular networks, multi-media applications can use the best network topology to get overall better performance. Presently structures similar to the mesh are cost effectively implemented for VLSI implementation and are highly suitable for VLSI parallel processing. We have evaluated triplet based architecture - TriBA as another useful multi-core topology for exploiting locality. Moreover, diagonal interconnects seem very likely to be available in near future and core count increasing to hundreds, and TriBA can be a power efficient topology on a single chip multi-core system. This work was supported by the National Natural Science Foundation of China (60973010) and the Doctoral Fund of Ministry of Education of China (200800071005). 1

2

Jayanta B, Ekaterina T, Magdy S A. Validating power architecture topology-based Mpsoc through executable specifications. IEEE Trans VLSI, 2008, 16: 388–396 Jerraya A A, Bouchhima A, Petrot F. Programming models and

4 5

6

7

8

9

10

11

12 13

14

15

TriBA 176 1772 15934 143585

90 810 7290 65610

0 3.85 11.33 20

Hw-sw interfaces abstraction for multi-processor Soc. In: Ellen S, ed. Proceedings of the 43rd ACM/IEEE Design Automation Conference, 2006 Jul 24–28, San Francisco: Association for Computer Machinery Inc. Press, 2006. 280–285 Magarshack P, Paulin P G. System-on-chip beyond the nanometer wall. In: Fix L, Getreu I, Lavangoa L, eds. Proceedings of the 40th ACM/IEEE Design Automation Conference, 2003 Jun 2–6, Anaheim California: Association for Computer Machinery Inc. Press, 2003. 419–424 Jerraya A A, Wolf W. Multiprocessor Systems-on-Chip. San Francisco: Elsevier Morgan Kaufmann, 2005. 431–462 Pande P P, Grecu C, Jones M, et al. Performance evaluation and design trade-offs for network-on-chip interconnect architectures. Computer, 2005, 54: 1025–1040 Rakesh K, Victor Z, Dean M T. Interconnections in multi-core architectures: understanding mechanisms, overheads and scaling. In: Bob W, ed. Proceedings of the 32nd International Symposium on Computer Architecture, 2005 Jun 4–8, Madison Wisconsin: IEEE Computer Society, 2005. 408–419 Shi F, Ji W X, Qiao B J, et al. A triplet based computer architecture supporting parallel object computing. In: Proceedings of 18th IEEE International Conference on Application-specific Systems, Architectures and Processors, 2007 Jul 9–11, Montreal Canada: IEEE Press, 2007. 192–197 Thompson C D. Generalized connection networks for parallel processor intercommunication. IEEE Trans Compu, 1978, c-27: 1119– 1125 Horowitz M, Dally B. How scaling will change processor architecture. In Laura C F, Mandana A, Arvin G, eds. Proceedings of the IEEE International Solid State Circuits Conference, 2004 Feb 15–19, San Francisco: S3 Digital Publishing Inc., 2004. 132–133 Lupu C, Niculiu T. Interconnection locality and group locality. In: Ljiljana M, Miroslav L, Aleksandar N, eds. Proceedings of The International Conference on computer as a tool- EUROCON, 2005 Nov 21–24, Belgrade Serbia and Montenegro: IEEE and University of Belgrade, 2005. 656–659 Grecu C, Pande P P, Ivanov A, et al. Structured interconnect architecture: A solution for the non-scalability of bus-based Socs, In: David G, John L, Charles A Z, eds. Proceedings of ACM Great Lakes symposium on VLSI, 2004 Apr 26–28, Boston MA: Association for Computer Machinery Inc. (ACM), 2004. 192–195 Hsieh C, Pedram M. Architectural energy optimization by bus splitting. IEEE Trans CAD, 2002, 21: 408–414 Fu J S. Hamiltonian-connectedness of the Wk-recursive network. In Frank D H, Kei H, Sherman S, eds. Proceedings of the 7th International Symposium on Parallel Architectures, Algorithms and Networks, 2004 May 10–12, Hong Kong SAR China: IEEE Computer Society Press, 2004. 569–574 Qiao B J, Shi F, Ji W X. A new hierarchical interconnection network for multi-core processor. In: a b c eds. Proceedings of the 2nd IEEE Industrial Electronics and Applications, 2007 May 25-28, Harbin China: IEEE Press, 2007. 246–250 Qiao B J, Shi F, Ji W X. A new routing algorithm in triple-based hierarchical interconnection network. In: Jeng S P, Peng S, Yao Z, eds. Proceedings of the 1st International Conference on Innovative Computing, Information and Control, 2006 Aug 30-01, Beijing China:


16 17

18

Chinese Sci Bull

IEEE Computer Society Press, 2004. 725–728 Laxmi N B, Qing Y, Dharma P A. Performance of multiprocessor interconnection networks. Computer, 1989, 42: 25–37 Crisp´ın G, Mar´ıa E G, Pedro L, et al. Exploiting wiring resources on interconnection networks: increasing path diversity. In: Bob W, ed. Proceedings of the Euromicro Conference on Parallel Distributed and Network based Processing, 2008 Feb 13–15, Toulouse France, IEEE Computer Society Press, 2008. 20–29 Yuriy S, Elena S, Felix S. Complexity and low power issues for on-chip interconnection in Mpsoc system level design. In: Jurgen B, Andreas H, Amar M, eds. IEEE Computer Society Annual Symposium on Emerging VLSI technologies and Architectures, 2006 Mar 2–3, Karlsruhe Germany: IEEE Computer Society Press, 2006

19 20

21

22


3371

Hongyu C, Bo Y, Feng Z, et al. The Y- architecture: yet another on-chip interconnect solution. IEEE ASP-DAC, 2005, 24: 588–599 Noriyuki I, Hideaki K, Ryoichi Y, et al. Diagonal routing in high performance microprocessor design. In: Fumiyasu H, ed. Proceedings of the 11th South Pacific Design Automation Conference, 2006 Jan 24–27, Yokohama Japan, IEEE Press, 2006. 624–629 Kumar R, Farkas K I, Jouppi N P, et al. Single-isa heterogeneous multi-core architectures: the potential for processor power reduction. In: Stephanie K ed. Proceedings of the 36th Annual IEEE/ACM International Symposium on Microarchitecture, 2003 Dec 2–5, San Diego California, IEEE Computer Society Press, 2003. 81–92 Theis T N. The future of interconnection technology. IBM J R&D, 2000, 44: 379–390

Computationally efficient locality-aware ... - Springer Link

Computationally efficient locality-aware ... - Springer Link

Suggest Documents

Computationally Efficient FIR Filtering of Polynomial ... - Springer Link

Computationally efficient modeling of

Computationally efficient quantum-mechanical

COMPUTATIONALLY EFFICIENT SUBSPACE

Computationally Efficient Calculation of

Computationally intensive methods warrant ... - Springer Link

Computationally intensive methods warrant ... - Springer Link

Computationally Efficient Face Detection - CiteSeerX

computationally efficient algorithms for robust

COMPUTATIONALLY EFFICIENT PARTICLE FILTERING USING

Ultra-miniature, computationally efficient diffractive

Computationally-efficient realtime interpolation ...

Computationally efficient bandwidth allocation and

Computationally Efficient Target Classification - arXiv

Link Prediction in Social Networks using Computationally Efficient ...

10 Computationally Efficient Link Prediction in a Variety of Social ...

10 Computationally Efficient Link Prediction in a Variety of Social ...

A Computationally Efficient Evolutionary Algorithm ... - Semantic Scholar

Computationally Efficient Maximum Likelihood Sequence Estimation ...

Computationally efficient inference procedures for vast dimensional ...

Computationally-Efficient Compressive Sampling of Pulse Stream ...

Computationally efficient MIMO system identification using Signal ...

A COMPUTATIONALLY EFFICIENT IMPLEMENTATION ... - UQ eSpace

A Computationally Efficient Multivariate Maximum-Entropy Density ...