Modeling Computational Limitations in H-Phy and OverlayNoC Architectures Dawid Zydek Department of Electrical Engineering Idaho State University USA
[email protected]
Grzegorz Chmaj Department of Electrical and Computer Engineering University of Nevada, Las Vegas USA
[email protected]
Steve Chiu Department of Electrical Engineering Idaho State University USA
[email protected]
Author's copy. The original publication is available at www.springerlink.com
Abstract High performance computing demands constant growth in computational power and services that can be offered by modern supercomputers. It requires technological and designing advances in the multiprocessor internal structures as well as novel computing models considering the very high computing demands. One of the increasingly important requirements of computing platforms is a functionality that allows efficient managing computational resources i.e. monitor them, restrict an access to some part of the resources, account for computational service, or ensure reliability and quality of service when some resources are broken or disabled. In this paper, we present a new model describing computational limitations for processing tasks on multiprocessor systems. The model is implemented in Hardware-Physical (H-Phy) and Overlay-Network-onChip (Overlay-NoC) architectures. Both architectures and the model are described and analyzed. Experimentation system is also presented, together with simulation assumptions, results of research and their study. The paper provides complete models of H-Phy and Overlay-NoC structures with an ability to restrict processing resources.
Keywords CMP, H-Phy, Overlay-NoC, Computational Limitations, Experimentation System.
1 Introduction The area of High Performance Computing (HPC) is currently under extensive research with an aim of achieving exascale systems by the end of the decade. The research covers both aspects that are important for HPC applications: hardware [3] and software [5]. On the software hand, a problem faced is to how efficiently solve complex processes; e.g. nuclear engineering, weather forecasting, etc.; described by mathematical equations, which are typically partial differential equations [15]. On the hardware hand, an effort is being done towards scaling architectures to billions of processors and increasing the efficiency of individual processing nodes [9]. The nodes currently used and developed are Chip MultiProcessors (CMPs) with number of cores ranging between a couple and tens [13]. The internal organization of the CMPs is governed by Network-on-Chip (NoC) that ensures effective communication among the cores. Low dimensional topologies are used for NoC implementations, especially 2D-mesh and 2D-torus [17], [21]. Physical structure of modern CMPs with NoC attracts a lot of research effort, e.g. [6], [8] and [18]. Also, computation problems based on distributed structures on higher architecture levels is also the subject of wide research today [16]. Performance of CMP highly depends on well-designed internal architecture; however, efficient utilization of the cores and their management are also important factors. This optimization is usually done together with execution time minimization. The model of interconnected processor structure and algorithms for minimizing the energy usage are described in [10]. Authors assume that CPUs are fully interconnected and have the energy costs defined, but omit the NoC energy consumption. It shown in previous works that this is an important factor [17], [18], [20], [22]. Multi-level connection model was presented in [11]. Multiple clusters of CPUs were connected among each other using TCP/IP protocol. Connections inside clusters are implemented using the regular buses. The energy required by transportation network is not considered. Distributed processing systems may also consider non-CPU elements, such as FPGA boards [14]. One board contains processors, shared memory and inter-communication elements. The system implements local onboard communications between CPUs using shared memory. Like in [11], distanced communication (board-board) is implemented using TCP/IP. The core frequencies of processors are individually controlled to achieve the energy management. This process uses the temperature sensors assigned to each CPU. In our research we use the simulation approach in order to determine the quality of our solutions. This technique is very popular among many fields of study [2], [3], [12], and [23]. The following ideas used in our experimentation system are shared with common shape of simulation approach [12], [20]: framework construction, modular design, data generation part. Energy consumption of multiprocessor structures was analyzed in [24], along with the system model. However the analysis is done for high level peerto-peer structures operating on overlay network. Algorithms for power reduction were described with experimentation results. [26] describes the hybrid approach to distributed processing along with energy minimization. Authors propose to use mixed offline and online allocation, to use the advantages of offline approach without the need of complete solve of MIP problem. Problem formulation, algorithms and promising results are presented. In [3], we described a model of the system capable of analyzing a CMP using two approaches: 1) HardwarePhysical (H-Phy) approach and 2) Overlay-NoC approach. Both of them use the same input data set for processing; however, they apply approach-specific factors making approaches unique not only by algorithms used. In this paper, we create a new model based on that presented in [3]. In the new system, computational limitations are implemented, that besides efficient utilization, also allows flexible managing and monitoring the cores in the CMP. Thus issues like the cores’ reliability, accounting, and access restricting can be modeled. Authors of [25] consider heterogeneous processors and define two types of CPU. Then they present theory and experiments for presented algorithms. We also define our system as heterogeneous, as individual processors can have different parameters. Based on the new idea, an experimentation system is built, where H-Phy and Overlay-
2
NoC approaches are developed independently. Based on the presented evaluation system, examples of experiments are conducted and results are presented. Details of both H-Phy and Overlay-NoC approaches are presented in form of mathematical models, what is the common way of describing systems [24], [26]. The reminder of this paper is organized as follows: Firstly, in Section 2, we describe the topologies and the energy model used in the research. H-Phy and Overlay-NoC approaches are presented in Section 3 and Section 4, respectively. Section 5 contains a description of experiments and analysis of results. Conclusions and final remarks are presented in Section 6.
2 Topologies and Energy Model 2.1 Description of Topologies We define the universal description for 2D-mesh and 2D-torus in order to express the distance used between two nodes in the network. First, let us define the notation: nX = horizontal size of mesh nY = vertical size of mesh node_t = destination node (expressed as the sequential number of node) node_f = source node (expressed as the sequential number of node) m(value, divisor) = the modulo division of value by divisor For 2D-mesh, the internode distance d is as follows (divisions are rounded to the nearest integer): 𝑑 =|m(node_t, nX) – m(node_f, nX)| + |(𝑛𝑜𝑑𝑒_𝑡 ) – (𝑛𝑜𝑑𝑒_𝑓 )| 𝑛𝑋 𝑛𝑋 For 2D-torus, we make the following assumptions: All nodes are numbered from 0 to nX*nY-1, and the node number is its position in the torus; Columns are numbered from 0 to nX-1; Rows are numbered from 0 to nY-1; Wrapping rows = rows denoted by 0 and nY-1; Wrapping columns = columns denoted by 0 and nX-1; By wrapping we mean changing node position across wrapping row or wrapping column; All divisions are rounded to the nearest integer. The node denoted as 0 (its x,y position is 0,0) may be selected arbitrarily. This kind of nodes numbering and positioning allows precise navigation and operations on nodes. We also propose to define the distance from node 0,0 to all other nodes, that makes it easy to express the torus as the triangular matrix of distances. In this paper, we introduce the universal way to define all distances. The distance in the torus, with the regards of nX and nY, requires to distinguish between four cases: d1 (node_f, node_t, nX) – distance between nodes in torus, while no wrapping d2 (node_f, node_t, nX, nY) – distance between nodes in torus, while wrapping through wrapping rows d3 (node_f, node_t, nX) – distance between nodes in torus, while wrapping through wrapping columns d4 (node_f, node_t, nX, nY) – distance between nodes in torus, while wrapping both through wrapping rows and columns The distance d1 is defined the same way as d for 2D-mesh: d1 = d. Distances d2, d3 and d4 involve wrapping, so we describe it in pseudocode: d2 from node_f to node_t{ determine distance dh to closer wrapping row wr for node_f and which one of two (0 or nY-1) it is: dr; determine the position of first node node_i after wrap, using wr and dr; determine the distance dw from node_i to node_t; d2 = dh + dw. } d3 from node_f to node_t{ determine distance dh to closer wrapping column wc for node_f and which one of two (0 or nX-1) it is: dc; determine the position of first node node_i after wrap, using wc and dc; determine the distance dw from node_i to node_t; d3 = dh + dw. }
3
d4 from node_f to node_t{ determine distance dh to closer wrapping row wr for node_f and which one of two (0 or nY-1) it is: dr; determine the position of first node node_i after up/down wrap, using wr and dr; determine distance dw to closer wrapping column wc for node_i and which one of two (0 or nX-1) it is: dc; determine the position of first node node_j after left/right wrap, using wc and dc; daw = |m(node_t, nX) – m(node_j, nX)| + |(𝑛𝑜𝑑𝑒_𝑡 ) – (𝑛𝑜𝑑𝑒_𝑗 )| 𝑛𝑋 𝑛𝑋 d4 = dh + dw + daw. } 2.2 Energy Model Used We implement NoC architectures with: 1) 2D-mesh and 2D-torus topologies; 2) Virtual-channel flow control; 3) Dimensional Order Routing (DOR)[18]. Each NoC node consists of a Virtual Channel (VC) router (R in Fig. 1), A packet traversing from a node (tile) v to neighboring tile w (it is one NoC channel or 1 hop) needs to be processed by the VC router, where next destination node is selected, and it also needs to traverse an NoC channel. The average energy consumption in pJ of sending one bit of data from tile v to tile w is expressed by: For 2D-Mesh: v, w VC E bit 0.98( N hops 1) 0.57 N hops
(1)
For 2D-Torus: v, w VC Ebit 0.98( Nhops 1) 0.75 Nhops
(2)
VC , where N hops is the number of VCs traversed by a packet between tile v and w, and Nhops is the number of NoC channels traversed by the packet. The values 0.98 and 0.57 are obtained based on hardware implementation of NoC on an FPGA device [19]. We consider the system built with Intel Core i5-660 processors having clock 3.6 GHz. These units include two physical cores inside, but we treat one i5 chip as one PE. According to Intel technical specification [7], i5-660 processors have Thermal Design Power (TDP) equals to 73W. Computing power expressed in GFLOPs (billion FLOPs) equals to 29. We use TDP as the operating power of cluster processors to give good estimate of energy consumption. We convert the TDP into energy consumed in a cycle Ec [19] according to formula:
Ec TDP
1 [ J] , Fmax
(3)
where Fmax is the maximum frequency of Intel Core i5-660 processor in [MHz].
3 H-Phy Approach H-Phy architecture for the simulated CMP is presented in Fig. 1. A detailed description can be found in [18]. In the picture, a 5×5 CMP with 2D-torus topology is shown. The chip area is divided into tiles, that ensures scalability and effective use of resources available on the chip. Each tile contains networking elements (router, networking interface, network channels) and Processing Elements (PEs), i.e. processor, cache memory, etc. A communication among tiles is done using routers that are connected by NoC. Both H-Phy and Overlay-NoC structures consider a heterogeneous architecture, where all PEs in CMP may have the individual computational power. The efficient utilization of PEs is done by a Processor Allocator (PA) and Job Scheduler (JS). PA is implemented in hardware and placed as tile on the same die as the PEs in the CMP, that delivers better performance and efficiency [18]. 3.1 Simulation Methodology The simulation starts when JS receives request to allocate a job. JS is responsible for job scheduling that deals with the selection of the job to be executed next. In the presented system, the JS processes jobs in First Come First Served (FCFS) fashion. A job may contain one or more tasks. In H-Phy approach, a job is described by the size of the requested subgrid. A contiguous processor allocation is considered thus the processors allocated to a
4
job are physically adjacent and have the same topology as NoC. The scheduled job is moved to a PA, which assigns the job to available PEs according to allocation algorithm. Two best allocation schemes are used: IFF algorithm for 2D-mesh [4], [18] and BMAT technique for 2D-torus [17]. Once the PA finds available PEs to accommodate the job (all job’s tasks), the PA sends an allocation message to PEs to reserve them for the job. The jobs are allocated in such a manner that they do not overlap with each other and if they are allocated, they run until completion. If there is no free PEs, the PA waits until another job will release some PEs. After execution, PEs send a release message to the PA, which updates the status of processors. All messages in the system are sent by the implemented NoC. We assumed that allocation and release messages take one flit, e.g. if job requires 4 processors, 4 flits have to be sent from a PA to all 4 PEs assigned to the job.
R
PE
R
PE R
PE R
PE
PE
PE
PE R
PA
PE
PE
R
PE
PE
Processing Element
PE R
R
Processor Allocator
PE
PE
PE
PA R
R
R
R
PE
PE
PE
On-Chip Tile R
R
R
R
PE
PE
PE
R
R
R
R
R
PE
PE R
PE
R
R
R
R
Router
PE Network Channel
JS
Fig. 1. Sample of H-Phy architecture: Tiled 5×5 2D-torus.
3.2 Assumptions and Description The model for H-Phy approach is defined as follows: Indices: v, w = 1, 2, ..., V M b = 1, 2, ..., B s = 1, 2, ..., S t = 1, 2, ..., T
PEs PA&JS job to process sizes of jobs time slots
Binary variables: qbs = 1 when job b has horizontal size s or less, 0 otherwise (binary) rbs = 1 when job b has vertical size s or less, 0 otherwise (binary) ybMvt = 1 when job b is sent from PA to PE v in time slot t, 0 otherwise (binary) xbvt = 1 when job b is computed at PE v in time slot t i, j = 1, 2, …, N indices for position of PE gvij = 1 when PE v resides at position i,j (i=horizontal, j=vertical) in mesh or torus structure Constants v ,w energy consumption to send one bit from v to w Ebit pv W X, Y
computation power of PE v word length size of mesh/torus (horizontal/vertical)
5
Criterion function M ,w minimize F = 2W ∑w∑b∑t ybMw t Ebit ∑s qbs ∑s rbs y
Constraints All jobs have to be computed: ∑b∑v xbv = B
(4)
Each job is computed once: ∑v xbv = 1
b = 1, 2, …, B
(5)
PEs do not exchange data packets between each other: ∑b∑v∑w∑t ybvwt = 0
v≠w≠M
(6)
Job is allocated to PE which is not occupied: ∑b xbvt = 1
t = 1, 2, …, T,
v = 1, 2, …, V
(7)
Each PE has limited computation power:
∑b∑t xbvt ≤ pv v = 1, 2, …, V
(8)
Mesh specific constraints: Job is allocated to adjacent PEs, job must not overlap mesh: ∑v∑e∑f xbvt gv(i+e)(j+f) = ∑s qbs∑s rbs 1 ≤ t ≤ T, b = 1, 2, …, B, 0 ≤ e < ∑s qbs, 0 ≤ f < ∑s rbs, 1 ≤ i < X-∑s qbs , 1 ≤ j < Y-∑s rbs
(9)
Torus specific constraints: Job is allocated to adjacent PEs, job can overlap torus: ∑v∑e∑f xbvt gv[(i+e)%X,(j+f)%Y] = ∑s qbs∑s rbs 1 ≤ t ≤ T, b = 1, 2, …, B, 0 ≤ e < ∑s qbs, 0 ≤ f < ∑s rbs, 1 ≤ i ≤ X, 1 ≤ j ≤ Y
(10)
(where % symbol denotes modulo division) A job may contain one or many tasks that are adjacent to each other. A job has a shape that is a subgrid of the NoC topology, and it is described by the size of the subgrid it requires (Fig. 2a). Other types of job, like e.g. presented in Fig. 2b and 2c are not considered in this work. Authors of [27] consider interesting idea of modeling tasks in the form of graphs, where nodes represent tasks and edges represent communication. It allows combining tasks into jobs, similarly as we use shapes. All PEs in our CMP have the processing power pv assigned (8) and each PE may process only one task in the same time (7), so for jobs containing more than one task, more PEs are needed, e.g. for the job from Fig. 2a, six adjacent PEs are needed and they assume the shape as illustrated in the figure. Once job is allocated to PEs it runs until completion. PEs are connected among each other by NoC, the one researched in this paper has the width of NoC channels equal to 32 bits (W = 32). Thus, one flit has width 32 bits and for simplicity we assumed that one packet contains 1 flit. PA is located in location 0,0 and this node is not able to process data (there is no a PE at 0,0).
(a) (b) (c) Fig. 2. A job containing 6 tasks: (a) A job shape is a subgrid of the NoC; (b) A job with adjacent tasks; (c) A job with task that are not adjacent.
6
3.3 System Algorithms From algorithmic point of view, H-Phy strategy considers system as one part. The core part of the system is a PA that governs the entire process. PEs are reduced to pure processing elements and their other activities are limited to informing the PA when job is done. PE Algorithm 1 2
If task computation has just finished, send release message to PA. If allocation message was received, wait for arguments and start computing.
PA&JS Algorithm 1 2 3 4
If queue is not empty, make JS to select job. Select PEs to which job will be allocated using IFF or BMAT algorithm, considering current knowledge about their state (free/busy and computation limit status). If PA has found available PEs, send allocation messages to selected PEs. If PA has not found available PEs, wait for release message from any PE and try allocate the job again. If PA has received release message from a PE, update computation limit.
4 Overlay-NoC Approach NoC communication network connects on-chip processors (Fig. 1.). In the overlay approach, the processors where the job is executed do not have to be adjacent. This is equivalent to the division of jobs into single-PE tasks. The Fig. 3 shows the structure of the Overlay-NoC system: R elements are not connected in a specific structure, but topologies such as mesh or torus can be applied by assigning the values to R-to-R parameters (constants k in our model). In this approach, a heterogeneous CMP architecture can be considered (PEs do not have to be the same), i.e. the cost data unit computation may be different among them. This way we can define the structure containing PEs with different energy consumption and different computing power (though we are able to set costs and power for equal values for all PEs to get the uniform parameters for all of them). The presented mathematical model is valid for the heterogeneous architecture.
Fig. 3. Overlay-NoC approach architecture. Overlay-NoC approach uses the messaging protocol to communicate between all system units. Network elements R pass messages among each other and they have Messaging Interfaces (MI) built inside. PEs are also equipped with MIs, as well as PA. MI is able to receive and send messages and manage the available bandwidth. The following messages are used: compute request, compute acknowledgement, compute denial, compute task, and computation result. 4.1 Simulation Methodology In this paper, we use the heuristic simulation solution approach that gives us results reflecting H-Phy structure modeled during the research and described in Section 3. However, the research system, according to presented models, also allows the Mixed Integer Programming optimal solutions and heuristic static-based solutions. The idea of time scale slicing is used together with the message exchanging system. This makes the research system
7
scalable and extendable for new mechanisms. The system also includes robust input data management, the trendresearch module, and scenario driven experiment capabilities.
4.2 Assumptions and Description For each processor pair the distance is known (which is interpreted as number of hops between two processors). The distance also refers to NoC energy cost, as for longer distances more energy has to be consumed to transfer data, so we define the following cost parameter for each PE pair: kvw=const, which is the energy cost of sending the data packet between tiles v and w. The parameters defined for each PE v are: computing power pv=const and computation cost (for one data unit) cv=const. The overlay NoC and connected PEs also have limited data transfer capabilities. Each PE v is able to send maximum uv=const data packets and receive maximum dv=const data packets in a given period of time t. The time scale is divided into T time slots, what is the additional constraint for the problem being solved (it must be finished before limited T amount of time). The PE structure performs the computation and results’ distribution tasks, so the overall structure’s energy cost consists of computation and data transfer costs. Finally, our complete model for Overlay-NoC based processor structure is as follows: Indices b = 1, 2, …, B t = 1, 2, …, T v, w = 1, 2, …, V M Constants cv kwv pv dv uv
indices for data packets blocks (problem contains B 1x1 tasks) indices for time slots (problem has to be resolved in maximum T time slots) indices for PEs special unit: Processor Allocator (PA)
cost of task computation at PE v (energy) cost of packet transfer between tiles w and v (energy) computation power of PE v packet receive limit of PE v packet send limit of PE v
Binary variables xbv = 1 when task b is computed at PE v; 0 otherwise (binary) ybwvt = 1 when task b is transferred from PE v to PE w in time slot t; 0 otherwise (binary) Criterion function The criterion function (computation + transfer) for the model is formulated as: minimize F = ∑v∑b xbvcv + ∑v∑w∑b∑t yvwbtkvw
(11)
x, y
Constraints We have to define the following constraints, to determine the additional assumptions to make the model complete: Each PE has to compute at least one task: ∑b xbv ≥ 1 v = 1, 2, …, V
(12)
Each task may be computed only on one PE: ∑v xbv = 1 b = 1, 2, …, B
(13)
Each PE has limited computation power: ∑b xbv ≤ pv v = 1, 2, …, V
(14)
PE’s sending data link capacity: ∑b∑v ybwvt ≤ uw w = 1, 2, …, V t = 1, 2, …, T
(15)
PE’s receiving data link capacity: ∑b∑w ybwvt ≤ dv v = 1, 2, …, V t = 1, 2, …, T
(16)
Data packets containing tasks are sent from PA to PEs, each such packet is sent once: ∑b∑v∑t ybMvt = B
(17)
8
Result of task computation has to be sent back enclosed in packet b from computing PE v to PA: ∑b∑v∑t ybvMt xbv = B
(18)
PEs do not exchange data packets between each other: ∑b∑v∑w∑t ybvwt = 0
v≠w≠M
(19)
Unlike the H-Phy model, mesh and torus specific elements are provided by constant values of input data. Presented model is applied as rule set to the simulation system.
4.3 System Algorithms Each system unit’s work is determined by its internal algorithm. PE Algorithm 1 2 3 4
If task computation has just finished, send computation result to PA. If compute request message was received, PE is not currently in computation phase, and computation limit was not reached – reply to PA with compute acknowledgement message. Otherwise send compute denial. If compute task was received, PE is not currently in computation phase, and computation limit was not reached – start computing the task attached to compute task message. If compute task was received and computation was not started in 2), send compute denial message to PA.
By sending the computation result to PA, PE signalizes its readiness to compute further tasks. If PE receives two compute request messages, and first one did not result in computation, the second one cancels the first one. PA&JS Algorithm 1 2 3 4
If queue is not empty, make JS to select job. Select PEs to which job will be allocated – according to current knowledge about their state. Send compute request messages to selected PEs. If compute denial message was received, repeat target node selection for task related to compute denial message and send compute request messages. If compute acknowledgement messages were received from all PEs assigned to selected task, send compute task to all of them. Each compute task message contains the part of selected job. If computation result messages were received from all PEs related to a specific task, combine all messages into job computation result and store.
JS selects a task from the task queue using determined job selection strategy. Similarly to H-Phy system, the FCFS scheme is used, but system and model are ready to implement other strategies, that we consider as future work.
5 Experimentation System and Results The evaluation of the considered models was done in an experimentation system with logical structure presented in Fig. 4. P1
A
P2 P3 P4
O1 O2
Fig. 4. Block-diagram of the experimentation system as input-output system. The elements of the system are: Input – A: Queue with jobs. Problem parameters – P1: Approach used (H-Phy or Overlay-NoC); P2: Size of the evaluated CMP; P3: Topology (2D-Mesh or 2D-Torus); P4: Computational limits for PEs. Outputs – O1: Computation Energy; O2: NoC Energy.
9
Experimental research performed for both models included the experiments both for mesh and torus CMPs. In both experiments, we investigated 10×10 and 15×10 CMP structures. The randomly generated queues (discrete uniform distribution) contained 1,000 jobs that varied in size. To calculate computation energy, we assumed that one task requires 4 cycles for completion [1]. The average cost of transfer of one data packet between two processors in the whole structure equals: For 10×10 structure: 0.48 nJ (mesh) and 0.28 nJ (torus); For 15×10: 11.57 nJ (mesh) and 6.27 nJ (torus). The experiment was performed for uniform energy costs, that models the cluster containing same processors. Jobs for H-Phy approach were used in original size, the Overlay-NoC approach required to divide them into 1×1 tasks. The H-Phy model does not include uv, dv and cv. Thus in order to ensure the ability to compare the H-Phy model with Overlay-NoC, these constants are not considered for Overlay-NoC experiments as well. 5.1 Energy Consumption with No Computational Limitations In the first experiment, we used two queues requiring B = 6,210 and B = 8,798 cores if single cores are requested. In this investigation, no computation limit was assumed for PEs, in other words, PEs could compute infinite number of tasks. Table 1. Energy consumption based on the approach used and topology of CMP. H-Phy
B=6210 CMP 10×10 B=8798 CMP 10×10
B=8798 CMP 15×10
11000 11 10000 10
Computation Energy [µJ] NoC Energy [µJ] Total Energy [µJ] Computation Energy [µJ] NoC Energy [µJ] Total Energy [µJ] Computation Energy [µJ] NoC Energy [µJ] Total Energy [µJ]
Overlay-NoC Mesh Torus
Mesh
Torus
503.70
503.70
503.70
503.70
6.28
3.78
5.66
3.38
509.98
507.48
509.36
507.08
713.61
713.61
713.61
713.61
8.80
5.34
8.03
4.80
722.41
718.95
721.64
718.41
713.61
713.61
713.61
713.61
10.80
7.37
9.90
6.31
724.41
720.98
723.51
719.92
Overlay, Mesh H-Phy, Mesh Overlay, Torus H-Phy, Torus
NoC Energy [µJ]
90009 80008 70007 60006 50005 40004
30003 10×10, 8798 15×10, 8798 10×10, 6210 Size of Mesh/Torus, number of 1×1 tasks
Fig. 5. NoC Energy comparison for experiments with no computation limit.
As we can see in Table 1, computation energy consumed by PEs is significantly higher in comparison to NoC energy. In H-Phy case, the computation energy is on average 98 times higher than NoC energy, while in overlay model it is 110 times. The result is not a surprise and it proves the correctness of the models, since similar outcomes were reported in [19] and [3]. Both simulation tactics confirmed significant advantage of NoC based on torus topology (Table 1 and Fig. 5). In the H-Phy approach, average NoC energy consumption for meshes was
10
59% higher in comparison to torus topology. Similarly for the overlay tactic, the ratio was 63%. Additionally, average NoC energy consumption using the H-Phy approach was 11% higher than in the overlay case. It is correct since the overlay approach uses non-contiguous allocation strategy while contiguous scheme is used in the H-Phy method. This value reflects the profit achieved by the ability to divide tasks into single PE size. 5.2 Energy Consumption Considering Computational Limitations In next experiment, we considered the same queue of jobs used in previous experiment; with 1,000 jobs varying in size and requiring B = 8,798 cores if single cores are requested. Average number of tasks AvgT per 𝐵 core is = 𝑁𝑜𝑃𝐸 , where NoPE = nX*nY and it is number of PEs in the CMP. Thus for 10×10 and 15×10 CMP structures; AvgT is equal 88 and 59, respectively. Additionally, the performed investigations included computational limitations of PEs, in other words, PEs could have processed a limited number of tasks. Computational restrictions were assigned individually to each PE in the CMP. We investigated many scenarios with the following sets of restrictions: Set 1: No Restrictions were employed for PEs; Set 2: Restrictions were randomly generated for each PE using discrete uniform distribution, between -1 and AvgT, where: o -1: PE is able to compute an infinite number of tasks; o 0: PE cannot compute any task; o 1 .. AvgT: Number of tasks that can be processed by PE; Set 3: Computational limit for all PEs is equal AvgT; Set 4: Two bottom rows in mesh/torus with PEs cannot compute any task; Set 5: Four bottom rows in mesh/torus with PEs cannot compute any task; Set 6: Four bottom rows in mesh/torus with PEs cannot compute any task; Set 7: Two left columns in mesh/torus with PEs cannot compute any task; Set 8: Four left columns in mesh/torus with PEs cannot compute any task; Set 9: Six left columns in mesh/torus with PEs cannot compute any task; Set 10: Only for 15×10 CMP, eight left columns in mesh/torus with PEs cannot compute any task; Set 11: Only for 15×10 CMP, ten left columns in mesh/torus with PEs cannot compute any task; Set 12: Only for 15×10 CMP, eleven left columns in mesh/torus with PEs cannot compute any task. For three first sets results are presented in Table 2. In Set 2 case, we analyzed several randomly generated sets of restrictions, and in the table we present two most representative sets (Set 2a and 2d). For all randomly generated sets, we were unable to process the queue with jobs (Sets 2a and 2d in the Table 2). In other words, PEs’ restrictions have not allowed processing all jobs from queue. It was necessary to increase the number of tasks that can be processed by PEs (decrease limitations); and it is represented by sets 2b, 2c, 2e, and 2f in the table, where: i. In Set 2b, limitations are decreased 10 times for 10×10 CMP, thus each PE was allowed to process 10 times more tasks than with Set 2a, and 6.5 times for 15×10 CMP; ii. In Set 2c, limitations are decreased 13 times for 10×10 CMP and 7.5 times for 15×10 CMP, in comparison to Set 2a; iii. In Set 2e, limitations are decreased 11 times for 10×10 CMP and 6.5 times for 15×10 CMP, in comparison to Set 2d; iv. In Set 2f, limitations are decreased 15.5 times for 10×10 CMP and 7 times for 15×10 CMP, in comparison to Set 2d. We also had to decrease limitations in case of Set 3 as follows: i. In Set 3a, limitations are decreased 9% for 10×10 CMP, thus each PE was allowed to process 9% more tasks than with Set 3, and 6% for 15×10 CMP; ii. In Set 3b, limitations are decreased 16% for 10×10 CMP and 12% for 15×10 CMP, in comparison to set 3. Impact of computational limitations on PEs is significant. It can happen that there is no enough computational time to complete the given queue (Sets 2a and 2d in Table 2). Overlay approach is better in this case, since by applying non-contiguous allocation strategy, the queue that is not done due to computational restrictions using HPhy approach, can be done in Overlay tactic (Sets 2b, 2e and 3). Using H-Phy approach, it is needed to decrease limitations 2.25 times on average to achieve the same results as in Overlay strategy. Similarly like in previous experiment, results from both models confirmed advantage of torus over mesh topology. Decreasing limitations in order to ensure completeness for the queue of jobs was always lower for torus NoCs. In other words, restrictions in meshes have to be lower in comparison to torus topology (Sets 2b, 2e, and 3a in Table 2). It makes the torus systems more reliable and flexible, since we can in better way accommodate losses of some PEs in the system due to defects or temporal outages.
11
Interestingly, restricting computation ability by PEs can provide NoC energy savings over systems with no restrictions, e.g. Table 2, Sets 2b and 2c in H-Phy; or Sets 2e, 2f, 3, 3a, and 3b in Overlay torus systems. Thus failure or temporal downtime of few units is not always equivalent to losses in energy. Results of disabling more PEs for longer time are presented in Fig. 6 and Fig. 7, where disabling columns of PEs (Fig. 6) and rows of PEs (Fig. 7) is considered. In the figures on x-axis, only considered sets are marked. NoC energy values shown in ranges between the considered sets are estimated energy consumptions, e.g. in the middle of range between set 1 and set 7 in Fig. 6a, NoC energy value represents the energy consumption where one left column cannot compute any task. For meshes, NoC energy consumption is increasing together with disabling more columns or rows. For both H-Phy and Overlay models, disabling columns has increased NoC energy by 33% and 49% on average; while disabling rows has increased the energy by 31% and 22% on average; for 10×10 and 15×10 CMP, respectively.
10×10 CMP
Mesh Torus Mesh Overlay Torus H-Phy
15×10 CMP
Mesh Torus Mesh Overlay Torus H-Phy
Table 2. NoC Energy while computational restrictions are random. NoC Energy [µJ] Computational Restrictions Set 1 Set 2a Set 2b Set 2c Set 2d Set 2e Set 2f Set 3 Set 3a Set 3b 8.8 8.65 8.9 8.79 − − − − − − 5.96 5.7 5.77 6.16 6.12 6.0 5.99 − − − 8.03 8.27 8.28 7.82 7.66 7.85 7.85 7.85 − − 4.8 5.25 5.37 5.01 5.04 4.87 4.87 4.87 − − 10.8 8.23 9.44 6.12
− − − −
−
10.56
8.21
8.23
10.18
10.19
6.07
6.07
− − − −
−
10.85
−
10.89
8.41
− −
8.39
8.17
8.2
9.99
9.98
10.04
10.04
10.04
6.6
6.64
6.07
6.07
6.07
For toruses, disabling rows or columns can decrease NoC energy. If the number of restricted rows/columns is low (Sets 4 and 5 or Sets 7 and 8), the energy is slightly increased (around 10%). However, with further increasing number of disabled rows/columns (Set 6 or Sets 9, 10, 11, and 12), the energy consumption is decreasing. Moreover, it can be even lower than the energy in the case with no restrictions (Fig. 6b), about up to 15% and 20% in H-Phy and Overlay models, respectively.
11
Overlay, Mesh H-Phy, Mesh Overlay, Torus H-Phy, Torus
16
NoC Energy [µJ]
NoC Energy [µJ]
12
10 9
8 7
6
12 10 8 6
5
4 Set 1
14
Set 7
Set 8
Computational Restrictions
Set 9
4 Set 1
Set 7
Set 8
Set 9
Set 10
Set 11
Computational Restrictions
(a) (b) Fig. 6. NoC Energy based on number of columns restricted: (a) 10×10 CMP; (b) 15×10 CMP.
12
Set 12
Overlay, Mesh H-Phy, Mesh Overlay, Torus H-Phy, Torus
11 10
14
NoC Energy [µJ]
NoC Energy [µJ]
12
9 8
7 6
13
12 11 10 9 8 7
5
6
4 Set 1
Set 4
Set 5
Set 6
Set 1
Computational Restrictions
Set 4
Set 5
Set 6
Computational Restrictions
(a) (b) Fig. 7. NoC Energy based on number of rows restricted: (a) 10×10 CMP; (b) 15×10 CMP.
6 Conclusions In this paper, we proposed a new model of NoC-based CMPs with ability to limit and monitor computational power offered by PEs on the chip. The model was constructed based on mixed integer programming optimization technique for two completely different approaches: H-Phy and Overlay-NoC. The H-Phy strategy models contiguous processor allocation algorithms, while the overlay scheme is designed for non-contiguous allocation. We have proposed a simulation system, where both models can be implemented and evaluated. Such aspects of the modern CMPs like efficient utilization of cores; cores’ reliability, monitoring, accounting, or access restricting; can be investigated and analyzed. Experimental results presented in the paper confirm lower NoC energy consumption of CMP when torus topology is used. Similarly, the energy consumption using non-contiguous processor allocation strategy is lower in comparison to contiguous approach. The energy consumption in torus topology is also less dependent on computation limitations; moreover, our results revealed that limiting or even disabling computation power for some PEs can give NoC energy savings. It is not a case for mesh networks, where any computing constrains caused the energy consumption to be higher. These observations were confirmed for both H-Phy and overlay strategies. The presented evaluation system and experimental results also reveal that the overlay NoC simulation approach models modern CMPs as accurately as the H-Phy tactic. It gives enormous flexibility and possibilities of simulating NoC-based CMPs, their utilization characteristics; and issues related to computation power monitoring and limitations.
References K. A. Bowman, A. R. Alameldeen, S. T. Srinivasan, C. B. Wilkerson, “Impact of Die-To-Die ond Within-Die Parameter Variations on the Throughput Distribution of Multi-Core Processors,” Proceedings of the 2007 international symposium on Low power electronics and design (ISLPED '07), 2007, pp. 50-55, doi: 10.1145/1283780.1283792 [2] G. Chmaj, K. Walkowiak, “Decision Strategies for a P2P Computing System”, Journal of Universal Computer Science, vol. 18, no. 5, 2012, pp. 599-622 [3] G. Chmaj, D. Zydek, Y. Z. Elhalwagy, H. Selvaraj, “Overlay-NoC and H-Phy based computing using Modern Chip MultiProcessors,” Proceedings of 2012 IEEE International Conference on Electro/Information Technology (EIT 2012), IEEE Computer Society Press, 2012, pp. 1-6, doi: 10.1109/EIT.2012.6220604 [4] L. B. Daoud, M. E. Ragab, V. Goulart, “Faster Processor Allocation Algorithms for Mesh-Connected CMPs,” Proceedings of 14th Euromicro Conference on Digital System Design (DSD 2011), 2011, pp. 805-808, doi: 10.1109/DSD.2011.107 [5] C. Feichtinger, S. Donath, H. Köstler, J. Götz, U. Rüde, “WaLBerla: HPC software design for computational engineering simulations,” Journal of Computational Science, vol. 2, no. 2, 2011, pp. 105-112, doi: 10.1016/j.jocs.2011.01.004 [6] G. Hendrya et al., “Time-division-multiplexed arbitration in silicon nanophotonic networks-on-chip for high-performance chip multiprocessors,” Journal of Parallel and Distributed Computing, vol. 71, no. 5, 2011, pp. 641-650, doi: 10.1016/j.jpdc.2010.09.009 [7] Intel Microprocessor Export Compliance Metrics, http://download.intel.com/support/processors/corei5/sb/core_i5-600_d.pdf, 2011. [8] D. N. Jayasimh, B. Zafar, Y. Hoskote, “On-chip Interconnection Networks: Why They Are Different and How to Compare Them., “ Intel, 2006. [9] D. Jensen, A. Rodrigues, “Embedded Systems and Exascale Computing,” Computing in Science & Engineering, vol. 12, no. 6, 2010, pp. 20-29, doi: 10.1109/MCSE.2010.95 [10] Y. Lee, A. Zomaya, “Energy Conscious Scheduling for Distributed Computing Systems under Different Operating Conditions”, Parallel and Distributed Systems, IEEE Computer Society, Vol. 22, Issue 6, 2011, pp. 1374-1381, doi: 10.1109/TPDS.2010.208 [1]
13
[11] Y. Lin, W. Chen, A. Su, D. Chang, “A Low Cost, Low Power, High Scalability and Dependability Processor-Cluster Platform”, 6th IEEE International Symposium on Industrial Embedded Systems (SIES), 2011, pp. 95-98, doi: 10.1109/SIES.2011.5953689 [12] M. F. Nadeem, S. A. Ostadzadeh, S. Wong, K. Bertels, “Task Scheduling Strategies for Dynamic Reconfigurable Processors in Distributed Systems”, Proceedings on International Conference on High Performance Computing and Simulation (HPCS), 2011, pp. 90-97, doi: 10.1109/HPCSim.2011.5999811 [13] N. Satish et al., “Fast Sort on CPUs, GPUs and Intel MIC Architectures,” Technical Report, Intel, 2010. [14] H. Shen, Q. Qiu, “An FPGA-based Distributed Computing System with Power and Thermal Management Capabilities”, Proceedings of 20th International Conference on Computer Communications and Networks (ICCCN), 2011, pp. 1-6, doi: 10.1109/ICCCN.2011 .6005802 [15] A. Voigt, T. Witkowski, “A multi-mesh finite element method for Lagrange elements of arbitrary degree,” Journal of Computational Science, vol. 3, no. 5, 2012, pp. 420-428, doi: 10.1016/j.jocs.2012.06.004 [16] G. Chmaj, K. Walkowiak, M. Tarnawski, M. Kucharzak, ”Heuristic Algorithms for Optimization of Task Allocation and Result Distribution in Peer-to-Peer Computing Systems”, International Journal of Applied Mathematics and Computer Science, Vol. 22, No. 3, 2012, pp. 733-748, DOI: 10.2478/v10006-012-0055-0 [17] D. Zydek, H. Selvaraj, “Fast and Efficient Processor Allocation Algorithm for Torus-Based Chip Multiprocessors,” Journal of Computers & Electrical Engineering, vol. 37, no. 1, 2011, pp. 91-105, doi: 10.1016/j.compeleceng.2010.10.001 [18] D. Zydek, H. Selvaraj, “Hardware Implementation of Processor Allocation Schemes for Mesh-Based Chip Multiprocessors,” Journal of Microprocessors and Microsystems, vol. 34, no. 1, 2010, pp. 39-48, doi: 10.1016/j.micpro.2009.11.003 [19] D. Zydek, H. Selvaraj, G. Borowik, T. Luba, “Energy Characteristic of Processor Allocator and Network-on-Chip,” International Journal of Applied Mathematics and Computer Science, vol. 21, no. 2, 2011, pp. 385-399, doi: 10.2478/v10006-011-0029-7. [20] D. Zydek, H. Selvaraj, L. Koszalka, I. Pozniak-Koszalka, “Evaluation Scheme for NoC-based CMP with Integrated Processor Management System,” International Journal of Electronics and Telecommunications, vol. 56, no. 2, 2010, pp. 157-168, doi: 10.2478/v10177-010-0021-4 [21] D. N. Jayasimha, B. Zafar, Y. Hoskote, “On-Chip Interconnection Networks: Why They are Different and How to Compare Them,” Intel, 2006. [22] E. Antunes et al., “Partitioning and dynamic mapping evaluation for energy consumption minimization on NoC-based MPSoC,” 13th International Symposium on Quality Electronic Design (ISQED), 2012, pp. 451-457, doi: 10.1109/ISQED.2012.6187532 [23] A. M. Law, “How to build valid and credible simulation models,” Proceedings of the 2009 Winter Simulation Conference (WSC), 2009, pp. 24-33, doi: 10.1109/WSC.2009.5429312 [24] T. Enokido, A. Aikebaier, M. Takizawa, “Process Allocation Algorithms for Saving Power Consumption in Peer-to-Peer Systems”, IEEE Transactions On Industrial Electronics, Vol. 58, No. 6, 2011, pp. 2097-2105, doi:10.1109/TIE.2010.2060453 [25] B. Andersson, “Assigning Real-Time Tasks on Heterogeneous Multiprocessors with Two Unrelated Types of Processors”, Real-Time Systems Symposium (RTSS), 2010, 239-248, doi:10.1109/RTSS.2010.32 [26] A. Schranzhofer, J. Chen, L. Thiele, “Dynamic Power-Aware Mapping of Applications onto Heterogeneous MPSoC Platforms”, Vol. 6, No. 4, 2010, pp. 692-707, doi:10.1109/TII.2010.2062192 [27] L. De Giusti, F. Chichizola, M. Naiouf, A. De Giusti, “Mapping Tasks to Processors in Heterogeneous Multiprocessor Architectures: The MATEHa Algorithm”, International Conference of the Chilean Computer Science Society, 2008, pp. 85-91, doi:10.1109/SCCC.2008.11
14