Bandwidth-Constrained Multi-Objective Segmented Brute-Force ...

16 downloads 0 Views 2MB Size Report
Dec 1, 2016 - Bandwidth-Constrained Multi-Objective Segmented Brute-Force Algorithm for Efficient Mapping of Embedded Applications on NoC Architecture.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/ACCESS.2017.2778340, IEEE Access

Date of publication xxxx 00, 0000, date of current version xxxx 00, 0000. Digital Object Identifier xxx/ACCESS.2017.DOI

Bandwidth-Constrained Multi-Objective Segmented Brute-Force Algorithm for Efficient Mapping of Embedded Applications on NoC Architecture SARZAMIN KHAN1 , SHERAZ ANJUM2 , USMAN ALI GULZARI1 , TARIQ UMER2 , (Senior Member, IEEE) and BYUNG-SEO KIM3 , (Senior Member, IEEE) 1 2 3

Department of Electrical Engineering, COMSATS Institute of Information Technology, Pakistan (e-mail: [email protected]) Department of Computer Science COMSATS Institute of Information Technology, Pakistan (e-mail: [email protected]) Department of Computer and Information Communication Engineering, Hongik University, Sejong, South Korea (e-mail: [email protected])

Corresponding author: Sarzamin Khan (e-mail: [email protected]). “This research was supported by Basic Science Research Program through the National Research Foundation of Korea (NRF) funded by the Ministry of Education (2015R1D1A1A01059186).”

ABSTRACT Network-on-Chip (NoC) is an emerging alternative to address the communication problem in embedded System-on-Chip (SoC) designs. One of the key and major issues is the optimized mapping of the embedded applications on the underlined NoC platform. In this paper, we propose the BandwidthConstrained Multi-Objective Segmented Brute-Force Mapping (SBMAP) algorithm, which minimizes the communication energy consumption and reduces the computational complexity of the NoC designs. The algorithm generates efficient mapping of the embedded applications on the Processing Elements (PEs) of the NoC system by segregating the application into multiple segments. It utilizes the property of modular systematic search, which produces high-performance results with optimized simulation time. We compared SBMAP algorithm with the state of the art mapping techniques, such as Branch and Bound (BB), Near-optimal Mapping (NMAP), and Random Mapping algorithms for mapping real world application workloads. The experimental results validated the efficiency of the proposed algorithm against its competitors for most of the performance parameters of the NoC designs. The improvement in energy consumption of the SBMAP algorithm is, up to 53% for 2D mesh and 62% for torus topology as compared to NMAP, BB, and Random algorithms for VOPD, PIP, WiFi receiver, and MMS real time application benchmarks.

INDEX TERMS Algorithm, Brute-Force, Communication, Network-on-Chip, Topology. I. INTRODUCTION

HE recent trend in the deployment of the power efficient System-on-Chip (SoC) designs provoked the research community to develop NoC based designs for power and performance improvements. The classical bus based, power hungry systems have scalability and performance issues. Network-on-Chip (NoC) has emerged as a promising solution for embedded System-on-Chip designs [1], [2]. NoC is packet based, on-chip communication switching network designed for communication among the Intellectual Property (IP) cores of the SoC systems [3]. NoCs use packets to exchange data between processing elements (PEs) via network fabric that consists of Resource Network Interfaces (RNI), routers and interconnecting links as shown in Figure 1. There are different research issues [4] in NoC designs, and researchers are moving ahead to resolve them by distinctive research methodologies [5]–[7]. The design flow of the NoC

T

VOLUME x, 2017

architecture consists of Application Partitioning, Tasks Scheduling and Application Mapping processes on the target NoC platform. Application Partitioning and Tasks Scheduling are related to the task deadlines and execution time. These are traditional CAD problems and have been addressed by the CAD community in the area of hardware/software co-design phase [28], [29]. However, in NoC designs, one of the most important, and core issues is the mapping of applications on the underlined NoC platform [8]. Application mapping determines the topological placement of the IPs onto NoC platform in order to optimize certain metrics of performance, e.g. energy, latency, throughput, and power. Hard NoC represents NoC architecture, which has pre-designed computation and communication infrastructure. There is no flexibility in changing IP cores on to their placeholders. Firm NoCs have pre-designed communication architecture, but the topological placements of the IPs are still to be 1

2169-3536 (c) 2017 IEEE. Translations and content mining are permitted for academic research only. Personal use is also permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.

This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/ACCESS.2017.2778340, IEEE Access Sarzamin Khan et al.: Bandwidth-Constrained Multi-Objective Segmented Brute-Force Algorithm for Efficient Mapping of Embedded Applications on NoC Architecture

decided in the architecture. Application mapping is an NPhard (Non Polynomial hard) problem [9], because the search space increases pictorially with the system size. To map k Intellectual Property (IP) cores on n network nodes (k ≤ n), the possible core arrangements (S) on the NoC network is given by: Snk = n!/(n − k)! (1) When the number of IP cores is identical to the number of network nodes (n = k), the possible IP mappings on the network nodes becomes n!. It is, therefore a combinatorial optimization problem that requires efficient heuristic algorithms for optimized solutions. Integer Linear Programming (ILP) produces the best solution for a small problem size of the real world applications [10], [11]. If the NoC size scales up, ILP methods cannot solve the problem in polynomial time as evident from (1). Different heuristics are proposed in the literature that speed up the computation time [12]–[16], but compromise the best feasible solution. As modern android and embedded Systems-on-Chip designs are battery powered and consume high power due to their complexity, therefore optimization of energy and performance parameters comes out to be the important aspects in the development of these systems. Application mapping optimization is the most important part in the design phase of the embedded systems because it can greatly affect the energy consumption and performance parameters [31]–[35]. In this research work we propose a Bandwidth-Constrained Multi-Objective Segmented Brute-Force Mapping algorithm (SBMAP) for NoC application mapping. SBMAP utilizes the property of modular systematic search to reduce the computational time complexity and increases the quality of the feasible solution for the performance parameters as compared to other heuristics. The algorithm divides the input stream of the application into multiple segments, and solves it by permutation based modular systematic search method. The input stream is the collection of IP cores and tasks with the required bandwidth of the embedded application. The tasks having high communication demands with their neighbors are grouped in distinct segments for initial mapping. The initial mapped segments are then iteratively optimized for best possible solution by the algorithm. For each partially generated mapping, the algorithm calculates the energy and cost of the network and retains that mapping, which has minimum cost and energy consumption. The algorithm keeps the track record of the previous segmented streams of data for generating cumulative optimized solution. SBMAP algorithm is embedded in the NoCTweak simulator [26] for generating optimized power, throughput and latency with the constraint of bandwidth reservation. The rest of this paper is organized as follows: In section II, we present related research work on NoC application mapping. Section III briefly describes the problem formulation and mathematical models for the NoC performance parameters. Section IV presents the proposed BandwidthConstrained Multi-Objective SBMAP algorithm. Section V 2

PE RNI Router Link

FIGURE 1. Network-on-Chip (NoC).

analyzes the simulation results, and finally, in section VI, we present concluding remarks and future work. II. RELATED WORK

Application mapping is an NP-hard problem and can be handled through heuristic or systematic search techniques. Various algorithms, namely Branch and Bound (BB), Nearoptimal Mapping (NMAP), Random algorithm, Simulated Annealing (SA), Taboo Search (TS), Genetic Algorithm (GA) and Particle Swarm Optimization (PSO) algorithms are used for energy, power, latency, throughput, and bandwidth optimization of NoCs [17], [30]. J. Hu et al. [18] presented Branch and Bound (BB) mapping algorithm for topological placement of IPs onto NoC platform to minimize total energy consumption, with the constraint of link bandwidth. The author compared the results of the algorithm with the adhoc Simulation Annealing (SA) method and proved that the BB algorithm is faster, but the algorithm turned out to sub optimal results than SA technique. T. Lei et al. [19] proposed a delay based twostep Genetic Algorithm (GA) for on-chip communication of NoC. The minimization of the total execution time to the tasks is considered as objective function for both mapping and scheduling of the IPs. The application is mapped on 2D mesh topology to minimize the execution time. For mapping the IP cores on NoC, delays of messages are estimated using a Delay Model from source to destination node. To find the critical-path and schedule the vertices of the task graph on NoC nodes, Asynchronous as Late as Possible (ALAP) and Asynchronous as Soon as Possible (ASAP) scheduling is proposed. This approach did not consider power and energy consumption which is an important aspect of the application mapping. S. Murali et al. [20] presented an algorithm for mapping of IP cores onto NoC architecture. The network traffic is divided among the IPs across multiple links on a 2D mesh topology VOLUME x, 2017

2169-3536 (c) 2017 IEEE. Translations and content mining are permitted for academic research only. Personal use is also permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.

This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/ACCESS.2017.2778340, IEEE Access Sarzamin Khan et al.: Bandwidth-Constrained Multi-Objective Segmented Brute-Force Algorithm for Efficient Mapping of Embedded Applications on NoC Architecture

with the constraint of bandwidth reservation. The author proposed heuristic approach with the minimal path routing algorithm for the mapping of the cores on 2D mesh topology. This approach included initialization of the mapping, minimal path computations and pairwise swapping of the vertices. NMAP algorithm is presented, that divided the traffic along different minimum paths, such that the bandwidth constraint is satisfied. The results obtained by NMAP are compared with BB algorithm, and it is observed that for fewer numbers of cores, BB produced better solution, but when the system size scaled up, NMAP performed better as compared to BB. Z. Lu et al. [21] proposed Cluster based Simulated Annealing (CSA) to reduce the simulation time of the annealing process for large system size. The CSA clustered the IPs according to pre-defined rules. It considered the cluster as a single identity in minimizing system size to speed up the simulation time and obtained near optimal solution. In the clustering process, the edge tiles are combined in the same group while the central tiles are shared with a different group, based on its neighborhood. The clustering technique speeded up the computation time, but compromised the optimal results. C. Radu and L. Vintan [9] presented an Optimized Simulated Annealing (OSA) algorithm for NoC application mapping on 2D mesh topology. In this method, the author optimized the annealing parameters to produce optimal results with shorter time as compared to conventional simulation annealing. Annealing schedule, Number of iterations per temperature level, Acceptance function, Probability Distribution Function (PDF) based swapping and stopping conditions are modified to speed up the simulation. The simulation time in this method is still higher than other heuristic techniques. G. Ascia et al. [22] proposed a Multi-Objective mapping heuristic for the 2D mesh topology of the NoC architecture to optimize network performance and power consumption. Multi-Objective Genetic Algorithm is presented for mapping of the IP cores. R. K. Jena et al. [23] proposed an algorithm for IP mapping using 2D mesh topology of the NoC architecture. A Multi-Objective GA based heuristics is used to map the IP cores and find the mappings that optimize network power consumption, link bandwidth and network performance. M. J. Sepulveda et al. [24] proposed MAIA (Multi-Objective Adaptive Immune Algorithm), which is a multi-application evolutionary algorithm to solve the NoC mapping problem. H. M. Harmanani et al. [25] presented the Simulated Annealing (SA) algorithm for task assignment to network nodes on a 2D mesh topology. The author also proposed a routing algorithm, which optimized message blocking, bandwidth and throughput of the network. Most of the above stated mapping algorithms compromise either application simulation time or results optimality. The proposed SBMAP algorithm optimizes both the computation time as well as the results of the performance parameters of the NoC system.

III. PROBLEM FORMULATION

In NoC designs, an application is represented by Network Task Graph (NTG) which is subsequently scheduled by a scheduler on the available IP cores through Network Core Graph (NCG). NCG is then transformed and mapped by an efficient mapping algorithm on NoC topology through NoC Architecture Graph (NAG). Definition 1: A Network Task Graph (NTG), is a directed acyclic graph, N = N (T, C) in which each vertex of the graph represents a task, (ti ∈ T, i = 1, 2, 3 . . . .) for the computational resource of the application. The task is associated with execution time, energy consumption and resource deadlines. The directed arc, (ci, j ∈ C, i = 1, 2, 3 . . . ., j = 1, 2, 3 . . . .) represents either data volume or communicated information between the communicated tasks (ti , tj ). Definition 2: A Network Core Graph (NCG), G = G(P, A), is a directed graph, in which vertex of the graph, (pi ∈ P ) represents the intellectual property (IP) core or Processing Element (PE). The directed arc (ai, j ∈ A) shows characteristic parameters and required bandwidth between the IP cores (pi to pj ). Definition 3: NoC Architecture Graph (NAG), A = A(R, H) is an architecture graph, in which each vertex (ri ∈ R) shows a router node in the graph, and the directed arc, (hi, j ∈ H) repesents the routing channel or link between router, ri to the router, rj . The router transmits and receives the data volume of the associated IP cores, and the routing channels (N h ) provide the routing paths for the communicated packets. The routing channel is associated with data bandwidth information, B ti, tj . The following mathematical models are used for energy, power, throughput and latency calculations of the SBMAP algorithm for the optimized application mapping on the NoC architecture. A. BIT ENERGY MODEL

The Bit Energy [8] of the NoC platform is given by: EBit = ES + EL

(2)

E Bit is the energy consumed for sending a unit bit of data from source to destination node and includes the Switch Energy (E S ) and Link Energy (E L ) of the NoC network. Average network energy (E ti, tj ), consumed in transmitting a unit bit of data from source tile (ti ) to destination tile (tj ) is given by: Eti,tj = Nh × ES + (Nh − 1) × EL

(3)

Where, Nh is the Manhattan distance from source node (xi , y i ) to the destination node (xj , y j ) of the NoC architecture and is given by: Nh = |xi − xj | + |yi − yj |

(4)

The total energy consumption of the network is therefore, given by: Nh X ET otal = (Bti.tj × Eti,tj ) (5) i,j

VOLUME x, 2017

3

2169-3536 (c) 2017 IEEE. Translations and content mining are permitted for academic research only. Personal use is also permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.

This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/ACCESS.2017.2778340, IEEE Access Sarzamin Khan et al.: Bandwidth-Constrained Multi-Objective Segmented Brute-Force Algorithm for Efficient Mapping of Embedded Applications on NoC Architecture

of the NoC system. The average latency of the network under this model is given by:

Input: Network Task Graph

Ltav

Order IPs in segments having maximum communications

N N 1 X 1 X Lti,j = N i=1 Ni j=1

(8)

Where, Lti, j is the packet latency of packet j, N i is the number of packets received by the processor, i after the warm-up time and N is the number of processors in the platform. The average throughput of the network is given by:

Calculate Cost and Energy of segment

N

Segment=Segment+1

No

Permutate the segment and retain minimum Cost

T hav =

X 1 Ni N (Tsim − Twrm ) i=1

Where, T sim is the simulation time and T wrm is the warmup time of the simulations. The average power of the network is calculated by:

Is the last segment searched for minimum Cost?

N N 1 XX [αi,j P wact,j + (1 − αi,j )P winact,j ] N i=1 j=1 (10) Where, P wact, j is the post-layout active power and P winact, j is the post-layout inactive power of the component, j. The parameter αi, j is the active percentage of the component, j in the router, i (after T wrm ). Finally, the average energy consumed by each packet in the network is given by:

P wav =

Yes Calculate total minimum Cost Generate best mapping

calculate Energy, Power, Latency and Throughput

N

Enp =

+(1 − αi,j )P winact,j ]

B ti, tj is the arc bandwidth from tile ti to tile tj . Therefore, Nh X [Bti.tj × (Nh × ES + (Nh − 1) × EL )] (6) i,j

The cost can be computed using the following equation: Cost =

N

(Tsim − Twrm ) X X [αi,j P wact,j N NP i=1 j=1

FIGURE 2. SBMAP algorithm.

ET otal =

(9)

Nh X (Bti.tj × Nh )

(7)

(11)

Where, N P is the total number of packets, traversed across the network of the NoC architecture. C. ORION MODELL

ORION Model [27] calculates the power and energy at the discrete and component level of the network and can be embedded within the simulation environment for total energy calculations of the network. ORION model is not used in our proposed SBMAP algorithm but can be utilized as an extended part of this research work.

i,j

IV. SBMAP ALGORITHM

Different mappings will generate different cost and energy solutions, and our objective is to find a mapping function that produces minimum cost and energy for the entire network operation. In this research work cost and energy of the NoC applications are used as the performance indicators for different applications, mapped to their placeholders. B. CMOS CELL LIBRARY MODELL

CMOS Cell Library Model utilizes post layout cell data of standard CMOS libraries for calculation of timing and power estimation [26]. Our proposed SBMAP algorithm uses CMOS standard Cell Library Model for computations of average latency, throughput, power and energy consumption 4

The proposed Segmented Brute-Force Mapping (SBMAP) algorithm takes the Network Task Graph (NTG) and Network Core Graph (NCG) as inputs and efficiently performs the topological placement of these tasks on the available tiles of the NoC platform to generate efficient NoC Architecture Graph (NAG). Transformation from NTG to NCG intrinsically requires the scheduling of the tasks (T ) on the available processors (P ) for execution. When T = P , or T < P , a single task can be assigned to an individual PE, while on the other hand, if T > P , then two or more tasks have to be scheduled on a single PE to accommodate all the tasks of the NTG. For this purpose, a scheduler is required before performance simulations. The Scheduler handles the control and data dependencies. It accomplishes activities VOLUME x, 2017

2169-3536 (c) 2017 IEEE. Translations and content mining are permitted for academic research only. Personal use is also permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.

This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/ACCESS.2017.2778340, IEEE Access Sarzamin Khan et al.: Bandwidth-Constrained Multi-Objective Segmented Brute-Force Algorithm for Efficient Mapping of Embedded Applications on NoC Architecture

1. NTG and NCG Mapping Tasks T0 T1 T2 T3 0 1 2 3 Tiles

T4 4

T5 5

T6 6

T7 7

2. Initial Mapping with heavily communicating modules Tasks T0 T1 T2 T3 T4 T5 T6 T7 5 4 2 3 1 0 15 7 Tiles

T8 8

T9 9

T8 14

T9 6

3. Splitting Initial Mapping and applying systematic search method Tasks T0 T1 T2 T3 T4 T5 T6 T7 T8 5 4 2 3 1 0 15 7 14 Tiles 4. SBMAP optimized mapping Tasks T0 T1 T2 T3 3 0 2 1 Tiles

T4 7

T5 4

T6 15

T7 5

T8 8

T9 6

T9 13

T10 10

T10 10

T11 11

T11 11

T10 10

T10 9

T11 11

T11 12

T12 12

T12 12

T12 12

T12 11

T13 13

T13 13

T13 13

T13 10

T14 14

T14 9

T14 9

T14 6

T15 15

T15 8

T15 8

T15 14

FIGURE 3. SBMAP illustration with example.

such as the execution time, deadlines and priorities of the processing tasks. In this research work, we consider one to one mapping of tasks on the IP cores, i.e.; T = P , therefore scheduling is not required at this stage and the transformation of NTG to NCG therefore, has no timing bounds and deadline constraints for the simulations. In the proposed algorithm (Algorithm 1), the initial mapping is based on the bandwidth requirements and communication workload between PEs. The communication intensive tasks are grouped in decades or less than a decade segment to minimize computation time and network energy consumption. These segments are then iteratively checked by the SBMAP algorithm for minimum cost computation, using the modular systematic search method. The mapping and cost of the preceding groups are retained and used sequentially for cost calculation of the application as shown in Figure 2. Segmented Brute-Force Mapping (SBMAP) algorithm systematically search the problem space for optimized mapping of an application on NoC platform and follows the following sequence of procedure: • SBMAP initializes the task graph by keeping the most communicating tasks in identical segments. • Divides the search space into small segments of IPs to minimize execution time. • Searches the best solution for each segment by permutation with modular systematic search method. • Retains and updates the best mapping of each segment to minimize energy consumption. • Explores all the segments for optimized mapping, based on the cumulative cost comparison. • Calculates power, energy, latency and throughput of the best searched mapping. The proposed algorithm is multi-objective in nature, with the constraint of bandwidth reservation. The main objectives VOLUME x, 2017

Algorithm 1: Segmented Brute-Force Mapping 1 2 3 4 5 6 7 8 9 10 11 12 13

INPUT: G (P, A) or N (T, C) OUTPUT: Mapping G (P, A), N (T, C) →A (R, H) Number of Segments (N) = Total tasks/10 Initialize task mapping, N = 1, min Cost = ∞ Do { Calculate Manhattan distance Calculate Communication Bandwidth ) If the bandwidth constraint is satisfied, then find the total Cost Total Cost = Manhattan distance Communication Bandwidth Compute Bit energy ) If (Total Cost < min Cost) Min Cost = Total Cost Task mapping = min Cost mapping

14 Total Energy, 15 N++ } 16 While (Next Mapping) 17 Return best Commutative Mapping with lowest Cost of all segments 18 Calculate application Energy, Power, Latency, and Throughput 19 End

of the proposed algorithms are the optimize performance parameters such as power, energy, latency and throughput of 5

2169-3536 (c) 2017 IEEE. Translations and content mining are permitted for academic research only. Personal use is also permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.

This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/ACCESS.2017.2778340, IEEE Access Sarzamin Khan et al.: Bandwidth-Constrained Multi-Objective Segmented Brute-Force Algorithm for Efficient Mapping of Embedded Applications on NoC Architecture

75205

80 25

16

3672

7

8

21

7061

64

9

5

20348

12

13

6

641

22

10

38016

18

75584 38

11

512

23

6

7065

353

5

357

4

27

15

16

16

14

12

384

640

9

13

384

14

108

19

384

17

1

1

1

1

54

24

21

54 0.05

22

54

20

6

64

1

128

72

18

5

0.0 23

2

6

3

12

16

13

(c) VOPD Application

64

11

16

10

512

72

64

157

94 13 3

16

640

8

3

16

300 500

9

8

640

16

16

313

320

4

640

72

5

7

7

640

(b) WiFi Receiver Application

49

6

362 2

2

640

3

0.05 4

36

362

5 0.12

15

(a) MMS Application

1

640

384

1404

28265

0.125

19

01

0 64

7065

70

2

0.05 3672

0

640

0.125

14

4

17

2672

20

1

0

76

80 25

3672

640

64

38026

0.125 0

640 740

15

4

8

64

7

64

20348

11

33848

0.05

1

7061

10

3

73

6

33848

64

59 16

2

640

33848 68 10

5

1

1

38106

197

640 0

64

4

(d) PIP Application

FIGURE 4. Network Task Graphs with communication bandwidth (MB/S).

the embedded application. Figure 3 shows the flow of the mapping algorithm using an arbitrary example, having 16 tasks. The application mapping is performed on the input files of NTG and NCG, as supplied by the user to the simulator using SBMAP algorithm. The algorithm generates initial mapping by grouping heavily communicating tasks into distinct segments in the first phase. The SBMAP algorithm then 6

splits the initial mapping into different modules and applies modular systematic mapping optimization technique. The size of the module is user selective and can be customized in the user configuration settings. The algorithm utilizes the Bit energy model to find the optimized mapping using the minimum cost computation method. The NoC performance parameters are calculated using the CMOS cell library model. VOLUME x, 2017

2169-3536 (c) 2017 IEEE. Translations and content mining are permitted for academic research only. Personal use is also permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.

This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/ACCESS.2017.2778340, IEEE Access Sarzamin Khan et al.: Bandwidth-Constrained Multi-Objective Segmented Brute-Force Algorithm for Efficient Mapping of Embedded Applications on NoC Architecture

TABLE 1. Platform Description for Simulation Environment

required bandwidth, traffic flow and interdependencies. As mentioned in section III, we consider one to one mapping of tasks on the IP cores of the network, therefore the MMS application is mapped on 5x5, 2D mesh and torus topologies for analysis. The WiFi application has 24 tasks and it is therefore, mapped on 5x5, mesh and torus topologies. VOPD occupies 16 tasks, and it is mapped on 4x4, while PIP requires 9 tasks and hence mapped on 3x3, 2D mesh and torus topologies. The simulation settings, utilized for the comparison of the application mapping of the embedded applications, under the uniform NoCTweak platform are shown in Table. I. These configuration settings are only considered for fair comparison of the algorithms and have no effect on the design structure of the proposed algorithm.

Network type

2-D mesh and torus

Platform type

EMBEDDED

Embedded applications

VOPD, MMS, PIP, 80211arx

Packet delivery type

WITHOUT ACK

Sending ACK policy

SEND ACK OPTIMALLY

Packet distribution

EXPONENTIAL

Fixed packet length

10 (flits)

Flit injection rate

0.1 (flits/cycle/node)

Mapping algorithm

BB, SBMAP, NMAP, RANDOM

Router type

WORMHOLE-PIPELINE

Routing algorithm

XY DIMENSION-ORDERED

Output channel selection

XY-ORDERED

Buffer size

8 (flits)

Inter-router link length

1000 (um)

A. 2D MESH TOPOLOGY

Pipeline type

8

Pipeline stages

4

Input voltage

1 (V)

Input clock frequency

1000 (MHz)

Operating clock frequency

1000 (MHz)

Warm-up time

20000 cycles

The results obtained by our proposed SBMAP algorithm for 2D mesh topology are compared with NMAP, BB and Random algorithms as shown in Table II and III. Table II shows the simulated results and the percentage savings of the SBMAP algorithm for power, energy and cost computations. Similarly, Table III shows latency, simulation time, throughput, and the percentage improvements of the SBMAP algorithm over BB, NMAP and Random mapping algorithms. NMAP is a standard algorithm for the comparison of mapping algorithms and therefore, we normalized the results to the NMAP algorithm as shown in Figure 5 and Figure 6. SBMAP algorithm has 28.8%, 0.5% and 38.7% improvement in power consumption than BB, NMAP and Random algorithm respectively, for VOPD application as shown in Table II and Figure 5(a). For PIP application, it has 24.5% power savings than NMAP algorithm and 32.2% improvement over Random algorithm. For MMS application, SBMAP has 23.7% improved performance than Random algorithm. Power consumption improvement for 80211arx is 15.4%, 28% and 45.3% as compared to BB, NMAP and Random algorithm respectively. The reduction in the energy consumption is 28.8% and 36.4% as compared to BB and Random algorithm for VOPD application as shown in Table II and Figure 5(b). For PIP application, the energy savings are 24.5% and 31.3% as compared to NMAP and Random algorithm, respectively. MMS has 24.1% more energy consumption by mapping through Random algorithm. SBMAP has 20%, 35.6% and 53.5% lower energy consumption for 80211arx application as compared to BB, NMAP and Random algorithm respectively. The absolute values and savings of the communication cost are also shown in Table II. Figure 5(c) shows the normalized cost measurements of SBMAP for different applications, which are lower as compared to BB, NMAP and Random algorithm. Network latency of SBMAP algorithm has better improvements for these applications than BB, NMAP and Random algorithm, as shown in Table III and Figure 6(a). The simulation time in Figure 6(b) of the SBMAP algorithm is slightly higher than its prior arts, but it is within a range of a few seconds (Table III). However, the quality of the

The code of the proposed algorithm is written in SystemC and embedded in NoCTweak simulator. NoCTweak is an open source SystemC based simulator developed by Anh T. Tran for NoC design simulations [26]. The simulator integrates two algorithms (NMAP and Random algorithm) for application mapping of the embedded systems. In addition Branch and Bound (BB) algorithm of the NOCMAP simulator [18] is also implanted in the NoCTweak simulator for comparison and analysis of the proposed as well as its competitor algorithms. The NoCTweak platform is utilized to provide a fair and uniform simulation environment for comparison and analysis of energy, power, throughput and latency parameters of all the underlined mapping algorithms.

V. RESULTS AND ANALYSIS

To verify the effectiveness of SBMAP algorithm, two different topologies namely, 2D mesh and torus are used for mapping and comparative analysis. As a case study, four real time benchmarks, Multimedia System (MMS) Figure 4(a), WiFi Receiver (80211arx) Figure 4(b), Video Object Plane Decoder (VOPD) Figure 4(c), and Picture In Picture (PIP) Figure 4(d) are collected from the literature and NoCTweak simulator [17], [26] for mapping and evaluation against their performance parameters. The Network Task Graphs (NTGs) of the benchmarks in Figure 4 show the tasks, the communication workloads (MB/S), interdependencies, and the traffic flow of the communicating tasks. For example the MMS application shown in Figure 4(a) requires 25 processing tasks from T 0 to T 24. The communication bandwidth from T 0 to T 1 is 38106 MB/S with unidirectional link. Similarly, the remaining nodes are represented in the graph with the VOLUME x, 2017

7

2169-3536 (c) 2017 IEEE. Translations and content mining are permitted for academic research only. Personal use is also permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.

This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/ACCESS.2017.2778340, IEEE Access Sarzamin Khan et al.: Bandwidth-Constrained Multi-Objective Segmented Brute-Force Algorithm for Efficient Mapping of Embedded Applications on NoC Architecture

TABLE 2. Network Power, Energy and Cost Improvements for 2D Mesh Topology

BB VOPD PIP MMS 80211arx

Power 27551.1 25964.2 26869.1 70187.4

VOPD PIP MMS 80211arx

Energy 368.6 178.7 451.7 264.5

VOPD PIP MMS 80211arx

Cost 9053 640 693802 13158

Power (uW) & Savings (%) NMAP Savings Power Savings 28.8 21508.6 0.5 0.5 32151.5 24.5 0.6 26759.8 0.2 15.4 77871.1 28.0 Energy (uJ/Packet) & Savings (%) Savings Energy Savings 28.8 287.9 0.6 -0.4 223.3 24.5 0.8 449.3 0.2 20.0 299.0 35.6 Cost & Savings (%) Savings Cost Savings 119.5 4265 3.4 0 704 10 5.0 667628 1.0 3.2 16185.6 26.9

RANDOM Power Savings 29685.1 38.7 34133.6 32.2 33033.4 23.7 88380.6 45.3

SBMAP Power 21396.1 25823.1 26701.2 60825.0

Energy 390.4 235.5 556.2 338.3

Savings 36.4 31.3 24.1 53.5

Energy 286.2 179.3 448.2 220.4

Cost 10552 1088 760000 18500

Savings 155.8 70 15.0 45.1

Cost 4125 640 660931 12749.8

TABLE 3. Network Latency, Simulation Time and Throughput Improvements for 2D Mesh Topology

BB VOPD PIP MMS 80211arx

Latency 24.5 37170.3 18.8 18218.0

VOPD PIP MMS 80211arx

Time 16 10.0 22.6 33.7

VOPD PIP MMS 80211arx

Throughput 0.047 0.016 0.002 0.011

Latency (Cycles) & Savings (%) NMAP RANDOM Savings Latency Savings Latency Savings 28.0 19.2 0.5 26.4 38.2 -0.9 37507.3 0.0 37256.0 -0.7 1.0 18.7 0.3 25.6 37.3 -6.9 18929.1 -3.3 18781.9 -4.1 Simulation Time (S) & Savings (%) Savings Time Savings Time Savings 14.3 14 0 14 0 -8.9 13 18.2 10 -9.1 -39.1 20 -45.9 20 -45.9 -17.9 24 -41.5 25 -39.0 Throughput (Cycles/Packet) & Savings (%) Savings Throughput Savings Throughput Savings 0 0.047 0 0.048 -2.13 0 0.016 0 0.016 0 0 0.002 0 0.002 0 0 0.01 9.09 0.01 9.09

solution and therefore, energy consumption of the portable android devices are more important than a slight increase in simulation time. Simulation time is only evolved in the design phase of the system, and can be speeded up using faster computers/accelerators. Throughput has almost steadfast behavior except for 80211arx application, which has 9% improvement than NMAP and Random algorithm as shown in Figure 6(c). The constant throughput is because of the fact that the network has no saturation for the traffic injected 8

SBMAP Latency 19.1 37508.9 18.7 19578.7 Time 14 11 37 41 Throughput 0.047 0.016 0.002 0.011

into the network. The mapping results regarding communication cost, latency, power and energy consumption show that SBMAP is more efficient than BB, NMAP and Random algorithms for all the listed applications. The improvements in performance results are due to the property of modular systematic search method in the algorithm as compared to blind search of other algorithms or their random heuristic nature. VOLUME x, 2017

2169-3536 (c) 2017 IEEE. Translations and content mining are permitted for academic research only. Personal use is also permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.

This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/ACCESS.2017.2778340, IEEE Access

1.4

1.3

1.3

1.2

1.2

BB

1.1

SBMAP

1

NMAP

0.9

Latency

Power Consumption

Sarzamin Khan et al.: Bandwidth-Constrained Multi-Objective Segmented Brute-Force Algorithm for Efficient Mapping of Embedded Applications on NoC Architecture

RANDOM

0.8 0.7 VOPD

PIP

1.1

BB

1

SBMAP

0.9

NMAP

0.8

RANDOM

0.7 MMS 80211arx

VOPD

1.5 1.4 1.3 1.2 1.1 1 0.9 0.8 0.7

80211arx

2

BB SBMAP NMAP

1.5

BB SBMAP

1

NMAP RANDOM

0.5

RANDOM

0 VOPD

PIP

MMS

80211arx

VOPD

(b) Network Energy consumption

PIP

MMS

80211arx

(b) Network Simulation time

1.7

1.1

1.5 BB

1.3

SBMAP

1.1

NMAP RANDOM

0.9 0.7

Throughput

Communication Cost

MMS

(a) Network Latency

Simulation Time

Energy Consumption

(a) Network Power consumption

PIP

1 BB SBMAP

0.9

NMAP

0.8

RANDOM

0.7 VOPD

PIP

MMS

80211arx

VOPD

PIP

MMS

80211arx

(c) Network Communication Cost

(c) Network Throughput

FIGURE 5. NoC performance parameters of 2D mesh, normalized to NMAP.

FIGURE 6. NoC timing parameters of 2D mesh, normalized to NMAP.

B. TORUS TOPOLOGY

To analyze and compare the SBMAP algorithm against its competitors, we also considered torus topology for the aforementioned embedded applications. Table IV shows the absolute values and percentage savings of the SBMAP algorithm in terms of power, energy, and cost measurements. The results normalized to NMAP values are also shown in Figure 7. The latency, simulation time and throughput results are shown in Table V and Figure 8. The results show that VOLUME x, 2017

SBMAP outperforms in all the performance parameters than BB, NMAP and Random algorithms under different embedded application workloads. The results also confirm the efficiency of the proposed mapping algorithm when applied to torus topology as compared to 2D mesh. The improvements in power consumption of the SBMAP algorithm for VOPD application are 2.8%, 4.9% and 32.3% as compared to BB, NMAP and Random algorithms respectively. The SBMAP power savings for PIP, MMS and 80211arx applications are 9

2169-3536 (c) 2017 IEEE. Translations and content mining are permitted for academic research only. Personal use is also permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.

This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/ACCESS.2017.2778340, IEEE Access Sarzamin Khan et al.: Bandwidth-Constrained Multi-Objective Segmented Brute-Force Algorithm for Efficient Mapping of Embedded Applications on NoC Architecture

TABLE 4. Network Power, Energy and Cost Improvements for Torus Topology

BB VOPD PIP MMS 80211arx

Power 20099.8 25964.2 24739.1 64768.0

VOPD PIP MMS 80211arx

Energy 270.2 213.9 439.9 238.9

VOPD PIP MMS 80211arx

Cost 4119.0 832.0 664727 12733.4

Power (uW) & Savings (%) NMAP Savings Power Savings 2.8 20506.3 4.9 0.5 32151.5 24.5 -1.1 26689.7 6.7 28.2 57676.2 14.1 Energy (uJ/Packet) & Savings (%) Savings Energy Savings 2.3 275.3 4.2 33.0 165.0 2.6 0.6 466.6 6.7 27.8 206.9 10.7 Cost & Savings (%) Savings Cost Savings 0.4 4119.0 0.4 44.4 640.0 11.1 1.5 658551 0.5 0.0 14078.6 10.6

RANDOM Power Savings 25868.1 32.3 34133.6 32.2 30635.1 22.5 78662.8 55.7

SBMAP Power 19551.5 25823.1 25012.2 50531.7

Energy 347.5 177.6 507.2 302.7

Savings 31.6 10.4 16.0 62.0

Energy 264.0 160.8 437.3 186.9

Cost 6758.0 960.0 709040 20030.0

Savings 64.7 66.7 8.2 57.3

Cost 4103.0 576.0 655220 12733.5

TABLE 5. Network Latency, Simulation Time and Throughput Improvements for Torus Topology

BB VOPD PIP MMS 80211arx

Latency 17.0 9209.5 16.7 4987.7

VOPD PIP MMS 80211arx

Time 14 9 30 39

VOPD PIP MMS 80211arx

Throughput 0.005 0.016 0.011 0.011

Latency (Cycles) & Savings (%) NMAP RANDOM Savings Latency Savings Latency Savings -2.2 17.6 1.3 22.1 27.0 0.4 9291.9 1.3 9443.6 2.9 0.0 22.0 31.4 17.0 1.7 1.7 4857.1 -1.0 4769.0 -2.8 Simulation Time (S) & Savings (%) Savings Time Savings Time Savings -22.2 14 -22.2 14 -22.2 12.5 7 -12.5 6 -25.0 -21.1 24 -36.8 23 -39.5 0.0 28 -28.2 30 -23.1 Throughput (Cycles/Packet) & Savings (%) Savings Throughput Savings Throughput Savings 0 0.005 0.0 0.005 0.0 0 0.016 0.0 0.016 0.0 0 0.011 0.0 0.01 9.1 0 0.01 9.1 0.01 9.1

up to 55.7% as shown in Table IV and Figure 7(a). Similarly the reduction in energy consumption of the proposed algorithm is 0.6 to 62% as compared to BB, NMAP and Random algorithms respectively (see Table IV and Figure 7(b)). The cost improvements are up to 44.4%, 11.1% and 66.7% as compared to BB, NMAP and Random algorithms as shown in Figure 7(c). Table V shows latency, simulation time and throughput comparisons of the listed algorithms for torus topology. The 10

SBMAP Latency 17.4 9174.2 16.7 4905.3 Time 18 8 38 39 Throughput 0.005 0.016 0.011 0.011

results reveal that the proposed mapping algorithm incurs low latency when used on torus topology as compared to 2D mesh. The latency improvements for VOPD, PIP, MMS and 80211arx applications are up to 1.7%, 31.4% and 27% as compared to BB, NMAP and Random algorithms respectively (see Table V and Figure 8(a)). The simulation time shown in Figure 8(b) is slightly compromised in order to get optimal results, but still, it is very low, because customarily the mapping process is carried out prior to the design VOLUME x, 2017

2169-3536 (c) 2017 IEEE. Translations and content mining are permitted for academic research only. Personal use is also permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.

This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/ACCESS.2017.2778340, IEEE Access

1.4

1.3

1.3

1.2

1.2

BB

1.1

SBMAP

1

NMAP

0.9

Latency

Power Consumption

Sarzamin Khan et al.: Bandwidth-Constrained Multi-Objective Segmented Brute-Force Algorithm for Efficient Mapping of Embedded Applications on NoC Architecture

RANDOM

0.8 PIP

BB SBMAP

1

NMAP

0.9

RANDOM

0.8

0.7 VOPD

1.1

0.7 MMS 80211arx

VOPD

PIP

1.5 1.4 1.3 1.2 1.1 1 0.9 0.8 0.7

2

BB SBMAP NMAP

1.5

BB SBMAP

1

NMAP RANDOM

0.5

RANDOM

0 VOPD

PIP

MMS

80211arx

VOPD

(b) Network Energy Consumption

MMS

80211arx

1.1

1.5 BB

1.3

SBMAP

1.1

NMAP RANDOM

0.9 0.7

Throughput

Communication Cost

PIP

(b) Network Simulation time

1.7

1 BB SBMAP

0.9

NMAP

0.8

RANDOM

0.7 VOPD

PIP

MMS

80211arx

VOPD

(c) Network Communication Cost FIGURE 7. NoC performance parameters of torus topology, normalized to NMAP.

implementation in most cases. The simulation time can be further improved using the state of the art fast computers or accelerators. Throughput is almost constant, because the network has no congestion at this traffic as shown in Figure 8(c). The optimized results obtained for SBMAP algorithm confirms its efficiency for both the 2D mesh and torus topologies against all of its competitor algorithms.

VOLUME x, 2017

80211arx

(a) Network Latency

Simulation Time

Energy Consumption

(a) Network Power Consumption

MMS

PIP

MMS

80211arx

(c) Network Throughput FIGURE 8. NoC timing parameters of torus topology, normalized to NMAP.

VI. CONCLUSION

This research work addressed a very hot and demanding issue of the application mapping of the real time embedded applications in NoC based systems. We presented, BandwidthConstrained Multi-Objective Segmented Brute-Force Mapping (SBMAP) algorithm, and compared it with Branch and Bound (BB), NMAP and Random algorithm for average latency, throughput, power, and energy consumption of the onchip networks. SBMAP adds the property of segmentation to 11

2169-3536 (c) 2017 IEEE. Translations and content mining are permitted for academic research only. Personal use is also permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.

This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/ACCESS.2017.2778340, IEEE Access Sarzamin Khan et al.: Bandwidth-Constrained Multi-Objective Segmented Brute-Force Algorithm for Efficient Mapping of Embedded Applications on NoC Architecture

the Exact mapping techniques of the IP cores on NoC tiles. The algorithm interactively searches the optimal mapping of the different segments of task graph and achieves the best results by applying modular systematic search method. Experimental results show a significant reduction in power and energy consumption of the proposed mapping algorithm as compared to its competitors for real world application benchmarks. Improved results are also obtained for throughput, latency, and cost measurements for the 2D mesh and torus topologies of the NoC architecture. The improvements in results are due to the fact that the proposed algorithm searches the best solution for the application mapping as compared to the blind search of other available algorithms. As a substitute for good performance results, the simulation time of the algorithm is also comparable with the existing algorithms. Hence our mapping algorithm can be used as a good mapping heuristic for its better performance in terms of average latency, power, and energy consumption. This algorithm can also be utilized for application mapping of the embedded benchmarks on other topologies like folded torus, butterfly, fat tree, and remains as a future extension of this research work.

[15]

[16]

[17]

[18]

[19]

[20]

[21]

[22]

[23]

REFERENCES [1] L. Benini, G. De Micheli, “Networks on Chips: A New SoC Paradigm," IEEE Computer Society, Vol. 35, No. 1, pp. 70-78, January, 2000. [2] W. J. Dally, B. Towels, “Route Packets, Not Wires: On-Chip Interconnection Networks," in Proceedings of 38th DAC, pp. 684-689, 2001. [3] W. C. Tsai, Y. C. Lan, Y. H. Hu et al., “Networks on Chips: Structure and Design Methodologies," Hindawi Journal of Electrical and Computer Engineering, Vol. 2012, Article ID 509465, 15 pp., 2012. [4] Marculescu, U.Y. Ogras, “Outstanding research problems in NoC design: systems, micro architecture, and circuit perspectives," Transactions on Computer-aided Design of Integrated Circuits and Systems, IEEE, Vol. 28, No.1, pp. 3-21, January, 2009. [5] Walter, I. Cidon, A. Kolodny, “The era of many-modules SoC: revisiting the NoC mapping problem," in 2nd Int. Workshop on Network on Chip Architectures, (NoCArc 2009), pp. 43-48, December, 2009. [6] U. A. Gulzari, S. Khan, S. Anjum, F. S. Torres, “An Efficient and Scalable Cross-By-Pass-Mesh Architecture for On-Chip Communication," IET Computers & Digital Techniques, pp. 70-78, February, 2017. [7] U. A. Gulzari, S. Anjum, F. S. Torres, “A New Cross-By-Pass-Torus Architecture Based on CBP-Mesh and Torus Interconnection for OnChip Communication," PLOS ONE, Vol. 11, No. 12, 18 pp., e0167590. December 1, 2016. [8] J. Hu, R. Marculescu, “Energy-aware mapping for tile-based NoC architectures under performance constraints," in Asia and South Pacific Design Automation Conference, pp. 233-239, January, 2003. [9] C. Radu, L. Vintan, “Optimized Simulated Annealing for Network-onChip Application Mapping," in Proceedings of the 18th International Conference on Control Systems and Computer Science, Romania, pp. 452459, 2011. [10] S. Tosun, O. Ozturk, M. Ozen, “An ILP formulation for application mapping onto Network-on-Chips," in International Conference on Application of Information and Communication Technologies, pp. 1-5, 2009. [11] C. Ostler, K.S. Chatha, “An ILP formulation for system-level application mapping on network processor architecture," in Proceedings of Design, Automation and Test in Europe, pp. 1-6, 2007. [12] S. Tosun, “New heuristic algorithm for energy aware application mapping and routing on mesh-based NoCs," Journal of System Architecture, Vol. 57, No. 1, pp. 69-78, January, 2011. [13] T. Shen, C.H. Chao, Y.K Lien et al., “A new binomial mapping and optimization algorithm for reduced-complexity mesh-based on-chip network," in Proceedings of NOCS, pp. 317-322, May, 2007 [14] W. Lei, L. Xiang, “Energy- and latency-aware NoC mapping based on discrete particle swarm optimization," in Proceedings of International 12

[24]

[25]

[26]

[27]

[28]

[29]

[30]

[31]

[32]

[33]

[34]

[35]

Conference on communications and Mobile Computing, IEEE, pp. 263268, 2010. G. Fen, W. Ning, “Genetic algorithm based mapping and routing approach for network on chip architectures," Chinese Journal of Electronics, pp. 9196, 2010. K. Bhardwaj, R.K. Jena, “Energy and bandwidth aware mapping of IPs onto regular NoC architectures using multi-Objective genetic algorithms," in International Symposium on System-on-Chip, pp. 27-31, 2009. P. Kumar Sahu et al, “A survey on application mapping strategies for Network-on-Chip design," Journal of Systems Architecture, Vol. 59, pp. 60-76, 2013. J. Hu, “Energy and Performance Aware Mapping for Regular NoC Architectures," IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems, Vol. 24, No. 4, April, 2005. T. Lei, S. Kumar, “A Two-step Genetic Algorithm for Mapping Task Graphs to a Network on Chip Architecture," in Proceedings of the Euromicro Symposium on Digital System Design, IEEE, pp. 180-187, September, 2003. S. Murali, G. DE Micheli, “Bandwidth-Constrained mapping of cores onto NoC architectures," in Proceedings of IEEE/ACM Design, Automation and Test in Europe Conference, Paris, France, Vol. 2, pp. 896-901, February, 2004. Z. Lu, L. Xia, A. Jantsch, “Cluster-based simulated annealing for mapping cores onto 2D mesh networks on chip," in Proceedings of the 11th Workshop on Design and Diagnostics of Electronic Circuits and Systems, IEEE Computer Society, USA, pp. 1-6, April, 2008. G. Ascia, V. Catania, M. Palesi, “Multi-Objective Mapping for Mesh Based NoC Architectures," in Proceedings of the ICHSC/ICSS, Stockholm, Sweden, pp. 182-187, September, 2004. R.K. Jena, “Application mapping of mesh based NoC using Evolutionary algorithm," Journal of Information Systems and Communication, Vol. 3, No. 1, pp. 203-206, 2012. M. J. Sepulveda, M. Strum, W.J. Chau, G. Gogniat, “A Multi-Objective Approach for Multi-Application NoC Mapping," in IEEE Latin American Symposium on Circuits and Systems (LASCAS), pp. 1- 4. 2011. H. M. Harmanani, “A Method for Efficient Mapping and Reliable Routing for NoC Architectures with Minimum Bandwidth and Area," in IEEE International Workshop on Circuits and systems and TAISA Conference (NEWCAS-TAISA), pp. 29-32, 2008. Anh T. Tran, Bevan M. Baas, “NoCTweak: A Highly Parameterizable Simulator for Early Exploration of Performance and Energy of Networks On-Chip," in Technical Report, VLSI Computation Lab, ECE Department, UC Davis, 2012. A. B. Kahng, “ORION 2.0: A Power-Area Simulator for Interconnection Networks," IEEE Transactions on VLSI Systems, Vol. 20, No. 1, pp. 191, January, 2012. J. Chang, M. Pedram, “Codex-dp: co-design of communicating systems using dynamic programming," IEEE Tran. on CAD of Integrated Circuits and Systems, Vol. 19, No. 7, 14 pp. July, 2002. M. H. Mottaghi, H. R. Zarandi, “DFTS: A dynamic fault-tolerant scheduling for real-time tasks in multicore processors," Elsevier journal of Microprocessors and Microsystems Vol. 38, No. 1, pp. 88-97, February, 2014. S. Khan, S. Anjum, U.A. Gulzari, F. S. Torres, “Comparative Analysis of Network-on-Chip Simulation Tools," IET Computers & Digital Techniques, 11 pp., September, 2017. T. Liu, S. Yin, J. Liu, L. Teng, “Hybrid Quantum Genetic Algorithm Used for Low-power Mapping in Network-on-chip," Journal of Software Engineering, Vol. 11, No. 2, pp. 194-201, 2017. C. Xua, Y. Liu, Z. Zhu, et al, “An efficient energy and thermal-aware mapping for regular network-on-chip," IEICE Electronics Express, Vol. 14, No. 17, pp. 1-11, 2017. Y. Hu, D. Muller-Gritschneder, M. J Sepulveda, “Automatic ILP-based Firewall Insertion for Secure Application-Specific Networks-on-Chip," Interconnection Network Architectures, On-Chip, Multi-Chip (INAOCMC), pp. 9 -12, 2015. C. Wu, C. Deng, L. Liu, “An efficient application mapping approach for the co-optimization of reliability, energy, and performance in reconfigurable noc architectures," IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems, Vol. 34, No. 8, pp. 1264-1277, 2015. C. Wu, C. Deng, L. Liu et al., “A multi-objective model oriented mapping approach for noc based computing systems," IEEE Transactions on Parallel Distributed Systems, Vol. 28, No. 3, pp. 662-676, March, 2017.

VOLUME x, 2017

2169-3536 (c) 2017 IEEE. Translations and content mining are permitted for academic research only. Personal use is also permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.

This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/ACCESS.2017.2778340, IEEE Access Sarzamin Khan et al.: Bandwidth-Constrained Multi-Objective Segmented Brute-Force Algorithm for Efficient Mapping of Embedded Applications on NoC Architecture

SARZAMIN KHAN received his B.E degree in Electronic Engineering and MS degree in Electrical Engineering from NED University of Engineering and Technology Karachi, Pakistan in 2001 and 2003, respectively. He has more than 10 years industrial experience at different positions. He is currently a PhD student in the Department of Electrical Engineering at COMSATS Institute of Information Technology, Pakistan. His research interests include design and analysis of Analog and Digital Systems, Communication Systems, System-on-Chip, Networkon-Chip, Embedded Systems, and Optimization Techniques.

SHERAZ ANJUM received the degree of PhD Engineering in Microelectronics and Solid-State Electronics from Institute of Microelectronics, Graduate University of Chinese Academy of Sciences, Beijing, China, in 2008, the degree of M.Sc Computer Engineering from University of Engineering and Technology, Taxila, Pakistan, in 2005 and M.Sc. in Electronics from Quaid-EAzam University, Islamabad, Pakistan, in 1999. His Ph.D. study was supervised by Dr. Chen Jie and M.S research by Dr. Shoab A. Khan. Between 1999 to 2001, he served in Barani Institute of Information Technology as Research Associate. From 2001 to 2005 and 2008 to 2012 he worked for Comsats Institute of Information Technology, Wah Cantt, Pakistan as Lecturer and Assistant Professor respectively. Currently he is working as Associate Professor at COMSATS Institute of Information Technology, Wah Cantt, Pakistan. He has more than eighteen years of teaching and research experience. From 2010 to 2017, he served in the TPC of the international conference FIT. He is currently serving as a reviewer of many journals of international repute. Since 2015, he is a member of IET and an executive member of IET Islamabad local network. He is registered as Chartered Engineer with ECUK and as Registered Engineer with PEC. His research interests include but not limited to Digital System Design, Design and Analysis of Networks on Chip architectures and algorithms, Reconfigurable Architectures, Multi-Processor heterogeneous computing and advance DSP Architectures.

USMAN ALI GULZARI is currently a faculty member and Assistant Professor in Electrical Engineering Department at The University of Lahore, Islamabad, Pakistan. He has more than 12 years of industrial and academic experience. He did his BS and MS in Computer Engineering (20002006) and PhD scholar in Network-on-Chip from COMSATS Institute of Information and Technology Islamabad. His interested areas are Computer Architecture, Embedded System, System on-Chip

TARIQ UMER (M’16-SM’16) received his Ph.D. in Communication systems in 2012 from School of Computing and Communications, Lancaster University, U.K and Master in Computer Science in 1997 from Bahauudin Zakariya University, Multan, Pakistan. He served for IT education sector in Pakistan for more than 13 years. Since January 2007, he is working as Assistant Professor in CS department of COMSATS Institute of Information Technology, Wah Cantt. His research interests include cognitive radio ad hoc networks, Internet of Things, Wireless Sensor Networks, Telecommunication Network Design, Vehicular Adhoc Networks and Internet of Vehicles (IoV). He is currently serving as an Editorial board member of Elsevier Future Generation Computer System (FGCS) and Associate Editor of the IEEE Access journal. He served in the TPC for the international conference FIT 15, 16, IEEE PGnet 2011, 2012, IEEE FMEC 2016, WPMC 2017, CSCN 2017 and ANTS 2017 conferences. He is currently serving as a reviewer for IEEE Communications Letters, IEEE ACCESS, Computers and Electrical Engineering (Elsevier), Journal of Network and Computer Applications (Elsevier), Wireless Networks (Springer) journal, and the Journal of Communications and Networks. . He is also serving as a Guest Editor of Future Generation Computer Systems (Elsevier) and IEEE ACCESS. He is the active member of Pakistan Computer Society and Internet Society Pakistan.

BYUNG-SEO KIM (M’02-SM’17) received his B.S. degree in Electrical Engineering from InHa University, In-Chon, Korea in 1998 and his M.S. and Ph.D. degrees in Electrical and Computer Engineering from the University of Florida in 2001 and 2004, respectively. His Ph.D. study was supervised by Dr. Yuguang Fang. Between 1997 and 1999, he worked for Motorola Korea Ltd., PaJu, Korea as a Computer Integrated Manufacturing (CIM) Engineer in Advanced Technology Research and Development (ATR&D). From January 2005 to August 2007, he worked for Motorola Inc., Schaumburg Illinois, as a Senior Software Engineer in Networks and Enterprises. His research focuses in Motorola Inc. were designing protocol and network architecture of wireless broadband mission critical communications. He is currently an Associate Professor at the Department of Computer and Information Communication Engineering in Hongik University, Korea. He was Chairman of the department from 2012 to 2014. He served as the General Chair for General Chair of 3rd IWWCN 2017, and the TPC member for the IEEE VTC 2014-Spring and the EAI FUTURE2016, and ICGHIC 2016 ∼ 2018 conferences. He served as Guest Editors of special issues of International Journal of Distributed Sensor Networks (SAGE), IEEE Access, and Journal of the Institute of Electrics and Information Engineers. He was also served as the Member of Sejong-city Construction Review Committee and Ansan-city Design Advisory Board. His work has appeared in around 141 publications and 22 patents. He is IEEE Senior Member. His research interests include the design and development of efficient wireless/wired networks including link-adaptable/cross-layerbased protocols, multi-protocol structures, wireless CCNs/NDNs, Mobile Edge Computing, physical layer design for broadband PLC, and resource allocation algorithms for wireless networks.

and Network- on-Chip.

VOLUME x, 2017

13

2169-3536 (c) 2017 IEEE. Translations and content mining are permitted for academic research only. Personal use is also permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.