Dynamic Thread Mapping for High-Performance, Power-Efficient ...

16 downloads 14715 Views 424KB Size Report
jinpyo11.park@samsung.com .... providing, to the best of our knowledge, a novel scalable ..... core,” in ACM SIGARCH Computer Architecture News, 2005. 61.


Dynamic Thread Mapping for High-Performance,  Power-Efficient Heterogeneous Many-core Systems Guangshuo Liu

Jinpyo Park

Diana Marculescu

Department of ECE Carnegie Mellon University Pittsburgh, PA [email protected]

Samsung Electronics Yong-In, South Korea [email protected]

Department of ECE Carnegie Mellon University Pittsburgh, PA [email protected]

Con&guration

T1

Tile Little

Mid2

Mid1

T2

Shared Cache

Big

T3

Tile

T4

Tile

Big (T2)

Heuristic-based Optimizer

T2

T4

T3

(T1)

Budget

Highest throughput mapping

T1

Power Throughput

T4

T3

Mid2 Budget

T2

Power Throughput

T1

Little

(T3)

Mid1 (T4)

Dynamic mapping

Swap until power budget satis&ed

Fig. 1: Overview of the heuristic-based dynamic thread mapping

variability, such as program phrase changes, so that it needs to be efficient enough to be brought online. In short, we can foresee the demand for online dynamic thread-to-core mapping mechanisms that scale to large number of cores comprising of multiple core types. This paper proposes a generic approach to formulate the mapping problem as a 0-1 integer linear program (ILP), given any numbers of threads, cores and type of cores. Further, an efficient heuristic-based algorithm is proposed and validated. The algorithm is able to produce mappings less than 0.6% away from optimum in term of total throughput given power constraints. The computation time fits within milliseconds for hundreds of cores. As illustrated in Figure 1, the proposed algorithm is based on the idea of first attempting to achieve highest possible throughput regardless of power constraints, and afterward performing virtual swapping of threads between adjacent core types to fit within power budgets. Several effective metrics to quantify the suitability of a thread to aggressive cores and the priority of swapping threads are also proposed in the implementation of the algorithm. The rest of the paper is organized as follows: Section II summarizes prior work. Section III describes the notations and assumptions. Section IV provides the mathematical formulation of the problem and describes in detail the algorithm

I. I NTRODUCTION Technology scaling continues to allow more transistors to be integrated onto a single processor while power dissipation increasingly constrains the design of many-core processors. Under such a scenario, it becomes desirable to take advantage of application variability and design many-core systems composed of multiple core types with different power-performance characteristics. ARM’s big.LITTLE [1] and Nvidia’s Tegra 3 [2] are well known industrial products of multicore heterogeneous computing platforms. Exploiting inter-core heterogeneity is challenging as it boils down to mapping threads to most appropriate cores. As we will be able to integrate more cores than can be active simultaneously within the power budget (normally referred to as the dark silicon era [3]), the mapping problem faces the following challenges. Firstly, one needs to select a subset of the total available cores to power on and the total number of cores can be large. Secondly, more than two types of cores are expected to be built into a single platform for a more complete power/performance spectrum. Thirdly and more importantly, the mapping scheme should be aware of runtime application

978-1-4799-2987-0/13/$31.00 ©2013 IEEE

Threads

Tile

Shared Cache

Abstract—This paper addresses the problem of dynamic thread mapping in heterogeneous many-core systems via an efficient algorithm that maximizes performance under power constraints. Heterogeneous many-core systems are composed of multiple core types with different power-performance characteristics. As well documented in the literature, the generic mapping problem is an NP-complete problem which can be formulated as a 0-1 integer linear program, therefore, prohibitively expensive to solve optimally in an online scenario. However, in real applications, thread mapping decisions need to be responsive to workload phase changes. This paper proposes an iterative approach bounding the runtime as O(n2 /m) , for mapping multi-threaded applications on n cores comprising of m core types. Compared with an optimal solution, the proposed algorithm produces results less than 0.6% away from optimum on average, with two orders of magnitude improvement in runtime. Results show that performance improvement can reach 16% under iso-power constraints compared to a random mapping. The algorithm can be brought online for hundred-core heterogeneous systems as it scales to systems comprised of 256 cores with less than one millisecond in overhead. Index Terms—Heterogeneous many-core, dynamic thread mapping, heuristic

54

with illustrative examples. Section V describes the simulation infrastructure along with system configurations. Section VI demonstrates experimental results in terms of optimality, scalability and runtime overhead. Section VII concludes the paper and offers future directions.

an efficient close-to-optimum heuristic to guarantee it scales to hundred of cores with affordable runtime overhead. PIE [12] is a recently proposed scheduling framework to predict workload-core mappings that improve performance. It incorporates hardware parameters and CPI stack to predict the ratio of memory level parallelism and the ratio of instruction level parallelism between the currently running core and another core. It scales to more than two types of cores. However, the framework doesn’t explicitly model power consumption and thereby is not fully power-aware. It does not provide an efficient solution for obtaining optimal mapping for largescale heterogeneous systems either. Our approach can be built on top of that or any sort of prediction model to produce a fully power-aware high-performance dynamic thread mapping framework. Several works are related from other perspectives. Gupta et al. [13] take uncore into account for dynamic thread mapping for heterogeneous multicore systems. Turakhia et al. [14] propose HaDeS in the domain of synthesis for heterogeneous multicore systems by formulating a non-linear optimization problem and developing a heuristic to transform the problem into integer linear program in polynomial time. Flicker [15], proposed by Petrica et al., takes a different approach by dynamically scaling core resources to create adaptive and configurable heterogeneity in hardware. Our work can also be adapted to that as a supplementary scheduling method whenever the resource scaling is not sufficiently flexible. Compared to prior arts, this work makes the following contributions: • We formulate the mapping problem as a constrained 0-1 integer linear program (ILP), which can be applied as a generic model for optimizing system throughput under power constraints and can be scaled to arbitrary number of core types by using an efficient iterative heuristic, as described below. • We propose an iterative O(n2 /m) heuristic-based algorithm for solving the 0-1 ILP thread mapping, thereby providing, to the best of our knowledge, a novel scalable approach for effective thread mapping for maximizing throughput on many-core heterogeneous systems. • The algorithm is validated using Sniper [16] multi-core simulator and multi-threaded workloads. The proposed algorithm produces mappings less than 0.6% away from optimum on average and fits in milliseconds-long control epochs for hundred-core systems, beating the state-of-the-art ILP solver with orders of magnitude improvement in runtime.

II. R ELATED W ORK AND PAPER C ONTRIBUTIONS Kumar et al. [4] were first to demonstrate the potential of improving energy efficiency with minor performance loss via dynamic workload mapping for heterogeneous multicore systems. Their follow-up work and the IPC-based scheduling proposed by Becchi and Crowley [5] enforce periodic sampling workloads on all core types to gather performance/power readings in order to perform workload migrations that save power consumption without losing much performance. Koufaty [6] et al. propose bias scheduling that inspects program CPI stack and determines a workload to be either little-core biased or big-core biased based on the ratio of cycles spent on execution and stalls. As a limitation, the above-mentioned scheduling schemes can be only applied to two core types. Frequent forced migrations in the sampling approaches are not affordable for more core types and the number of core types supported by bias scheduling is inherently limited. In contrast, the proposed heuristic by this paper theoretically scales to an arbitrary number for core types and performance/power metrics of threads on other core types are efficiently predicted instead of being sampled. Prior research also focuses on metrics to effectively quantify the suitability of program-core mapping. Chen and John [7] utilize weighted Euclidean distance between program’s resource demand and core configuration. HASS [8] proposed by Shelepov et al. employs program signatures, which indicate the distance between consecutive accesses to the same memory location, generated by offline profiling and embedded in the program binaries. Although these methods are scalable, they require workloads to be profiled beforehand and need customized support from compilers, which can be impractical. They are also limited to static scheduling and do not exploit the runtime behavior of mapped workloads. Our approach, although it needs offline profiling to calibrate the scheduler, it does not necessarily require prior knowledge of workloads and it provides an efficient algorithm that can be brought online for dynamic thread mapping. There are existing global power management methods [9] [10] attempting to maximize performance under power constraints by selecting optimal voltage/frequency level for each core in the context of heterogeneity from process variation. Teodorescu and Torrellas [10] formulate the problem as a linear optimization problem and solve it globally. Winter et al. [11] evaluate the effectiveness and runtime complexity of different power management algorithms. These approaches treat workload scheduling and power management as independent problems, where to a certain extent the optimality of the solution might be lost. Our approach integrates both thread mapping for high performance and low power consumption into a single constrained 0-1 assignment problem and provides

III. A SSUMPTIONS AND M ODELS We begin with the definition of a typical heterogeneous many-core system and basic assumptions. In this paper, we consider the case of heterogeneous many-core systems comprising of multiple core types. A core type is defined by a tuple of micro-architecture features and associated nominal voltage/frequency. For example, cores that differ in architectural parameters, such as issue width, cache size, number of

55

function units, etc., are considered as different core types. In addition, even if two cores are designed identically in terms of microarchitecture but associated with different nominal frequencies, they are considered as distinct core types. For a system comprised of m core types and n total number of cores, the numberof cores in core type j is m denoted as core countj and j=1 core countj = n. The core types are assumed to follow a total ordering in nominal power consumption. A core is “bigger” than another one if its nominal power consumption is higher under identical workloads. The overall performance of the big core is also expected to be higher than that of a little core. We consider multi-threaded workloads and assume that each core runs at most one thread at a time - in other words, we assume support for single threaded execution, without simultaneous multithreading (SMT). Without loss of generality, the total number of threads is assumed to be n, identical to the total number of cores. In the case that there are less threads than cores, dummy threads with zero throughput and power are considered mapped to idle cores. We assume the threads are spawned in the beginning of the execution and that mapping decisions are taken at certain time intervals or control epochs. We assume that the number of threads is fixed throughout the duration of a control epoch. This is a reasonable assumption since the granularity of control epochs can be adapted to the granularity of thread spawning and joining. Throughout the paper we will use the following notation to denote the throughput, power consumption, and thread-to-core assignment: n×m • The throughput matrix is denoted by T ∈ R , in which each element Tij represents the throughput of thread i running on a core in type j. Throughput is defined as the total number of instructions committed per unit of time. n×m • The power matrix is denoted by P ∈ R , in which each element Pij represents the total power consumption for thread i running on a core in type j. n×m • The assignment matrix X ∈ {0, 1} represents the assignment of threads to each core type. Thread i is mapped to core type j if and only if Xij = 1.

Constraints: The following constraints apply to the problem. First, the power budget needs to be satisfied,  Pij Xij ≤ total TDP of cores. (2) i,j

Second, a thread can be only mapped to a single core,  Xij ≤ 1. ∀i :

(3)

j

The core count in each type is given,  Xij ≤ core countj . ∀j :

(4)

i

Finally, the mapping is a 0-1 assignment problem, and therefore ∀i, j : Xij ∈ {0, 1}. (5) B. An O(n2 /m) Algorithm A 0-1 ILP formulation is usually classified as a NP-complete problem in its most general form [17]. Although there exist efficient ILP solvers, as the problem size increases, the runtime complexity increases exponentially and can easily exceed the duration of a control epoch, which is typically in the range of milliseconds. In addition, as the power constraint becomes tighter, the computation can take orders of magnitude times longer than a relatively unconstrained problem of the same size. The reason behind this inefficiency is that typical ILP algorithms are essentially based on branch-and-bound approaches [18] that may iteratively search for the global optimum which may result in exceedingly high time complexities. Therefore, using ILP solvers is impractical for online dynamic mapping for heterogeneous many-core systems. Instead of relaxing the optimization problem and performing branch-and-bound search, we directly manipulate the assignment matrix without violating Equation 3, 4, 5. The proposed heuristic, bounded as O(n2 /m) in runtime contains two phases, namely: maximization and swapping. The basic idea is to first aggressively assign threads such that highest possible throughput is achieved. Then, for adjacent core types, swap threads to reduce the total power. Note that the ordering here refers to the total ordering of core types in nominal power consumption. It does not have any relation to the topological placement of cores. Finally, if there is no possible swap, we consider the case to be infeasible. It indicates the power budget is too constrained to fit all the tasks using the heuristic. Whenever it happens, the scheduler has to time-share cores among tasks or stall a subset of total tasks, which is beyond the scope of this paper and will be taken into account as future work for dark silicon scenarios. Figure 2 illustrates how the heuristic works. Essentially, the heuristic first maximizes throughput and then swaps threads to reduce power. In the rest the paper, we refer to the proposed approach as maximization-then-swapping (MTS) heuristic. The pseudocode for the algorithm is presented in Figure 3. In the maximization phase (line 2 - 7), the power budget is

IV. P ROBLEM F ORMULATION AND P RACTICAL S OLUTION A. Generic Problem Formulation The goal of the proposed thread mapping approach is to maximize throughput while keeping total power under a given budget. Typically, mapping is done so as to maintain total power consumption under the thermal design power (TDP) constraints which guarantees reliable operation given thermal characteristics. Mathematically, the problem can be formulated as a 0-1 integer linear program. Objective function: Maximize performance defined by the total throughput,  maximize Tij Xij (1) i,j

56

1:

Maximization High

Big

2:

Medium2

Throughput

Medium1 Little

3: 4:

Low Budget

5: 6:

Power Throughput

7: 8: 9:

Proceed if power budget is not satis"ed

Swapping

10: 11: 12:

Swap with Highest Power/ Throughput

Power

Power

13: 14: 15:

Power

16: 17:

Power

18: 19: 20: 21: 22: 23: 24:

Power

// Core types are ordered from smallest to biggest and assigned indices from 1 to m Mij ← Tij for j ← m to 1 do {i1 , ..., in/m } ← indices of largest n/m elements in M∗ j Xi1 j , ..., Xin/m j ← 1 Mi1 ∗ , ..., Min/m ∗ ← 0 end for // swapsj := vectors of downward swaps for core type j for j ← m to 2 do swapsj ← {} for each thread t1 assigned to core type j do for each thread t2 assigned to core type j − 1 do if Δpower < 0 then priority = Δpower/Δthroughput swapsj ← swapsj ∪ {(priority, t1 , t2 )} end if end for end for prune off impossible elements in swapsj build swapsj into a max-heap on priority end for

ng pi ap lM na Fi

while ∃i : swapsi is not empty do for j ← 2 to m do (tdown , tup ) = pop heap(swapsj ) swap the tdown and tup in assignment X update heaps if budget is met then return X end if 32: end for 33: end while 25: 26: 27: 28: 29: 30: 31:

Budget

Power Throughput

Fig. 2: An illustrative example of MTS heuristic

temporarily assumed to be sufficiently large so that threads are mapped from biggest cores to smaller cores in descending order of throughput to achieve high throughput with best effort. The second phase (line 10 - 33) attempts to swap threads to reduce the total power. We define downward swap as a swap of threads between one core type and the next smaller core type. We also define a swap to be legal only if it results in saving power. For each core type, all the downward swaps are examined for legality. A legal swap is associated with a priority of Δpower/Δthroughput (that is, determined by the maximum power savings with minimum performance loss) and a thread pair (t1 , t2 ) involved in the swap. Starting from the second smallest core type, the downward swap of highest priority is performed and then the same process is launched for next bigger core. After the swap in the biggest core is done, it starts another round from the second smallest core type. The iterations are executed until there is no legal downward swap or the power constraint is met. If the power constraint is satisfied, the assignment, which is a feasible solution close to the optimum, is returned. Note that, during this phase, all the swaps are steps in the algorithm, not physical thread migrations. Only the final mapping is committed when the

34: 35:

return CASE INFEASIBLE Fig. 3: Pseudocode of MTS heuristic algorithm

algorithm terminates. Finally, if the above process is still not able to produce a feasible solution, it is very likely that the budget is so tight that the problem is infeasible to solve. C. Implementation Details and Efficiency In our problem formulation, the core count in each type can be different, but the worst case is the one in which cores are evenly distributed into each type. The reason behind this is that most computation is spent on comparing priority of swaps. For two adjacent core types, if the total number of cores is fixed, the worst case for swaps to be examined is the one in which the two types contain the same core count. In general, for m core types and n total cores, the most computation intensive case is the one in which each type comprises of n/m cores. If n/m is not an integer, one can add non-movable (not subject

57

TABLE I: Core configurations Core type

Frequency Voltage

Issue width

ROB size

Integer ALUs

L2 cache

Little Mid1 Mid2 Big

2.0GHz 2.5GHz 3.0GHz 3.5GHz

1 2 4 8

16 32 64 128

1 3 6 12

128 256 256 512

0.56 0.71 0.86 1.00

V V V V

KB KB KB KB

L1-I cache: 32 KB / L1-D cache: 32 KB / Floating point units: 2 / Complex ALUs: 1 Technology node: 22nm

and performance of a thread on other core types from the statistics on the currently assigned core. A recent proposed estimation framework by Craeynest et al. [12] works from the perspective of estimating instruction-level parallelism (ILP) and memory-level parallelism (MLP) on other type of cores. In our work, we utilize the data from offline profiling of PARSEC [20] benchmarks and implement a binning approach to categorize unknown threads as their nearest neighbor in PARSEC benchmarks. The distance between an unknown thread and a profiled benchmark is measured as the difference in throughput on the core type that the unknown thread is running on. More specifically, let Tworkload,j denote the throughput of a given workload running any core type j and let Pworkload,j be defined similarly for power. These numbers are obtained from offline profiling by sweeping a diverse set of workloads with throughput (or power) ranging from very low to very high values for all available PARSEC benchmarks. For an unknown thread t currently running on core type k, we first determine the nearest-neighbor workload w by

Fig. 4: Runtime prediction of power and throughput

to swaps) dummy cores to make n a multiple of m. To ensure implementation efficiency, candidate swaps are maintained in a max heap for each core type. For two adjunct core types, all of the thread combinations are inspected and legal swaps are stored. Note that number of legal swaps can be up to n2 /m2 for each core type, but not all of them have the chance to be performed. For example, if the swap between thread 1 and thread 3 has the highest priority, even if thread 2 and thread 3 yield the second highest priority, the swap is not possible since after thread 1 and thread 3 swap with each other, the priority of the thread 2-thread 3 swap is obsolete and the heap needs to be updated. Therefore, after all the legal swaps are recorded, we prune off those impossible swaps from the heap (line 20). The advantage of the pruning is to shrink the heaps to be less than n/m in size, instead of n2 /m2 . Small heaps make the update relatively inexpensive. Initializing the heap is O(n2 /m2 ) since all pairwise combinations of threads are examined. Thus, the entire initialization is done in O(n2 /m2 ) × m = O(n2 /m). Executing a swap followed by updating the heap takes O(n/m) time since we need to examine the legality and possibility of all n/m new downward swaps caused by switching t1 and t2 . For each core type, there can be up to n/m rounds of swapping and updating. Considering that there are m core types, the total runtime of executing swaps is O(n/m) × n/m × m = O(n2 /m). Therefore, the swap phase, which determines the algorithm runtime complexity, is bounded by O(n2 /m).

w=

arg min workload∈P ARSEC

|Tworkload,k − Tt,k |

(6)

Then, the power and throughput values of t are predicted by ∀j ∈ {1, 2, ..., m} : Tt,j = Tw,j and Pt,j = Pw,j

(7)

To validate the prediction model, we use workloads mixes of multithreaded benchmarks randomly selected from PARSEC and SPLASH-2 [21] benchmark suites. Figure 4 demonstrates the runtime prediction accuracy of both power and performance for a workload mix including blackscholes, swaptions, streamcluster and fft (four threads per benchmark) on a 16-core heterogeneous configuration in a period of 11 milliseconds. The throughput is measured in instructions per nanosecond. On average, the prediction error is 7% for power and 14.3% for throughput. V. E XPERIMENTAL S ETUP

D. Predicting Power and Performance

We use Sniper [16] multi-core simulator to conduct the simulations. Sniper provides support of heterogeneous configurations and dynamic thread migration. It employs McPAT [22] as the power estimation engine. A core library of four types is presented in Table I. The voltage/frequency pairs are assumed to be fixed as shown.

Since more than two types of cores are assumed in the problem formulation, periodically migrating threads [19] [5] to each type of cores and sampling the power and performance statistics is not affordable in terms of performance loss and migration cost. Therefore, it is necessary to predict the power

58

7KURXJKSXWLPSURYHPHQW FRUHVRQHWKUHDGSHUFRUH

7KURXJKSXWLPSURYHPHQW FRUHVRQHWKUHDGSHUFRUH

        





    1RUPDOL]HGSRZHUEXGJHW



PV ,/3VROYHU 076KHXULVWLF

 



 



    



    1RUPDOL]HGSRZHUEXGJHW

í





6FKHGXOLQJUXQWLPH FRUHVRQHWKUHDGSHUFRUH





PV ,/3VROYHU 076KHXULVWLF















í



í

















,/3VROYHU 076KHXULVWLF



  



6FKHGXOLQJUXQWLPH FRUHVRQHWKUHDGSHUFRUH





(ODSVHGWLPH PV



1RUPDOL]HGWKURXJKSXW



(ODSVHGWLPH PV

1RUPDOL]HGWKURXJKSXW



,/3VROYHU 076KHXULVWLF

 



í





   1RUPDOL]HGSRZHUEXGJHW















(a) 16 cores, 2 core types



    1RUPDOL]HGSRZHUEXGJHW





(b) 16 cores, 4 core types

7KURXJKSXWLPSURYHPHQW FRUHVRQHWKUHDGSHUFRUH

7KURXJKSXWLPSURYHPHQW FRUHVRQHWKUHDGSHUFRUH 1RUPDOL]HGWKURXJKSXW

        





   1RUPDOL]HGSRZHUEXGJHW







 





í

 

    



    1RUPDOL]HGSRZHUEXGJHW





6FKHGXOLQJUXQWLPH FRUHVRQHWKUHDGSHUFRUH 

PV ,/3VROYHU 076KHXULVWLF





í





PV ,/3VROYHU 076KHXULVWLF











,/3VROYHU 076KHXULVWLF



  



6FKHGXOLQJUXQWLPH FRUHVRQHWKUHDGSHUFRUH





(ODSVHGWLPH PV



,/3VROYHU 076KHXULVWLF



(ODSVHGWLPH PV

1RUPDOL]HGWKURXJKSXW

















í

 



í





   1RUPDOL]HGSRZHUEXGJHW







(c) 64 cores, 2 core types











    1RUPDOL]HGSRZHUEXGJHW





(d) 64 cores, 4 core types

Fig. 5: Throughput / runtime comparison with ILP solver (one thread per core)

All the L1 and L2 caches are private to cores and there is no L3 cache. For the cases in which the number of cores of each type is identical, the network hierarchy is configured as a mesh of tiles, similarly as shown in Figure 1, with each tile containing one core from each type. If the core count is different in each type, we will have some tiles with

fewer core types. To demonstrate optimality and scalability of MTS heuristic, we randomly generate workloads mixing benchmarks from PARSEC suites. To obtain the optimal solutions of the constrained mapping problem, we use Gurobi [23] optimizer, which is a commercial optimization toolbox, as the ILP solver.

59

VI. E XPERIMENTAL R ESULTS

5XQWLPHVFDODELOLW\



A. Comparison with ILP solver 

To demonstrate that MTS heuristic efficiently produces mappings that are close to optimum, we compare our approach and the commercial ILP solver Gurobi in both total throughput and elapsed time. For a given configuration, a workload comprised of PARSEC benchmarks is randomly selected with total number of threads equal to the number of cores. 1000 random thread-to-core mappings are generated and the median power of the 1000 random mappings is set as the baseline power budget. We sweep the power budget from 90% to 120% of the baseline, representing the range from fairly constrained cases to relatively unconstrained cases, and then use MTS heuristic and the ILP solver to solve the problem respectively. The throughput is normalized to the mean throughput of the 1000 random mappings. The reported runtimes are obtained by running both MTS and the Gurobi ILP solver on a dualcore x86 workstation running at 2.5 GHz. It is expected that the MTS runtimes will get better on a deeply scaled platform running at clock speeds as the ones reported in Table I. Results of 16 and 64 cores are shown in Figure 5. The configurations with two core types are comprised of Little and Big, and the ones with four core types are comprised of Little, Mid1, Mid2 and Big, as shown in Table I. On average, the solution provided by MTS is less than 0.6% away from optimum and achieves up to 16% improvement in terms of total throughput under iso-power constraints compared with the baseline. Figure 5 also shows the comparison in runtime overhead for the two approaches. On average, the proposed heuristic is more than two orders of magnitude faster than the commercial ILP solver.

 6DPSOHSRLQWV )LWWHGTXDGUDWLFFXUYH

(ODSVHGWLPH PV

       







   1XPEHURIFRUHV









Fig. 6: Runtime overheads for different problem sizes

For much finer grained cases, migration cost can be more significant. We quantify dynamic mapping migration cost as the number of thread migrations required from the original mapping to the produced mapping in one control epoch. To minimize migration cost, we further enable MTS heuristic to take the original mapping (history) into account. The extended version of MTS heuristic is referred as history-based MTS. Basically, in the maximization phase (line 2 - 7 in Figure 3), history-based MTS performs a partial, instead of a complete highest-throughput mapping. A new metric history ratio is introduced, which ranges from 0 to 1. In this case, the number of threads that are considered in the maximization phase is limited to n×(1−history ratio). With a high history ratio, the mapping used at line 10 in Figure 3 is more biased towards the original mapping, thereby in effect executing fewer thread migrations, while as history ratio approaches to 0, historybased MTS behaves more like the original MTS in determining the best-effort mapping for throughput. As shown in Figure 7, there is a clear tradeoff between the resulting throughput and the cost of migration. Such a phenomenon is expected since using an initial mapping that is closer to the historical behavior has a negative effect on achieving the objective in Equation 1. Indeed, as history ratio increases, the migration cost can be reduced since fewer threads will be migrated between the original mapping and the final one. Depending on the user-specific scenarios, one can select the most appropriate history ratio. For example, if the program phase doesn’t change often, it is more preferable to select a low history ratio to directly move to a high-throughput mapping. In such cases, the relatively high migration cost is likely to be compensated in followup stable control epochs. On the other hand, when program phases switch frequently, one can set a high history ratio to incrementally update the mapping without taking too much risk on making outdated migrations. Note that this adaptability to program behavior is not available to the Gurobi ILP solver as it always strives to solve the mapping problem optimally, without trading off migration cost and distance from optimal

B. Scalability Figure 6 presents the elapsed time of MTS for several different problem sizes, with the number of cores ranging from 64 to 1024. The power budgets in those instances are set sufficiently small so that all possible swaps are performed in the swapping phase. Cores are assumed to be evenly distributed into the four core types. A fitted quadratic curve is also shown in the figure. The fitting validates the runtime complexity analysis of O(n2 /m), where m = 4 in this case. It scales to 256 cores within less than 1 ms in computation time. We claim that MTS heuristic can fit in typical millisecondslong scheduling epochs so that it is capable to be brought online for hundred-core heterogeneous systems. C. Migration Cost A typical control epoch length for MTS can be 1 ms. For such granularity, prior work [24] has shown that the cost of context migration and cold cache effect for a single thread is well amortized so that the performance loss is negligible. That is also consistent with our simulations. For PARSEC benchmarks, after caches are warmed up, we migrate a parallel thread to a core with cold cache and there is no significant performance drop observed.

60

Electronics. The authors would also like to thank Suvrat Alshi for his help in obtaining quantitative thread migration cost. R EFERENCES

1RUPDOL]HGWKURXJKSXW



,/3VROYHU



+LVWRU\íEDVHG076KHXULVWLF

   



0LJUDWLRQVSHUFRQWUROHSRFK

[1] P. Greenhalgh, “Big. little processing with ARM Cortex-A15 & CortexA7,” White Paper, ARM, 2011. [2] “Variable SMP: A multi-core CPU architecture for low power and high performance,” Whitepaper, NVIDIA, 2011. [3] B. C. Lee and D. M. Brooks, “Illustrative Design Space Studies with Microarchitectural Regression Models,” Proc. HPCA, 2007. [4] R. Kumar, K. Farkas, N. Jouppi, P. Ranganathan, and D. Tullsen, “Single-ISA heterogeneous multi-core architectures: The potential for processor power reduction,” in Proc. MICRO, 2003. [5] M. Becchi and P. Crowley, “Dynamic thread assignment on heterogeneous multiprocessor architectures,” in Proc. CF, 2006. [6] D. Koufaty, D. Reddy, and S. Hahn, “Bias scheduling in heterogeneous multi-core architectures,” in Proc. EuroSys, 2010. [7] J. Chen and L. K. John, “Efficient program scheduling for heterogeneous multi-core processors,” in Proc. DAC, 2009. [8] D. Shelepov, J. C. S. Alcaide, S. Jeffery, A. Fedorova, N. Perez, Z. F. Huang, S. Blagodurov, and V. Kumar, “HASS: a scheduler for heterogeneous multicore systems,” in ACM SIGOPS Operating Systems Review, 2009. [9] C. Isci, A. Buyuktosunoglu, C.-y. Cher, P. Bose, and M. Martonosi, “An Analysis of Efficient Multi-Core Global Power Management Policies: Maximizing Performance for a Given Power Budget,” in Proc. MICRO, 2006. [10] R. Teodorescu and J. Torrellas, “Variation-Aware Application Scheduling and Power Management for chip multiprocessors,” in Proc. ISCA, 2008. [11] J. Winter, D. Albonesi, and C. Shoemaker, “Scalable thread scheduling and global power management for heterogeneous many-core architectures,” in Proc. PACT. ACM Press, 2010. [12] K. V. Craeynest, A. Jaleel, L. Eeckhout, P. Narvaez, and J. Emer, “Scheduling heterogeneous multi-cores through performance impact estimation (PIE),” in Proc. ISCA, 2012. [13] V. Gupta, P. Brett, D. Koufaty, and D. Reddy, “The forgotten’uncore’: On the energy-efficiency of heterogeneous cores,” in Proc. USENIX Annual Technical Conf, 2012. [14] Y. Turakhia, B. Raghunathan, S. Garg, and D. Marculescu, “HaDeS: Architectural Synthesis for Heterogeneous Dark Silicon Chip Multiprocessors,” in Proc. DAC, 2013. [15] P. Petrica, A. Izraelevitz, D. Albonesi, and C. Shoemaker, “Flicker: A Dynamically Adaptive Architecture for Power Limited Multicore Systems,” in Proc. ISCA, 2013. [16] T. Carlson, W. Heirman, and L. Eeckhout, “Sniper: exploring the level of abstraction for scalable and accurate parallel multi-core simulation,” in Proc. SC, 2011. [17] R. Karp, Reducibility among combinatorial problems. Springer, 1972. [18] J. Mitchell, “Branch-and-cut algorithms for combinatorial optimization problems,” in Handbook of Applied Optimization, 2002. [19] R. Kumar, D. M. Tullsen, P. Ranganathan, N. P. Jouppi, and K. I. Farkas, “Single-ISA Heterogeneous Multi-Core Architectures for Multithreaded Workload Performance,” in Proc. ISCA, 2004. [20] C. Bienia, S. Kumar, J. Singh, and K. Li, “The PARSEC benchmark suite: Characterization and architectural implications,” Proc. PACT, 2008. [21] S. C. Woo, M. Ohara, E. Torrie, J. P. Singh, and A. Gupta, “The SPLASH-2 Programs : Characterization and Methodological Considerations and Approach,” in Proc. ISCA, 1995. [22] S. Li, J. Ahn, R. Strong, J. Brockman, D. Tullsen, and N. Jouppi, “McPAT: an integrated power, area, and timing modeling framework for multicore and manycore architectures,” in Proc. MICRO, 2009. [23] “Gurobi optimizer.” [Online]. Available: www.gurobi.com [24] T. Constantinou, Y. Sazeides, P. Michaud, D. Fetis, and A. Seznec, “Performance implications of single thread migration on a chip multicore,” in ACM SIGARCH Computer Architecture News, 2005.





 +LVWRU\UDWLR









,/3VROYHU



+LVWRU\íEDVHG076KHXULVWLF

     





 +LVWRU\UDWLR





Fig. 7: Throughput-migration tradeoff (64 cores, one thread per core)

throughput. VII. C ONCLUSION In this paper, we have formulated mathematically the dynamic thread mapping problem for heterogeneous many-core systems as a 0-1 integer linear program and further proposed a quadratic time heuristic-based algorithm. Results have shown that (1) the algorithm achieves runtime improvement of more than two orders of magnitude, compared with an efficient commercial ILP solver, while losing less than 0.6% total throughput on average; (2) Up to 16% performance improvement is demonstrated under iso-power constraints; (3) The heuristic scales to hundred-core systems with runtime overhead less than 1 ms, so that it can be brought online for large-scale thread mapping with relatively fine-grained control epochs. We have also extended the heuristic to be history-aware for minimizing the migration cost in dynamic thread mapping. A tradeoff between throughput and migration cost is demonstrated. Future directions of research include taking into account fairness and load balancing for high performance, power constrained thread mapping, especially for systems primarily composed of dark silicon. Also, we are working on a more accurate performance prediction model than the simple nearest neighbor approach. ACKNOWLEDGEMENTS This research was supported in part by an Intel URO grant, NSF Grant CNS-1128624, and a grant from Samsung

61