the shared resources in a core, it is actually the scheduling policy that makes the ..... K. Keutzer, et al., âThe landscape of parallel computing research: a view from berkeley ... on Performance Analysis of Systems and Software, pp. 164â171 ...
On Better Performance from Scheduling Threads According to Resource Demands in MMMP Lichen Weng and Chen Liu Computer Architecture and Microprocessor Engineering Lab (CAMEL) Department of Electrical and Computer Engineering Florida International University Miami, Florida 33174 Email: {lichen.weng, chen.liu}@fiu.edu
Abstract—The Multi-core Multi-threading Microprocessor introduces not only resource sharing to threads in the same core, e.g., computation resources and private caches, but also isolates those resources within different cores. Moreover, when the Simultaneous Multithreading architecture is employed, the execution resources are fully shared among the concurrently executing threads in the same core, while the isolation is worsened as the number of cores increases. Even though fetch policies regarding how to assign priorities in fetch stage are well designed to manage the shared resources in a core, it is actually the scheduling policy that makes the distributed resources available for workloads, through deciding how to send their threads to cores. On the other hand, threads consume various resources in different phases and Cycles Per Instruction Spent on Memory (CPImem ) is used to express their resource demands. Consequently, aiming at better performance via scheduling according to their resource demands, we propose the Mix-Scheduling to evenly mix threads across cores, so that it achieves thread diversity, i.e., CPImem diversity in every core. As a result, it is observed in our experiment that 63% improvement in overall system throughput and 27% improvement in average thread performance, when comparing the Mix-Scheduling policy with the reference policy MonoScheduling, which keeps CPImem uniformity among threads in every core on chips. Furthermore, the Mix-Scheduling also makes an essential step towards shortening load latency, because it succeeds in reducing the L2 Cache Miss Rate by 6% from MonoScheduling.
I. I NTRODUCTION Instead of solely increasing the clock frequency for better performance, it is parallelism that finally enlightens the future of microprocessor design. The Multi-core Multi-threading Microprocessor (MMMP) yields increasingly powerful systems with reduced cost and improved efficiency [1]. Meanwhile, the multi-threading architecture introduces resource sharing to threads in the same core, which includes computation resources and private caches. Especially, the execution resources become fully shared by the concurrently executing threads in the same core within the Simultaneous Multithreading (SMT) architecture. Nevertheless, resource distribution among threads determines not only the individual thread performance, but also the overall system throughput [2]. Without welldefined resource management scheme, monopoly on the shared resources may happen in SMT, and thus lead to performance degradation [3]. Taking the long-latency load as an example, when a thread experiences long-latency load, it occupies the
shared resources without any progress, and prevents other threads from utilizing such resources [4]. Therefore, fetch policies regarding how to assign priorities in the fetch stage are well designed to allocate the shared resources among the simultaneously running threads. Furthermore, other schemes taking the workload behavior and Quality-of-Service (QoS) into consideration [5] are also proposed to manage the shared resources. On the other hand, aside from the resource sharing in a single core, multi-core architecture also isolates resources within different cores throughout MMMP. The multi-core architecture may refer to the heterogeneous multi-core architecture or the homogeneous multi-core architecture. In the former one, such as IBM Cell Broadband EngineTM , cores are different on the chip. Better performance is achieved especially when the scheduling policy makes decision according to the thread resource demands following the Amdahl’s Law, i.e., to execute parallel threads using simple cores and the serial thread by the powerful one [6]. The latter one is greatly promoted by R Single-Chip Cloud Computer (SCC) [7] by Intel , in which each core consists of 1/48 integer units, floating units, and level 1 & 2 cache resources of the whole processor. Hence, the scheduling policy takes responsibility to fully utilize the distributed resources across 48 cores. Furthermore, it is widely accepted that application behaviors can be divided into different phases, which are impermanent periods with steady performance in throughput, branch prediction, cache behavior, etc [8]. Therefore, although applications require various resources during execution, their resource requirement can be analyzed statistically and grouped for further research. Consequently, considering a workload consisting of multiple threads with each one corresponding to a benchmark, the threads compete for the same resource in the same core when they are of the similar behavior in phases of execution, while other resources that are not mainly consumed by these threads are relatively idle. Nonetheless, this kind of idle resources may be heavily demanded in other cores in MMMP. As a result, the distributed resources are not fully utilized, and the workload performance suffers from the scheduling policy, rather than limited resource in MMMP. We strive to validate the necessity of scheduling policy
based on the thread resource demands for better utilization and less competition. The rest of this paper is organized as following. A brief review on the related work is presented in section 2. Furthermore, section 3 will illustrate our proposed scheduling policy and the experimental results. The conclusions are drawn in section 4. II. R ELATED W ORK Transistor count on a chip increased rapidly in the past several decades, urged by the famous Moore’s Law [10]. The pressure, however, now falls onto the parallelism in a processor [11]. As a result, the multi-core architecture was employed to explore the Job Level Parallelism. Furthermore, the parallelism is well developed within multi-threading architecture. Especially, in the SMT architecture, the Horizontal waste and Vertical waste are minimized [12]. Nevertheless, Raasch et al. [13] concluded that the true power of SMT lies in its ability to issue and execute instructions from different threads at every clock cycle. Thus fetch policies are proposed to manage the shared resources within SMT architecture. ICOUNT policy proposed by Tullsen et al. [4] grants priority to threads according to the number of instructions in the front-end stages. Other resource management schemes were proposed to allocate the shared resources based on the thread behavior. Hill-Climbing by Choi et al. [14] observes the impact of different resource distribution decisions, and uses the current optimal result to direct future resource allocation. Adaptive Resource Partitioning Algorithm (ARPA) [15] allocates the Instruction Queue, Reorder Buffer, etc, among threads depending on Committed Instruction Per Resource Entry (CIPRE), so that it improves the resource entry utilization. On the other hand, the execution time of applications can be spent on processing or memory access [16], and Cycle Per Instruction (CPI) portions are used to measure it by Zhu et al. in [9]. They provided the CP I in the format of CP Iproc and CP Imem , so that the application resource demands are statistically described. The application mainly consumes computation resources when most of its average CP I is spent on processing (CP Iproc ), or composed of small CP I values spent on memory access (CP Imem ). On the contrary, the larger the CP Imem is, the more memory resources the application requests. Therefore, they also argue that performance of applications with large CP Imem values are memory-bound, while those with small CP Imem values are computation-bound. The same categorizing result about SPEC CPU2000 benchmarks is also given by Cazorla et al. in [17], using average Level 2 Cache Miss Rate. Weinberg et al. [18] proposed the symbiotic space-sharing in a many-node supercomputer. They utilized the idle shared resources in nodes through executing background jobs at a lower priority than the primary jobs in the same node. Thus they proved that to utilize the idle shared resources requires extra efforts from the scheduling policy in the manycore single-threading environment. In the environment of a Single-core Multi-threading Microprocessor (SMMP) with SMT, Snavely et al. [19] proposed the symbiotic scheduling
policy to mix jobs with different priorities together, so that the system throughput was increased from multithreading and coscheduling. Furthermore, Knauerhase et al. [20] proposed the scheduling policy in Operating System (OS) to achieve better performance. The authors modified the default scheduling policy in Linux, so that it achieved the cache weight balance by migration. However, they did not discuss the unique characteristic in MMMP with SMT, that potential for better performance can be explored through fully utilizing the function units by several concurrent threads. Instead of considering cache weight in multi-core architecture, Sodan et al. [21] achieved performance improvement via scheduling jobs in a many-node cluster according to their resource demands, e.g., CPU-bound, disk-bound and network-bound. Even though they considered the SMT architecture, in their cluster each node was equipped with independent hierarchical memory system, rather than the shared resources, e.g., off-chip bus, cache and memory among different cores in MMMP. Similarly in a 256-processor Alpha cluster without SMT, Frachtenberg et al. [22] proposed to classify processes into three categories with descending priorities for scheduling onto the same computing node, in accordance to their synchronization requirement and CPU time utilization. Although they provided some hints to classify the processes dynamically, their criteria were mainly based on the CPU time and communication time in the network, rather than the competition for the shared resources in a single core with SMT. III. P ROPOSED S CHEDULING P OLICY AND E XPERIMENT In this section we propose our scheduling policy, in an effort to fully utilize the distributed resources in MMMP with minimum competition. We also design the experiment to evaluate its performance and the results are presented in later parts of the section. A. To Schedule According to Resource Demands Given the fact that resources are distributed among cores in MMMP, the decision of scheduling policy should overcome the resource isolation among different cores, so that workloads can fully utilize the distributed resources in MMMP. On the other hand, taking the fully shared resources in SMT into consideration, the scheduling policy should also pair threads in the same core to reduce competition. Therefore, we propose the Mix-Scheduling policy (MIX) to keep thread diversity in every core. The thread diversity means to evenly mix threads of various CP Imem in every core, so that the variance of CP Imem values among threads in the same core are maximized. As the direct result of the proposed policy, the difference of CP Imem values between threads is expected to be significant. Hence, given the discussion in Section II, the threads with smaller CP Imem values consume mainly the computation resources, while the throughput of other threads with larger CP Imem values depends on memory resources. Consequently, threads in the same core require different resources, rather than compete for the same resource severely. Moreover, from the perspective of a multi-threading workload,
it means to utilize more resources globally if its threads are able to access various resources in the core, and thus it results in better workload performance. The on-line model of Mix-Scheduling significantly increases the implementation complexity, due to the real-time analysis of applications and dynamic migration as well. As discussed in the introduction section, there is hardly any algorithm that clearly defines the scheduling policy according to workload resource demands in MMMP with SMT, and the necessity in such environment is not validated specifically yet. Moreover, real-time analysis is not necessary when the applications are more or less fixed while the inputs vary. As a result, instead of an adaptive model, our experiment is simplified in two aspects. First of all, the off-line CPI portions analysis is finished for the benchmarks employed in our experiment. Secondly, the benchmarks are further divided into two categories based on their CP Imem values, following the methods similar to [16]. Some benchmarks perform a lot of operations on a single data fetched from the memory system, so that they have small CP Imem values and mainly consume the computation resources. These benchmarks are capable of providing more Instruction Level Parallelism (ILP) and fall into the Computation-Bound category. On the contrary, some other benchmarks deal with a large amount of data during execution, but perform relatively less operations on a single data, so they have large CP Imem values. Hence, they belong to the Memory-Bound (MEM) category. The benchmarks from SPEC CPU2000 [26] used in the study with categories are shown in Table I. Memory-Bound category demands memory resources while Computation-Bound category demands computation resources. Given the goal to maximize thread CP Imem variance in the core, the Mix-Scheduling policy in our experiment is to schedule threads according to their resource demands. Hence, the benchmarks belonging to different categories should be scheduled onto the same core. As a result, considering the workload consisting of four threads with each thread executing one benchmark, we propose to schedule one MEM thread and one ILP thread onto one core, and so for the other core in the Mix-Scheduling policy. In such a case, the MEM thread relies more on memory resource in the core, while ILP thread lives more on computation resource in the core, which matches our goal to minimize competition and optimize utilization. The Mix-Scheduling decisions for workloads 1–23 are shown in Table II. On the contrary, the Mono-Scheduling policy schedules benchmarks from the same category onto the same core, i.e., it schedules two MEM threads onto one core, and two ILP threads onto the other core to maintain CP Imem uniformity among threads in every core. It is considered as the opposite scenario of Mix-Scheduling providing the worst case results, while other scheduling decisions without consideration of the workload resource demands would have the expected performance between Mono-Scheduling and MixScheduling. The Mono-Scheduling decisions for workloads 1– 23 are listed in Table III. It is omitted in our study that all the
TABLE I E IGHT SPEC CPU2000 BENCHMARKS EMPLOYED Benchmark Gcc Gzip Crafty Bzip2
Type INT INT INT INT
Category ILP ILP ILP ILP
Benchmark Equake Mcf Parser Twolf
Type FP INT INT INT
Category MEM MEM MEM MEM
TABLE II T HREADS ASSIGNMENT IN M IX -S CHEDULING WL 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23
CORE0 MEM0 ILP0 Equake Gzip Equake Gzip Parser Gzip Mcf Gcc Mcf Gcc Equake Bzip2 Equake Bzip2 Equake Gzip Equake Gcc Parser Gzip Mcf Bzip2 Mcf Bzip2 Mcf Bzip2 Mcf Bzip2 Equake Gzip Equake Gzip Mcf Crafty Mcf Crafty Mcf Crafty Parser Gzip Equake Gcc Mcf Crafty Parser Bzip2
CORE1 MEM1 ILP1 Mcf Gcc Twolf Gcc Twolf Gcc Parser Gzip Twolf Gzip Twolf Gcc Parser Gcc Mcf Bzip2 Mcf Bzip2 Twolf Bzip2 Twolf Gcc Parser Gcc Parser Gzip Twolf Gzip Mcf Crafty Twolf Crafty Twolf Gcc Parser Gzip Twolf Gzip Twolf Crafty Mcf Crafty Twolf Bzip2 Twolf Crafty
threads are of uniform CP Imem or from the same category, because the scheduling policy makes decisions somehow from the comparison of categories. In such omitted cases, the scheduling policies based on other parameters, e.g., network communication, may be able to provide a better performance, which is not discussed in this paper. B. Experiment Environment We use the SuperESCalar Simulator (SESC) by Renau et al. [23] in our experiment. The MMMP parameters of the baseline configuration are shown in Table IV. We configure a dual-core two-thread microprocessor in our research. The L1 cache is composed of I-cache and D-cache for every core, and the L2 Cache is shared by all cores. Instead of executing a benchmark to its final completion, the representative regions composed of several typical execution phases are studied for performance evaluation. Perelman et al. proposed the Early Single Simulation Points with acceptable error rates but less fast-forwarding instructions to tell the performance of the whole benchmark [24]. Therefore, we simulate 100 million instructions for every benchmark from the Early Single Simulation Points presented in [25]. In total, 400 million instructions are executed with reference input for every four-thread workload in the experiment. The fetch policy in the experiment accompanying both Mix-Scheduling and Mono-Scheduling is ICOUNT proposed by Tullsen et al. in
TABLE III T HREAD ASSIGNMENT IN M ONO -S CHEDULING IP Cnew,i i=1 IP Cbaseline,i
PN WL 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23
CORE0 MEM0 MEM1 Equake Mcf Equake Twolf Parser Twolf Mcf Parser Mcf Twolf Equake Twolf Equake Parser Equake Mcf Equake Mcf Parser Twolf Mcf Twolf Mcf Parser Mcf Parser Mcf Twolf Equake Mcf Equake Twolf Mcf Twolf Mcf Parser Mcf Twolf Parser Twolf Equake Mcf Mcf Twolf Parser Twolf
CORE1 ILP0 ILP1 Gzip Gcc Gzip Gcc Gzip Gcc Gzip Gcc Gzip Gcc Gcc Bzip2 Gcc Bzip2 Gzip Bzip2 Gcc Bzip2 Gzip Bzip2 Gcc Bzip2 Gcc Bzip2 Gzip Bzip2 Gzip Bzip2 Gzip Crafty Gzip Crafty Gcc Crafty Gzip Crafty Gzip Crafty Gzip Crafty Gcc Crafty Crafty Bzip2 Crafty Bzip2
Average Baseline W eighted IP C =
ReOrder Buffer Size L1 D-Cache L1 I-Cache L2 Cache L1 Cache Hit Latency L2 Cache Hit Latency Main Memory Hit Latency
Value 8/9 160 INT, 64 FP 3 Load/Store 3 INT Mul/Div, 5 FP Mul/Div 320 entries 32KB, 4-way 32KB, 4-way 512KB, 8-way associative 2 cycles 9 cycles 469 cycles
[4]. C. Method of Performance Evaluation The average Instruction Per Cycle (avg IP C) used in [15, 26] is one critical method to measure the overall system throughput, which is defined as the total number of instructions executed over the time it costs. The formula for avg IP C is shown as following: PN average IP C =
IP Ci N
i=1
(2)
D. Investigation of Results
TABLE IV BASELINE PARAMETERS Parameter Instruction Fetch/Retire Width Instruction Window Size Function Unit
N
(1)
where N is the number of threads. Nevertheless, in justifying the performance of the new architecture, Sazeides et al. [27] proposed the Average Baseline Weighted IPC (abw IP C). It is calculated as the average thread improvement in the new architecture over the old architecture, which is also a good reference to observe the average thread performance. The formula of abw IP C is given as following:
The normalized abw IP C in Mix-Scheduling over MonoScheduling are shown in Figure 1. It shows 27% improvement in terms of abw IP C with standard deviation of 0.263, when we compare Mix-Scheduling with Mono-Scheduling in our experiment. Because of the thread CP Imem diversity in every core, the threads consume different shared resources in every core, and thus their performance is dependent on different resources. Since dependency on different resources in Mix-Scheduling results in less competition for the same resource than in Mono-Scheduling, the Mix-Scheduling policy fulfills the basic concept of SMT architecture to better utilize fully shared resources in every core with minimum competition. On the contrary, in the Mono-Scheduling policy, threads in the same core are of similar CP Imem values, and their performance mainly relies on the same resource. When the severe competition for the same shared resource happens, the thread performance is limited by such resource, which results in the poor average thread performance in Mono-Scheduling policy, compared with the Mix-Scheduling. Furthermore, when the threads are distributed across different cores, they are able to access resources and satisfied with more distributed resources throughout MMMP. Because both computation resources and memory resources are utilized by threads in the Mix-Scheduling policy, it improves the resource utilization in MMMP with SMT architecture. Meanwhile, from the perspective of the workload, when it can utilize more distributed resources in MMMP, its threads are able to take use of more resources for more throughput. Therefore, the normalized avg IP C in Mix-Scheduling over MonoScheduling are shown in Figure 2. The avg IP C increases in Mix-Scheduling by 63% with standard deviation of 0.672 from Mono-Scheduling, indicating improved overall system throughput. Aside from resource utilization, the Mix-Scheduling policy also enlarges the decision space for Fetch policy. The basic concept of ICOUNT is actually to favor the threads with higher IP C in an SMT core in the fetch stage. Therefore, when the threads are with similar IP C, the decision space for ICOUNT is relatively small. Taking the benchmark MCF as an example, it is a benchmark with not only large CP Imem value, but also large total CP I [9]. In our experiment, the top four improvements happen in the workloads containing MCF with respect to the abw IP C, while MCF never appears in the bottom four workloads. It results from that Mix-Scheduling sends an ILP thread with MCF onto the same core, which has considerably higher IP C than MCF. Hence, the IP C gap between two threads is noticeable to ICOUNT. On the contrary, MCF has an IP C similar to the other MEM thread in the same core in the Mono-Scheduling. Consequently, via
Fig. 1.
Normalized Average Baseline Weighted IPC in Mix-Scheduling over Mono-Scheduling
Fig. 2.
Normalized Average IPC in Mix-Scheduling over Mono-Scheduling
TABLE V C ORRELATION C OEFFICIENT
abw IP C avg IP C L1 Miss L2 Miss
abw IP C 1.0000
avg IP C 0.9227 1.0000
L1 Miss 0.5484 0.7179 1.0000
L2 Miss -0.7780 -0.7351 -0.4492 1.0000
making decisions based on the number of in-flight instructions, the ICOUNT is able to be more effective in Mix-Scheduling than Mono-Scheduling, and thus MCF is less likely to reduce the system throughput. It is usually argued that more L1 Cache misses possibly leads to more L2 Cache misses. For example, DCache Warn [29] uses L1 misses as a clear indicator of a possible L2 misses. To justify the relationship, we use correlation coefficient to examine abw IP C, avg IP C, L1 Cache miss rate (L1 Miss) and L2 Cache miss rate (L2 Miss), which is presented in Table V. The coefficient 1 indicates the case of
an increasing linear relationship, e.g. Y = X; -1 expresses a decreasing linear relationship, e.g. Y = −X; while 0 means rare linear relationship [30]. As a result, considering the issue of Memory Wall, it makes sense that lower L2 Miss leads to better performance, i.e., higher IP C, shown as the correlation coefficients of -0.7780 between L2 Miss and abw IP C and -0.7351 between L2 Miss and avg IP C. However, regarding the correlation coefficient between the two cache miss rates, -0.4492 means neither a linear relationship so significant as between cache misses and IP Cs, nor a firm support for the statement that more L1 misses clearly indicate more L2 misses. Therefore, our future research would cover this interesting observation to pursue detailed reason and/or adaptive methodology. Moreover, Knauerhase et al. [20] consider the memorybound benchmarks as cache heavy, suggesting they utilize cache more than other benchmarks. Therefore, when they are distributed throughout two cores, they update both private L1
Fig. 3.
Normalized L1 Data Cache Miss Rate in Mix-Scheduling over Mono-Scheduling
Fig. 4.
Normalized L2 Data Cache Miss Rate in Mix-Scheduling over Mono-Scheduling
caches frequently, so both Computation-Bound benchmarks suffer from high L1 Cache miss rate. Consequently, the two L1 caches can hardly provide better performance in MixScheduling policy than in Mono-Scheduling. As a result the normalized L1 & L2 Cache miss rates in Mix-Scheduling over Mono-Scheduling are shown in Figure 3 and 4 respectively. Figure 3 shows that L1 Cache miss rate is increased by 10% with standard deviation of 0.0972 in Mix-Scheduling. Based on the above observation that L1 and L2 cache miss rate has a correlation coefficient of -0.4492, it is reasonable to find that L2 Cache miss rate reduces by 6% with standard deviation of 0.411 in Mix-Scheduling policy as shown in Figure 4. Given that Memory Wall is one of the major obstacles to better performance, the Mix-Scheduling policy contributes not only to better utilization, but also to shortening average load latency. IV. C ONCLUSION In the SMT architecture, concurrently executing threads fully share the resources, i.e., computation resources and
memory resources in a core. This kind of resource sharing was proposed for better utilization, but it also introduces competition for the same resource. Moreover, even if there is severe competition in the core, the threads cannot utilize resources in other cores, because the resources are isolated within multiple cores. Nevertheless, the resource sharing and isolation is deteriorated with the development of MMMP. Meanwhile, the execution of applications is divided into several phases with various behaviors in phases, such that the performance of an application in an execution phase is dependent on certain resources statistically, e.g., memory resources or computation resources. To fully utilize the shared resources distributed throughout MMMP, we propose the Mix-Scheduling policy to ensure the CP Imem diversity in the core, so that threads in the same core can better utilize the shared resources with minimum competition. After dividing benchmarks into ComputationBound category and Memory-Bound category, the system
can schedule threads according to thread resource demands. When the distributed resources are better utilized in the proposed Mix-Scheduling policy, we achieved improvement in overall system throughput illustrated by the 63% increment of avg IP C and average thread performance stated by the 27% growth of abw IP C. In conclusion, the Mix-Scheduling policy makes an essential step toward fully utilizing the distributed resources throughout MMMP, and neutralizes the severe competition for the shared resources in a single core with SMT. Moreover the Mix-Scheduling policy is helpful for the long-latency load problem, considering the 6% reduced L2 Cache miss rate from the Mono-Scheduling policy. As far as the future research is concerned, we are interested in the adaptive scheduling policy, so that the decision of scheduling policy is made based on-online analysis. In designing such a policy, it will be carefully studied that the overhead from on-line analysis and dynamic migration and the improvement from Mix-Scheduling. Meanwhile, we also attach importance to the mathematical criteria to classify the resource demands of workloads. Furthermore, given that the correlation coefficient between L1 Cache miss and L2 Cache miss does not indicate a simple or linear relationship as some assumption, an efficient description is demanded, so that the prediction of L2 Cache miss based on L1 Cache behavior can be implemented. R EFERENCES [1] K.J. Nesbit, J.E. Smith, M. Moreto, F.J. Cazorla, A. Ramirez and M. Valero, “Multicore resource management,” IEEE Micro, vol. 28, no. 3, pp. 6–16, 1999. [2] D.M. Tullsen, S.J. Eggers, J.S.Emer and H.M. Levy, “Exploiting choice: instruction fetch and issue on an implementable simultaneous multithreading processor,” Proceedings of the 23rd Annual International Symposium on Computer Architecture, pp. 191–202, 1996. [3] C. Liu and J. Gaudiot, “The impact of resource sharing control on the design of multicore processors,” Proceedings of the 9th International Conference on Algorithms and Architectures for Parallel Processing, pp. 315–326, 2009. [4] D.M. Tullsen and J.A. Brown, “Handling long-latency loads in a simultaneous multithreading processor,” Proceedings of the 34th Annual International Symposium on Microarchitecture, pp. 318–327, 2001. [5] F.J. Cazorla, A. Ramirez, M. Valero, P.M.W. Knijnenburg, R. Sakellariou and E. Fern´andez, “QoS for high-performance SMT processors in embedded systems,” IEEE Micro, vol. 24, issue 4, pp. 24–31, 2004. [6] R. Kumar, D.M. Tullsen, N.P. Jouppi and P. Ranganathan, “Heterogeneous chip multiprocessors,” Computer, vol. 38, issue 11, pp. 32–38, 2005. [7] J. Howard, S. Dighe, Y. Hoskote, S. Vangal, D. Finan, G. Ruhl, et al., “A 48-core IA-32 message-passing processor with DVFS in 45nm CMOS,” International Solid-State Circuits Conference, pp. 108–109, 2010. [8] T. Sherwood and B. Calder, “Time varying behavior of programs,” Technical Report CS99-630, University of California at San Diego, 1999. [9] Z. Zhu and Z. Zhang, “A performance comparision of DRAM memory system optimizations for SMT processors,” Proceedings of the 11th International Symposium on High-Performance Computer Architecture, pp. 213–224, 2005.
[10] G.E. Moore, “Cramming more components onto integrated circuits,” Electronics, vol. 38, no. 8, pp. 114–117, 1965. [11] K. Asanovic, R. Bodik, B.C. Catanzaro, J.J. Gebis, P. Husbands, K. Keutzer, et al., “The landscape of parallel computing research: a view from berkeley,” Technical Report UCB/EECS–200600183, University of California at Berkeley, 2006. [12] D.M. Tullsen, S.J. Eggers and H.M. Levy, “Simultaneous multithreading: maximizing on-chip parallelism,” Proceedings of the 22nd International Symposium on Computer Architecture, pp. 317–328, 1995. [13] S. Raasch and S. Reinhardt, “The impact of resource partitioning on SMT processors,” Proceedings of the 12th International Conference on Parallel Architectures and Compilation Techniques, pp. 15–25, 2003. [14] S. Choi and D. Yeung, “Leaning-based SMT processor resource distribution via hill-climbing,” Proceedings of the 2006 International Symposium on Computer Architecture, pp.239–251, 2006. [15] H. Wang, I. Koren and C.M. Krishna, “An adaptive resource partitioning algorithm for SMT processors,” Proceedings of the 17th International Conference on Parrallel Architectures and Compilation Techniques, pp. 230–239, 2008. [16] A. Kagi, J.R. Goodmand and D. Burger, “Memory bandwidth limitations of future microprocessor,” Proceedings of the 23rd Annual International Symposium on Computer Architecture, pp. 78–89, 1996. [17] F.J. Cazorla, P.M.W. Knijnenburg, R. Sakllario, E. Fern´andez, A. Ramirez, and M. Valero, “Predictable performance in SMT processors: synergy between OS and SMTs,” IEEE Transactions on Computer. vol. 55, no. 7, pp. 785–799, 2006. [18] J. Weinberg and A. Snavely, “Symbiotic space-sharing on SDSC’s datastar system,” The 12th Workshop on Job Scheduling Strategies for Parallel Processing, pp. 192–209, 2006. [19] A. Snavely, D.M. Tullsen and G. Voelker, “Symbiotic jobscheduling with priorities for a simultaneous multithreading processor,” ACM SIGMETRICS Performance Evaluation Review, vol. 30, no. 1, pp. 66–76, 2002. [20] R. Knauerhase, P. Brett, B. Hohlt, T. Li and S. Hahn, “Using OS observations to improve performance in multicore systems,” IEEE Micro, vol. 28, no. 3, pp. 54–66, 2008. [21] A.C. Sodan and L. Lan, “LOMARC: lookahead matchmaking for multiresource coscheduling on hyperthreaded CPUs,” IEEE Transactions on Parallel and Distributed Systems, vol. 17, no. 11, pp. 1360–1375, 2006. [22] E. Frachtenberg, D.G. Feitelson, F. Petrini and J. Fern´andez, “Adaptive parallel job scheduling with flexible coscheduling,” IEEE Transactions on Parallel and Distributed systems, vol. 16, no. 11, pp. 1066–1077, 2005. [23] J. Renau, B. Fraguela, J. Tuck, W. Liu, M. Prvulovic, L. Ceze, S. Sarangi, et al., “SESC Simulator,” http://sesc.sourceforge.net, 2005. [24] E. Perelman, G. Hamerly and B. Calder, “Picking statistically valid and early simulation points,” Proceedings of the 12th International Conference on Parallel Architectures and Compilation Techniques, pp. 244–255, 2003. [25] SimPoint, “Early Single Simulation Points,” http://cseweb.ucsd.edu/ ∼calder/simpoint/points/early/spec2000/single-early-100M.html. [26] J.L. Henning, “SPEC CPU 2000: measuring CPU performance in the new millennium,” Computer, vol. 33, no. 7, pp. 28–35, 2000. [27] Y. Sazeidex and T. Juan, “How to compare the performance of two SMT microarchitectures,” The IEEE International Symposium on Performance Analysis of Systems and Software, pp. 180–183, 2001. [28] K. Luo, J. Gummaraju and M. Franklin, “Balancing throughput and fairness in SMT processors,” Proceedings of IEEE International Symposium on Performance Analysis of Systems and Software, pp. 164–171, 2001. [29] F.J. Cazorla, A. Ramirez, M. Valero and E. Fern´andez, “DCache Warn: an I-Fetch policy to increase SMT efficiency,” Proceedings of 18th International Parallel and Distributed Processing Symposium, pp. 74–83, 2004. [30] R.D. Yates and D.J. Goodman, “Probalility and stochastic processes: a friendly introduction for electrical and computer engeineers,” 2nd ed., Wiley, John & Sons, Incorporated, 2004.