Dynamic Scheduling Issues in SMT Architectures - University of ...

3 downloads 7560 Views 190KB Size Report
Samsung Electronics Corporation ... Dept. of Electrical Engineering and Computer Science. University ... While ICOUNT is the scheduling policy that works best.
Dynamic Scheduling Issues in SMT Architectures ∗ Chulho Shin System Design Technology Laboratory Samsung Electronics Corporation [email protected]

Seong-Won Lee Dept. of Electrical Engineering - Systems University of Southern California [email protected]

Jean-Luc Gaudiot Dept. of Electrical Engineering and Computer Science University of California, Irvine [email protected] Abstract Simultaneous Multithreading (SMT) attempts to attain higher processor utilization by allowing instructions from multiple independent threads to coexist in a processor and compete for shared resources. Previous studies have shown, however, that its throughput may be limited by the number of threads. A reason is that a fixed thread scheduling policy cannot be optimal for the varying mixes of threads it may face in an SMT processor. Our Adaptive Dynamic Thread Scheduling (ADTS) was previously proposed to achieve higher utilization by allowing a detector thread to make use of wasted pipeline slots with nominal hardware and software costs. The detector thread adaptively switches between various fetch policies. Our previous study showed that a single fixed thread scheduling policy presents much room (some 30%) for improvement compared to an oracle-scheduled case. In this paper, we take a closer look at ADTS. We implemented the functional model of the ADTS and its software architecture to evaluate various heuristics for determining a better fetch policy for a next scheduling quantum. We report that performance could be improved by as much as 25%.

1. Introduction Simultaneous Multithreading (SMT) or Multithreaded Superscalar Architectures [4, 10, 21, 20, 5, 8] can achieve high processor utilization by allowing multiple independent threads to coexist in the processor pipeline and share resources with support of multiple hardware contexts. SMT is an attempt to overcome low resource utilization of wide-issue singlethreaded superscalar processors by exploiting Thread-Level Parallelism (TLP) at a relatively low hardware cost for supporting the multiple hardware contexts. Studies by Tullsen et al. and Ungerer et al. [21, 16] have shown that when the number of threads simultaneously active in an SMT processor becomes greater than four, performance often saturates and in some cases even degrades. In these studies, an attempt was made to overcome the saturation effect by finding a better fetch mechanism or increasing the number and availability of resources that would otherwise become bottlenecks (such as register files and instruc∗ The material reported in this paper is based upon work supported in part by the National Science Foundation under Grants No. CSA-0073527 and INT-9815742. Any opinions, findings, and conclusions or recommendations expressed in this material are those of the authors and do not necessarily reflect the views of the National Science Foundation.

tion queues). It was also shown that increasing the size of the caches can result in a higher saturation point. Unfortunately, such remedies do not work in all cases because their effectiveness is heavily affected by the properties of the application mixtures. We believe that one fixed thread scheduling policy which performs better than others “on the average” cannot deliver the performance we anticipate in SMT processors with more than four thread contexts. We will show that with our adaptive dynamic thread scheduling policy [15], we can significantly improve the performance of SMT processors and prevent the saturation or degradation effects alluded to earlier. Our work focuses on multiprogrammed or multi-user environments where combinations of multiple threads that an SMT processor faces are significantly varied over time. For multiprogramming or multi-user workloads consisting of threads running on the processor independently of one another, no information about any interactive behavior between threads may be known in advance. Consequently, it is indispensable to adopt a more ”intelligent” and more dynamic thread scheduling capability if we are to sustain high throughput. When parallelizing an application to generate multiple threads, the role of thread scheduling is to eliminate resource conflicts and avoid data dependencies in order to expose more parallelism. On the contrary, the role of scheduling for multiple independent threads (of multiprogrammed workloads) is to perform a better “traffic control” so as to sustain higher throughput by maintaining low interference between threads. Tullsen et al. [20] evaluated several fetch policies and showed that the ICOUNT policy yields the best average performance. ICOUNT gives priority to the threads with fewer instructions in the decode stage, the rename stage, and the instruction queues. Actually, ICOUNT best accounts for what is taking place in SMT pipelines in general: since it gives priority to the threads that have fewer instructions in the earlier stages of the pipelines, a balanced use of the instruction window occurs. Since it gives more opportunities to the threads whose instructions drain through the pipeline more rapidly, a more efficient use of the pipeline results. While ICOUNT is the scheduling policy that works best on the average, it does not address problems as directly as other policies such as BRCOUNT and MISSCOUNT1 do. (BRCOUNT prioritizes threads with fewer conditional branches.) Assume for example that the set of applications 1 See

Section 5 for definitions of various fetch policies.

0-7695-1926-1/03/$17.00 (C) 2003 IEEE

in an SMT processor consists of four control-intensive applications (with many conditional branches) and four other applications. Further assume that these four control-intensive applications are experiencing high branch prediction misses at the moment. Then, the processor will suffer from wasted slots filled with wrong-path instructions of the four controlintensive applications while preventing the other four threads from exploiting the resources in the pipeline. In this specific case, if BRCOUNT had been used, the four control-intensive threads would have found fewer chances to get fetched. Consequently, the number of (fetched) instructions of controlintensive threads will diminish while the number of instructions of the other four threads will increase making the number of effective instructions even among all threads. The main goal of a hardware thread scheduler is to avoid imbalance among threads, where imbalance on a resource means that usages or counts of the resource are not even among the threads. For example, if one thread has many more instructions in the early stages of the pipeline (the decode and rename stages and the instruction queue) than the others do, we have an imbalance in terms of instruction count. Imbalance adversely affects the throughput for the following reasons (it would result in lowered Thread-Level Parallelism): • Since a small number of threads are occupying one type of resources, the other threads cannot have access to these same resources. • The average number of non-dependent and “issuable” instructions per thread becomes lower for the other threads, lowering the average number of instructions that can proceed through the pipeline.

With adaptive dynamic thread scheduling, when a change in the system environment is detected, the fetch policy which should be used during the next interval is decided upon and put into effect to eliminate the problematic imbalance. However, having multiple fetch policies and decision-making algorithms in hardware could translate into high hardware complexity. In our previous work [15], we proposed our detector thread approach which could help lower the hardware requirements and also make use of unused pipeline slots to run decision-making algorithms and fetch policies. Our approach also has the advantage that thread scheduling can be manipulated even after the chip has been produced because the detector thread is programmable. The detector thread can also help lower the overhead of the system job scheduler by shortening its stay in the processor and analyzing information before the job scheduler needs it. In this paper, we take a closer look at the software aspect of ADTS. We propose an effective software architecture for the detector thread. The core of this software is the heuristics for determining the fetch policy that will be used in the next scheduling quantum. We implement and evaluate the functional models of those heuristics. This paper is organized as follows. In section 2, previous works related to our work are summarized. The adaptive dynamic thread scheduling is reviewed in section 3, its software architecture is discussed in section 4 and how we evaluate our idea is discussed in section 5. Results of our simulation experiments are presented in section 6 and analyzed. Summary and conclusions will appear in section 7.

2. Related Work Wang et al. investigated the use of a special thread while aiming at realizing speculative precomputation in one of the two threads available on the Hyper-Threading architecture [22]. The study is targeted at improving the performance of singlethreaded applications on two-context SMT processors. DanSoft [6] proposed the idea of nanothreads in which one nanothread is given the control of the processor upon the stall of a main thread. The idea was based on a CMP with dual VLIW single-threaded cores and its success hinges on the effectiveness of the compiler. Assisted Execution [18] extended the nanothread idea for architectures that allow simultaneous execution of multiple threads including SMT. It attempts to improve the performance of a main thread by having multiple nanothreads perform prefetch and its success also hinges on the operation of the compiler. Speculative data-driven multithreading [14] takes advantage of a speculative thread, called a data-driven thread (DDT) to pre-execute critical computations and consume latency on behalf of the main thread on SMT. This study was also focusing on improving the performance of a main thread. Luk [11] also proposed pre-executing for more effective prefetch for hard-to-predict data addresses using idle threads to boost the performance of a primary thread. Simultaneous Subordinate Microthreading (SSMT) [3] was proposed in an attempt to improve the performance of a single thread by having multiple subordinate microthreads perform useful work such as running sophisticated branch predication algorithms. The idea was not based on an SMT architecture and also requires effective compiler technology. Parekh et. al. [13] investigated issues related to job scheduling for SMT processors. They compared the performance of oblivious and thread-sensitive scheduling. Oblivious scheduling means round-robin and random while threadsensitive scheduling takes into account resource demands and the behavior of each thread. The study concluded that thread-sensitive IPC-based scheduling can achieve a significant speedup over round-robin methods. However, this study concerns system job scheduling and cannot be directly related to dynamic thread scheduling. Also, the job scheduler will have to be brought into the processor, resulting in a context switch of user threads. This job scheduler, however, can take advantage of our detector thread approach and it will be discussed in section 3. Another similar study [17] investigated job scheduling for SMT processors. The study proposed a job scheduling scheme called SOS where an overhead-free sample phase is involved where the performance of various schedules (mixes) is sampled and taken into account for the selection of tasks for the next time slice. We recognize that this strategy can also benefit from our approach because the detector thread will be always active. It could make use of unused pipeline slots and resources to find out what threads should not be selected in the next job scheduling time slice while lowering the burden of the job scheduler. Our adaptive dynamic thread scheduling approach [15] should not be confused with adaptive process scheduling [12] which addresses O/S job scheduling issues for SMT processors: the goal of our approach is to offer more efficient thread scheduling at the individual instruction level in the SMT pipeline.

0-7695-1926-1/03/$17.00 (C) 2003 IEEE

A study that examine approaches to detect per-thread cache behavior using hardware counters and help job scheduling based on the information obtained on SMT was performed by Suh et al. [19]. This approach is similar to our idea of relating the detector thread with job schedulers. However, it does not aim at controlling thread fetch policies.

3. Adaptive Dynamic Thread Scheduling (ADTS) with a Detector Thread (DT) Our Adaptive Dynamic Thread Scheduling (ADTS) was introduced and discussed in details in [15]. Its implementation with a detector thread (DT) was also discussed. The ADTS with a DT tackles two problems: first, a new fetch policy can be activated if the system is suffering from low throughput. Second, it allows unused pipeline slots to be used to detect adverse changes in the system, identify threads that clog the pipeline, and take actions needed to sustain high throughput. The action that can be taken include context-switching a thread and preventing a specific thread from being fetched. A detector thread is a special thread which reads thread status indicators and updates thread control flags based on the current values of the indicators so that the thread control hardware can take any necessary action to improve performance of an SMT processor. The per-thread status indicators are updated by circuitry located throughout the processor pipeline, based upon specific events such as cache miss, pipeline stalls, population at each stage, etc. Per-Thread Counters

DT

flags

A

B

C

D

E

F

G

H

Thread Selection Units

Our previous work [15] proposed a way to implement the detector thread based on another study [3]. The detector thread will have its own program cache sufficiently large (2 or 4KB) to fit its small program image and its data accesses should be mostly to special registers such as the per-thread counters and general-purpose registers. Most of the time, the detector thread will be the lowest-priority thread. When the slots are almost fully occupied by normal threads, the detector thread will not obtain any more scheduling slots; this is acceptable because it means that the processor pipeline slots are enjoying high utilization. Fetching the detector thread’s instructions should not result in significant overhead either. Since its instructions are coming from their own isolated program cache, they will not compete for fetch bandwidth with other normal threads. It should not affect the data memory bandwidth either because its data will be mostly coming from special registers. Also, it was shown that the detector thread’s job can fit within the cycle budget allowed in realistic situations [15]. The detector thread plays a major role in this process as shown in Figure 1. It keeps watching the per-thread status indicators and updates the flags based on its active policy. The indicators are updated by hardware on predetermined events in places spread across the pipeline. The detector thread has the lowest priority among threads. As long as the pipeline is well utilized, the detector thread will not often be activated. Can a detector thread experience starvation in such cases? This depends upon the occupancy rate of the instruction fetch buffer. As long as the instruction fetch buffer is full, no instructions from the detector thread can be fetched. For this detector thread approach to work successfully, it has to be equipped with intelligent heuristics or algorithms to dynamically detect clogging (low throughput) and to choose a better fetch policy for the next time frame. However, since the resources allowed for the detector thread are quite limited in order to minimize hardware overhead, the algorithm is also limited in the data to which it can refer. This will be the topic of the next section.

4. Software Architecture of the Detector Thread

Figure 1. How a Detector Thread works with normal threads. The role of the detector thread is to check the values of the various thread status indicators and, based on the conditions dynamically defined in software, to properly update the thread control flags as shown in Figure 1. A thread will have its own set of flags. A flag may tell whether a thread can be fetched in the next cycle while another flag may tell whether it should be context-switched in the next opportunity. When the system thread is loaded, it will look at the flag and suspend a clogging thread without going through the process of determining which thread to suspend. Then, the thread selection unit simply issues instructions from threads in their order of priority. Although the per-thread status indicators, thread control flags, and thread selection units are fixed in hardware, we can control the thread control behavior around those hardware resources by writing a different program code for the detector thread.

The software architecture of the detector thread for adaptive thread scheduling is shown in Figure 2. The status counters are updated at each cycle throughout the pipeline. For every period of 8K cycles, the number of committed instructions are counted and the maximum number of instructions that can be executed (8Kx8) are counted. If the interval is to remain constant, the maximum numbers need not be counted. The detector thread will check whether the IPC (the number of committed instructions per cycle) is less than the threshold. In this case, the previous time frame will be identified as lowthroughput. Once a previous scheduling quantum2 is determined to be low-throughput, a new fetch policy has to be determined because the incumbent policy (the one which is currently engaged) turned out to perform poorly. Then, the policy that has been decided to be used to replace the incumbent policy for the next scheduling interval is activated. In the meantime, 2 This scheduling quantum should not be confused with that of the job scheduler. Typical sizes of a quantum for job scheduling is in the range of milliseconds which can be equivalent to a million cycles.

0-7695-1926-1/03/$17.00 (C) 2003 IEEE

during the remaining idle slots, other functions can be accomplished. The first thing is to identify the clogging threads. By looking at the per-thread status counters, the threads that are clogging the pipelines for various reasons can be identified and marked so that the job scheduler can later suspend them once loaded without going through the possibly long process of identifying them for itself. This results in a shorter period of activity for the job scheduler. The second thing is to enforce the incumbent policy. Per-thread status counters are checked and the priority array is updated depending on the values of the counters. Then, the thread selection unit will look at the array to make decisions on which two threads should be selected for instruction fetch at each cycle.

No Status Counters Updated

state and the incumbent policy while the thread selection unit (TSU) examines this array to determine the threads for instruction fetch at each cycle. The TSU selects up to two threads at each cycle because we are using ICOUNT.2.8 [20].

IPC

Suggest Documents