Reducing Cache Pollution of Threaded Prefetching by Controlling

3 downloads 0 Views 376KB Size Report
idea of the amount of pollution. SUHIHWFK GLVWDQFH. &K. DQJ. H. RI. DF. FH. VV. EH. KD. YL ..... [32] Trishul M. Chilimbi, Martin Hirzel.: Dynamic hot data ...
2012 IEEE 201226th IEEE International 26th International ParallelParallel and Distributed and Distributed Processing Processing Symposium Symposium Workshops Workshops & PhD Forum

Reducing cache pollution of threaded prefetching by controlling prefetch distance Yan Huang 1, 2, Zhi-min Gu1, Jie Tang1, Min Cai1, Jianxun Zhang1, Ninghan Zheng1

School of Computer Science and Technology Beijing Institute of Technology Beijing, China

Software Engineering College Zhengzhou University of Light Industry Zhengzhou, China

[email protected]

[email protected]

Abstract—Threaded prefetching based on Chip Multiprocessor (CMP) issues memory requests for data needed later by the main computation, and therefore may lead to increased stress on limited shared cache space and bus bandwidth. In our earlier work, we had proposed an effective threaded prefetching technique that selects proper prefetch distance for specific application to improve the timeliness of prefetching. In this paper, we first estimate the upper limit of prefetch distance for specific application in our proposed threaded prefetching technique, and then analyze the effect of increasing prefetch distance on shared cache pollution. Our experimental evaluations indicated that the bounded range of effective prefetch distance can be determined using our method, and the shared cache pollution can be reduced by controlling prefetch distance in our proposed threaded prefetching technique. KeywordsChip Multiprocessor (CMP); threaded prefetching; cache pollution; hot function; performance analysis

I.

INTRODUCTION

Long-latency memory access is one of the major performance bottlenecks of modern computing platforms [1]. The behavior of cache determines system performance due to its ability to bridge the speed gap between the processor and main memory. To tolerate memory access latency, there have been a plethora of proposals for data prefetching [2-8]. Data prefetching techniques improve performance by predicting future memory accesses and fetch them in cache before they are accessed. As a result, memory access latency is hidden. However, these traditional data prefetching techniques can lead to bandwidth waste and cache pollution [4-8]. Moreover, these approaches mainly aim for array accesses and are not applicable for linked data structure (LDS) traversal with irregular access patterns. With the advent of Chip multiprocessors (CMP) architectures, thread-based prefetching and speculative execution techniques [9-14, 16-21] utilizes a helper thread to boost the performance of main thread by prefetching data into cache. Helper thread based prefetching techniques [2021, 23-29] are promising methods to deal with LDS traversal that are hard to predict. However, because LDS are traversed in a way that prevents individual accesses from being overlapped, conventional helper thread based prefetching techniques can’t issue prefetches timely if little time available between data accesses. In our earlier work [35], we proposed an improved threaded prefetching technique, Skip 978-0-7695-4676-6/12 $26.00 © 2012 IEEE DOI 10.1109/IPDPSW.2012.224

helper threaded Prefetching (SP), to overcome this problem. Our proposed solution staggers and balances memory accesses between main thread and helper thread based on the characteristic of operations in hotspots. By selecting proper prefetch ratio and prefetch distance for specific application, it improves the timeliness of prefetching, degrades the contention of shared resources and avoids shared cache pollution. The experimental evaluations showed that our proposed mechanism can improve the effectiveness of helper threaded prefetching obviously. The proper prefetch distance is selected to avoid cache pollution by empirically and experimentally in our earlier work. In this paper, the upper limit of prefetch distance is determined by analyzing data access characteristic of hot loops, and the effect of growing prefetch distance on shared cache pollution is estimated. Specifically, in this paper we make the following contributions: • First, we analyze possible shared cache pollution caused by early prefetches when applying our proposed threaded prefetching mechanism. • Second, we exploit the features of data access stream to find the upper limit of prefetch distance for specific application in our proposed mechanism. • Third, we demonstrate experimentally that the shared cache pollution can be reduced obviously by controlling the prefetch distance within the estimated range. The rest of the paper is organized as follows: in section 2 we state the background and motivation of our work. The control of prefetch distance to reduce cache pollution is illustrated in section 3. Experimental methodology is analyzed in Section 4. In section 5, we evaluate and analyze the experimental results. We review the related work in prefetching in section 6. Finally, in section 7, conclusions are drawn and future work is discussed. II.

BACKGROUND AND MOTIVATION

In this section, we firstly describe Skip helper threaded prefetching technique proposed in our earlier work [35]. Then we analyze possible cache pollution of Skip helper threaded prefetching technique, which our proposal builds upon.

1806 1812

A. Skip helper threaded prefetching (SP) Helper threaded prefetching [20-21, 23-29] is a technique that utilizes a second core or logical processor in a multithreaded system to improve performance of the main thread. A helper thread executes in parallel with the main thread, and generates prefetch requests for these loads that would trigger cache misses in main thread. The helper thread executes only the load’s computation. As a result, it fetches and issues the load earlier than main thread. By the time main thread reaches problematic load, the corresponding block is hopefully already cached. Although conventional helper threaded prefetching [10, 23, 24, 28] is attractive technique for prefetching LDS, there is a major deficiency in its identification of addresses to prefetch: the helper thread generates prefetch requests for all delinquent loads. Greedily prefetching all delinquent loads will result in low prefetch efficiency and significantly increasing bandwidth consumption if little time is available between the issues of loads. This often happens in memory intensive applications with LDS. To improve the effectiveness of threaded prefetching, in our earlier work [35], Skip helper threaded Prefetching (SP) was proposed. Like conventional helper threaded prefetching mechanism [10, 23, 24, 28], our SP mechanism focuses on loops with heavy cache misses. However, instead of blindly prefetching all delinquent loads, SP is designed to skip some indirect references or second level direct reference of LDS based on profile information, thus improving the timeliness of prefetching. To help describe our SP mechanism, we use the following notations to represent the basic parameters of our solution: ƽ CALR (Computation/Access Latency Ratio) represents the ratio of cycles for computation over cycles for data accesses in hot loop. ƽ A_SKI is prefetch distance of helper thread, which schedules prefetches to get ahead of main thread the proper amount of iteration in each round. ƽ A_PRE is prefetch degree of helper thread, which schedules prefetches to last the proper amount of iteration in each round. ƽ RP = A_PRE/(A_SKI+A_PRE) is prefetch ratio of the helper thread. Figure 1 shows the concept of our SP mechanism. A hotspot code example (here, we only display the loop and problematic loads) is given in Figure 1(a). The example in Figure 1(a) shows that a majority of last-level cache misses are caused by the second level traversals in inner for-loop. As indicated in our earlier works [21, 35], for LDS programs with low CALR in Figure 1(a), the helper thread may be too stressed to run ahead of the main thread, thus making no performance gain. This problem can be solved if the helper thread ignores some problem loads. To issue prefetches timely, the helper thread in our solution is designed to launch prefetches for only part of delinquent loads. Figure 1(b) shows the corresponding helper thread code for Figure 1(a). The hot loop is divided into same rounds. In each round, the helper thread omits A_SKI iterations of inner loop before

pre-execute A_PRE iterations of inner loop. In other words, the helper thread conducts A_PRE iterations of both two level traversal every A_SKI skipped second level traversal. By presetting appropriate A_PRE and A_SKI according to CALR, we can make sure that the helper thread does not run behind or too far ahead of main thread. for(curr_node = nodelist; curr_node; curr_node = curr_node->next) { //outer loop for (j=0; jfrom_length)++; /* delinquent load */ otherlist=other_node->from_values; /* delinquent load * / ......

} }

(a) ...... for(i=0; i< A_SKI&& node_index; i++, node_index = node_index->next); // A_SKI iterations of outer loop that omitting inner loops for(i=0; i< A_PRE && node_index; i++, node_index = node_index->next){// A_PRE iterations of outer loop that executing inner loops for(tmp_j=0; tmp_j < tmp_degree && flag; tmp_j++){ //innerer loop ...... tmp_count = tmp_other_node->from_length; /* prefetch load */ tmp_otherlist = tmp_other_node->from_values; /* prefetch load */ ......

} } ......

(b)

Figure 1. Constructing a prefetching helper thread. (a) Hotspot of EM3D. (b) Helper thread of EM3D.

In our SP mechanism, the helper thread is designed to share load operations with main thread instead of taking over all of them. Our SP mechanism attempts to find the best A_PRE and A_SKI according to CALR so that memory load latency of helper thread can be overlapped with main thread work as much as possible. In this way, one part of the prefetches is overlapped with the main thread’s arithmetic computations, and the other part is overlapped with main thread’s memory accesses. This is the key idea of our SP Mechanism. B. Determining A_SKI and RP for SP To maximize performance of SP, the selection of A_SKI and RP is most important. The right RP can improve the parallelism of helper thread and main thread, and the proper A_SKI can improve the timeliness of prefetches and reduce cache pollution. The selection of proper A_SKI and RP has been illustrated in our earlier work [35]. For our targeted applications with CALR close to 0, we have RP0.5 (i.e., A_SKI = A_PRE), which means that the helper thread should take over half of problem loads from main thread. And For our targeted applications with CALR higher than 1, we have RP1 (i.e., A_SKI = 0). In that case, even though the helper thread takes over all the load operations from main thread, just like conventional helper threaded prefetching. In our earlier work [35], the specific range of A_SKI was first determined empirically, and then the best values are obtained by experiments.

1813 1807

A. prefetch distance and cache pollution For an application that often accesses cached data mapped in the same cache set, it will be vulnerable to cache pollution. That is because such an application needs very large set space to keep in the data that it actively uses. For an application that hardly ever accesses cached data mapped in the same cache set, it will not suffer from cache pollution. That is because such an application needs very little set space to keep in the data that it actively uses. In our SP mechanism, the helper thread is designed to bring in needed data in advance, and the data fetched by helper thread must be held in the shared cache for some period of time before it is used by the processor. As mentioned in section 2, the bigger the prefetch distance A_SKI, the larger the active data set since the prefetched data must be kept longer time in shared cache. If A_SKI is set too large (i.e., prefetches are issued too early), there is a large chance that the prefetched data will displace other useful data or be displaced itself before use, introducing cache pollution.

C. Cache pollution analysis of SP The selection of A_SKI has an important effect on SP's performance. First, A_SKI must be large enough to ensure timely prefetches. Second, A_SKI must not be too large to give rise to ineffective prefetching and cache pollution. In the context of helper threaded prefetching, cache pollution is said to occur when newly fetched data replaces or kicks out more useful data from the cache. Cache pollution due to threaded prefetching can happen in several cases: 1. A prematurely prefetched block displaces data in the cache that will be reused by the processor. 2. A prematurely prefetched block displaces data in the cache that is just fetched by helper thread but still not be used by the processor. 3. A prematurely prefetched block displaces data in the cache that is just prefetched by hardware prefetchers but still not be used by the processor. This observation advocates that, to avoid cache pollution of helper thread in SP, we should select proper prefetch distance to ensure the prefetched data not to arrive too early.

B. The upper limit of prefetch distance To describe the impact of prefetch distance on cache pollution, we discern the execution points of hot loops where the accessed blocks mapped in the same cache set exceeds blocks a shared cache set can hold, which we called Set Affinity.

Performance change with growing prefetch distance

Normalized to original results

   

Definition 1. Set Affinity. Given a cache set address of an accessed block, its Set Affinity is the iteration count of outer hot loop where the sequential accessed blocks mapped in the specific cache set exceed its capacity.

  Normalized_HotMisses



Normalized_MemoryAccesses



Normalized_Runtime

Definition 2. Original Set Affinity. Original Set Affinity is Set Affinity when an application runs alone, e.g., L2 prefetchers are all disabled and helper thread prefetching is not applied.

 







   prefetch distance







Figure 2. Performance effect of prefetch distance on EM3D.

Definition 3. Set Affinity with Helper Thread. Set Affinity with Helper Thread is Set Affinity when an application runs with helper thread prefetching applied.

Figure 2 shows the runtime, memory accesses, and L2 cache misses of SP for EM3D normalized to that of its original execution. From Figure 2, it can be seen that the runtime, memory accesses, and L2 cache misses have similar increasing trend with growing prefetch distance, suggesting that larger prefetch distance introduces cache pollution and degrade performance of EM3D. In this paper, we will analyze the effect of prefetch distance on cache pollution after applying our SP mechanism, aiming to reduce cache pollution and improve performance of our proposed solution by controlling prefetch distance. III.

Set Affinity of memory access reveals the interactive interference of the references from different data access entities. And Set Affinity with Helper Thread is closely related to Original Set Affinity. Assume for example that a specific cache set’s Original Set Affinity = 40 iterations of outer hot loop, which implies that the cached data in this specific set will be replaced by new reference when the program executes 40 iterations of outer hot loop. That is because main thread is the only data access entity. After applying threaded prefetching, there are at least six data access entities, such as main thread, helper thread, two Streaming Prefetchers and two DPL prefetchers. Among these six access entities, helper thread has similar Set Affinity with main thread due to their similar access stream, while Set Affinity of hardware prefetchers is uncertain because of their uncertain impact on program’s behavior. The extra access entities (such as helper thread) will

REDUCING CACHE POLLUTION BY CONTROLLING PREFETCH DISTANCE OF SP

In this section, we estimate the upper limit of prefetch distance to prevent prefetched data from arriving too early, giving a broad hint to reduce cache pollution by controlling prefetch distance.

1814 1808

TABLE II.

aggravate cache interference. Thus, whether or not involving hardware prefetchers, we can get:

benchmark

input

EM3D mcf

4*105 nodes, arity 128 ref

mst

1*104 nodes

Set Affinity with Helper Thread *2 < =Original Set Affinity Formula above indicates that the cached data in a specific set will be replaced by new reference when the program executes at most half Original Set Affinity iterations of outer hot loop after enabling hardware prefetchers and applying helper thread. As a result, to avoid cache pollution caused by interference references from helper thread, we should have: Prefetch Distance < Set Affinity with Helper Thread Or Prefetch Distance < Original Set Affinity/2 Considering the example above that a specific cache set’s Original Set Affinity = 40 iterations of outer hot loop, when prefetch distance of helper thread is bigger than 20, the prefetched data will evict reuse data or cached data taken by hardware prefetchers or by helper thread itself before they are used. That is because the fetched data by all access entities exceeds the capacity of the cache set. Thus, to avoid introducing cache pollution, the upper limit of prefetch distance should be the minimum Set Affinity with Helper Thread. IV.

TABLE TYPE STYLES

Iterations of outer hot loop

SA(L, Sx) [40, 360]

4*105 [1.4*104, 5 *104] [1, 1*104]

[3000, 46000] [6300, 10000]

C. Data access stream profiling To predict the upper limit of prefetch distance, we further profile data access feature of hot loops and exploit the Set Affinity range of outer hot loop to L2 cache sets as defined above. In our earlier work [36], we have found that data access in our selected hot functions shows phase behavior. This phenomenon is the result of repeated operates in hot loops or repeated calls for hot functions. Thereafter, the profiling mechanism in this paper is implemented using an interval-based burst sampling technique. The data access phases of each hot function are detected firstly. Then we get data access stream of each phase by interval-based burst sampling. Finally, we analyze the data access stream samples to obtain the Set Affinity range of the hot loop to L2 cache sets.

EXPERIMENTAL METHODOLOGY

for( each accessed cache block B in hot function) { If(the set address of B is in the list of touched cache set){

A. Hardware system Our threaded prefetching technique is performed on a real physical system. As shown in Table 1, we use a system with an Intel Core 2 Quad Processor with two Core 2 Duo E6600 processors. The cores share an on-chip 4MB level 2(last level) unified cache. Unless otherwise mentioned, the L2 misses in this paper means last level misses.

If(the block address B is not in the block address list of the mapped cache set) Increase the accessed block count of the mapped cache set by 1 If(the accessed blocks of the mapped cache set is less than ways of cache set) Append the block address B to the block address list of the mapped cache set Else Record the iteration count of out loop as Set Affinity of the mapped cache set } Else{ Append the set address of B to the touched cache set list

TABLE I.

TABLE TYPE STYLES

Append the block address B to the block address list of the mapped cache set Set the accessed block count of the mapped cache set to 1

Processor L1 ICache L1 DCache L2 Unified cache OS

Intel Core 2 Quad Processor Q6600 32KB*4, 8 way set-association, 64B line size 32KB*4, 8 way set-association, 64B line size 4MB*2, 16 way set-association, 64B line size Fedora 9 with kernel 2.6.25

B. Benchmarks The targeted applications of our experiment are mainly memory-intensive applications. To decide the benchmarks used in our experiments, we first run entire SPEC2006 and Olden suite on VTune[22] and collect the L2 cache miss profiles. Then we select those applications that have significant number of cycles attributed to the L2 cache misses, including EM3D, and MST from Olden suite, MCF from SPEC2006 suite. Table 2 shows the characteristics of each application. The third column in Table 2 denotes the iterations or iteration range of the outer hot loop in each hot function (“0” presents that the hot function contains only single level loop). The benchmarks are compiled by gcc-4.3 with optimization level O2.

} }

Figure 3. Analysis algorithm for Set Affinity of outer hot loop.

Figure 3 shows pseudo-code for Set Affinity analysis. We call a current accessed cache block B. If B’s mapped cache set has been touched and B is not in the mapped cache set, we cumulate the access count of the mapped cache set; otherwise we set the access count of the mapped cache set to 1. If the access count of the mapped cache set is not less than the capacity of a shared cache set, we record the iteration count of outer hot loop as Set Affinity of the mapped cache set. This process repeats until it arrives at the end of data access stream or Set Affinity for all touched cache sets has been recorded. For each representative data access stream sample in every application, we analyze Set Affinity of the outer hot loop for its all touched cache set, and then we sort Set Affinity for all its touched cache set to show the distribution range clearly. The last column in Table 2 denotes the Set

1815 1809

Affinity range of the outer hot loop to L2 cache sets in each application. EXPERIMENTAL EVALUATION

&KDQJHRIDFFHVVEHKDYLRU 

V.

the original programs. The difference between the changes of access behavior with different prefetch distances gives us an idea of the amount of pollution.

In this section, we begin with the analysis of the upper limit of prefetch distance by Set Affinity of hot loops to L2 cache sets. Then we evaluate the effect of prefetch distance on cache pollution in SP by experimental results.

  T ot ally_hit T ot ally_miss



Part ially_hit

 





















A. The upper limit of prefetch distance As discussed in section 3, to avoid cache pollution, we can control the maximum of prefetch distance by Set Affinity of hot loops to L2 cache sets. As shown in Table 2, Set Affinity of hot loop in EM3D to L2 cache sets changes in the small range [40, 360], suggesting that the prefetch distance for EM3D should be less than 20 to avoid cache pollution. Set Affinity of hot loop in MCF to L2 cache sets is not out of range [3000, 46000]. This illustrates that the prefetch distance for MCF should be less than 1500 to minimize cache pollution. For MST, Set Affinity of its hot loop to L2 cache sets is always within the range [6300, 10000]. As a result, we can conclude that the prefetch distance for MST should be less than 3150 to avoid cache pollution.

  SUHIHWFKGLVWDQFH

(a)

(b)

Figure 4. Behavior change of EM3D with increasing prefetch distance. (a) Access behavior. (b) Normalized runtime.

1816 1810

change of access behavior(%)

    

















T otally_hit   T otally_miss P art ially_hit

   prefet ch distance

(a)

(b)

Figure 5. Behavior change of MCF with increasing prefetch distance. (a) Access behavior. (b) Normalized runtime.

  change of access behavior

B. The effect of prefetch distance on cache pollution in SP In this section, we implement our SP mechanism with growing prefetch distance to evaluate the effect of prefetch distance on cache pollution. As described in section 2.2, we select prefetch ratio 0.5 for our three targeted applications (EM3D, MCF and MST) according to their very low CALR. Moreover, we have estimated the upper limit of prefetch distance in section 5.1 based on profiling and analysis. Next, we will examine cache pollution of SP by gradually increasing prefetch distance. To measure the effect of prefetch distance in SP on cache pollution, we use the following notations to represent the access behavior of our measurement: ƽ Totally cache misses: the demanded data doesn't arrive in cache until its memory request is serviced. ƽ Totally cache hits: the demanded data is hold in cache. ƽ Partially cache hits: the demanded data arrive in cache after its memory request is issued but before its memory request is serviced. ƽ Memory access: the demanded data misses in L2 cache, including totally cache misses and partially cache hits. The effectiveness of SP for tolerating memory latency depends on the ability to decrease totally cache misses and increase cache hits. In other words, the proper prefetch distance should be selected to eliminate totally L2 cache misses and transfer partially L2 cache hits into totally cache hits. If cache pollution happens, the prefetched data will displace other data which will be used in the future. As a result, the prefetched data will not increase the totally L2 cache hits or decrease memory accesses. Figure 4(a), Figure 5(a) and Figure 6(a) shows the change of totally L2 cache hits, totally L2 cache misses and partially L2 cache hits for EM3D, MCF and MST respectively. The results shown in Figure 4(a), Figure 5(a) and Figure 6(a) are normalized to the memory accesses of

 T ot ally_hit 

T ot ally_miss Part ially_hit

 











    prefetch dist ance

(a)









precious bandwidth and limits the effectiveness of SP. It is worth noticing that cache pollution happens when the prefetch distance of SP is smaller than Set Affinity of hot loops to L2 cache sets. That is because the initial state of L2 cache is uncertain since it can’t be empty, which also affects cache behavior of applications. Specifically, the sensitivity of runtime to prefetch distance in EM3D is larger than that in MCF and MST. As mentioned in section 5.1, relative to the hot function of EM3D, the hot functions of MCF and MST have bigger Set Affinity to L2 cache sets. As a result, the runtime doesn’t change a lot when the prefetch distance is bigger than 800 in MCF and 30 in MST. We don’t display the results when the prefetch distance is bigger than 2000 in MCF and 100 in MST due to the limited space. Anyway, SP maximizes their performance when prefetch distance is within the estimated range in section 5.1, which reveals that our control for prefetch distance by Set Affinity of hot loops to L2 cache sets is effective. We will discuss the performance influence of memory access pattern in our future work. The results in Figure 4, Figure 5 and Figure 6 reveal that the major reason for performance degradation due to larger prefetch distance is L2 cache pollution. The cache pollution caused by larger prefetch distance offset the benefits of timely prefetches. In light of these results, to increase the performance of SP, we should control the prefetch distance to eliminating L2 cache pollution as much as possible without compromising the timeliness of prefetching.

(b)

Figure 6. Behavior change of MST with increasing prefetch distance. (a) Access behavior. (b) Normalized runtime.

From Figure 4(a), it can be seen that SP for EM3D eliminates a large fraction (up to 41.27% of original memory accesses) of totally L2 cache misses and increase partially L2 cache hits(up to 78.56% of original memory accesses) dramatically, but decreases totally L2 cache hits apparently (up to 48.38% of original memory accesses). The decrease of totally L2 cache hits demonstrates that SP cause L2 cache pollution for EM3D. It also can be observed that selecting larger prefetch distance can even substantially decrease totally L2 cache hits. Therefore, a large prefetch distance in SP for EM3D causes more L2 cache pollution than a smaller prefetch distance. From Figure 5(a), it can be seen that SP for MCF eliminates a large part (up to 17.29% of original memory accesses) of totally L2 cache misses, and increase partially L2 cache hits(up to 13.45% of original memory accesses) and totally L2 cache hits apparently (up to 6.74% of original memory accesses). The decrease of totally L2 cache misses and the increase of totally L2 cache hits denote that SP is effective in tolerating memory latency for MCF. However, it also can be observed that the totally L2 cache hits is decreasing with the increasing prefetch distance. This suggests that a larger prefetch distance in SP for MCF causes more L2 cache pollution than a smaller prefetch distance. From Figure 6(a), it can be seen that SP for MST eliminates many totally L2 cache misses (up to 27.83% of original memory accesses), and increase partially L2 cache hits (up to 29.71% of original memory accesses). However, the totally L2 cache hits increase with smaller prefetch distance but decrease with larger prefetch distance. This implies that a smaller prefetch distance is more effective in SP for MST, and a larger prefetch distance in SP for MST causes more L2 cache pollution than a smaller prefetch distance. Figure 4(b), Figure 5(b) and Figure 6(b) shows the performance result for EM3D, MCF and MST respectively. The runtimes shown in Figure 4(b), Figure 5(b) and Figure 6(b) are normalized to the runtime of the original programs. The difference between the changes of normalized runtimes with different prefetch distances shows us the effect of prefetch distance on performance. From Figure 4(b), Figure 5(b) and Figure 6(b), we can see that, in general, the runtimes are increasing with growing prefetch distance. The reason lies in that larger prefetch distance can lead to a large fraction of early prefetches, which in turn leads to higher cache pollution, wastes

VI.

RELATED WORK

Several prior studies considered employing a helper thread to hide memory latency [10, 23-32]. Lee, et al. [24] use a helper thread based prefetching scheme for looselycoupled processors and present a synchronization mechanism to prevent the helper thread from running too far ahead or behind the application thread. Kim et al. [26] employ a similar scheme that employs helper threads running in spare hardware contexts ahead of the main computation to start the memory operations early so that memory latency can be tolerated. Liao et al. [27] propose post-pass binary analysis to construct p-slices at the binary level. Song et al. [23] proposed a detailed compiler framework that generates helper threaded prefetching for dual-core SPARC microprocessors, in which they select candidate loops carefully by profitability test and generate two-version code for cases where no profile feedback information is available, trying to ensure the helper thread do useful work. Collins et al. [29] exploits Dynamic Speculative Pre-computation (DSP) to identify delinquent loads and construct chaining threads via hardware code slicing. Ganusov and Burtscher [30] propose a prefetching technique, called Future Execution, which is similar to DSP except it uses predicted values as initial live-in values to start prefetching threads a few iterations ahead. These threaded prefetching techniques are successful in tolerating memory latency of general memory intensive applications. However, these threaded prefetching techniques lack cache pollution analysis of helper thread. This drawback can lead to a large fraction of early prefetches, which in turn limits the

1817 1811

effectiveness of prefetching and wastes precious bandwidth. In this paper, we conducted cache pollution analysis for our proposed threaded prefetching technique, aiming to reducing cache pollution by controlling prefetch distance. Lu et al. [25] Designs and implements the ADORE dynamic optimization framework which generates helper thread prefetches at runtime using information obtained from the hardware monitors. Lu, et al. [28] dynamically constructs p-slices via a dynamic optimization system running on an idle core. Kim et al. [10] reduces p-thread impact on the main thread’s performance by judiciously invoking p-threads. Zhang et al. [31] present several techniques to accelerate the pre-computation threads, including collocation of p-threads with hot traces, dynamic stride prediction, and automatic adaptation of run ahead and jumpstart distance. Chilimbi et al. [32] describes a dynamic software prefetching framework implementing and evaluating a dynamic prefetching scheme for general-purpose programs, in which it targets a program’s consecutive data reference sequences that frequently repeat in the same order. Although these works behave differently in generating helper thread and synchronization, they are similar in constructing helper thread at runtime and trying to make helper thread to prefetch all delinquent loads. Our proposed approach is different from these efforts by prefetching only part of delinquent loads to avoid a backward helper thread. These threaded prefetching mechanisms above are all implemented based on the cache miss behavior at runtime, while our work focuses on the reduction of cache pollution by controlling prefetch distance. VII. CONCLUSIONS AND FUTURE WORK

research projects of Henan in 2010(No.102300410110); and the Beijing Key Discipline Program. REFERENCES [1] [2]

Smith, A.J.: Cache Memories. Comput. Surv. 14(3), 473-530 (1982) Chen, T.F., Baer, J.-L.: Effective Hardware-Based Data Prefetching for High-Performance Processors. IEEE Trans. Comput. 44(5), 609– 623 (1995) [3] Mowry, T.: Tolerating Latency in Multiprocessors through Compiler Inserted Prefetching. ACM Trans. Comput. Syst. 16(1), 55– 92 (1998) [4] Collins, J.D., Sair, S., Calder, B., Tullsen, D.M.: Pointer cache assisted prefetching. In: MICRO-35, pp.62-73(2002) [5] Yong Chen, Huaiyu Zhu, Xian-He Sun, An Adaptive Data Prefetcher for High-Performance Processors. In CCGRID 2010: pp.155164(2010) [6] S. Srinath, O. Mutlu, H. Kim and Y. Patt. Feedback Directed Prefetching: Improving the Performance and Bandwidth-Efficiency of Hardware Prefetchers. In Proc. of 13th International Symposium on High Performance Computer Architecture, pp.63-74(2007) [7] J. Peir, S. Lai, S. Lu, J. Stark and K. Lai. Bloom Filtering Cache Misses for Accurate Data Speculation and Prefetching. In Proc. of the 16th International Conference on Supercomputing, pp.189-198(2002) [8] Ebrahimi, E., Mutlu, O., Lee, C.J., and Patt, Y.N. Coordinated control of multiple prefetchers in multi-core systems. In Proceedings of MICRO, pp.316-326(2009) [9] Zilles, C., Sohi, G.: Execution-based prediction using speculative slices. In: ISCA-28, pp. 2-13(2001) [10] Kim, D., Liao, S.S., Wan, P.H., del Cuvillo, J., Tian, X., Zou, X., Wang, H., Yeung, D., Girkar, M., Shen, J.P.: Physical experimentation with prefetching helper threads on Intel’s Hyperthreaded processors. In: Proceedings of the 2004 Annual Conference on Code Generation and Optimization (CGO-3), pp. 27–38, March 2004

In our earlier work, we have proposed an effective threaded prefetching technique to improve the timeliness of prefetching by presetting proper prefetch distance. In this paper, we have described the method to reduce cache pollution by controlling prefetch distance in our proposed threaded prefetching technique. Based on data access streams obtained from a low-overhead prole run of applications, we estimated the upper limit of prefetch distance by analyzing the affinity features of hot loops to the shared cache sets. To guide the selection of proper prefetch distance for our proposed threaded prefetching technique, we also examined the effect of growing prefetch distance on shared cache pollution. Evaluation results show that larger prefetch distance introduces more shared cache pollution. By controlling prefetch distance within the estimated range, L2 cache pollution can be reduced obviously and the effectiveness of our proposed threaded prefetching technique can be improved notably. In our future research, we would try to analyze the effect of memory access pattern on prefetching performance.

[11] Tang, J., Liu, S., Gu, Z., Liu, C., Gaudiot, J.L.: Prefetching in Embedded Mobile Systems Can Be Energy-Efficient. IEEE Comput. Archit. 10(1), 8-11 (2011)

ACKNOWLEDGMENT

[19] Liu, S., Eisenbeis, C., Gaudiot, J.L.: Value Prediction and Speculative Execution on GPU. Int. J. Parallel. Program. December, 39(5), 533552 (2010)

We would like to thank all the members of our research group for their contributions. And we also would like to thank the anonymous reviewers. This research is supported by the MOE-Intel-08-10; the Natural Science Foundation of China (No. 61070029); the basic and frontier technology

[12] Liu, S., Eisenbeis, C., Gaudiot, J.L.: Speculative Execution on GPU: An Exploratory Study. In: Proceedings of the 39th International Conference on Parallel Processing, pp. 453-461, September 2010 [13] Liu, S., Eisenbeis, C., Gaudiot, J.L.: A Theoretical Framework for Value Prediction in Parallel Systems. In: Proceedings of the 39th International Conference on Parallel Processing, pp.11-20, September 2010 [14] Sohi, G.S., Roth, A.: Speculative multithreaded processors. IEEE Comput. 34(4), 66-73 (2001) [15] Doweck, J.: White Paper: Inside Intel Core Microarchitecture and Smart Memory Access. Intel Corporation, 2006. [16] Liu, S., Gaudiot, J.L.: Potential Impact of Value Prediction on Communication in Many-Core Architectures. IEEE Trans. Comput. 58(6), 759-769: 2010 [17] Byna, S., Chen, Y., Sun, X.H.: A Taxonomy of Data Prefetching Mechanisms. J. Comput. Sci. Technol. 24(3), 405-417 (2009) [18] Zilles, C., Sohi, G.: Master/slave speculative parallelization. In: Proceedings of the 35th International Symposium on Microarchitecture, pp.85 -96, November 2002

[20] Gu, Z., Zheng, N., Zhang, Y.: The Stable Conditions of a Task-Pair with Helper-Thread in CMP. In: Proceedings of the International Conference on Parallel and Distributed Processing Techniques and Applications, pp.125-130 (2009)

1818 1812

[21] Huang, Y., Gu, Z.: Performance Analysis of Prefetching Thread for Linked Data Structure in CMPs. In: Proceedings of the International Conference on Computational Intelligence and Software Engineering, pp.1-4 (2009)

[30] Ganusov, I., Burtscher, M.: Future execution: A hardware prefetching technique for chip multiprocessors. In: International Conference on Parallel Architectures and Compilation Techniques, pp.350 -360, September 2005

[22] http://www.intel.com/cd/software/products/apac/zho/245112.htm

[31] Weifeng Zhang, Dean M. Tullsen, Brad Calder.: Accelerating and adapting pre-computation threads for efficient prefetching. In: Proceedings of the 13th Symposium on High-Performance Computer Architecture, pp. 85-95 (2007)

[23] Song, Y., Kalogeropulos, S., Tirumalai, P.: Design and Implementation of a Compiler Framework for Helper Threading on Multi-Core Processors. In: Proceedings of the 14th International Conference on Parallel Architectures and Compilation Techniques (PACT’05), pp.99 -109(2005) [24] Lee, J., Jung, C., Lim, D., Solihin, Y.: Prefetching with Helper Threads for Loosely Coupled Multiprocessor Systems. IEEE Trans. Parallel Distrib. Syst. 20(9), 1309-1324 (2009) [25] Lu, J., Chen, H., Yew, P.C., Hsu, W.C.: Design and implementation of a lightweight dynamic optimization system. J. Instr. Lev. Parallel. Volume 6, 1-24 (2004) [26] Kim, D., Yeung, D.: Design and Evaluation of Compiler Algorithms for Pre-Execution. In: ASPLOS, pp. 159–170 (2002) [27] Liao et al.: Post-Pass Binary Adaptation for Software-Based Speculative Pre-computation. In: PLDI, pp. 117–128 (2002) [28] Lu et al.: Dynamic Helper Threaded Prefetching on the Sun UltraSPARC CMP Processor. In: MICRO, pp. 93–104 (2005) [29] Collins, J.D., Tullsen, D.M., Hong Wang, Shen, J.P.: Dynamic Speculative Pre-computation, In 34th International Symposium on Micro-architecture. pp. 306-317 (2001)

[32] Trishul M. Chilimbi, Martin Hirzel.: Dynamic hot data stream prefetching for general-purpose programs, In: Proceedings of the ACM SIGPLAN Conference on Programming Language Design and Implementation, pp. 199-209 (2002) [33] Lee, C., Mutlu, O., Narasiman, V., Patt, Y.: Prefetch-Aware DRAM Controllers. In: Proceedings of MICRO, pp. 200-209(2008) [34] Srinath et al.: Feedback directed prefetching: Improving the performance and bandwidth-efficiency of hardware prefetchers. In: HPCA-13, pp. 63-74 (2007) [35] Yan Huang, Jie Tang, Zhi-min Gu, Min Cai, Jianxun Zhang, Ninghan Zheng. The performance optimization of threaded prefetching for linked data structures. International Journal of Parallel Programming, 2011(DOI: 10.1007/s10766-011-0172-7). [36] Gu Zhimin, Fu Yinxia, Zheng Ninghan, Zhang Jianxun, Cai Min, Huang Yan, Tang Jie. Improving Performance of the Irregular Data Intensive Application with Small Computation Workload for CMPs. In: Proceedings of the 40th International Conference on Parallel Processing Workshops (ICPPW), 2011.

1819 1813

Suggest Documents