PMU Guided Structure Data-Layout Optimization - IEEE Xplore

26 downloads 572 Views 931KB Size Report
Key words: structure data layout; performance monitoring unit (PMU); compiler optimization; program ... E-mail: [email protected]; Tel: 86-10-62785592 ...
TSINGHUA SCIENCE AND TECHNOLOGY ISSNll1007-0214ll05/15llpp145-150 Volume 16, Number 2, April 2011

PMU Guided Structure Data-Layout Optimization YAN Jianian (闫家年), CHEN Wenguang (陈文光)**, ZHENG Weimin (郑纬民) Department of Computer Science and Technology, Tsinghua University, Beijing 100084, China Abstract: Existing methods of obtaining runtime feedback for structure data-layout optimization have several drawbacks, such as large overhead and difficulty composing training sets. As a result, structure data-layout optimization is not widely used. To overcome these drawbacks, a performance monitoring unit (PMU) sampling method was developed with much less overhead and better portability and usability. An algorithm was developed to correct incomplete and inaccurate PMU sampling. With the corrected PMU feedback, a structure data-layout optimizer achieved a 45.1% performance improvement compared to a design without data-layout optimization, which is 97.6% of the performance improvement achieved with instrumented feedback. Calculation of the PMU feedback increased the execution time by 12.3%, compared to the overhead for the instrumented feedback of 341.5%. Tests show that the PMU feedback is efficient and effective for structure data-layout optimization. Key words: structure data layout; performance monitoring unit (PMU); compiler optimization; program locality

Introduction Structure data-layout is an important optimization method for modern compilers. This method changes the layout of structure fields to improve cache line utilization in a program. Several studies[1-9] have demonstrated its effectiveness. Reported performance improvements with structure data-layout optimization have reached as high as 96%[3]. The structure data-layout can be optimized based on estimates of the program’s memory access behavior. Most current structure data-layout optimization methods acquire estimations from program runtime feedback obtained by either analyzing memory access traces or instrumented programs for basic block (BB) execution counts. However, these two methods have several drawbacks that limit adoption of structure data-layout optimization methods for industrial applications. For example, Received: 2010-03-15; revised: 2010-10-30

** To whom correspondence should be addressed. E-mail: [email protected]; Tel: 86-10-62785592

 

High overhead Memory access traces are usually ten to several thousand times slower. Instrumenting of a program usually slows the program threefold. Such overhead is not acceptable for most developers. Furthermore, the runtime overhead may change the execution path of time critical applications, which may result in misleading optimization results. Difficulty finding a proper working set Small but representative input data sets for industrial applications are difficult to compose. For example, input for a web service application needs representative data for both the database content and customer queries. Representative instrumented feedback has many restrictions Utilizing instrumented feedback requires that the source code, the compiling options, and the compiler version for the final version be the same as for the instrumented binary. Previous studies[10-13] have analyzed techniques for utilizing performance monitoring unit (PMU) sampling to compose an edge profile of the control flow graph (CFG) and to guide compiler optimization. One problem of PMU sampling is its imprecision. Levin et al.[12]

146

proposed a method to improve the PMU feedback. They assumed that creating a legal network flow, while minimizing criteria for the amount of weighted changes to the given flow will provide a better estimate of the actual flow. They constructed a fixed graph from the program’s CFG and used a minimum-cost circulation (MCC) algorithm[14] to find the fixup vector. Chen et al.[13] extended their work with a machine learning algorithm to find additional hardware performance events that can help overcome issues introducing errors, such as the observed aggregation effect[13] due to instructions that trigger long stalls, such as cache misses, which have abnormally higher sample counts.

1

Correcting PMU Feedback for Data-Layout Optimization

The raw data PMU feedback data cannot be used for structure data-layout optimization because of its inaccuracy. Our evaluations show that many of the benchmarks have worse performance with the data-layout strategy derived from the raw PMU feedback. Use of the PMU feedback fixed algorithm by Levin et al.[12] resulted in data-layout strategies that were better or at least close to the performance of those without datalayout optimizations. However, three out of nine benchmarks had much worse performance after the correction algorithm than the data-layout strategies derived from instrumented feedback. Thus the PMU feedback needs improvement. 1.1 Removing flow consistency restrictions In structure data-layout optimization, BB execution counts are used to calculate the number of memory accesses that have the same preferences for a specified data-layout strategy. The final data-layout strategy is determined by the number of memory accesses sharing the same data-layout preference. Therefore, for a structure data layout, the difference between the PMU feedback and the instrumented feedback is the ratio of the execution count for each BB to that of all the BBs in the entire program. The BB execution count proportions for the raw PMU feedback, the corrected PMU feedback[12], and the instrumented feedback were compared to improve the PMU feedback. One problem with the raw PMU feedback is that some BB execution counts are missing, especially for

 

Tsinghua Science and Technology, April 2011, 16(2): 145-150

BBs that have relatively few instructions. The MCC[12] correction algorithm is capable of alleviating the difference introduced by missing BB execution counts between PMU feedback and instrumented feedback One problem with the corrected PMU feedback is that some BB execution counts are over-counted. For BBs whose predecessors or successors missed being counted in the PMU feedback, the MCC algorithm will over-correct the number of execution counts which aggravates the difference between the PMU feedback and the instrumented feedback. The MCC algorithm requires that the corrected graph be legal. For each BB, the sum of the execution counts of all incoming edges should be equal to the sum for the outgoing edges and should be equal to the total number of BB execution counts. The legality of the CFG is critical for some optimizations, for example, instruction scheduling relying on branch prediction. However, the legality is not critical for structure data-layout optimization. To overcome the over-correction of the execution counts by the MCC algorithm, heuristics are used that are free from the restrictions of graph legality. For simplicity, the first heuristic is based on the MCC, but without negative values in the fixup vector, which results in no execution count reductions for all BBs. This heuristic is named the MCCN (minimum cost circulation without negative fixup values). This algorithm gives better structure data layouts than the MCC algorithm for two of the benchmarks. 1.2 Weighted BB execution count compensation The second heuristic only eliminates negative fixup values for BBs whose predecessors or successors have no execution counts in the PMU feedback. This heuristic alleviates the difference between the PMU feedback and the instrumented feedback. However, results with structure data-layout optimization with this heuristic were worse than with the MCCN heuristic. In MCCN, BBs with instructions that triggered long stalls are over-sampled due to the aggregation effect observed by Chen et al.[13] This heuristic alleviates such over-sampling; however, tests show that this correction results in worse performance. Therefore memory accesses causing cache misses should be properly favored to improve the data-layout strategy.

147

YAN Jianian (闫家年) et al.:PMU Guided Structure Data-Layout Optimization

The weighted minimum cost circulation without negative fixup value (WMCCN) heuristic was then developed based on the MCCN but with increased weights for memory accesses causing cache misses. The WMCCN heuristic also samples the cache miss counts for each BB to calculate the BB execution counts in addition to instruction retirements. For a basic block BBi , assume its execution count in MCCN is IR i and its cache miss count is CM i . Assume the sum of the execution counts of all BBs for the program is IR a and the total cache miss count for the program is CM a . Let Cnt i be the BB execution count for BBi after the WMCCN heuristic is applied. The WMCCN algorithm is then as described by Algorithm 1. Algorithm 1

Correction of execution counts by weighting

the BB with the cache misses. input: IRi , CMi , i ∈ {BB}, IRa , CMa

output: Cnti , i ∈ {BB} 1 for all BBi ∈ BB do 2

if

CM i IR i − >0 then CM a IR a ⎛ CM i IR i ⎞ Cnt i = IR i ⎜1 + − ⎟; ⎝ CM a IR a ⎠

3 4

end

5

else Cnti = IRi ;

6 7

end

8 end 9 return.

2

System Design and Implementation

The system was implemented based on the previous algorithm ASLOP[9]. Figure 1 shows the system framework. All the feedback was obtained from the PMU, which includes instruction retirement and cache miss events for different cache level instead of using instrumented feedback. The cache miss event is used to calculate the intra- and inter-instance affinity as well as the weighting of the BB execution counts. The instruction retired event is used to generate a preliminary number of BB execution counts. All the PMU event counts are first mapped from the instruction address to the source code position. The mapping information is presented as the debug information in the program binary. The mapped PMU event counts are written to a feedback file which is used later by the compiler.

 

Fig. 1 System framework

In the local phase, the program source code is processed procedure-by-procedure. The PMU event counts are attached to the BBs with the source code position information presented in the intermediate representation (IR) in the compiler. The total instruction retired counts for a BB are summed and divided by the number of instructions that have been sampled. The quotient is used as an estimate of the BB execution counts with the MCCN heuristic used for the first-phase correction. The characteristics of the memory accesses are then summarized and safety checks are performed. All the information is saved in the IR for further processing. In the global phase, the BB execution counts are further corrected using the WMCCN heuristic. The data-layout strategies are generated for the structure types that can be safely transformed. The data-layout strategies are then transferred to the transform phase.

3

Evaluations

The algorithm was evaluated using different cache configurations on two different testing platforms. The configurations for the two platforms, Aries and Virgo are listed in Table 1. Table 1

Platform Aries Virgo

CPU Opteron 8214 Xeon E5504

Platform configurations

L1D cache per core (KB)

L2 cache per L3 cache core (KB) (MB)

64

1024

None

32

256

4

A total of 9 benchmarks were used in the evaluations. 179.art and 181.mcf were from SPEC CPU2000[15], 462.libquantum, 429.mcf, and 472.moldyn (a candidate benchmark for the SPEC CPU 2006 suite) were from SPEC CPU2006[16]; em3d, health, treeadd, and tsp were from the Olden benchmarks[17]. The Olden benchmarks provided a common reference base from

Tsinghua Science and Technology, April 2011, 16(2): 145-150

148

previous work on structure data-layout optimization, while the others were from standard benchmark suites to provide insight into complicated programs. Benchmarks not selected from the SPEC CPU suits failed the legality test so they were not suitable for data layout transformation. The benchmarks from the SPEC CPU suites used SPEC’s train data set to get the runtime feedback and the ref data set for the performance evaluations. The differences between the training data set and the testing data set for the benchmarks from the Olden suite are shown in Table 2. The table also gives the memory footprint when running the testing data set on each benchmark. The benchmarks were compiled with the -O3 -ipa options. Each benchmark was executed 6 times with the average time accepted. The normalized execution times are shown in Fig. 2. In the legend, Original represents tests without any data-layout transformation,

while ASLOP represents tests with the data-layout strategy generated with instrumented feedback. The MCC, MCCN, and WMCCN results are for data-layout strategies generated with PMU feedback corrections using those heuristics. Table 2

Benchmark characteristics

Benchmark

Training input

Testing input

Memory footprint

179.art

Train

ref

4140 KB

181.mcf

Train

ref

112 MB

429.mcf

Train

ref

1.6 GB

462.libquantum

Train

ref

95 MB

472.moldyn

ref 24 000, 1200, 75, 1 4, 7000, 7 24, 1

105 MB

health treeadd

Train 12 000, 600, 75, 1 4, 3500, 7 22, 1

tsp

3 276 800

16 384 000

1.0 GB

em3d

1.3 GB 19 MB 512 MB

(a) Testing platform: Aries

(b) Testing platform: Virgo Fig. 2 Normalized execution times compared to benchmarks without data-layout optimization

The PMU feedback gave the same data-layout strategy as the instrumented feedback in 5 of the 9 benchmarks, 179.art, 462.libquantum, 472.moldyn, treeadd, and tsp. For 429.mcf, the PMU feedback result was

 

better than the instrumented feedback result. Although the information provided by instrumented feedback is accurate, being a heuristic, the data-layout algorithm cannot guarantee output of the optimal data-layout

YAN Jianian (闫家年) et al.:PMU Guided Structure Data-Layout Optimization

strategy. 429.mcf is an illustration of a suboptimal result. For the other three benchmarks, the PMU feedback result was worse than the instrumented feedback result. On average, the PMU feedback with WMCCN was 97.6% of the performance achieved with instrumented feedback on Aries and 99.4% of that on Virgo. The average performance improvement with PMU

149

feedback with the WMCCN was 45.1% compared to the test without data-layout optimization on Aries and 26.9% on Virgo. Figure 3 shows the normalized last-level cache miss rates of the benchmarks on the two platforms. The data agrees with the results in Fig. 2.

(a) Testing platform: Aries

(b) Testing platform: Virgo Fig. 3

Normalized last level cache miss compared to benchmarks without data-layout optimization

Table 3 lists the time and space overhead for the structure data-layout optimization with instrumented feedback and PMU feedback. On average, the instrumented Table 3 Overhead comparison

Benchmark 179.art 181.mcf 429.mcf 462.libquantum 472.moldyn em3d health treeadd tsp Average

 

Normalized runtime

Storage (KB)

Inst.

PMU

Inst.

PMU

2.309 3.544 3.058 8.092 5.059 5.081 2.138 8.954 6.398 4.415

1.081 1.094 1.033 1.63 1.102 1.061 1.102 1.069 1.033 1.123

19.0 20.3 21.5 32.9 14.2 10.6 8.2 2.1 7.9 15.2

10.0 19.2 18.6 11.4 6.2 5.0 7.2 3.3 8.9 10.0

method increased the program execution time 3.4-fold to collect the profile data while the PMU method only increased the program execution time by about 10%. Both the instrumented and PMU methods had low storage requirements.

4

Conclusions

Structure data layout is an important optimization method that can dramatically improve program performance. However its use is limited by restrictions on obtaining program execution feedback. This paper developed a framework to use hardware performance event counters for the feedback needed by the structure data layout. The PMU-based feedback is free from the restrictions of other methods. In addition, an algorithm was developed to improve the PMU feedback to

150

improve the data-layout strategy. Tests show that this method is competitive with instrumented feedback for improving program performance. Furthermore, this PMU-based method provides better portability, usability, and less overhead. References [1] Chilimbi T M, Davidson B, Larus J R. Cache-conscious structure definition. In: Proceedings of the ACM SIGPLAN 1999 Conference on Programming Language Design and Implementation. New York: ACM, 1999: 13-24. [2] Chilimbi T M, Hill M D, Larus J R. Cache-conscious structure layout. In: Proceedings of the ACM SIGPLAN 1999 Conference on Programming Language Design and Implementation. New York: ACM, 1999: 1-12. [3] Kistler T, Franz M. Automated data-member layout of help objects to improve memory-hierarchy performance. ACM Transactions on Programming Languages and Systems,

2000, 22(3): 490-505. [4] Rabbah R M, Palem K V. Data remapping for design space optimization of embedded memory systems. ACM Transactions on Embedded Computing Systems, 2003, 2(2):

186-218. [5] Zhong Yutao, Orlovich M, Shen Xipeng, et al. Array regrouping and structure splitting using whole program reference affinity. In: Proceedings of the ACM SIGPLAN 2004 Conference on Programming Language Design and Implementation. New York: ACM, 2004: 255-266. [6] Hundt R, Mannarswamy S, Chakrabarti D R. Practical structure layout optimization and advice. In: Proceedings of the Fourth IEEE/ACM International Symposium on Code Generation and Optimization. Washington: IEEE, 2006: 233-244. [7] Zhao Peng, Cui Shimin, Gao Yaoqing, et al. Forma: A framework for safe automatic array reshaping. ACM Transactions on Programming Languages and Systems,

2007, 30(1): 42-72.

 

Tsinghua Science and Technology, April 2011, 16(2): 145-150

[8] Curial S, Zhao P, Amaral J N, et al. MPADS: Memorypooling-assisted data splitting. In: Proceedings of the 7th International Symposium on Memory Management. New York: ACM, 2008: 101-110. [9] Yan Jianian, He Jiangzhou, Chen Wenguang, et al. ASLOP: A field-access affinity-based structure data layout optimizer. To appear in Science in China Series F: Information Sciences.

[10] Conte T M, Patel B A, Menezes K N, et al. Hardware-based profiling: An effective technique for profile-driven optimization. International Journal of Parallel Processing, 1996, 24(2): 187-206.

[11] Zhang Xiaolan, Wang Zheng, Gloy N C, et al. System support for automated profiling and optimization. In: Proceedings of the Sixteenth ACM Symposium on Operating Systems Principles. New York: ACM, 1997: 15-26. [12] Levin R, Newman I, Haber G. Complementing missing and inaccurate profiling using a minimum cost circulation algorithm. In: Proceedings of the High Performance Embedded Architectures and Compilers, Third International Conference. Berlin: Springer, 2008: 291-304. [13] Chen Dehao, Vachharajani N, Hundt R. Taming hardware event samples for FDO compilation. In: Proceedings of the Eighth International Symposium on Code Generation and Optimization. Washington: IEEE, 2010. [14] Goldberg A V, Tarjan R E. Finding minimum-cost circulations by canceling negative cycles. Journal of the ACM, 1989, 36(4): 873-886. [15] SPEC CPU 2000 Benchmark Suite. http://www.spec.org, 2000. [16] SPEC CPU 2006 Benchmark Suite. http://www.spec.org, 2006. [17] Rogers A, Carlisle M C, Reppy J H, et al. Supporting dynamic data structures on distributed memory machines. ACM Transactions on Programming Languages and Systems, 1995, 17(2): 233-263.

Suggest Documents