A compiler framework for recovery code generation in ... - IEEE Xplore

2 downloads 0 Views 332KB Size Report
A general framework that integrates both control and data speculation using alias profiling and/or compiler heuristic rules has shown to improve. SPEC2000 ...
A Compiler Framework for Recovery Code Generation in General Speculative Optimizations Jin Lin, Wei-Chung Hsu, Pen-Chung Yew, Roy Dz-Ching Ju and Tin-Fook Ngai Department of Computer Science and Microprocessor Research Lab Intel Corporation Engineering Santa Clara, CA 95052 U.S.A University of Minnesota Email: {jin, hsu, yew} @cs.umn.edu {roy.ju, tin-fook.ngai}@intel.com

Abstract A general framework that integrates both control and data speculation using alias profiling and/or compiler heuristic rules has shown to improve SPEC2000 performance on Itanium systems. However, speculative optimizations require check instructions and recovery code to ensure correct execution when speculation fails at runtime. How to generate check instructions and their associated recovery code efficiently and effectively is an issue yet to be well studied. Also, it is very important that the recovery code generated in the earlier phases integrate gracefully in the later optimization phases. At the very least, it should not hinder later optimizations, thus, ensuring overall performance improvement. This paper proposes a framework that uses an if-block structure to facilitate check instructions and recovery code generation for general speculative optimizations. It allows speculative instructions and their recovery code generated in the early compiler optimization phases to be integrated effectively with the subsequent optimization phases. It also allows multi-level speculation for multi-level pointers and multi-level expression trees to be handled with no additional complexity. The proposed recovery code generation framework has been implemented in the Open Research Compiler (ORC).

1. Introduction Control and data speculation [2,9,16,17,19] has been used effectively to improve program performance. Ju et al. [9] proposed a unified framework to exploit both control and data speculation targeting specifically for memory latency hiding. In their scheme, the speculation is exploited by hoisting load instructions

across potentially aliasing store instructions, thus more effectively hiding the memory latency. In [16], a more general framework is proposed to support general speculative optimizations, such as speculative register promotion and speculative partial redundancy elimination. However, the crucial issue of check instructions and recovery code generation have so far been mostly overlooked. The recovery code generation scheme used by Ju, Norman, Mahadevan and Wu in [9] (referred to as JNMW algorithm in this paper) during the code scheduling is quite straightforward. When a load instruction is speculatively moved up to an earlier program point, a special check instruction, chk, is generated at its original location. Some of the instructions that are flow dependent on the load instruction could also be moved up with the speculative load instruction. To facilitate recovery code generation, the compiler introduces additional dependence edges for the new chk instruction. This simple scheme works fine because the generated recovery code rarely interacts with other compiler optimizations (instruction scheduling is usually performed at the very end of the compiler optimization phases). In [16], a check instruction is explicitly modeled as an assignment statement and the corresponding recovery code is generated separately. These additional assignment statements could hinder subsequent optimizations and impact the performance of the speculative optimizations unless we mark these special assignment statements explicitly and force later analysis and optimizations to deal with them specifically using ad hoc approaches. It makes the later analyses and optimizations more complex and less effective. In this paper, we propose a unified framework for recovery code generation that supports general speculative optimizations including speculative PRE

Proceedings of the 13th International Conference on Parallel Architecture and Compilation Techniques (PACT’04) 1089-795X/04 $ 20.00 IEEE

(usually performed in an early phase of the compiler optimizations) and instruction scheduling (usually performed late). It uses an explicit control flow structure to model check instructions and recovery code. Using this model, the check instruction and its recovery code can be treated as a highly biased if-block in the later analyses and optimization phases. The check instructions and their recovery blocks can thus be integrated gracefully in the later phases such that they are analyzed and optimized as any other instructions without having to be treated separately and explicitly as other ad hoc approaches [9] [16]. This is most obvious for multi-level speculation, such as cascaded speculation for multi-level pointers [9]. They can be handled easily in the later optimizations without additional complexity. As far as we know, this is the first general compiler framework for recovery code generation. Our experimental results show that the proposed framework can effectively support general speculative optimizations. The rest of the paper is organized as follows. Section 2 describes our scheme to model check instructions and their recovery code generation using explicit if-block structures. It also discusses how to generate the recovery block in both speculative partial redundancy elimination (SPRE) and instruction scheduling. In section 3, we discuss how to eliminate redundant check instructions. In section 4, we show how our new framework could generate recovery code for both single-level and multi-level speculation in a unified way. Section 5 presents some experimental results. In Section 6, we review some of the previous recovery code generation algorithms and explain why their check and recovery code generation approaches are inadequate for general speculative optimizations. We give our conclusions in section 7.

2. A Framework for Check Instructions and Recovery Code Generation A proper representation of a check instruction and its corresponding recovery block (A recovery block is the recovery code for a specific check instruction. In the rest of the paper, we will use recovery block and recovery code interchangeably). In a compiler internal representation (IR) is important because they could interact with later optimizations in a very complicated way. In this section, we present our model for check instructions and recovery code generation.

2.1. Representation of Check Instructions and Recovery Blocks We explicitly represent the semantics of the check instruction and its corresponding recovery block as a conditional if-block as shown in Figure 1(c). In the ifblock, it checks whether the speculation is successful or not, if not, the recovery block will be executed. Initially, only the original load instruction associated with the speculative load is included in the recovery block, shown as s5 in Figure 1(c). A φ-function is inserted after the if-block for r in the SSA form (in statement s7). Also, since mis-speculation has to be rare to make data speculation worthwhile, the ifcondition is set to be highly unlikely to happen. Based on such an explicit representation, later analyses and optimizations could treat it as any other highly biased if-blocks. There will be no need to distinguish speculative code from non-speculative code during the analyses and the optimizations, as will be discussed in Section 2.3.

2.2. Recovery Code Generation for Speculative PRE We show how check instructions and their recovery code are generated with our proposed model in speculative SSAPRE form [16]. Our strategy is to generate the check instruction at the point of the last weak update but before its next use. Starting from a speculative redundant candidate, we arrive at the latest weak update. There, we insert an if-then statement to model the check instruction and its corresponding recovery block, which initially has just one single load instruction (i.e. the original load instruction) in the then part of the if-then statement. Figure 2(a) gives an example to show the effect of the proposed algorithm. We assume that **q in s2 could potentially update the loads *p and **p with a very low probability. The second occurrence of the expression *p in s3 has been determined to be speculatively redundant to the first one in s1, as shown in Figure 2(b). The compiler inserts an if-then statement after the weak update **q in s2 for the speculative redundant expression *p in s3, as shown in Figure 2(c). The chk flag is used to mark those ifblocks that correspond to check instructions. Those ifblocks will be converted to a chk.a instruction, or an ld.c instruction (if there is only one load instruction in the recovery block) during the code generation.

Proceedings of the 13th International Conference on Parallel Architecture and Compilation Techniques (PACT’04) 1089-795X/04 $ 20.00 IEEE

s1: s2: s3:

... = *p *q =... (may update *p) ... = *p +10

(a) The source program

s1: r 1 = *p /* ld.a flag */ s1’: ... = r 1 s2: *q =... s4: if (r 1 is invalid){ /* chk flag */ s5: r 2 = *p s6: } s7: r 3 ← φ(r 1 , r 2 ) s3: … = r 3 + 10 (c) The check instruction is m odeled as an explicit control flow structure

s1: r 1 = *p /* ld.a flag */ s1’: ... = r 1 s2: *q =... s3’: r 2 = *p /* chk flag */ s3: … = r 2 + 10

(b) The check instruction is modeled as an assignment statement

Figure 1 Different representations used to model a check instruction and its recovery block r 1 = *p /* ld.a flag */ … = *r 1 **q =... if (r 1 is invalid){ /* chk flag */ r 2 = *p } r 3 ← φ (r 1 , r 2 ) … = *r 3 +10

s1: ... = **p s2: **q = … (may update **p and *p) s3: ... = **p +10;

s1: … = **p [h 1 ] s2: **q =... s3: … = **p +10 [h 1 ]

s1: s2 s3 s4: s5: s6 s7: s8:

(a) source program

(b) output of the renaming step for speculative PRE for expression *p, where h is hypothetical temporary for the load of *p

(c) output of the code motion step for speculative PRE for expression *p

Figure 2 Example of recovery code generation in speculative PRE

2.3. Interaction of the Early Introduced Recovery Code with Later Optimizations The recovery blocks, represented by highly-biased if-then statements can be easily maintained in later optimizations such as instruction scheduling. In this section, we use partially ready code motion algorithm [1] during the instruction-scheduling phase as an example. Consider the example in Figure 3(a). If we assume that the right branch is very rarely taken, the instruction b=y in the basic block D can be identified as a P-ready (i.e. partial ready) candidate [1]. It will then be scheduled along the most likely path A→B→D, instead of the unlikely path A→C→D. Figure 3(b) shows the result of such a code motion. The instruction b=y is hoisted to the basic block A and a compensation copy is placed in the basic block C. The recovery code represented by an explicit and highly-biased if-then statement is a good candidate for P-ready code motion. If there exists a dependent instruction that could be hoisted before the recovery block, this instruction will be duplicated as a compensation instruction in the recovery block. At the end of instruction scheduling, the recovery block would

have been updated accordingly and is already well formed. There is no need to trace the flow dependence chain again, as in the JNMW scheme, and to generate the recovery block in a separate effort. After instruction scheduling, we could easily convert these if-then blocks into check instructions and their recovery blocks. A

B

a=x b=y

A

b=y

y=2

0.1

0.9

0.1

0.9

C

D

(a) instruction b=y is partially ready at basic block D

y=2 b=y

B

a=x

(b) partial ready code motion duplicates instruction b=y

Figure 3 Example of partial ready code motion for latency hiding Using the example in Figure 4 (a), the output of the speculative PRE (shown in Figure 4 (b)), is further optimized with instruction scheduling. As can be seen in Figure 4(c), if the compiler determines the

Proceedings of the 13th International Conference on Parallel Architecture and Compilation Techniques (PACT’04) 1089-795X/04 $ 20.00 IEEE

D

C

instruction across the instruction check and

Similar to that in speculative PRE, the proposed scheme can also be used in speculative instruction scheduling.

s = r+10 can be speculatively hoisted store *q statement, it will duplicate the into the recovery block. The generated recovery code is shown in Figure 4(d).

... = *p *q =... (may update *p) ... = *p +10;

(a) source program

r = *p /* ld.a flag */ ... = r *q =... if (r is invalid){ /* chk flag */ r = *p } s = r +10 …

r = *p /* ld.a flag */ ... = r s = r+10 *q =... if (r is invalid){ /* chk flag */ r = *p s =r+10 } …

(b) output after speculative PRE

ld.a r = add s = st [q] = chk.a

[p] r, 10 … r, recovery

label: … recovery: ld r = [p] add s =r,10 br label

(c) output after instruction scheduling

(d) final output

Figure 4 Recovery code generation for a speculative PRE after instruction scheduling

3. Recovery Code Generation in MultiLevel Speculation s1: ... = **p … s2: **q = … … s3: ... = **p

ld.a r = [p] ld.a f = [r] … st … chk.a r, recovery_1 label_1: chk.a f, recovery_2 label_2: … recovery_1: ld r = [p] ld f = [r] br label_1

3.1 Check Instructions and Recovery Code Representation for Multi-Level Speculation

ld.a r = [p] ld f = [r] … st … chk.a r, recovery_1 label_1: …

s1: ... = *p + *q s2: *a =... (may update *p or *q) s3: ... = *p +*q;

recovery_1: ld r = [p] ld f = [r] br label_1

recovery_1: ld r = [p] add s = r, f br label_1 recovery_2: ld f = [q] add s = r, f br label_2

recovery_2: ld f = [r] br label_2

(a) Source program

(b) p and *p could potentially be updated by intervening store **q

ld.a r = [p] ld.a f = [q] add s = r, f *a = … chk.a r, recovery_1 label_1: chk.a f, recovery_2 label_2:

(c) O nly p could potentially be updated by intervening store **q

(d) Source program

(e) Speculation for both the variables and the value in a computation expression

Figure 5 Two examples of multi-level speculation Multi-level speculation refers to the data speculation that occurs in a multi-level expression tree as in the Figure 5(a) and Figure 5(d). In Figure 5 (d), the expression tree *p + *q in s3 could be speculatively redundant to *p + *q in s1, if *a in s2 is a potential weak update to *p and/or *q. In this case, we have to generate two speculative loads and two check instructions for *p and *q as shown in Figure 5(e). Each check instruction will have a different recovery block. Another example of a typical multi-level

speculation is as shown in Figure 5 (a). It is also referred to as a cascaded speculation in [9]. In Figure 9(a), **q could be a potential weak update to *p and/or **p. We need to generate two check instructions for both *p and **p as shown in Figure 5(b), each with a different recovery block. If **q is a potential weak update only to *p, then the check instruction and recovery block will be as shown in Figure 5(c). According to our previous study [3], there are many multi-level indirect references (e.g. multi-level field

Proceedings of the 13th International Conference on Parallel Architecture and Compilation Techniques (PACT’04) 1089-795X/04 $ 20.00 IEEE

accesses) in the SPEC2000 C programs. Those references present opportunities for multi-level

speculative PRE.

s1: ... = **p … s2: **q = … … s3: ... = **p

r1 = *p /* ld.a flag */ f1 = *r1 /*ld.a flag */ ... = f1 **q =... if (r1 is invalid){ /* chk flag */ r2 = *p f2= *r2 } r3 ←φ (r1, r2) f3 ←φ (f1, f2) if (f3 is invalid){/* chk flag */ f4 = *r1 } f5 ←φ (f3, f4) … = f5

r1 = *p /* ld.a flag */ f1 = *r1 ... = f1 *q =... if (r1 is invalid){ /* chk flag */ r2 = *p f2= *r2 } r3 ←φ (r1, r2) f3 ←φ (f1, f2) … = f3

s1: ... = *p + *q s2: *a =... (may update *p or *q) s3: ... = *p +*q

(a) Source program

(b) p and *p could potentially be updated by intervening store **q

(c) Only p could potentially be updated by intervening store **q

(d) Source program

r1 = *p /* ld.a flag */ f1 = *q /* ld.a flag */ s1 = r1 +f1 *a = … if (r1 is invalid){ /* chk flag */ r2 = *p s2= r2 +f1 } r3 ←φ (r1, r2) s3 ←φ (s1, s2) if (f1 is invalid){ /* chk flag */ f2 = *q s4= r3 +f2 } f3 ←φ (f1, f2) s5 ←φ (s3, s4) … = s5 (e) Speculation for both the variables and the value in a computation expression

Figure 6 Examples of check instructions and their recovery blocks in multi-level speculation The recovery code representation discussed in Section 2 can be applied directly to multi-level speculation without any change. In Figure 6(b), both the address expression p and the value expression *p are candidates for speculative register promotion. Hence, two if-blocks are generated. In Figure 6(c), only the address *p may be modified by the potential weak update. It is obvious that the representation in Figure 6 matches very well with final assembly code shown in Figure 5.

3.2. Recovery Code Generation for MultiLevel Speculation Using an if-block to represent a check instruction and its recovery block, all previous discussions on recovery code generation for speculative SSAPRE can be applied directly to multi-level speculation without any change. In our speculative SSAPRE framework, an expression tree is processed in a bottom-up order. For example, given an expression ***p, we begin by processing the sub-expression *p for the first-level speculation, then the sub-expression **p for the second-level speculation, and lastly ***p for the third level speculation. This processing order also guarantees

that the placement of check instructions and the recovery code generation are optimized level-by-level. Therefore, when a sub-expression is processed for the n-th level speculation, every sub-expression at the (n-1)-th level speculation has been processed, and its corresponding recovery block (i.e. if-block) also generated. Here, we assume an if-block will be generated at the last weak update of the sub-expression. A re-load instruction for the value of the subexpression will be included initially in the if-block. Figure 7(b) show the result after the first-level speculation on *p for the program in Figure 7(a). We assume **q is aliased with both *p and **p with a very low probability. Hence, **q in s2 is a speculative weak update for both *p and **p. During the first-level speculation, the sub-expression *p is speculatively promoted to the register r. The subscript of r represents the version number of r. As can be seen in Figure 7(b), the value of *p in s1' becomes unavailable to the use in s3 along the true path of the if-block because of the reload instruction in s5. The value of an expression is unavailable to a later use of the expression if the value of the expression is redefined before the use. This is represented in the version number of r that changes from r1 to r3 due to the φ(r1, r2) in s6.

Proceedings of the 13th International Conference on Parallel Architecture and Compilation Techniques (PACT’04) 1089-795X/04 $ 20.00 IEEE

s1 : ... = * *p s2 : **q = … (m a y u p d a te **p a nd *p ) s3 : ... = * *p + 1 0 ;

s1 : s1 ’: s2 : s4 :

r 1 = *p /* ld .a fla g * / … = *r 1 * *q = ... if (r 1 is in va lid ){ /* c h k flag */ s5 : r 2 = *p } s6 : r 3 ← φ (r 1 , r 2 ) s3 : … = *r 3 + 1 0

s1 : s1 ’: s2 : s4 :

r 1 = *p /* ld .a fla g */ … = *r 1 [h] * *q = ... if (r 1 is in va lid ){ /* ch k flag */ s5 : r 2 = *p } s6 : r 3 ← φ (r 1 , r 2 ) s6 ’: h ← φ (h, h) s3 : … = *r 3 [h] + 1 0 /* h is h y p o th etic a l te m p o ra ry fo r th e lo ad o f **p * /

(a) S o urc e p ro gra m

(b )T he rec o ve ry c o d e fo r 1 st lev el sp ec ula tio n

(c)O utp ut o f φ - in se rtio n ste p fo r 2 n d lev el sp e c ula tio n

s1 : s1 ’: s2 : s4 :

r 1 = *p /* ld .a fla g */ … = *r 1 [h 1 ] **q = ... if (r 1 is in va lid ){ /* ch k flag */ s5 : r 2 = *p } s6 : r 3 ← φ (r 1 , r 2 ) s6 ’: h 2 ← φ (h 1 < sp e cu lative> , ⊥ ) s3 : … = *r 3 [h 2 ] + 1 0

s1 : s1 ’: s1 ”: s2 : s4 :

r 1 = *p /* ld .a fla g * / s 1 = *r 1 /* ld .a fla g */ … = s1 * *q = ... if (r 1 is in va lid ){ /* c h k flag */ s5 : r 2 = *p s5 ’: s 2 = *r 2 } s6 : r 3 ← φ (r 1 , r 2 ) s7 : s 3 ← φ (s 1 , s 2 ) s8 : if (s 1 is in v alid ){ /* c h k flag */ s9 : s 4 = *r 3 } s1 0 : s 5 ← φ (s 3 , s 4 ) s3 : … = s 5 + 1 0

s1 : s1 ’: s2 : s4 : s3 :

r 1 = *p /* ld .a fla g */ … = *r 1 * *q = ... r 2 = *p /* ch k fla g * / … = *r 2 + 1 0

(d ) O utp ut o f re n a m e step fo r 2 n d lev el sp ec ula tio n

(e ) O utp u t o f re co ve ry c o d e fo r 2 n d le vel sp ec ula tio n

(f) o utp ut o f rec o v ery co d e fo r 1 st le vel sp ec ula tio n u sing assign m e n t sta te m e nt rep re sen tatio n

Figure 7 Examples of recovery code generation in speculative PRE In the second-level speculation for **p, we can also speculatively promote **p to register s. A hypothetical variable h is created for *r (i.e. **p) in s1'. Figure 7(c) shows the result after the φ-insertion phase in SSAPRE [5]. A φ-function φ(h, h) in s6' is inserted because of the if-block in s4. The first operand of φ(h, h), which corresponds to the false path of the if-block in s4, will have the same version number as that in s1' because of the speculative weak update in s2 after the rename step in SSAPRE [5]. However, the second operand of φ(h, h), which corresponds to the true path, is replaced by ⊥. It is because the value of h (i.e. *r) becomes unavailable due to the re-load (i.e. a re-definition) of r in s5. The result is shown in Figure 7(d). The if-block in s8 is inserted after the speculative weak update of s2 with a re-load of *r in s9 (Figure 7(e)). Due to the second operand of φ(h1, ⊥) in s6', the algorithm will insert a reload of *r (in s5') along the true path of the if-block (in s4) to make the loading of

*r3 in s3 fully redundant. That is, the existing SSAPRE algorithm will automatically update the recovery block from the first level (i.e. *p) without additional effort. This advantage of the if-block representation for the recovery block can be better appreciated if we look at the difficulties we would have encountered if we represent the check instruction as an assignment statement as mentioned in section 2.1. The expression *p and **p are speculative redundant candidates. Using an assignment statement to represent a check instruction, the recovery block generated for the expression *p is shown in Figure 7(f). Since the value r is redefined at s4, the compiler cannot detect the expression *r (i.e. **p) in s3 as speculative redundant (because r has been redefined). In order to support multi-level speculation, we would have to significantly modify the SSAPRE representation and increase the complexity to the algorithm substantially.

Proceedings of the 13th International Conference on Parallel Architecture and Compilation Techniques (PACT’04) 1089-795X/04 $ 20.00 IEEE

placement scheme is superior because it depends on the structure of the application program.

4. Optimizing the Placement of Check Instructions and Recovery Blocks in Speculative PRE

5. Performance Measurements 25%

Increase percentage

The newly generated check instructions and recovery blocks after the speculative PRE may introduce additional redundancies. For example, the algorithm described in Section 3.2 that inserts a check instruction after the last weak update may introduce further redundancy, if the check instructions cross the control flow merge points to their uses. Those extraneous check instructions may impact the later instruction scheduling and degrade performance. In this section, we examine the issues regarding how to eliminate such redundant check instructions.

…=*p …

B

…= *p …

…= *p *q = … (ma y mo d i f y *p)

…= *p *q = … (ma y mo d i f y *p)

equake

bzip2

gap

m cf

pars er

code size

Figure 9 Number of basic blocks and code size increase with the if-block based recovery code generation scheme We implemented our new check instruction and recovery code generation algorithm for general speculative PRE optimizations in the ORC compiler version 2.0, and tested on the HP zx2000 workstation equipped with an Itainum-2 900MHz processor and 2GB of memory. Eight SPEC2000 programs are used in this measurement. The base versions used for comparison are compiled at optimization level O3. Our experiments contain two parts. In the first part, we study the impact of representing the check instruction and the recovery code as an if-block on the code generated by the compiler and its run time performance. In the second part, we evaluate the performance impact of the recovery code generation with the multi-level speculation support. The performance data is collected using the Pfmon tool [21].

A

B

…= *p …

C

check

(a) A check instruction is generated after the weak update

art

5.1. Impact of Recovery Code Generation *p = exp …

D

5%

number of basic blocks

check

C

10%

am m p

Consider the example in Figure 8(a). The occurrence of expression *p in basic block D is assumed to be speculatively redundant to *p in basic block C, and non-speculatively redundant to *p in basic block B. In this case, it is better to generate a check instruction after the weak update *q in basic block C than a check instruction at the use location in basic block D. This is because it could introduce a redundant check instruction if the program executes along the path A→B→D. A

15%

0%

There are two ways to place a check instruction during the speculative PRE: one is at the use location of the speculative redundant expression. The other is after the last weak update but before its next use as we already mentioned. Both schemes may introduce redundant check instructions.

….

20%

(b) A check instruction is generated at the use location

Figure 8 Examples of effective placement of check instructions in speculative PRE The example in Figure 8(b) shows an opposite situation. It is better to generate a check instruction at the use location because a check instruction generated after the weak update *q will introduce a redundant check instruction if the path A→B is executed. It is difficult to determine which check instruction

Our recovery code generation algorithm is quite effective in practice. We show the increase in the number of basic blocks and the code size of the eight benchmarks from the speculative PRE optimization in Figure 9. We collected the number of basic blocks at the compile time, and the code size is measured using the size of text segments. The inserted ld.c and/or chk.a instructions and respective recovery blocks contribute only slightly to the static code size. Note that since our general speculative optimizations can eliminate speculative redundant expressions, it would offset the code increase from the check instructions and their recovery blocks. As can be seen in Figure 9, the impact on code size is marginal. We also collected the

Proceedings of the 13th International Conference on Parallel Architecture and Compilation Techniques (PACT’04) 1089-795X/04 $ 20.00 IEEE

twolf

instruction miss stall at runtime as shown in Table 1. We observe that the instruction cache miss increased a little in our recovery code generation version compared to the base version. For benchmark gap, the instruction cache miss is high due to frequent indirect procedure calls in the program. Figure 10 shows performance comparison of recovery code generation in speculative PRE using the assignment representation, the improved assignment representation and the if-block representation. The main difference between the assignment-based scheme and the improved assignment based scheme is that we spent much effort to modify later analysis and optimization phases to minimize the negative impact from the assignment statements generated from the check instructions in the improved assignment based scheme. Overall, Figure 10 shows that the if-block based scheme outperforms assignment based schemes. For some benchmarks such as parser and bzip2, the performance difference is rather small. This is because the opportunity for data speculation in these two

benchmarks is low. Also, despite our numerous efforts in minimizing the negative impact of assignment statements introduced by check instructions, the improvement in performance with the improved assignment based scheme still less than the if-block based scheme. However, we shall not focus on the performance aspect of the if-block based representation. Its key contribution is that we can avoid tedious tuning work required in later optimization phases to overcome the ill-effect from assignment based representation of check instructions. ammp 0.31% 0.33%

1 2

art 0.01% 0.02%

equake 0.05% 0.06%

bzip2 0.04% 0.05%

gap 8.62% 8.95%

mcf 0.02% 0.03%

parser 0.43% 0.46%

Table 1 Percentage of instruction miss stalls with the if-block based recovery code scheme *1: The percentage of Instruction miss stalls in total cpu cycles in base version *2: The percentage of Instruction miss stalls in total cpu cycles after recovery code generation

Improvement percentage

40% 35% 30% 25% 20% 15% 10% 5% 0% ammp

a rt

equake

b z ip 2

c p u c y c le s (a s s i g n m e n t b a s e d ) c p u c y c le s (i f- b lo c k b a s e d ) d a ta a c c e s s s ta ll( i m p r o v e d a s s i g n m e n t b a s e d ) lo a d s r e ti r e d ( a s s i g n m e n t b a s e d ) lo a d s r e ti r e d ( i f- b lo c k b a s e d )

gap

mcf

p a rse r

t w o lf

c p u c y c le s ( i m p r o v e d a s s i g n m e n t b a s e d ) d a ta a c c e s s s ta ll( a s s i g n m e n t b a s e d ) d a ta a c c e s s s ta ll( i f- b lo c k b a s e d ) lo a d s r e ti r e d ( i m p r o v e d a s s i g n m e n t b a s e d )

Figure 10 Overall performance comparison in speculative PRE using assignment based scheme and the if-block based scheme

5.2. Performance Comparison with SingleLevel and Multi-Level Data Speculation Figure 11 shows the performance difference in speculative PRE with single-level and multi-level speculation support. We can observe some performance difference for benchmark ammp, equake and gap when multi-level data speculation is enabled. Overall, the impact on performance is not substantial. Consider Equake, for example, many intervening stores exist between multiple redundant multi-level memory loads. Since multi-level memory references here are pointer type and the type-based alias analysis assumes that aliases could only occur among memory references of the same type, hence, the possible aliasing caused by intervening stores can be ignored. However, multi-level

speculation opportunities can still be applied even when the type-based alias analysis is enabled. In the code segment shown in Figure 13(a), the type of expression w[col][1], A[Anext], and A[Anext][0] are different, so the compiler could assume there is no alias existed among expressions w[col][0], A[Anext], and A[Anext][1]. We can see that expressions A[Anext] and A[Anext][1] can be promoted to registers simply using type-based alias analysis. Nevertheless, the opportunity of data speculation can still be applied to those pointer references of the same type. For example, the expression A[Anext][1][1] cannot be promoted to a register because it is the same type as the store expression w[col][0]. In Figure 12, we can see that both single-level speculation and multi-level

Proceedings of the 13th International Conference on Parallel Architecture and Compilation Techniques (PACT’04) 1089-795X/04 $ 20.00 IEEE

twolf 9.46% 10.57%

Improvement percentage

speculation increase performance for some benchmarks

even when type-based alias analysis is enabled.

40% 35% 30% 25% 20% 15% 10% 5% 0% ammp

art

equake

bzip2

cpu cycles(single-level) data access stall(single-level) loads retired(single-level)

gap

mcf

parser

twolf

cpu cycles(multi-level) data access stall(multi-level) loads retired(multi-level)

Improvement percentage

Figure 11 Performance comparison using if-block based recovery code generation between single level and multi-level data speculation support, where the base version used for comparison is compiled with –O3. 35% 30% 25% 20% 15% 10% 5% 0% ammp

a rt

equ ake

b zip 2

c p u c y c le s (s in g le -le ve l) d a ta a c c e s s s ta ll(s in g le -le ve l) lo a d s re t ire d (s in g le -le ve l)

gap

mcf

p a rse r

two lf

c p u c y c le s (m u lti-le ve l) d a t a a c c e s s s ta ll(m u lt i-le ve l) lo a d s re tire d (m u lt i-le ve l)

Figure 12 Performance comparison using if-block based recovery code generation between single level and multi-level data speculation support, where the base version used for comparison is compiled with –O3 and type based alias analysis. for (i = 0; i < nodes; i++) { ... while (Anext < A last) { col = Acol[A next]; sum 0 += A[A next][0][0] *… sum 1+= A [A next][1] [1] *… sum 2+= A [A next][2] [2] *… w [col][0] += A[A next] [0][0]*v[i] [0] + … w [col][1] += A[A next] [1][1]*v[i] [1] + … w [col][2] += A[A next] [2][2]*v[i] [2] + … A next++; } }

*u = (cos(Src.rake) * sin(Src.strike) sin(Src.rake) * cos(Src.strike) * cos(Src.dip)); *v = (cos(Src.rake) * cos(Src.strike) + sin(Src.rake) * sin(Src.strike) * cos(Src.dip)); *w = sin(Src.rake) * sin(Src.dip);

(a) M ulti-level speculation for m ultiple level m em ory loads

(b) M ulti-level speculation for com putation expressions including intrinsic procedure call such as sin and cos

}

Figure 13 Two examples of multi-level speculation opportunities in benchmark Equake

Proceedings of the 13th International Conference on Parallel Architecture and Compilation Techniques (PACT’04) 1089-795X/04 $ 20.00 IEEE

The type-based alias analysis would not help when the intervening stores have the same type as the multilevel memory loads or computation expressions. Consider the code in Figure 13(b). The expression Src.rake, Src.strike and Src.dip are of the same type as the expressions *u and *v. Therefore, with single-level data speculation, our compiler can speculatively promote expressions Src.rake, Src.strike and Src.dip into registers. With multi-level speculative PRE, our compiler can further speculatively promote the value of expressions sin(Src.rake), sin(Src.strike), cos(Src.rake), cos(Src.strike) and cos(Src.dip) to registers. This would effectively eliminate six expensive function calls. Such performance opportunity may exist in some important applications. Finally, it should be noted that the type-based alias analysis is not always safe and can only work for languages that are type-safe. In the near future, we will further investigate the impact of multi-level speculation on the recovery code generation through more benchmark programs.

6. Related Work Most of the recovery code generation schemes proposed so far only deals with specific optimizations. If the expression*p and *q are potential aliases with a very low probability, the load instruction can be speculatively moved across the store instruction. A chk.a instruction is inserted at its original location to check for possible mis-speculation at runtime. It will jump to a recovery block if a mis-speculation occurs. This algorithm works very well for code scheduling, but not for more general speculative optimizations such as speculative partial redundancy elimination. Consider the example in Figure 15(a), assume *q and *p are unlikely to be aliases. The compiler can speculatively replace the expression *p+b (s4) by a+b (s1) as shown in Figure 14(c). Using the JNMW algorithm, in the later recovery code generation phase, it cannot find the instruction on the dependence chain between the ld.a and chk.a because it has been speculatively eliminated. This can be shown with the output in Figure 15(b), where the instruction *p+b (i.e. add r3=r4, r2 in Figure 15(c)) is eliminated. Keeping such eliminated instructions on the dependence chains in the IR can significantly complicate later optimizations because they must be handled differently from regular instructions. Another difficulty with the JNMW algorithm is that it implicitly assumes that there exists only one speculative load (ld.a) instruction for each chk.a. This

assumption is only true for instruction scheduling. For speculative PRE, one chk.a instruction may correspond to multiple ld.a instructions, so multiple flow dependence chains may exist for a single chk.a instruction. This can be shown with the example in Figure 15(d). We can speculatively eliminate *p (s6) and use the value loaded in s1 or s3. A ld.a instruction is generated for s1 and s3, respectively (as shown in Figure 15(e)). The chk.a instruction in s6 corresponds to multiple ld.a instructions (at s1 and s3). If the compiler selects the flow dependence chain based on the ld.a instruction in s3, according to the JNMW algorithm, the statement s4, i.e. add r2 = r1, 1, would be included in the recovery block. However, this instruction is non-speculative, and should not be included in the recovery code. The correct recovery block should be as shown in Figure 15(g). This example shows the need to distinguish non-speculative instructions from speculative instructions in IR for correct recovery code generation. All later optimization phases must be aware of the existence of possible recovery blocks and check instructions. In [16], a check instruction is represented implicitly as an assignment statement and its recovery block as its associated flow dependence chain. Consider a simple example in Figure 15(a). The second occurrence of expression *p in s3 can be regarded as speculatively redundant to the first *p in s1. Therefore, we can use ld.a for the first load of *p, and use the check instruction for the second *p. In Figure 15(b), an assignment statement shown as r2= *p with a check flag is used to represent chk.a or ld.c, respectively. The recovery block is implicitly assumed to have the original load instruction for *p in s3. In the later code scheduling, if some instructions can be speculatively moved up as the example in Figure 15, the flow dependence chain between ld.a and the check instruction implicitly represents the remaining code inside the recovery block. However, as shown in Figure 15(b), the live range of the register r has been changed, and the first definition of r in s1 can no longer reach the last use of r in s3. The version number of r is changed in s3’ from r1 to r2 because the check instruction is represented as an assignment statement. This in some sense violates the semantics of the check instruction because it requires that both the speculative load and the corresponding check instruction use the same register, but such a representation do not allow that to happen. In addition, this implicit representation could inhibit later optimizations, such as code scheduling, for instructions

Proceedings of the 13th International Conference on Parallel Architecture and Compilation Techniques (PACT’04) 1089-795X/04 $ 20.00 IEEE

related to the speculative load. For example, it no longer allows the instruction r+10 in s3 to be speculatively hoisted across the store instruction *q in s1: s2: s3: s4:

... = a+b *p = a *q = ... ... = *p+ b

s1: s2: s3: s4: s5: s6: s7: label:

s2. Existing check and recovery code generation approaches are thus inadequate for general speculative optimizations.

ld r1 = [& a] ld r2 = [& b] add r3 = r1, r2 st [p] = r1 ld.a r4 = [p] st [q] = … chk.a r4, recovery

s1: s2: s3: s4: s5: s6: s7: label:

….

(a)original program

s1: t = *p+ 2 …

ld r1 = [& a] ld r2 = [& b] add r3 = r1, r2 st [p] = r1 ld.a r4 = [p] st [q] = … chk.a r4, recovery ….

recovery: /* incorrect */ ld r4 = [p] br label

recovery: /* correct */ ld r4 = [p] add r3 = r4, r2 /* m issing in the speculative chain*/ br label

(b) incorrect recovery code for global value num bering based speculative redundancy

(c) correct recovery code for global value num bering based speculative redundancy elim ination

s3: t =*p+1 …

s5: *q = … s6: … =*p … (d) original program

recovery: /*incorrect */ ld a r1 = [p] add r2 = r1, 1 br label (f) incorrect recovery code for speculative redundancy elim ination

s1: ld.a r1= [p] s2: add r2 = r1,2 …

s3: ld.a r1= [p] s4: add r2 = r1,1 …

s5: st [q]= … s6: chk.a r1, recovery label: … (e) output after speculative redundancy elim ination

recovery: /*correct */ ld a r1 = [p] br label (g) co rrect recovery code for speculative redundancy elim ination

Figure 15 Examples to show that the scheme proposed in [9] is insufficient to handle general speculative optimizations. (a), (b) and (c) shows one situation where some instructions could be missing in the recovery block. (d), (e) and (f) shows another situation where some instructions could be mistakenly added into the recovery block.

7. Conclusions In this paper, we propose a compiler framework to generate check instructions and recovery code for general speculative optimizations. The contributions of this paper are as follows: First, we propose a simple ifblock representation for the check instruction and their corresponding recovery blocks. It allows speculative recovery code introduced early on during program optimizations to be seamlessly integrated with subsequent optimizations. We use this framework to generate the recovery code for speculative PRE-based optimizations that include partial redundancy

elimination, register promotion and strength reduction. We then show that the recovery code generated for speculative PRE can be integrated seamlessly with later optimizations such as instruction scheduling. Furthermore, this recovery code representation can be applied to other speculative optimizations such as speculative instruction scheduling. Finally, we show that our proposed framework can efficiently support the compiler to generate recovery code for both singlelevel and multi-level speculations. We have implemented the proposed recovery code generation framework in Intel’s ORC compiler. In the past, we had used the assignment based representation for check

Proceedings of the 13th International Conference on Parallel Architecture and Compilation Techniques (PACT’04) 1089-795X/04 $ 20.00 IEEE

instructions. We ended up spending numerous hours in tuning later optimizations to minimize the negative impact from assignment based check instructions. Using the if-block based representation, the later analysis and optimization phases remain the same, saving substantial development efforts. As for future work, we would like to support more speculative optimizations, such as value speculation and thread-level speculation, under the same framework to ensure its generality and to exploit additional performance opportunities.

Acknowgement The authors wish to thank Sun Chan (Intel), Peng Tu (Intel), Raymond Lo for their valuable suggestions and comments. This work was supported in part by the U.S. National Science Foundation under grants EIA9971666, CCR-0105571, CCR-0105574, and EIA0220021, and grants from Intel.

References [1] J. Bharadwaj, K. Menezes, and C. McKinsey. Wavefront scheduling: Path based data representation and scheduling of subgraphs, MICRO’32, December, 1999. [2] R.A. Bringmann, S.A. Mahlke, R. E. Hank, J.C. Gyllenhaal, and W.W. Hwu, Speculative Execution Exception Recovery Using Write-Back Suppression, MICRO’26, Dec 1993. [3] T. Chen, J. Lin, W. Hsu, P.C. Yew. An Empirical Study on the Granularity of Pointer Analysis in C Programs, In 15th Workshop on Languages and Compilers for Parallel Computing, Maryland, July 2002. [4] F. Chow, S. Chan, S. Liu, R. Lo, and M. Streich. Effective representation of aliases and indirect memory operations in SSA form. In Proc. of the Sixth Int’l Conf. on Compiler Construction, April 1996. [5] F. Chow, S. Chan, R. Kennedy, S. Liu, R. Lo, and P. Tu. A new algorithm for partial redundancy elimination based on SSA form. PLDI’97, pages 273-286, Las Vegas, Nevada, May 1997. [6] R. Cytron, et. al., Efficiently computing static single assignment form and the control dependence graph. ACM Transactions on Programming Languages and Systems, 13(4): 451--490, 1991. [7]

[9] R. D.-C. Ju, K. Nomura, U. Mahadevan, and L.-C. Wu. A Unified Compiler Framework for Control and Data Speculation, PACT’2000, pages 157 - 168, Oct. 2000. [10] R. D.-C. Ju, S. Chan, and C. Wu. Open Research Compiler (ORC) for the Itanium Processor Family. Tutorial presented at Micro 34, 2001. [11] R. D.-C. Ju, S. Chan, F. Chow, and X. Feng. Open Research Compiler (ORC): Beyond Version 1.0, Tutorial presented at PACT’2002. [12] M. Kawahito, H. Komatsu, and T. Nakatani, Effective Null Pointer Check Elimination Utilizing Hardware Trap, ASPLOS-IX, Nov. 12-15, 2000. [13] R. Kennedy, F. Chow, P. Dahl, S.-M. Liu, R. Lo, and M. Streich. Strength reduction via SSAPRE. In Proc. of the Seventh Int’l Conf. on Compiler Construction, pages 144--158, Lisbon, Portugal, Apr. 1998. [14] R.Kennedy, S. Chan, S. Liu, R. Lo, P. Tu, and F. Chow. Partial Redundancy Elimination in SSA Form. ACM Trans. on Programming Languages and systems, v.21 n.3, pages 627-676, May 1999. [15] J. Knoop, O. Ruthing, and B. Steffen. Lazy code motion.PLDI’92, pages 224-234, San Francisco, California, June 1992. [16] J. Lin, T. Chen, W.-C. Hsu, P.-C. Yew, R. D-C Ju, T-F. Ngai, S Chan, A Compiler Framework for Speculative Analysis and Optimizations, PLDI’03, June 2003. [17] J. Lin, T.Chen, W.C. Hsu, P.C. Yew, Speculative Register Promotion Using Advanced Load Address Table (ALAT), CGO03, pages 125-134, San Francisco, California, March 2003 [18] R. Lo, F. Chow, R. Kennedy, S. Liu, P. Tu, Register Promotion by Sparse Partial Redundancy Elimination of Loads and Stores, PLDI’98, pages 26-37, Montreal, 1998 [19] S.A. Mahlke, Exploiting Instruction Level Parallelism in the Presence of Conditional Branches, Ph. D. Thesis, University of Illinois, Urbana, IL, 1996. [20] U. Mahadevan, K. Nomura, R. D.-C. Ju, and Rick Hank, Applying Data Speculation in Modulo Scheduled Loops, PACT’2000, pp. 169-176, Oct. 2000. [21] pfmon: ftp://ftp.hpl.hp.com/pub/linux-ia64/pfmon-1.10.ia64.rpm [22] Youfeng Wu and James R. Larus. Static branch frequency and program profile analysis.MICRO’27, pages 1--11, Nov 1994.

C. Dulong. The IA-64 Architecture at Work, IEEE Computer, Vol. 31, No. 7, pages 24-32, July 1998.

[8] K. Ishizaki, T. Inagaki, H. Komatsu, T. Nakatani, Eliminating Exception Constraints of Java Programs for IA-64, PACT’2002, pp. 259-268, September, 2002.

Proceedings of the 13th International Conference on Parallel Architecture and Compilation Techniques (PACT’04) 1089-795X/04 $ 20.00 IEEE

Suggest Documents