Enabling Code Transformation via Synthetic ...

6 downloads 0 Views 170KB Size Report
provide mechanisms to directly optimize the machine code that it synthesizes. Even in a ..... SPE can be found in the IBM XL compiler [10]; however we frame our ...
Enabling Code Transformation via Synthetic Composition Andrew Friedley

Christopher Mueller

Andrew Lumsdaine

Open Systems Laboratory Indiana University, Bloomington IN 47405, USA

Abstract Modern compilers have not kept up with the growth of new features (e.g., SIMD instructions, multiple cores) in processor architectures, and therefore do not fully exploit these features for optimal performance. The recently developed CorePy library has addressed this shortcoming by placing the architecture in developers’ hands, allowing them to map software solutions directly to the architecture’s performance features. However, CorePy lacked support for more advanced code organization, transformation, and optimization techniques. To enable the application of code transformations (e.g., those commonly found in compilers), we extend CorePy to support synthetic components and their composition. The capabilities of this extension are demonstrated by implementing an instruction scheduling optimization for the Cell BE platform. Finally, we further demonstrate how synthetic composition accelerates assembly development by automating performance tuning.

1.

Introduction

The recently developed CorePy library is a system for rapidly prototyping assembly language code in Python. CorePy brings the benefits of rapid prototyping to low-level performance tuning by making all architectural features accessible to the developer in a high productivity environment. Code developed using CorePy can be used as a drop-in replacement for accelerating the performance-critical portions of an application. Developers use a collection of classes representing aspects of the processor architecture (such as registers and instructions) to generate high-performance machine code directly from Python. An executable kernel is created that can be invoked immediately at runtime, or later compiled into an application. We refer to this technique as synthetic programming [22].

[Copyright notice will appear here once ’preprint’ option is removed.]

1

The synthetic programming approach is particularly effective in enabling programmers to control high-level, domainspecific optimizations—optimizations that are typically not the strong point of general-purpose compilers. On the other hand, the basic synthetic programming approach does not provide mechanisms to directly optimize the machine code that it synthesizes. Even in a rapid development environment, optimizing low-level code by hand is a repetitive and labor-intensive process. Just as with traditional compilers, automating this process can significantly reduce the time needed to develop optimized code. However, using an optimizer in past versions of CorePy was difficult due to the manner in which code was generated. Code was generated en masse; transformations must be applied to an entire synthetic program. In keeping with the original spirit of CorePy, we have developed an approach for modularizing synthetic code generation and for optimizing synthetically generated code that puts these processes directly under the control of the programmer in rapid development fashion. The following example illustrates how code composition could be used in conjunction with an instruction scheduling optimization to generate unrolled and optimized loop code in CorePy: loop_code = generate_loop_iter(0) loop_code += generate_loop_iter(1) main_code += isched(loop_code) The first two lines above generate code for the first two iterations of the loop, concatenating them together (composing two components) to form unrolled loop code. This code is then input to the instruction scheduler, whose optimized output is then concatenated (composed) onto the main program code. By organizing both generative and generated code into components, our approach allows transformations to be applied to arbitrary code segments in a synthetic program. CorePy was originally motivated by the observation that modern compilers have had difficulty exploiting new performance-specific features (e.g., SIMD instructions, multiple cores) in processor architectures [20]. CorePy addressed this shortcoming by exposing the underlying architecture to the developer, allowing direct access to performance features. However, CorePy lacked support for more 2009/9/14

advanced code organization and transformation techniques. To enable the application of code transformations (e.g., those commonly found in compilers [1,2,8,19], we extend CorePy to support synthetic components and their composition. The capabilities of this extension are demonstrated by implementing an instruction scheduler optimization for the Cell BE platform [14, 15]. Finally, we further demonstrate how synthetic composition accelerates assembly development by automating performance tuning. When used as a tool, our instruction scheduler allows the programmer to develop readable, maintainable assembly code while matching (or exceeding) the performance of hand-optimized code with less effort.

The LLVM compiler system implements a virtual ISA optimization backend primarily for use by frontends of existing languages, while CorePy tightly integrates machine-level code into a scripting language.

2.

3.1

Related Work

3.

CorePy

CorePy is an evolution of the Synthetic Programming Environment [20, 22], an object-oriented run-time system for generating and executing high-performance computational kernels in Python. The CorePy library presents a collection of classes representing aspects of the processor architecture; the most fundamental of these objects and their relationships are discussed below. CorePy Object Hierarchy

Code optimization, whether in interpreters for high-level languages, or compilers for low-level languages, is an area of ongoing research. High-level approaches such as telescoping languages [16] generate a domain-specific language compiler that achieves high performance, avoiding the need to write performance-critical code in lower-level languages. Another high-level strategy is the Task Graph Library [3], which enables dynamic code generation using a domainspecific language (a subset of C) more conducive to optimization. Other work focuses on lower-level optimization. For example, the GNU Compiler Collection (GCC) [13] supports reordering of architecture-specific intrinsics (Clanguage functions that often map directly to an assembly instruction) using instruction scheduling techniques found in compiler literature [1, 8, 19]. Spiral [6] is a performance library that generates intrinsic-based code from high-level formulas (a form of domain-specific language), relying on GCC’s ability to optimize the intrinsics for optimal performance. LLVM [17] implements a virtual instruction set that also contains high-level type information, enabling advanced low-level code transformations. Scripting languages often have functionality or support libraries allowing for in-line assembly programming; PyASM [24] is one example. While this approach is suitable for writing small code segments, development of nontrivial code sequences is difficult. CorePy itself is an evolution of in-line assembly programming that includes inputdependent code generation and is more tightly integrated into the host language to facilitate easier debugging and development. Several projects have explored the compilation of scripting languages into lower-level languages, and CorePy can be synergistically combined with these to accelerate run-time code generation. The telescoping languages project seeks to generate new languages better suited for optimization, while projects such as Psyco [25] can accelerate unmodified Python using an optimizing just-in-time (JIT) compiler. Cython [4] takes a hybrid approach, similar to the Task Graph Library, in which a high-level language very similar to Python is compiled into a C++ extension module.

Figure 1 depicts the overall CorePy object hierarchy. Our work introduces the Program object, which inherits some functionality previously found on InstructionStream objects. The details and motivation for this change are discussed below. At the top are the most basic objects—Registers and Labels. In addition, standard Python integer and boolean types are used directly as immediate operands and special flags. A label is unique in that it is used as an operand for an instruction, as well as for a branch target in an instruction stream. Every instruction is represented by its own Instruction object. The set of all instructions for a particular architecture (e.g., PowerPC, x86) are referred to as an ISA module. Instructions require some number of operands—registers, labels, immediate values, flags—in order to be instantiated. When constructed, an Instruction checks that it has valid types and values for each of its operands. For example, an error will be raised if a label is passed where a register is expected, or if an immediate value is too large in magnitude to be encoded into the instruction. Instruction objects are added to an InstructionStream object to form a synthetic code segment. A stream (shorthand for InstructionStream) is a container for instructions and labels. Labels may be added directly to a stream to mark their locations in the coder; they act as targets for branch instructions. Furthermore, the instructions and labels from one stream may be added to another, forming the composition operation critical to supporting componentoriented programming. Objects (instructions, labels, and other streams) are added to a stream via the add method or using the overloaded + and += operators, which behave like concatenation. Section 4 discusses the use of this functionality as a code composition operation in detail. To support composition, we introduced a new Program object to the hierarchy. Program objects centralize the code rendering and resource management functionality previously found in InstructionStream objects. Code composition introduces the possibility for a synthetic program

2

2009/9/14

synthetic program, passing along any provided parameters and returning the program’s return value. An asynchronous execution mode is supported, in which the synthetic program executes in a separate thread and the execute call returns immediately. When asynchronous execution mode is enabled, a join method is used to block until the program completes, and to retrieve the return value. Internally, the Processor object utilizes one of a number of small C libraries written to support each architecture and operating system. The machine code rendered by the Program object is passed through to the C library as an array, where it is cast to a function pointer and called to execute the program. The code never sees a compiler during this process; it is rendered and executed directly.

ISA

Register

Label

Instruction acquired from

acquired from

InstructionStream

3.2

Program executed by

StorageRef Processor

Figure 1. CorePy Object Hierarchy to contain multiple instruction streams, yet this functionality should only exist in one place for a synthetic program. Before a synthetic program can be executed, all the instructions in the program must be rendered into raw machine code and wrapped into an executable function. This process is performed by the cache code method, and ensures that generated code complies with the architecture and operating system-dependent application binary interface (ABI). Program objects now serve as the central point for resource management. There are three types of resources: registers, labels, and storage references. A processor’s register file is represented as pools of registers, one pool for each register type. Registers are acquired and released from the pool as needed; if no register of a particular type is available, an error is raised. Labels are acquired by calling a factory method with a desired label name. If another Label already exists with a particular name, that label is returned. Otherwise, a new label object is returned. A second factory method exists for creating a label object with a guaranteed unique name. Finally, due to Python’s automatic garbage collection, it is often desirable to maintain references to objects containing data used by a synthetic program. A pair of set/get methods allow for adding and retrieving references as key/value pairs. A Processor object represents available run-time execution resources. An execute method invokes a specified 3

Supporting Libraries

In addition to the core object hierarchy, a variety of utility libraries are provided. On the Cell SPE architecture a synthetic code library is provided for managing data transfers and interprocess communication. Each architecture has its own synthetic iterator library, which provides Python objects that are used to generate assembly code for various types of loops using the built-in Python for-loop syntax. The Cell SPE, PowerPC, and Altivec architectures each have a synthetic expression library [21] that defines basic datatypes with overloaded arithmetic operators that generate assembly code to perform the operations. Python’s built-in arithmetic expression syntax can then be used to generate synthetic programs. The Printer module supports the output of generated code to any file descriptor (defaulting to standard output, or the display), using a variety of assembly syntax formats. The default format is useful for debugging and inspection purposes, while several assembler-specific formats allow generated code to be compiled and linked into another application, or input to a source code analysis tool (e.g., profiler, debugger). All of the primary architectures supported by CorePy have some set of SIMD instructions available, generally with strict memory alignment requirements. Existing Python data containers do not provide suitable alignment guarantees, so CorePy provides an extended array class that guarantees page-aligned memory. The ExtArray interface is designed to match that of the existing Python array interface.

4.

Synthetic Composition

A synthetic component is a function, class, or module that can be called to generate code at different times during synthesis. Here, we refine the definition of synthetic component to focus on making components more generally reusable by requiring that not only the component be callable multiple times to generate code, but also to allow the generated code to be reused. Taking advantage of this refinement, we now define synthetic composition as the process 2009/9/14

in which programs are constructed by connecting multiple synthetic components together. This process enables more direct component-oriented programming [23] using CorePy. CorePy supports composition with the ability to add one instruction stream object to another via the add method or + and += operators (as seen in Figure 2). The generative aspects of CorePy enable the possibility of generating entire synthetic programs from a collection of components without ever directly writing or generating assembly code. Domain-specific languages can be built by providing a set of components and abstracting the composition process into a language. In addition, portable assembly programs can be achieved by building synthetic components with matching interfaces, but with differing implementations. Code can be generated for different architectures by swapping the underlying component implementations while leaving the overall construction process unchanged. Figure 2 demonstrates various aspects of componentoriented programming in CorePy. The main program body loads several parameters passed in from Python, then adds the code generated by the add array component. Four parameters are required by the component. Registers containing source and destination pointers and array length account for the first three, while the fourth parameter is a component function that generates code for the element-wise copy operation. Parameterizing this operation makes add array highly generic, allowing for operations other than just addition. A synthetic iterator [21] is used to generate the code for the loop itself. Inside the loop, the element-wise operation is generated once and used twice to generate the final loop body with two iterations unrolled. This strategy exploits component reuse to perform basic loop unrolling, and has the added side effect of increasing code generation performance. 4.1

Assembly Resource Management

# Add one array to another, element-by-element def add_array(prgm, r_dst, r_src, r_len, elem_fn): r_cnt = prgm.acquire_register() code = prgm.get_stream() # Iterate over the elements of the array iter = syn_iter(code, r_len, count = r_cnt, step = 8) for r_i in iter: # Generate the loop body body = elem_fn(prgm, r_i) # Unroll the body by adding it twice code.add(body) code.add(x86.add(r_i, 8)) code.add(body) prgm.release_register(r_cnt) return code # Add a source element to a destination element def add_elem(prgm, r_i): code = prgm.get_stream() r_data = prgm.acquire_register() # Load source value into a register, # then add to destination. code.add(x86.mov(r_data, MemRef(r_src, 0, r_i))) code.add(x86.add(MemRef(r_dst, 0, r_i), r_data)) prgm.release_register(r_data) return code # Main program body prgm = env.Program() code = prgm.get_stream() code.add(x86.mov(rdi, MemRef(rbp, 16))) code.add(x86.mov(rsi, MemRef(rbp, 24))) code.add(x86.mov(rdx, MemRef(rbp, 32)))

An assembly-level code composition system raises some issues with regard to resource management. Registers are a finite resource, and are generally hardcoded as instruction operands. When two components use the same registers and are composed together, incorrect code can result due to register values being overwritten. We solve this problem by defining register allocation and management to be a function of the central Program object rather than of each InstructionStream object. All components used in a particular program acquire and release registers from the same shared pool. Therefore, a component cannot generate code using a register already in use by another component. The downside to this approach is that generated code can only be safely be used in one component; however, in practice this is a minor limitation since a component can generate code again in the context of another program. Sometimes code requires the use of a specific register; this is commonly the case on x86 due to instructions that implicitly use certain registers as operands. CorePy supports

the use of specific registers; however, this poses a problem for the same reasons as stated above—composing components that use a specific register can lead to incorrect code. This problem is solved by not holding ‘live’ data in the set of registers used as implicit operands across code composition operations. Label names are another resource that must be managed carefully, as valid assembly code cannot contain multiple labels with the same name. The get unique label method of the Program object guarantees a new label with a name unique to that program, ensuring no naming conflicts will occur within the program. However, a limitation still

4

2009/9/14

# Compose add_array component into main program code += add_array(prgm, rdi, rsi, rdx, add_elem)

Figure 2. Component-oriented synthetic program

exists—a generated code segment containing a label cannot be added twice. Even if a unique label name is used, the generated code contains the hardcoded label name, and will cause a conflict when the same label occurs twice in the program. The solution in this case is to generate new code instead of reusing generated code that uses a label.

5.

Code Transformation

Our generative code composition system allows for the possibility of applying transformations to generated code as part of the composition process. We define a code transformation as an operation that takes some generated code as input and outputs a modified version of that code. When the transformations are explicit, programmers can control which transformations are applied, and to which code. A single transformation may even be applied multiple times as part of a multi-stage optimization process, as might be done in a compiler [2, 19]. Conversely, our system is also an excellent environment for developing and experimenting with transformations themselves. A transformation can impose requirements about the type of code it accepts as input; this way a transformation need not handle all cases to be useful. For example, a transformation may require that input code contains only branches with static targets, or no branches at all. Since Python is used as the implementation language, transformations can be developed rapidly and in a high-level manner. Such a development environment is not found in compilers or other code transformation systems. Another unique aspect is that transformations are applied directly to assembly code, rather than to some intermediate form or language. However, a transformation is free to internally translate code to a suitable intermediate form if desired. 5.1

Instruction Scheduling For Cell SPE

To demonstrate the effectiveness of applying code transformations using our system, we present an instruction scheduler targeted at the Synergistic Processing Elements (SPEs) of the Cell BE [14, 15] architecture. Framed as a code transformation, our instruction scheduler takes some code as input and outputs the same instructions rearranged for optimized performance. Application of the instruction scheduler transformation is done like the following:

mized code in one pass. Rather than focusing on the performance tuning, programmers can instead focus on quickly developing correct, maintainable code. We chose to implement our instruction scheduler for the Cell SPE due to its simple, well-defined RISC architecture. Latencies for each instruction are static (i.e., they do not change based on input), making it easy to calculate instruction stall times due to data dependences. The instructions themselves are simple, and do not have side-effects (other than load/store instructions) or implicit operands. The SPE has two execution pipelines, referred to as even and odd. Each instruction in the architecture executes on only one of the pipelines—arithmetic operations on the even pipeline; load/store, branching, and other operations on the odd pipeline. Two instructions may be issued per cycle, one to each pipeline. 128 registers are available, enabling a high degree of instruction-level parallelism. Together these features make an interesting architecture for instruction scheduling while remaining simple enough to make optimal scheduling feasible in most cases. A similar approach to instruction scheduling for the Cell SPE can be found in the IBM XL compiler [10]; however we frame our scheduler in the context of our new synthetic composition framework rather than a compiler. The first stage of our implementation builds a data-flow dependence graph, segmenting it across a series of code blocks whose boundaries are defined by labels and branch instructions. In order to preserve correctness, labels and branch instructions are treated as scheduling barriers, across which no instructions may be moved during the scheduling process. This organization is similar to the use of Hierarchical Task Graphs to do instruction scheduling across basic blocks, described in previous work [27]. In the second stage, a set of heuristics is used to guide a topological sort of the graph which generates an optimized version of the input code. This approach yields a result quickly, but does not necessarily find the most optimal scheduling possible [12]. 5.2

Dependence Graph Generation

Using an instruction scheduling optimization at the assembly level solves the problem of developing and maintaining optimized assembly code. A common approach to assembly code development is to first develop correct (but not necessarily fast) code, and then iteratively tune and test the code by hand until the desired level of optimization is achieved. This optimization process is repetitive and time consuming. An instruction scheduler can be used to largely automate this process, always generating correct and opti-

Figure 3 illustrates the construction of the blocks and the dependence graph they encapsulate. The algorithm begins with an empty block, and begins adding elements from the input code stream one by one. Depending on its type, each element is handled as follows: Labels. If a label is encountered and another label or code has already been added to the current block, a new block is started—labels must always occur at the start of a block. If the current block is empty, the label can be assigned to that block. Branches. Branches must always occur at the end of a block. When a branch instruction is encountered in the instruction stream, that branch is added to the current block and a new block is started. Instructions. When an instruction is added, its operands are examined to determine if any edges (representing de-

5

2009/9/14

code += inst_sched(generate_code())

def generate_blocks(code) block = sched_block() block_list = [block]

instruction reads from this register operand, a dependence edge is added to the last write instruction, and the instruction is added to the dependence record’s list of read instructions. In either case, the latency (in cycles) of the last write instruction is associated with each edge in the graph for later use by stall-minimization heuristics. Note that dependence edges are added such that one graph is formed across all of the blocks, allowing the instruction scheduler to account for latencies crossing block boundaries and achieve higher scheduled code performance.

for element in code: if isinstance(element, label): # Create a new block if the current # block already has a label or graph if block.label is not None or block.graph is not empty: block = sched_block() block_list.append(block) block.label = element

5.3

Topological Sort

A topological sort algorithm is repeatedly applied to each block in order to build a new instruction stream from the dependence graph. Before considering any instructions in the graph, the block’s starting label is added to the output stream if it exists. Likewise, the block’s ending branch instruction is added after the sort algorithm completes for the current block. The set of instructions with no dependences in the same block is used as the initial set S. Sorting proceeds by selecting the ‘best’ instruction in S and appending it to the output instruction stream. Normally, a topological sort would delete dependence edges to this instruction, and add any instructions that no longer have any outgoing dependences edges to S. Our implementation keeps a counter of valid dependence edges, decrementing this counter instead of deleting edges. This removes the need to modify the graph and leaves the edge information in place for later use. When an instruction’s dependence counter reaches zero, it is added to S. We select the ‘best’ instruction from the start set by evaluating several heuristics in a specific order. The goal is to choose the instruction that leads to the schedule yielding optimal performance.

elif isinstance(element, branch): # Add branch to current block and # create a new block. block.branch = element block = sched_block() block_list.append(block) else: # Add instruction to the block’s graph # by adding dependence edges based on # register operands for each register op in element: dep = dep_records[op] if op is written: add_dep_edge(block, element, dep.last_write, dep.last_write.latency) for dep in dep.read_list: add_dep_edge(block, element, dep, 1) dep.read_list = [] dep.last_write = element

5.4

if op is read: add_dep_edge(element, dep.last_write) dep.read_list.append(element) if element has no dependences in block: block.start_sert.append(element)

Cell SPE-Specific Heuristics

pendences) need to be added to the graph. For each register operand that is encountered, a dependence record for that register is retrieved. A dependence record consists of a reference to the last instruction to write to the register and a list of instructions that have read from the register since the last write. If the current instruction writes to a register operand, several dependence edges are added: one to the last instruction to write the register, and one to each of the instructions that read the register since the last write. The dependence record is updated to contain the current instruction and the list of read instructions is cleared. If the current

Branch Hints. Branch hint instructions indicate to the hardware that an upcoming branch will be taken (the SPE by default assumes no branches are taken), and must be placed no more than 255 instructions away from the branch they are hinting. This heuristic ensures that this requirement is met by calculating the number of instructions remaining to schedule before the hint’s target branch. We do this by iteratively adding the number of instructions remaining to be scheduled starting from the current block forward to the block containing the target branch. If this number is greater than 255, the heuristic ensures the branch hint is not selected. Otherwise, the branch hint is subjected to the remaining heuristics and scheduled like any other instruction. Stall Minimization. The first heuristic for most instructions calculates how many cycles each instruction in the start set would stall the processor if it were to execute next. Thus, our scheduler places the most importance on keeping the processor busy and minimizes the time each instruction spends waiting on prior instructions. For each instruction, we calculate the maximum delay of its dependences using

6

2009/9/14

return block_list

Figure 3. Pseudo-code generating blocks containing a dependence graph.

def critpath(start): def critpath_recur(inst): max = 0 for each incoming dep on this inst: # Recurse if dependence’s critical path # cost is not yet computed if dep.critpath == 0: critpath_recur(dep)

def stall_cycles(inst): stallcycles = 0 for each outgoing dep of inst: # Compute stall time for this dependence: stall = dep.latency - (curcycle - dep.cycle) stallcycles = MAX(stallcycles, stall) return stallcycles

max = MAX(max, dep.critpath)

Figure 4. Pseudo-code computing the stall time of an instruction. # Critical path cost = max + latency inst.critpath = max + inst.latency

the algorithm in Figure 4. Instructions with a positive cycle stall count will stall if issued to the processor, while instructions with a zero or negative count can execute immediately. Negative stall counts are clamped to -1. This treats all instructions able to execute without stalling equally, and allows later heuristics to decide which of these instructions is best for issuing this cycle. We choose -1 instead of 0 because this instruction may be dual-issued and therefore be executed one cycle earlier than expected. An instruction with a stall time of 0 would then have a stall time of 1, preventing the dual-issue from occurring due to a one-cycle dependence stall. Pipeline Matching. After minimizing processor stalls, we emphasize matching instructions to both execution pipelines each cycle to take advantage of the processor’s superscalar capability. This is done by comparing the execution pipeline of each instruction against the processor’s currently active execution pipeline. The active execution pipeline is implied by the address of the next instruction to execute. An address congruent to 0 modulo 8 indicates the even pipeline is currently active, while an address congruent to 4 modulo 8 indicates the odd pipeline (all instructions are 4 bytes long and must be 4-byte aligned in memory, thus only 0 or 4 modulo 8 are possible). The heuristic chooses the instruction(s) that will execute on the active pipeline, if possible. Otherwise, if none of the potential instructions will execute on the active pipeline, this heuristic has no effect. Critical Path. Before the topological sort stage begins we precompute the cost of each instruction’s critical path [8, 9], or the maximum number of cycles required to execute the instructions that depend on each instruction. This type of heuristic is common; similar descriptions can be found in compiler textbooks [8, 19]. Figure 5 illustrates the algorithm in pseudo-code. At each instruction (node), we compute the distance as the latency of this instruction plus the maximum of the distances of the depending instructions. Using this information, our goal is to minimize the time required to execute the instructions that have yet to be issued, reducing the overall execution time of the output code. Therefore, we pick the instruction with the longest distance. Tie Breaking. Occasionally several instructions appear to be equally suitable even after all three heuristics. In this case, the fist one in the set is chosen. However, one more technique 7

for s in start: critpath_recur(s)

Figure 5. Critical path algorithm pseudo-code. is used to guide the instruction selection when ties occur after all the heuristics are applied. Each time an instruction is chosen and added to the output stream, the instructions depending on it are evaluated. As described above, if an instruction has no remaining dependences, it is added to the start set. Otherwise, it still has some dependences. If those dependences exist in the start set, they are moved to the beginning of the set. The purpose is to satisfy instruction dependences more quickly, which will tend to create a larger start set from which to choose instructions.

6.

Experimental Results

To demonstrate the effectiveness of the instruction scheduler and its use as a code transformation in the CorePy environment, we present two examples. First we consider vectorized 16- and 64-point FFTs (Fast Fourier Transforms) and evaluate the performance gain achieved by the instruction scheduler. Second, we consider polynomial approximations of the sine and cosine functions to show how the scheduler can be used to merge two independent code segments for improved performance. 6.1

FFT Butterfly

The Fourier Transform is fundamental to a wide variety of applications (particularly signal and image processing), making it an important target for optimization. We implemented 16- and 64-point radix-4 FFT synthetic components using CorePy. A radix-4 FFT repeatedly performs a butterfly operation with four inputs and four outputs, then breaks the problem into four sub-problems and solves each in the same manner. Figure 6 gives a pseudo-code overview of the 16-pt implementation, which performs two stages. The 64-pt FFT performs an additional butterfly stage at the beginning, then does the same computation as the 16-pt FFT on each of the four sub-problems. Rather than targeting optimal performance, the code was developed and organized to mini2009/9/14

Load 16 input points into four vectors i0-i3 20

SIMD v0 v1 v2 v3

Psuedo-gflop/sec

Load 3 twiddle values into vectors t0-t2 Duplicate each value across a vector compute four 4-point radix-4 butterflies: = (i0 + i2) + (i1 + i3) = ((i0 + i2) - (i1 + i3)) * t1 = ((i0 - i2) - j * (i1 - i3)) * t0 = ((i0 - i2) + j * (i1 - i3)) * t2

compute = (v0 + = (v0 + = (v0 = (v0 -

four 4-point radix-4 butterflies: v2) + (v1 + v3) v2) - (v1 + v3) v2) - j * (v1 - v3) v2) + j * (v1 - v3)

Spiral

FFTW

(higher is better).

struction scheduler, compared to no optimization. This result is not surprising; no attempt was made to optimize the code other than using the scheduler. Our purpose is to show the speedups that can be obtained when applying the instruction scheduler to hand-written (and readable) assembly code. Inspecting the scheduler-optimized code output using the spu timing profiler tool (part of the IBM Cell SDK [15]) by hand uncovered no potential performance improvements; the code contains no stalls to eliminate nor instructions that might be arranged for pipelined dual-issue. This inspection indicates that we have achieved our goal of both optimal and maintainable code. The optimized FFT presented here is significantly faster than the Cell implementation found in the popular FFTW library [11] and is roughly comparable to the performance of Spiral [6], which is one of the fastest FFT implementations available for the Cell processor to date.

Figure 6. Pseudo-code for a vectorized 16-point radix-4 FFT.

Psuedo-gflop/sec

CorePy

Figure 8. Optimized and unoptimized 64-pt FFT performance

Store 16 output points from vectors o0-o3

12

10

0

Matrix transpose vector registers o0-o3

14

15

5

Matrix transpose vector registers v0-v3 SIMD o0 o1 o2 o3

Optimized Unoptimized

Optimized Unoptimized

10 8 6 4

6.2 2 0

CorePy

Spiral

FFTW

Figure 7. Optimized and unoptimized 16-pt FFT performance (higher is better).

mize complexity and overall length; our goal was maintainable code, not performance. The bulk of the butterfly computation and matrix transpositions were implemented as subcomponents. We then benchmarked the FFTs’ performance with and without the use of our instruction scheduler, as seen in Figures 7 and 8. Times were obtained using the special decrementer register on the SPE. To obtain adequate resolution, 10,000 FFTs were executed in a loop to obtain an average time. Pseudo-Gflop/sec are calculated using the conventional formula 5n lg n for counting the number of floating point operations executed by a naive FFT implementation. Figures 7 and 8 indicate a performance increase of 24% for 16 points and 40% for 64 points when using our in8

Sine and Cosine

The purpose of this example is to demonstrate how two or more independent code segments can be merged together to hide instruction latency and improve performance. The sine and cosine trigonometric operations are used here, though the code segments could be anything; for example several iterations of an unrolled loop body. Sine and cosine are often repeatedly called together in pairs (e.g., in a convolution algorithm), making them an ideal case for code merging and optimization. Our implementation approximates the sine and cosine functions by computing a polynomial chosen for good performance without significant loss of accuracy. Composing and optimizing the combined sine and cosine operation is done as follows: sincos = generate_sin(r_sx) sincos += generate_cos(r_cx) code += isched(sincos) Figure 9 illustrates the performance of the merged sine/cosine code. Times were again measured using the SPU decrementer timer. Concatenating the sine and cosine code to2009/9/14

60

Sine Cosine Interleaved Sine/Cosine

50

Time (ns)

40 30 20 10 0

Unoptimized

Optimized

Figure 9. (Un)optimized sine and cosine performance (lower is better)

gether yields a combined sine/cosine performance approximately equal to the sum of the time required to compute the sine and cosine separately. Applying the instruction scheduler to the code yields performance nearly as fast as an individual sine or cosine operation, yielding almost perfect speedup. This is because the separate sine and cosine code spends a lot of time stalling on the processor waiting for intermediate results. Hand inspection of the output code shows that the scheduler is intelligently interleaving the code, taking advantage of pipelined dual-issue opportunities when possible. In general, the performance gained by merging multiple code sections will depend on the amount of stall time in the code. Code with few or no stalls will see little speedup, while code with many stalls benefits greatly.

7.

Future Work

Future work in this area will largely focus on addressing the limitations of the composition system and potential code transformations. Management of register resources could be made more flexible by moving from statically assigning physical registers to register objects, and instead allocating physical registers when machine code is rendered from the synthetic objects. An existing register allocation algorithm [5, 7] could easily be applied. Synthetic components would no longer be tied to specific registers or programs, allowing code generated by a component to be reused in any number of synthetic programs. Implementation of this idea could be done in several ways: integrated into the CorePy system, as a standalone transformation, or included as part of the instruction scheduling transformation. Performing register allocation as part of the scheduler would allow false write-after-read dependences to be eliminated, as well as set up the possibility for more advanced code optimizations and register management. The current composition operation duplicates code when appending one stream to another. Instead, adding a reference to the source stream would create a more flexible and 9

Python-like composition system. Additions or changes made to a sub-component would propagate to all the prior uses of that sub-component. A side affect of this approach would be improved code rendering; instructions in a sub-component that is reused many times would only need to be rendered once. A challenge, however, exists when this idea is combined with the implementation of register allocation. The instruction and register objects from a single component are reused (and referenced directly) in multiple places, implying the same registers must be allocated to each use of that component. A register allocation algorithm must then account for this behavior and allocate registers appropriately. Similar problems arise for instruction scheduling; the same component would likely need to be scheduled differently due to surrounding code. An alternative to the heuristic-based instruction scheduling approach would be to use a provably optimal algorithm [18, 26]. Although more computationally intensive, these approaches will always output the best possible instruction schedule. With an additional scheduler implementation, one possibility might be to run both schedulers in parallel on multiple cores and use the best solution found after some specified time elapses. This effectively allows a developer to set the point at which optimization is no longer worthwhile, preventing optimization from taking up too much time for a particular application.

8.

Conclusion

We have extended the CorePy library to support componentoriented programming via synthetic code composition, facilitating code reuse at both the Python and generated code levels. The resulting system supports the development and user-directed application of code transformations to code generated by synthetic components, outside the context of a compiler. As an example, we have developed an instruction scheduling optimizer for the Cell SPE architecture. Using the scheduler as part of the code composition system, we made assembly code development an easier and faster process. CorePy source code, including synthetic composition and the instruction scheduler, is available under a BSD software license at http://www.corepy.org.

Acknowledgments Laura Hopkins, Jeremiah Willcock, and Benjamin Martin assisted in editing this paper. This work was funded by a grant from the Lilly Endowment.

References [1] A. V. Aho, M. S. Lam, R. Sethi, and J. D. Ullman. Compilers: Principles, Techniques, and Tools (2nd Edition). Addison Wesley, August 2006.

2009/9/14

[2] D. F. Bacon, S. L. Graham, and O. J. Sharp. Compiler transformations for high-performance computing. ACM Comput. Surv., 26(4):345–420, 1994.

band Engine Resource Center. http://www.ibm.com/ developerworks/power/cell/ (Accessed May 2009).

[3] O. Beckmann, A. Houghton, M. Mellor, and P. H. J. Kelly. Runtime code generation in c++ as a foundation for domainspecific optimisation. In In Proceedings of the 2003 Dagstuhl Workshop on Domain-Specific Program Generation, pages 291–306, 2003. [4] S. Behnel, R. Bradshaw, and D. Sverre Seljebotn. Cython User’s Guide, 2009. http://cython.org/ (Accessed September 2009). [5] P. Briggs, K. D. Cooper, and L. Torczon. Improvements to graph coloring register allocation. ACM Transactions on Programming Languages and Systems, 16:428–455, 1994. [6] S. Chellappa, F. Franchetti, and M. P¨uschel. Computer generation of fast fourier transforms for the cell broadband engine. In ICS, New York, NY, USA, 2009. [7] F. C. Chow and J. L. Hennessy. The priority-based coloring approach to register allocation. ACM Transactions on Programming Languages and Systems, 12(4):501–536, 1990. [8] K. D. Cooper and L. Torczon. Engineering a Compiler. Morgan Kaufmann, 2004. [9] T. Cormen, C. Leiserson, and R. Rivest. Introduction to Algorithms. The MIT Press, 1990. [10] A. E. Eichenberger, K. O’Brien, K. O’Brien, P. Wu, T. Chen, P. H. Oden, D. A. Prener, J. C. Shepherd, B. So, Z. Sura, A. Wang, T. Zhang, P. Zhao, and M. Gschwind. Optimizing compiler for the cell processor. In PACT, pages 161–172, Washington, DC, USA, 2005. [11] M. Frigo, Steven, and G. Johnson. The design and implementation of FFTW3. In Proceedings of the IEEE, pages 216–231, 2005. [12] P. B. Gibbons and S. S. Muchnick. Efficient instruction scheduling for a pipelined architecture. In SIGPLAN, pages 11–16, New York, NY, USA, 1986. [13] GNU. GNU Compiler Collection (GCC). http://gcc. gnu.org/ (Accessed May 2009). [14] International Business Machines Corporation (IBM). Cell Broadband Engine Architecture, August 2005. [15] International Business Machines (IBM).

Cell Broad-

10

[16] K. Kennedy, B. Broom, A. Chauhan, R. Fowler, J. Garvin, C. Koelbel, C. McCosh, and J. Mellor-Crummey. Telescoping languages: A system for automatic generation of domain languages. Proceedings of the IEEE, 93(3):387–408, 2005. [17] C. Lattner. LLVM: An Infrastructure for Multi-Stage Optimization. Master’s thesis, Computer Science Dept., University of Illinois at Urbana-Champaign, Urbana, IL, Dec 2002. [18] A. M. Malik, J. McInnes, and P. van Beek. Optimal basic block instruction scheduling for multiple-issue processors using constraint programming. In ICTAI, pages 279–287, Washington, DC, USA, 2006. [19] S. Muchnick. Advanced Compiler Design and Implementation. Morgan Kaufmann, August 1997. [20] C. Mueller. Synthetic Programming: User-Directed RunTime Code Synthesis for High Performance Computing. PhD thesis, Indiana University, 2007. [21] C. Mueller and A. Lumsdaine. Expression and loop libraries for high-performance code synthesis. In LCPC, November 2006. [22] C. Mueller and A. Lumsdaine. Runtime synthesis of highperformance code from scripting languages. In DLS, October 2006. [23] O. Nierstrasz, S. Gibbs, and D. Tsichritzis. Componentoriented software development. Commun. ACM, 35(9):160– 165, 1992. [24] G. Olson. PyASM User’s Guide V.0.2. http://mysite. verizon.net/olsongt/usersGuide.html (Accessed May 2009). [25] A. Rigo. The Ultimate Psyco Guide, 1.6 edition, February 2005. http://psyco.sourceforge.net/psycoguide/ (Accessed May 2009). [26] P. van Beek and K. D. Wilken. Fast optimal instruction scheduling for single-issue processors with arbitrary latencies. In CP, pages 625–639, London, UK, 2001. [27] D. R. Wallace. Low level scheduling using the hierarchical task graph. In ICS, pages 72–81, New York, NY, USA, 1992. ACM.

2009/9/14

Suggest Documents