Compact Multi-Dimensional Kernel Extraction for Register Tiling ∗
Lakshminarayanan Renganarayana1 Uday Bondhugula1 Salem Derisavi2 Alexandre E. Eichenberger1 Kevin O’Brien1 IBM T.J. Watson Research Center, Yorktown Heights, New York, USA 2 IBM Toronto Lab, Ontario, Canada (lrengan,ubondhug,alexe,caomhin)@us.ibm.com and
[email protected] 1
ABSTRACT To achieve high performance on multi-cores, modern loop optimizers apply long sequences of transformations that produce complex loop structures. Downstream optimizations such as register tiling (unroll-and-jam plus scalar promotion) typically provide a significant performance improvement. Typical register tilers provide this performance improvement only when applied on simple loop structures. They often fail to operate on complex loop structures leaving a significant amount of performance on the table. We present a technique called compact multi-dimensional kernel extraction (COMDEX) which can make register tilers operate on arbitrarily complex loop structures and enable them to provide the performance benefits. COMDEX extracts compact unrollable kernels from complex loops. We show that by using COMDEX as a pre-processing to register tiling we can (i) enable register tiling on complex loop structures and (ii) realize a significant performance improvement on a variety of codes.
1.
INTRODUCTION
Register tiling [8, 13] is a well-known technique that increases register locality and the amount of available Instruction Level Parallelism (ILP), while simultaneously decreasing the loop overhead and computational redundancy within the loop body. In its simplest one-dimensional form, this technique reduces to loop unrolling, where a loop body is replicated a fix number of times. Quality of code is generally improved by unrolling as it enables the elimination of redundant computations, the reuse of values via registers, and the increased scheduling flexibility between instances of replicated loop body. Register tiling applies this loop unrolling technique to two or more dimensions of loop structures, and enables higher levels of register locality and ILP. This technique, also known as unroll-and-jam [5, 19] followed by scalar promotion [4, 6], can often result in significant (2x or more) speedups. ∗ This material is based upon work supported by the Defense Advanced Research Projects Agency under its Agreement No. HR0011-07-9-0002.
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are Copyright 2009 ACM (c) 2009 Association for Computing Machinery. not made or distributed for profit or commercial advantage and that copies ACM acknowledges that this contribution was authored or co-authored by bear this notice and the full citation on the first page. To copy otherwise, to a contractor or affiliate of the U.S. Government. As such, the Government republish, to post on servers or to redistribute to lists, requires prior specific retains a and/or nonexclusive, royalty-free right to publish or reproduce this permission a fee. article, or to allow others do so, forOregon, Government SC09 November 14-20, 2009,toPortland, USA. purposes only. Copyright 2009 ACM 978-1-60558-744-8/09/11 ...$10.00.
SC09 November 14-20, 2009, Portland, Oregon, USA (c) 2009 ACM 978-1-60558-744-8/09/11 ...$10.00.
Though register tiling is one of the most profitable loop transformations for improving register locality and instruction level parallelism (ILP) [7, 8, 15], it is often not applied because register tilers usually can handle only simple rectangular loops as inputs. Unfortunately, there is an increasingly large set of loops that have complex (deep imperfectly nested) loop structures. The complexity comes from two sources. First, many applications spend significant amount of time in imperfectly nested loops. Examples are matrix factorizations such as LUD and Cholesky, finite difference methods such as FDTD and ADI, and matrix operations from BLAS level-3. The second source of complexity is from the use of long sequence of loop transformations to optimize locality and coarse-grained parallelism for multi-cores [9, 3]. Though the original source program might have a simpler loop strcuture, the result of these long sequence of transformations is a complex imperfectly nested loop strcuture. The loops that would benefit from register tiling are often buried deep inside these complex loop nests. Register tilers that attempt to work directly on such complex loop nests often fail to separate out or extract the unrollable loops and end up not applying register tiling at all – as shown in our experimental evaluation (cf. Section 5). This paper proposes a technique that enables register tilers to handle complex inputs and successfully bring their performance benefits to programs optimized for multi-cores. We refer to the task of extracting a loop nest with simple rectangular bounds as kernel extraction, and the extracted loop nest, a kernel. We propose the use of kernel extraction as an explicit transformation which extracts unrollable kernels and transforms them so that scalar promotion is profitable. Kernel extraction can be viewed as an enabling transformation that makes register tilers more robust against complex code structures (imperfect loop nests, if-conditions, etc.). To successfully reap the benefits of register tiling, kernel extractors need to have at least the following three features: (i) support for complex imperfect loop nests as input, (ii) compact code size after kernel extraction and (iii) support for coarse-grain parallelism enabling transformations As discussed earlier support for complex loops as inputs is essential. The code size after kernel extraction is becoming increasingly important as accelerators with small, dedicated local-memory are becoming prevalent. The third feature – support for coarse-grain parallelism enabling transformations – is a subtle but important one. At a high level, to achieve scalable parallelization on multi-cores we need coarse-grained parallel units. The transformations (e.g., loop skewing) used for extracting this coarse-grained parallelism require the use of a particular model of loop tiling [3]. We would like the kernel extractor to work with this tiling model so that it can be used on loops optimzied for coarsegrained parallelism. This issue is further discussed in Section 2.3.
Input loop nest structure TLOG [18] Jimenez et al. [13] PrimeTile [10] COMDEX
Perfect Perfect Imperfect Imperfect
Support for (coarse-grain) parallelism enabling transformations (e.g., skewing of tiles) No Yes No Yes
Code size after kernel extraction Compact – minimal Long – exponential Long – exponential Compact – minimal
Table 1: Comparison of kernel extraction methods. Imperfect loop nests appear in several important scientific applications and support for them is essential. Support for (coarse-grain) parallelism enabling transformations are required for scalable parallelization on multi-cores [3]. Compact code size is preferred in compilation for accelerators with small local stores/I-Cache. COMDEX, the method proposed in this paper, achieves the best in all the desired features. Table 1 compares our method (COMDEX) with the previous methods for kernel extraction on each of the three features. COMDEX supports all the three features. Whereas none of the previous methods support more than one desirable feature. The limitations of TLOG [18] and PrimeTile [10] are partly due to their intertwining of the extraction of unrollable loop nests with the tiled code generation technique and partly due to their use of unconventional tiling models (see Section 6 for more details). By separating out kernel extraction as an independent transformation and by carefully combining it with conventional tiling models, COMDEX supports transformations of tiles. By using appropriate slices of the iteration space to extract kernels, COMDEX supports imperfect loop nests and compact code size. The code size after kernel extraction in PrimeTile [10] and the scheme of Jimenez et al. [13], grows exponentially with the number of loops tiled. Whereas in COMDEX (as in TLOG [18]), the code size growth is minimal – just another version of the extracted kernel and an if-else condition. The contributions of this paper are listed below. • A kernel extraction technique, COMDEX, that can support imperfect loop nests, (parallelism enabling) transformations of tiled loops, and produces compact code after kernel extraction. To the best of our knowledge, COMDEX is the first kernel extractor with all these features. • An implementation of the kernel extractor in the IBM XL compiler, and performance and code size comparison of COMDEX with several other techniques. The next section provides the background on kernel extraction for register tiling. Section 3 first describes the issues involved in extracting kernels for imperfect loop nests and supporting transformations of tiles. Section 4 introduces the notion of generalized inset describes our kernel extraction algorithm. Section 5 describes the implementation of our technique and evaluates it on a variety of scientific kernels. We outline related work on Section 6 and provide conclusions and directions for future work in Section 7.
2.
BACKGROUND: REGISTER TILING AND KERNEL EXTRACTION
Register tiling involves three distinct phases: (i) tiling or blocking the loops that needs to be unrolled (ii) fully unrolling the loops (iii) applying scalar promotion to the unrolled loop body. The unrolling phase consists of unrolling multiple loops and jamming or combining all the unrolled iterations. Such unrolling exposes ILP that can be exploited by the instruction scheduler down the line. The unrolling also exposes opportunities for scalar promotion where invariant array references are promoted to scalars which are then assigned to registers by a register allocator. Together the
three phases improve ILP and register locality. A good overview of unroll-and-jam and scalar promotion can be found in [1]. The sufficient legality condition for register tiling of a set of loops is the same as that of regular tiling [23], viz., full permutability of the loops. In this paper, we assume that the loops have been appropriately transformed earlier so that register tiling is legal. For our experiments, we use a scheme that is based on Bondhugula et al. [2, 3] to find an initial compound transformation which optimizes for cache locality / coarse-grained parallelism and makes rectangular tiling valid on the transformed loop nest. Given a loop nest for which register tiling is legal, we address the issues involved in generating the register tiled code, viz., extracting an unrollable kernel from it and transforming it so that scalar replacement can be applied. The rest of this section introduces the concepts involved in register tiling and kernel extraction.
2.1
Full tiles and kernels
Figure 1(a) shows a 2D loop nest typically found in stencil computations. The parallelogram shaped iteration space (for a smaller program size) is shown in Figure 2. Figure 2 also shows two levels of tiling: an outer-level of 4 × 4 tiling (marked as cache tiles) and inner-level of 2 × 2 tiling (marked as register tiles). We distinguish here two types of register tiles, viz., partial and full. Full tiles are those that are completely contained in the iteration space. Partial tiles are partly contained in the iteration space. The origins — lexicographically earliest iteration point — of these tiles are respectively called as full tile origin and partial tile origin. Since the full tiles are completely contained in the iteration space, the loops that iterate over the points in them can have simple constant bounds which check whether the iterations lie within a given tile. This is the key property that is exploited to extract a kernel. We call the loops that enumerate tile origins as tile-loops and those that enumerate points within a tile as point-loops. We would like to fully unroll the 2 × 2 register tiles to expose ILP and exploit register locality. Unrollers require loops with constant number of iterations, i.e., the difference between the lower and upper bounds of a given loop is a constant. We call a loop nest in which each loop has a constant number of iterations, a kernel. We call kernel extraction the process of separating out the kernel from a loop nest with complex bounds.
2.2
Kernel extraction using inset
Figure 1(b) shows the tiled-loop nest (generated using HiTLOG [11, 18]). As observed earlier, the bounds of the loops that enumerate the iterations contained in the full tiles have constant number of iterations and are the ones that constitute a kernel – a loop nest that can be directly unrolled. Based on this key observation, we can reduce the problem of extracting the kernel to that of splitting a given set of point-loops into two sets of loops: one enumerating the
1 2 3
for ( i = 1; i