Cache Pre-fetching for Image Processing - Semantic Scholar

3 downloads 0 Views 50KB Size Report
spatial nor temporal locality can be fully exploited and a large amount of cache miss may occur. In this paper we analyze current trends of cache pre-fetching ...
Cache Pre-fetching for Image Processing R. Cucchiara, M. Piccardi Dipartimento di Ingegneria, University of Ferrara Via Saragat 1, I-44100 Ferrara, Italy E-mail {rcucchiara, mpiccardi}@ing.unife.it Abstract. This paper analyzes hardware pre-fetching techniques for caching images. Performances are evaluated with respect to different classes of image processing algorithms, namely raster-scan and propagative algorithms, common in computer vision and multimedia applications. Sequential pre-fetching and adaptive pre-fetching are compared with the proposed adaptive local pre-fetching that results to be more efficient, mirroring the two-dimensional spatial locality of image processing algorithms. 1.

Introduction

This paper focuses on caching strategies for caching images in computer vision and multimedia application. The interest in caching images follows the emerging trend that attempts to include some functional units for handling multimedia data and in particular images also in generalpurpose processors. However, although effective research activity has addressed dedicated execution units (see the Intel MMX, HP MAX, Sun VIS pixel-oriented instruction sets), no large advances have been made yet in specialized or reconfigurable memory units. In this context, this work analyze the behavior of classes of programs working on images from the point of view of memory access, in order to propose new cache solutions oriented to image processors. Images are normally structured as very large 2D arrays, stored in memory row-by-row. The main difference between working on general data 2D arrays and image arrays is that in the latter case the choice of which matrix element to compute often depends on the previous data computation, i.e. the data access could be data-dependent and not undergo a rigid row-by-row scheme. Therefore with a standard cache architecture, that privileges the horizontal direction, neither spatial nor temporal locality can be fully exploited and a large amount of cache miss may occur. In this paper we analyze current trends of cache pre-fetching [1-3] with specific reference to some classes of very general image processing algorithms, and we propose some new issues for managing two-dimensional image types: 1) performing an adaptive local pre-fetching in the data cache for image data types; 2) splitting the data cache and adopting a separate cache for images. The first point explores specific techniques for identifying which data is better to fetch and allocate in the cache. We propose to fetch data so as to maintain in cache the spatial locality embodied also in those image processing applications that do not adopt a rigid raster-scan access. The second item refers to the possibility of providing specialized memory units for exploiting specific kinds of locality, that may significantly improve performance in many applications [4]. To prove the effectiveness of the proposed approach, we have modified existing cache simulators and analyzed program traces of algorithms extracted from the DARPA image understanding benchmark [5]; their use is not restricted only to computer vision and image understanding, but is now common in many multimedia applications. The paper is structured as follows: in the next section we analyze common requirements of image processing applications, and outline the concept of 2D spatial locality. Section 3 describes the compared pre-fetching strategies: in particular, sequential and adaptive pre-fetching are described and compared to a

novel approach called adaptive local pre-fetching. Section 4 reports experimental results for the various pre-fetching techniques and compares the use of a unified cache and a split cache in the tested image processing programs. Finally, conclusions are presented in section 5. 2. 2D spatial locality in images Images used in most multimedia and image processing applications are defined in high level language programs as large two-dimensional arrays, generally stored row-by-row in statically or dynamically allocated variables in central memory. Many algorithms, such as image compressions, filtering, convolutions, segmentation processes and feature extractors are based on a main loop iterated for each image point containing the instruction stream typical of that application for each single data item. Since each image point is referenced in a nested loop on rows and columns, they are usually called raster-scan algorithms [6]. Differently from this class, another large class of image algorithms exhibit 2D spatial locality but it is unpredictable, since the data acccess is mainly data-dependent and not performed in rasterscan manner. Many algorithms belong to this class, and they are called propagative algorithms, since the computational flow is propagated along the image, in an a-priori unknown direction depending on the data themselves [6]. Typical examples are contour tracing algorithms that, after selecting a starting point belonging to the object boundary, follow the object contour in order to compute some geometrical properties. Other examples are area-tracing or regiongrowing algorithms that propagate in an image area: the widely-used labeling algorithms, for identifying homogeneous parts of the image through univocal labels, surface extractors from range images, used for 3D object recognition, and many others. These algorithms prove the bidimensional spatial locality of image algorithms, and at the same time the impossibility of predicting the data access. Moreover, the temporal locality is difficult to emphasize, because points involved in the neighbor computation may be used in the future, but may be after long computation and thus capacity misses probably occur. In this case, using a standard cache architecture that fetches and stores only physically adjacent blocks, neither the vertical spatial locality, nor the temporal locality can be fully exploited. In this context standard cache optimization techniques are useless: blocking [1] can not be performed since no a-priori data partitioning is allowed; use of large block sizes, often used for reducing cache misses, may not catch the vertical spatial locality; even software or hardware sequential pre-fetching do not overcome the outlined drawbacks. 3. Cache pre-fetchings stategies in image processing We look for cache optimization techniques that suitably mirror the 2D spatial locality of images, in order to reduce cache misses. In this paper we focus on hardware pre-fetching strategies. Prefetch reduces miss rate by bringing data into the cache before its use so as it can be accessed without delay [1-3]. In this paper, we provide a performance analysis in terms of miss rate only, without evaluating a complete time execution model that instead is required for the cache architecture design. Moreover, due to specificities of the simulators used, we explore the prefetch-on-miss technique only, which initiates the pre-fetch of a next block when the access to another block results in a cache miss; more aggressive pre-fetching strategies could achieve a substantial decrease in the number of misses. Nevertheless, the main aim of this work is to prove that a specific form of

spatio-temporal locality exists in image analysis, that is not fully catched by traditional prefetching techniques. We have compared five different caching strategies, specifically oriented to managing image data type, as data arrays of [N x N] elements: a) simple line caching without pre-fetching b) sequential pre-fetching: that is, fetching two consecutive lines c) constant stride pre-fetching: fetching the line in which the miss occurred, and the corresponding one in the next image row, that is with a displacement of N elements. d) adaptive pre-fetching: first, the current line is fetched; then, a stride is computed as the difference between the current line address and the line address of the previous miss [2]; the stride is added to the current line address and used for fetching another line. e) adaptive local pre-fetching: if the stride is large as a single block (in forward or backward direction), same as d) in the two directions; otherwise if the stride is greater than a block size, the corresponding line either in the previous or in the next row is fetched. All of these strategies adopt fetch on miss. All approaches consider to fetch an identical amount of data, i.e. they have the same line size (called B): nevertheless, apart from a) in the pre-fecthing techniques two lines of B data are fetched: the one required by the cache miss and the predicted one. Options b)-e) differ in the adopted reference prediction scheme [2]. The basic idea is that when managing large data arrays like images, pre-fetching can be improved by a correct prediction of which array element will be accessed and referenced by load or store instructions. In the c) approach a constant stride is adopted: if we call current_addr the address of the current block to be fetched due to a data miss, the address of the block to be pre-fetched is given by adding to current_addr a fixed stride, known a compile time, equal to the image row length. The adoption of this fixed displacement could be justified in accordance with the 2D spatial locality exhibited by many image processing algorithms. Theoretically it should have a similar performance of b), but exploiting a “vertical” pre-fetching instead of a “horizontal” one as in the sequential case. Conversely the adaptive approaches d) and e) require a dynamic stride detection. In d), the stride detection computes the difference between the current address of the referenced data (when a miss occurs) and the previously referenced by the same instruction. The last approach we propose in this work is a modified version of the adaptive pre-fetching: the adaptive local pre-fetching is based on the assumption that even if the direction of the spatial locality is not always predictable (and thus adaptive), the data-flow variation is mainly local around a 2D neighbourhood. Thus, if the current stride is large as one single block (in forward or backward direction), the scheme is the same as d); otherwise, if the stride is greater than one block size, we fetch a line from the upper or lower image row. 4. Performance comparison of pre-fetching techniques In this work we explored a suite of standard programs, and in particular 1) Convolution, 2) Trace Contour, and 3) Labeling. Since in case 2) and 3) the computation behavior is strongly datadependent we have tested the algorithms on a large set of variously-oriented banners, as well as on the original Darpa benchmark data sets. Cache performance has been evaluated tracing the program’s memory references on a Sun Sparc 10 workstation and using the trace as input for a cache simulator [7]. The simulator performs allocation on misses, both for read and write, exploiting an exact LRU substitution policy. As common in modern uniprocessors, we have considered separate instruction and data caches.

We compared the five strategies on the banner image database with some different cache organizations. Due to the lack of space, we report results only for the case of two-way associative caches (A=2), 16 Kbyte cache size with lines of 16 byte size; anyway, similar results apply also for different cache and line sizes and associativity. Figg. 1.a and 1.b compare the number of cache misses occurred for the references to image data types only. The figures concern two different input data (namely, the USA banner and the rotated USA banner), representing two opposite cases: the Labeling and Trace Contour computation direction is mainly horizontal in the former image (following the banner strips), while is mainly vertical in the latter. Type of fetching (N. references) a) Simple b) Sequential PF c) Constant Stride PF d) Adaptive PF e) Adaptive local PF

Trace Contour 631348 71998 38191 37738 36166 36180

Labeling 3926944 138588 74495 83035 72663 72252

Convolution 7022700 32704 16352 16352 16355 16355

Fig. 1.a. Number of misses for image references (A = 2, B = 16, S = 16K, USA banner). Type of fetching (N. references) a) Simple b) Sequential PF c) Constant Stride PF d) Adaptive PF e) Adaptive local PF

Trace Contour 864405 66346 33353 33351 33297 33348

Labeling 3975237 138270 77743 72443 75867 72654

Convolution 7022700 32704 16352 16352 16355 16355

Fig. 1.b. Number of misses for image references (A = 2, B = 16, S = 16K, USA rotated). In the case of Convolution, misses are mainly compulsory: thus, there is a substantial benefit with pre-fetching, and it doesn’t depend on a specific pre-fetching strategy. A different behavior is shown by propagative algorithms, where the locality is not regular. Trace Contour and Labeling have a less number of data reference than Convolution, since not all the image points are involved in the propagative computation; Trace Contour has a smaller computational load than the Labeling algorithm, as is indicated by the less number (of one magnitude order) of references. Nevertheless the impredictable direction of the data flow increases the number of misses with respect to Convolution (see the a) rows in Figg. 1.a and 1.b): therefore a performance optimization for these cases is still more critical and effective, although it is often neglected in the literature. The reported data show that the adaptive local pre-fetching achieves results similar to the best performing technique for each case: for Trace Contour and Labeling in Fig. 1.a, adaptive pre-fetching and adaptive local pre-fetching outperform sequential and constant stride schemes, while for Labeling in Fig. 1.b constant stride and adaptive local prefetching are the most effective. The previous results address only image references, but non-image references must be accounted, too, in the complete program. A possible solution is to supply image processors with a split data cache, consisting of a dedicated cache for image data in addition to the standard data cache for other data. This idea follows the trend of providing some separated special-purpose caches for dealing with different kind of localities [4]. To this aim, Fig. 2 compares performance obtained with split cache and adaptive local pre-fetching or with a standard unified cache with

the classic sequential pre-fetching. In the first case a 16K and 8 Kbyte caches are used for image and scalar data respectively, while in the second case we simulated a 16 Kbyte cache. This choice (mainly due to difficulties in exploiting a 24K cache that is not a power of two!) is not particularly restrictive, as similar performance are achieved by unified caches of 16 Kbyte and 32 Kbyte. Type of fetching

Trace Contour

Labeling

Convolution

Image data Scalar data Image data Scalar data Image data Scalar data (N. references) Split data cache, Adaptive local PF Unified cache, Sequential PF

631348

545912

3926944

36180

3453

72252

39633 42298

1320144

27412 99664 102395

7022700

2379264

16355

2064 18419 22459

Fig. 2. Number of total misses (A = 2, B = 16, S = 16K). Fig. 2 shows that the split cache outperforms the unified one in each program. In the convolution program the advantage is mainly due to the lack of interference between and image and nonimage data, while in the other two algorithms the improvement achieved with our method is also due to the adoption of a pre-fetching approach that better mirrors the local computation. 7. Conclusions The present work has analyzed hardware cache pre-fetching techniques on image processing algorithms typical of computer vision and multimedia applications. An original technique, called adaptive local pre-fetching, that exploits in an adaptive manner the 2D data locality embedded in many image processing algorithms, proved to perform equally or better than all other compared techniques. In addition, we have compared the performance obtained with a unified data cache against the one of a split cache for image and non-image data; although the comparison has been made only at a rough parity of resources, it seems to show that the split cache can be considered a valid solution to achieve low overall data cache misses in image processing programs. Further extensions of this work will address a complete time execution model for the adaptive local pre-fetching, and the integration in tested programs of other multimedia algorithms. References [1] J.L. Hennessy, D.A. Patterson. Computer architecture a quantitative approach. 2nd edition, Morgan Kaufman Publ., 1996. [2] S. Vander Wiel, D. Lilja, “When caches aren’t enough: data prefetching techniques”, IEEE Computer, v. 30, n. 7, pp. 23-30, 1997. [3] T.F. Chen J. L. Baer “Effective hardware-based data prefecthing for high performance processors” IEEE Trans. on Computers, vol 44 n.5., 1995, pp. 609-622. [4] V. Milutinovic, B. Markovic, M. Tomasevic, M. Tremblay, “The split temporal/spatial cache: initial performacne analysis”, Proc. of the SCIzzL-5, Santa Clara, CA, USA, 1996. [5] C. Weems, E. Riseman, A. Hanson, A. Rosenfeld, "The DARPA image understanding benchmark for parallel computers", Journal of Parall. and Distrib. Computing, vol. 11, pp. 1-24, 1991. [6] W. K. Pratt, Digital image processing, New York: John Wiley and Sons, 1991. [7] ACME Cache Simulator, http://atanasoff.nmsu.edu/~acme/acs.html. [8] N.P. Jouppi “ Improving direct-mapped cache performance by the addition of a small associative cache and prefetch buffers” Proc. 17th annualInt Symp. On Computer Architecture, 1990, pp. 364373

Suggest Documents