Cache Tiling for High Performance Morphological Image Processing

7 downloads 0 Views 129KB Size Report
fellowship. He was a digital design engineer with Boeing Aerospace from 1987 to 1989 ... Morphological image analysis is a technique of processing images through shape charac- teristics (Jain ... Dilation (. ) and erosion ( ) are the basic.
Cache Tiling for High Performance Morphological Image Processing Craig M. Wittenbrink

Arun K. Somani†

Dept. of Electrical Engineering, (and Dept. of Computer Science and Engineering)† University of Washington MS FT-10, Seattle, WA USA 98195 Appears as: Cache tiling for high performance morphological image processing, Craig M. Wittenbrink and A. K. Somani. In Journal Machine Vision and Applications, Vol. 7 No. 1, Winter 1993, pages 12-22. Short version in CAMP 91, Computer Architecture For Machine Perception, Paris, France, 1991, pages 427-438.

Author’s Biographies: Craig M. Wittenbrink received his B.S. in electrical engineering and computer science from the University of Colorado, Boulder CO in 1987, and a M.S. in electrical engineering from the University of Washington, Seattle WA in 1990. He is a Ph.D. candidate at the University of Washington studying parallel volume rendering with support from a NASA fellowship. He was a digital design engineer with Boeing Aerospace from 1987 to 1989 working on firmware design and testing for computer image generation hardware. Mr. Wittenbrink’s primary research interests include computer graphics, computer architecture, and image processing. Arun K. Somani received the B.E. (Hons.) degree in electronics engineering from Birla Institute of Technology and Science, Pilani, India in 1973, and the M.Eng. and Ph.D. degrees in electrical engineering from McGill University, Montreal, Canada, in 1983 and 1985, respectively. From 1974 to 1982, he worked as a Scientist in the System Group, Department of Electronics, Government of India, New Delhi. Currently he is an Associate Professor of Electrical Engineering and Computer Science and Engineering at the University of Washington, Seattle, WA. Dr. Somani’s research interests include computer architecture, fault tolerant computing, parallel computer systems and algorithms, and computer interconnection networks. He is the architect of Proteus Computer System, a large reconfigurable network of processors for image and vision processing applications being built at the University of Washington.

3

Abstract Morphological image analysis is a technique of processing images through shape characteristics (Jain 1989). Because images are regular data structures morphology algorithm’s memory access patterns are predictable. By using read and write patterns, we derive a model of processing to examine inefficiencies for cache processing. We then develop a cache architecture for windowed processing that reduces cache thrashing. Our caching technique, cache tiling, allows dramatic improvement in caching efficiency for small caches independent of compiler optimizations. Programs are not affected providing a transparent solution to improve caching. System code, compilers, or profiling programs can determine the blocking necessary for the best performance. An analytical model for morphological processing memory characteristics is presented which provides for exact cache analysis and prediction. The analytical model is compared to address traces to validate the model. Other algorithms such as inner product, matrix multiplication, and convolution also benefit from the architecture presented herein. Key Words: Cache tiling, computer architecture, performance, windowed or blocked processing, morphological image processing.

1

Introduction

This paper discusses tiling to achieve high performance while processing images. Small caches with sizes from 2 k to 64 k bytes are implemented on the same chip as the microprocessor. Such on-chip caches do not perform well for windowed image processing if using conventional control. We propose a new architecture to overcome cache conflicts and thrashing (Wittenbrink 1990). We begin by reviewing morphological algorithms and their caching problems. Algorithms used in morphological image analysis work within windows of the image. If not constrained to a window the algorithms work along rows, or columns. The accessing characteristics are algorithm specific but may use memory locations that map to the same places in the cache. Dilation ( ⊕ ) and erosion ( ) are the basic morphological operators (Jain 1989) defined as Dilation(i, j) = max { Image(i – k, j – l) + SE(k, l) }

( i, j ) ∈ Image ⊕ SE , ( k, l ) ∈ SE

(EQ 1)

Erosion(i, j) = min { Image(i + k, j + l) – SE(k, l) }

( i, j ) ∈ Image SE , ( k, l ) ∈ SE

(EQ 2)

4

where (i, j) is a location in the resulting image, and (k, l) is a location of the structuring element denoted by SE. Dilation expands an image and erosions shrinks it. Operations built from erosion and dilation include opening ( ) and closing ( ). A morphological opening is an erosion followed by a dilation. A closing is a dilation followed by an erosion. Each operation’s memory accesses depend on the size and shape of the structuring element. The structuring element used is application dependent. In the example shown in FIGURE 1, (Haralick and Shapiro 1992) the domain of the structuring element is the two points of the input image, I , that the structuring element covers. The structuring element is centered on the left point. The result is I ⊕ SE . Processing with large structuring elements in small caches maps input data to the same locations. This means that previous data are replaced by new data. If the data replaced are to be reused later they must be fetched again. Repeated replacement, called thrashing, is inefficient. Thrashing is one shortcoming of conventional caches for image processing. We propose a mechanism, tiling, to eliminate thrashing for morphological processing. Tiling improves caching when the cache stores the window of interest. The proposed scheme accommodates different structuring elements and image sizes achieving a general solution for windowed processing. Cache tiling improves similar windowed processing such as median filtering, bit block transfers, and matrix operations. Tiling is used frequently in other applications. Some examples of tiling include Blinn’s use of tiling for texture maps (Blinn 1990), and the MPP’s and IUA’s use of tiling to download image regions to their SIMD arrays (Batcher 1980;Weems 1987). Blinn’s application improved the performance of the virtual memory system by avoiding page faults to disk. The SIMD architectures require tiling so that the fixed amount of memory can hold the region of interest for the processing array. Our scheme uses tiling to efficiently cache images. Tiling is efficient because operations are based on near neighbor processing. There are machines that do efficient near neighbor processing, including the MPP (Batcher 1980), the Warwick Pyramid Machine (Nudd et al. 1989), and the IUA (Weems 1987). But these architectures, and fine granularity pipeline processors (Abbot, Haralick and Zuang 1988) such as the Cytocomputer (Loughheed and McCubbrey 1980), cannot process with structuring elements of a large domain, with full color image formats, or gray scale structuring elements. Cache optimization is considered in (Somani et al. 1991) where the difficulty is partially solved with large caches. A general shared memory archi-

5

tecture with tiling can process these formats and achieve the highest performance possible using small caches. Recent image processing software has been designed to isolate the hardware specifics from the software, but in solutions proposed, software has to explicitly describe the format of the data (Sato et al. 1990), and puts requirements on the hardware for address generation to achieve high performance (Flickner, Lavin, and Das 1990). In the VIEWStation Environment (Sato et al. 1990), the image format must be changed for the required processing by having the virtual memory controller reformat the data and copy the image to a new location. We avoid reformatting the data, and allow a variety of processing to be done with the highest performance. The control of our architecture is similar to the cache control required of many commercial microprocessors. The image processing software may be used unchanged. The system software sets up the cache depending on the characteristics of the algorithm. Lam, Rothberg, and Wolf (1990) have recently proposed to use blocking, a concept similar to tiling, for such algorithms. They showed that the cache efficiency is improved with blocking of program loops. Their work relies on the compiler to restructure the loop. However, partitioning the loop results in additional overhead in execution time. Our approach is based on hardware and has minimum overhead in execution time, and combined with blocking can further optimize processing with very large matrices. In section 2, we discuss tiling and show performance advantages using simulation. In section 3 we discuss implementation and design issues. A tiling trace analysis is done in section 4. An exact analysis is done in section 5 and the trace results are compared with the exact analysis. The paper is concluded in section 6.

2

Tiling

The normal addressing scheme of an image is either by rows or columns. We discuss row major addressing, and the same discussion follows for column major addressing by replacing rows with columns and vice versa. Each pixel in a row is given successive addresses as the row is traversed. Matrices are stored in the same way. An image can be thought of as a matrix of pixel values. Linearization of a 2 dimensional matrix causes elements in the same column, but in consecutive rows to be stored far apart. Depending on the cache size and associativity column elements may map into the same area in the cache. Algorithms

6

that use the columns repeatedly cannot cache them simultaneously. Tiling is addressing an image so that spatial locality creates address locality. This section shows how tiling improves cache utilization while processing images.

2.1

Tiling Principles

Tiling an image breaks the image into a mosaic of windows as shown in FIGURE 2. Addresses in a window are row major, and successive tiles follow in row major order. The addresses within each window are contiguous, the property we exploit for optimal caching in windowed or blocked algorithms. By the notation in FIGURE 2 a rows × cols image consists of pixels = rows ⋅ cols elements. The image is addressed by a ( row, column ) address. ( i, j ) denotes any pixel in the image 0 ≤ i ≤ rows – 1 , 0 ≤ j ≤ cols – 1 . A pixel at ( i, j ) is stored at address i ⋅ cols+j , and address = i ⋅ cols+j = 〈i|j〉 is a concatenation of binary strings representing i and j. The tiled image in the right half of FIGURE 2 is broken into ( rows ⁄ m ) ⋅ ( cols ⁄ n ) tiles. Tiles are m × n pixels, with m being the height and n the width of the tile. We assume that rows , cols , m , and n are powers of two. Thus a given pixel ( i, j ) in an image belongs to tile ( r, s ) , where r = i ⁄ m and s = j ⁄ n , and its address in the tile is given by ( k, l ) , where k = imodm and l = jmodn . The address 〈 i j〉 can be viewed as consisting of four parts 〈 r k s l〉 where bits representing i are a concatenation of bits representing r and k , and j is a concatenation of s and l . 〈 r s〉 gives the tile address and 〈 k l〉 gives the address in the tile. Finally, if an image is stored in a tiled form, then the address in the tiled form is given by tiled_address = r ⋅ cols ⋅ m + s ⋅ m ⋅ n + k ⋅ n + l = 〈 r s k l〉 . To create a tile address from a row major address the k and s bit fields are exchanged, 〈 r k s l 〉 → 〈 r s k l〉 .

2.2

(EQ 3)

Caching

A cache memory (Stone 1987) is a small fast memory which sits between the processor and the main memory. A cache stores the most commonly used data. It automatically fetches the required data from the main memory and keeps track of data it stores. For this purpose, both cache and main memories are divided into lines of multiple bytes. A cache memory fetches a line at a time from the main memory. The cache may be direct-mapped where each line in the main memory is stored in a specific line in the cache memory. On the other hand in an A-way set-associative cache memory, each line in the main memory

7

may be stored in any one of A banks. An address generated by a processor is split into three fields by the cache as shown in FIGURE 3. The least significant bits, Word = log(bytes/line) , are used for the word within a line. The remaining two fields are the tag and set address. A cache of size ( cache_size ) and associativity ( A ) has Set = log 2 ( cache_size ⁄ A ) – Word bits to select the set. The remaining bits, ( Tag ) , are used to establish the correspondence between the main memory line and the line stored in the cache. Further details of cache architecture can be found in Stone (1987).

2.3

Tiling and Caching

Tiling makes a cache memory more efficient. As the processor reads the image it needs only a window of interest to be available in cache memory. In morphology, if the structuring element rows over fill the cache, thrashing occurs as the image is processed, even though there may be sufficient space to store the window of interest. Cache lines are replaced by data in the same window, or domain of the structuring element. The same is true when considering column major addressing of images stored in row major order. We present an example, using a scaled down dilation, to illustrate thrashing. The same problem occurs with typical images and current on-chip cache sizes. Consider dilation of an 8 pixel wide image, with a 2x4 structuring element: m=2, n=4, rows = 8, and cols = 8. The structuring element is used like a convolution mask. Let (k0, l0) be the center of the structuring element, in this example ( k 0 = 1, l 0 = 3 ) . The algorithm is shown in FIGURE 4. The cache has 8 lines and is direct mapped. The behavior of the cache for computing results 0 to 4 is shown in FIGURE 5. The lines are denoted as line 0, line 1 etc., and pixels from the input image are shown stored in the cache as p0, p1, . . . , p15. As the window is covered the cache reads pixels in and replaces each previously read pixel, not using half of the cache. Notice that each pixel read in after pixel 3 replaces one already in the cache. The processing in the 2x4 window of interest uses these pixels again, and they have to be repeatedly fetched from main memory. While having a direct mapped cache exacerbates the problem shown above, the same effect will occur in 2 or 4 way set associative caches when the number of rows exceeds the number of sets for that address. By tiling the previous example, with a 2x4 tile we can make better use of all 8 lines of the cache. The algorithm is the same as shown in FIGURE 4. Now the cache transforms the addresses. The image in main memory is stored the same, but tiling occurs when caching. An image, its tiled form, and cache replacement are shown in FIGURE 6. The tile ad-

8

dress is calculated by using the address reshuffling formula (EQ 3) as shown in FIGURE 7. Computing results 0 to 4, without tiling, there are 40 reads from main memory (FIGURE 5). To compute results 0 to 4, with tiling, there are 16 reads (FIGURE 6). The number of reads to compute all results in this example are 390 without tiling and 120 with tiling. To make caching more efficient without tiling we can reuse pixels by alternating the traverse order of k and l for reading the structuring element. Such snake like processing (Hwang and Briggs 1984) still requires 28 reads to compute results 0 to 4. The pixels most recently read in are reused for the following loop. In FIGURE 5, after 8 pixels have been read for the first result, the order of reading for the second result is p12 (miss), p11 (hit), p10 (hit), and p9 (hit). Then p1, p2, p3, and p4 (misses) are read. This saves misses on reads of p9, p10, and p11. But, the software overhead grows, and compilers and replacement policies of the cache may defeat coding intentions. Because the algorithm must be changed and tiling gives better results we do not use snake like processing. Snake like processing with tiling is briefly discussed later in section 5.

3

Cache Tiling Design

Tiling can be done in software by storing the image in tiled form in main memory. But, software tiling is not the best solution. In fact, there are four important advantages to storing images untiled in main memory. First, the image storing and transferral mechanisms, such as DMA or I/O devices, are independent of the tiling scheme. Second, different tilings may be used on the same image. Third, the same tile windows can be used to process arbitrary sized images. And the fourth, and possibly most important, advantage is that the address translation is transparent to user software. These advantages dictate that a conversion of the address take place when caching, and not in software. Hardware tiling improves performance. System software calls are made by the users to set up the cache for higher performance. For example, to allocate tiled memory a call to a custom allocation routine can set up the memory area to be tiled while cached. The tiling example shown in the previous section requires a fast address translator to exchange the k and s fields as shown in (EQ 3). For general use, the translator should be able to implement a set of tilings that may be used with any algorithm of interest. By minimizing tilings and image formats, a reasonable translator is designed. The cache trans-

9

lates the address for all reads and writes to determine a hit or miss as shown in FIGURE 8. Because the translation is done for all accesses the operation is time critical. The mode information regarding the size of the image and tile is stored in a programmable register, selected by address bits, or activated by the memory management unit.

3.1

Design Methodology

In image processing, several image frames may be processed together to compute a result. Let J be the number of windows from different image frames required simultaneously for processing. To avoid thrashing, the cache should be partitioned to hold J windows at once. If the cache is A -way set associative, A > J , and each cache memory bank is large enough to hold a tile, then this is achieved automatically. Otherwise, the cache set addresses are split in P = J ⁄ A partitions. In this case, a memory address consists of 6 fields as shown in FIGURE 9. Here, field q represents the upper bits in the address which are not used for tiling. Field p includes the bits representing the number of partitions and r , s , k , and l are as defined earlier. To partition the cache to accommodate tiles from all images, p directs images to different sets in the cache. Care must be taken to use images with different partition addresses (bits in the p field), or the images will collide defeating the purpose of the mapping. An image processing system computes with different image and window sizes. Each set of specifications is a mode, denoted by φ . Let ( rows φ × cols φ ) be the image size, J φ be the number of images, and ( m φ × n φ ) be the size of tile window required for mode φ . If m φ ⋅ n φ ⋅ J φ is smaller than the ( cache_size ) then all images can be tiled and cached simultaneously. Combined software blocking (Lam, Rothberg, and Wolf 1990) with hardware tiling can allow even larger tiles to be used by reducing the ( m φ × n φ ) required for the algorithm. We will discuss only hardware tiling and design. Tiling improves cache performance when the number of image rows that can be cached is less than the rows in the window of interest, or cache_size < m φ ⋅ cols φ ⋅ J φ . For any mode, tiling is useful and possible if m φ ⋅ n φ ⋅ J φ ≤ cache_size < m φ ⋅ cols φ ⋅ J φ ,

φ ∈ Modes .

(EQ 4)

When (EQ 4) is satisfied for all modes, we create a tile translator by mapping mode parameters into design parameters, φ → dφ , and derive a circuit. The number of partitions ( P dφ ) is determined from J φ and the associativity ( A ) of the cache, P dφ =

Jφ ⁄ A ,

Jφ ≥ A .

(EQ 5)

10

When J φ < A , P dφ = 1 minimizes the required tiling. The given image size dictates that rows dφ = rows φ and cols dφ = cols φ . We partition the cache into storage for each image that will be simultaneously in the cache. The cache associativity and the partition bits divide up the cache. Tile dimensions m dφ and n dφ are chosen by tessellating the partition of size p_size = cache_size ⁄ ( P dφ ⋅ A ) .

(EQ 6)

Width n dφ is at least as large as the line size of the cache or n dφ ≥ line_size . Also, m dφ and n dφ are chosen in such a way that m dφ ≥ m φ , n dφ ≥ n φ , and m dφ × n dφ = p_size .

(EQ 7)

A user can adjust m dφ and n dφ to reduce the overall cost as long as m dφ × n dφ ≤ p_size . Width n dφ should be maximized if the processing is being done in the row major order, and m dφ should be maximized if the processing is being done in the column major order. All these parameters have to be powers of 2 for these formula and for efficient implementation. If n dφ is greater than or equal to the width of the image ( cols dφ ) then no tessellation is required. Each mode requires address translation to perform partitioning and tiling of FIGURE 8 determined by P dφ , m dφ , n dφ , and rows dφ × cols dφ . A reduction of modes is done by removing those covered by the original cache, and equivalent modes. By multiplexing the input address bits to the output address bits for all of the modes, dφ , and regular caching a Φ + 1 multiplexor directly implements the circuit, where Φ is the number of modes. The range of bits that are translated is computed using (EQ 8), (EQ 9), (EQ 10), (EQ 11), and (EQ 12). Different field lengths are computed using the following relations. ( b p ) dφ = log 2Pdφ

( b r ) dφ = log 2rowsdφ – log 2mdφ ( b k ) dφ = log 2mdφ .

( b s ) dφ = log 2colsdφ – log 2ndφ

( b l ) dφ = log 2ndφ (EQ 8)

The least significant bit of the translator is b_right = min { ( b l ) dφ } ,

dφ ∈ Modes .

(EQ 9)

When no partitioning is required, the most significant bit of any tiling mode is b_left = max { ( ( b l ) dφ + ( b k ) dφ + ( b s ) dφ ) – 1 } ,

dφ ∈ Modes .

Otherwise, the most significant bit of a combination of tiling and partitioning is

(EQ 10)

11 b_left_p = max { ( ( b l ) dφ + ( b k ) dφ + ( b s ) dφ + ( b r ) dφ + ( b p ) dφ ) – 1 } ,

dφ ∈ Modes, P dφ > 1

.

(EQ 11)

It follows that the width of the translator required is # ( bits ) = max { b_left, b_left_p } – b_right + 1 .

(EQ 12)

In the next section we present detailed examples.

3.2

Cache Address Translator Example

We present translators with and without partitioning. Suppose we are interested in image sizes 256x256, 512x512, 1024x1024, and structuring elements 32x1 and 1x32. The cache we are using is 8 K two way set associative ( A = 2 ) with a line size of 32 bytes. Tiles can be any size as long as they cover the required window. To hold one image, J = 1 . Using equations (EQ 4) to (EQ 12), the design parameters are P dφ = 1 , m dφ × n dφ = { 64 × 64 } for all image sizes. The field widths computed using the relations in (EQ 8) are given in TABLE 1 in the P dφ = 1 column. The translator locations are: b_right = 6 , b_left = 15 and # ( bits ) = 10 . If we require many images to be simultaneously cached, say J = 8 , then P dφ = 4 and m dφ × n dφ = { 32 × 32 } for all image sizes. In this case field widths are given in TABLE 1 in the P dφ = 4 column. The bit locations are b_right = 5 , b_left_p = 21 and # ( bits ) = 17 . A circuit to implement either translation is shown in FIGURE 10. The translator in each case consists of 4 to 1 multiplexors for each output. It uses only two levels of logic for high speed.

TABLE 1

Cache Address Translator Tilings P dφ = 1

P dφ = 4

64 × 64

32 × 32

image size

br

bk

bs

bl

bp

br

bk

bs

bl

256x256

2

6

2

6

2

3

5

3

5

512x512

3

6

3

6

2

4

5

4

5

1024x1024

4

6

4

6

2

5

5

5

5

Implementations that tile into the line size ( n φ < line_size ) require advanced control to perform fills and flushes. Tiling by at least the line size allows for simpler control. A single line’s data are fetched from contiguous addresses. The same translator may be used for different cache sizes by reducing the tile width, n , and shifting the translator or reduc-

12

ing b_right . The next section discusses a dilation address trace analysis that illustrates the performance improvement from cache tiling.

4

Tiling Trace Analysis

A good performance estimate for comparing cache architectures is the average cycle time (Stone 1987). Total time for an application is roughly the sum of the number of cycles spent executing and the cycles spent waiting for the memory system (Hennessy and Patterson 1990). The majority of processor stalls result from cache misses. The hit ratio, or percentage of accesses satisfied in the cache, can be used as an implementation independent measure of cache performance. The cycle time depends upon the technology used, VLSI line spacing, process, and other production factors. The hit ratio depends only upon the cache behavior and the program behavior. Therefore, an accepted technique for cache performance comparisons is to measure the hit ratio. We developed a simulator that processes address traces captured on a workstation to compare hit ratios with and without cache tiling. Dilation using a structuring element of 32 rows and one column, or a rod, was used to generate the address trace. We used tiles of size 32 × n dφ by extending n dφ as much as required. Dilations of image sizes 256x256, 512x512, and 1024x1024 were traced. 2 way set associative caches from 2 kilobytes to 1 Megabyte were used to process the traces. The results are given in FIGURE 11 and FIGURE 12. See also TABLE 2, TABLE 3, and TABLE 4. The trace is of all data addresses generated while executing the dilation. Erosion, opening, and closing operations will give similar results as will be shown in the analytical analysis section. The trace results are not dependent on the data, only upon the size of the image and the structuring element. FIGURE 11 shows the hit ratios for the input of the 256x256 image with four cache sizes from 2 k bytes to 16 k bytes. Caches larger than 16 k do not further improve the hit ratios. Only the input image is cached. The structuring element is not cached. The results show that a 2 k byte regular cache gives a hit ratio of zero, while the tiling cache gives a hit ratio of 0.96875. This shows the strength of tiling. When the regular cache reaches a size of 8k bytes, it has enough capacity to store all the rows of interest in the image, so at this threshold tiling is no longer needed. This will be shown analytically in section 5. FIGURE 12 shows the effect of tiling using all three image sizes. The cache size

13

necessary to achieve a high hit ratio without tiling doubles as the image width doubles. Tiling is nearly optimal for all sizes and formats.

5

Exact Cache Analysis

Dilation and erosion are regular and predictable and an analytical analysis is done to determine the cache hit ratio for ideal cache behavior. This is done to show that tiling approximates ideal cache behavior for the algorithms of interest and is therefore optimal. Dilation and erosion are used to determine closed form solution of hit ratios assuming grey scale structuring elements. Such structuring elements do not allow row and column decomposition of dilation and erosion. Thus the overall computation is more demanding. The calculation requires the explicit use of all members of the structuring element. These assumptions are valid for grey scale operations with large structuring elements that cannot be stored in registers. To interpret the results for processing where the structuring elements can be held in the registers, and for processing that does not iterate through the structuring element, can easily be done by removing the number of structuring element reads from the analysis. A structuring element is centered at ( k = k0, l = l0 ) , where indexing of the structuring element is k = 1tom , and l = 1ton . The center affects the number of reads and cache misses for dilation and erosion when the input, or the output is limited. We consider erosion first and then dilation for the number of reads, writes, and cache misses. Two different sizes of cache are used. The caches, to simplify the derivations, use OPT, or optimal replacement (Stone 1987), and are fully associative. In this way dependencies on memory alignment, and order of replacement do not affect the analytical analysis, and the cache may perform no better than OPT. Cache 1 holds the input window, m ⋅ n pixels, and the structuring element, m ⋅ n , or the cache_size = 2 ⋅ m ⋅ n . The second cache, Cache 2 is of cache_size = mC + mn , where C is the number of columns in the input. The number of columns changes for each variety of algorithm as discussed later. This cache holds all rows of interest and optimally caches the input. Cache 1 is a fully associative cache that holds the window of interest. Tiling caches approximate Cache 1. Cache 2 is the best performance that can be expected with any cache. Cache 1 and 2 have b bytes per cache line. The units of m , n , and C are in pixels. If a pixel does not equal a byte then the bytes per line factor, b , should contain a pixel to byte conversion, b ( pixel ⁄ byte ) ( byte ⁄ line ) , for computation of misses. This section concludes by directly comparing the analytical models to the trace analysis done in Section 4.

14

5.1

Erosion

To derive the number of reads and writes we examine the input and result of erosion. For grey scale erosion the input is rows × cols , and the result, ( rows – m + 1 ) × ( cols – n + 1 ) , is smaller than the input. The result’s coordinate frame is translated relative to the original input by k0 – 1, l0 – 1 . The images used for erosion are shown below in FIGURE 13. For each pixel result, we read m ⋅ n input pixels and m ⋅ n structuring element values. The number of reads and writes for the erosion are #(reads) = 2 ( rows – m + 1 ) ( cols – n + 1 ) m ⋅ n ,

(EQ 13)

#(writes) = ( rows – m + 1 ) ( cols – n + 1 ) .

(EQ 14)

Assuming that the image is processed in row order, we need to read a m × cols window each time one row of output is processed. To read the structuring element, we need to perform ( mn ) ⁄ b reads. The number of bytes per cache line, given by b , helps to significantly reduce the number of misses. The number of read misses for Cache 1 is m mn Cache 1, #(misses) = ( rows – m + 1 ) cols ⋅ ---- + ------- . b b

(EQ 15)

Cache 2 is large enough to hold all rows in the window of interest. Thus the entire image is read only once. The width of the input is ( m = cols ) . The number of read misses is cols mn Cache 2, #(misses) = rows ⋅ --------- + ------- . b b

(EQ 16)

To calculate the hit ratio using these equations is straight forward. We consider writes as misses, and the number of hits as: the total number of reads, minus, the number of cache read misses. For both Cache 1 and Cache 2 the hit ratio is #(reads) – #(misses) hit ratio = ------------------------------------------------ . #(reads) + #(writes)

5.2

(EQ 17)

Dilation

Using the same derivation as for the erosion, the reads, writes, and misses for dilation is calculated. For dilation a rows × cols image processed with a m × n structuring element requires as input an enlarged ( rows + 2m – 2 ) × ( cols + 2n – 2 ) image. The result is larger than the original input, and has a size of ( rows + m – 1 ) × ( cols + n – 1 ) . The expanded input and output images are shown in FIGURE 14. The expanded image can be created by inserting

15

the minimum pixel value before dilation, or the edges of the original image may be extended. The expanded input image coordinates are shifted in relation to the original image by ( – m + 1, – n + 1 ) . The number of reads, writes, and misses are ( C = cols + 2n – 2 ): #(reads) = 2 ( rows + m – 1 ) ( cols + n – 1 ) m ⋅ n ,

(EQ 18)

#(writes) = ( rows + m – 1 ) ( cols + n – 1 ) ,

(EQ 19)

m mn Cache 1, #(misses) = ( rows + m – 1 ) ( cols + 2n – 2 ) ---- + ------- , b b

(EQ 20)

( rows + 2m – 2 ) ( cols + 2n – 2 ) mn Cache 2, #(misses) = -------------------------------------------------------------------------------- + ------- . b b

(EQ 21)

As an example, FIGURE 15 depicts dilation of a 4x4 image by a 2x3 structuring element with k0=2, l0=3. FIGURE 15 (A), shows the reads of the input image only. The number of reads is twice that amount when the structuring element reads are added. FIGURE 15 (B) shows the number of misses for Cache 1, and FIGURE 15 (C) shows the number of misses for Cache 2. To perfectly cache the image means only having to read the image once. Because the line size b = 1 , every pixel must be read. (EQ 18), (EQ 19), (EQ 20), and (EQ 21) are used to calculate the reads, writes and misses. (EQ 17) is used to calculate the hit ratio, which for this example is 0.702564 for Cache 1 and 0.784615 for Cache 2.

5.3

Dilation With Restricted Input and Output

For memory limited applications the dilation can be limited to produce an output of only rows × cols , and not the ( rows + m – 1 ) × ( cols + n – 1 ) image. The dilation operates on a rows × cols image and creates a rows × cols image. We consider processing in this manner and derive optimal hit ratios for dilation. Both rows × cols input and extended ( rows + m – 1 ) × ( cols + n – 1 ) input cases are considered as shown in FIGURE 16. The input images considered both produce a rows × cols output image. The output is in the same position as the input. The expanded input has its coordinates translated by k0 – m, l0 – n in relation to the original input. The number of reads of input data may be calculated by summing the number of reads for each pixel. We derive the number of reads, and the number of misses in the cache to calculate the cache hit ratios. The number of reads is a 2 dimensional histogram that returns the number of reads for any pixel in the enlarged input image for dilation to produce a rows × cols image. The histogram of reads is given by

16  F(i, j) =  

 f( j) =  

i ⋅ f(j) m ⋅ f(j) [ ( rows + m ) – i ] ⋅ f(j)

1≤i≤m–1 m ≤ i ≤ rows rows + 1 ≤ i ≤ rows + m – 1

1≤j≤n–1 n ≤ j ≤ cols cols + 1 ≤ j ≤ cols + n – 1

j n ( cols + m ) – j

.

(EQ 22)

If we consider all of the reads in the extended image and structuring element reads we have rows + m – 1

#(reads) = 2

cols + n – 1





i=1

F(i, j) = 2 ⋅ rows ⋅ cols ⋅ m ⋅ n .

(EQ 23)

j=1

If we want only the reads that fall within the original rows × cols image, as our conditional dilation used in sections 2.3 and 4, we limit the range of F(i, j) . For structuring elements that are centered at ( k = k0, l = l0 ) consider the general form #(reads) = 2

rows + m – k0

cols + n – l0

i = m – k0 + 1

j = n – l0 + 1





F(i, j) .

(EQ 24)

If the structuring element is centered in the upper left hand corner ( k0 = 1 , l0 = 1 ), or the lower right hand corner ( k0 = m , l0 = n ),the number of reads in the image and in the structuring element is rows cols

#(reads) = 2

∑∑

i=1 j=1

 F (i , j ) = 2  

∑∑ ∑∑ ij +

i=1 j=1

rows cols

rows n – 1

m – 1 cols

m–1 n–1

in +

i=1 j=n

∑∑

jm +

i=m j=1

m 1 n 1 = 2  rows – ---- + ---  ⋅  cols – --- + ---  ⋅ m ⋅ n 2 2 2 2



∑ ∑ mn

i=m j=n

.

(EQ 25)

(EQ 24) and (EQ 25) give the number of reads of pixels in the input image and number of reads of the structuring element. There are closed form solutions for each of the ranges so that four solutions can be derived from (EQ 24). The ranges are: k0 = 1orm , l0 = 1orn (EQ 25); 1 < k0 < m , l0 = 1orn ; k0 = 1orm , 1 < l0 < n ; 1 < k0 < m , 1 < l0 < n . To dilate an image with a dense structuring element requires that many reads. The number of writes for the dilation is independent of the form of the input. For both expanded input, and the rows × cols input the number of writes is #(writes) = rows ⋅ cols ,

(EQ 26)

17

because we have chosen to limit the size of the result to the size of the input. Next we derive the number of misses for Cache 1 using both input cases shown in FIGURE 16. The number of misses depends upon both the initial structuring element reads and the input image reads. For processing the extended input image ( rows + m – 1 ) × ( cols + n – 1 ) the number of cache read misses while processing with the cache of 2 ⋅ m ⋅ n is ( cols + n – 1 ) m ⋅ n Cache 1, #(misses) = rows ⋅ m ⋅ ---------------------------------- + ----------- . b b

(EQ 27)

For the rows × cols input the number of misses is 1 1 cols m ⋅ n Cache 1, #(misses) =  rows ⋅ m – --- k0 ( k0 – 1 ) – --- ( m – k0 + 1 ) ( m – k0 )  --------- + ----------- . 2 2 b b

(EQ 28)

Cache 2 holds all rows of interest. C = cols + n – 1 for the extended input, and C = cols for the reduced input. The number of read misses in Cache 2 for the two input variations is ( rows + m – 1 ) ( cols + n – 1 ) mn Cache 2, #(misses) = ------------------------------------------------------------------------- + ------- , b b

(EQ 29)

rows ⋅ cols mn Cache 2, #(misses) = ------------------------- + ------- , b b

(EQ 30)

for extended input images, and reduced input respectively. Again, the hit ratios are computed by using (EQ 17). In FIGURE 17 is an example showing the number of reads (FIGURE 17 A) and misses for Cache 1 (FIGURE 17 B) and Cache 2 (FIGURE 17 C). The structuring element is centered at ( k0 = m , l0 = n ). The number of image reads, and not structuring element reads, or result writes is shown to simplify the example. Hit ratios are calculated from (EQ 25), (EQ 26), (EQ 28), (EQ 30), and (EQ 17), and the ratios are 0.647887 for Cache 1 and 0.732394 for Cache 2.

5.4

Snake-Like Processing for Reducing Misses

When the cache is large enough to hold the window of interest to calculate 1 result the input is read only once to compute a result (Cache 1). The input will be read again when the rows are traversed for following results. We may save a small number of misses if we change the direction of processing on each row. This is called snake-like row major processing because the coverage path of the input winds (Hwang and Briggs 1984). Normal process-

18

ing as shown by the algorithm in section 2.3 and snake-like processing are shown in FIGURE 18. Snake-like processing for small examples appears to be quite beneficial, but processing typical sized images the savings are small and the programming is more difficult. The number of misses saved for the reduced rows × cols input to produce a reduced rows × cols output is given by n #(misses saved) = ( rows – 1 ) ( m – 1 ) --- . b

(EQ 31)

The savings for the dilation in section 4 would have been only 7,905 misses or 12% using m = n = 32 and b = 32 , because the structuring element has to be in units of cache lines or 32 bytes. The hit ratio would be improved from 0.96875 to 0.9726. In addition to snakelike processing by rows the structuring element may be traversed in this way so that values are re-used as much as possible. By not using snake-like processing to evaluate the number of reads we are presenting a worst case estimate of the number of reads which is consistent with avoiding compiler and programming dependence.

5.5

Analytical Analysis and Trace Analysis Comparison

To examine how the analytical results compare to practice, we consider the statistics from the dilation traces, presented in Section 4. First, we consider the impact of input image reads. The results are shown in TABLE 2 and TABLE 3. The trace driven simulation matches the analytical results for the appropriate cache sizes. Both the trace (for 2 k and 4 k tiling caches) and (EQ 28) (using only the term for input image reads) give 63488 misses. Because we have the same number of misses for the tiling caches the hit ratios are the same also. From (EQ 17) both have hit ratios of 0.96875.

19

TABLE 2

256x256 Dilation Analytical and Trace Hit Ratios, Input Only 2k

4k

Regular Trace

0.0

0.01406 0.94040 0.99899 0.99899

Tiling Trace

0.96875 0.96875 0.99749 0.99899 0.99899

Cache 1 (2k)

0.96875 0.96875 -

Cache 2 (8K+32) -

TABLE 3

-

8k

16 k

-

32 k

-

0.99899 0.99899 0.99899

256x256 Dilation #(misses), Input Only 2k

4k

8k

16 k

32 k

Regular Trace

2031616 2003056 121088

2048

2048

Tiling Trace

63488

63488

5108

2048

2048

Cache 1 (2k)

63488

63488

-

-

-

-

2048

2048

2048

Cache 2 (8K+32) -

The reason the regular cache does not match Cache 1 is that the mapping thrashes, as discussed in Section 1. Tiling gives the same results because tiling approximates the full associativity used in the analytical models. The second cache model, Cache 2, is duplicated for the caches that optimally cache the input. When considering only the input reads the trace exactly matches the analytical results. Caches greater than 16 k, tiling or regular, gives 2048 misses, 2,029,568 hits, and a hit ratio of 0.99899 for input reads only. The tiling cache of 8k is slightly less efficient than optimal because the order of processing for the algorithm returns to the image edges after the middle has been read, because loop unrolling is used to optimize the algorithm. Equations for the second cache model give 2048 misses, (EQ 30), 2,031,616 reads, and also a hit ratio of 0.99899. The complete traces including structuring element reads and image writes show how the model matches the trace. Image reads, structuring element reads, and result writes are used for hit ratios given in TABLE 4. Considering structuring element reads and result writes the hit ratios are calculated from (EQ 17). The analytical cache model, Cache 1, and the 4 k tiling cache trace analysis both yield a hit ratio of 0.96875. The 2 k tiling cache yielded 0.937912, and does not match with the analytical result because the structuring element gets replaced several times as it collides with the input image. In addition, the order of processing in the input image returns to the edges as mentioned before. The 8 k tiling cache was nearly optimal, 0.98340 versus 0.98363, but doesn’t have storage for the 32 byte structuring element that misses. Cache 2 yields a hit ratio of 0.983631. From the trace

20

analysis all caches greater than 16 k, tiling or regular give hit ratios of 0.983630. A cache size of 16 k yields the optimal hit ratio predicted by Cache 2. The cache size to yield the optimal hit ratio is 8k (rows of interest) + 32 (structuring element size).

TABLE 4

256x256 Dilation Analytical and Trace Hit Ratios 2k

4k

8k

16 k

32 k

Regular Trace

0.48827 0.49769 0.97516 0.98363 0.98363

Tiling Trace

0.93791 0.96875 0.98340 0.98363 0.98363

Cache 1 (2k)

0.96875 0.96875 -

Cache 2 (8K+32) -

-

-

-

0.98363 0.98363 0.98363

The analytical results predict the trace analysis. They differ in cases where the cache cannot approximate a fully associative cache, and overlaps structuring element and image, or the processing does not truly process the entire image in increasing row or column order. The read only trace shows that the first cache model is valid for tiling caches, and not regular caches. This shows how tiling helps to approximate full associativity. When cache is large enough for all rows Cache 2 predicts performance for both tiling and regular caches.

6

Conclusions

Tiling improves hit ratios of small caches as shown through analysis. Optimal caching occurs when the cache is large enough to store all rows of the image and the structuring element being used in processing. The data are read exactly once. Windowed processing allows near optimal caching with much less storage. Windowed processing is advantageous when the data structures being processed are large, and the window of processing is not. We showed how to achieve ideal windowed caching with direct mapped and associative caches through cache tiling. Trace analysis duplicated the hit ratios of derived analysis confirming the analytical model and showed tiling matches the fully associative cache that holds the window. A larger percentage of hits speeds up processing. Both tiling and partitioning allow smaller size on chip caches to achieve the same performance as larger external caches. The implementation is straight forward, and can be minimized for specific requirements to reduce chip area. Tiling remains useful for on chip cache control because as cache sizes increase image sizes will also increase. Tiling improves caching for

21

any image or matrix processing that reuses data. The same improvements as shown for dilation will improve other morphological and matrix algorithms. For example, in calculating matrix product, C = AB , matrices must be accessed by row ( A ) and by column ( B ). One way to make this efficient for memory accesses is to transpose ( B ) before the multiplication. This requires reformatting of the data, similar to software tiling. This allows data that are fetched to a line to all be used. Otherwise, the column traversed matrix will miss on every read. The row traversed matrix will miss on each new line, but all elements within a line will be hits. Depending upon the size of the matrix and the size of the cache, the decision as to which values are to be cached can be ( A ) and/ or ( B ) and/or ( C ). One alternative, by caching ( A ) and ( B ) we can avoid caching writes of ( C ). Using cache tiling, matrix ( B ) is tiled to maximize the number of columns being held in the cache. The cache remaps the addresses to store the data by columns. Because we tile by at least the line size, the rows × line_size elements stored in the cache will all be saved, and following column reads of ( B ) will hit. Thus, cache tiling avoids the reformatting step, allows natural indexing in the user program, and provides highly efficient memory system performance. With larger matrices where an entire row cannot be stored in the cache, we can use block processing, and tiling will allow the blocks to be stored in the cache without reformatting. Other image and windowed processing, where the cache cannot store the entire data structure, but may store the window of processing (in this example a column or block) can use tiling to achieve fully associative performance with direct mapped or set associative caches. Higher cache efficiency improves MIMD shared memory performance for image processing. Furthermore, tiling affects only the private copy of the data thereby not impacting other processors, shared memory, or DMA devices. Bus snoopers and other coherency protocols are independent of the tiling, but must use the same mapping as the tiling translator to compare tags. Tiling is a method of increasing the apparent associativity of the cache for matrices and images. Further research may show how tiling with a direct mapped cache removes dependence upon order of replacement, a weakness of set associative caches. Compiler optimization for use of variables can be defeated if the cache replacement policy is not deterministic. We will also do analysis of other algorithms using tiling caches.

22

References Abbott L, Haralick RM, Zhuang X (1988) Pipeline Architectures for Morphological Image Analysis. Machine Vision and Applications 1:23–40 Batcher KE (1980) Design of a Massively Parallel Processor. IEEE Transactions on Computers 29(9):836–840 Blinn JF (1990) Jim Blinn’s Corner “The Truth About Texture Mapping.” IEEE Computer Graphics and Applications :78–83,March Flickner M, Lavin M, Das S (1990) An Object-oriented Language for Image and Vision Execution (OLIVE). in 10th Annual IEEE International Conference on Pattern Recognition, June Haralick RM, Shapiro LG (1992), Computer and Robot Vision Vol. 1. Addison-Wesley, Reading, MA, pp 157–262 Hennessy JL, Patterson DA (1990), Computer Architecture A Quantitative Approach. Morgan Kaufmann, San Mateo, CA, pp 416–419 Hwang K, Briggs FA (1984), Computer Architecture and Parallel Processing. McGraw Hill, pp 362-363 Jain AK (1989) Fundamentals of Digital Image Processing. Prentice Hall, Englewood Cliffs, NJ Lam M, Rothberg EE, Wolf ME (1990) The Cache Performance and Optimizations of Blocked Algorithms. in Proceedings of the ASPLOS-IV, pp 63–74 Loughheed RM, McCubbrey DL (1980) The Cytocomputer: A Practical Pipelined Image Processor. in Proceedings of the 7th Annual Symposium on Computer Architecture, pp 271–277 Nudd GR, Atherton TJ, Francis ND, Howarth RM, Kerbyson DJ, Packwood RA, Vaudin GJ (1989) WPM: A multiple-SIMD architecture for image processing. in Proceedings Third International IEE Conference on Image Processing and Its Applications, London, pp 161–165, July Sato H, Okazaki H, Kawai T, Yamamoto H, Tamura H (1990) The VIEW-Station Environment: Tools and Architecture for a Platform-Independent Image-Processing Work-

23

station. in 10th Annual IEEE International Conference on Pattern Recognition, pp 576–583, June Somani AK, Wittenbrink CM, Haralick RM, Shapiro LG, Hwang JN, Chen CH, Johnson R, Cooper K (1991) Proteus System Architecture and Organization. in Proceedings of Fifth International Parallel Processing Symposium, pp 276–284, April Stone HS (1987) High Performance Computer Architecture. Addison Wesley, Reading, MA Weems CC, Levitan SP, Hanson AR, Riseman EM, Nash JG, Shu DB (1987) The Image Understanding Architecture. COINS Technical Report 87–76, University of Massachusetts Wittenbrink CM (1990) Directed Data Cache for High Performance Morphological Image Processing. Master’s thesis, Dept. of Elect. Engr., University of Washington, Seattle, WA

24

Figures

11

13

4

1

-2

5

I

18

11

16

9

6

I ⊕ SE

SE

Dilation

FIGURE 1

l j i

k

r

(0,0) (1,0) (2,0) Image

(0,n-1)

(0,0)

s

(0,cols-1)

Tile (1,2)

(m-1,0)

(rows-1,0) (rows-1,cols-1) Row Major Image

(m-1,n-1)

Expanded Tile

Tiled Image

Mosaic, Image and Tile Notation

FIGURE 2 .

Address

0 1 2

Tag

Set

Word

Data

Tag Store

N-1 =? Hit ?

To memory for miss

Select Data Out

FIGURE 3

Cache Architecture

25 for i = 0 to rows+m-2 for j = 0 to cols+n-2 Dilation(i, j) = 0 for k = 0 to m-1 for l= 0 to n -1 ii = i-k+k0 jj = j-l+l0 if (ii < rows && ii>=0 ) && (jj < cols && jj>=0) Dilation(i, j) = Max [ Dilation(i, j), Image(ii, jj) + SE(k, l) ]

Dilation program

FIGURE 4

Image

Cache Memory line 0 line 1 line 2 line 3 line 4 line 5 line 6 line 7

0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63

p 0 p8 p1 p9 p1 p9 p2 p10 p2 p10 p2 p10 p 3 p11 p3 p11 p3 p11 p3 p11 p4 p12 p4 p12 p4 p12 p4 p12 p5 p13 p5 p13 p5 p13 p6 p14 p6 p14 p7 p15 Order of Cache Line Replacement

FIGURE 5

Caching into an 8 line direct mapped cache

26

Image

0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63

Cache Memory line 0 line 1 line 2 line 3 line 4 line 5 line 6 line 7

0 1 2 3 4 5 6 7 Tile addresses p4 p0 p5 p1 p2 p3 p12 p8 p9 p10 p11

0 1 2 3 8 9 10 11 4 5 6 7 12 13 14 15 16 17 18 19 24 25 26 27 20 21 22 23 28 29 30 31 32 33 34 35 40 41 42 43 36 37 38 39 44 45 46 47 48 49 50 51 56 57 58 59 52 53 54 55 60 61 62 63 Tile locations

p6

p7

p13 p14 p15

Order of Cache Line Replacement

FIGURE 6

Caching into an 8 line cache and lining j i i2 i1 i0 j2 j1 j0

i2 i1 j2 i0 j1 j0

FIGURE 7

Address Transformation For 2x4 tile

27

Address Tile Translator Tag

Mode

Set

Word

Data

Tag Store

0 1 2

N-1 =? Select

Hit ?

To memory for miss

Data Out

The Cache Architecture with Tiling

FIGURE 8

tile

q

r

s tag

FIGURE 9

column

partitions row p q r k p

s

l

row,column

k

l

tiled

set

Partitioning and Tiling

byte in line

28

Address Input

b_right + bits – 1

Mux

Mux

r s p k l Address Output

FIGURE 10

b_right

Multiplexing address solution

Mux

min(l φ)

29

FIGURE 11

Hit Ratios for Tiling and Regular Caches. 256x256 image, Image Reads Only

FIGURE 12

Hit Ratios for Tiling and Regular Caches All Images, Reads, Writes and Structuring Element Reads

30

k0 – 1

cols

cols – n + 1

l0 – 1

rows

rows – m + 1

Original input

Erosion Output

Image input and output for erosion

FIGURE 13

–m+1

cols

– k0 + 1

cols + 2n – 2

cols + n – 1 – l0 + 1

–n+1

rows

rows + m – 1

rows + 2m – 2

Original input

FIGURE 14

Expanded input

Dilation Output

Image expansion for input to dilation for ( rows + m – 1 ) × ( cols + n – 1 ) result

31

n=3 sss m = 2 s s s structuring element 12333321 24666642 24666642 24666642 14666642 12333321

rows=cols=4

11111111 22222222 22222222 22222222 22222222 11111111

11111111 11111111 11111111 11111111 11111111 11111111

Reads

Cache 1

Cache 2

(A)

(B)

(C)

Example of cache read misses for full dilation

FIGURE 15

cols + n – 1 k0 – m

cols

cols

l0 – n

rows rows rows + m – 1

Original input

FIGURE 16

Expanded input

Dilation Output

Image expansion for input to dilation for rows ⋅ cols result

32

n=3 sss m=2 sss k0 = 2

structuring element

l0 = 3

123321 246642 rows=cols=4 2 4 6 6 4 2 246642 123321

FIGURE 17

111111 222222 222222 222222 111111

Reads

Cache 1

Cache 2

(A)

(B)

(C)

Example of Cache Read Misses For Restricted Input and Output

Normal Row wise

FIGURE 18

111111 111111 111111 111111 111111

Snake-like

Normal Traversal vs. Snake-like

Suggest Documents