Data cache and direct memory access in programming ... - CiteSeerX

DATA CACHE AND DIRECT MEMORY ACCESS IN PROGRAMMING MEDIAPROCESSORS MEDIAPROCESSORS PROVIDE HIGH PERFORMANCE BY USING BOTH INSTRUCTION- AND DATA-LEVEL PARALLELISM. BECAUSE OF THE INCREASED COMPUTING POWER, TRANSFERRING DATA BETWEEN OFF- AND ON-CHIP MEMORIES WITHOUT SLOWING DOWN THE CORE PROCESSOR’S PERFORMANCE IS CHALLENGING. TWO METHODS, DATA CACHE AND DIRECT MEMORY ACCESS, ADDRESS THIS PROBLEM IN DIFFERENT WAYS.

Donglok Kim, Ravi Managuli, and Yongmin Kim University of Washington

0272-1732/01/$10.00  2001 IEEE

Mediaprocessors are programmable processors specifically designed to provide flexible, cost-effective, high-performance computing platforms for multimedia computing.1,2 Multimedia streams carry enormous amounts of data, and efficiently handling that data is becoming increasingly more important in high-performance multimedia computing. Moreover, the data- and instruction-level parallelism in mediaprocessors shortens processing time, making more functions and applications I/O bound rather than compute bound. The growing speed disparity between the processor and off-chip memory further exasperates the problem of how to efficiently move data. Most general-purpose processors use data caches to transfer data between slower off-chip and faster on-chip memories. Some digital signal processors (DSPs), on the other hand, use direct memory access (DMA) controllers to move data between off- and on-chip memories, more directly controlling memory access than the data cache mechanism. DMA’s direct

control allows a predictable access time, which is critical in many real-time DSP applications. A double-buffering technique lets a DMA controller work independently of the core processing unit. Most mediaprocessors have either a data cache or a DMA controller, but not both, so programmers must use whichever is provided. However, some mediaprocessors support both data cache and DMA on a single chip— for example, the MAP10003 and the TMS320C64x.4 In this case, programmers must answer three questions: • How much performance can the selected data transfer mechanism deliver? • How can programmers optimally use the data cache or DMA capability? • What additional programming effort is required to reduce the memory access overhead by developing cache- or DMAbased programs? We compare the data cache and DMA

JULY–AUGUST 2001

33

MEDIAPROCESSOR PROGRAMMING STRATEGIES

putation. Figure 1 illustrates this technique, called double 1 2 3 buffering. The processor allocates four buffers, two for Block 3 input blocks (ping_in_ ping_in_buffer buffer and pong_in_ DMA controller buffer) and two for outBlock 1 put blocks (ping_out_ ping_out_buffer buffer and pong_out_ buffer), in the on-chip Block 2 1 memory. While the core pong_in_buffer Core processor works on a current processor image block (for example, Block 2 pong_out_buffer block 2) from pong_in_ buffer and stores the result in pong_out_buffer, the DMA controller moves the previously calculated outFigure 1. Double buffering in the data flow. The DMA controller is loading block #3 while storput block (block 1) in ping_out_buffer to the ing block #1. The core processor is processing block # 2. external memory and brings the next input block (block 3) N M N M from the external memory into ping_in_buffer. A B AT When the computation and T Output B data movements are complete, Input image Input image N M M N image the core processor and the Output DMA controller switch bufimage fers, with the core processor (a) (b) using the ping buffers and the DMA controller working on Figure 2. Image transposition using data cache: without (a) and with (b) blocking. the pong buffers. On the other hand, several programming techniques use data cache efficiently— methods by selecting several image computing for example, blocking, loop fusion, and loop functions whose memory access patterns are interchange.5 In particular, blocking lets the representative of most other functions. We program process the data, block by block, implement the selected functions on the increasing the memory access locality. BlockMAP1000 with both cache- and DMA-based ing is useful when the processor is accessing approaches. large 2D data in column-major order. Figure 2 illustrates the blocking technique used for 2D block transfer DMA mode and data cache image transposition. Figure 2a shows the The most frequently used DMA mode in transposition of a 2D image as the processor image computing is the 2D block transfer. accesses the input image, row by row, and When an entire input image does not fit on a stores each row in column-major order. Each chip, the DMA controller transfers the input store operation requires loading the entire image from off- to on-chip memory and cache line in the data cache (we assume a processes it in small blocks. The 2D block write-allocate data cache). If the data cache transfer DMA then moves each processed block is smaller than N × cache line size, where N is back to the off-chip memory. Moreover, the the number of rows in the input image, then DMA controller can manage the data move- excessive cache misses could occur as more ments concurrently with the processor’s com- columns are accessed. Figure 2b shows the Off-chip memory

Output image

Input image

On-chip memory

34

IEEE MICRO

Core processor Cluster 0

Cluster 1

Registers

Registers

IALU

IFGALU

IALU

MAP1000

IFGALU

32 bits 66 MHz

Cache 16-Kbyte data

Data Streamer (DMA controller)

16-Kbyte instruction

PCI ports

64 bits 133 MHz maximum DMA IALU IFGALU PCI SDRAM

Direct memory access Integer arithmetic logic unit Integer floating-point graphics arithmetic logic unit Peripheral component interconnect Synchronous dynamic RAM

SDRAM

Figure 3. MAP1000 block diagram. The core processor consists of two clusters (each having two execution units, IALU and IFGALU) and cache memory. The MAP1000 can use either conventional cache miss service or DMA (using the Data Streamer) to transfer data between the data cache, SDRAM, and two PCI ports.

blocking mechanism, which accesses the input image in 2D blocks and transposes it, storing the results in the output image. Sizing the two 2D blocks—one for the input image and another for the output image—to fit in the data cache improves cache utilization and performance.

MAP1000 The MAP1000 architecture supports both data cache and DMA mechanisms. With the DMA controller (called Data Streamer in the MAP1000), shown in Figure 3, the 16-Kbyte data cache can act as on-chip memory, with multiple ping-pong blocks allocated in the MAP1000 data cache for double buffering. This dual capability on the same platform lets us objectively evaluate the advantages and disadvantages of cache- and DMA-based approaches in implementing image computing functions.

Image computing functions and data flow The image computing functions selected for evaluating cache- and DMA-based approaches include 2D convolution, affine warp, 2D fast Fourier transform (FFT), invert,

and add. These functions have different data flows and reveal the relative advantages and disadvantages of the two approaches in handling data in a mediaprocessor.

2D convolution Convolution computes each output pixel as a weighted average of several neighboring input pixels. In the simplest form, generalized 2D convolution of an input image with an M × M convolution kernel, where M is the size of the kernel width and height, is defined as

( )

b x, y =

1 s

x + M −1 y + M −1

∑ ∑ f (i, j )h(i − x , j − y) i =x

j= y

where f is the input image, h is the kernel, s is the scaling factor, and b(x, y) is the convolved output pixel value at location (x, y). Managuli et al. discuss optimally mapping this algorithm on mediaprocessors using powerful inner-product instructions.6 The data access pattern for convolution is regular. For example, convolution with a 3 × 3 kernel on the current pixel (x5) requires eight neighboring

JULY–AUGUST 2001

35


Image width

M

M

x1

x4

x7

x10

x2

x5

x8

x11

x3

x6

x9

x12

x   x i   a11 a12 a13   o   yo  =  y i  a 21 a 22 a 23   1    0 ≤ xo < image_width – 1 0 ≤ yo < image_height – 1

Image height

Figure 4. Convolution data access pattern. Variables x1 to x12 are input pixels. A 3 × 3 convolution kernel (M = 3) is overlaid on the input image with its center on x5. The convolved output pixel that corresponds to the current location of the kernel center is obtained from the sum of products between the input pixel values and overlapped kernel values. The kernel is then moved horizontally or vertically, and the process repeats.

pixels (x1 through x9), as shown in Figure 4. Also, computing this equation for the next adjacent pixel (x8) requires a new set of eight neighboring pixels (x4 through x12). Both the 2D-block-transfer DMA and the cache-based approach would work well in moving the data between the off- and on-chip memories.

Affine warp A subset of image warping algorithms, affine warp redefines the spatial relationship between points in the input image to generate the resultant output image while preserving parallel lines. Affine warp transformations include rotation, scaling, shearing, flipping, and translation—applied individually or in any combination—as shown in Figure 5. A mathematical equation for affine warp,

(a)

called inverse mapping, relates the output image to the input image:

(b)

where xo and yo are the discrete output image locations, xi and yi are the inverse-mapped input locations, and a11 through a23 are the six affine warp coefficients. For each discrete pixel in the output image, an inverse transformation with this equation yields a nondiscrete subpixel location within the input image, from which the output pixel value is computed. Determining the gray-level output value at this nondiscrete location requires some form of interpolation (for example, bilinear) using the pixels around the mapped location.7 In a DMA-based implementation of affine warping, the programmers can segment the output image into multiple blocks, as shown in Figure 6. A given output block maps to a quadrilateral in the input space. The solid line in the input image represents the rectangular bounding block encompassing each quadrilateral. If the bounding block contains no source pixels (case 3 in Figure 6), the DMA controller writes zeros in the output image block without bringing any input pixels onto the chip. If the bounding block is partially filled (case 2 in Figure 6), the DMA controller brings only valid input pixels onto the chip and fills the rest of the output image block with zeros. If the bounding block is completely filled with valid input pixels (case 1 in Figure 6), the DMA con-

(c)

Figure 5. Affine warp examples: original image (a); image shrunk in the x direction and sheared horizontally 30 degrees (b); and image rotated 87 degrees (c).

36

IEEE MICRO

troller brings the entire block onto the chip. In addition, double buffering lets the data-transfer and processing times overlap. In a cache-based implementation of affine warping that computes output pixels in rowmajor order, the data cache behavior highly depends on the warping parameters. For example, the input image in Figure 5a is accessed in a raster-scan order for the warping shown in Figure 5b, whereas it is accessed in almost column-major order for the warping in Figure 5c. If a mapped pixel location in the input image lies outside the boundary, the program should fill the corresponding output pixel with a zero value. The warping in Figure 5c might cause many cache misses because each access to the input pixel requires filling the entire cache line, and accessing the pixels in almost column-major order can cause a large number of data cache misses. To overcome this problem, programmers can reorganize the program to produce output pixels block by block, using the blocking technique. If a 2D block of output pixels is generated at a time, the memory access for the input image will contain better cache locality. The memory access pattern in this case is similar to that of the DMA-based approach, except that the core processing unit, rather than the DMA controller, handles cases 2 and 3 in Figure 6, and there is no double buffering.

Input image

Output image Case 1

Case 2 Case 3

Figure 6. Inverse mapping in affine warp. The corresponding bounding block in case 1 lies completely within the input image, in case 2 crosses the input image boundary, and in case 3 is outside the input image.

more additions than the direct 2D FFT approach. However, most programmers have used this approach because all the data for the row- or columnwise 1D FFT currently being computed can easily be stored in the data cache or on-chip memory.9 Transposing the intermediate image after the row-wise 1D FFTs enables performing another set of rowwise 1D FFTs. This transposition reduces the number of costly synchronous dynamic RAM (SDRAM) page misses. If 1D FFTs were performed on columns of the intermediate image, these page misses would occur many times. One more transposition is needed before storing the final results.

2D fast Fourier transform A direct 2D FFT using Cooley-Tukey decomposition,8 though computationally efficient, leads to data references that are highly scattered throughout the image. For example, the first 2 × 2 butterfly on a 512 × 512 image would require x(0, 0), x(0, 256), x(256, 0), and x(256, 256) pixels. (A butterfly is the innermost computation loop that includes complex multiplications and additions in an FFT.) The large distances between the data references make keeping the necessary data for the same butterfly in the data cache or on-chip memory difficult. An alternative is to decompose a 2D FFT into row-column 1D FFTs by performing the 1D FFT on all the rows (rowwise FFTs) followed by all the columns (column-wise FFTs) of the row FFT results. This separable 2D FFT algorithm requires 33 percent more multiplications and 9.1 percent

Invert and add Invert (to subtract each 8-bit pixel of the image from 255 to produce an inverse image) and add (to add two 8-bit images) are point operations, where each output pixel depends only on the input pixels at the corresponding position. Both cache-based and DMA mechanisms can access the input images in rowmajor order and produce the output image accordingly. However, the simple processing requirement in each tight loop makes these functions I/O bound.

Results and discussion We implemented these image computing functions on the MAP1000 using both cacheand DMA-based approaches. The functions are highly optimized. We wrote the functions in assembly language and C with a high-level

JULY–AUGUST 2001

37


Table 1. Performance of image computing functions. Cache-based implementation Tight-looped execution Total cycles cycles Implementation (millions) (millions)

Function name 2D convolution (7 × 7 kernel) Assembly Invert C Add C Affine warp, case 1* Assembly Affine warp, case 2* Assembly 2D FFT (two sets Assembly of row-wise 1D FFTs and two transpositions)

1.28 0.0349 0.0512 2.78 2.90 2.83 (2.25 + 0.577)

1.69 0.326 0.437 3.17 8.36 7.86 (4.83 + 3.03)

DMA-based implementation Tight-loop execution Total Total cycles cycles execution (millions) (millions) time (ms) 1.26 0.0396 0.0541 0.902 1.47 2.80 (2.22 + 0.576)

1.48 4.93 0.229 0.76 0.325 1.08 1.50 5.00 1.98 6.60 5.12 17.07 (2.61 + 2.51)

Ratio* 1.14 1.42 1.34 2.11 4.22 1.54

*Affine warp case 1 has a 0° rotation angle, 0.5 horizontal scale, 1.0 vertical scale, and 30° horizontal shear. Affine warp case 2 has an 87° rotation angle, 1.0 horizontal scale, 1.0 vertical scale, and 0° horizontal shear. The 2D fast Fourier transform consists of two sets of row-wise 1D FFTs and two transpositions. The last column in the table lists the ratio of the total number of execution cycles in the cache-based implementation versus in the DMA-based implementation.

language extension, called intrinsics, that acts like a C function call and guides the compiler to use a specific assembly instruction.1 Table 1 lists these functions’ performance on 512 × 512 × 8-bit images. We measured the number of tight-loop execution cycles and the total number of cycles for each implementation on the MAP1000 cycle-accurate simulator. This simulator models the SDRAM running at half the frequency of the core processor. The total number of cycles includes those for tight-loop execution, data cache service, and instruction cache service, and other miscellaneous tasks such as dataflow control. Table 1 also shows the total execution time of each DMA-based implementation, assuming the core processor runs at 300 MHz. These times demonstrate the high computing power of the MAP1000. In particular, the 2D FFT performance of 17.07 ms is the most cost-effective and one of the fastest of any published hardwired implementations and programmable processors.9

Effect of DMA on overall performance Functions listed in Table 1 show a performance improvement by a factor of 1.14 to 4.22 with the DMA-based approach. For 2D convolution, the cache-based implementation’s performance compares to that of the DMA-based implementation. The reason is that the 2D con-

38

IEEE MICRO

volution is compute bound, so the tight-loop execution cycles account for the majority of the total cycles. Therefore, the achievable savings in the data transfer time using double buffering with the DMA controller is small. For invert and add, the memory access overhead dominates the total cycles because data processing in the tight loop is very simple: Two clusters in the MAP1000 can subtract 16 pixels (in invert) or add 16 pairs of pixels (in add) in one partitioned instruction. In the cache-based implementation, a cache miss occurs at the end of every cache line (32 bytes), which happens every two partitioned instructions as the tight loop traverses the entire input image. These cache misses incur a large overhead; in the absence of a writeback operation, each cache miss service takes about 25 cycles. On the other hand, the DMA controller improves performance by allowing larger data blocks than the 32-byte cache line size, thus avoiding the overhead of servicing many cache misses. Therefore, when the memory access pattern is sequential and the program is I/O bound, the DMA-based approach can handle the data transfers more quickly, improving overall performance by a factor of 1.34 to 1.42. The DMA controller can also support several simple operations—such as masking,

Total no. of cycles (millions)

10 alignment, and indirect 8.53 8.44 addressing—while transfer8 ring the data. In the DMA5.79 based affine warp, the DMA 6 controller writes zeros (mask 4.44 operations) in the output 4 block that is mapped partial3.85 3.57 3.38 3.28 2 ly or completely outside the input image boundary—cases 0 2 and 3 in Figure 6. On the 16 × 16 32 × 32 64 × 64 128 × 128 other hand, with the cacheBlock size (bytes) based approach, the core processor consumes more Figure 7. Affine warping performance with blocking. tight-loop execution cycles because it must test whether the inverse-mapped pixels lie outside the input based approaches was 2.24. However, the first image boundary. The savings in the execution affine warping case’s performance decreases cycles with the DMA approach are high in the with smaller blocking sizes because the firstfirst affine warp case (0.9 million cycles, com- case affine warping prefers row-major-order pared with 2.78 million cycles with the data memory access. Hence, deciding which blockcache) where many output pixels are mapped ing size is optimal for a given set of affine paraoutside the input image, as shown in Figure meters is not straightforward. Perhaps the 5b. In the second affine warp case (Figure 5c), programmer could prepare a large lookup table the tight-loop execution cycles with the that lists the optimal blocking size for all difDMA-based approach increase to 1.47 mil- ferent combinations of parameters. lion cycles because there are few zero output The 2D FFT consists of two sets of rowpixels. wise 1D FFTs and two sets of transpositions. The performance improvement with DMA- The total computing time for two sets of rowbased data handling is more evident in the sec- wise 1D FFTs improves by a factor of 1.85 ond affine warp case (Figure 5c). Because the with the DMA-based approach (2.61 million image is rotated 87 degrees, the input image cycles, compared with 4.83 million cycles with access pattern in the cache-based implementa- the data cache) because it hides the data transtion has little locality, causing an excessive num- fer time behind the processing time. The FFT ber of cache-miss service cycles. On the other processing time on the MAP1000 is short, hand, the DMA-based approach uses 2D block thanks to powerful multimedia instructions transfers, bringing the input blocks into the con- such as complex-multiply, which in one tiguous memory space—ping_in_buffer instruction performs two partitioned complex or pong_in_buffer—in the background. multiplications. Therefore, data transfers, Because of DMA’s efficient data transfer, the rather than data processing, cause bottlenecks performance of affine warping with the DMA on the MAP1000, contrary to the charactercontroller is better than that with the data cache istics seen in general-purpose processors or by a factor of 4.22. other DSPs when performing FFTs. The To remedy the inefficient cache performance DMA-based transpositions have only a slight in the second affine warping, we tested the advantage over the cache-based approach cache-based affine warping with the blocking (3.03 million cycles versus 2.51 million cycles, technique for different blocking sizes. As in Table 1) when we used the blocking techshown in Figure 7, blocking reduced the num- nique for the cache-based 2D FFT transposiber of cache misses and significantly improved tion stage. The DMA-based approach uses a performance for the second case of affine warp- block size of 16 × 32 × 4 bytes to optimize the ing (from 8.36 million cycles with the origi- on-chip memory space for double buffering. nal method to 4.44 million cycles in the 16 × Combining the total cycles for all the row16 × 16-byte blocking). Thus, the perfor- wise 1D FFTs and the two transpositions for mance ratio between the DMA- and cache- the 2D FFT function makes the DMA-based

8.38

8.36 Case 1 Case 2

3.22

3.17

256 × 256 512 × 512

JULY–AUGUST 2001

39


tight loops, and output images) and fill out the information for each glyph. This information includes • the input and output image sizes, • the tight-loop function name and its arguments, • whether the ping-pong block can be used in place (that is, the tight loop’s ping-in or pong-in buffers can be overwritten with the tight loop’s output), and • whether the data flow needs to include neighborhood pixels (for example, in convolution).

Figure 8. Dataflow code generator example.

implementation 1.54 times faster than the cache-based implementation.

DMA programming Because the cache-based approach does not involve the DMA controller, programming and debugging are more straightforward than with the DMA-based approach. Therefore, the cache-based approach is preferable as long as the required performance can be met. Otherwise, the DMA controller can speed up the performance and remove the system’s I/O bottlenecks—crucial for building low-cost programmable real-time embedded systems such as digital televisions and set-top boxes. However, the DMA-based algorithm design, programming, and debugging require more programmer effort than the cache-based approach. The programmer must handle the data I/O for the intended algorithm to specify the data transfer timing, amount, location, data transfer type, and so on. The programmer must also synchronize data transfers (for example, DMA start and stop) with the core processor. Using a software development tool that aids dataflow planning and programming can avoid the burden associated with the DMA-based approach. One example of such a DMA programming tool is our prototypical dataflow code generator, shown in Figure 8, where the user can create and connect the glyphs (input images,

40

IEEE MICRO

The tool then determines the appropriate amount of ping-pong block size based on the parameters specified by the user and the number of input/output images specified by the connections between the glyphs. It then generates dataflow code. The user does not have to specify how to perform double buffering or when to handle dataflow synchronization; the tool automatically generates all the necessary code. This kind of tool increases software development productivity. One limitation, however, is that this method supports only common dataflow patterns; manual development of code is still necessary when a specialized data flow is needed.

Future mediaprocessors Wider data paths and more powerful multimedia instructions in mediaprocessors let many traditionally compute-bound functions execute faster. Increasingly more of these functions will become I/O bound in the future. To provide higher-level integration in a system on a chip (SOC), memory hierarchies in future mediaprocessors will support very large on-chip memory. This will provide shorter latency and higher bandwidth compared with the external memory. Figure 9 shows an example memory hierarchy for a mediaprocessor with large on-chip memory (L2) along with the smaller and faster memory (L1). L1 can be either data cache or memory-mapped synchronous RAM (SRAM). L2 can be either data cache, memory-mapped SRAM, or memory-mapped embedded dynamic RAM. Whereas DMA can transfer entire frames, rather than blocks, of the video or images between the L2 (due to

its large size) and external devices, the core processor can access the data in L2 through one of three methods: • L1 data cache misses, where the missed data cache lines are mapped to L2; • addressable load and store instructions from and to L2; or • DMA transfers between the allocated memory spaces of L1 and L2 and then accesses to L1. The first method provides almost the same programming environment as the cachebased approach; thus, programming is easier here than in the other methods. To reduce the data-cache-miss service penalty, computer architects can provide data cache prefetching in hardware, software, or a combination of both.10 The second method provides direct access to the large L2 memory but takes longer to access data in L2 than in L1. If the data from L2 is accessed only once, this penalty might be tolerable. However, if the data is reused often, this penalty will hinder performance. The last method requires more programming effort, because it has another level of DMA between L1 and L2 in addition to the one between L2 and the external devices. This method, however, would give better performance than the other two methods. Although the architectural decision on which method should be used in a specific future mediaprocessor will largely depend on the trade-offs between programming ease and overall performance, supporting multiple methods would give the user more flexibility. For example, if a mediaprocessor with L1 as data cache and L2 as memory-mapped memory has a DMA controller, the programmer could use either the first or the third method. Another effect of the large on-chip memory is that if L2 is large enough to contain the entire input, intermediate, and output images, the unit of DMA data transfers could become a frame rather than a small 2D block. The algorithms would then have to be modified to fully harness this large on-chip memory. For example, the optimal affine warping algorithm would differ from the one we discussed earlier (in the “Affine warp” section). Furthermore, computing a direct 2D FFT using 2 × 2 but-

PE0

PE1

Core processor

L1 memory

External devices

L2 memory

(a)

PE Processing element

L2 data access method

L1

L2

L1 miss

Data cache

Data cache

Data cache

Memory mapped

Data cache

Memory mapped as noncacheable

Memory mapped

Memory mapped

Data cache

Memory mapped

Memory mapped

Memory mapped

Load/store from/to L2

Load/store from/to L1 with DMA between L1 and L2 (b)

Figure 9. Memory hierarchy with a large on-chip memory: the core processor has multiple processing elements and on-chip L1 and L2 (a), and the core processor can access L2 in various ways depending on the configurations of L1 and L2 (b).

terflies without transpositions would be more advantageous than performing separate rowcolumn 1D FFTs with transpositions using two-point butterflies.

D

espite its higher performance, the DMA approach has disadvantages. Many programmers have had difficulty using the DMA controller to its full potential, because it requires a good understanding of the mediaprocessor architecture, as well as the target algorithm and its data flow. Therefore, although the DMA is important for maximizing mediaprocessor performance, the software development environment is essential for easing the dataflow layout, programming, and debugging. The decision on whether to use the cache- or DMA-based approach should depend not only on the achievable performance but also on the programming expertise available and the possible return on the extra programming investment. As the tight loops become shorter with wider data paths and powerful multimedia

JULY–AUGUST 2001

41


instructions, more functions are becoming I/O bound. The DMA-based approach can alleviate this speed gap between the processor and external memory. Because the disparity between the processor and memory speed is likely to increase further, the DMA controller and efficient DMA programming will become even more important in future mediaprocessors and their applications. MICRO References 1. P. Faraboschi, G. Desoli, and F.A. Fisher, “The Latest Word in Digital and Media Processing,” IEEE Signal Processing, vol. 15, no. 2, Mar. 1998, pp. 59-85. 2. I. Kuroda and T. Nishitani, “Multimedia Processors,” Proc. IEEE, vol. 86, no. 6, June 1998, pp. 1203-1221. 3. C. Basoglu, W. Lee, and J.S. O’Donnell, “The MAP1000A VLIW Mediaprocessor,” IEEE Micro, vol. 20, no. 2, Mar./Apr. 2000, pp. 48-59. 4. “DSP Products,” Texas Instruments, Dallas; http://dspvillage.ti.com/docs/ dspproducthome.jhtml. 5. D.A. Patterson and J.L. Hennessy, Computer Architecture: A Quantitative Approach, 2nd ed., Morgan Kaufmann, San Francisco, 1996. 6. R. Managuli et al., “Mapping of Two-Dimensional Convolution on Very Long Instruction Word Media Processors for Real-Time Performance,” J. Electronic Imaging, vol. 9, no. 3, July 2000, pp. 327-335. 7. O. Evans and Y. Kim, “Efficient Implementation of Image Warping on a Multimedia Processor,” Real-Time Imaging, vol. 4, no. 6, Dec. 1998, pp. 417-428. 8. D.E. Dudgeon and R.M. Mersereau, Multidimensional Digital Signal Processing, Prentice Hall, Upper Saddle River, N.J., 1984. 9. C. Basoglu, W. Lee, and Y. Kim, “An Efficient FFT Algorithm for Superscalar and VLIW Microprocessor Architectures,” Real-Time Imaging, vol. 3, no. 6, Dec. 1997, pp. 96-106. 10. T.-F. Chen and J.-L. Baer, “A Performance Study of Software and Hardware Data Prefetching Schemes,” Proc. 21st Int’l Symp. Computer Architecture (ISCA 94), ACM Press, New York, 1994, pp. 223-232.

Donglok Kim is a research assistant professor in the Department of Electrical Engineering

42

IEEE MICRO

at the University of Washington, Seattle. His research interests include mediaprocessor architectures, processor-memory interface, multimedia video and graphics processing, and portable image computing libraries. Kim has a PhD in electrical engineering from the University of Washington, Seattle. Ravi Managuli is a research assistant professor in the Department of Bioengineering at the University of Washington, Seattle. His research interests include algorithms and systems for medical imaging, digital signal processing, image processing, computer architecture, and real-time multimedia applications. Managuli has a PhD in electrical engineering from the University of Washington, Seattle. Yongmin Kim is a professor and chair of bioengineering and a professor of electrical engineering, both at the University of Washington, Seattle, where he is also an adjunct professor of computer science and engineering, and of radiology. His research interests include mediaprocessor architecture and algorithms, and systems for multimedia, image processing, and medical imaging. Kim has a PhD in electrical engineering from the University of Wisconsin, Madison. Direct questions and comments about this article to Yongmin Kim, Depts. of Electrical Engineering and Bioengineering, Univ. of Washington, Box 352500, Seattle, WA 98195; [email protected]. For further information on this or any other computing topic, please visit our Digital Library at http://computer.org/publications/dlib.

Data cache and direct memory access in programming ... - CiteSeerX

Data cache and direct memory access in programming ... - CiteSeerX

Suggest Documents

Direct Storage-class Memory Access

Direct Storage-class Memory Access

Extending Relational Data Access Programming Libraries ... - CiteSeerX

Memory Access Pattern Analysis and Stream Cache Design for ...

Memory Access Pattern Analysis and Stream Cache ... - CiteSeer

CACHE MANAGEMENT FOR SHARED SEQUENTIAL DATA ACCESS

ddm { a cache-only memory architecture - CiteSeerX

Cache-efficient Dynamic Programming Algorithms for ... - CiteSeerX

Proteus: Power Proportional Memory Cache Cluster in ... - CiteSeerX

Runtime and Programming Support for Memory ... - CiteSeerX

Power Consumption Awareness in Cache Memory Design ... - CiteSeerX

Strategies for Cache and Local Memory Management by ... - CiteSeerX

Strategies for Cache and Local Memory Management by ... - CiteSeerX

Application Specific Memory Access, Reuse and ... - CiteSeerX

Tolerating Node Failures in Cache Only Memory

Direct Cache Access for High Bandwidth Network ... - Stanford University

Execution-Cache-Memory Performance Model: Introduction and ...

Intelligent Memory Manager Eliminates Cache

Systematic Speed-Power Memory Data-Layout Exploration for Cache ...

Integer programming approaches to access and ... - CiteSeerX

Cache Conscious Programming in Undergraduate Computer Science

Cache Memory The Quest for Speed - Memory A Solution: Memory ...

Channelized Direct Memory Access and Scatter Gather - Xilinx

Online Data Structures in External Memory - CiteSeerX