1
Streaming Processors for Next Generation Mobile Imaging Applications *
Sek M. Chai , Silviu Chiricescu*, Ray Essick*, Abelardo López-Lagunas†, Brian Lucas*, Phil May*, Kent Moat*, James M. Norris*, Mike Schuette* Abstract— Next generation mobile devices will continue to demand high processing power for imaging applications. The expected performance is in the class of supercomputers, but delivered with limited energy and memory bandwidth for embedded systems. This paper advocates a streaming computation model that leverages the deterministic access patterns in imaging applications to deliver the necessary processing throughput. A reconfigurable datapath connects a set of functional units, forming a computation pipeline to offer energy efficiency. The architecture and implementation of a stream processor are presented along with the memory subsystem to support stream data transfers. Results show speedup ranging from a factor of 2 to 28 for imaging applications, offering favorable comparison against scalar processors. Index Terms—Stream processing, smart camera, mobile imaging, memory hierarchy
INTRODUCTION
T
he growth of the digital imaging industry is driven primarily by mobile phones with embedded cameras, i.e. camera-phones, first introduced in 2000. Camera-phone sales have since surpassed sales of both film and digital cameras, and the entire digital imaging market continues to accelerate [1]. Applications and features in camera-phones follow the high-end digital cameras as users expect common capabilities from all their camera devices. As the appeal of ubiquitous cameras gains massmarket acceptance, mobile imaging applications will continue to need more performance and features. Today, these applications involve the capturing, processing, and transmission of visual content on wireless mobile devices for small images with sizes up to one mega-pixel. In the next generation camera-phones, these applications will demand higher resolution (multiple megapixels), better image quality through image finishing, and faster frame rates (video quality at 30 frames per second). In addition, the leading marketable feature will change from pixel resolution to new product features based on computer vision algorithms, much akin to those in smart cameras [2,3]. The higher performance requirements in next generation mobile imaging applications can not be resolved with a traditional scalar processor that relies on increasing clock rates and large caches, both of which consume energy inefficiently. The system architecture must include a flexible computing platform that leverages the inherent and abundant parallelism in these applications. In addition, the memory subsystem that supports the processors and hardware accelerators must sustain their increased memory bandwidth demands as well. This paper presents a system architecture that delivers high processing throughput with a configurable network to handle the different application parallelism and to attain more efficient energy consumption. It is based on a computation model that treats images as streams of data, leveraging deterministic memory access patterns to improve memory bandwidth utilization. Design perspectives are discussed with details of the system-on-chip (SoC) and development platforms. In addition, performance numbers are provided to give the reader a point of reference of the stream processor’s capabilities. MOBILE IMAGING APPLICATIONS Current mobile imaging applications include color interpolation, white balance, gamma correction, and compression/decompression. For the sub mega-pixel resolutions, these functions can be handled by today’s imaging DSPs with performance ranging up to 500 MOPs (million operations per seconds). Next generation mobile imaging applications will include image finishing for higher resolution images, compression/decompression at video frame rates, and computer vision algorithms to enable smart camera features. These next generation applications will demand more resolution and smart camera features that can exceed more than twice the performance of an imaging DSP [2,3]. In general, imaging applications exhibit the following properties: Vast amounts of data are manipulated which requires a large number of memory accesses to transfer image data at video frame rates. Computations are localized into groups, typically embedded in nested loops. Many are self-contained and have no data dependencies among other computation stages. Computation groups are regular and repetitive. * †
Embedded Systems and Physical Science Center of Excellence, Motorola Labs (contact:
[email protected]) Instituto Tecnológico y de Estudios Superiores de Monterrey Campus Toluca, México (contact:
[email protected])
2 This section briefly describes these applications in order to highlight the above computation and memory access characteristics. It motivates the stream processing model and processor design described in the following sections whereby memory accesses are overlapped with computation, and processing elements are organized for parallelism. Image Finishing Image finishing consists of a series of processing steps to render raw sensor data into user visible images [4]. They consist of a variety of algorithms including: color interpolation, white balance, gamma correction, and compression. Virtually all mobile imaging devices implement these algorithms either in image sensor companion chips or in application processors. Color interpolation is used to convert a single sensor’s output from a mosaic of colored pixels into a full color image. The algorithm typically includes a filtering operation over a range of 3x3 to 5x5 tiles for the three RGB (red, green, blue) color planes. White balance and gamma correction are steps to tune the image pixel data to match the intended color range. For example, a white color should be visually white, with RGB colors saturated for their range of values. In general, the processing redistributes and scales the energy content of the RGB values. The algorithms typically include a 3x3 matrix multiplication of pixel data with a filter and look-up-tables (LUT) to map the result to a new non-linear color space. Finally, compression standards such as JPEG, include steps to convert the RGB color space into a YCbCr color space with luminance (Y), red chrominance (Cr), and blue chrominance (Cb), before the compression procedure. The compression algorithm involves block comparisons, typically 8x8, and LUT to encode the image data. Memory access patterns are 2-D blocks based on the filter tile sizes (3x3 or 5x5 for color interpolation and 8x8 for JPEG compression). Video Codec MPEG compression standards are widely used in digital video encoding products such as digital camcorders. Video messaging and two-way video conferencing are already available in camera phones today, but popular demand for such services takes time to foster. As content providers are eager to create a new market in mobile devices, wider bandwidth networks will enable new applications such as video-on-demand and mobile TV broadcast. In the future, content from home entertainment systems can be seamlessly shown on mobile devices, enabling user access in addition to mobility. In general, the encoding process begins with preprocessing, which may include format conversion to convert the image into YCbCr color space and subsampling to a desired image size. Motion estimation, 2-D DCT, quantization, and entropy coding are the next stages in the coding process. For a CIF-sized video (352x240), the encoding process requires more than 1000 MOPs, which is beyond the capabilities of today’s embedded scalar processors [5]. The decoding processing follows a simpler inverse procedure and requires only 200 MOPs (easily achievable in scalar processors). Memory access patterns are based on macroblock and DCT filter sizes which is typically 8x8 over the entire image. Computer Vision Smart cameras use computer vision algorithms to “see and understand” the scene. The “see” portion uses computer vision algorithms to segment, classify, and identify objects; the “understand” portion includes complex learning algorithms to model and track objects. There are already application examples in security, entertainment, safety, and transportation, but not all of these applications apply to camera-phones. An early example includes camera phones that read business cards. In general, these applications have the following abilities: to extract features and structures from images, and to derive semantic meaning in the form of metadata through a series of pattern matching operations. Computer vision algorithms identify objects in a scene, and consequently produce one region of an image that has more importance than other regions of the image. For example, in body gesture recognition [3], the region of the video image that includes portions of the hand (a region of interest, or ROI) may be studied to find the hand posture over a series of frames. A video image may also be processed to track different parts of the body. Each object lies in a different region of the captured images which must be processed simultaneously to find the overall effect of the body gesture. Memory access patterns are dependent on object size in the image, ranging from a few to 1000 pixels in order to have enough resolution for feature detection. Furthermore, the object moves in subsequent frames, which corresponds to fetching a “moving” block of pixels from memory. STREAM PROCESSORS Stream processing is a computation model that operates on sequences of data using computation kernels to expose and exploit concurrency and locality for efficiency [6]. While both industry and academia have studied the concurrency of computation and data movement, this streaming model provides a new and interesting framework that brings together both task and data level parallelism within the same context. This stream computation model can be important because of the following: • Design issues such as limited memory bandwidth and latency are resurfacing again. These issues not only appear in the interface between memory controllers and external memories, but also manifest in the form of wire interconnect delays that
3 add to inter-chip communication latency. Furthermore, caching techniques are not effective for imaging applications due to poor temporal locality of the data, requiring a new approach to address the bandwidth problem. • The microprocessor architecture community needs to improve performance by looking at opportunities beyond traditional instruction-level-parallelism (ILP) and data-level-parallelism (DLP) techniques. Processor designs have gone from the standard RISC design into multi-issue superscalar processors, stretching the complexity and consequently the clock frequencies to physical limits. Limited SIMD‡ instruction set extensions, such as MMX and AltiVec, have also seen limited success due to problems in data alignment and the compiler’s ability to automatically extract data parallelism [7]. • Computer vision and imaging applications exhibit certain computation and data movement characteristics that are a good match for the stream model of computation. Making data movement explicit and describing which portions of the application can be computed in parallel enable compilers to optimize data movement and match it to the available processing units. It also helps hardware designers to simplify the processor design by relying on the compiler to keep the processing elements busy all the time. This leads to stream processors capable of achieving high performance with low energy consumption. Stream processors are similar to vector processors in their ability to hide latency, amortize instruction overhead, and expose data parallelism by operating on large sets of data. However, stream processors support a more complex access pattern by allowing the programmer to explicitly define the data movement. This enables the compiler to schedule data movement as part of the computation and/or allowing dedicated data movement hardware decoupled from computation. Examples of stream processors include RAW [8], Imagine [9], Merrimac [10], and the RSVP™ architecture [11,12]. The RAW processor is a tiled architecture where each tile has a RISC processor, a single precision FPU, instruction and data caches, and four communication networks. The compiler assigns computation kernels into adjacent tiles to reduce data movement and memory bandwidth. The RAW processor exposes all of its resources to the compiler to schedule data movement and computation with fine grain detail. Imagine and Merrimac consist of a large number of functional units grouped into computational clusters. They rely on a memory hierarchy to keep those functional units busy while reducing memory bandwidth. Merrimac replicates the same functional unit to form a homogeneous computing cluster. In addition, it not only adds a cache in front of the memory controller to fully exploit the memory controller bandwidth but also incorporates a conventional processing core along with the computation clusters. On all of the above architectures, stream data appear to be sequential to the computation kernels even though they are scattered throughout memory. Merrimac and Imagine rely on a memory hierarchy implemented by stream buffers and stream register files to assemble data as contiguous streams. On the other hand, the RSVP™ architecture uses stream descriptors, a language extension to specify memory access patterns, and dedicated stream units to assemble data. The programmer describes the computation independently from stream descriptors, and then a compiler configures the hardware appropriately for stream processing. RECONFIGURABLE STREAM VECTOR PROCESSOR, RSVP™. The RSVP™ architecture is a stream-oriented, memory-to-memory vector co-processor designed to handle streaming data functions. It can operate synchronously with the host processor instruction flow, presenting the familiar single-core programming model. With more complex synchronization, it can also operate in an asynchronous manner for potentially higher system performance. It complements the data processing instructions of the host scalar processor, but does not participate in the control flow. Data access and data processing are decoupled. In programming the RSVP™ architecture, the user expresses the shape and location of data streams in memory and describes a data flow graph of the operations to be performed [11,12]. Architecture The RSVP™ architecture improves performance by increasing parallelism through decoupled memory access, deep pipelining (chaining functional units), and SIMD vector processing. Each of these aspects can be applied independently or in parallel, and are hidden from the programmer by the compiler. Decoupled memory access allows data pre-fetching to occur during computation. It is achieved with the programmer describing the shape and location of data in memory using stream descriptors, which will be discussed in the next section. Stream units then use the stream descriptors to fetch data from memory and present the aligned data in the order required by the computing platform. This decoupling allows the stream units to take advantage of available bandwidth to prefetch data before it is needed. The architecture becomes dependent on average bandwidth of the memory subsystem with less sensitivity to the peak latency to access a data element. In addition, it benefits from having fewer stalls due to slow memory access, alleviating memory wall issues [13]. The stream units replace the large caches that are not effective for video data because of poor temporal locality [7]. Furthermore, memory access no longer needs to be scheduled amongst compute operations, thereby allowing for more ‡
Single Instruction Multiple Data (SIMD)
4
// In-lined C code cfir32(const short *x, const short *h, short *r, const short nh, const short nr) { extern void d_cfir32; // entry point to the DFG _vload(&d_cfir32); // Load DFG // Set up stream descriptors _vihalf(1, (unsigned short *) &x[2*nh - 1]); _vishape(1, -1, 2*nh, 2*nh + 2); _vihalf(2, (unsigned short *) &h[0]); _vishape(2, 1, 2*nh, -2*nh); _vohalf(0, r); // Start RSVP™ co-processor _vloop2(&d_cfir32, nh, nr); // execute DFG }
(a)
// DFG for the complex FIR vname d_cfir32 vbegin L_end-L_start,0 L_start: L0: vconst 0 // clear the accumulators L1: vputa L0, a0 L2: vputa L0, a1 vinner L3: vld.u16 (v1) // load the input data L4: vld.u16 (v1) // load the input data L5: vld.u16 (v2) // load the coefficients L6: vld.u16 (v2) // load the coefficients L7: vmul.s32 L4,L5 // first multiply L8: vmul.s32 L3,L6 // second multiply L9: vsub.s32 L7,L8 // real part L10: vadd.s32 L7,L8 // imaginary part vadda L9,a0 // accumulate real vadda L10,a1 // accumulate imaginary vpost L11: vgeta.s16 a0 L12: vgeta.s16 a1 L13: vst L11, (v0) // store the real L14: vst L12, (v0) // store the imaginary L_end: vend
Preouter
vconst vputa(a0)
Inner Loop
vputa(a1)
vld(v1)
vld(v2)
vld(v1)
Postouter
vld(v2)
vmul
vmul
vsub
vadd
vadda(a0)
vadda(a1)
vgeta(a0)
vgeta(a1)
vst
(b)
vst
(c)
Figure 1. Example RSVP™ program using DFG. (a) In-lined C function call, (b) textual representation of a DFG, (c) graphical representation of a DFG optimal scheduling of operations. Deep pipelining allows multiple functional units to be chained, eliminating large register files for the storage of temporary data. This process is achieved with the programmer describing the data flow graph (DFG) of the operations to be performed by the RSVP™ co-processor (usually kernel inner loops). Each RSVP™ functional unit has mapped to it a small number of operations in the data flow graph, and a reconfigurable interconnect network connects all components. If there are sufficient functional units such that each operation in the DFG is mapped to a unique functional unit, then the data stream is directly routed from one functional unit to the next, possibly delayed. If there are more operations than functional units, then the functional units and interconnects are “virtualized”, performing only a subset of the data flow graph operations and data movement in each cycle. The architecture benefits by removing the need for multi-ported register files that are large and slow. Furthermore, many operations specified in the DFG execute in the same clock cycle. A scheduler tool selects these operations in a way that maximizes the use of the functional units. Additionally, compiler loop unrolling enables stream data elements to be processed in SIMD fashion provided that sufficient resources are available to map multiple iterations of the loop. Stream Descriptors The RSVP™ architecture includes several independent stream units to prefetch data from memory and turn streams into FIFO queues of vector elements. Additional stream units are used to write vector elements as streams into memory. Each stream unit handles all issues in loading/storing of data: address calculation, byte alignment, data ordering, and memory bus interface. Data is transferred through the stream units which are programmed using stream descriptors. A compiler can also schedule the loading of a stream descriptor that is dependent on run time values. Although stream descriptors are emphasized in this paper for the movement of image data, they can also be used to describe movement of non-image data or used with peripherals in the system [14]. In the RSVP™ architecture, a stream descriptor is represented by the 7-tuple (start_address, stride, span0, skip0, span1, skip1, type) where: • start_address represents the memory location of the first data element. • stride is the spacing between two consecutive data elements. This is useful for describing subsampling in an image. • span0 is the number of datum gathered before applying the skip0 offset. • skip0 is the offset that is applied between groups of span0 datum. • span1 is the number of datum gathered before applying the skip1 offset. • skip1 is the offset that is applied between groups of span1 datum. • type indicates how many bytes are in each datum, for example 8-bit pixels associate a value of zero, 16-bit pixels associate a value of one and so on. The stride, span0, skip0, and type fields define the shape of the data stream, start_address defines the location of the first data element, and span1 and skip1 describe uniform motion of a data block within the image. The stride, skip0 and skip1 offsets can be positive or negative, while the span0 and span1 are always positive. The template described above allows the description of two-dimensional sub-arrays within larger arrays (useful in image convolution filters), uniform sub-samples of arrays (useful in
5
128
128 VBus2 Master
128
Tile Buffer
VBus2 Interface + cache
master control
64 128
64
VSUs
Mult
64 128
128 192 192
ALU 192
Fabric and Queue 64
AHB Interface
Shifter slave control
AMBA AHB Slave
slave VBus1
64
64
32
Code Cache
512
128
Inst
Scalars, Accumulators & Constants (+loopcount)
128 control
status
Seq watchdog
status
errors
Status/ Control
Figure 2. Current RSVP™ Implementation image sub-sampling or decimation), circular access of a sub-array (using negative skip values), a sliding 2D window with uniform window motion, or a combination of these. Stream Computation RSVP™ programs use a “Data-flow Graph” (DFG) language to express vector operations in a machine-independent manner. A DFG consists of nodes, representing arithmetic and logical operations, and directed edges, representing the dependency (data, order, iteration) between operations. In this DFG language, all dependencies are explicitly stated. This simplifies the scheduler’s task of identifying dependencies and determining which operations can be scheduled in parallel, resulting in schedules that are often optimal, given the functional unit and interconnect limits of the underlying RSVP™ implementation. Each node in the DFG is denoted by a descriptor, which specifies: • Input operands. The input operands are specified as relative references to previous nodes rather than named registers. This feature helps eliminate the unnecessary contention for named registers as well as the overhead associated with register renaming. • The operation to be performed by the node. • The minimum precision of its output value. This can be derived from the precision of the input operands and from the operation performed by the node. However, implementations are allowed to use more precision when necessary. • The signedness of the node. The DFG is then mapped onto current hardware implementation by a micro-architecture aware DFG compiler. This DFG compiler exploits both instruction and data level parallelisms, which manifests as pipeline and SIMD operations in the RSVP™ co-processor. To initiate execution of a kernel loop on the RSVP™ co-processor, the host loads the stream descriptors, accumulators, and scalars (loop invariants). RSVP™ execution starts by passing it the address of the compiled DFG and the number of iterations to execute. The RSVP™ architecture supports up to two levels of loop nesting with different loop iterations for the inner and outer loops. Figure 1 describes the process to program the RSVP™ co-processor with an example DFG for a complex FIR filter. Details of the FIR filter can be found in [15]. Figure 1a shows the in-lined function call that runs on the host processor. In this function call, the DFG and stream descriptors are loaded in a series of steps before starting execution on the RSVP™ co-processor. Figure 1b and Figure 1c show the textual and graphical representations of the complex FIR filter. The two forms of representation are equivalent, and the programmer can choose either method for creating the DFG. During compilation, the DFG
6
Power Switch
PLL
LDO Voltage Regulators Epson LCD Connector
PLL SDRAM SODIMM
GPIO
IrD
Peripherals Test Device
RSVP™ Co-processor
SIF GPIO GPIO
ARM946ES™ Sequencer RAM
TileBuffer
Reset
JTAG Debugger Inte
Logic Analyzer Test Points
(SIF) Image Sensor Interface
UART/ RS232 Sensor Interface
SRAM FLASH SC
Figure 3. RSVP™ die photo and development board nodes are mapped onto the RSVP™ functional units by a micro-architecture aware scheduler which creates a single modulo schedule for the entire nested loop construct. This allows full pipelining of inner and outer loop operations in the FIR. Mapping DFG descriptions of algorithms onto hardware has been previously studied as part of high level synthesis (HLS) research [16-18]. Most of these approaches rely on a number of fixed functional units connected through fixed bus structures to centralized and/or distributed register files. In our case, the functional units (which can be configured for either SIMD or MIMD§ parallelism) are interconnected by a flexible network, shown as Fabric_and_Queue in Figure 2, which can be reconfigured every cycle. The intermediate results are stored in distributed queues which are part of the network. Compared with schedules created by other HLS tools, our DFG compiler produces modulo schedules that: (i) do not require prolog and epilog; (ii) can handle nested loop constructs without having to create different schedules for different parts of the nested loop construct; (iii) may be iteration dependent; and (iv) deal with limited logical connectivity between functional units. Implementations The block diagram of the current RSVP™ implementation is shown in Figure 2. A bus interface gasket (AHB Interface) is used to connect to the host processor as a memory mapped peripheral on the bus. A 32KB tile buffer is included as a scratch memory, providing significant speedup to algorithms that map their data there. There is one output stream unit and three input stream units in this implementation (VSUs). The inputs can source up to four 32-bit data elements per cycle, and the output can sink up to two 32-bit data elements per cycle. The datapath consists of a reconfigurable interconnect network (Fabric_and_Queue) that stores intermediate results. The network is configured to connect functional units to take advantage of both pipeline and SIMD parallelism. This interconnect fabric consists of 16 links, each of which can transfer 32-bits of data from its source to its destination. Additionally, the 32-bit links can swap 16-bit halves during a route or “splat” (duplicates data across elements in a vector) either of the 16-bit input halves to both halves of the 32-bit output. These links can be reconfigured in every clock cycle. Each function unit is 64-bits wide and sliced on 16-bit boundaries. These units can operate as four 16-bit units, one 32-bit and two 16-bit units, two 32-bit units, or one 64-bit unit. In contrast to traditional wide-word SIMD implementations, each of the slices in this RSVP™ implementation has its own control, so the slices constitute a MIMD unit rather than a SIMD unit (although they can be scheduled as a SIMD unit by applying the same control to all slices). All function units are fully pipelined with result latching (including the Fabric_and_Queue), allowing the units to be chained together to form deep reconfigurable pipelines customized to each application. The first implementation has been fabricated in TSMC 0.18µm CMOS technology. It is integrated into an ARM946ES based SoC with a complete set of peripherals for an embedded camera. Figure 3 shows a die photo of the SoC which contains 9.5M transistors in a 5.04x9.03 mm2 die. Readers are referred to [11] for more detail. Performance Results This section presents performance results for the described RSVP™ implementation. The selected benchmarks are part of next generation mobile imaging applications. Some benchmarks are “kernels”, which represent portions of the application that are mapped onto the RSVP™ co-processor using the stream descriptors and DFG language. Other benchmarks are full applications which include execution on both the RSVP™ co-processor and the host processor. For comparison against an ARM946ES core, benchmarks were written in C and compiled using an ARM-C compiler. The kernel benchmarks are presented in Table 1. Two sets of results are presented for each processor. The left column, §
Multiple Instruction Multiple Data
7 Table 1. Performance comparisons on kernel benchmarks† ARM946ES
TI-C64xx
SC140
RSVP™ Coprocessor 1 32 252 5470 104 357
Speedup versus ARM946ES 22.4 - 33 7.5-68
1 32 1 32 1 Iterations 5661 180935 154 3006 1130 IDCT 780 24588 32 1024 192 Motion Estimation 41581 1312483 1243 40923 1632 1763 55659 23.6 FFT 88965 2924132 2584 81944 2612 2729 80706 32.6-36.2 FIR 64322 342 1736 1045 61.5 Convolution 734401 22199 80280 33463 21.9 SobelFilter † IDCT: IEEE 118-1990 compliant, performed on one 8 x 8 block of pixels. The input is 16 bits/pixel. MotionEstimation: Motion estimation is executed on an 8 x 8 window, with 8 bits/pixel. FFT: A 256-point complex FFT, 16 bits/input, normally ordered, bit reversed out. FIR: An 80-sample, 32-taps, 16 bits/input complex FIR. Convolution: A 3 x 3 convolution executed on a 250 x 3 image with 8 bits/pixel. SobelFilter: Sobel filter executed on a 128 x 128 image with 8 bits/pixel. Table 2. Performance comparison on application benchmarks‡ Memory timing ISP† JPEG encode MPEG4 encode MPEG4 decode H264 decode
Frame Size
228 51.5
229 52.2
7.96 6.92
8.85 7.29
Speedup versus ARM946ES 28-26 7.4-7.2
8.75
10.7
1.83
2.73
4.8-3.9
QCIF
2.62
3.29
1.07
1.42
2.44-2.32
QCIF
36
50
15.5
-
2.3
VGA
1
ARM946ES 40(22)-2
RSVP™ Co-processor 1 40(22)-2
VGA VGA
‡Expressed in 106 cycles/frame † The ISP benchmark is composed of white balance, bilinear color interpolation, 3x3 low pass filter, gamma correction, color correction, 3x3 high pass filter on the Y component, 3x3 pseudo-median filter on the Cr/Cb components, rgb to ycc color conversion, and chroma decimation. labeled 1, lists results for the benchmarks as described above. The right column, labeled 32, lists results with the number of iterations of each benchmark greater by a factor of 32. Consequently, the performance data listed in the left columns in Table 1 are more highly influenced by overhead for setup and internal pipeline priming/draining. However, the right columns show the performance gained from deep pipelining and vector execution. The kernel benchmarks assume an ideal memory model of 1 cycle access time with a bus width of 128 bits. The speedups for the kernels range from 7.5 to 68 and have an average of 34.5. The results show a performance improvement beyond those obtained with the 4-8 wide SIMD extensions in today’s scalar processor. For these benchmarks, the results indicate that there is significant parallelism to exploit. The RSVP™ architecture, and stream processors in general, can achieve higher performance than wide-word SIMD because it is more readily scalable to higher numbers of functional units. Furthermore, the stream programming model with decoupled memory access allows the compiler to better utilize resources such as functional units, network, and memory bandwidth. The kernel speedups are comparable to leading DSP processors with a similar number of functional units, such as the TI-64x and SC140. However, RSVP™ speedups are obtained without the laborious task of assembly language programming and scheduling, which is often required in using DSPs to achieve their published performance. Moreover, the code is often scheduled assuming a low latency memory, which can lead to the need for large on-chip memories, dwarfing the size of the DSP core itself. The decoupled memory access of RSVP™ architecture, coupled with a highly capable memory controller and the predictable nature of the streaming data access patterns, can obviate the need for large on-chip memories in many cases. The benchmarks presented in Table 2 are full blown applications. They include an internally developed image processing
8 chain (ISP), JPEG encode, MPEG4 encode and decode, and H264 decode. Table 2 presents performance numbers assuming a 1cycle and a 40(22)-2-cycle memory access timing (40 cycles for first 32-bit access to DRAM outside the contents of the row buffer and 22 cycles for first 32-bit access to DRAM in the row buffer, 2 cycles for each 32-bits thereafter). These model the entire hierarchical memory system, both on and off-chip. The speed-up column presents a range of speed-ups, which are dependent on the memory characteristics. It is important to note that the RSVP™ architecture is largely insensitive to memory latency. Compared to an ARM946ES, the RSVP™ co-processor achieves a speed-up in the range of 2-28x. In most cases, the speedup is ultimately limited by the fraction of the application that is streaming in nature and can be executed on the RSVP™ coprocessor. Additional speed-up can be obtained if the code is restructured to allow the ARM and the RSVP™ programs to run asynchronously or if the code is restructured to facilitate operation on longer vectors. The use of asynchronous operation has been explored for H264 and JPEG, resulting in speedups of 2.8x and 10.4x over the ARM946ES, respectively. Power consumption has been measured for the implementation shown in Figure 3 to be comparable to the ARM946ES core and its 8K caches. This shows the ability of an optimized coprocessor to offload heavy computation load for streaming operations while reducing power. A second generation implementation with extensive clock gating in 90nm is underway to further improve power consumption. Future work The RSVP™ architecture is being extended and its implementation modified as experience is gained in applying it to mobile imaging applications. Current experience indicates that the overhead in setting up the RSVP™ co-processor for execution and the limits on the address patterns that can be generated by the stream units are performance issues. New stream descriptors are being developed and extended to other system components [14,19]. Work on new algorithms and restructuring current algorithms is underway to fully take advantage of stream processing. SUMMARY Mobile imaging applications continue to demand more processing capabilities to improve quality and performance. Furthermore, emerging features based on computer vision algorithms will soon become commonplace. This paper presents the streaming computation model with an architecture that supports these applications in an embedded system. To improve performance with limited energy and memory bandwidth, data access and data processing are decoupled in the stream processor. Stream descriptors are used to leverage the deterministic memory access patterns, while data flow graphs are used to reconfigure its datapath to achieve higher computation parallelism. ACKNOWLEDGMENT The authors thank the Motorola Labs development team in the Embedded Imaging Systems Lab and Dr. Yi Wei in obtaining the die photo. REFERENCES [1]
Brian O’Rourke, “CCDs & CMOS: Zooming in on the Image Sensor Market,” In-Stat Report IN030702MI, September 2003.
[2]
D. S. Wills, J. M. Baker, Jr., H. H. Cat, S. M. Chai, L. Codrescu, J. Cruz-Rivera, J. C. Eble, A. Gentile, M. A. Hopper, W. S. Lacy, A. López-Lagunas, P. May, S. Smith, and T. Taha, “Processing architecture for smart pixel systems,” IEEE J. Select Topics Quantum Electron, v. 2, no 1, pp. 24–34, 1996.
[3]
Wayne Wolf, Burak Ozer, Tiehan Lv, “Smart cameras as embedded systems,” IEEE Computer, September 2002, pp. 48-53
[4]
J. Adams, K. Parulski, and K. Spaulding, “Color processing in digital cameras,” IEEE Micro, no. 18, pp. 20–30, 1998.
[5]
Vasudev Bhaskaran, Konstantinos Konstantinides, Image and video compression standards: algorithm and architectures, 2nd edition, Kluwer Academic Press, 1997.
[6]
Saman P. Amarasinghe; William Thies, “Architecture, languages and compilers for the Streaming Domain,” PACT 2003 Tutorial., http://cag.lcs.mit.edu/wss03/.
[7]
Parthasarathy Ranganathan, Sarita Adve, Norman P. Jouppi, “Performance of image and video processing with general-purpose processors and media ISA extensions,” Proceedings of the 26th Annual International Symposium on Computer Architecture (ISCA’99), May 1999, pp. 124-135
[8]
Michael Bedford, et al, “Evaluation of the Raw microprocessor: An exposed-wire-delay architecture for ILP and streams,” Proceedings of the 31st Annual International Symposium on Computer Architecture (ISCA’04), June 2004, pp. 2-14.
[9]
Scott Rixner, William J. Dally, Ujval J. Kapasi, Brucek Khailany, Abelardo López-Lagunas, Peter R. Mattson, John D. Owens, “A bandwidth-efficient architecture for media processing,” Proceedings of the 31st annual ACM/IEEE International Symposium on Microarchitecture, November 1998, pp. 3-13.
[10] William J. Dally; Patrick Hanrahan; Mattan Erez; Timothy J. Knight; François Labonté; Jung-Ho Ahn; Nuwan Jayasena; Ujval J. Kapasi; Abhishek Das; Jayanth Gummaraju; Ian Buck, "Merrimac: Supercomputing with streams", Proceedings of the SuperComputing SC’03 Conference, November 2003, Phoenix, Arizona, pp. 35-43. [11] S. Chiricescu, R. Essick, B. Lucas, P. May, K. Moat, J. Norris, M. Schuette, and A. Saidi, "The Reconfigurable Streaming Vector Processor (RSVP™)," Proceedings of the 36th International Symposium on Microarchitecture, December 2003, pp. 141-150. [12] S. Chiricescu, M. Schuette, R. Essick, B. Lucas, P. May, K. Moat, and J. Norris, "RSVP™: An automotive vector processor," Proceedings of the IEEE Intelligent Vehicles Symposium, June 2004, pp. 563-568.
9 [13] W.A. Wulf, S. A. McKee, “Hitting the memory wall: implications of the obvious,” ACM SIGARCH Computer Architecture News, Vol. 23, No. 1, March 1995, pp. 20-24. [14] S. M. Chai and A. López-Lagunas, “Streaming I/O for Imaging Applications”, IEEE International Conference on Computer Architecture for Machine Perception, July 2005, pp. 178-183. [15] Texas Instruments, TMS320C64x DSP Library Programmer's Reference, Literature Number SPRU565B, October 2003, page 4-57. [16] S. Ohm, F. Kurdahi, and N. Dutt, “A Unified Lower Bound Estimation Technique for High-Level Synthesis”, IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems, Vol16, No. 5, May 1997, page 458-472. [17] J. Jeonm D. Kim, D. Shin, K. Choi, “High Level Synthesis Under Multi-Cylcle Interconnect Delay”, ASP-DAC, 2001, page 662-667. [18] S. Tarafdar and M. Leeser, “The DT-model: high-level synthesis using data transfers”, Proceedings of the 35th annual conference on Design automation, 1998, page 114-121. [19] A. López-Lagunas and S. M. Chai, “Memory Bandwidth Optimization through Stream Descriptors”, Memory Performance: Dealing with Applications, Systems and Architecture (MEDEA) Workshop, St. Louis, Missouri, September 2005, pp. 59-66. RSVP™ is a trademark of Motorola Inc. Other product names are the property of their respective owner. A patent is pending that claims aspects of items and methods described in this paper.