The MAP-CA VLIW-based Media Processor From ...

1 downloads 0 Views 64KB Size Report
VLIW media processors built from processor platform technology centered on an ... of Equator and Hitachi's jointly developed MAP-CA platform technology.
The MAP-CA VLIW-based Media Processor From Equator Technologies Inc and Hitachi Ltd.

Chris Basoglu, PhD*, Karl Zhao, PhD*, Keiji Kojima †, Atsuo Kawaguchi†

*

Equator Technologies, Inc. 1300 White Oaks Road Campbell, CA 98008 †

Hitachi, Ltd., Systems Development Laboratory 1099 Ozenji, Asao-ku, Kawasaki, 215-0013 Japan

1 Introduction The Media Accelerated Processor for Consumer Appliances (MAP-CA) is the most recent member of VLIW media processors built from processor platform technology centered on an advanced VLIW architecture and a mature trace scheduling compiler. The key characteristics of the MAP-CA processor are: •

The MAP-CA is a fusion of the best of the next-generation General Purpose RISC and high performance (DSP) digital signal processor architectures seamlessly integrated into a VLIW architectural framework. Very Long Instruction Word (VLIW) processors are becoming the predominant next generation processor architectures that both Equator and Hitachi have been working on since the late 1970’s.



A VLIW compiler is the very foundation of the VLIW architecture and an integral part of the MAPCA architecture design. Many compiler-friendly hardware features were embedded into the MAP-CA architecture that fully supports a 100% C programming paradigm even for signal and image processing tasks.



The MAP-CA delivers unparalleled DSP computational power through support of data level parallelism using SIMD partitioned operations on a patent pending hardware implementation. At 300Mhz, the MAP-CA mediaprocessor delivers more than 11 GOPS of sustained 16-bit SIMD DSP operations, more than 24 GOPS of sustained 8-bit SIMD operations and more than 30 GOPS for sumof-absolute difference (SAD) block matching. This unparalleled SIMD hardware implementation, supporting in excess of 1000 instructions, has DSP performance far beyond any available or even announced next-generation VLIW DSP.



Besides the powerful VLIW core within the MAP-CA mediaprocessor, additional co-processors such as the VLx, VF, and DataStreamer TM provide a powerful and sophisticated architecture for sustained performance very close to peak performance over a broad range of applications using simple C programming techniques.



On-chip memories supporting these multiple processing elements include a very high bandwidth multiported Data Cache. The Data Cache is connected through a high bandwidth crossbar switch to external SDRAM, DataStreamer memory and VLx co-processor memory. A traditional Instruction Cache is also resident on the MAP-CA that fully supports patented VLIW memory compression techniques.



A set of media I/O interfaces both digital and analog along with a 66MHZ 32bit PCI bus makes the MAP-CA a complete system-on-chip solution. The patent pending DataSteamerTM supports a set of complex I/Os, so minimal VLIW core interaction is required on I/O transactions.

The flexibility and performance of Equator and Hitachi’s jointly developed MAP-CA platform technology matches the requirements for the rapidly evolving and computational demanding nature of broadband digital media, high-performance digital imaging, and advanced networking products. The recent start of commercial digital-TV broadcasting, rapid adoption of broadband-Internet streaming video, and the rapidly growing acceptance of personal video recording (PVR) services creates a need for a multi-function digital TV and STB’s. Most of today’s digital-TV sets are built with fixed-function ASICs managed by conventional RISC microprocessors. With today’s rapidly evolving functional requirements, these fixed-function devices plus RISC processors are prone to rapid obsolescence. In the meantime, evolving functionality requires new and specialized chips. This combination of rapid increases in functionality and lack of flexibility in fixed-function ASIC’s draw Consumer OEM’s, Cable MSO’s interest in the MAP-CA.

2

A multitude of office automation products such as color printers and copiers demand very highperformance digital image processing. Traditionally, all these image-processing engines were implemented using fixed-function ASICs and RISC processors. The MAP-CA enables designers to implement various image-processing engines found in different office automation products in software. This dramatically reduces the development cost and time to market versus the traditional fixed-function plus RISC design approach. For each specific ASIC required to be designed, even with slight improvements over existing devices, considerable engineering resources must be deployed and time-to-market versus a mediaprocessor approach suffers greatly. Although the use of VLIW mediaprocessors in networking products is still somewhat limited, the demand for faster Internet access through wire and wireless connections creates a strong need of such a processor in advanced networking products. Today the turnover rate for Internet hardware is extremely high making the proprietary fixed-function ASICs life spans very short. The growing interest in Mediaprocessors to maximize product life cycles in networking and communications products is very high. In Section 2, we will discuss the fundamentals of Equator’s platform technology, which includes the VLIW architecture, the trace scheduling compiler, and various architecture simulators. We also review the predecessor of the MAP-CA: the MAP1000A and its applications. In Section 3, we provide an architectural overview of the MAP-CA. This overview includes the VLIW-core, the DataStreamer, the VLx coprocessor and the I/O sub-system. We will also discuss the performance requirements for various applications. In Section 4, we conclude this article.

2 Equator’s Platform Technology 2.1

The VLIW Approach

Instruction-level parallelism (ILP) has been a central focus of computer architecture research over the past 20 years. There are two schools of thought on how to achieve a high degree of instruction-level parallelism: VLIW and super-scalar architectural approaches. The MAP processor family is a new architecture based on VLIW, today’s popular microprocessors are based on the super-scalar approach. Real-time processing of multi-media data stresses processor architecture in many different ways. There are three basic ways to increase a processor's performance: decrease the cycle time, decrease the number of cycles required to execute an instruction, and execute more instructions per cycle. The first two are becoming increasingly difficult to improve beyond process improvements, while the last is still underexploited. Executing more instructions per cycle exploits the natural parallelism available in most applications. Very Long Instruction Word (VLIW) processors explicitly describe this parallelism by packing multiple operations into a single instruction word, which is then executed as a very wide instruction unit. VLIW differs from super-scalar architectures in that the grouping and scheduling of instructions for execution is done at compile time, rather than execution time. Therefore, VLIW processors do not need significant on-chip hardware required to resolve data dependencies to issue parallel operations. By moving this difficult task of finding parallelism to the compiler, VLIW architectures dramatically simplify the CPU design by reducing the gate count, thus freeing valuable die area for other performance enhancements or significantly reduce costs by smaller die costs. While the VLIW architectural approach is primarily designed to exploit parallelism, its simplification of the processor architecture inherently supports reduced cycle times as well. The scheduling of all parallel operations is the responsibility of the compiler. Thus, the compiler must understand the target architecture intimately to be able to schedule any given algorithm for optimal performance. The compiler searches for eligible operations, checks for data dependencies, controls any resource conflicts, and packages these operations into VLIW instruction words. The compiler can explore across natural code boundaries such as branches for opportunities of increased parallelism far beyond the limited search window seen in existing super-scalar architectures. The Equator compiler uses a technique known as Trace Scheduling to search a whole routine for eligible operations.

3

2.2

Trace Scheduling Compiler

The most important requirement for the MAP processor is to free the developer from the difficult task of application development through the use of high-level language programming while delivering full processor performance. The goal to support product developers from first prototype through production shipments staying 100% in C, and avoid the hand-written, hand-scheduled assembly code typical of today's DSP-based systems has been achieved. Benefits of programming in C include reduced development costs, reduced time-to-market, lower system costs, reduced maintenance time and field software-based upgrades. The core technology used in the Equator C compiler was initially developed in 1983 for the Multiflow TRACE series of mini-super computers [3]. The Multiflow machines for which the compiler was developed had a 1024-bit VLIW architecture with up to 28 parallel functional units and sixteen banks of registers, posing severe challenges in identifying and scheduling application parallelism. Equator has extended and improved this base Compiler technology, focusing on minimizing object code size and supporting much more aggressive approaches to optimization and scheduling. The Equator compiler uses complex inline expansion, assertions, and loop unrolling with trace frequency estimation algorithms to maximize C code efficiency. These optimizations include: • Uncovering instruction-level parallelism, • Manage register scheduling, pipeline loading and functional units, • Generate instruction operation schedules that exploit parallelism, • Supports extensive global optimization, analysis and scheduling, • Restructure loops through software pipelining, unrolling and preconditioning, • Provide local scheduling and optimization, • Support media-oriented machine facilities, • Reduce data dependency to maximize scheduling efficiency. The compiler also takes advantage of the predication on the VLIW core to eliminate unnecessary branch operations. Even when control flow cannot be eliminated, for example when a subroutine is called conditionally, the compiler is able compact code. Thus C code in Figure 1a is compiled into the code in Figure 1b. As can be seen, the entire code segment executes in only 4 cycles. 2: 3: if (i > j) { 4: if (flag) { 5: i++; 6: } 7: else { /* Inner else clause */ 8: sub(i); 9: } 10: } else … /* Outer else clause */ (a)

c0:

pr0 =/p

gr0 > gr1 -- line 3 (means pr0 = gr0 > gr1 and pr1 = !(gr0 > gr1))

c1:

pr2 =/p&(pr0) gr2 != 0

-- line 4 (means pr2 = pr0 & (gr2 != 0) and pr3 = pr0 & !(gr2 != 0))

c2: if pr1, branchto

LABEL_WHERE_OUTER_ELSE_CLAUSE_IS -- line 3

c3: if pr2, gr0 = if pr3, branchto

gr0 + 1 LABEL_WHERE_SUBROUTINE_CALL_IS

-- line 5 -- line 4

(b)

Figure 1: C code and the corresponding Equator C compiler generated code. The Multiflow compiler relied on a combination of aggressive loop unrolling and trace scheduling techniques to exploit parallelism. This technique, while tolerant of a wide range of code structures and machine resources, produces larger object code than software pipelining. Equator has focused on achieving small object code size along with high performance, and has built an integrated back-end, that combines both software pipelining and trace scheduling. Many of the applications envisioned for the MAP-CA are embedded; code changes infrequently and production volumes are high. Compared to a general-purpose computing environment, much higher compiler running time is acceptable for final compilation if even small improvements in application performance can be achieved. Equator has explored several approaches to optimization and scheduling which evaluate multiple approaches, including genetic algorithms for operand placement and scheduling.

4

In several key instances, allowing aggressive backtracking has produced performance improvements on the order of 5%-7% compared to the existing single-pass heuristic approaches.

2.3

Instruction Set Architecture and Software Tools

As previously discussed the roots of MAP architecture is centered on a VLIW architecture approach and a highly integrated C compiler. The micro architecture design of MAP processors is specifically driven by the requirements of digital signal and image computing at both algorithmic and system levels. This requires the concurrent development of many digital media, digital imaging and communications applications with the development of architecture simulators and refinements to the C compiler. As part of the instruction set architecture (ISA) development process, a full machine simulator and optimizing compiler was developed, so that the ISA could be evaluated from the perspective of compiler efficiency on a variety of media algorithms. Fast algorithms for motion estimation, inverse discrete cosine transform (IDCT), fast Fourier transform (FFT), finite impulse response filtering (FIR) and others were developed, compiled, and simulated before the final chip design implementation was completed. This early development effort to tune the final combination of compiler, operating system, application software allowed the chip architecture to have very high performance with a minimum of tuning after first silicon 18 months ago. The Equator C compiler allows for direct access to almost all of the MAP-CA opcodes from the C source code using a mechanism called media intrinsics. While many compilers provide a mechanism called asm statements that allow direct embedding of assembly language instructions in a C program and in-line expansion to insert low-level code, asm statements are not an efficient approach for optimization because they defeat the purpose of an optimizing compiler. The asm statements require the programmer to manually allocate registers and schedule instructions. It usually requires disabling all compiler optimizations, which could lead to inefficiently compiled code. The Equator compiler accepts media intrinsics just like any other C operators (e.g., integer addition, floating-point square root, Boolean and bitwise operations). All the advanced features of the optimizing compiler are also performed on media intrinsic operations just like all the usual C operators. Therefore, the media intrinsics can be considered an integrated extension of the C language to the Equator C compiler. The software developer's toolkit includes three software simulators: trsim, wimsim, and casim. •

Trsim is a high speed, instruction level simulator of the MAP-CA core unit. This simulator works on an intermediate representation of a software program and can be used to initially develop applications and experiment with the performance of different algorithms and use of compiler options.



Wimsim is also a high speed, instruction level simulator. This simulator works off actual MAP-CA binaries. Wimsim is a functional simulator of the MAP-CA core unit with Data Cache, Data Streamer, and a subset of I/O devices. Wimsim provides runtime checks against resource constraints and all MAP-CA features necessary to simulate a MAP-CA running a real-time operating system and applications.



Casim is a nearly cycle-accurate software simulator for the MAP-CA and models more accurately the VLIW core, Data Streamer, VLx, Video Filter (VF), instruction and data caches, memories, buses, and a subset of supported I/O devices. Like Wimsim, this simulator operates off actual MAP-CA binaries. Casim provides more detailed runtime checks against resource constraints. Casim provides visibility into internal machine state and bandwidth thus augmenting debugging and tuning of interactions between the VLIW core, Data Streamer, and VLx programs.

Equator's development environment includes a source-level debugger based on gdb (GNU debugger). Equator's gdb (egdb) runs on both the Windows NT and Linux platforms. The egdb debugger allows the user to: • load a MAP-CA application from the host PC file system onto the MAP-and run it, • set, list, and clear software and hardware breakpoints,

5

• • • • •

single step through both C source code and assembly instructions, source level debug of optimized C code, examine and deposit values into local variables, global variables and PIO space, examine and deposit values into all registers, examine the stack, including stack backtracing,

Figure 2 illustrates how quickly the Equator C compiler can be used to optimize C source code using a compiler. There are three points on each curve of Figure 2. The first point represents the cycles count using the Equator C compiler using only automatic optimization. The second point is achieved by adding the memory disambiguation directives to the compiler. The third point is a result of modifying the source code to use the media intrinsics. Cycles 14000 FIR: 13163

12000

10000

8000

FIR-NO_LD: 7718

FIR FIR-NO_RED_LD D C T V E C M P Y MAC

6000

V Q C O D E B O O K SEARCH

4000

DCT: 3573 Vec: 3201

DCT: 1862

2000

FIR: 558 FIR-NO-LD: 501 Vec: 1109 Mac: 736 VQ: 375

DCT: 80140 Mac: 94 VQ: 90 Vec: 81

0 0

2

4

6

Days

Figure 2. Engineering hours spent to make a particular set of algorithms run efficiently on the MAP-CA. These functions may execute even more quickly with different algorithm approaches.

6

By using the compiler and the software tools available for the MAP processors, engineers have more time to improve the algorithms versus spending months or years hand-tuning assembly code. When an improved algorithm comes along, it can be easily implemented and quickly tested out. With new and more efficient algorithms, companies can gain more significant intellectual properties and the end results can be more quickly propagated to the end customers.

2.4

MAP1000A Processor

MAP1000A is the predecessor of MAP-CA processor. Both the MAP1000A and MAP-CA are the result of Equator’s software first design methodology discussed in the previous sections. A rich set of real-time applications is supported on the MAP1000A [6-8] all in high level C language from Equator and Hitachi. These applications include MPEG2 MP@ML encoder and decoder, all format decoder (AFDTM), H.263 encoder, MPEG4 decoder, Dolby AC-3 and MP3 audio decoders, G.729a speech codec, and V.34 modem. Equator and Hitachi’s third party software vendors have developed many other applications for communications and DTV. These applications have also been incorporated into system level demonstrations such as personal video recorder, analog time-shifting, streaming video, and voice over IP etc. Several real-time operating systems (RTOS) have been ported to MAP1000A including the popular VxWorks RTOS from Wind River Systems. Over the last year the MAP1000A has been designed into many products. These products include advanced digital set-top-boxes, cable head-end equipment, color printers and copiers, video tele-conferencing systems, video non-linear editing systems, IP voice and video telephones, and xDSL gateways. To date, there are over 40 customers, 60 design wins and over 430 customer engineers working on products based on MAP1000A.

3 A New Member in MAP Family: MAP-CA MAP-CA is a new member in Equator/Hitachi’s MAP processor family. It is very similar with MAP1000A in terms of the core architecture. However, it is more compact in the sense that (1) it removes the floatingpoint unit, (2) it eliminates the 3D-acceleration unit, (3) it streamlines the I/O systems. On the other hand, MAP-CA offers more fixed-point computational power at 300Mhz, twice the MAP1000A data cache and instruction cache, improved VLx and Video Filter co-processors and increased coprocessor memory. MAPCA delivers cost, performance, and power characteristics suitable for mass production devices such as digital set-top-box, broadband home gateways, color copier and printers, and communications equipment. The MAP-CA major functional units consist of a VLIW core, programmable co-processors, on-chip memories and I/O interfaces. The VLIW core operates at 300 MHZ executing four operations in parallel and supports partitioned SIMD operations for 8/16/32/64-bit data types. Co-processors include VLx, VF and DataStreamer TM. VLx offloads bit-oriented serial operations from the MAP-CA core while the VF accelerates video filtering and scaling operations. The DataStreamer TM is a high performance, programmable DMA engine that provides buffered data transfers between different MAP-CA memory subsystems or between memories and I/O devices. On-chip memories include 32K byte Data Cache, 32K byte Instruction Cache, 4K byte data memory, 4K byte program memory for VLx with 6K byte line buffer memory for VF, and 8K buffer memory for DataStreamer TM. Key media access I/O interfaces are integrated on the processor along with one 32-bit 33/66 MHZ PCI bus. Figure 3 shows the block diagram of MAP-CA.

7

MAPCA

I-ALU 1 IG-ALU 1

Instruction Cache

Register File

VLx

2kbyte data Memory 2kbyte data Memory 2kbyte data Memory

VLIW Core

I-ALU 0 IG-ALU 0

Register File

6kbyte data Memory

Data Cache

Video Filter

DTS

PCI-A SDRAM Memory Controller

Data Streamer

IOB

JTAG 24.576 MHz

Analog RGB

PLLs Display Refresh Controller I2C

ITU-R BT.656 IN 0 or Trans. Channel Interface 0

ITU-R BT.656 IN 1 or Trans. Channel Interface 1

Flash ROM 2

IS IEC958

ITU-R BT.656 OUT

I/O Selector

Figure 3: Block diagram of the MAP-CA

3.1

VLIW – Core

3.1.1 Execution Units The MAP-CA has two clusters. Each cluster contains two functional units: I-ALU and IG-ALU. Each IALU contains a Load/Store unit, an integer ALU, and a Branch unit. Each IG-ALU contains an integer ALU and a general-purpose signal and imaging operation unit. The architecture of MAP-CA VLIW core is shown in Figure 4.

8

Data Cache Port 0

I-ALU

1st Integer Unit

Register File

2nd Integer Unit Graphics Unit IG-ALU

Data Cache Port 1

1st Integer Unit

Register File

2nd Integer Unit Predicate Registers

Media Unit

Graphics Unit

Predicate Registers

Media Unit Special Registers

Special Registers

Figure 4. Architecture of the MAP-CA VLIW Core

Every MAP-CA VLIW instruction contains four operations. The basic operation format for the MAP-CA consists of three-operand register-to-register operations. Native data types include 1-bit logical values, 8, 16, 32, and 64-bit integers, with 32-bit addresses. The Media Intrinsic operations include partitioned operations over these data types. Load and store operations can perform 1, 2, 4, and 8-byte accesses, with support for both little and big-endian byte orderings. Dynamic address translations and virtual memory protection is fully supported. The 1-bit logical values are also used to support predicated execution that substantially enhances available parallelism by allowing partial speculation while eliminating branches. Each cluster has its own register file containing sixty-four 32-bit registers usable in pairs of 64-bit registers, 16 1-bit predicate registers, and four special 128-bit registers. These large register files help minimize unnecessary instruction dependencies caused by logically distinct register reuses.

3.1.2 VLIW Core Operations The I-ALU mainly performs load and store, branch, 32-bit integer arithmetic, logical and bitwise logical, address calculation and system control operations. IG-ALU operations include: • 32-bit integer arithmetic, logical and bitwise logical operations, • 64-bit integer arithmetic operations, • Shift/Extract/Merge operations, • 64-bit SIMD operations (with 8-bit, 16-bit, and 32-bit partitions) including selection, comparison, selecting maximums and minimums, addition, multiply-add, complex multiplication, inner-product, and sum-of-absolute differences. • 128-bit partitioned (with 8-bit, 16-bit, and 32-bit partitions) SIMD operations including inner product with new partition shift-in for efficient FIR operation and sum-of-absolute difference with new partition shift-in for efficient block matching operation. Certain operations (such as DSP SIMD operations) require more than one cycle to complete. No hardware interlocks are needed to prevent issue of an operation that attempts to read a result not yet completed. The compiler is responsible for correct scheduling, not hardware. Register scoreboarding is supported for outstanding loads.

9

Nearly all operations can have their effect controlled by the value of a selected (1-bit) predicate register. A predicate register is tested to determine whether or not the operation should be performed. This allows the compiler to aggressively convert control flow into data flow, thus enabling a substantially higher degree of instruction-level parallelism. This use of predicate registers also greatly helps to reduce any penalties for branching, without the cost and complexity of hardware branch prediction. The IFG-ALU is also capable of performing 128-bit integer multiply-add (MAD) operations as well as 128bit vector sum operations. This allows the sum of absolute difference (Σ0ν−1|α(ι)−β(ι)|) and the inner product operation (Σ0ν−1α(ι)∗β(ι)) to be computed with a single operation, greatly improving the performance of video compression and decompression applications. Thus one IFG-ALU executes eight 16bit fixed-point MADs in every cycle. Since there are two IFG-ALUs, the MAP-CA can perform 3.52 billion sixteen-bit multiply-and-add operations per second at 300 MHz.

3.2

SOC – System Level Parallelism

3.2.1 Cache and Data Streamer The MAP-CA supports several on-chip memories with access to high-speed low-cost SDRAM and other memories. The VLIW is equipped with both a 32KB Instruction Cache, and a four-port 32KB Data Cache used for caching instructions and storing data from SDRAM. In addition to supporting two I-ALU ports, the Data Cache supports a port to the DTS (Data-Transfer Switch), and a port is available to the DataStreamer TM as well. The VLx co-processor uses 4KB-instruction memory and 4KB data memory. The Video Filter uses a 6KB line buffer memory. The VLx and VF memories (total 14KB) are also accessible by the VLIW core through uncached load/store operations. In addition, these memories are also available to the DataStreamer TM and for external use via the PCI bus. The line buffer memory is used to store the content of flash ROM at a system boot up. The Data Streamer provides a flexible mechanism to orchestrate the transfer of the data between external memory, external buses, processor cache, and on-chip devices. The Data Streamer's provides the application programmer the means to develop image-processing software, that avoids all “cache misses”, and cache refill events. In inner loops operating over large images or steams of data, the programmer uses the Data Streamer to move strips, blocks or patches of data on and off the chip and between different functional blocks within the chip. The Data Streamer accepts a complex chain of commands and executes these in a separate thread of control with minimal CPU support, using internal buffer memory to “stream” data to and from the chip using the external SDRAM memory. The programmer can use a conventional double-buffering scheme to fully overlap data movement and VLIW core processing. This double-buffering technique coupled with multidimensional transfers eliminates the bottleneck experienced on all conventional microprocessors when operating on image data larger than the cache. The Data Streamer is a 64-channel DMA controller with a dedicated 4 KByte on-chip buffer memory. Concurrent streaming of 64 channels of data allows system-wide channel allocation to application programs, and adds tremendous flexibility to the overall architecture. A data transfer operation involving the Data Streamer is initiated by defining the source and destination blocks using the VLIW core and then enabling the corresponding Data Streamer channels to run independent of the VLIW core. The block width, height, and the number of bytes to skip over can specify a block that the DataStreamer will operate on. Other high-performance microprocessors attempt to address problems of data cache misses with nonblocking caches and software-directed cache pre-fetching. These capabilities are included within the MAPCA architecture as well. However, both Hitachi and Equator engineers found these traditional mechanisms

10

for data pre-fetch insufficient to sustain high levels of performance in computing multidimensional data sets, and extremely wasteful of external memory bandwidth. The conventional RISC approach to process two-dimensional image data stored in memory is to access the data using load/store operations. Such load and store operations systematically miss in the data cache and generate very low processor efficiency. To reduce the penalty of cache misses; data can be pre-fetched via pre-fetch operations prior to their use. However, pre-fetching instructions themselves consume execution cycles that could have been used for actual computation. Second, locality of the pre-fetched data is usually low since the original data is two-dimensional. The convention of loading a cache line at a time loads significant unused data, causing pollution of the data cache and wasting external memory bandwidth. A key insight in many of these large computations is that address patterns are very predictable by the programmer. The Data Streamer can execute these address patterns directly, and transfer exactly the desired two dimensional block of image data, linearizing the data into cache lines that the VLIW core can access them without cache miss or cache pollution.

3.2.2 VLx Co-processor The Variable Length Encoder/Decoder (VLx) is a 16-bit RISC co-processor with thirty-two 16-bit registers. This co-processor offloads the VLIW CPU core from bit sequential tasks like Variable Length Encoding and Variable Length Decoding (VLE/VLD), and accelerates video and communications applications such as JPEG, MPEG, H.263, JBIG, DV, etc. It includes special purpose hardware for bitstream processing, hardware accelerated MPEG2 table lookup and general purpose variable length decoding.

3.2.3 Video Filter A polyphase (8 phase) 2D Video Filter takes 4:2:0 or 4:2:2 YUV stream as input and scales either up or down as required. 4 (vertical) x 5 (horizontal) filters support up to 768 horizontal pixels, 3 x 5 up to 1024 horizontal pixels, and 2 x 5 up to 1536 horizontal pixels. The Video Filter pumps out scaled 4:4:4 YUV data to the SDRAM or the DRC through the video bus.

3.2.4 Integrated I/O System Several audio/video interfaces are supported, including ITU-R BT.656 input and output, MPEG2 Transport Channel Interface (TCI) inputs, I2C selectable interface, IEC958 and I2S digital audio interfaces. Two video inputs (2 BT.656s, or 1 BT.656 & 1 TCI, or 2 TCI’s) can be used at the same time. The DRC unit not only supports RGB computer screen refresh, but also has hardware support for overlaying a hardware cursor and graphics/text or a secondary video channel on the primary video channel. A glueless SDRAM controller supports access up to 150MHz SDRAM. The maximum memory size supported is 128MB. A 32-bit 33/66MHz PCI buses interface is also supported. The MAP-CA can be booted up either via PCI bus or a Flash ROM interface. These I/O functions execute in parallel with the CPU and eliminate the need for several external ASICs with the associated increased cost and memory bandwidth issues. There are 3 PLLs (core/SDRAM, pixel, and audio) on chip to generate all the internal clocks from a single 27MHz external clock input. A software PLL mechanism is provided to track the encoder clock coded in the PCR in transport stream. An IEEE compliant JTAG interface is provided for manufacturing test.

3.3

Application Performance

When performing typical video processing functions, the MAP-CA’s VLIW core is capable of executing 102 operations per cycle. At a clock frequency of 300 MHz, the MAP-CA is thus capable of a sustained computational performance of 30.7 billion operations per second (GOPS). When performing the DSP staple of multiply-add operation, the MAP-CA can achieve 25.2 GOPS with 8-bit data and 9.7 GOPS with 16-bit

11

data. These performance figures do not include other operations that can be performed concurrently by the other on-chip units like the VLX coprocessor, video filter (VF), and Data Streamer, each of which is also programmable. Taking into account all of these computing resources, the MAP-CA could achieve up to 45 GOPS. A rich set of applications is supported by the MAP-CA [6-8], since applications running on MAP1000A can be recompiled to run on the MAP-CA without modifying the source code. The MAP-CA has sufficient processing power to concurrently perform full 601 resolution MPEG-2 MP@ML encoding and decoding suitable for analog time shifting and other personal video recording applications. MAP-CA is also capable of performing all format decoding (AFD), Dolby AC-3 audio decoding simultaneously allowing consumers to view HD program on standard TV sets. Due to its high video processing performance coupled with its flexibility, MAP-CA is an ideal platform for emerging video codecs delivering high video quality at low bit rate, e.g. 1Mbps. The list of applications shown in Table 1 indicates some of the capabilities of the MAP-CA processor. Note that between 20 and 45 GOPS are available with the MAP-CA, depending on the efficiency of the algorithm and the compiler.

4 Conclusion We have introduced the MAP-CA, with nearly 45 GOPS, as a revolutionary component to solve the computing demands of multimedia products such as digital TV, digital imaging office products, and Internet communications and networking equipment. Concurrent operations with a variety of applications including MPEG audio and video, both encoding and decoding combined with video post processing becomes possible with a single processor. With full programmability in high level languages, yet not requiring traditional assembly coding, the MAP-CA creates a flexible solution for the constantly evolving multimedia communications standards and applications. The MAP-CA’s scalable architecture and its portable media processing C code enables a wide range of products from the smallest of handheld media devices to desktop imaging systems, limited only by the availability of semiconductor technologies. With a flexible, cost-effective and powerful solution, with between 20 and 45 billion operations per second, Equator and Hitachi anticipate rapid adoption of the MAP-CA into a variety of consumers, commercial and professional multimedia communications products.

12

Table 1. Processing Requirements for Various “Software-only” Media Functions

Soft Media Function

Real-time Performance

Video Functions Million Cycles Per Frame 2.38 6.20 1.33 5.00 8.33 2.50 0.99 1.32

30 fps -- 33ms/B-frame MPEG-2 MP/ML Video Decoder (601 resolution) MPEG-2 MP/ML Video Encoder (601 resolution) * MPEG-4 Simple Profile Decoder (CIF resolution) MPEG-4 Simple Profile Encoder (CIF resolution) All Format Decoder (decoding all HD formats) * H.263 Video Coder/Decoder * De-blocking (601 resolution) De-interlace (601) resolution

Million Cycles Per Second 72 188 40 150 250 76 30 40

% of 300Mhz MAP-CA 24% 63% 13% 50% 75% 25% 10% 13%

Imaging Functions (512 × 512 image size) Million Cycles Per Frame 5.76 1.82/1.94/3.24 3.2 5.82 1.75 0.89

Operate at 30 fps -- 33ms/frame 2D 512×512 point Complex FFT 2D Affine Transform (8-bit/16-bit/32-bit) 2D Perspective Warp 2D Canny’s Edge Detection 2D Convolution (7x7 kernel) 2D Binary Morphological Opening (5x5 kernel)

Million Cycles Per Second 174 55/58.8/98 97 176 53 27

% of 300Mhz MAP-CA 58.2% 18.3/18.7/33% 32.7% 58.8% 17.7% 8.9%

Million Cycles Per Second 22.7

% of 300Mhz MAP-CA 7.6%

Million Cycles Per Second 12 15 100

% of 300Mhz MAP-CA 4% 5% 33%

Audio Functions Frame Size 0.75

Operate at 30 fps -- 33ms/frame AC-3 Audio Decoder - 5.1 channel

Communications Functions Million Cycles Per Frame 0.12 0.15 1.00

Operate at 100 fps -- 10ms/frame V.34 Modem Transceiver * G.729a Voice Codec * G.992.2 ADSL Modem Transceiver (ATU-R) *

Note: * indicates that the particular function has not been fully optimized by Equator/Hitachi and the performance number is based on projection.

5 Reference 1.

Chris Basoglu, Donglok Kim, Robert Gove, and Yongmin Kim, “Parallel Image Computing in Modern Commercially-Available Processors”, International Journal of Imaging Systems 9, 407 –415 (1998).

2.

J. Turley, “Multimedia Chips Complicate Choices,” Microprocessor Report 10(2), 14-19 (1996).

13

3.

R. P. Colwell, R. P. Nix, J. S. O’Donnell, D. B. Papworth, and P. K. Rodman, “A VLIW Architecture for a Trace Scheduling Compiler,” IEEE Transactions on Computers 37, 967-979 (1988).

4.

Steve S. Muchnick, Advanced Compiler Design and Implementation, San Francisco: Kaufmann, 1997.

5.

Alfred V. Aho, Ravi Sethi, Jeffrey D. Ullman, Compilers Principles, Techniques and Tools, Reading Massachusetts: Addison-Wesley, 1986.

6.

Woobin Lee and Chris Basoglu, “MPEG-2 Decoder Implementation on MAP-CA Mediaprocessor using the C Language”, in SPIE Proceedings on Media Processors 2000, Vol. 3970, 2000, in press.

7.

Inga Stotland, Donglok Kim, and Yongmin Kim, “Image Computing Library for a Next-Generation VLIW Multimedia Processor”, in SPIE Proceedings of Electronic Imaging, Vol. 3655, pp 47-55, 1999.

8.

Robert Gove, Woobin Lee, Chris Basoglu, Yongmin Kim, “Next-Generation Media Processors and their Impact on Medical Imaging”, in SPIE Proceedings of the Medical Imaging Conference, 1998.

9.

P. Geoffrey Lowney, Stefan M. Freudenberger, Thomas J. Karzes, W. D. Lichtenstein, Robert P. Nix, John S. O'Donnell, John C. Ruttenberg, The Multiflow Trace Scheduling Compiler, The Journal of Supercomputing, 7, 51-142 (1993).

14

Morgan