multithread video coding processor for the videophone - CiteSeerX

1 downloads 348 Views 282KB Size Report
In contrast, the programmable architecture is a natural solution to get better exibility .... We designed a video codec for the videophone or teleconference on the.
MULTITHREAD VIDEO CODING PROCESSOR FOR THE VIDEOPHONE Jeong-Min Kim, Seok-Kyun Hong, Eel-Wan Lee, and Soo-Ik Chae Department of Electronics Engineering and Inter-university Semiconductor Research Center Seoul National University 151-742, Seoul, Korea

Abstract - The architecture of a programmable video codec IC is described that employs multiple vector processors in a single chip. The vector processors operate in parallel and communicate with one another through on-chip shared memories. A single scalar control processor schedules each vector processor independently to achieve real-time video coding with special vector instructions. With programmable interconnection buses, the proposed architecture performs multi-processing of tasks and data in video coding. Therefore, it can provide good parallelism as well as good programmability. Especially, it can operate multithread video coding, which processes several independent image sequences simultaneously. We explain its scheduling, multithread video coding, and vector processor architectures. We implemented a prototype video codec with a 0.8um CMOS cell based technology for the multi-standard videophone. This codec can execute video encoding and decoding simultaneously for the QCIF image at a frame rate of 30Hz.

I. INTRODUCTION

Because of increasing demands for digital image transmission and multimedia applications, several international standards [1],[2] for video coding have been established. Many VLSI architectures for the digital video coding systems have been proposed, which are classi ed as dedicated systems and programmable systems from the viewpoint of programmability. Dedicated systems [3] often lead to a compact and cost-e ective hardware realization, but they have some limitation on exibility in adapting di erent coding standards or modi ed algorithms. In contrast, the programmable architecture is a natural solution to get better exibility in the video coding system. However, general purpose DSPs have relatively low computational performance in video applications. Therefore, multi-processing architectures, such as SIMD(single instruction stream, multiple data stream) and MIMD(multiple instruction stream, multiple data stream), have been proposed [4],[5],[6],[7]. Although scalability and modularity of SIMD and MIMD architectures are the merits for VLSI realization, it is

Figure 1: Conceptual diagram of proposed architecture. dicult to distribute resources and to schedule tasks for their best utilization. We propose a multi-processing architecture that has several vector units and processes multiple issue vector operations to control them concurrently with a scalar control processor. A thread is an independent image sequence in coding operation, and the proposed architecture processes multithread video coding. For example, each encoding and decoding image in the one-to-one communication environment such as videophones, constitutes a thread respectively, and in the one-to-many system such as videoconference, each decoding image from di erent channel constitutes a thread also. Therefore, this method is cost-e ective in comparison to the conventional approaches that adopts several video codecs; one for each thread. This paper is organized as follows. In Section II we discuss the proposed architecture and its scheduling for tasks and data distribution, and its instruction set design. In Section III the video codec system and the major functional units, such as SPU, VPU, VLC, ME, are described. In Section IV the overall performance of the video codec and its implementation is described and in Section V conclusions are drawn.

II. PROPOSED ARCHITECTURE A. Multiple Issue Vector Operation

A hybrid coding is composed of motion estimation, transform coding and variable length coding. It has two characteristics in hardware implementation. First, image processing in all hybrid coding is based on the macro block. Data dependency within each macro block exists, but does not among them. Therefore, the data processing in each macro block must be executed sequentially according to the dependency. However, it can be executed concurrently for di erent blocks. Second, each operation of hybrid coding has di erent characteristics. Therefore, the system with identical processing elements to execute all coding operations show lower performance. To design an ecient hybrid coding system, both characteristics must be taken into consideration. The conventional MIMD architecture in which identical processors deal with di erent block data concurrently, exploits the rst characteristics only. We propose a new architecture that exploits both characteristics. The pro-

Figure 2: Conceptual diagram of multiple issue vector operations. posed architecture consists of multiple vector processors and a scalar processing unit(SPU). The vector processors are dedicated function units to process each operation in the video coding. The SPU is a modi ed RISC controller, that has some additional instructions to the conventional RISC, such as vector instructions and synchronization instructions. And it manages the vector processors and the ow of video coding. The conceptual con guration of the proposed architecture are shown in Fig. 1. The vector processors, which have their own local controller, execute intensive computations or regular operations in hybrid coding. The SPU executes complex operations that does not require intensive computations and are not suitable for dedicated implementation. Therefore, we can achieve highthroughput video compression with deep parallel processing of several vector processors. Furthemore, we can obtain better exibility with the SPU. The vector processor processes the data for di erent blocks in parallel. They are stored in an on-chip shared memory, which consists of multi-banks. Each bank can be accessed through the programmable interconnection bus network(IBN). Using multiple issue vector instructions, the SPU controls each vector processor independently, and schedules vector processors in accordance with the coding programs. Therefore, the order of vector operations is preserved. Once the SPU initializes each vector processor with a vector instruction, its own local controller manages it until the completion of the instruction. The SPU can issue other vector instructions without waiting for the completion of previously issued vector instructions. Some vector instructions take several tens or hundreds machine cycles, and they can overlap in time. In the proposed architecture, 4 vector instructions can be executed concurrently. It is easier to schedule the coding operation with a single processor than with multiprocessors because of no synchronizations required to communicate with other processors. It can also reduce the control hardware, which are duplicated in each processor in the multiprocessor system. The conceptual diagram of

Figure 3: An example of multithread video coding composed of encoding and decoding. multiple vector instruction is depicted in Fig. 2.

B. Multithread Video Coding

Two methods exist in multithread video coding with a single processor. One is simultaneous video coding, which processes several threads simultaneously by sharing processor resources. The other is time-sharing video coding, which allocates all processor resources to one thread in one time-quantum and each thread is executed sequentially. The simultaneous video coding is suitable if a little resource sharing exists between threads, such as the videophone. The time-sharing video coding is suitable if frequent resource con icts between threads exist, such as the videoconferencing that have many threads. In the simultaneous multithread, each thread has its own space in the shared memory and has dedicated vector processors if they are used frequently like the DCT module. The other vector processors that are not used frequently can be shared. Each thread has its own coding program and is scheduled at run-time by the SPU. The instruction issue is stalled automatically if the vector instruction to be issued need to use the same vector processor that is already being executed by a previous instruction. Or it is stalled also, if it need to access the shared memory that is being used by others. If any resource con ict is detected, the SPU switches the context to another thread or waits for the time when the con ict is resolved by the completion

Type vector-op

Examples V DCT, V IDCT, V Q, V IQ, V ADD, V VLE HEADER, V CO LD, V ME... vector-sync V VPU SYNC, V COPU SYNC... Table 1: Examples of vector instructions. of previous vector operations. If the SPU detects any resource con icts in the new thread, it will switch the context again to another thread in the same way. If a stopped thread is revisited with thread switching, the SPU starts to execute the coding program from the previously stopped instruction. Therefore, the execution order of threads can be changed at run-time, but the execution order of instructions in each thread is maintained. Consequently, the data and control consistencies can be preserved. With this thread switching, we can process the vector instruction in a the new thread, if it has no data dependency with the vector instructions in the previous thread. Therefore, this codec can execute multiple vector operations from the di erent threads simultaneously using the multiple issue vector instructions mentioned above. This thread switching reduces the time of the vector processors being idle with data or resource con icts, compared to the conventional coding without adopting multithread. For example, Fig. 3 shows the order of vector instruction issue and that of execution in the multi-thread video coding, which is composed of one encoding and one decoding process. In this architecture, the fast thread switch is performed by dividing the data memory of the SPU into several logical pages and by decreasing the data movements for a context switch. This memory is used for storing the parameters of each thread, such as the quantization parameter or macro-block number etc., and for a temporary workspace. A logical page is the address space to be accessed by the SPU at once. Each logical space has the required parameters or data of its own thread, and the SPU can switch the thread fast by changing the content of logical page-pointer register.

C. Instruction Set Design

The SPU instruction set consists of scalar instructions and vector instructions. In the scalar instructions, there are ALU instructions such as add, shift, and bit operations and control instructions such as jump, jump and link. They are sequentially issued and executed. The vector instructions are classi ed to vector command instructions, and vector synchronization instructions. The vector command instructions send commands to the vector processors and vector synchronization instructions make the SPU wait for the completion of previous vector instruction pointed by these instructions. If a vector instruction try to use the same vector processor or shared memory, which are occupied by the previously issued vector instruction, the

Figure 4: The scheme of vector scoreboarding. SPU stalls the instruction and stop fetching and issuing a next instruction until this dependency is cleared. We adopted the scoreboarding of vector operations to resolve the resource con ict. This scheme allows instructions to be executed out of order if there are available resources and no data dependencies. If a vector instruction is issued, its vector processor ID and shared memory ID are recorded in the scoreboard. Before the next vector instruction is issued, the IDs of this instruction are compared at the scoreboard in parallel. If any resource con ict is detected, the SPU stalls this instruction decoding. Otherwise, the SPU continues execution after recording each ID to scoreboard in the same way. Fig. 4 shows the processing of score-boarding.

III SYSTEM ARCHITECTURE DESIGN A. Overall Structure

We designed a video codec for the videophone or teleconference on the PSTN and ISDN together with the proposed architecture. In this application, the frame size, rate, or bit stream syntax are not xed because the channel bandwidth and coding standard are variable. For example, if it is connected to ISDN, it can transmit the CIF image at a frame rate of 30Hz based on H.261. And if it is connected to PSTN, which network has lower bandwidth than ISDN, it can only transmit the QCIF image at a frame rate of 15Hz base on H.26P. This diculty can be resolved by using a variable coding scheme selected in the frame header. So, the video codec satisfying this requirement must be programmable processor. This architecture consists of three shared bu ers and several vector processors. The vector processors are two DCT modules, two VLC modules(VLE and VLD), one memory management module(CoPU), one motion estimator(ME), a PPU(pre-post processing) module, which is the part of handling the camera and display interface, video format conversion and ltering. All vector operations uses the common shared memories except motion estimation, whose execution cycles are ten times longer than others in the normal

Figure 5: Data ow diagram for inter-frame encoding with H.261.

Figure 6: Block diagram of overall system. mode. The ow of video coding is based on the macro pipelining that the motion estimation is performed one macro block earlier than other operations, which process the macro block of which motion vector is obtained previously. For example, the data ow composed of three shared memories, is shown in the case of inter-frame encoding based on H.261 in Fig. 5, which shows the operations in each shared memory with time. In this system, the result of quantization is written simultaneously into the two shared memories, and the next operations such as VLC and inverse quantization can be executed concurrently. The next block data from the frame bu er can be loaded immediately to the freed space in the shared bu er to hide the latency of memory operation. So, the idle time of each processor can be decreased and the hardware complexity of the processor can be reduced with sharing memory and the control units. Fig. 6 is the block diagram of the overall system.

Figure 7: The con guration of IBN.

B. Bus Structure and Memory Hierarchy

The IBN consists of SPU bus, VPU bus, CoPU bus, Bit bus and SRAM bus. It plays an important role in e ective distribution of job and data between vector processors and memories. Fig. 7 show the con guration of the IBN. The external SRAM is used as both the SPU instruction cache and the data cache of the search area for motion estimation, and accessed by time sharing each other. Three rows of macro blocks in an image frame are stored in the SRAM for re-use of data and reduction of memory access of the DRAM in motion estimation. The DRAM is used for storing the image frames. The bit width of the DRAM is 32bits, and the basic con guration size is 8Mbits. This can be extended to 16Mbits which size can accommodate not only a single thread frame but also several thread frames. In addition to the storage of frames, the DRAM can be used as the lter workspace by the PPU and the bit bu er of the VLC and external modem. With using the DRAM, the external modem and host can interface the input and output bitstream burst. The CoPU performs all memory operation such as block-based load and store except the decision of the base address and addressing mode, which are calculated by the SPU before the initialization of the CoPU operations. Our system can be multiprocessor system with adding two or more codecs without additional external hardware. Fig. 8 represents the dual processor system connected each other with the same DRAM bus. One operates in the master mode, which takes responsibility for memory operations, and the other do in slave mode.

C. Designs of Each Sub-Blocks

The SPU is a 16bit modi ed RISC controller that controls the data ow of the overall system and vector processors. The SPU is composed of 16

Figure 8: Dual processor con guration. GPRs(General Purpose Registers), 32 SPRs(Special Purpose Registers), dual port 128words data memory and the ALU. From the external SRAM, the SPU fetches instructions, and executes 5-stage pipeline, which consists of fetch, pre-decoding, decoding, execution and write-back. Pre-decoding stage checks the control and data dependencies of vector operations and resource con icts of vector processors, and issues a vector instruction that has no hazards. The VPU is a vector processor that processes the transform coding such as DCT and quantization(include IQ also) in hybrid coding. Two VPUs are designed, which are composed of VPUI processing DCT/Q/IQ/IDCT and VPUII processing IQ/IDCT. DCT/Q/IQ have 7 stage pipelining and IDCT 9 stage pipelining. The VLC processes entropy coding, which includes the variable length coder, variable length decoder, and the stream server which multiplexes or demultiplexes the coded data. If an over ow appears in the internal bit bu er, the stream server creates an interrupt to the SPU automatically. Then, the SPU enters into the state of interrupt handling, and it move the data of bit bu er into the DRAM temporarily and save the situation. The motion estimator searches the motion vector of 16x16 reference frame in 31x31 search area on 2-steps. In 2-step motion estimation, rst it nds the motion vector MV0 using the exhaustive block matching method in the 31x31 search area. And it nds the motion vector MV1 on half-pel basis in the 3x3 search area around MV0 found in the rst step. We designed the motion estimator to be adjusted for the target execution cycle time by varying search windows and search methods.

IV. PERFORMANCE ANALYSIS AND IMPLEMENTATION

Some vector processing examples of video coding with the proposed video codec are in Table 2. An ITU recommended H.261 coding function for a 64 kb/s video codec can be achieved. This codec can process simultaneously

Type of operations Exection cycles 8x8 2D-DCT/IDCT 540 8x8 2D-Q/IQ 75 8x8 2D-half-pel motion estimation 115 8x8 2D-image load/save 65 8x8 2D-variable length encoding* 135 8x8 2D-variable length decoding* 85 full half-pel motion estimation(-16 15.5) 17600 fast integer-pel motion estimation(-8 7) 2400 scalar operation(add, bits, sll, jmp...) 1 thread context switch 15 Table 2: Execution cycles of operations. (* for typical image) Operation conditions Encoding Decoding ( No. codec, search mode,video format, frame/sec frame/sec type of thread schdeduling) single, full half-pel, QCIF, simultaneous 19 36 single, interlace half-pel, QCIF, simultaneous 29 42 single, interlace half-pel, QCIF, timesharing 33 33 dual, interlace half-pel, CIF, none 28 Table 3: Simulation results of execution time base on H.26P video coding. (assumed that video codec operates at 24MHz,bandwidth is 500kbps, and the range of ME is 31x31.) video encoding and decoding for QCIF video sequences at a frame rate of 30Hz, which was veri ed by simulations. More detail simulation results are in Table 3. The proposed video codec had designed with the standard-cell design approach, and it is in the process of fabrication. The summary of our video codec features is in Table 4.

V. CONCLUSION

The architecture of a new programmable video codec is presented for the videophone or teleconference on ISDN and PSTN. By using multiple-issue vector instructions, specialized vector processors and a RISC processor, both

exibility and performance required for complex video coding algorithms is realized. We showed that the concept of thread context and vector scoreboarding is especially suitable for the multithread video coding. With changing the number of applied vector processors and shared memories, or multiprocessing, the proposed architecture is scalable. Therefore, this architecture can be used for both low-end systems and high-end systems.

Pin counts 240(140 signals, 17 tests, 37 Vdds, 46 GNDs) Equivalent gates* 162228 gates Operating Frequency** 25MHz Technology 0.8um CMOS standard-cell Package PowerQuad2 Internal memory(dual) 4banks(=5kbits) Internal memory(single) 5banks(=11kbits) Required external memory DRAM(8Mbits)+SRAM(>16Kbits) Table 4: Hardware summary. (* except internal single-port RAMs, ** estimated)

ACKNOWLEDGMENTS

The authors thank Korea Telecom for its nantial support and help on the design of this IC.

References

[1] CCITT Study Group XV, \Recommendation H.261, Video Codec for Audiovisual Services at p x 64 kbit/s," Report R37 Geneva, July 1990S. [2] ITU-T Draft Recommendation H.26P, Video coding for narrow telecommunication channels at < 64 kbit/s. [3] H. Fujiwara, et al., \An All-ASIC Implementation of a Low Bit- Rate Video Codec," IEEE Trans. on Circuit and Systems for Video Technology, no. 2, pp. 123{134, Feb. 1992. [4] H. Yamauchi, et al. \Architecture and Implementation of Highly Parallel Single-Chip Video DSP," IEEE Trans. on Circuit and Systems for Video Technology , no. 2, pp. 207{220, Feb. 1992 [5] K. Balmer, et al., \A Single Chip Multimdedia Video Processor," in IEEE Custom Integrated Circuit Conf., 1994. pp. 6.1.1{6.1.4. [6] K. Aono, et al, \A Video Digital Signal Processor with a Vector-Pipeline Architecture," IEEE J. of Solid-State Circuits, vol. 27, pp. 1886{1894, Dec. 1992. [7] B. W. Lee, et al., \Data Flow Processor for Multi-standard Video Codec," in IEEE Custom Integrated Circuit Conf., 1994. pp. 6.4.1{6.4.4.

Suggest Documents