A Memory-based Parallel Processor for Vector Quantization - CiteSeerX

0 downloads 0 Views 83KB Size Report
ABSTRACT. We propose a memory-based parallel processor for vector quantization FMPP-VQ. It accelerates nearest neighbor search of vector quantization.
A Memory-based Parallel Processor for Vector Quantization K. Kobayashi, M. Kinoshita, M. Takeuchi, H. Onodera, K. Tamaru Department of Electronics and Communication Graduate School of Engineering, Kyoto University, Kyoto, JAPAN

ABSTRACT We propose a memory-based parallel processor for vector quantization FMPP-VQ. It accelerates nearest neighbor search of vector quantization. All distances between an input vector and reference vectors in a codebook are computed simultaneously in all PEs. The minimum value of all distances is searched in parallel. The nearest vector is obtained in O(k ), where k stands for the dimension of vectors. An LSI including four PEs has been implemented. It operates at 25MHz clock rate.

1 Introduction Vector quantization[1] is one of promising candidates for low bit rate image compression. It requires much less hardware for decompression. For compression, however, a large amount of computation is required. The most time-consuming factor on compression is ‘‘nearest neighbor search’’ to find the vector nearest to an input one among a large number of reference vectors. A set of reference vectors is called a codebook. We have developed a memory-based parallel processor called a functional memory type parallel processor for vector quantization: FMPP-VQ. It has as many PEs as reference vectors. A shared bus connects all PEs. The nearest vector is searched exhaustively in parallel. Each PE has conventional memories to store reference vectors and a logic unit to compute the distance between an input vector and reference vectors. The nearest vector is obtained using CAM-based parallel search. These procedures are done in O(k ), where k stands for the dimension of vectors. The number of reference vectors does not affect computation time. On the nearest neighbor search, only input vectors are given from a shared data bus. Distance computation is done locally in each PE. Thus, memory-based SIMD processors perform effective computation through a shared bus. Reference vectors can be easily updated, since they are stored into conventional memories. All PEs can be laid out into memory-like regular-array structure, which minimizes the hardware cost. A large number of PEs can be integrated on a single LSI. For example, a 0.7m CMOS process enables 256 PEs on a 1cm2 die. For low bit rate transmission, a compression hardware including a simple processor, a video memory and an FMPP-VQ with 64 PEs can compress an 8bit videophone image into 0.52bpp. We have implemented an FMPP-VQ LSI including four PEs using a 0.7m CMOS process. The LSI operates at 25MHz clock frequency.

2 Functional Memory Type Parallel Processor for Vector Quantization In the nearest neighbor search, we should find the nearest vector ~q among reference vectors ~y in a codebook Y for every input vector ~x according to (1).

~q

=

min d(~x; ~y)

~y2Y

(1)

P

The definition of d(~x; ~y) is one of important factors in hardware cost. The FMPP-VQ uses the absolute error(AE) j~x 0 ~yj as d(~x; ~y) to minimize the hardware cost. The AE has enough quality for image processing, since input vectors from images spread out sparsely in Euclidean space. Figure 3 shows a block diagram of the FMPP-VQ. A processing element (PE) of the FMPP-VQ is called a ‘‘block.’’ All blocks are laid out in a two-dimensional regular array and connected through a global data bus like a conventional memory. Figure 4 shows a simplified schematic diagram of a block. A block consists of memories and a small logic unit(LU). Codebook words wi are conventional 6 transistor SRAMs to store all elements y0 ; y1 ; :::; yk 01 of a reference vector ~y. For vector quantization of images, the dimension of vectors k is usually 16. Thus, the FMPP-VQ has 16 codebook words. The LU computes d(~x; ~y). The result word R stores d(~x; ~y). The operand word B stores an operand for addition, subtraction and comparison. There is no conventional adder in the LU. Instead of that, the operand word, the carry chain and other logic gates work together for addition or subtraction according to (2).

A+B

=

P

+ C;

Ci = Gi + Pi 1 Ci01 ; P

=

A 8 B; G = A 1 B

(2)

The operand word is designed based on conventional SRAM-based CAM[2]. Logical operations (P; G) required for addition are executed using four pass transistors in the operand word. The carry chain produces carries from P and G. The XOR gate at the bottom of the carry chain finally produces the sum. The temporary word T stores the sum. Note that operations including carry generation are done in bit parallel. Thus, we can get the sum in a single step. A codebook word is 8bit, while logic unit is 12bit, since absolute distances may become 12bit values. Two flags, an overflow and a search flag are located at the left of the codebook words. The overflow flag v stores overflow from addition or subtraction. The search flag s stores a search result. Some control logics at the left of the codebook words prohibit operations according to the state of these flags. Output signals of all search flags are sent to a wired-OR logic connected to the signal ER , which becomes low when there is no true search flag. It accelerates the minimum search as described below. The MRR resolve the place of the block whose search flag is true. The address encoder generates the address of the block. Possible operations in the LU are listed as follows.

1 Addition and Subtraction 1 Transfer 1 Negative operation 1 Comparison

T = B 6 E; v = overflow wi ! B; T ! B or R; R ! B T =B if B = E; s = 1

The value E means an external value. Subtraction is done with 2’s complement of E . Negative operation is done using XOR functionality in the operand word. The distance between a reference vector ~y and an input vector ~x is computed as follows. Each element xi of ~x is given one by one to every block through the global data bus. Every time an element xi of ~x comes to a block, jxi 0 yi j is computed according to the procedure shown in Fig. 1. To compute absolute value of jxi 0 yi j, the blocks where yi 0 xi < 0 perform negative operation and the and yi 0 xi . The overflow flag enables such an LSB of the carry is set to 1 on the addition of data-independent operation for SIMD control sequence. These computations are done simultaneously in all blocks. After all elements of ~x are given, a CAM-based parallel search procedure finds the minimum value of jxi 0 yi j as shown in Fig. 2. The result of each search is rapidly confirmed by the state of ER . The required clock count to perform the nearest neighbor search is 165: the distance measure takes 112 and the minimum value search takes 53. Note that the number of clocks does not depend on the number of blocks (size of codebooks).

P

P

P

3 Experimental Results and Performance Evaluation We have implemented an LSI including four PEs and some test circuitries using a 0.7 m CMOS process. Four PEs and IO circuitries occupy 2.43mm2 . They are shown in the chip microphotograph

w0 wi w15 R B



y0 yi y15

v=1





Transfer(wi,B)

subtraction(xi)

yi

T

yi

yi

xi

yi xi yi

v=0

yi



xi

xi

transfer(T,B) negative(B)

yi

x

i

 + yi

xi

transfer(T,B) addition(R)

yi

x

i

 + yi

transfer(T,B) addition(R+1)

 + jxi xi + 1 transfer(T,R)

P Figure 1: How to compute jxi 0 yi j

yi j

min=X #(every bit is masked) for i=m-1 to 0 min[i]=0 search(min) if ER=1 then min[i]=1 end

Figure 2: Minimum value search

Fig. 5. The area for one block(PE) becomes 706m2389m. Thus, 256 blocks correspond to 70mm2 . Table 1 shows several specifications of the LSI. Primitive control signals increase the number of IO pins and chip area, while enhance testability of the FMPP-VQ. But, FMPP-VQ LSIs in the future will have less IO pins. All functionalities for computing the nearest vector work properly at 25MHz clock frequency. Fig 6 shows a measured result from subtraction. One division corresponds to two cycles. First four cycles writes four values(0x37, 0x50, 0x0a, 0x0f) to word w0 of all four blocks. Next three cycles transfer data from w0 to operand word B and subtracts 0x15 from B in every block. The last 8 cycles read all results ((0x37,0x50,0x0a,0x0f)-0x15=(0x22, 0x3b, 0xff5, 0xffa)) from subtraction. Computer simulations show that a compression hardware including a simple processor, a video memory and an FMPP-VQ with 64 PEs can compress an 8bit 1762144 standard monochrome videophone image into 0.52 bpp (bit/pixel). The compression ratio is 0.065. In the FMPP-VQ, codebooks are updated at every frame with LBG algorithm[3]. The FMPP-VQ can easily update reference vectors stored in SRAMs. The SNR of the reconstructed images is 36.5dB, while an fixed codebook generates only 28.3dB images. As for the performance, an FMPP-VQ working at 25MHz can performs 152,000 nearest vector searches.

4 Conclusion The FMPP-VQ accelerates the nearest vector search on vector quantization. It achieve both high performance and small silicon area. The nearest vector search is done in O(k ), where k stands for the dimension of vectors. It does not depend on the number of reference vectors. Each PEs consists of conventional SRAMs and pass transistor logics to compute distances between vectors. The CAM-based parallel search procedure rapidly finds the minimum distance. All PEs are connected with a shared bus. PEs can be integrated into two-dimensional array like conventional memories. Thus, a large number of PEs can be integrated into a single LSI. An LSI including four PEs has been implemented. It operates at 25MHz clock frequency.

References [1] A. Gersho and V. Cuperman. Vector quantization. IEEE Commun. Mag., 21(9):15--21, 1983. [2] T. Ogura, S. Yamada, and T. Nikaido. A 4kb Associative Memory LSI. IEEE J. Solid-State Circuits, SC-20(6):1277--1282, 1985. [3] Y. Linde, A. Buzo, and R.M. Gray. An algorithm for vector quantizer design. IEEE Trans. Commun. (USA), COM-28(1):84--95, 1980.

8bit

12bit

control logic and flags

w0

X codebook words

16words

block

MRR, Address Encoder

LU

MRR, Address Encoder

LU

MRR, Address Encoder

global data bus

Memory

w

15

result word

R

operand word

B

search flag

s

overflow flag

v

LU

P P

G

carry chain

C

ER MRR, Address Encoder

LU

Address sum

temporary word

Figure 3: Block Diagram of the FMPP-VQ

T

global data bus

Figure 4: Schematic diagram of a block 0x37-0x15=0x22

0x50-0x15=0x3B

0x0a-0x15=0xff5

0x0f-0x15=0xffa

write(b0,w0,0x37) transfer(w0,operand) write(b1,w0,0x50) subtraction(0x15) read(b1,temporary) read(b3,temporary) write(b2,w0,0x0a) read(b0,temporary) read(b2,temporary) write(b2,w0,0x0f)

Figure 6: Measured results from subtraction

Figure 5: Chip microphotograph of the four block FMPP-VQ.

Table 1: LSI Specifications Process 0.7m CMOS Die size 26.3mm2 Area of Fig. 5 2.43mm2 # IOs 116 Frequency 25MHz Power dissipation 19.5mW @20MHz

Suggest Documents