Optimizing Vector-Quantization Processor Architecture for In-telligent

Typeset using jjap.cls

Optimizing Vector-Quantization Processor Architecture for Intelligent Query-Search Applications Huaiyu Xu, Yoshio Mita and Tadashi Shibata1 Department of Electronic Engineering, The University of Tokyo, 7-3-1 Hongo, Bunkyo-ku, Tokyo 113-8656, Japan 1

Department of Frontier Informatics, The University of Tokyo,

7-3-1 Hongo, Bunkyo-ku, Tokyo 113-8656, Japan (Received

)

The architecture of a very large scale integration (VLSI) vector-quantization processor (VQP) has been optimized to develop a general-purpose intelligent query-search agent. The agent performs a similarity-based search in a large-volume database. Although similarity-based search processing is computationally very expensive, latency-free searches have become possible due to the highly parallel maximum-likelihood search architecture of the VQP chip. Three architectures of the VQP chip have been studied and their performances are compared. In order to give reasonable searching results according to the different policies, the concept of penalty function has been introduced into the VQP. An E-commerce real-estate agency system has been developed using the VQP chip implemented in a field-programmable gate array (FPGA) and the effectiveness of such an agency system has been demonstrated. KEYWORDS: vector quantization, VLSI processor, the Internet, query search, similaritybased search, E-commerce, real-estate agency

1

1. Introduction World-Wide-Web (WWW)-based information searches play an important role in our daily life.1) However, what is most intolerant is that a search query often results in a “No matches found” response, although the database may contain very useful information on the requested subject. This is because most of the search engines employ a wordmatching technique. It is therefore essential to develop an intelligent query-search engine that is capable of finding the information required more flexibly, taking the individual’s preference and specific interest into account. A similarity-based search is a good candidate for this purpose.2, 3) However, in order to find the best results that match the query, the similarities between the query and all template vectors should be calculated. In addition, in order to take the individual’s preference and specific interest into account, the weight of each element should also be calculated. It is indeed a time-consuming computation. Algorithms that take soft similarity values into account have been developed to reduce the retrieval time. One example is the use of k-d trees, which reduces the retrieval time significantly.2) Another example is a case retrieval net that is also successfully used in some applications.4) However, neither of them is efficient if the database is often updated;2, 5) frequent updating of databases is the key feature of on-line applications. One example is the real-estate agency application that should allow the owners to add or remove information to or from the server freely. The authors claim that the introduction of a similarity-computing integrated-circuit engine (vector-quantization processor: VQP) can solve the problem. Namely, we can achieve a very fast similarity-based search even for databases in which updates occur very frequently. The purpose of this study is to develop an optimized hardware organization for intelligent query-search engines employing the vector-quantization processor (VQP) architecture.6) Due to the parallel processing on the chip, a more than 104 times faster search has been enabled for 40,000 items compared to sequential processing using a general-purpose processor. Implementation of typical architectures either in a VLSI chip or in a fieldprogrammable gate array (FPGA) has been carried out for performance demonstration.

2

The concept of the variable penalty function has been introduced to implement real intelligent query-search systems. An interactive search engine for an E-commerce real-estate agency system has been developed and demonstrated with FPGA-based implementation. 2. VQP Architectures 2.1 Requirements of intelligent query search and specifications for VQP To apply the VQP to intelligent query search applications, several specific requirements are considered, as summarized in Table I. In an agent in which we are requesting the search, the VQP stores the information by means of template vectors, calculates the similarity between an input vector (query) and template vectors, and returns the address of the maximal-likelihood vector (Top 1). The second, third... most similar vectors are also retrieved if the Top 2,3... and M searches are requested. To be applied to a variety of intelligent query search engines, in which the dimension of the vector is unpredictable, the vector dimension should be made alterable in the VQP. Due to the nature of similarity, if the Top M most similar results are given to a customer, the customer can negotiate with the search agent. In order to weigh each element according to the personal preference in an Internet search, weights should also be available in the VQP. Moreover, the VQP must contain a huge database and carry out a fast search. 2.2 Optimizing the architecture of VQP The architecture of the VQP has been optimized to meet the specifications in Table I. We first designed a very flexible VLSI VQP. It allows the use of an arbitrary number of elements in a vector and various template-grouping structures. Despite such versatile features, the chip can handle only 64 vectors (128 dimensions) using external memories for templates (Fig. 1). Since this is by no means practical for the present applications, a more area-efficient organization that allows a larger amount of template integration on a chip has been studied. Three different architectures named Type 1-1-1, Type N-1-N and N-1-1 are proposed in Figs. 2(a), 2(b), and 2(c), respectively. The area, the number of storable vectors, and the calculation time are discussed to find the best architecture. Type 1-1-1, shown in Fig. 2(a), represents the organization similar to that in ref. 6, where the template memory (TM, 128 dimensions), distance module (DM), and winner3

take-all (WTA) module are provided for each template vector. Here, the DM calculates the dissimilarity between the input vector and each template vector (usually, Manhattan distance is employed which is defined as the sum of the absolute value of difference in each element). The WTA module is utilized to locate the minimum distance vector, i.e., the winner. In order to weigh each element according to personal preference in an Internet search, a 15-bit (7-bit × 8-bit) multiplier is added to the DM. A penalty function module, which will be introduced in the next section, is also included in the DM to calculate various kinds of dissimilarities, in addition to the Manhattan distance. Such functional enhancement in the DM has substantially increased the area of a DM compared to that in ref. 6. In order to obtain the Top M most similar results, a control module is also added in the WTA module, thus also increasing the area of one WTA module. The area ratios of the TM, DM and WTA modules were estimated based on the gate counts obtained from FPGA implementation and the layout information obtained in the work in ref. 6. The result is shown in the figure. In order to increase the number of template vectors on one chip, Type N-1-N, as shown in Fig. 2(b), was examined first in this work. A single DM is provided for N TMs and N WTAs in Type N-1-N. The area for DM is reduced as shown in the figure, thus leaving more space for TM. One DM calculates all distances between the input vector and N template vectors. As a result, the distance calculation time is N times larger than that in Type 1-1-1. On the other hand, the area of DM is reduced as shown in the figure, where the area was estimated to be N = 100. The address of the Top M maximal-likelihood vector is returned in the same way as in Type 1-1-1. Type N-1-1, as shown in Fig. 2(c), was also studied in the present work. One DM and one WTA are provided for N TMs in Type N-1-1. In Type N-1-1, buffer memory storing distance values and a comparator are added for a sequential winner search, resulting in a slight area penalty. The distance is calculated in the same way as in Type N-1-N. In Type N-1-1, the WTA processing time is several times greater than that in Type 1-1-1 and Type N-1-N depending on the value of N. The areas of the WTA module and the DM are reduced significantly, as shown in the figure. Five pieces of architecture are considered. Type N-1-N configurations with N = 10 4

and N = 100 are represented as Type10-1-10 and Type 100-1-100, respectively. Type N-1-1 configurations are also similarly represented in the figure. The amount of storable template vectors, which is one of the most important features, is estimated assuming a die size of 1 cm × 1 cm and 0.18 µm design rules. The area data obtained in designing the chip in ref. 6 were simply scaled down from 0.6 µm to 0.18 µm. The performance comparison among various architectures is summarized in Table II, where the vector dimensions were assumed to be 128. The number of clock cycles is the necessary value to return the top 100 most similar results. The chip is configured such that the dimensions of a vector can be arbitrarily specified up to a maximum of 128. If we limit the maximum vector dimensions, the number of storable vectors increases. Over 20,000 vectors are storable in one chip having the architecture Type 100-1-1 with 16-dimensional vectors. The figure of merit of architecture is defined as the number of template vectors divided by the number of clock cycles. The larger the figure, the larger the number of template vectors and the faster the search. The various VQP configurations are compared in terms of the figure of merit in Fig. 3. From the results in Fig. 3 and Table II, the architecture Type 10-1-1 is shown not only to contain 2519 template vectors of 128 dimensions on one chip, but also to perform a very fast search. Type 10-1-1 VQP running at 30 MHz only needs about 9.18 × 10−5 s to give the top 100 results out of 2519 vectors. If 20 VQP chips are mounted on a board and operated in fully parallel processing, a top-100 search service can be provided to all simultaneous visitors to the site within one second. In addition, it should be noted that regardless how often the template vectors are updated, the computation time is not affected. 2.3 Implementing penalty function in the DM Toward the real intelligent query search applications, it is not sufficient to calculate the dissimilarity between the query and template vectors using only the Manhattan distance. Versatile dissimilarities need to be introduced. The concept is explained in Fig. 4 taking the real-estate agency application as an illustrated example. Here, QP R , QSQ and QDT are the elements of a query vector specified by a customer, representing the price of the house, the size of the house, and the distance to a station from the house, respectively. TP R , TSQ 5

and TDT represent the corresponding elements in template vectors. The example shows that the Manhattan distance only gives a reasonable penalty (dissimilarity measure) for the last case “distance to station”. It means that the maximum-likelihood result might not always be the answer. One example in Fig. 4 is the price of the house, where the policy is that “lower is better ”. Another example is the size of the house, where the policy is that “larger is better ”. Therefore, various forms of penalty functions are required for the real intelligent query search applications. In the first approach, the penalty function was implemented as shown in Fig. 5. The policy was dependent on whether the element in the query vector was smaller or larger than the element in the template vector. Therefore, a set of two slope values, called Sk and Lk in Fig. 5, is utilized to represent the non-Manhattan-distance penalty functions. As an example, a penalty function of the “price” is implemented using the SkP R and LkP R . If the QP R is larger than the template TP R , the subtraction result is multiplied by LkP R . The subtraction result is multiplied by SkP R if the QP R is smaller than the template TP R . The dissimilarity between the query and template vectors is calculated by accumulating all the penalties of the elements. This CISC-like approach increased the area of DM by more than 30%. Therefore, the authors studied another RISC-like approach to obtain the same function with a less area increase. In the optimized architecture shown in Fig. 6, the increase of the DM area is reduced by a factor of three compared to the first implementation. In this architecture, an identical element is stored at two locations in each template vector. At each location, a different penalty function is applied, and both results are necessarily accumulated to obtain the dissimilarity between the query and template vectors. In the figure, identical elements, which represent the “price”, are stored as 1st and 2nd elements. The 1st element takes over the penalty when the given query element is larger than the template. Namely, if the given query value for “price” (QP R ) is larger than the value in the template (TP R ), the 1st element returns the non-zero result: (QP R - TP R ) × LkP R , whereas the 2nd element returns zero. If the given query element is smaller than the template, the 2nd element returns the non-zero negative result: (QP R - TP R ) × SkP R , while the 1st element returns zero. By summing the calculation results for all elements, the dissimilarity between the 6

query and template vectors is calculated with the implementation of various penalty functions. This approach not only reduced the number of registers, but also simplified the number of necessary operations, thus reducing the area increase by a factor of three. 3. Circuit Implementation of VQP In this section, a block diagram for DM and a processing scheme of the Top M are illustrated. To show the feasibility of single-chip implementation, the layout of the best architecture of Type N-1-1 with N = 10 was designed using 0.6 µm triple-metal CMOS technology. A block diagram for DM is shown in Fig. 7. In the first step, the element of the query vector (Qi ) is subtracted from the element of the template vector (Ti ). After the subtraction, the carry bit is examined to see whether the result is positive or negative. If the result is positive, the result is multiplied by the relevant element of the weight vector (Wi ) to give the result Di . If it is negative, the absolute value of the result is calculated before multiplication. The absolute-value function is implemented using exclusive OR gates. The weight is prepared externally by multiplying the preference value coming from the user and the penalty slope value (represented as SkP R , LkP R ... in the former section) from the agency. The instruction code (Ii ) and carry flag represents a different policy. As an example from the truth table in the figure, the condition Ii = 2, Carry = 1, and Dissimilarity = −Di means that when the element of the query vector is smaller than the element of the template vector, the dissimilarity between these elements is −Di . The penalty function is implemented depending on the instruction code (Ii ) and carry flag, as shown in the truth table. In the best architecture layout, similarities between the query and 10 template vectors are calculated by a 22-bit accumulator. The processing scheme of the Top M is illustrated in Fig. 8. It is composed of 100-vector matching circuits. Dissimilarity values are inputted to the minimum filter in a bit serial manner from the most significant bit to the least significant bit. A state register is provided for each distance signal to store the flag data, indicating the status of the competition, “0” means the loser and “1” means the winner. At each cycle, only the smallest bit values are passed through the minimum filter, and the other inputs are withdrawn in the subsequent 7

competition. The minimum filter is composed of 100 identical circuits. Each circuit is shown on the right-hand side of Fig. 8. The flag stored in each state register controls the OR gate. If the flag is “0,” the output of the OR gate is “1,” thus the input signal is neglected in the subsequent competition. If the flag is “1,” the distance signal is passed. The 100-input AND gate detects the minimum value of the signals passed through the upper OR gate. The lower AND gate compares the output of the 100-input AND gate and the signal from the upper OR. The results of the lower AND gate are stored in the state register as a flag to control the next competition step. Finally, flag 1 remains only at the location of the minimum-distance template vector (the winner). After the winner is obtained, the inhibit flag is set so that the dissimilarity value of the winner will be ignored in the next cycle. The calculation is continued until the addresses of the top M results are obtained. The layout of the best architecture of Type N-1-1 with N = 10 was designed in a 0.6 µm triple-metal CMOS technology, as shown in Fig. 9. It contains 256 vectors with 128 dimensions on one chip. In order to increase the number of template vectors on one chip, 32 KB SRAM on the chip is manually designed in the layout level to contain the maximal amount of memory in the limited area. DM and WTA modules were designed using Verilog-HDL. Synopsis Design Analyzer synthesized the design, and then Apollo obtained the layout. The chip size is 9 mm × 9 mm. 4. An Example of Intelligent Query Search Application 4.1 Demonstration We are all aware that looking for a new house is very difficult. The reason is that the house is not a standard product.2) In choosing a new house, factors such as individual preference and specific interest play very important roles. Therefore, a normal database and keyword searching cannot be of help. An intelligent query-house-search application, based on VQP, has been developed to solve this difficult problem. Experiments on intelligent query searches were carried out using a FPGA prototype. An overall block diagram of the searching system is shown in Fig. 10. First, the WWW server receives a query through common gateway interface (CGI). The query is parsed 8

and the commands for AP are generated. The AP VLSI searches and passes the best or top-M matching house to the analysis agent program. All the information is stored in the cache memory of the AP VLSI. Depending on the scope of interest of the query, the server simply enables or disables each datum for the similarity search. The masking is performed by a simple one-bit signal for each datum so that the transfer speed is sufficiently high. Therefore, a latency-free search is possible on the AP VLSI. The Database Update Daemon will update the information continuously; this function was not implemented for this experiment. In the analysis agent program (referred as Result Parser in the figure), results are analyzed and the balance sheet is returned to the customer through the WWW server. In the balance sheet, all the elements surpassing the customer’s request (happy results) are shown by upward bars. On the other hand, downward bars indicate the elements that do not satisfy the customer’s expectation (unhappy results). Figures 11 and 12 show the sales process of the E-commerce house-search application. First, a search is performed with the preference weights all set at 1. The agent reports the location of the top-nth houses. By double-clicking the location, a picture, balance sheet, and percent satisfaction are displayed. Figure 11 shows the balance sheet of the most suitable house for this search. The percent satisfaction is 95%. The balance sheet reports that the size of the third room (TR) is very good, however the distance to the station (DT) from the house is very bad. If the DT is important to the customer, she or he may simply launch the query with a larger preference value for DT, such as PDT = 5. After searching with higher priority on DT, DT will be ameliorated. Figure 12 shows the best house under more complicated preferences: {PDT = 5, PF R = 7, and PP R = 7}, where PF R and PP R refer to the preferences for the number of family rooms and price, respectively. The balance sheet with expert-evaluated values also reports that the percent satisfaction is 87%, showing a slight decline from the previous value (95%). It is exactly as would be expected by the majority of people: the more the preferences, the less the satisfaction. However, the result is very similar to the query considering all of the customer’s preferences. The result is not only reasonable but “No matches found” will never be returned to the customer as the result.

9

4.2 Performance comparison The intelligent query search was demonstrated on a hardware system using a Type 10-1-10 VQP having 100 template vectors (Fig. 13). The demonstration system consists of an interface board and VQP implemented on 400 kgates FPGA (ALTERA APEX 20k400EFC672-1). The design of VQP written in Verilog-HDL was synthesized and mapped to the FPGA using ALTERA Quartus software. The same searching function is also implemented using only C++ language. The computation time in the FPGA and the software are measured for comparison. The computation time of the software system is measured by the following procedure. First, all the template vectors are loaded in memory before computing. Second, the distances between the query and template vectors are calculated. Third, the results are sorted using the quick sort algorithm and Top M results are calculated. Among them, sorting is the most time-consuming process. We did not use the most efficient indexing structures (k-d trees and case retrieval net) that were mentioned in § 1. Although these structures can reduce the searching time significantly, one is obliged to wait too long for reconstruction of the structures when the contents of the database need to be updated. However, frequent updating of the database is a key feature for on-line searching applications on the Internet: the number of database updates is in the order of the number of read accesses. The measured computation time with the software does not include the IO time for a fair comparison. Based on the measured computation time (shown with underlines in TableIII) and the assumption that computation is performed in a fully parallel manner, the necessary computation time was extrapolated in Table III for performance comparison. The clock frequency for VQP is 30 MHz and that for software computation on an Intel Pentium IIIR processor is 600 MHz. With 2519 template vectors, which can be stored in one chip with a Type 10-1-1 architecture VQP, the VQP calculates 490 times faster than the software. With 40,000 template vectors, which would be necessary to search for a sufficient number of houses in a city, the search time by the software becomes unsatisfactory. On the other hand, the Type 10-1-1 VQP returns the result without any unappreciable delays if the computation 10

is performed in a fully parallel manner. The VQP is 45,000 times faster than the software at a clock frequency of only 30 MHz (However, we need sixteen Type 10-1-1 VQP chips on a board.). Therefore, similarity-based intelligent query search becomes feasible by employing the VQP chips. 5. Conclusions Several pieces of optimal architecture of the VQP are studied aiming at intelligent query search applications. Since the VQP has a very flexible architecture and penalty function, it can be used in almost every kind of intelligent query search application. The search engine for the Internet house-search application has been developed and successfully implemented in the VQP architecture using the FPGA prototype. Greater than 104 times faster searching is possible for over 40,000 template vectors at a clock frequency of 30 MHz, thus allowing high-performance and low-power implementation.

11

References 1) A. Chavez and P. Maes: Proc. First Int. Conf. Practical Application of Intelligent Agents and Multi-Agent Technology, London (1996) p. 75. 2) W. Wilke: Int. Conf. Case-Based Reasoning (ICCBR-97), Providence, Rhode Island (1997), p. 509. 3) A. Aamodt and E. Plaza: AI Communications ed. D. Leake (IOS Press, Amsterdam, 1994) 7, p. 39. 4) Advances in Artificial Intelligence, eds. H.D. Burkhard and M. Lenz (Springer, Berlin, 1995), p. 103. 5) S. Wess, K.-D. Althoff and G. Derwand: Lecture in Artificial Intelligence, Topics in Case-Based Reasoning, (Springer, Berlin, 1994) p. 167. 6) A. Nakata, T. Shibata, M. Konda, T. Morimoto and T. Ohmi: IEEE J. Solid-State Circ. 34 (1999) 822.

12

Figure captions Fig. 1. Photomicrograph of the VQP test chip featuring variable vector dimension and grouping structure. (The chip was implemented in a 0.6 µm triple-metal CMOS technology.) Fig. 2. (a). Type 1-1-1: Distance module and winner-take-all module are provided for each template memory. (b). Type N-1-N: A single DM is provided for N TMs and N WTAs. (c). Type N-1-1: One DM and one WTA are provided for N TMs. Fig. 3. Figure of merit that is defined as the number of template vectors divided by the number of clock cycles. Fig. 4. Penalty function is needed for real intelligent query search applications. Fig. 5. CISC-lime implementation of the penalty function. Fig. 6. Optimized RISC-like implementation of the penalty function. Fig. 7. Block diagram for DM. Fig. 8. Processing scheme of the Top M . Fig. 9. Layout of the best architecture.

(The chip was implemented in a 0.6 µm

triple-metal CMOS technology.) Fig. 10. An overall block diagram of the searching system. Fig. 11. House search with VQP-powered intelligent search engine. All the preference weights are set to 1. Fig. 12. Priority search on certain values. Priority weights PDT , PF R and PP R are set to 5,7 and 8, respectively. Fig. 13. VQP prototype was implemented on 400 kgates FPGA. It was connected directly to host PC through a hand-made interface board.

13

Table I.

Requirements of intelligent query search and specifications for VQP.

Table II.

Comparison of various VQP architectures.

Table III. Comparison of computation time in seconds. Assumption: Top 100 vectors were searched. Type 10-1-10 and Type 10-1-1 have fully parallel processing, regardless of the number of template vectors from 2519 to 40,000.

14

Optimizing Vector-Quantization Processor Architecture for In-telligent

Optimizing Vector-Quantization Processor Architecture for In-telligent

Suggest Documents

ARCHITECTURE OF RECONFIGURABLE PROCESSOR FOR ...

Optimizing Compiler for a CELL Processor - Research

AN INTELLIGENT INTERFACE PROCESSOR FOR A BEHAVIOUR

Intelligent Implementation Processor Design for Oracle ... - arXiv

EE382A Advanced Processor Architecture

Characterizing an Architecture for Intelligent

Processor Architecture Design for Smart Cameras - Technische ...

Dynamic Co-Processor Architecture for Software ... - Cs.ucr.edu

Selecting a Processor for Teaching Computer Architecture

Compiler processor tradeoffs for DISVLIW architecture - Parallel ...

Systems Architecture The ARM Processor

Systems Architecture The ARM Processor

Digital Signal Processor (DSP) Architecture

Optimizing the Software Architecture for ... - Semantic Scholar

Optimizing Digital Musical Effect Implementation for Multiple Processor

Optimizing an Open-Source Processor for FPGAs: A ... - IEEE Xplore

Towards an Intelligent Architecture

Optimizing a Multi-Core Processor for Message-Passing Workloads

Architecture challenges for intelligent autonomous ... - ICES - KTH

New Hybrid Control Architecture for Intelligent

An Intelligent Software Agent Architecture for

AN INTELLIGENT USER SERVICE ARCHITECTURE FOR ... - CiteSeerX

Ontology-Based Architecture for Intelligent ... - Semantic Scholar

A Software Architecture for Generally Intelligent ...