A database containing these feature vectors can be constructed, allowing query vectors to ... by using FPGA based hardware IP cores and compares this to a software .... devices i.e. to take advantage of the hardwired MAC cells in this FPGA.
HARDWARE IMPLEMENTATION OF SIMILARITY FUNCTIONS Michael Freeman, Michael Weeks, Jim Austin Advanced Computer Architecture Group, Dept. Computer Science, University of York, York, YO10 5DD, UK (mjf,mweeks,austin)@cs.york.ac.uk
ABSTRACT A number of applications varying from music to document classification, require the similarity between a collection of objects to be calculated. To achieve this, features about these objects are extracted e.g. keywords, shapes, colours, frequency components etc, to produce an N-dimensional feature vector, representing a point in a N-dimensional feature space. A database containing these feature vectors can be constructed, allowing query vectors to be applied and the distance between this vector and those stored in the database to be calculated. From the results of these comparisons, similar objects can now be identified and retrieved from the database for further processing by the application. There exists a number of commonly used distance or similarity measures e.g. city block, Euclidean, weighted cosine distance etc, with varying processing requirements and performance characteristics. This paper investigates the possibility of accelerating these distance measures by using FPGA based hardware IP cores and compares this to a software implementation based on a Sun Blade 2000 computer. KEYWORDS Euclidean, weighted cosine, IP core, FPGA.
1. INTRODUCTION A common system requirement is the ability to perform a comparison of similarity, or difference between a collection of objects. To achieve this, characteristics or features about these objects must be extracted e.g. keywords, shapes, colours, frequency components etc, to produce an N-dimensional feature vector, representing a point in a N-dimensional feature space. Once a set of objects has been converted into a feature vector representation, a query vector can be applied to this database of objects and compared with, or searched using standard similarity measures. The result of this search allows a system to identify and retrieve similar objects from the database for further processing by the application. Commonly used distance or similarity measures include, city block, Euclidean, weighted cosine distance etc, each with varying processing requirements and performance characteristics. The selected features and distance measure chosen should allow perceptually similar objects to have a smaller distance between them and perceptually different objects to have a larger distance between them, improving the retrieval accuracy of these systems. This approach to object comparison has been applied to a wide range of applications from document [A. Schenker et al, 2003] to music [J.T. Foote, 1997] classification. To minimize the average retrieval time for these systems a hardware implementation of the required similarity function can be constructed, e.g. an application specific co-processor. Another application area for this implementation is future ubiquitous and pervasive systems. In these types of systems traditional general purpose processor implementations may not be acceptable, owing to the limited power resources found in such embedded applications. Therefore, a more application specific processor, integrated into a system on a chip (SOC) architecture would be more desirable i.e. increasing parallelism in order to reduce the required system clock speed and, therefore, power consumption. In section 2 we will introduce the distance measures that are to be implemented, briefly describing their characteristics and implementation issues. Section 3 presents the hardware design approach taken in
implementing these distance measures as hardware IP cores and their performance compared to a software implementation executed on a Sun Blade 2000. Finally we close this paper with conclusions and future work.
2. SIMILARITY MEASUREMENTS In this section different similarity measurements that can be used to compare feature vectors are described. The results of these are such that as these vectors become more similar, their distance approaches 0.0. The first of these can be represented in a general form as the Minskowky distance :
dm(Q,D) = where Q is a query vector Q = {Q0, Q1, Q2, … Qn-1} and D is the database vector D = {D0, D1, D2, … Dn-1} to which it is being compared to. When λ =1 this is the city block or Manhattan distance :
dcb(Q,D) = When λ =2 this is the Euclidean distance :
de(Q,D) = There are obvious computational advantages for the city block distance over the Euclidean distance, however, the regions described by a fixed distance differ significantly, which can affect the retrieval accuracy of these systems e.g Euclidean distance defines a circle in two dimensions, a sphere in three etc, whilst the city block distance defines a square, a cube etc. For some applications it is more appropriate to base the comparison metric on the angle between vectors rather than the distance i.e. irrespective of vector lengths. The cosine distance is derived from the scalar or dot product of these vectors :
dc(Q,D) = 1 – cos θ = 1 –
For each of these distance metrics it is important that each feature in the feature vector contributes equally to the total distance e.g. if one feature spans the range [0.0,1.0] and another spans [0.0,1000.0], the maximum variation in the first will have little affect on the total distance, whilst even a small variation in the second will have a much larger effect. To remove this effect a weights vector W = {W0, W1, W2, … Wn-1} can be applied to each vector in order to standardize these values.
3. HARDWARE IMPLEMENTATIONS To accelerate the performance of the distance metrics described in the previous section, hardware IP cores written in VHDL have been designed. The intended application for these IP cores is based on a Xilinx Virtex-II FPGA [Xilinx, 2004], however, these designs can be equally applied to structured ASIC based systems. To minimize development times Xilinx’s CORE generator has been used to produce the simple arithmetic operators required by these distance metrics e.g. adder, multiplier, divider. The square root function is based on a two’s complement, non-restoring square root algorithm [K. Piromsopa et al, 2001]. These cores have been optimize for the chosen FPGA architecture and combined to form the systems shown in figures 1and 2. To simplify the design of these systems each is based around a soft core MAC unit. The city block IP core uses the Euclidean distance hardware with one input of the multiplier tied to a constant 1
and no final square root unit. The reason for this approach is to simplify the transition to Virtex-IV FGPA devices i.e. to take advantage of the hardwired MAC cells in this FPGA. The query and database vectors for these systems can contain 10 to 400 features, represented using signed 32bit integer values. In theory, higher computational performance could be achieved by replicating adder and multiplier units for each pair of vectors, however, the required memory bandwidth would make this solution impractical e.g. up to 12800 bits per cycle. Therefore, a streaming architecture has been used, based around an existing burst optimized SDRAM controller. When in operation the query vector is loaded into a looping FIFO buffer implemented in internal FPGA blockRam. The required database vectors are then sequentially accessed from SDRAM memory and streamed through the similarity function, with the distance results stored in an output FIFO queue. To minimize hardware and improve performance, weights multiplication and the sum of the squared vector values i.e. Sq and Sd used in the cosine distance are pre-processed and stored in SDRAM memory. The size and speed performance of each system for vectors of 50 features are shown in table 1.
Figure 1 Euclidean distance hardware
1
Figure 2 Cosine distance hardware Similarity function City block Euclidean Cosine
Size (slices) 283 344 1753
Max freq (MHz) 108 108 108
Speed (cycles) 55 55 59
Latency (cycles) 55 106 98
Table 1 Hardware performance From table 1, it can be seen that the common MAC cell dominates each system’s performance, allowing a similarity comparison to be performed approximately every 600ns. Also significant hardware resources are required by the divider unit making the cosine distance IP core considerably larger than the other two cores. It should also be noted that multiplier units do not add to this figure as they are hardwired units within the FPGA. The performance of the cosine distance function was then compared to a software based system implemented on a Sun blade 2000, with an ultrasparc-III+ Cu processor running at 1200MHz with 4GB ram and solaris 8 OS, using feature vector lengths of 10 to 400. The results of this comparison are shown in figure 3. This graph shows that for vector lengths less than 50, the delay for the square root function dominates, which is independent of vector length. For vector lengths greater than 50, the MAC delay dominates, which is proportional to vector length. The performance of this hardware cosine distance function is comparable to
the software implementation, but requires significantly less power and hardware resources. Processing performance can be greatly improved by replicating the MAC units i.e. processing Q0D0, Q1D1 etc in parallel and summing the final result. At present a 72 bit data word is used, limiting the number of MAC units to two 32 bit units or three 24 bit units. Alternatively, performance can be improved with increased clock speeds, by switching to higher speed grade devices e.g. Virtex-IV FGPAs or structured ASICs. Both these schemes require increased memory bandwidth to supply data to these units, which may not be desirable for low power embedded applications.
Hardware non-pipelined
time in ns
Software Hardware pipelined Hardware parallel MAC
vector length
Figure 3 Cosine distance performance comparison
4. CONCLUSION This paper has presented initial work into developing hardware IP cores for similarity functions used in vector comparison. Application areas for this implementation include application specific co-processors to minimize the average retrieval time of comparisons based on these functions and future low power and cost, ubiquitous and pervasive systems. Future work on these IP cores will be focused on minimizing power requirements for this type of application by increasing parallelism in order to reduce the required system clock speed and therefore power consumption when compared to traditional general purpose processor implementations.
ACKNOWLEDGEMENT The work presented in this paper was supported by the DTI Next Wave Technologies and Markets programme and Cybula Ltd, as part of the AMADEUS virtual research centre [AMADEUS, 2004].
REFERENCES AMADEUS, (2004) AMADEUS website, WWW: http://www.cs.york.ac.uk/amadeus A. Schenker, M.Last, Horst, Bunke, A. Kandel, (2003), Classification of web documents using a graph model, Proceedings of the Seventh International Conference on Document Analysis and Recognition (ICDAR 2003) J.T Foote, (1997), Content-based retrieval of music and audio, Multimedia Storage and Archiving Systems II, Proc. of SPIE, Vol. 3229, pp. 138-147 K. Piromsopa, C. Aportewan, and P. Chongstitvatana, (2001), An FPGA implementation of a fixed-point square root operation, Inter. Symposium on Communications and Information Technology, November 14-16, Thailand, pp. 587-589. Xilinx, (June 2004), Virtex-II Platform FPGAs: Complete Data Sheet, DS031 (v3.3), WWW: http://direct.xilinx.com/bvdocs/publications/ds031.pdf