FPGA-based Accelerator Design for RankBoost in Web Search Engines Ning-Yi XU, Xiong-Fei CAI, Rui GAO, Lei ZHANG, Feng-Hsiung HSU Microsoft Research Asia ningyixu, xfcai, ruigao, leizhang,
[email protected] Abstract Search relevance is a key measurement for the usefulness of search engines. Shift of search relevance among search engines can easily change a search company’s market cap by tens of billions of dollars. With the ever-increasing scale of the Web, machine learning technologies have become important tools to improve search relevance ranking. RankBoost is a promising algorithm in this area, but it is not widely used due to its long training time. To reduce the computation time for RankBoost, we designed a FPGA-based accelerator system. The accelerator, plugged into a commodity PC, increased the training speed on MSN search engine data by 2 orders of magnitude compared to the original software implementation on a server. The proposed accelerator has been successfully used by researchers in the search relevance ranking .
1. Introduction Web search based ad services have become a huge business in the last few years with billions of annual earnings and more than a hundred billion dollars of total generated market cap for search engine companies. For a search engine, the key to attract more users and obtain larger market share is its search relevance [3], which is determined by the ranking function that ranks resultant documents according to their similarities to the input query. Information retrieval (IR) researchers have studied search relevance problem for many decades. Representative methods include Boolean model, vector space model, probabilistic model, and the language model [1]. Earlier search engines were mainly based on such IR algorithms. When the internet bubble crashed in 2000, most observers believed that the search business was a minor business and search relevance was not an important problem. But just a year before, in 1999, Google proposed the PageRank algorithm to measure the importance of Web pages [16] and created a quantum leap in search relevance for search engines. This leap created a rapid decline of user populations for the then-existing search engines and proved beyond any doubt the importance of search relevance. Many factors affect the ranking function for search
relevance, such as page content, title, anchor, URL, spam, and page freshness. It is extremely difficult to manually tune ranking function parameters to combine these factors in an ever-growing Web-scale system. Alternatively, to solve this problem, machine learning algorithms have been applied to automatically learn complex ranking functions from large-scale training sets. Earlier algorithms for ranking function learning include Polynomial-based regression [2], Genetic Programming [3], RankSVM [4] and classification-based SVM [5]. However, all these algorithms have only been evaluated on small scale datasets due to their high computational cost. RankNet [6] was the first work that was evaluated on a commercial search engine. Motivated by RankNet, we revisited more machine learning algorithms. RankBoost [10] was identified to be a promising algorithm, because it has comparable relevance performance with RankNet and a faster ranking function, even though it has higher computation cost in the learning process. To learn a ranking function for a real search engine, large-scale training sets are usually used for better generalization, and numerous runs are required to exploit best results on frequently updated datasets. Thus the computational performance, measured by the time to obtain an acceptable model over a large dataset, is critical to the ranking function learning algorithm. With an initial software implementation, a typical run to obtain a model on MSN search engine data for RankBoost still takes two days, which is too slow for training and tuning of the ranking function. To further reduce the computation time of RankBoost, we designed an FPGA-based accelerator system. As an alternative to multi-core CPUs and GPUs, field-programmable gate-arrays (FPGAs) can be used as customized computing engines for accelerating a wide range of applications. High-end FPGAs can fully utilize bit-level parallelism and provide a very high memory access bandwidth. For example, an Altera Stratix-II FPGA, which is utilized in our accelerator, contains millions of gates, and provides Tbps accessing bandwidth to several megabytes of on-chip memory [13]. Due to its great flexibility in software design and its high performance which is close to Application Specific
Integrated Circuits (ASICs), FPGA has been recognized to be highly competitive with general purpose CPUs for high performance computing on a large number of applications [14][15]. In this paper, we propose FAR: FPGA-based Accelerator for RankBoost. We first develop an approximation to the RankBoost algorithm, which reduces the scale of computation and storage from O(N2) to O(N) and speeds up the software by 3.4 times. We then map the algorithm and related data structures to our FPGA-based accelerator platform. An SIMD (Single Instruction Multiple Data) architecture with multiple processing engines (PEs) is implemented in the FPGA. Training data are well organized to enable high bandwidth streaming memory access, which dramatically reduces processing time. In a typical experimental run on the MSN search engine data, training time is reduced from 46 hours to 16 minutes on a commodity PC equipped with FAR. This is an achievement of 169.9x improvement in performance. Several FAR systems have been successfully used in research work by the Web Search and Mining Group in Microsoft Research Asia. The remainder of this paper is organized as follows. Section 2 describes the RankBoost algorithm for search relevance, and discusses the bottleneck in the software implementation. Section 3 presents the design of FAR and illustrates how the hardware architecture utilizes the locality and parallelism in the algorithm. Section 4 describes the performance model and discusses experimental results. Section 5 concludes the paper and discusses future work.
2. Application: RankBoost for Web search relevance To better understand the application, in this section we give a brief introduction to RankBoost and describe how RankBoost is used to learn ranking functions in a Web search engine. In addition, the computational complexity is also discussed. RankBoost is a variant of AdaBoost, which was proposed by Freund and Schapire [17] in 1995. For a typical classification problem, the AdaBoost training process tries to build a strong classifier by combining a set of weak classifiers, each of which is only moderately accurate. AdaBoost has been successfully applied to many practical problems (including image classification [20], face detection [21], and object detection [22]) and demonstrated effective and efficient performance. The proposed FPGA-based RankBoost accelerator could be easily extended to support such applications to achieve similar performance improvement. For detailed background and theoretical analysis of boosting algorithms, please refer to [7][8][9] [10][19].
2.1.RankBoost for standard ranking problem In a standard ranking problem, the goal is to find a ranking function to order the given set of objects. Such an object is denoted as an instance x in a domain (or instance space) X. As a form of feedback, information about which instance should be ranked above (or below) one another is provided for every pair of instances. This feedback is denoted as a function: XXR, where (x0, x1) > 0 means x1 should be ranked above x0 and vice versa. The learner then attempts to find a ranking function H: XR which is as consistent as possible to the given, by asserting that x1 is preferred than x0 if H(x1) > H(x0). The RankBoost algorithm is proposed to learn the ranking function H by combining a given collection of ranking functions. The pseudo code is given in Fig. 1. RankBoost operates in an iterative manner. In each round, the procedure WeakLearn is called to select the best weak ranker from a large set of candidate weak rankers. The weak ranker has the form ht: XR, and ht(x1) > ht(x0) means that instance x1 is ranked higher than x0 in round t. A distribution Dt over XX (i.e., document pairs) is maintained during the training process to reflect the importance of ranking the pair correctly. The weight Dt(x0, x1) will be decreased if ht ranks x0 and x1 correctly (ht(x1) > ht(x0)), and increased otherwise. Thus, Dt will tend to concentrate on the pairs that are hard to rank. The final strong ranker H is a weighted sum of the selected weak rankers in each round. RankBoost Algorithm Initialize: initial distribution D over XX Do for t 1,...,T : (1) Train WeakLearn using distribution Dt. (2) WeakLearn returns a weak hypothesis ht. (3) Choose t R (4) Update weights: for ( x0 , x1 ) :
Dt ( x0 , x1 ) exp( t (ht ( x0 ) ht ( x1 ))) Zt where Z t is the normalization factor: Dt 1 ( x0 , x1 )
Z t Dt (d 0 , d1 ) exp( t (ht ( x0 ) ht ( x1 ))). x0 , x1
T
Output: the final hypothesis: H ( x) h tt t 1
Fig. 1. The original RankBoost algorithm
2.2. Extending RankBoost to Web relevance ranking To extend RankBoost to Web relevance ranking, two issues need to be addressed. The first issue is how to generate training pairs. The instance space for search engines does not consist of a single set of instances to be ranked amongst each other, but is instead partitioned by queries issued by users. For each query q, some of the returned documents are manually rated using a relevance score, from 1
(means 'poor match') to 5 (means 'excellent match'). Unlabeled documents are given a relevance score 0. Based on the rating scores (ground truth), the training pairs for RankBoost are generated from the returned documents for each query. The second issue is how to define weak rankers. In this work, each weak ranker is defined as a transformation of a document feature, which is a one-dimensional real-valued number. Document features can be classified as query-dependent features (such as query term frequencies in a document and term proximity) or query-independent features (such as PageRank [16]). Thus, the same document may be represented by different feature vectors for different queries due to the existence of query-dependent features. The extended RankBoost algorithm for Web relevance ranking is shown in Appendix 1.
2.3. WeakLearn design for performance and computation complexity analysis In both the original and extended RankBoost (Fig. 1. and Appendix 1), WeakLearn is the main routine in each round of the algorithm. Clearly, its design is critical to the quality and speed of RankBoost. The main contribution of this paper is the design of the WeakLearn algorithm and its mapping to an efficient hardware accelerator. This subsection will briefly describe various designs of WeakLearn and analyze the computational complexity of each. The core algorithm and its hardware implementation are addressed in the next section. As described in [10], WeakLearn is a routine that chooses from among Nf document features (as weak rankers), the one that yields minimum pair-wise disagreement relative to distribution D over Npair document pairs. Here, we first need to determine the form of weak rankers. We tried out several forms of weak ranker proposed in [10] to decide on the best form. For low complexity and good ranking quality, we define the weak ranker in the form: 1 if f i (d ) (1) h( d ) 0 if f ( d ) or f ( d ) is undefined i i Here, fi(d) denotes the value of feature fi for document d, and is a threshold value. To find the best h(d), WeakLearn needs to check all possible combinations of feature fi and threshold s. Freund [10] presents three methods for this task. The first method (noted as M1) is a numerical search, which is not recommended due to its complexity [19]. The second method (M2) needs to accumulate the distribution D(d0, d1) for each document pair to check each (fi, s) combination, and thus has a complexity of O(NpairNfN) per round. Our initial software (M2.2Dint) reduces the computational complexity to O(NpairNf) using 2-D integral
histograms1, and a complete run still needs about 150 hours. For speedup, we implement the third method (M3) (as shown in Appendix 1, algorithm 2), which tries to find the optimal r(f, ) by generating a temporary variable π(d) for each document. Intuitively, π contains information regarding labels and pair weights. WeakLearn only needs to access π in a document-wise manner for each feature and each threshold, that is, O(NdocNf N) in a straightforward implementation (noted as Naïve M3). We further reduce the computational complexity to O(NdocNf) using integral histograms, and the new algorithm is noted as M3.int. The computational complexities of these routines are listed in Table 1 for comparison. Substituting Npair, Nf and N with the benchmark values given in Section 4.1, it is clear that M3.int has the smallest computational complexity among all algorithms. Therefore, we focus on M3.int in this paper. Both software and hardware implementations of M3.int are presented in the following section. Table 1. Computational complexity of main routines in each round of different versions of RankBoost algorithm. Main routines in each round Wea k Lear ner
Pair-wise Document-wise algorithm algorithm Naïve M2.2Di Naïve M3 M3.int M2 nt O(Npair) O(Npair)
π calcul ation Best O(Npair ranker NfN) search Weight update
O(Npair Nf)
O(NdocNf N)
O(Ndoc Nf)
O(Npair)
3. RankBoost accelerator design In this section, we describe M3.int algorithm and its accelerator design. Devising an efficient algorithm and mapping the algorithm to efficient hardware architecture are the key challenges addressed in this work.
3.1 M3.int: document-wise WeakLearn based on integral histograms Weak learner M3 (as shown in Appendix 1, Algorithm 2) tries to find the weak ranker with optimal r(f, ), by generating a temporary variable π(d) for each document. r is defined as:
r ( f k , sk )
(d )
(2)
d : f k ( d ) sk
A straightforward implementation calculates r in O(NdocNfN) time per round, and it is apparent that the calculation of r is the bottleneck for the whole RankBoost algorithm. The first reason is the huge 1
This algorithm accumulates D(d0 ,d1) to a 2-D integral histogram with (fi(d0), fi(d1)) as coordinates in O(Npair), then selects the best threshold of fi in O(N). Details are beyond the scope of this paper.
that during the r calculation, WeakLearn needs to iteratively access a large amount of data, including all of the training feature values that may occupy several gigabytes of memory. M3.int efficiently calculates r with integral histograms in O(NdocNf) time. This gives close to 2 orders of magnitude reduction in computational complexity compared to Naïve M3 (when N = 256). The algorithm is described as follows. For each feature f k , k : 1...N f , the feature values for
{ f k (d ) | d : 1...N doc}
all
documents
can
be
classified into Nbin bins. The boundaries of these bins are: k f k f min , where k max s f k , s 0,1,..., N s
min
N bin
bin
k k (resp. f min ) is the maximum (resp. minimum) f max
value over all fk in the training dataset. Then each document d can be mapped to one of the bins according to the value of f k (d ) : Bink (d ) floor (
k f k (d ) f min N bin 1) k k f max f min
The histogram of built using: Hist k (i )
(d ) over
feature fk is then
(d ), i 0,...,( N
d :Bin k ( d ) i
(3)
bin
1)
(4)
Then, we can build an integral histogram by summing elements in the histogram from the right (i=Nbin-1) to the left (i=0). That is, Integral (i ) Hist ( a ), i 0,...,( N 1) (5) k
k
a i
Based on (2) ~ (5), it can be deduced that the value Integralk (i) equals r( f k , ik ) . The pseudo code of the r calculation algorithm is summarized in Fig. 2. With this algorithm, we obtain more than 3 times acceleration over M2.2Dint. However, runtime profiling (Section 4.1.2) reveals that r calculation still occupies almost 99% of the CPU time. Thus, our attention is next focused on this task. We observe that integral histograms for different features can be built independently of each other. This feature-level parallelism makes it possible to accelerate the algorithm using various techniques, such as distributed computing and FPGA-based accelerator, as described in the next subsection.
3.2. Mapping accelerator
App Driver
of all documents
Host Computer
for(k 0; k N f ; k ) for(d 0; d N d ; d ) bin floor (
f k (d ) f k f max f
k min k min
to
A
bin
Input : (d ) and feature vector ( f 0 (d ),... f N f (d ))
M3.int
FPGA-based
To further improve the performance of RankBoost, we investigate an implementation of M3.int using FPGA. This section introduces the accelerator system, describes the SIMD architecture for building the integral histograms, and discusses the implementation issues. 3.2.1 Accelerator architecture: FPGA-based PCI card with large on-board memory. The accelerator is a PCI card with FPGA and memories (SDRAM, SRAM), as shown in Fig. 3. On the board, an Altera Stratix-II FPGA (EP2S60F1020C5) is used as the core computation component. The on-board memories include DDR SDRAM (1 GB, the maximum possible is 2 GB) and 32MB SRAM (not used in this work). The board is designed by the authors of this paper. It can also be used to accelerate other similar algorithms. PCI bus of host computer
number of r values to check. Assuming we take 400 rounds and 256 levels of threshold on the dataset described in Section 4.1, there are tens of millions | r ( f k , sk ) | values generated. The second reason is
FPGA PCI Controller
Computation Logic
MMU
CPLD Power
SRAM
SDRAM
PCI Board
B
N bin 1);
Hist k (bin) (d ); endfor ; // Build histogram for feature k for(i N bin 1; i 1; i ) Integral k (i ) Hist k (i ) Hist k (i 1) endfor ; // Build Integral histogram (r values ) endfor ; // All features Output : Maximum ( Integral k (i ))
Fig. 2. The pseudo code for weak ranker selection in M3.int ((2) and (3) of Algorithm 2, Appendix 1).
Fig. 3. (A) The accelerator: FPGA-based PCI card with on-board memory. (B) The card used in experiments. 3.2.2 Building Integral Histograms with an SIMD Architecture. The most time-consuming part of the RankBoost algorithm is building the integral histograms. We propose an SIMD architecture to
separately build several integral histograms with multiple PEs at the same time, as shown in Fig. 4. FPGA DDR Feature Values fi(d)
PE0
DDR Interface
FIFO: fi(d) PE1 Control Unit PE2
PCI Local Interface
FIFO: (d)
. . .
PCI Controller
PE7 FIFO: r(f,)
Fig. 4. The SIMD architecture for RankBoost accelerator. At the initialization step, the software on host computer sends the quantized feature values to DDR memory through the PCI bus, PCI controller and FPGA. Note that the data are organized in such a way as to enable streaming memory access, making full use of the DDR memory bandwidth. In each training round, the software calls WeakLearn to compute (d) for every document, and sends (d) to the FIFO in FPGA. The control unit (CU) in FPGA then instructs the PE arrays to build histograms and integral histograms, and sends the results r(f, ) to the output FIFO. CU is implemented as a finite state machine (FSM), which halts or resumes the pipeline in PEs depending on the status of each FIFO. When CU indicates that the calculation of r has completed, the software reads back these r values and selects the maximum. The software then updates the distribution D(d0, d1) over all pairs and begins the next round. Shift Registers
f(d) Read Offset Address
M U X
Read Address
r(f,q)
Dual Port RAM
Data Out
32'b0
M U X
Write Address
Write Offset Address
Data In
M U X
Floating Point Adder
(d) 32'b0
M U X
PE
Fig. 5. The PE micro-architecture. We design the micro-architecture of PE (Fig. 5.) to support full-pipelined operation, which is important for the hardware performance. In a naive implementation of PE, direct computation of Equation (4) processes (d) values for a single feature sequentially. When the adjacent feature values fk(di) and fk(di+1) are the same, (di) and (di+1) will be added to the same position in the histogram. The floating point adder will then have to add bubbles to its pipeline to avoid read-after-write hazard. The design shown in Fig. 5 avoids such data
hazards by building multiple histograms (instead of a single histogram) in the pipeline, where the same (d) for a sequence of different features will only be added to the corresponding different histograms. The data input to PE arrays is stored in the DDR memory on the board. In order to efficiently use the provided bandwidth, we organize the data according to the sequence of access by the PE arrays, so that the data can be accessed in the burst mode of the DDR memory. The next challenging task is to select the parameters for PE and PE array, such as input data precision, the number of pipeline stages, the number of PEs, etc. These parameters should be selected to achieve the performance goal of the accelerator while meeting multiple conflicting constraints such as area, frequency, memory bandwidth limit, precision, scalability, etc. For example, increasing the number of PEs may not increase the system performance when the DDR memory cannot send enough data to PE arrays due to limited bandwidth. After several trial implementations and experiments, we made the following conclusions: i) Increasing the threshold levels to more than 256 will not optimize the generalization ability of the final strong ranker. Thus, each feature can be quantized to 8 bit or less. ii) The DDR memory on our board can provide feature data in a bandwidth of about 1GBps in streaming access mode. The PE array should be designed to handle that amount of input feature data. iii) When features are quantized to 8 bit, PE arrays will get 1 G feature values per second, i.e., 1G/Nf documents per second. Thus the bandwidth for accessing (d) values (single precision floating point) is about 4Bytes*1G/Nf, which is much lower than the bandwidth of PCI bus. This means that PCI interface will not become the bottleneck. iv) The PEs can run at 180MHz at most after optimization using aggressive pipelining (16 stages) and high effort in place-and-route for the Quartus-II software. The critical path lies in the floating point adder IP from Altera library. Following the arguments in i), ii), iii) and iv), a PE array with 8 PEs will consume feature values in the bandwidth of 1.44GBps (8Bytes 180MHz), which is beyond the upper bound of the DDR memory bandwidth on our board. Thus, increasing the number of PEs to more than 8 will not shorten the computation time further. Finally, we implement 8 PEs that have 16-stage pipeline in the FPGA. This configuration fully utilizes the DDR memory bandwidth, which is the bottleneck of our accelerator. The final implementation of FAR in Altera Stratix-II FPGA EP2S60 costs 11,357 (23%) ALUTs and 2,102K (83%) memory bits. The design scale is limited by the size of embedded memory in FPGA. The PE runs at 120MHz.
4. Results This section presents the results of performance experiments and the performance models of the FAR accelerator.
4.1. Experimental setup and overall results To compare the performance of different implementations, we use a benchmark dataset whose parameters are shown in Table 2. Table 2. Benchmark dataset parameters. Number of queries 25,724 Number of pairs 15,146,236 Number of features 646 Number of documents 1,196,711 Original data size 3.43GB Compressed data size 748MB The performance of each implementation is shown in Table 3. We run the RankBoost algorithm for 300 rounds, which is the typical length of training to obtain a model with good ranking accuracy. The initial Naïve M3 algorithm (run on a Quad-Core Intel Xeon 2.33 GHz processor with 7.99 GB RAM) took about 46 hours to complete the training rounds. The M3.int version took only 4.4 hours. The distributed version with 4 computation threads2 further reduced the time to slightly below one hour. Finally, the hardware accelerator plugged into a commodity PC (Dell GX620 3.00 GHz Intel Pentium 4 system with 2GB RAM) took only 16 minutes, giving a speedup of around 16.3 times over the accelerator’s software version (M3.int). Table 3. Computation time for a complete run of 300 rounds. Implementation Time (hour) Speedup SW (Naïve M3) 46.06 1 SW (M3.int) 4.43 10.40 4-thread Distributed 0.78 58.71 SW (M3.int) HW Accelerator 0.27 169.90 (FAR)
4.2. Performance model As the program initialization time for the software and hardware implementations is relatively small, we only model the executing time of each round in the training process. Let tSW (tHW) be the mean time of one round in the software (hardware) implementation. In each round, the main routines for both hardware and software are (1) calculating π, (2) searching for the best weak ranker, and (3) updating pair weights. According to the computational complexities listed in Table 1, we can use a linear model to estimate tSW:
t SW N pair SW N f N doc
The first term N pair represents the time for π calculation and weight update, and the second term is 2
The main program of the distributed implementation is run on the same machine as M3.int, while the four computation threads are executed on an 8-core AMD Opteron 2.40 GHz processor with 33.80 GB RAM.
the time for weak ranker search. The formulation is empirically justified by our observations on the real datasets (Subsection 4.3). The first term also applies to tHW, because the hardware version also uses the same software routines for the first two tasks. Aside of that, the time for weak ranker search in hardware implementation can be divided into communication time tcomm and computation time tcomp. Communication refers to the transfer of integral histograms from FPGA to the host computer through the 33MHz PCI bus using DMA mode. These histograms occupy a total of 4 N f N bytes RAM space. The time for PCI to transfer this amount of data can be noted as tcomm aN f N b , where b is the system overhead of a single DMA transfer. The computation component involves going through all data in the DDR memory in a streaming manner. Because the DDR memory bandwidth is the bottleneck of FPGA computation (as argued in Section 3.2), the time model for the hardware computation can be represented as tcomp HW N f N doc , where HW is the reciprocal of DDR bandwidth, and N f N doc is the data size of the compressed feature in DDR memory. The total time for one round in hardware training is then t HW N pair HW N f N doc aN f N b . The ratio tSW/tHW is therefore the speedup obtained through acceleration.
4.3. Parameter estimation and estimated speedup In order to estimate the parameters in our performance model, we measured the processing time of the main routines on the benchmark dataset, averaged over 300 rounds, as shown in Table 4. The ‘best ranker search’ routine is obviously the bottleneck task for M3.int software implementation, because it occupies 98.7% of the per-round computation time. FAR accelerates this task by 67.01 times, giving per-round speedup of 17.28 times. Table 4. Time (seconds) of main routines in each round. Machine specifications are as listed in Subsection 4.1. Optimized SW is described in Subsection 4.4. SW Optimized Routine FAR (M3.int) SW π 0.110 0.176 0.110 calculation t 0.719 Best ranker 52.068 comp 26.157 search tcomm 0.058 Weight 0.958 2.122 0.959 update Total time 53.136 3.075 27.226 per-round
We measured the runtime on datasets with varying Nq, Npair, Nf, Ndoc and N settings. Then we apply linear regression to infer parameters in the performance model. Fig. 7 presents the speedup estimated from the model. The surface represents the speedup (tSW/tHW) with fixed N = 256 and varying Npair, Ndoc and Nf. The values on Ndoc*Nf axis also represent the size of compressed feature data to be loaded to the DDR memory of the accelerator. The value of Ndoc*Nf ranges within [0, 2*109], because the board supports up to 2GB memory. As the feature data size increases, more speedup is achieved. On the other hand, as Npair increases, the time portion of π calculation and weight update will increase, and the whole speedup will decrease.
8. We note that FAR is still superior to the optimized software version in terms of speed.
Fig. 8. Speedup of FAR over the optimized software version for each round.
4.5. Algorithm quality experiments and new opportunities enabled by FAR To compare the relevance quality of generated ranking models, we run RankBoost with the same settings as RankNet. Ranking accuracy was computed using the Normalized Discounted Cumulative Gain (NDCG) measure [25]. As shown in Fig. 9., RankBoost yields comparable results to RankNet. This baseline performance, as computed using the algorithm in Appendix 1, has since improved significantly as more research effort are put into it [11] [12]. Fig. 7. Speedup of FAR over the software implementation of M3.int for each round. Execution times for varying Npair, Ndoc and Nf values are estimated using the derived performance model.
4.4. Scalability of the FAR accelerator and software optimization The FAR accelerator can support up to 2GB compressed feature data, which corresponds to 8GB uncompressed data. To support larger datasets, we can apply the distributed computing technique. In this technique, different portions of the feature dataset are distributed among multiple FAR-equipped computers. In each round, the host computer will calculate and distribute π to client computers, then collect results and update the weight. Another possible solution is to build accelerators with either larger on-board DDR memory or multiple DDR memories. The dataset size for the software version is only limited by the capacity of the hard disk, because it loads feature data from hard disk to memory only when the feature is used in each round. However, disk access slows down the software, and it is possible to achieve faster speed by loading all feature data beforehand, provided that we have enough user memory space to store the whole data. The performance of this version is listed in Table 4, and the new derived speedup estimation is shown in Fig.
Fig. 9. NDCG comparison between RankNet and RankBoost. Several FAR systems have been built and successfully used in research work by the Web Search and Mining Group in Microsoft Research Asia. The speedup enabled by FAR has rendered RankBoost more attractive than other candidate algorithms (such as RankNet and RankSVM). For illustration, an application for a commercial search engine requires about 200,000 rounds to obtain a meaningful model, which takes FAR less than 2.5 days to complete. With the performance model (Subsection 4.2), the distributed software (M3.int) on four servers will need a week to get the same result. It is fair to say that our accelerator opens up the opportunity to accomplish such compute-heavy tasks in much less time and expense.
5. Conclusions This paper described the design of FAR, an FPGA-based accelerator for the RankBoost algorithm in Web search engines. The main
contribution of this work includes (1) extension of the RankBoost algorithm to improve Web search relevance ranking, (2) an efficient computation method for RankBoost, and (3) the mapping of the algorithm to an efficient hardware accelerator architecture. Empirical results show that FAR can accelerate the original software by 169.9 times in a typical run. It has been successfully used by researchers in the area of search relevance ranking. Future work will focus on extending the experience of designing this accelerator to other similar machine learning algorithms.
6. Acknowledgements The authors would like to thank Qing YU, Chao ZHANG, Tao QIN and Vivy SUHENDRA for their support in this work, and Wei LAI, Jun-Yan CHEN and Tie-Yan LIU for their useful feedback. The authors would also thank our editor Dwight Daniels for his wonderful work.
7. References [1] R. Baeza-Yates and B. Ribeiro-Neto. Modern Information Retrieval. Addison Wesley, 1999. [2] N. Fuhr. Optimum polynomial retrieval functions based on the probability ranking principle. ACM TOIS, 7(3):183–204, July 1989. [3] W. Fan, et al. Ranking function optimization for effective web search by genetic programming: An empirical study. HICSS, pages 8–16, January 2004. [4] T. Joachims. Optimizing search engines using clickthrough data. SIGKDD, pages 133–142, 2002. [5] R. Nallapati. Discriminative models for information retrieval. SIGIR, pages 64–71, 2004. [6] C. Burges, et al. Learning to rank using gradient descent. ICML, pages 89–96, 2005. [7] R.E. Schapire. A brief introduction to boosting. In Proc. 6th International Joint Conf. on Artificial Intelligence, 1999. [8] R.E. Schapire and Y. Singer. Improved boosting algorithms using confidence-rated predictions. Journal of Machine Learning, 37(3):297–336, December 1999. [9] R.E. Schapire. The boosting approach to machine learning: An overview. In MSRI Workshop on Nonlinear Estimation and Classification, 2002. [10] Y. Freund, et al. An efficient boosting algorithm for combining preferences. Journal of Machine Learning, 4:933–969, 2003. [11] M-F. Tsai, et al. FRank: A ranking method with fidelity loss. Technical Report, Microsoft, November 2006. [12] Tao Qin, et al. Learning to search web pages with query-level loss functions. Technical Report, Microsoft, November 2006. [13] Altera Incorporated. http://www.altera.com. [14] K.D. Underwood and K.S. Hemmert. Closing the gap: CPU and FPGA trends in sustainable floating-point BLAS performance. FCCM, pages 219–228, April 2004. [15] T. El-Ghazawi, et al. Is high-performance reconfigurable computing the next supercomputing paradigm? Panel session, Proc. ACM/IEEE Conference on Supercomputing, 2006. [16] S. Brin and L. Page. The anatomy of a large-scale hypertextual web search engine. In Proc. WWW, pages 107–117, Brisbane, Australia, 1998.
[17] Y. Freund and R.E. Schapire. A decision-theoretic generalization of on-line learning and an application to boosting. Journal of Computer and System Sciences, 55(1):119–139, August 1997. [18] C. Burges. Ranking as learning structured outputs. Proc. NIPS2005 Workshop on Learning to Rank, pages 7–11, December 2005. [19] R.D. Iyer, et al. Boosting for document routing. CIKM, pages 70–77, 2000. [20] X. Liu, et al. Boosting image classification with LDA-based feature combination for digital photograph management. Pattern Recognition: Special Issue on Image Understanding for Digital Photos, 2004. [21] P. Viola and M. J. Jones. Robust real-time object detection. Technical Report, Compaq Cambridge Research Lab, 2001. [22] I. Laptev. Improvements of object detection using boosted histograms. In Proc. BMVC, Edinburgh, UK, 2006. [23] K. Järvelin and J. Kekäläinen. Cumulated gain-based evaluation of IR techniques. ACM TOIS, 20(4):422–446, October 2002.
8. Appendix Algorithm 1: RankBoost for Relevance Ranking Given: Nq queries {qi | i=1,…, Nq}. Ndi documents {dij | j=1,…, Ndi} for each query qi, where i N i 1q Ni Ndoc . Nf features {fk(d j) | j=1,…, Nf} for each document di.j. Nk thresholds { ks | s=1,…, Nk} for each fk. .Npair pairs (dij1, dij2) generated by ground truth rating { R ( qi , d ij ) } or { R ij }. Initialize: initial distribution D(dij1, dij2) over XX Do for t 1,...,T : (1) Train WeakLearn using distribution Dt. (2) WeakLearn returns a weak hypothesis ht, weight t (3) Update weights: for each pair (d0, d1): D (d , d ) exp( t (ht (d0 ) ht (d1 ))) Dt 1 (d 0 , d1 ) t 0 1 Zt where Z t is the normalization factor:
Z t Dt (d 0 , d1 ) exp( t (ht (d 0 ) ht (d1 ))). x0 , x1
Output: the final hypothesis: H ( x) T h t t t 1
Algorithm 2: WeakLearn for RankBoost in Web Relevance Ranking (M3) Given: Distribution D(d0, d1) over all pairs Compute: (1) For each document d: Compute (d ) ( D(d ' , d ) D(d , d ' )) d' (2) For every feature fk and every threshold ks: Compute r ( f , k ) (d ) k s d : f k ( d ) sk *
(3) Find the maximum | r ( f k* , sk* ) | *
(4) Compute: 1 ln( 1 r ) * *
1 r * Output: weak ranking ( f k* , sk* ) and 2