Improved Residual Resampling Algorithm and Hardware Implementation for Particle Filters Shaohua Hong, Jianxing Jiang, Lin Wang Department of Communication Engineering Xiamen University Xiamen, Fujian, 361005, P.R. China
[email protected] Abstract—In this paper, an improved residual resampling (RR) algorithm and hardware architecture for efficient hardware implementation of particle filters (PFs) is proposed. By rounding the accumulated product of the particle non-normalized weight and the number of particles, the proposed improved RR algorithm avoids the resampling of the residuals and thus has only one loop. Mathematical analysis and simulation results confirm that the proposed algorithm can guarantee the number of resampled particles correct and show approximately equal performance with the traditional systematic resampling (SR) and residual systematic resampling (RSR) algorithms. Compact hardware architecture for the proposed resampling is presented and the bearings-only tracking (BOT) problem is used for illustration and evaluation. Experimental results indicate that this hardware architecture is efficient in terms of low resource usage and low latency. Keywords—algorithm; hardware residual resampling; particle filters
I.
architecture;
improved
INTRODUCTION
Particle filters (PFs) [1]-[3], also known as Sequential Monte Carlo (SMC) filters, are particularly appropriate for nonlinear and/or non-Gaussian application. The key idea is to represent the probability density of the state by a weighted set of particles ^x kj , wkj `
N j 1
, where j is the particle index and N
is the number of particles used. From this weighted set of particles, an optimal estimate can be calculated with respect to different criteria. Generally, there are three important operations in the implementation of PFs: sampling, weight calculation, and resampling. The sampling step is to generate samples (particles) from the importance density and the weight calculation step is to assign importance weights to the particles based on the observations. The resampling step, the idea of which is to discard the particles with low normalized importance weights and multiply the particles with high normalized importance weights to replace them, is critical because it is an effective method to reduce the degeneracy problem in PFs. Due to its sequential nature, the resampling operation is generally regarded as a bottleneck. There has been considerable interest in this research topic and various resampling algorithms are proposed [4]-[9], such as residual resampling
(RR) [4], systematic resampling (SR) [5], and residual systematic resampling (RSR) [6], [7], etc. For SR and RSR algorithms, the authors in [10], [11] have proposed the corresponding generic hardware architectures and memory schemes for the resampling and sampling steps. But for RR algorithm, there is no generic hardware architecture to our knowledge. In RR algorithm, the number of replicated particles is calculated first by truncation of the product of the particle weight and the number of particles. After this step, the number of particles produced is generally less than the needed number due to truncation. Thus it is necessary to resample the residues to compensate for the number of particles. Obviously, the hardware implementation of the traditional RR algorithm is complicated. It is because that the hardware implementation of RR scheme must decide which particles to additionally replicate so that the total number of particles is correct. A modification is necessary for its efficient hardware implementation. In 2004, Hong et al. investigated a RR scheme suitable for high-speed physical realization [12], the idea of which is to process the residues using a memory-addressing scheme and a tagging method. The scheme can guarantee the number of particles after resampling correct. However, the complexity of the hardware implementation is still relatively high. Recently, based on the idea of RSR algorithm, Feng et al. proposed an improved method for RR algorithm [13]. This improved RR scheme rounds the accumulated product of the particle normalized weight and the number of particles and thus avoids resampling residual particles in the traditional RR algorithm. In this paper, we modify the improved RR algorithm for efficient hardware implementation and propose its compact hardware architecture. The rest of the paper is organized as follows. In Section 2, we briefly describe the traditional RR algorithm. The proposed resampling algorithm and architecture are described in Section 3. Section 4 evaluates the resource utilization, latency of the architecture on a FPGA platform in the bearings-only tracking (BOT) problem. Finally, this paper concludes with a brief summary in Section 5. II.
REVIEW OF THE TRADITIONAL RR ALGORITHM
Let N be the input number of particles, M be the number of resampled particles, w be an array of importance weights of particles, and the output r be an array of replication factors. The traditional RR algorithm is summarized in Pseudocode 1.
This work was supported by the National Natural Science Foundation of China under Grant No. 61102134.
978-1-4673-5829-3/12/$26.00 ©2012 IEEE
Pseudocode 1: The traditional RR algorithm
r
RR w, N , M
M;
(1) M r
(2) for m 1: N // the first step (3)
r m
(4)
wr m
(5)
Mr
«¬ w m M »¼ ; wm M r m ;
M r r m ;
(6) end (7) if M r ! 0
//the second step
Pseudocode 2: The proposed improved RR algorithm with non-normalized weights
i, r (1) K
0 , ind d
(3) for j
1: N
(5)
cp j [ wcp] ;
(6)
fact
if fact ! 0
(9)
wr m
(8)
(11)
rr
(12)
for m 1: N
(13)
r m
(14)
SR wr , N , M r ;
r m rr m ;
end
(15) end For more details of discussion of the traditional RR algorithm, please refer to [4]. From Pseudocode 1, it can be found that the traditional RR algorithm consists of two steps. The first step calculates the number of replications of particles by truncation of the product and then computes the residual number of particles M r . The second step requires resampling residues which produces M r of the final M particles. Finally, the number of replications of particles is calculated by summing the replication factors produced in these two steps. Note that the second step in Pseudocode 1 is realized by SR scheme with N input and M r for output particles and normalization wr m M r m 1, !, N needs to be processed before the SR algorithm.
One observes that the best case of the traditional RR algorithm arises when M r 0 . In this case there is only one for loop. In other situations, there are two steps. When M r N 1 , the worst case of the traditional RR algorithm occurs, where the second step requires generation of N 1 resampled particles. From an implementation viewpoint, it is complicated to implement directly the traditional RR algorithm. Thus, it is necessary to make a modification for efficient hardware implementation.
III. PROPOSED ALGORITHM AND ARCHITECTURE In this section, we propose an improved RR algorithm and its hardware architecture for efficient implementation of PFs.
cp j cp j 1 ;
i ind r
(9)
end
wcp w j K ;
wcp
(7)
0;
N 1;
(4)
for m 1: N
(10)
0 , cp 0
M S , wcp
(2) ind r
(8)
wr m M r ;
improved_RR w, N , M , S
j , r ind r
fact , ind r
ind r 1 ;
j , r ind d
fact , ind d
ind d 1 ;
else
i ind d
(10) (11)
end
(12) end
A. Improved Residual Resampling Recently, the authors in [13] proposed an improved method for the traditional RR algorithm. The key idea is to round the accumulated product of the particle normalized weight and the number of particles. In this subsection, we modify the improved RR algorithm proposed in [13] by incorporating the sum of weights into the resampling operation and then allow for resampling using non-normalized weights. Obviously, N divisions during normalization are replaced with one division which is very advantageous for hardware implementation. Let S be the sum of particle weights. The proposed improved RR algorithm with non-normalized weights is shown in Pseudocode 2.
It is seen from Pseudocode 2 that the number of replications of the ith particle in the proposed improved RR algorithm is given by r i
cp i cp i 1 ª i º ª i 1 i º i « ¦ wk M S » « ¦ wk M S » ¬i 1 ¼ ¬i 1 ¼
by
(1)
Then the total number of particles after resampling is given N
ª
º M S » i 1 ¼ >M S S @ M N
¦ r i «¬¦ w i 1
i k
N ª i º « M S ¦ wk » i 1 ¬ ¼
(2)
It can be found that the proposed improved RR algorithm can guarantee the number of particles after resampling correct.
0.24
Pseudocode 3: Memory-related operations of sampling step [11]
0.22 SR algorithm RSR algorithm Proposed improved RR algorithm
Position RMSE
0.2
x
Sampling i, r , x
0.18
(1) ind r
0.16
(2) for ind r
0.14
0.1 0.08
5121024
2048
4096 Number of particles
8192
(4)
x i ind r
(5)
for j
(7)
We evaluate the performance of the proposed algorithm based on the root mean square error (RMSE) values for position by applying it to the BOT problem. The RMSE is defined by RMSE
1 Pl
¦ N ¦ xˆ Pl
t 1
1
N mc
i t
mc i 1
xttrue yˆ ti yttrue 2
2
(3)
where Pl is simulation path length, N mc is the Monte Carlo simulation times, and xˆti , yˆ ti are the filter estimations at time t in i th Monte Carlo simulation. Fig. 1 shows the RMSE values for position as a function of the number of particles for the BOT problem [5], where the simulation path length Pl 24 and Monte Carlo simulation times N mc 1000 , respectively. One can observe from Fig.1 that the proposed improved RR algorithm shows approximately equal RMSE performance with the SR and RSR algorithms.
From Pseudocode 2, one can also find that the proposed improved RR algorithm has a single for loop and the processing time is independent of the input data. This allows a regular pipelined hardware structure and can be implemented in hardware easily. Further, the execution time of resampling using the proposed improved algorithm requires only N L clock cycles in hardware, where L is the latency of the data path. Table I shows the number of operations for a MATLABtype implementation for different resampling algorithms. Upon TABLE I.
COMPARISON OF THE NUMBER OF OPERATIONS FOR DIFFERENT RESAMPLING ALGORITHMS
Algorithms
SR
Best case
Worst case
RSR
Proposed Improved RR
RR
Multiplications
0
N
N
N
N
Additions
2M+N
2N
6N
3N
2N
Comparisons
N+M
0
3N
0
0
x i ind r ;
Reg
sample Reg , ind r
ind r 1 ;
r ind r 1 down to 1
x i ind d
(6) Fig. 1. RMSE values for position.
1 to length ind r
(3)
0.12
M 1 ;
0 , ind d
sample Reg , ind d
ind d 1 ;
end
(8) end the inspection of Table I, the number of operations required in the proposed improved RR algorithm is equal to that of the best case of the traditional RR algorithm. Thus, computational complexity required in the proposed resampling is reduced and the processing efficiency can be improved. When the input number of particles is equal to the output number of particles; that is, N M , the proposed algorithm requires N less additions than SR and RSR algorithms, and both the proposed improved RR and RSR algorithms perform N multiplications. Since multiplication is more complex than addition, the SR algorithm is the least complex and the proposed improved RR algorithm is less complex than the RSR algorithm. The corresponding operations in the sample step for the proposed resampling algorithm are the same as that of the RSR algorithm, shown in Pseudocode 3. B. Proposed Hardware Architecture Fig. 2 shows the architecture for the proposed improved RR algorithm. The weights are stored in the memory labeled MEMW and addressed by the address counter C1 that counts from 0 to N-1. The weight (W) is multiplied by the value K M S and the product is summed in an accumulator. After rounding the accumulation, the accumulated value is stored in a temporary register Reg. Take the previous accumulated value stored in Reg from the current accumulated value and the remainder is the number of replications of the current particle. If the particle is replicated, its index is written to the bottom part of the index memory labeled MEMi; that is, the counter labeled Counter_r counts up from the minimum address of MEMi. Otherwise, the indexes of the discarded particles are written to the top part of MEMi; that is, the counter labeled Counter_d counts down from the maximum address of MEMi. The appropriate replication factor is written to the corresponding location in the factor memory labeled MEMr. The indexes of the discarded particles are recorded to indicate where the information of the replicated particles should be stored in the memory of the sample unit.
Resampling
Wight W Memory (MEMW)
Counter1 (C1)
Particle allocation =0
Acc.
Round
En Counter_d Indd down
Address
Replicated and discarded index memory (MEMi)
Comp.
Reg
>0 K=M/S Index generator
En Counter_r Indr up
En
Data
Write_en
Delay
Address Data
Replicated factors memory (MEMr)
Fig. 2. The architecture for the proposed improved RR algorithm combined with the particle propagation.
In this section, the results of the implementation of the proposed architecture and comparison of the proposed scheme with the SR and RSR schemes are presented.
0.8 True Measurement point Matlab FPGA
0.6
y position
0.4 0.2 0 -0.2 -0.4 -0.6 -0.05
0
0.05 x position
0.1
0.15
Fig. 3. Results of the BOT problem utilizing the proposed improved RR scheme.
Once the resampling is done, the replicated and discarded indexes of particles are allocated in the bottom and top parts of the index memory, respectively. The architecture for memoryrelated operations in the sample step is the same as that of the RSR algorithm. For more details of discussion, please refer to [10], [11]. IV.
EXPERIMENTAL RESULTS AND EVALUATION
TABLE II. RESOURCE UTILIZATION OF THE RESAMPLING ARCHITECTURE ON THE XC2VP50-FF1152 DEVICE Resource
SR scheme
RSR scheme
Improved RR scheme
Slices
199
294
279
Slice Registers
130
224
286
4-input LUTs
232
348
384
Block RAMs
7
6
6
Block Multipliers
0
1
1
The results of implementing the sampling importance resampling filters (SIRFs) with 2048 particles on a Xilinx Virtex II pro device (XC2VP50FF1152) for the BOT problem are shown in Fig. 3. The state in this model is 4-dimensional that includes the position coordinates and velocities in x and y directions. The observation is the time-varying bearing of the moving target with respect to a fixed measurement point. From Fig. 3, we can find that the result from the hardware experiment agrees well with the simulated one and they are all close to the true tracks. Thus, one can draw the conclusion that the proposed improved RR algorithm is effective in PFs. A. Resource Utilization Table II shows the utilization of various resources using the proposed architecture. For comparison, not only the proposed scheme, but also the SR and RSR schemes are studied. The evaluation is conducted using N M 2048 particles and 18bit representation for the particle weights. The data in Table II for the SR and RSR schemes is directly from [11]. Upon the inspection of Table II, the proposed improved RR scheme requires more slices, slice registers, 4input LUTs, and block multipliers but less block RAMs than the SR scheme and all the resource used in the proposed scheme are almost the same as that required in the RSR scheme. Thus, the proposed scheme has low complexity. Samplek
Samplek 1
Importancek 1
Importancek
LS
Resamplek Tres
N LI TSIRF
Fig. 4. Timing of operations in SIRF.
TABLE III.
EXECUTION TIME FOR THE RESAMPLING ALGORITHMS
Algorithms Execution time
SR scheme
N M 1
RSR scheme
Improved RR scheme
NL
NL
B. Execution Time Fig. 4 shows the timing of operations of the SIRF. One can find that the resampling step is a bottleneck due to its sequential nature. Thus, development of faster and more efficient resampling algorithms is vital to the implementation of real-time particle filters in high-speed applications. Table III shows the execution time for the different resampling algorithms. It is seen from Table III that the execution time of the proposed improved RR algorithm is the same as that of the RSR scheme; that is N L . Both of them are lower than the SR scheme, which requires N M 1 cycles. For the BOT problem, the sample and importance computation units need latency LS 8 cycles and LI 53 cycles respectively. Utilizing N M 2048 particles for processing, the cycle time of the SIRFs using the proposed improved RR scheme is given by T
N LS LI Tres Tclk ª¬ 2048 8 53 2048 1 º¼ Tclk
(4)
4158Tclk
The designed hardware can support clock frequencies of up to 120 MHz. Thus the processing speed of the proposed algorithm can achieve 24 kHz with a clock frequency of 100 MHz. V.
CONCLUSION In this paper, we presented an improved RR algorithm and architecture for the hardware implementation of PFs. Mathematical analysis and simulation results have confirmed that the proposed improved algorithm can guarantee the number of particles after resampling correct and have approximately equal performance with the traditional SR and
RSR algorithms. Experimental study on a Xilinx Virtex 2 pro FPGA platform shows that this hardware architecture is efficient in terms of resource usage and latency.
REFERENCES [1]
[2] [3]
[4]
[5]
[6]
[7]
[8]
[9]
[10]
[11]
[12]
[13]
M. S. Arulampalam, S. Maskell, N. Gordon, and T. Clapp, “A tutorial on particle filters for online nonlinear/non-Gaussian Bayesian tracking,” IEEE Trans. Signal Processing, vol. 50, no. 2, pp. 174-187, Feb. 2002. A. Doucet, N. de Freitas, and N. Gordon Eds., Sequential Mote Carlo Methods in Practice, New York, NY, USA, Springer-Verlag, 2001. B. Ristic, S. Arulampalam, and N. Gordon, Beyond the Kalman Filter: Particle Filter for Tracking Applications, Artech House, Boston, London, 2004. E. R. Beadle and P. M. Djuriü, “A fast weighted Bayesian bootstrap filter for nonlinear model state estimation,” IEEE Trans. Aerospace and Electronic Systems, vol. 33, pp. 338-343, 1997. N. J. Gordon, D. J. Salmond, and A. F. M. Smith, “A novel approach to nonlinear and non-Gaussian Bayesian state estimation,” IEE Proceedings F, vol. 140, pp. 107-113, 1993. M. Boliü, P.M. Djuriü, and S. Hong, “New resampling algorithms for particle filters,”. IEEE International Conference on Acoustics, Speech, and Signal Processing, vol. 2, pp. 589-592, 2003. M. Boliü, P. M. Djuriü, and S. Hong, “Resampling algorithms for particle filters: a computational complexity perspective”, EURASIP Journal of Applied Signal Processing, no. 15, pp. 2267-2277, 2004. X. Fu and Y. Jia, “An improvement on resampling algorithm of particle filter,” IEEE Trans. Signal Processing, vol. 58, no. 10, pp. 5414-5420, Oct. 2010. S. H. Hong, Z. G. Shi, J. M. Chen, and K. S. Chen, “A low-power memory-efficient resampling architecture for particle filters,” Circuits, Systems, and Signal Process. vol 29, no.1, pp. 155-167, 2010. A. Athalye, M. Bolic, P. M. Djuric, and S. Hong, “Architectures and memory schemes for sampling and resampling in particle filters”, Proc. of Digital Signal Processing Workshop, 2004. A. Athalye, M. Bolic, S. Hong, and P. M. Djuric, “Generic hardware architectures for sampling and resampling in particle filters,” EURASIP Journal of Applied Signal Processing, Issue 17, pp. 2888-2902, 2005. S. Hong, M. Boliü, and P.M. Djuriü, “An efficient fixed-point implementation of residual resampling scheme for high-speed particle filters,” IEEE Signal Processing Letter, vol. 11, no. 5, pp. 482-485, 2004. C. Feng, N. Zhao, and M. Wang, “Improving the residual resampling algorithm,” Journal of harbin Engineering University, vol. 31, no. 1, pp. 120-124, 2010.