Multiple constant multiplication (MCM) is the problem of realizing several constant ... is in a redundant format, sign-extension is not straight for- ward. However, a ...
RadioVetenskap och Kommunikation, Linköping, June 14-16, 2005
CARRY-SAVE ADDER BASED DIFFERENCE METHODS FOR MULTIPLE CONSTANT MULTIPLICATION IN HIGH-SPEED FIR FILTERS Oscar Gustafsson, Henrik Ohlsson, and Lars Wanhammar Department of Electrical Engineering, Linköping University, SE-581 83 Linköping, Sweden, Tel: +46-13-28 40 59, Fax: +46-13-13 92 82, E-mail: {oscarg, henriko, larsw}@isy.liu.se
ABSTRACT B
Multiple constant multiplication (MCM) is the problem of realizing several constant multiplications with one data using as few additions/subtractions as possible. Almost all previously proposed MCM algorithms have been using carry-propagation adders. However, for high-speed applications carry-save adders are a better choice. Although it is possible to map carry-save adders to carry-propagation adders, this mapping is inconsistent in the number of carry-save adders required for a given number of carry-propagation adders for multiplier blocks. In this work a difference-based MCM algorithm for carry-save adders is proposed and evaluated.
(a)
A
bWd-1 aWd-1 bWd-2 aWd-2 cout
CPA S D B A
(b)
CSA
b0 a0
FA
FA
FA
FA
sWd-1
sWd-2
s1
s0
dWd-1bWd-1aWd-1 dWd-1bWd-2 aWd-2 FA cWd sWd-1
C S
b1 a1
d1 b1 a1
d0 b0 a0
FA
FA
FA
cWd-1 sWd-2
c2 s1
c1 s0
cin
Figure 1. (a) Carry-propagation adder symbol and structure. (b) Carry-save adder symbol and structure.
in
1. INTRODUCTION
T
In many DSP algorithms one data sample is multiplied by several constants. One typical example is the transposed form FIR filter, where one input data sample is multiplied with the filter coefficients. By expressing the multiplication by shifts and additions or subtractions a multiplierless realization is obtained that is efficient for implementation. As additions and subtractions are similar operations we consider them as equal and use the term addition for both operations. The hardware cost can be further decreased by utilizing redundancy between the coefficients [1]–[11]. This is known as multiple constant multiplication (MCM). The previous work in this area can be divided in three techniques. The first technique is based on pattern matching techniques [2]–[4],[7],[9] and the result depends on the initial representation of the constants. These algorithms are often referred to as subexpression sharing or subexpression elimination. The second technique constructs the MCM block independent of the representation and only considers the values after each addition [1],[10]. Finally, the difference methods works by pairing coefficients so that only a simple difference coefficient is required to compute one coefficient from the other [5],[6],[8],[11]. However, all algorithms except the one proposed in [10] are based on carry-propagation adders (CPAs) with two inputs and one output, but for many high-speed applications carry-save arithmetic is preferred [12], [13]. By using carry-save adders (CSAs) the need for carry propagation in the adder is avoided and the latency of one addition is equal to the gate delay of a full adder, independently of the data wordlength. The carry-save adder has three in-
T
T
VMA
out
Figure 2. FIR filter using carry-save arithmetic. Bold lines indicate data in carry-save representation.
puts and two outputs, where the two outputs together form the result. One of the outputs is the sum output, while the other is the carry output. As the output from the CSA adder is in a redundant format, sign-extension is not straight forward. However, a simple correction scheme is presented in [12]. This correction scheme must be applied before each sign-extension of data in carry-save representation. In Fig. 1 (a) a carry-propagation adder is shown while a carry-save adder is shown in Fig. 1 (b). The length of the carry output increases with one bit after each addition, but by using the carry overflow detection proposed in [12] the length can be kept constant. If the input to the multiplier is in carry-save format the previously proposed multipliers can be used by replacing each adder with two carry-save adders. However, if the input is only one binary word the mapping to carry-save adders will be suboptimal. This is the case in for example transposed direct form FIR filters as shown in Fig. 2. The bold lines denotes carry-save representation of the data. Carry propagation is performed in the final vector merging adder (VMA). This has been shown to be an efficient way of implementing high-speed FIR filters [14],[15]. It has been shown that the mapping from CPAs to CSAs is inconsistent in the number of adders required [10], [16]. This is due to the fact that a CPA can represent zero, one, or two CSAs depending on the data types of the inputs (non-redundant binary or carry-save representation).
245
RadioVetenskap och Kommunikation, Linköping, June 14-16, 2005
h0x d1x
h1x
d1x
h2x
dMx (a)
c = 2, d = {5, 9, 33}
Add
c = 2, d = {3, 15}
(b)
Figure 3. (a) Computing one coefficient from another coefficient and a difference. (b) Difference stage computing several coefficients from a set of differences.
55 79
c = 2, d = {3, 7} c = 2, d = {3, 31}
Figure 4. Graph for the coefficient set {41, 55, 79} with edge costs and corresponding difference sets indicated.
From a power consumption point of view it has been shown that a combination of CPAs and CSAs may be the best [17]. However, for high-speed applications a pure CSA implementation is preferred.
c = 2, d = {5, 9, 33}
1 c = 2, d = {7, 9, 63}
41
2. DIFFERENCE METHODS
c = 2, d = {3, 15}
The key idea in difference methods for MCM is illustrated in Fig. 2 (a). Here, the coefficient h1 is computed from the coefficient h2 using a difference d1. Note that h2 can be derived from h1 and d1 using commutativity. A difference method tries to find the best combination of h1, h2, and d1 so that the total number of additions/subtractions are minimized. When all coefficients, the coefficient set, are considered, a difference stage, as illustrated in Fig. 2 (b) is obtained. The difference stage realizes all coefficients by combining them and a difference in an arbitrary manner. The inputs to the difference stage is the required differences. As these corresponds to a MCM the same approach can be applied again, until only the coefficient 1 is required. The concept of MST based MCM was simultaneously introduced in [5] and [6]. Combining those approaches lead to the algorithm in [8], which were further improved in [11]. In MST based MCM a graph is formed where each node (vertex) corresponds to a coefficient and each edge (arc) corresponds to one or more minimum cost differences. The edge weights is a measure of the implementation cost of the difference(s). Hence, finding a fully connected subgraph (a minimum spanning tree, MST) which minimizes the edge weights leads to a difference stage with small implementation cost. The graph always includes the coefficient 1, which acts as a root node1 for the minimum spanning tree. The cost for a difference is here defined as the number of nonzero terms in a canonic signed digit representation of the difference, which is computed as k , l, m
c = 2, d = {5, 15, 63} c = 2, d = {7, 9, 63}
41
hNx
c i, j = min { CSD ( 2 k h i + ( – 1 ) m 2 l h j ) }
1
55 79
Figure 5. One possible MST for the graph in Fig. 3.
1 (a)
c = 1, d = {1}
c = 1, d = {1}
3
9
c = 2, d = {3, 15, 33}
1 (b)
c = 1, d = {1}
3
c = 1, d = {1}
9
Figure 6. (a) Graph for the second iteration using the differences {3, 9}. (b) MST for the graph in Fig. 5 (a).
In Fig. 3 a graph for the coefficient set {41, 55, 79} is illustrated. From the graph it is possible to derive a minimum spanning tree, a set of edges with minimum weight so that all nodes are connected. This is a polynomial time problem and several different algorithms are available [18]. In Fig. 4 one of several possible MSTs are illustrated. Here, the total cost is six. This means that six adders are required if the MCM is implemented as shown with each difference implemented as a separate multiplier. However, it is possible to apply the same methodology to the selected differences. Hence, the MST weight is an upper bound on the number of adders required for each stage. Selecting the differences 3 and 9 for the next stage it is clear that the number of differences has been reduced from three to two, which should correspond to a saving in adders for most cases. The graph and MST for the coefficient set {3, 9} are illustrated in Fig. 5. Here, the cost of the MST is two and the difference to realize in the next stage is 1. As 1 is a trivial coefficient to realize, no more iterations must be performed. For this example a realization with five adders were found. The initial MST indicated a cost of six adders, but as one of the required differences were used twice, the cost was reduced.
(1)
where #CSD denotes the number of non-zero terms in the canonic signed digit representation. Hence, the sign and the shifts of the coefficients are selected to minimize the complexity of the difference.
1. Once the MST is computed a directed acyclic graph (DAG) may be derived by directing all edges from the root node.
246
RadioVetenskap och Kommunikation, Linköping, June 14-16, 2005 3. PROPOSED DIFFERENCE METHOD FOR CARRY-SAVE ADDERS
70 60 Average number of adders
The proposed algorithm is based on the algorithm in [11]. However, there are several modifications that are due to the structure of CSAs. First, coefficients with two non-zero digits do not require any CSA as the two words to be added (subtracted) can be seen as the sum and carry words. The difference cost corresponds to the number of CSA adders required to compute a coefficient from another coefficient plus a difference. For the CSA case there are two different situations depending on if the output of the coefficient used is in carry-save representation or in a non-redundant binary representation. However, the only coefficient with an output in non-redundant binary representation is the initial node, 1. This leads to that two different cost measures must be used depending on if one of the coefficients is 1. The difference cost for the CSA case is then
20
8
9
10 11 12 Coefficient bits
13
14
15
Figure 7. Average number of adders required using different methods for sets of 25 coefficients. Averages over 100 sets. 90 80 Average number of adders
k , l, m
The number of CSAs required in a difference stage also depends in the same way on if a coefficient is derived from 1 or not. If the edge origins from the root node, 1, then only one CSA is required as one of the inputs to the CSA is in non-redundant binary format. The same is true is the edge weight is 1 as this implies that either the edge origins from the root node and that a difference with two non-zero terms is added, or that the difference to be added is 1 (or a power of two of 1). For the remaining cases the origin of the incoming edge corresponds to carry-save representation of the data and the difference is also represented using carry-save representation. This gives the following equation of the number of CSAs required to represent each coefficient in a difference stage (corresponding to each node in an MST)
n CSA
30
0 7
( –1 ) m 2 l h
1, f = 1 = 1, f ≠ 1, c = 1 2, f ≠ 1, c ≠ 1
40
10
min { CSD ( i+ j ) }, h i, h j ≠ 1 c i, j = k, l, m (2) min { CSD ( 2 k h i + ( – 1 ) m 2 l h j ) } – 1, otherwise 2k h
50
RCSAGn Modified Pasko et al. Proposed approach Transformed RAGn CPA RAGn
70 60
RCSAGn Modified Pasko et al. Proposed approach Transformed RAGn CPA RAGn
50 40 30 20 10 0 10
20
30 40 50 Number of coefficients Figure 8. Average number of adders required using different methods for sets with 11 bits coefficients. Averages over 100 sets.
CPAs, the resulting structure is transformed to CSAs. Also, for comparison reasons the actual number of CPAs are shown. In Fig. 7, the average number of adders for sets of 25 coefficients is shown for varying coefficient wordlengths. In Fig. 8, the average number of adders for sets with 11 bit coefficients is shown for a varying number of coefficients. As can be seen from Figs. 7 and 8, the average number of CSAs for the proposed algorithm are better than those for the subexpression sharing method in [4] and the transformed RAG-n approach. Comparison with the RCSAG-n algorithm gives that the proposed approach gives a slightly higher number of CSAs. Comparing with the method in [4], the savings are mainly due to the fact that our method is representation independent. Hence, slightly better results may have been obtained using the method in [4] if more possible representations were examined. However, this would drastically increase the computational complexity.
(3)
where c is the weight (cost) of the incoming edge and f is the coefficient corresponding to the origin of the incoming edge.
4. RESULTS To evaluate the proposed algorithm a number of multiple constant multiplication algorithms are used to compute the required number of CSAs. The algorithms are the algorithm in [10], here denoted RCSAG-n, the subexpression sharing algorithm in [4], modified for CSAs, and the RAG-n algorithm in [1]. As the RAG-n algorithm uses
247
RadioVetenskap och Kommunikation, Linköping, June 14-16, 2005 [7]
For the transformed RAG-n approach the proposed approach is better due to the inconsistency in the mapping from CPAs to CSAs. The RAG-n algorithm minimizes the number of CPAs, which not necessarily minimizes the number of CSAs.
[8]
5. CONCLUSIONS In this work a difference method for the multiple constant multiplication problem using carry-save adders has been proposed. The results indicates that the proposed method have a slightly higher average number of carry-save adders compared with the RCSAG-n method, but requires less adders than methods based on subexpression sharing. However, compared to the RCSAG-n method the proposed method have a smaller computational complexity.
[9]
[10]
[11]
REFERENCES [1]
[2]
[3]
[4]
[5]
[6]
A. G. Dempster and M. D. Macleod, “Use of minimumadder multiplier blocks in FIR digital filters,” IEEE Trans. Circuits Syst.–II, vol. 42, no. 9, pp. 569–577, Sept. 1995. M. Potkonjak, M. B. Srivastava, and A. P. Chandrakasan, “Multiple constant multiplications: efficient and versatile framework and algorithms for exploring common subexpression elimination,” IEEE Trans. Computer-Aided Design, vol. 15, no. 2, pp. 151–165, Feb. 1996. R. I. Hartley, “Subexpression sharing in filters using canonic signed digit multipliers,” IEEE Trans. Circuits Syst.–II, vol. 43, pp. 677–688, Oct. 1996. R. Pasko, P. Schaumont, V. Derudder, S. Vernalde, and D. Durackova, “A new algorithm for elimination of common subexpressions,” IEEE Trans. Computer-Aided Design Integrated Circuits, vol. 18, no. 1, pp. 58–68, Jan. 1999. K. Muhammad and K. Roy, “A graph theoretic approach for synthesizing very low-complexity high-speed digital filters,” IEEE Trans. Computer-Aided Design, vol. 21, no. 2, Feb 2002. O. Gustafsson and L. Wanhammar, “A novel approach to multiple constant multiplication using minimum spanning trees,” in Proc. IEEE Midwest Symp. Circuits Syst., Tulsa, OK, Aug. 4–7, 2002, vol. 3, pp. 652–655.
[12]
[13]
[14]
[15]
[16]
[17]
[18]
248
M. Martínez-Peiró, E. Boemo, and L. Wanhammar, “Design of high speed multiplierless filters using a nonrecursive signed common subexpression algorithm,” IEEE Trans. Circuits Syst.–II, vol. 49, no. 3, pp. 196–203, Mar. 2002. H. Ohlsson, O. Gustafsson, and L. Wanhammar, “Implementation of low-complexity FIR filters using a minimum spanning tree,” in Proc. IEEE Mediterranean Electrotechnical Conf., Dubrovnik, Croatia, May 12–15, 2004, pp. 261–264. F. Xu, C.-H. Chang, and C.-C. Jong, “Efficient algorithms for common subexpression elimination in digital filter design,” in Proc. IEEE Int. Conf. Acoustics Speech, Signal Processing, 17–21 May, 2004, vol. 5, pp. 137–140. O. Gustafsson, A. G. Dempster, and L. Wanhammar, “Multiplier blocks using carry-save adders,” in Proc. IEEE Int. Symp. Circuits Syst., Vancouver, Canada, May 23–26, 2004. O. Gustafsson, H. Ohlsson, and L. Wanhammar, “Improved multiple constant multiplication using minimum spanning trees,” in Proc. Asilomar Conf. Signals, Syst., Comp., Monterey, CA, Nov. 7–10, 2004. T. G. Noll, “Carry-save architectures for high-speed digital signal processing,” J. VLSI Signal Processing, vol. 3, pp. 121–140, 1991. M. Martínez-Peiró and L. Wanhammar, “High-speed lowcomplexity FIR-filter using multiplier block reduction and polyphase decomposition,” in Proc. IEEE Int. Symp. Circuits Syst., Geneva, Switzerland, May, 2000. R. Jain, P. Young, and T. Yoshino, “FIRGEN: A computer-aided system for high performance FIR filter integrated circuits,” IEEE Trans. Signal Processing, vol. 39, pp. 1655–1668, July 1991. R. A. Hawley, B. C. Wong, T.-J. Lin, J. L. Laskowski, and H. Samueli, “Design techniques for silicon compiler implementations of high-speed FIR digital filters,” IEEE J. Solid-State Circuits, vol. 31, pp. 656–667, May 1996. O. Gustafsson, H. Ohlsson, and L. Wanhammar, “Minimum-adder integer multipliers using carry-save adders,” Proc. IEEE Int. Symp. Circuits Syst., Sydney, Australia, May 6–9, 2001, vol. II, pp. 709–712. V. A. Bartlett and A. G. Dempster, “Using carry-save adders in low-power multiplier blocks,” in Proc. IEEE Int. Symp. Circuits Syst., Sydney, Australia, May 6–9, 2001, vol. IV, pp. 222–225. C. Bazlamaçcı and K. Hindi, “Minimum-weight spanning tree algorithms A survey and empirical study,” Computers & Operations Research, vol. 28, no. 8, pp. 767–785, 2001.