IEEE TRANSACTIONS ON SIGNAL PROCESSING, VOL. 48, NO. 3, MARCH 2000
sion logic back end, we add high-resolution timing and sophisticated decision logic. The detector is the first reported ITU-compliant detector that can be implemented on an 8-bit microcontroller. It is also amenable to multichannel DTMF detection on a digital signal processor and on single-instruction multiple-data processing found on to modern general-purpose processors, such as Intel's Pentium MMX extensions. We have released a Matlab version of the detector at ftp://pepperoni.ece.utexas.edu/pub/dtmf/mat8bit.zip. For future research, reliable detection of shorter tones than ITU requires may be possible by increasing the number of estimators—to four or even eight. REFERENCES [1] P. Mock, “Add DTMF generation and decoding to DSP-P designs,” EDN, vol. 30, pp. 205–220, Mar. 1985. [2] “Digit simulation test tape,” Bell Commun. Res., Tech. Rep. TR-TSY000763, July 1987. [3] Recommendation Q.24: Multi-Frequency Push-Button Signal Reception: ITU Blue Book, 1989. [4] M. D. Felder, J. C. Mason, and B. L. Evans, “Efficient dual-tone multifrequency detection using the non-uniform discrete Fourier transform,” IEEE Signal Processing Lett., vol. 5, pp. 160–163, July 1998. [5] G. Arslan, B. L. Evans, F. A. Sakarya, and J. L. Pino, “Performance evaluation and real-time implementation of subspace, adaptive, and DFT algorithms for multi-tone detection,” in Proc. IEEE Int. Conf. Telecommun., Istanbul, Turkey, Apr. 1996, pp. 884–887. [6] S. Park and D. M. Funderburk, “DTMF detection having sample rate decimation and adaptive tone detection,” U.S. Patent 5,392,348, Feb. 1995. [7] J. G. Proakis and D. G. Manolakis, Digital Signal Processing Principles, Algorithms, and Applications. Englewood Cliffs, NJ: Prentice-Hall, 1995. [8] S. Bagchi and S. K. Mitra, “An efficient algorithm for DTMF decoding using the subband NDFT,” in Proc. IEEE Int. Sym. Circuits Syst., May 1995, pp. 1936–1939. [9] S. L. Gay, J. Hartung, and G. L. Smith, “Algorithms for multi-channel DTMF detection for the WEDSP32 family,” in Proc. IEEE Int. Conf. Acoust. Speech Signal Process., May 1989, pp. 1134–1137. [10] V. Friedman, “A zero crossing algorithm for the estimation of the frequency of a single sinusoid in white noise,” IEEE Trans. Signal Processing, vol. 42, pp. 1565–1569, June 1994.
917
A Hardware Efficient Control of Memory Addressing for High-Performance FFT Processors Yutai Ma and Lars Wanhammar
Abstract—The conventional memory organization of fast Fourier transform (FFT) processors is based on Cohen’s scheme. Compared with this scheme, our scheme reduces the hardware complexity of address generation by about 50% while improving the memory access speed. Much power consumption in memory is saved since only half of the memory is activated during memory access, and the number of coefficient access is reduced to a minimum by using a new ordering of FFT butterflies. Therefore, the new scheme is a superior solution to constructing high-performance FFT processors. Index Terms—Conflict-free memory addressing, fast Fourier trasform, FFT coefficient access, low-power FFT processors.
I. INTRODUCTION High-performance fast Fourier transform (FFT) processors are widely used in communication, image processing, and radar signal processing, etc. Previously, the major concern was the processing speed, but the power consumption has become more important in recent years. The improvement on algorithm and processor architecture is still an active research topic that has been revitalized. Many architectures for FFT processors were proposed, including the architectures of single recursive FFT processors, cascaded FFT processors [4] and parallel FFT processing systems [8], and the recently proposed main memory-cache architecture [1]–[3]. These architectures are targeted to different applications. Our focus in this paper is on memory addressing and its control in FFT processors. Our work is applicable to the four types of architectures mentioned above. II. RELATED WORKS AND DISCUSSIONS For radix-2 in-place FFT algorithm, each butterfly reads two inputs and produces two outputs. Therefore, memory partition is necessary to ensure that in each clock cycle, the four memory operations can be performed simultaneously. Pease [9] found that the address parities of butterfly inputs are different, and he proposed a scheme to segment the memory into several banks using this property to meet the memory access requirement. Cohen [5] and Johnson [6] considered hardware efficient implementations for radix-2 and radix-4 FFT algorithms, respectively. In [5], Cohen proposed a simple way to generate data address and coefficient address and to segment the data memory into two banks according to their address parities. However, address parity should be checked to identify a correct memory bank. In addition, the two data addresses, the two butterfly inputs, and the two butterfly outputs all may need to be interchanged. Therefore, the delay of memory access is large
Manuscript received March 24, 1999; revised September 8, 1999. The associate editor coordinating the review of this paper and approving it for publication was Dr. Edwin Hsing-Men Sha. Y. Ma is with the Department of Electronics, Royal Institute of Technology, Stockholm, Sweden (e-mail:
[email protected]). L. Wanhammar is with the Department of Electrical Engineering, Linköping University, Linköping, Sweden. Publisher Item Identifier S 1053-587X(00)01596-8. 1053-587X/00$10.00 © 2000 IEEE
918
IEEE TRANSACTIONS ON SIGNAL PROCESSING, VOL. 48, NO. 3, MARCH 2000
for long FFT’s. Ma’s scheme [7] shortens the delay of address generation with comparable hardware cost. A common drawback of these two schemes is that two memory banks of size 2n01 are made active for each butterfly. Is it possible to reduce the size of active memory in order to save power consumption? In addition, is it possible to simplify the hardware complexity of the address generation? So far, people have not paid attention to the local activity of the coefficient access. In fact, the coefficient access exhibits very low activity, and it is possible for us to make efficient use of this property. Thereby, the power consumption could be saved if we can reduce the number of coefficient access. Reducing the number of executed operations, lowering the complexity of circuits, and making the active circuits smaller are three approaches to reducing power consumption at the algorithm and architecture levels. They are the directions of our work in this correspondence. We will extend the principle in [7]. III. A NEW ORDERING FOR FFT BUTTERFLIES Generally, if the number of ROM access can be reduced, then power consumption in the ROM can be lowered. In addition, since precharge of bit-lines and charge/discharge of wordlines are two major power consumption contributors in a dynamic ROM, power is consumed even for the same subsequent coefficient access. Therefore, a significant amount of power can be saved by using a circuit [address-transition detection (ATD)] to enable a coefficient access only if an address change appears on the address bus. Otherwise, the coefficient access is disabled. These two insights are the motivations for reordering FFT butterflies.
Fig. 1.
Signal flowgraph of radix-2 in-place 16-point FFT algorithm. TABLE I REORDERING OF BUTTERFLY SEQUENCE,
N = 16
A. Cohen’s Butterfly Sequence The radix-2 in-place FFT algorithm is illustrated in Fig.1. We assume that the inputs are arranged in bit reverse order and the outputs are produced in normal order. Define B = b(n02) 1 1 1 n b(p+1) bp b(p01) 1 1 1 b0 with N = 2 , which is a counter used for generating butterfly indices. Let As (r ), At (r), As (w), and At (w) be addresses of two butterfly inputs and two butterfly outputs, respectively. In Cohen’s scheme [5], the data addresses and butterfly sequence are generated based on the following principle:
IV. MEMORY PARTITION
RL(x; i)
where p = 0; 1; 1 1 1 ; n 0 1. We denote rotating x left over i bits by in the above expressions.
In [5] and [7], data memory is divided into two banks to meet the memory access requirement. If we can further segment the two memory banks into four banks and only enable the memory banks where the butterfly operands and the butterfly results reside during read and write operations, then nearly half power consumption in the memory can be saved.
B. Our Butterfly Sequence
A. Data Assignment
From Fig. 1, we find that the coefficients used for butterfly calculations own two features: 1) The butterflies at pass p are divided into (n010p) 2 groups with p = 0; 1; 1 1 1 ; n 0 1, and the coefficients used in these groups are the same, and 2) the coefficients used in each group are classified into two parts: one is the coefficients with their addresses’ MSB being “0,” and the others are the coefficients with their addresses’ MSB being “1.” Let C and C 0 be two coefficients and their addresses be Ac and A0c , respectively. We express C as a complex number C = Re(C ) + j 1 Im(C ). We know that if A0c = Ac + N=4, then we have C 0 = Re(C 0 ) + j 1 Im(C 0 ) = 0Im(C ) + j 1 Re(C ). This shows that the coefficient C 0 can be generated from C by interchanging its real and imaginary components. It is possible to make use of the above two features to reduce the number of coefficient access by reordering the sequence of butterflies at each pass. The new butterfly sequence is illustrated in Table I.
We know that the two butterfly input addresses generated by (1) and (2) at pass p differ in the pth bit with p = 0; 1; 1 1 1 ; n 0 1. Thus, the two butterfly inputs can be placed in two memory banks determined by 0bn02 and 1bn02 , respectively, with butterfly counter B = b(n02) b(n03) 1 1 1 b1 b0 . The butterfly outputs should be stored in two memory banks determined by b1 0 and b1 1, respectively, to ensure correct access at the next pass. This is the main idea for our memory partition. We see that this memory partition brings three benefits. 1) By removing the pth bit from the address at pass p (p = 0; 1; 1 1 1 ; n 0 1), we have As (r) = At (r) and As (w ) = At (w ). Thus, only two barrel shifters are needed instead of four barrel shifters. 2) No address parity check is required to choose a correct memory bank.
As (r )
=As (w ) =
RL(2B
+ 0;
p)
(1)
At (r )
=At (w ) =
RL(2B
+ 1;
p)
(2)
IEEE TRANSACTIONS ON SIGNAL PROCESSING, VOL. 48, NO. 3, MARCH 2000
3) The decoding for choosing a memory bank is performed on only one bit and is very simple.
919
TABLE II MEMORY PARTITION AND DATA ASSIGNMENT
B. Memory Addressing
In Section III-A, we denote by B = b(n02) b(n03) 1 1 1 b1 b0 a butterfly counter. In order to generate the desired butterfly sequence, we reconstruct the butterfly counter output and denote it by Bbfly = b0 b(n02) b(n03) 1 1 1 b1 . The butterfly to be performed is then identified by Bbfly . Taking the 16-point FFT as an example, when B = 001 at pass 0, Bbfly at this time will be set to 100, and the butterfly (8, 9, 0) will be calculated, as shown in Table I. We also use two variables Br = b1 b(n02) 1 1 1 b2 and Bw = b0 b(n02) 1 1 1 b2 with n N = 2 to describe data address generation in the following. Based on the principle presented in Section IV-A and Cohen’s algorithm, the addresses of butterfly inputs and outputs are As (r )
=At (r ) =
As (w )
=At (w ) =
RL(Br ; p) RL(Bw ; p)
(3) (4)
with p = 0; 1; 1 1 1 ; (n 0 1). The memory bank selection is based on the decoding of 0b0 and 1b0 for read operations and b1 0 and b1 1 for write operations, respectively. It should be noted that the initial data assignment should be done carefully to allow addressing the four memory banks based on (3) and (4). Denote the sample index by D = d(n01) d(n02) 1 1 1 d1 d0 . When a sample is loaded into the FFT processor, a memory bank is selected based on the decoding of d0 d(n01) , and the address to the identified memory bank is d1 d(n02) d(n03) 1 1 1 d3 d2 . For the butterflies at pass 0 (p = 0), based on Cohen’s algorithm (1), (2), the addresses of the butterfly inputs are As (r )
= As (w ) =
RL(2Bbfly
= b0 b(n02) b(n03) At (r )
= At (w ) =
111
RL(2Bbfly
= b0 b(n02) b(n03)
111
+ 0; 0)
Fig. 2. Address generation of read operations for 1024-point FFT.
b2 b1 0
+ 1; 0)
b2 b1 1:
Based on our initial data assignment and our algorithm (3), (4), 0b0 and 1b0 are used for choosing memory banks, and the addresses are b1 b(n02) b(n03) 1 1 1 b2 for read operations and b0 b(n02) b(n03) 1 1 1 b2 for write operations. This small difference between our algorithm and Cohen’s algorithm does not affect reading out correct operands for the butterflies at pass 0. Through the write operations at pass 0 using (4), the address space becomes normal, i.e., the address space generated by using (3) and (4) at other passes (p = 1; 2; 1 1 1 ; n 0 1) is restored to the address space generated using Cohen’s scheme (1), (2). This means that correct memory access is guaranteed. For FFT output, memory addressing also should be done carefully. Based on Cohen’s scheme (1), (2), the addresses of two butterfly outputs at pass (n 0 1) are 0b0 b(n02) b(n03) 1 1 1 b2 b1 and 1b0 b(n02) b(n03) 1 1 1 b2 b1 . While based on our algorithm (4), both the addresses are b(n02) b(n03) 1 1 1 b2 b0 . Therefore the FFT output should be performed as follows. Let the FFT outputs be indexed by a counter D = d(n01) d(n02) 1 1 1 d1 d0 . We use d0 d(n01) to choose a memory bank and the address of the output is d(n03) 1 1 1 d1 d(n02) . The memory partition and data assignment are shown in Table II and the address generation is shown in Figs. 2 and 3. An example of data assignment for 16-point FFT computation is shown in Table III. C. Conflict-Free Memory Access It is obvious that there are no memory access conflicts between the two read operations as well as between the two write operations for each butterfly. However, we should ensure that the write operations do not destroy (overwrite) other results produced in the previous pass.
Fig. 3. Address generation of write operations for 1024-point FFT.
From the butterfly sequence illustrated in Table I and the memory addressing described in the last subsection, we know that four butterflies are grouped together and they are scheduled as shown in Table IV. Since each butterfly consists of memory reads, arithmetic operations, and memory writes, the pipeline has at least three stages and is usually designed to have four stages or more because the delay of a butterfly arithmetic operation is large. Therefore, the data that would be overwritten has been read out already before its location has overwritten a new butterfly result. This analysis indicates that the new algorithm leads to conflict-free memory access. V. FFT COEFFICIENT ACCESS A. Reduction of Coefficient Access From Tables I and IV, we know that the coefficient addresses of four-butterfly groups at the first pass (p = 0) are all 0; thus, only one coefficient access is issued. At pass 1 (p = 1), the addresses are
920
IEEE TRANSACTIONS ON SIGNAL PROCESSING, VOL. 48, NO. 3, MARCH 2000
TABLE III DATA ASSIGNMENT
AFTER BUTTERFLY CALCULATIONS AT PASS ( 16)
N=
EACH
Fig. 4. Address transition detection on coefficient address for 32-point FFT. = b b b b ). (B = b b b b and its reordered output B
Fig. 5. “1.”)
TABLE IV BUTTERFLY SCHEDULING AND DATA ASSIGNMENT AT PASS 3
Enable signal generation for coefficient access. (R is initially set to
ated. Coefficient address generation here follows the methods in [5] and [8]. The difficulty lies in how to detect a change in the coefficient address if we want to further reduce the number of coefficient access. When we use the algorithm in [8], it is beneficial to design a circuit to detect the change because the cost is small in this case. The algorithm in [8] decomposes an N -point FFT into N1 FFT’s of N2 -point and N2 FFT’s of N1 -point with N = N1 1 N2 . The algorithm can be used to construct a parallel system and can be applied to Baas’s main memory-cache architecture. Because the detection is only performed on n1 -bit and n2 -bit coefficient addresses for the short N1 -point and N2 -point FFT’s, respectively, the circuit cost is reduced greatly. The control logic for a 1024-point FFT processor based on the main memory-cache architecture is shown in Figs. 4 and 5. C. Application to Cascaded FFT Processors For cascaded DIT FFT processors, the coefficient access also exhibits a low activity. The activity is given by
n01 i=2
i01)
(
2
(n 0 1)2n01
0 and 2n02 . At other passes (p = 2; 1 1 1 ; n 0 1), the addresses are (n010p) (n010p) 2 (0b(n02) 1 1 1 b(n0p) ) and 2 (1b(n02) 1 1 1 b(n0p) ). Based on the principle behind the new butterfly sequence, 2(p01) coefficient accesses should be issued at pass p with p = 1; 2; 1 1 1 ; n 0 1. Thereby, the total number of coefficient accesses issued is given by
n01 1+
i=1
i01)
(
2
n01)
(
=2
although the total coefficient access is n2n01 . For example, for 1024point FFT, the activity of coefficient access is reduced to 10%. B. Coefficient ROM Addressing Since interchanging the coefficient real and imaginary parts also appears in the complex multiplier [10] in which a complex multiplier can be constructed with a cost corresponding with that of two real multipliers, these two interchanges can be combined together, and no extra cost is incurred. Because the coefficient access with its address’s MSB being “1” has been removed, the coefficient ROM can be cut in half. The MSB of coefficient address is known when the address is gener-
n01) 0 2 : (n 0 1)2n01 (
=
2
This shows that for a 1024-point FFT, the activity of coefficient access is reduced to 11%. Therefore, making efficient use of this feature is helpful in reducing power consumption in the coefficient ROM, and the ROM size can be reduced by half. In addition, the control logic for coefficient access is very simple in cascaded FFT processors. VI. COMPARISON WITH COHEN’S SCHEME A. Comparison of Memory Addressing Speed In our scheme, the delay of address generation is dominated by that of barrel shifters, that is, Td = TRL + TMUX . While, for Cohen’s scheme, the delay is Td = maxfTRL ; Tparity g + 2TMUX , where Tparity is the delay of an XOR tree for address parity check. We know that Tparity is larger than TRL . Therefore, the memory access speed of our scheme is faster than that of Cohen’s scheme. Since the active memory becomes smaller (reduced by half), the actual performance would be even better. B. Comparison of Hardware Cost Hardware cost reduction lies in the simple address generation. Two shifters in our scheme and four shifters in Cohen’s scheme are used,
IEEE TRANSACTIONS ON SIGNAL PROCESSING, VOL. 48, NO. 3, MARCH 2000
TABLE V COMPARISON OF HARDWARE COST BETWEEN OUR SCHEME AND COHEN’S SCHEME (w : MEMORY BITWIDTH)
921
REFERENCES [1] B. M. Baas, “A 9.5 mW 330 µsec 1024-point FFT processor,” in Proc. Custom Integrated Circuits Conf., May 1998. [2] , “An energy-efficient single-chip FFT processor,” in Proc. IEEE Symp. VLSI Circuits, June 1996. [3] , “An energy-efficient FFT processor architecture,” StarLab Tech. Rep., NGT-70 340-1994-1, Jan. 25, 1994. [4] E. Bidet, D. Castelain, C. Joanblanq, and P. Senn, “A fast single-chip implementation of 8192 complex point FFT,” IEEE J. Solid-State Circuits, vol. 30, Mar. 1995. [5] D. Cohen, “Simplified control of FFT hardware,” IEEE Trans. Acoust., Speech Signal Processing, vol. ASSP-24, pp. 577–579, Dec. 1976. [6] L. G. Johnson, “Conflict free memory addressing for dedicate FFT hardware,” IEEE Trans. Circuits Syst. II, vol. 39, pp. 312–316, May 1992. [7] Y. Ma, “An effective memory addressing scheme for FFT processors,” IEEE Trans. Signal Processing, vol. 47, pp. 907–911, Mar. 1999. [8] , “A VLSI oriented parallel FFT algorithm,” IEEE Trans. Signal Processing, vol. 44, pp. 445–448, Feb. 1996. [9] M. C. Pease, “Organization of large scale Fourier processors,” J. Assoc. Comput. Mach., vol. 16, pp. 474–482, July 1969. [10] L. Wanhammar, DSP Integrated Circuits. New York: Academic, 1999.
Real-Time Sonar Beamforming on Workstations Using Process Networks and POSIX Threads respectively. More important, the circuits of address parity check, the address interchange circuits for both butterfly inputs and outputs, and the interchange circuit for butterfly outputs are all removed, whereas these circuits are required in Cohen’s scheme. The detailed comparison is given in Table V. We see that the hardware cost is reduced more than 50%, regardless of the butterfly counters and pass counters. C. Comparison of Power Consumption The power consumption saved by using our algorithm comes from three parts: 1) smaller memory made active during access; 2) simplified address generation; 3) much less coefficient access. In our scheme, significant power consumption in memory is saved for the active memory is reduced by half compared with Cohen’s scheme. Our analysis in the last subsection shows that the hardware complexity of address generation is reduced by about 50%. In addition, we have shown that the activity of coefficient access is reduced to 1=n. Taking 1024-point FFT as an example and using the control logic described in Section V-B, the activity of coefficient access is reduced to 20% (1=n1 or 1=n2 ). Therefore, a significant amount of power in both the memory and its control logic and the coefficient ROM is saved.
VII. CONCLUSION We have extended the work in [7], which leads to an improvement of nearly 50% in hardware cost and significant power savings associated with memory addressing and improves the memory access speed as well. The new scheme is applicable to single recursive FFT processors, parallel FFT processing systems, and the main memory-cache architecture for low power FFT processors. The new butterfly ordering makes it possible to reduce the number of coefficient access to a minimum; the idea is also applicable to cascaded DIT FFT processors.
Gregory E. Allen and Brian L. Evans
Abstract—We present a scalable framework for real-time data-intensive systems on commodity multiprocessor workstations. The framework is an extension of the process network model, which captures parallelism, guarantees determinate execution, and executes in bounded memory. We implement the framework using lightweight POSIX threads and prototype a 4-GFLOP sonar beamformer on a 12-processor 336-MHz Sun Enterprise server. The beamformer scales nearly linearly from 1 to 12 processors. Index Terms—Beamforming, high-performance computing, models of computation, multiprocessor programming, native signal processing, process networks, real-time systems, scalable software.
I. INTRODUCTION High-resolution sonar beamforming algorithms require on the order of a billion multiply-accumulates (MAC’s) per second and have traditionally required custom parallel hardware for real-time implementation. Because these systems are not typically sold in high volumes, the nonrecoverable engineering cost for custom hardware development may make this approach prohibitively expensive. The Manuscript received April 6, 1999; revised September 28, 1999. This work was supported by the Independent Research and Development Program, Applied Research Laboratories, The University of Texas at Austin, the Defense Advanced Research Projects Agency (DARPA) and the U.S. Army under DARPA Grant DAAB07-97-C-J007 through a subcontract from the Ptolemy project at the University of California at Berkeley, and the U.S. National Science Foundation CAREER Award under Grant MIP-9702707. The associate editor coordinating the review of this paper and approving it for publication was Dr. Edwin Hsing-Men Sha. G. E. Allen is with the Applied Research Laboratories, The University of Texas at Austin, Austin, TX 787813-8029 USA. B. L. Evans is with the Department of Electrical and Computer Engineering, The University of Texas at Austin, Austin, TX 78712-1084 USA. Publisher Item Identifier S 1053-587X(00)01551-8.
1053-587X$10.00 © 2000 IEEE