Nuha A. S. Alwan
EFFICIENT SYSTOLIC SINUSOIDAL SEQUENCE GENERATION * Nuha A. S. Alwan Department of Electrical Engineering, College of Engineering, University of Baghdad
اﻟﺨﻼﺻـﺔ : ﺗﻢ ﻓﻲ هﺬا اﻟﺒﺤﺚ ﺗﻄﺒﻴﻖ ﻃﺮﻳﻘﺔ اﻟﺘﺼﻤﻴﻢ اﻟﻨﺒﻀﻲ اﻟﻤﺘﻮازي ﻟﺤﺴﺎب اﻟﻤﺘﺘﺎﺑﻌﺎت اﻟﺠﻴﺒﻴﺔ ﻣﻦ اﻟﻤﺘﺴﻠﺴﻼت اﻟﻤﺜﻠﺜﻴﺔ اﻟﻤﺤﺪدة .وﻣﻦ ﺛﻢ ﺗﻤﻜﻨﺎ ﻣﻦ اﻟﺤﺼﻮل ﻋﻠﻰ ﺳﺮﻋﺔ أﻋﻠﻰ ﻣﻦ ﻣﺎآﻨﺎت Von Neumannاﻟﻤﺘﻮاﻟﻴﺔ ﻋﻨﺪﻣﺎ ﻳﺘﻢ ﺣﺴﺎب اﻟﻤﺘﺴﻠﺴﻠﺔ ﻟﻘﻴﻢ ﻣﺘﻮاﻟﻴﺔ ﻣﻦ اﻹدﺧﺎل .وﺗﺒﺮز آﻔﺎءة اﻟﺘﺼﻤﻴﻢ اﻟﻤﻘﺘﺮح ﻣﻦ اﺳﺘﺨﺪام ﻃﺮﻳﻘﺔ Horner ﻟﺤﺴﺎب ﻣﺘﻌﺪدات اﻟﺤﺪود ،وﺑﺬﻟﻚ ﻳﺘﻢ اﻟﺤﺼﻮل ﻋﻠﻰ ﻣﻌﺪل ﺗﻮﻟﻴﺪ ﻟﻠﺪاﻟﺔ أﻋﻠﻰ ﺑﺎﻟﻤﻘﺎرﻧﺔ ﻣﻊ أﺑﺤﺎث ﺳﺎﺑﻘﺔ ﺗﺘﻌﻠﻖ ﺑﺎﻟﻄﺮﻳﻘﺔ اﻟﻨﺒﻀﻴﺔ اﻟﻤﺒﺎﺷﺮة ﻟﺤﺴﺎب اﻟﻤﺘﺴﻠﺴﻠﺔ .إﺿﺎﻓﺔ إﻟﻰ ذﻟﻚ ﻓﺈن اﺳﺘﺨﺪام ﻃﺮﻳﻘﺔ Hornerﻳﺆدي إﻟﻰ إﻣﻜﺎﻧﻴﺔ ﺗﺠﻨّﺐ ﻋﻤﻠﻴﺔ اﻟﻘﺴﻤﺔ داﺧﻞ اﻟﻌﻨﺼﺮ اﻟﻤﻌﺎﻟﺞ .إن ﻣﻌﺪل ﺗﻮﻟﻴﺪ اﻟﺪاﻟﺔ وﻋﺪد اﻟﻌﻤﻠﻴﺎت ﻓﻲ اﻟﺜﺎﻧﻴﺔ ﻟﻠﻨﻈﺎم اﻟﻨﺒﻀﻲ اﻟﻤﺘﻮازى ﻳﺰدادان آﻠﻤﺎ ازداد ﻋﺪد اﻟﺤﺪود اﻟﻤﺴﺘﺨﺪﻣﺔ ﻓﻲ اﻟﻤﺘﺴﻠﺴﻠﺔ ﻋﻨﺪ اﻟﻤﻘﺎرﻧﺔ ﻣﻊ اﻟﻤﻌﺎﻟﺠﺔ اﻟﻤﺘﻮاﻟﻴﺔ.
* Address for correspondence: Department of Electrical Engineering College of Engineering P. O. Box 47037 University of Baghdad Jadiriya Baghdad, Iraq E-mail:
[email protected] Paper Received 26 June 2006; Revised 29 June 2007; Accepted 30 June 2007
179
The Arabian Journal for Science and Engineering, Volume 33, Number 1B
April 2008
Nuha A. S. Alwan
ABSTRACT The numerical computation of sinusoidal sequences from truncated trigonometric series can be achieved systolically, resulting in a substantial speed advantage over Von Neumann machines when the series is evaluated for successive values of the input argument. The computation is made especially efficient by using Horner’s method of polynomial evaluation resulting in a higher generation rate when compared to previous work involving the direct systolic evaluation of the series. Besides, division inside the processing element (PE) can be avoided when Horner’s method is employed. The function generation rate and throughput of the systolic parallel system improve by the number of terms used in the truncated series compared to serial processing. Key words: trigonometric series, systolic arrays, sinusoidal sequence generation, Horner polynomial evaluation.
180
The Arabian Journal for Science and Engineering, Volume 33, Number 1B
April 2008
Nuha A. S. Alwan
EFFICIENT SYSTOLIC SINUSOIDAL SEQUENCE GENERATION 1.
INTRODUCTION
Systolic architectures are typically formed by interconnecting a set of identical processing elements (PEs) or cells in a highly regular array in a nearest neighbor fashion. Data is processed where it falls, flowing synchronously from cell to cell, with each cell performing a small step in the overall computation of the array. The result is a high degree of parallelism with reduced I/O interface [1, 2]. Systolic arrays are typically used as special-purpose devices to meet high performance requirements. This was made possible with the advent of VLSI technology. On the other hand, suitable programmable systolic array hardware is also widely used especially with the advent of field programmable gate arrays (FPGA’s) which are constructed from tiling identical memory and logic blocks along with supporting mesh interconnection networks in a way that matches systolic array architectures [3]. Trigonometric series such as sines and cosines are derived from the Maclaurin series for exponential functions. In a recent study [4], a truncated series for the sine function is computed systolically for consecutive values of the sine function argument to produce the sinusoidal output sequence one sample at a time from one end of the systolic array upon entering a specified digital frequency as an input to the array. The sine sequence thus generated is produced in a considerably shorter time than with a Von Neumann machine. The generated sequence in [4] finds application in many situations in which computation speed is a crucial issue such as offline technical computing for the design, simulation, and testing of digital filters, and other digital signal processing systems. Offline examination, analysis, and measurement of system characteristics and parameters also call for fast sinusoidal sequence generation. This paper improves on the work in [4] by using Horner’s method of polynomial evaluation [5, 6]. The efficiency of Horner’s scheme stems from the reduction of the number of multiplications/divisions per cell compared to the direct evaluation of the truncated sine series. This results in a systolic array for sinusoidal sequence generation with higher generation rate than that in [4]. The systolic computation under consideration can be implemented on both specialpurpose processing devices and general-purpose programmable parallel architectures. The organization of the paper is as follows: In Section 2, Horner’s method of polynomial evaluation is presented and applied to systolic sinusoidal sequence generation. Throughput and generation rate issues are discussed in Section 3. Finally, Section 4 concludes the paper. 2. SYSTOLIC SINUSOIDAL SEQUENCE GENERATION USING HORNER’S METHOD x
The Maclaurin series for e [7] is given by e x = 1+ x +
x2 x3 x4 + + + ...... 2! 3! 4!
Replacing x by a complex value and simplifying, the series for cos x can be shown to be cos x = 1 −
x 2 x 4 x6 + − + ...... 2! 4! 6!
(1)
Now consider an nth-degree polynomial given by y ( x) = co + c1x + c2 x 2 + ....... + cn x n .
This polynomial can be rearranged using Horner’s method [5, 6]. For example, if n = 3, then y ( x) = [(c3 x + c2 ) x + c1 ]x + co .
Truncating the cosine series in (1) to four terms and expressing it by Horner’s method yields ⎡⎛ 1 1⎞ 1⎤ y ( x) = cos( x) = ⎢⎜ − x 2 + ⎟ x 2 − ⎥ x 2 + 1 6 ! 4 ! 2 !⎦ ⎠ ⎣⎝
April 2008
(2)
The Arabian Journal for Science and Engineering, Volume 33, Number 1B
181
Nuha A. S. Alwan
Figure 1. (a) Dependence graph for the computation of cos x; (b) operations assigned to a vertex of the graph
A dependence graph for a word-level systolic array that may be used to compute Equation (2) is shown in Figure 1(a). Three PE’s are needed. The x-axis of the dependence graph indicates progress in time whereas the y-axis indicates processor location. The nested form of Equation (2) suggests that the operations assigned to a vertex of the dependence graph be as shown in Figure 1(b). The dashed arrow connecting the vertices in Figure 1(a) represents intermediate quantities in the calculation of the nested form of Equation (2). This is clarified in Figure 1(b). The dependence graph of Figure 1 maps into the fully pipelined systolic array of Figure 2. The small black boxes represent clocked delay elements.
182
The Arabian Journal for Science and Engineering, Volume 33, Number 1B
April 2008
Nuha A. S. Alwan
Figure 2. Systolic array for the computation of cos x truncated to four terms
Figure 1(b) shows that the factorial is computed inside the cells rather than using cell-resident values. When a sequence of x-values is entered at the bottommost cell of the systolic array, values of cos x are produced at the topmost cell, one value at a time instant, at a rate three times faster than that achieved with a serial computational system. Now, a digital cosinusoidal signal or cosinusoidal sequence is denoted by cos(wn) where w is the digital frequency in radians and n is the sample number and is dimensionless. As in [4], instead of feeding multiples of w continuously to the systolic array of Figure 2 to produce the cosinusoidal sequence, we can input just the digital frequency and modify the computations slightly inside the PE to establish equal increments of the cosine argument as computations proceed. The functional block diagram of the PE and the operations associated with it are then as shown in Figure 3. Figure 3 shows that each PE performs six multiplications. Two of these multiplications are simply sign inversions; one is represented by the equation uo= – ui in the algorithm of Figure 3, and the other is represented by the term (uo fo) in the last equation (Figure 3). The latter term is either a sign inversion or no change at all in fo, depending on whether uo is –1 or 1, where ui, uo, and fo are intermediate quantities needed in the computation of the sinusoidal sequence. Therefore, the number of multiplications per PE is actually four with no division. The efficiency of Horner’s scheme stems from the reduction of the number of multiplications/divisions compared to the direct evaluation of Equation (1). Implementing Equation (1) systolically as in [4] would require four multiplications and one division inside the PE. Another advantage of employing Horner’s scheme over the method in [4] is that no divisions inside the PE need be performed since divisions require more logic than multiplications thereby increasing the cell complexity. The disadvantage of Horner’s scheme, however, is that the number of terms in the truncated series must be determined beforehand. Changing the number of terms would entail changing the inputs to the array.
April 2008
The Arabian Journal for Science and Engineering, Volume 33, Number 1B
183
Nuha A. S. Alwan
.yo
.w
.uo
fo
po
+
+
–1
+
–1
X
X
X
X
.w′
+
X
.yi
w
X
–1
ui
fi
pi
.w′=0 initially in each PE .yi = –1/6! , fi =1/6! , pi = 6 and ui = –1 initially in the first PE w′= w + w′ .po= pi – 2 .fo= pi (pi – 1) fi uo= – ui yo = yi w′2 + uo fo Figure 3. Functional block diagram of the constituent PE of the cosinusoidal sequence generation systolic array together with the associated operations
184
The Arabian Journal for Science and Engineering, Volume 33, Number 1B
April 2008
Nuha A. S. Alwan
Any desired phase shift can be added to w′ in each PE in Figure 3 before squaring w′. If the phase shift is –π/2, we obtain a true sine sequence. The phase shift must be preloaded into each PE. 3. THROUGHPUT AND GENERATION RATE The throughput of a systolic array is a measure of the processing rate and is defined as [8]: Throughput = N I
fc operations per second Nc
(3)
where N I is the number of operations (or instruction executions) in the algorithm, N c is the number of cycles to compute the algorithm and f c is the clock frequency. Considering the operations to be multiplications and disregarding additions, we find that the throughput found from Equation (3) is 12 f c ops/s. The generation rate of the successive function values is equal to f c . This rate can be increased by making the clock period almost as small as the time required by four multiplications. Therefore, the generation rate is greater than that in [4]. The throughput, however, is the same if f c is maximized in both cases. For a serial system computing the series with the same number of terms, the clock period must be three times longer (for four terms in the series) thereby decreasing f c , the generation rate, and throughput. As the number of PEs increases, the truncated series becomes a closer approximation and higher precision is achieved. In addition, the improvement in generation rate and throughput compared to the serial system with the same number of terms increases. Table 1 compares the design presented in [4] with the new design of the systolic sinusoidal sequence generation array using Horner’s method. Table 1. Comparison Between the Systolic Sinusoidal Sequence Generation Array Design of [4] With the New Systolic Array Design Using Horner’s Method Design presented in [4]
New design
Number of multiplication/divisions per PE
Five
Four
Need for division inside PE
One division needed
No divisions needed
Area requirement
Moderate
Lower
Generation rate
High compared to serial processing
Higher
Throughput
High compared to serial processing
Same as in [4] when clock frequency is maximized in both cases
Number of terms in truncated series
Can be changed without changing array inputs
Changing number of terms entails changing array inputs
CONCLUSIONS A cosinusoidal sequence y ( n) = cos( wn) computed by a truncated cosinusoidal series is generated systolically upon entering a digital frequency w at the boundary cell of the systolic array. The computation is made efficient, compared to previous work [4], by employing Horner’s method for polynomial evaluation. This method requires fewer computational operations inside the PE than the direct evaluation of the truncated series in [4] and can be arranged to avoid division inside the PE. Therefore, the generation rate is higher than in the previous method enabling faster processing. The method has the disadvantage that the number of terms in the truncation must be predefined.
April 2008
The Arabian Journal for Science and Engineering, Volume 33, Number 1B
185
Nuha A. S. Alwan
ACKNOWLEDGMENT The author would like to thank the anonymous reviewers for their helpful comments. REFERENCES
186
[1]
N. Petkov, Systolic Parallel Processing. North-Holland: Elsevier Science, 1993.
[2]
H. T. Kung, “Why Systolic Architectures”, Computer, 15 (1982), pp. 37–46.
[3]
J. G. Nash, “Automatic Generation of Systolic Array Designs for Reconfigurable Computing”, Proceedings Engineering of Reconfigurable Systems and Algorithms (ERSA ’02), International Multiconference in Computer Science, Las Vegas, Nevada, June 2002, pp. 176–182.
[4]
N. A. S. Alwan, “A Fully Pipelined Systolic Array for Sinusoidal Sequence Generation”, IEEE Transactions on Computers, 55 (2006), pp. 636–639.
[5]
S. S. Epp, Discrete Mathematics with Applications. Belmont, CA: Brooks Cole Publishing Company, 3rd edn, 2004.
[6]
D. Grover and J. R. Deller, Digital Signal Processing and the Microcontroller. New Jersey: Prentice-Hall, 1999.
[7]
G. B. Thomas and R. L. Finney, Calculus and Analytic Geometry. Manila: Addison-Wesley, 5th edn, 1979.
[8]
S. M. Chai, “Real-Time Image Processing on Parallel Arrays for Gigascale Integration”, Ph.D. Thesis, Georgia Institute of Technology, November 1999.
The Arabian Journal for Science and Engineering, Volume 33, Number 1B
April 2008