Split-Radix Algorithms for Discrete Trigonometric Transforms
Gerlind Plonka and Manfred Tasche
Abstract
In this paper, we derive new split{radix DCT{algorithms of radix{2 length, which are based on real factorization of the corresponding cosine matrices into products p of sparse, orthogonal matrices. These algorithms use only permutations, scaling with 2, butter y operations, and plane rotations/rotation{re ections. They can be seen by analogy with the well{known split{radix FFT. Our new algorithms have a very low arithmetical complexity which compares with the best known fast DCT{algorithms. Further, a detailed analysis of the roundo errors for the new split{radix DCT{algorithm shows its excellent numerical stability which outperforms the real fast DCT{algorithms based on polynomial arithmetic. Mathematics Subject Classi cation 2000. 65T50, 65G50, 15A23. Key words. Split{radix algorithm, split{radix FFT, discrete trigonometric transform, discrete cosine transform, arithmetical complexity, numerical stability, factorization of cosine matrix, direct sum decomposition, sparse orthogonal matrix factors.
1
1
Introduction
Discrete trigonometric transforms are widely used in processing and compression of signals and images (see [17]). Examples of such transforms are discrete cosine transforms (DCT) and discrete sine transforms (DST) of types I { IV. These transforms are generated by orthogonal cosine and sine matrices, respectively. Especially, the DCT{II and its inverse DCT{III have been shown to be well applicable for image compression. The roots of these transforms go back to trigonometric approximation in the eighteenth century (see [10]). Discrete trigonometric transforms have also found important applications in numerical Fourier methods, approximation via Chebyshev polynomials, quadrature methods of Clenshaw{Curtis type, numerical solution of partial dierential equations (fast Poisson solvers), singular integral equations, and Toeplitz{plus{Hankel systems. Since DCTs are the most widely used transforms, we shall concentrate on the construction of real, fast, and numerically stable DCT{algorithms in this paper. For large transform lengths n, and even for relatively small lengths n = 8 and n = 16 used in image compression, one needs to have fast DCT{algorithms. There is a close connection between fast DCT{algorithms and factorization of the transform matrix. Let Cn 2 Rnn be an orthogonal cosine matrix of large radix{2 order n. Assume that we know a factorization of Cn into a product of sparse matrices Cn = Mnm : : : Mn ; 1 < m n: (1.1) Then the DCT{transformed vector Cnx with arbitrary x 2 Rn can be computed recursively by x s := Mns x s ; (x := x) for s = 0; : : : ; m 1 such that x m = Cnx. Since all matrix factors in (1.1) are sparse, the arithmetical complexity of this method will be low such that the factorization (1.1) of Cn generates a fast DCT{algorithm. For an algebraic approach to fast algorithms of discrete trigonometric transforms we refer to [16]. We may distinguish the following three methods to obtain fast DCT{algorithms: 1. Fast DCT{algorithms via FFT: It is natural to focus rst on the computation of DCT by using the FFT (see [15, 25, 7], [17], pp. 49 { 53, and [24], pp. 229 { 245). Such DCT{algorithms are easy to implement using standard FFT routines and possess a good numerical stability (see [2, 23]). However, the real matrix{vector product Cn x is computed in complex arithmetic. Therefore the arithmetical complexity of these DCT{algorithms is relatively high. The best FFT{based DCT{algorithms require about 2:5 n log n ops, where a op is a real arithmetical operation. 2. Fast DCT{algorithms via polynomial arithmetic: All components of Cn x can be interpreted as values of one polynomial at n nodes. Reducing the degree of this polynomial by divide{and{conquer technique, one can get a real fast DCT{algorithm with low arithmetical complexity (see [9, 20, 21, 16]). The best DCT{algorithms require about 2 n log n
ops. A polynomial DCT{algorithm generates a factorization (1.1) of Cn with sparse, s non{orthogonal matrix factors Mn , i.e., the factorization (1.1) does not preserve the orthogonality of Cn (see e.g. [19, 2, 23]). This fact leads to a bad numerical stability of these DCT{algorithms [19, 2, 23]. 3. Fast DCT{algorithms via direct matrix factorization: Using simple properties of trigonometric functions one may nd direct factorizations of the transform matrix Cn (
1)
( +1)
(0)
( )
(
( )
(0)
)
2
2
( )
2
into a product of real, sparse matrices. The trigonometric approximation algorithm of Runge [18] can be considered as a rst example of this approach. Results on direct matrix factorization of Cn into orthogonal, sparse matrices are due to Chen et al. [4] and Wang [26]. Note that various factorizations of Cn use non{orthogonal matrix factors (see [17], pp. 53 { 62, [3, 12, 13]). Many results were published without proofs. A direct \orthogonal" factorization of the cosine matrix of type II and of order 8 was given by Loeer et al. [14] and is used even today in JPEG standard (cf. Example 2.8). Improving the earlier results in [4, 26], Schreiber has given a constructive proof of a factorization of some cosine matrices of size 2n into a product of sparse, orthogonal matrices in [19]. Unfortunately, the construction of this factorization is not simple. However, another important result in [19] (see also [23]) says that a fast DCT{algorithm possesses an excellent numerical stability, if the algorithm is based on a factorization of Cn into sparse, orthogonal matrices. Therefore, in order to get real, fast, and numerically stable DCT{algorithms, one should be especially interested in a factorization (1.1) with sparse, orthogonal matrix factors. In this paper, we shall derive new split{radix DCT{algorithms of radix{2 length, which are based on real factorization of the corresponding cosine matrix into a product of sparse, orthogonal matrices. Here a matrix factor is said to be sparse if each row and column contains at most 2 nonzero entries. The new DCT {algorithms require only permutap tions, scaling (with 2), butter y operations, and plane rotations/rotation{re ections with small rotation angles. In contrast with many other factorizations, this split{radix approach admits a recursive factorization of a cosine matrix into a product of sparse, orthogonal matrices. We shall show that the split{radix DCT{algorithms can be seen by analogy with the well{known split{radix FFT. Our new algorithms have a very low arithmetical complexity which compares with the best known fast DCT{algorithms. In particular, the obtained factorization of the cosine matrix of type II and of order 8 is very similar to that in [14] and needs exactly the same number of multiplications and additions. Algorithms with better arithmetical complexity can only be achieved using scaling matrices which are incorporated into the quantization process afterwards (see e.g. [8]). Further, a detailed analysis of the roundo errors for the new split{radix DCT{II algorithm shows its excellent numerical stability which outperforms the real fast DCT{II algorithms based on polynomial arithmetic. The paper is organized as follows: In Section 2 we introduce dierent types of cosine and sine matrices and derive factorizations of the cosine matrices of types I { IV. All proofs are based on divide{and{conquer technique applied directly to a matrix. That is, a given trigonometric matrix can be represented as a direct sum of trigonometric matrices of half size (and maybe of dierent type). While the factorization in Lemma 2.2 has already been used before (see e.g. [18, 4, 26]), the factorization of the cosine matrix of type IV in Lemma 2.4 is new and one of the basic results of this paper. It enables us to derive the corresponding split{radix DCT{algorithms. In Section 3 we revisit the split{radix FFT and represent it in a new form which is completely based on matrix factorization of the Fourier matrix. In Sections 4, 5 and 6 we derive some split{radix algorithms for DCT{II, DCT{IV, and DCT{I. A split{radix DCT{III algorithm follows immediately from a split{radix DCT{II algorithm. A comparison of these algorithms with the split{radix FFT shows the close 3
connection between these approaches. We also compute the arithmetical complexity of the new algorithms. Section 7 is devoted to a comprehensive analysis of the numerical stability of a split{radix DCT{II algorithm. It can be shown that the new algorithm has a very low arithmetical complexity and a better numerical stability than the DCT{algorithm based on polynomial arithmetic. 2
Trigonometric matrices
Let n 2 be a given integer. In the following, we consider cosine and sine matrices of types I { IV with order n which are de ned by CnI +1 CnII CnIV SnI
1
SnII
:= := := := := :=
q
n (j ) cos jk n
n
j;k=0 ; q n 1 j (2k+1) 2 j ; n n 2n j;k=0 q n 1 (2j +1)(2k +1) 2
n
( ) cos
cos sin j
2
n
q 2
n
q
4n
( +1)(k +1)
n
n
j;k=0 2
j;k=0
n
q
n
(2.1)
;
;
n (j + 1) sin (j +1)(22nk+1)
2
CnIII := (CnII )T ;
n
1
j;k=0
;
SnIII := (SnII )T ;
: n sin j;k Here we set n(0) = n(n) := p2=2 and n(j ) := 1 for j 2 f1; : : : ; n 1g. In our notation a subscript of a matrix denotes the corresponding order, while a superscript signi es the \type" of the matrix. First cosine and sine matrices appeared in connection with trigonometric approximation (see [10, 18]). In signal processing, cosine matrices of type II and III were introduced in [1]. The above classi cation was given in [26] (cf. [17], pp. 12 { 21). The cosine and sine matrices of type I { IV are orthogonal (see e.g. [17], pp. 13 { 14, [22, 19]). Strang [22] pointed out that the column vectors of each cosine matrix are eigenvectors of a symmetric second dierence matrix and therefore orthogonal. n In the following, In denotes the identity matrix and Jn := (Æ(j + k n + 1))j;k the counteridentity matrix, where Æ means the Kronecker symbol. Blanks in a matrix indicate zeros or blocks of zeros. The direct sum of two matrices A; B is de ned to be a block diagonal matrix A B :=n diag (A; B ). Let n := diag (( 1)k )k be the diagonal sign matrix. In the following lemma we describe the intertwining relations of above cosine and sine matrices. Lemma 2.1 The cosine and sine matrices in (2.1) satisfy the intertwining relations CnI Jn = n CnI ; SnI Jn = n SnI ; CnII Jn = n CnII ; SnII Jn = n SnII ; (2.2) CnIV Jn n = Jn n CnIV ; SnIV Jn n = Jn n SnIV SnIV
2
(2j +1)(2k +1) 4n
1
=0
1 =0
1 =0
+1
+1
+1
+1
1
4
1
1
1
and
Jn CnII = SnII n ;
CnIV Jn = n SnIV :
(2.3) The proof is straightforward and is omitted here (see [19]). The corresponding properties of CnIII and SnIII follow from (2.2) { (2.3) by transposing. A discrete trigonometric transform is a linear mapping generated by a cosine or sine matrix in (2.1). Especially, the discrete cosine transform of type II (DCT{II) of length n is generated by CnII . The discrete sine transform of type I (DST{I) of length n 1 is generated by SnI . In this paper, we are interested in fast and numerically stable algorithms for discrete trigonometric transforms. As an immediate consequence of Lemma 2.1, we only need to construct algorithms for DCT{I, DCT{II, DCT{IV, and DST{I. For even n 4, Pn and Pn denote the even{odd permutation matrices (or 2{stride permutation matrices) de ned by Pn x := (x ; x ; : : : ; xn ; x ; x ; : : : ; xn )T ; x = (xj )nj ; Pn y := (y ; y ; : : : ; yn ; y ; y ; : : : ; yn )T ; y = (yj )nj : Note that Pn = PnT is the n {stride permutation matrix and Pn = PnT is the (n +1){ stride permutation matrix with n := n=2. Lemma 2.2 Let n 4 be an even integer. (i) The matrix CnII can be factorized in the form CnII = PnT (CnII1 CnIV1 ) Tn (0) (2.4) 1
+1
+1
0
2
0
2
1
2
1
1
3
1 =0
1
3
1
=0
1 +1
1
1
+1
1
with the orthogonal matrix
Tn (0) :=
(ii) The matrix CnI
+1
p1
2
In1 In1
Jn1 Jn1
In1 Jn1
Jn1 : In1
CnI +1 = PnT+1 (CnI 1 +1 CnIII 1 ) Tn+1 (2) 0
Tn+1 (2) := p12 1
can be factorized in the form
with the orthogonal matrix
(iii) The matrix SnI
=
p1 (In1 Jn1 ) 2
@
In1
Jn1
p
2
In1
Jn1
(2.5)
1 A:
can be factorized in the form
= PnT (SnIII1 SnI 1 ) Tn (2): (2.6) Proof. We show (2.4) by divide{and{conquer technique. First we permute the rows of CnII by multiplying with Pn and write the result as a block matrix: n1 n1 0 1 n (2j ) cos j kn n (2j ) cos j n nk j;k j;k C Pn CnII = pn1 B @ n1 n1 A : n (2j + 1) cos j nk n (2j + 1) cos j nn k j;k j;k SnI
1
1
1
2 (2 +1) 2
1
1
2 ( +2 +1) 2
=0
1
(2 +1)(2 +1) 2
=0
5
1
(2 +1)( +2 +1) 2
1
=0
1
=0
By (2.1) and cos j n nk = cos j n nk ; cos j nn k = cos j nn k it follows immediately that the four blocks of Pn CnII can be represented by CnII1 and CnIV1 : II Cn1 CnII1 Jn1 I J n1 n1 II II IV p p Pn Cn = = (Cn1 Cn1 ) In1 Jn1 CnIV1 CnIV1 Jn1 II IV = (Cn1 Cn1 ) Tn (0): Since Pn = PnT and Tn(0) Tn (0)T = In, the matrices Pn and Tn(0) are orthogonal. The proofs of (ii) and (iii) follow similar lines. 2 Remark 2.3 The trigonometric approximation algorithm of Runge [18] is equivalent to (2:5) and (2:6). Similar factorizations of CnII , CnI , and SnI were also been presented in [26], but by using of modi ed even{odd permutation matrices Qn and Qn de ned by Qn x := (x ; x ; : : : ; xn ; xn ; xn ; : : : ; x )T ; x = (xj )nj ; Qn y := (y ; y ; : : : ; yn ; yn ; yn ; : : : ; y )T ; y = (yj )nj : From (2:4) we obtain the following factorization of CnIII : IV CnIII = Tn (0)T (CnIII (2.7) 1 Cn1 ) Pn : (
( +2 +1)
2
1)
(2 +1)( +2 +1) 2
(2 +1)(
2
1)
2
1 2
1 2
1
+1
1
+1
+1
0
2
0
2
2
1
1
3
3
1 =0
1
1
=0
Now we introduce modi ed identity matrices In0 := diag (n (j ) )nj ; In00 := diag (n (j + 1) )nj : Note that In0 and In00 dier from In in only onepposition. The matrix In0 has the entry p2 in the left upper corner and In00 has the entry 2 in the right lower corner. Let Vn be the forward shift matrix Vn := (Æ (j k 1))nj;k ; i.e., Vn has nonzero entries 1 only in the upper secondary diagonal. Note that for arbitrary n x = (xj )j 2 Rn we have Vn x = (0 ; x ; x ; : : : ; xn )T ; VnT x = (x ; x ; : : : ; xn ; 0)T : Further we de ne cosine and sine vectors by n n ; sn := sin k n : cn := cos k n k k Lemma 2.4 For even n 4, the matrix CnIV can be factorized in the form CnIV = PnT An (1)(CnII1 CnII1 ) Tn (1) (2.8) 1
1 =0
1
1 =0
1 =0
1 =0
0
1
2
1
(2 +1) 8
0 I
An (1) := p
1 2
and the cross{shaped twiddle matrix
n VnT1
2
(2 +1) 8
=0
with the modi ed addition matrix
Tn (1) := (In1 n1 )
1
Vn1 n1 In001 n1
diag cn1 Jn1 diag sn1
The two matrices An (1) and Tn (1) are orthogonal.
6
1
1
=0
(In1 Jn1 )
Jn1 diag (Jn1 sn1 ) : diag (Jn1 cn1 )
Proof. We show (2.8) again by divide{and{conquer technique. Therefore we permute the rows of CnIV by multiplying with Pn and write the result as block matrix: 0
Pn CnIV
= pn1 B @ 1
cos
(4j +1)(2k +1) 4n
cos
(4j +3)(2k +1) 4n
n1
cos cos
1
j;k=0
n1
1
j;k=0
(4j +1)(n+2k +1) 4n (4j +3)(n+2k +1) 4n
n1
1
1
j;k=0 C
n1
A:
1
j;k=0
Now we consider the single blocks of Pn CnIV and represent every block by CnII1 and SnII1 . 1. By (2.2) and cos j nk = cos j kn cos k n sin j kn sin k n it follows that n1 p In0 1 CnII1 diag cn1 Vn1 SnII1 diag sn1 pn1 cos j nk = j;k = p In0 1 CnII1 diag cn1 Vn1 n1 SnII1 Jn1 diag sn1 : (2.9) 2. By (2.2) and cos j nk = cos j nk cos k n + sin j nk sin k n we obtain that n1 p VnT1 CnII1 diag cn1 + In001 SnII1 diag sn1 pn1 cos j nk = j;k = p VnT1 CnII1 diag cn1 + In001 n1 SnII1 Jn1 diag sn1 : (2.10) 3. By (2.2) and cos j nn k = ( 1)j cos j kn + n kn = ( 1)j cos j kn sin n kn ( 1)j sin j kn cos n kn it follows that n1 pn1 cos j nn k j;k = p n1 In0 1 CnII1 diag (Jn1 sn1 ) n1 Vn1 SnII1 diag (Jn1 cn1 ) (2.11) = p In0 1 CnII1 Jn1 diag (Jn1 sn1 ) + Vn1 n1 SnII1 diag (Jn1 cn1 ) : Here we have used that n1 In0 1 = In0 1 n1 and n1 Vn1 = Vn1 n1 . 4. By (2.2) and cos j nn k = ( 1)j cos j nk n kn = ( 1)j cos j nk sin n kn + ( 1)j sin j nk cos n kn (4 +1)(2 +1) 4
1
(4 +1)(2 +1) 4
(2 +1)
1
(2 +1) 4
(2 +1)
(2 +1) 4
1 2
=0
1 2
(4 +3)(2 +1) 4
1
(4 +3)(2 +1) 4
( +1)(2 +1)
1
(2 +1) 4
( +1)(2 +1)
(2 +1) 4
1 2
=0
1 2
(2 +1)
(4 +1)( +2 +1) 4
( +2 +1) 4
(2 +1)
1
(4 +1)( +2 +1) 4
(
2 4
1)
(2 +1)
(
2 4
2 4
1)
1
=0
1 2
1 2
(4 +3)( +2 +1) 4 +1
+1
( +1)(2 +1)
( +1)(2 +1) (
2 4
1)
7
( +2 +1) 4 +1
( +1)(2 +1)
(
1)
we obtain that n1 pn cos j nn k 1 j;k = p n1 VnT1 CnII1 diag (Jn1 sn1 ) + n1 In001 SnII1 diag (Jn1 cn1 ) = p VnT1 n1 CnII1 diag (Jn1 sn1 ) In001 n1 SnII1 diag (Jn1 cn1 ) (2.12) = p VnT1 CnII1 Jn1 diag (Jn1 sn1 ) In001 n1 SnII1 diag (Jn1 cn1 ) : Using the relations (2.9) { (2.12) we get the following factorization of Pn CnIV : 0 I V diag c J diag ( J s ) n n n n n n n 1 1 1 1 1 1 IV 1 Pn Cn = p (CnII1 SnII1 ) J diag s VnT1 In001 n1 diag (Jn1 cn1 ) : n1 n1 Thus (2.8) follows by the intertwining relation (2.3), namely SnII1 = Jn1 CnII1 n1 . The orthogonality of An(1) and Tn(1) can be shown by simple calculation. Note that Tn(1) consists only of n plane rotations/rotation{re ections. 2 Corollary 2.5 Let n 2 N, n 8 with n 0mod 4 be given. Then the matrices CnII , 1
(4 +3)( +2 +1) 4
1
=0
1 2
1 2 1 2
1 2
1
CnIV , and CnI +1 can be factorized as follows
= PnT (PnT1 PnT1 )(In1 An1 (1))(CnII2 CnIV2 CnII2 CnII2 )(Tn1 (0) Tn1 (1))Tn(0); CnIV = PnT (An (1)(PnT1 PnT1 )(CnII2 CnIV2 CnII2 CnIV2 )(Tn1 (0) Tn1 (0)) Tn (1); III IV CnI = PnT (PnT1 Tn1 (0)T )(CnI 2 CnIII 2 Cn2 Cn2 )(Tn1 (2) Pn1 ) Tn (2): Now we give a detailed description of the structures of C II ; C IV ; C II , and C IV . Here, we want to assume that a matrix{vector product of the form CnII
+1
+1
+1
+1
+1
4
a0 a1 a1 a0
x0 x1
4
+1
8
8
is realized by 2 additions and 4 multiplications (see Remarks 2.10 and 2.11). As we shall see, the above factorization of CnII for n = 8 can be used to construct a fast algorithm which needs only 10 butter y operations and 3 rotations/rotation{re ections. Example 2.6 For n = 4 we obtain by Lemma 2.2 that C II = P T (C II C IV ) T (0) with 1 1 cos I J sin II IV C =p 1 1 ; C = sin cos ; T (0) = p I J : 4
2
1 2
2
4
2
4
2
8
8
8
8
1 2
4
2
2
2
2
Note that P T = P . This yields p cos sin I J 1 1 II C = P I J 1 1 2 sin cos such that C II x with x 2 R can be computed with 8 additions and 4 multiplications. The nal scaling with 1=2 is not counted. We see that only 3 butter y operations and 1 scaled rotation{re ection are required. 4
4
4
4
1 2
4
4
8
8
8
2
2
8
8
2
2
Example 2.7
For n = 4 we obtain by Lemma 2.4 that C IV = P T A (1)(C II C II ) T (1) p = P T A (1)(p2C II p2C II ) T (1) 4
4
4
2 2
with
0
p
2
4
4
2
1
2
4
2
0
4
2
cos
sin 1
B C cos sin 1 1C C ; T (1) = B C: A (1) = A @ A 1 p 1 cos sin 2 sin cos Hence, we can compute C IV x with x 2 R with 10 additions and 10 multiplications. We see that only 3 butter y operations, 1 rotation, and 1 rotation{re ection are required. Example 2.8 For n = 8 we obtain by Corollary 2.5 that C II = P T (P P )(I A (1))(C II C IV C II C II )(T (0) T (1)) T (0) with 1 I J T (0) = p I J : 2 Note that B := P T (P P ) coincides with the bit reversal matrix (see [24], pp. 36 { 43). This yields the factorization p p p II p IV p II p II p I J II C = B (I A (1))( 2C 2C 2C 2C )( 2T (0) 2T (1)) I J which is very similar to that of Loeer et al. [14]. Note that p II 1 1 2C = 1 1 is a butter y matrix. If we compute C II x for arbitrary x 2 R , then the algorithm based on the above factorization p requires only 26 additions and 14 multiplications (not including the nal scaling by 2=4). Further we see that only 10 (scaled) butter y operations, 1 scaled rotation, and 2 scaled rotation{re ections are used. Example 2.9 For n = 8 we obtain by Corollary 2.5 that C IV = P T A (1)(P P )(C II C IV C II C IV )(T (0) T (0)) T (1) p p p p p = pP T p2A (1)( P P )( 2C II 2C IV 2C II 2C IV ) p ( 2 T (0) 2 T (0)) T (1) with the cross{shaped twiddle matrix 0 1 cos sin B C cos sin B C B C cos sin B C B C cos sin B C T (1) = B C cos sin B C B C sin cos B C @ A sin cos sin cos 9 B p12 B @
4
16
3 16 3 16
4
16
3 16 3 16
16
4
8
8
4
4
4
4
2
2
8
8
2 4
8
8
4
8
4
16
4
2
4
4
4
4
4
2
4
8
4
4
2
2
2
4
2
4
4
4
4
4
2
8
8
8
8
8
2 4
4
8
8
4
32
4
2
4
4
4
3 32
5 32
8
3 32
32
2
5 32
2
2
2
4
2
4
2
8
2
8
7 32 7 32
7 32 7 32
5 32
5 32
3 32
32
3 32
32
and the modi ed addition matrix 0p 2 B 1 B A8(1) =
B B B 1 p B 2 B B B B @
1
1
1
1
1
1
1
1
1
1
1
1 p
2
Remark 2.10 In [9; 14] one can nd the identity
a0 a 1 a1 a0
x0 x1
= 10
1 0 1 1
0 @
a0 + a1
C C C C C C C C C C A
(I J ): 4
10
a1
a0
a1
1 A @1 0
4
01x 1A x 1
0 1
:
If the terms a0 + a1 and a0 a1 are precomputed, then this formula suggests an algorithm with 3 additions and 3 multiplications. Using this method, the DCT{II of length 8 in Example 2:8 needs only 11 multiplications and 29 additions. However, the numerical stability of this method is worse than the usual algorithm with 2 additions and 4 multiplications (see [27]).
Remark 2.11 For a rotation matrix cos ' sin ' sin ' cos '
(' 2 (
; ));
there is a second way to realize the matrix vector multiplication with only 3 multiplications and 3 additions. Namely, using 3 lifting steps [5], one nds the factorization
cos ' sin ' = 1 tan('=2) 1 0 1 tan('=2) : sin ' cos ' 0 1 sin ' 1 0 1 The above matrix factors are not orthogonal. However, it has been shown (see [27]) that
for small rotation angle ' the roundo error is less than for classical rotation. The same method is also applicable to the rotation{re ection matrix
3
cos ' sin '
sin ' cos '
(' 2 (
; )):
Split-radix FFT
The unitary Fourier matrix
1 Fn := p1n (wnjk )nj;k=0
with wn := exp( 2i=n) is closely related with cosine and sine matrices. A standard FFT is based on factorization of the Fourier matrix into a product of sparse, (almost) unitary matrices. A complex matrix An is called almost unitary, if An ATn = In with some > 0. A real, almost unitary matrix is called almost orthogonal. 10
Lemma 3.1 For even n 4, the Fourier matrix Fn can be factorized as follows
= =
Fn Fn
PnT (Fn1 Fn1 )Un ; Un (Fn1 Fn1 )PnT
with the unitary matrix 1 1 wnk nk=0
Un := In1 diag ( Proof. We only sketch the proof Pn we obtain the block matrix
)
p
1 2
In1 In1
In1 : In1
of (3.1). Multiplying Fn with the permutation matrix
( wnjk )nj;k1 ( wnj n1 k )nj;k1 p Pn Fn = n (wn j k )nj;k1 (wn j n1 k )nj;k1 The four blocks can be expressed in the form Fn1 Fn1 Pn Fn = p n1 k F diag (w ) F diag (wk )n1 (2 +1)
=
=
n1
2 (
1 =0
2
1
1 2
(3.1) (3.2)
1 =0
n k=0
1
(2 +1)(
n1
+ )
+ )
1 =0
:
1 =0
n k=0
In1 1 1 diag (wnk )nk=0
!
1
In1 (Fn1 Fn1 ) p 1 1 diag (wnk )nk=0 I I n1 n1 n1 1 p1 k (Fn1 Fn1 )(In1 diag (wn )k=0 ) 2 In In : 1 1 1 2
Obviously we have Un UnT = In. This completes the proof of (3.1).
2
Remark 3.2 In signal processing, the even{odd permutation of rows of Fn (as realized in the above proof of (3:1)) is called \decimation{in{frequency", since the frequency{
dependent row indices are decimated into even and odd indices. The even{odd permutation of columns of Fn (corresponding to (3:2)) is called \decimation{in{time", since the time{dependent column indices are splitted (see [24], pp. 67 68).
In the case n = 2t (t 2), recursive application of (3.1) provides the Gentleman{Sande factorization of Fn (see [24], pp. 65 { 67). Recursive application of the transposed factorization yields the well{known Cooley{Tukey factorization (see [24], pp. 17 { 21). The corresponding FFT are also known as radix{2 algorithms. A better arithmetical complexity can be achieved using the so{called split{radix algorithm. The idea of split{radix FFT, due to [6, 7], can be described as a clever synthesis of radix{2 and radix{4 FFTs (see [24], pp. 111 { 119). In this section we want to derive a new approach to the split{radix FFT which is completely based on the matrix factorization of Fn. In the next sections, we will apply the same ideas to derive new split{radix DCT{algorithms. Let n = 2t (t 2). We set ns := n 2 s = 2t s for s = 0; : : : ; t 1. Further, we introduce the odd Fourier matrix n Fn := pn w nj k : j;k This matrix is unitary, too. Note that Fn is also called the even Fourier matrix. We shall show that Fn can be splitted into Fn1 Fn1 in a rst step, then Fn1 Fn1 can be splitted into Fn2 Fn2 Fn2 Fn2 in a second step and so on. This procedure follows from (1)
(2 +1) 2
1
(1)
(1)
11
1
=0
(1)
Lemma 3.3 For even n 4, the matrices Fn and Fn(1) can be factorized as follows
= =
Fn Fn(1) with the unitary matrices
Wn (0) := p
1 2
PnT (Fn1 Fn(1) 1 ) Wn (0); T Pn (Fn1 Fn1 ) Wn (1)
(3.3) (3.4)
In1 In1
In1 ; W (1) := (D D3 ) W (0)(I ( iI )); n n1 n n1 n1 n1 In1
:= diag (wkn)nk 1 . The proof is similar to that of Lemma 3.1 and is omitted here. For n = 2t (t 3), we conclude from Lemma 3.3 that Fn = PnT (PnT1 PnT1 )(Fn2 Fn2 Fn2 Fn2 )(Wn1 (0) Wn1 (1)) Wn (0): Recursive application of Lemma 3.3 provides the split{radix factorization of Fn on which the split{radix FFT is based. The factorization steps can be illustrated in the following diagram: Fn step s = 0 Æ X X X XXXXz 9 Fn1 Fn1 step s = 1 Æ Æ H H HH HHj j step s = 2 ÆFn@2 ÆFn@2 ÆFn@2 ÆFn@2 R @ R @ @R @R F F F F F F F Æn3 Æn3 Æn3 Æn3 Æn3 Æn3 Æn3 ÆFn3 step s = 3 Now we shall answer the following problem: On which positions k 2 f1; : : : ; 2s g in step s 2 f0; : : : ; t 1g occur Fn and Fn , respectively? We introduce the binary vectors s := ( s(1); : : : ; s(2s)). If Fn stands at position k in step s, then let s(k) := 0. We put s(k) := 1, if Fn stands at position k in step s (see [24], pp. 115 { 119). By Lemma 3.3, from s (k ) := 0 it follows immediately that s (2k 1) = 0 and s (2k) = 1. Further, from s(k) := 1 it follows that s (2k 1) = s (2k ) = 0. Now, our diagram looks as follows: 0 = (0)
XX XX XXXz 0 9 = (0; 1)
HH
1HH 0 Hj1 0 Hj0 = (0; 1; 0; 0)
@ @ @ @ 0 @R1 0 @R0 0 @R1 0 @R1 = (0; 1; 0; 0; 0; 1; 0; 1)
where Dn1
2
=0
1
(1)
(1)
(1)
(1)
(1)
(1)
(1)
s
s
s
(1)
s
+1
+1
+1
+1
0
1
2
3
Lemma 3.4 Let t 2 N with t 2 be given and 0 := (0). Then
s+1 = ( s ; ~ s );
s = 0; : : : ; t
12
2;
(3.5)
where ~ s equals s with the exception that the last bit position is reversed. Further,
k sk = Pk
2s =1
1
s (k ) = 13 (2s
( 1)s )
(3.6)
is the number of ones in the binary vector s .
The assertion (3.5) follows by induction. For s = 0 we have = (0) and = (0; 1) by Lemma 3.3 such that = ( ; ~ ): Assume that for arbitrary s 2 f1; : : : ; t 2g, s = ( s ; ~ s ): Then we have by de nition ~ s = ( s ; s ): (3.7) In step s + 1, the rst part of vector s , namely s , turns into ( s ; ~ s ) by our assumption. The second part of s, namely ~ s , changes into ( s ; s ) by assumption. Consequently, we obtain s = ( s ; ~ s ; s ; s ) = ( s; ~ s). The 1{norm of s equals the number of ones in s. By (3.3) we know that s(k) = 0 implies s (2k 1) = 0 and s (2k) = 1. By (3.4) we see that s(k) = 1 implies s (2k 1) = 0 and s (2k ) = 0. Thus we have k sk + k s k = 2s; k k = 0: Using classical dierence equation theory we obtain (3.6). 2 We introduce permutation matrices Pn (s) := PnT : : : PnT ; s = 0; : : : ; t 2; and for each pointer s Wn ( s ) := Wn ( s (1)) : : : Wn ( s (2s )); s = 0; : : : ; t 1; with Wn (0) and Wn (1) as in Lemma 3.3. Note that Bn := Pn (0) : : : Pn (t 2) is the bit reversal matrix (see [24], pp. 36 { 43). Further, Wn( t ) is a diagonal block matrix, where each block in the diagonal is either 1 1 1 i F =p 1 1 or F = p 1 i : Recursive application of Lemma 3.3 provides the split{radix factorization of Fn. Theorem 3.5 Let n = 2t (t 3). Then the Fourier matrix Fn can be factorized into the Proof.
0
1
1
0
0
1
1
1
1
1
1
1
+1
1
+1
1
1
1
1
1
1
+1
+1
+1
1
+1 1
0 1
s
s
s
s
1
(1) 2
1 2
2
1 2
following product of sparse unitary matrices
Fn = Bn Wn ( t
)Wn( t ) : : : Wn ( ): In order to reduce the number of multiplications, the factor p2=2 in each matrix Wn ( s ) is moved to the end of calculation. From p p p F = p B 2 W ( ) 2 W ( ) : : : 2 W ( ) n
1
n
n
n
t
1
2
n
1
we obtain the following split{radix FFT:
13
0
t
2
n
0
Algorithm 3.6 [Split{radix FFT] Input: n = 2t (t 3); x 2 C n .
1. For s = 3; : : : ; t precompute the roots of unity wk (k = 1; : : : ; 2s 2. For s = 0; : : : ; t 2 form s := ( s ; ~ s ) with = (0). 3. Put x := x and compute for s = 0; : : : ; t 1 p x s := 2 Wn ( s )x s : 2s
+1
2
1).
0
(0)
( )
( +1)
4. Permute x t := Bn x t : 5. Multiply with scaling factor y := pn x t . Output: y = Fnx. The split{radix FFT is known to be one of the FFT with lowest arithmetical complexity, in particular, it has a lower complexity than radix{2 and radix{4 FFTs (see e.g. [24], p. 119). ( )
(
1)
1
4
( )
Split-radix DCT-II algorithms
In this section, we present split{radix algorithms for the DCT{II of radix{2 length. Recursive application of (2.4) and (2.8) provides the split{radix factorization of CnII . The corresponding complete factorization can be derived similarly as worked out in Section 3 for the Fourier matrix. Note that there does not exist a factorization of CnII or CnIV which is analogously to that of Lemma 3.1. Therefore a split{radix DCT{II algorithm seems to be very natural. A split{radix DCT{III algorithm follows immediately from a split{radix factorization of CnII by transposing. Let n = 2t (t 2) be given. Further, let ns := 2t s (s = 0; : : : ; t 1). In the rst factorization step, CnII is splitted into CnII1 CnIV1 by (2.4). Then in the second step, we use (2.4) and (2.8) in order to split CnII1 CnIV1 into CnII2 CnIV2 CnII2 CnII2 . In the case n > 2 we continue this procedure. Finally, we obtain the split{radix factorization of CnII . The rst factorization steps areillustrated by the following diagram: CnII step s = 0 Æ X X XXXXz II 9 Cn1 CnIV1 step s = 1 Æ Æ H H j IV II H H Hj II II H C C C step s = 2 Æn@2 Æn@2 Æn@2 ÆCn@2 II @R IV II @R II II @R IV II @R IV C C C C C C C Æn3 Æn3 Æn3 Æn3 Æn3 Æn3 Æn3 ÆCn3 step s = 3 2
As in Section 3, we introduce binary vectors s = ( s(1); : : : ; s(2s )) for s 2 f0; : : : ; t 1g. We put s(k) := 0 if CnII stands at position k 2 f1; : : : ; 2s g in step s, and s(k) := 1 if CnIV stands at position k in step s. These pointers s have the same properties as in Section 3. s
s
14
For each pointer s we de ne the modi ed addition matrix An ( s ) := An ( s (1)) : : : An ( s (2s )); s = 0; : : : ; t 2 with An (0) := In and An (1) as in Lemma 2.4, further the modi ed twiddle matrix Tn ( s ) := Tn ( s (1)) : : : Tn ( s (2s )); s = 0; : : : ; t 1 with Tn (0) and Tn (1) as in Lemma 2.2 and Lemma 2.4, and nally the cosine matrix Cn ( s ) := Cn ( s (1)) : : : Cn ( s (2s )); s = 0; : : : ; t 1 with Cn (0) := CnII and Cn (1) := CnIV . Let the permutation matrices Pn (s) be given as in Section 3. Note that Cn ( ) = CnII . We shall see in the following that the cosine matrix Cn( s) appears as intermediate result in our recursive factorization of CnII . The matrix Cn( t ) = Tn ( t ) is a diagonal block matrix, where each block in the diagonal is either C II or C IV . By construction, all matrices Pn (s), An ( s ) and Tn ( s ) are sparse and orthogonal. Theorem 4.1 Let n = 2t (t 2). Then CnII can be factorized into the following product s
s
s
s
s
s
s
s
s
s
s
s
s
s
s
0
1
1
2
2
of sparse orthogonal matrices
CnII = (Pn (0)An ( 0)) : : : (Pn (t
2)An ( t ))Tn( t ) : : : Tn( ): (4.1) Proof. The split{radix factorization (4.1) follows immediately from Cn ( s ) = Pn (s)An ( s )Cn ( s )Tn ( s ); s = 0; : : : ; t 2: (4.2) By de nition of Pn (s), An( s), Cn ( s) and Tn( s), these formulas (4.2) are direct consequences of Lemma 2.2 and Lemma 2.4. 2 p In order to reduce the number of multiplications, we move the factors 1= 2 in the matrices Tn ( s ), s = 0; : : : ; t 1 to the p end of the calculation and obtain with the scaled, almost orthogonal matrices S ( ) := 2 T ( ) the factorization 2
1
0
+1
n
n
s
s
CnII = p1n (Pn (0)An ( 0)) : : : (Pn (t
2)An( t ))Sn( t ) : : : Sn( ): (4.3) This implies the following fast algorithm for the DCT{II of radix{2 length: Algorithm 4.2 [Split{radix DCT{II algorithm] Input: n = 2t (t 2); x 2 Rn. p 1. Precompute 2=2 pand for s = 0; : : : ; t 2 and k = 0; : : : ; 2s 1 the values p k 2 cos +3 and 2 sin k +3 . 2. For s = 0; : : : ; t 2 form s := ( s ; ~ s ) with := (0). 3. Put x := x and compute for s = 0; : : : ; t 1 x s := Sn ( s )x s : (2 +1) 2s
2
(2 +1) 2s +1
0
(0)
( +1)
( )
15
1
0
4. For s = 0; : : : ; t 2 compute x t s := Pn (t ( + +1)
2) An ( t
s
s
2
)xt
( + s)
:
5. Multiply with scaling factor y := pn x t . Output: y = CnII x. Now we determine the arithmetical complexity of this split{radix DCT{II algorithm. As usual, the nal scaling by n = is not counted. For an arbitrary real matrix An of order n, let (An ) and (An ) denote the number of additions and multiplications for computing An x with arbitrary x 2 Rn. Theorem 4.3 Let n = 2t (t 2) be given. Then the split{radix Algorithm 4.2 for the DCT{II of length n possesses the following arithmetical complexity n ( 1)t + 1; (CnII ) = nt (CnII ) = nt n + ( 1)t + 1: Proof. By (4.3) it follows that 1
(2
1)
1 2
8 9
4 3
4 3
(CnII ) (CnII )
= =
t 1 X s=0
t 2 X s=0
1 9
1 3
(Sn ( s )) + (Sn ( s )) +
t 2 X s=0
t 2 X s=0
(An ( s ));
(4.4)
(An ( s )):
(4.5)
By de nition of An( s) and Sn ( s) we obtain (Sn ( s )) = n; (Sn ( s )) = 2ns k s k ; (An ( s )) = (An ( s )) = (ns 2) k s k : By (3.6) we have k sk = (2s ( 1)s ). Inserting these values into (4.4) and (4.5) yields the assertions of the theorem. 2 We want to derive a second algorithm which possesses even lower arithmetical complexity. We slightly change the derived orthogonal matrix product (4.1) in the following way. Instead of An( s), consider the modi ed addition matrices A0n ( s ) = A0n ( s (1)) : : : A0n ( s (2s )); s = 0; : : : ; t 2 p with A0n (0) := An (0) = In and A0n (1) := 2 An (1). Hence An( s) and A0n( s ) are connected by A0n ( s ) = Dn ( s ) An ( s ) with a diagonal matrix p p (4.6) Dn ( s ) := ( 2) In : : : ( 2) In ; s = 0; : : : ; t 2: 1
1
1
1 3
s
s
s
s
s
s
s
s (1)
s s (2 )
s
16
s
Further, consider the modi ed twiddle matrices Sn0 ( s ) := Sn0 ( s (1)) : : : Sn0 ( s (2s )); s = 0; : : : ; t 2 with Sn0 (0) := p2 Tn (0) and Sn0 (1) := Tn (1) such that Sn0 ( s ) = (Dn ( s )) Sn ( s ): As before, the matrices A0n( s) and Sn0 ( s ) are sparse matrices with at most 2 nonzero entries in each row. More precisely, after suitable permutations, A0n( s), s = 1; : : : ; t 2, contains p only butter y matrices (with entries 1) and diagonal matrices (with entries 1 and 2). The matrices Sn0 ( s), s = 0; : : : ; t 2 only contains butter y matrices (with entries 1), rotation matrices, p and rotation{re ection matrices. Note that A0n( ) = 0 An (0) = In and Sn ( ) = 2 Tn (0). With the changed matrices we nd from (4.1) the factorization 1 CnII = p t Pn (0)A0n ( ) : : : Pn (t 2)A0n ( t )Tn ( t )Sn0 ( t ) : : : Sn0 ( ); ( 2) since for each s = 0; : : : ; t 2 the product of diagonal matrices Dn ( s) : : : Dn ( ) commutes with Tn( s ) and with An( s). Observe that thepinner matrix Tn( t ) is not modi ed. For the algorithm, we multiply this matrix with 2 and nally obtain CnII = pn Pn (0) A0n ( ) : : : Pn (t 2) A0n ( t )Sn ( t ) Sn0 ( t ) : : : Sn0 ( ): (4.7) This factorization leads to the following modi ed split{radix DCT{II algorithm: Algorithm 4.4 [Modi ed split{radix DCT{II algorithm] Input: n = 2t (t 2); x 2 Rn. 1. Precompute p2, p2 cos , p2 sin , and for s = 1; : : : ; t 2; k = 0; : : : ; 2s 1 precompute the values cos k +3 and sin k +3 . 2. For s = 0; : : : ; t 2 form s := ( s ; ~ s ) with := (0). 3. Put x := x and compute for s = 0; : : : ; t 2 x s := Sn0 ( s )x s : s
s
s
s
s
s
1
0
0
0
1
2
1
2
0
1
0
+1
1
1
0
2
8
(2 +1) 2s
8
1
2
0
(2 +1) 2s
+1
0
(0)
( +1)
( )
4. Compute x t := Sn ( t )x t : 5. For s = 0; : : : ; t 2 compute x t s := Pn (t ( )
1
(
1)
s
( +1+ )
6. Multiply with scaling factor y := pn x Output: y = CnII x. 1
2) A0n ( t
(2t 1)
17
.
s
2
)xt
( +s)
:
1
The arithmetical complexity of this modi ed split{radix DCT{II algorithm can be determined analogously as in Theorem 4.3. By (4.7) it follows that (CnII )
= (Sn ( t )) +
(CnII )
= (Sn ( t )) +
1
1
t 2 X s=0 t 2 X s=0
(Sn0 ( s )) + (Sn0 ( s )) +
t 2 X s=0 t 2 X s=0
(A0n ( s )); (A0n ( s )):
By de nition of A0n( s) and Sn0 ( s) we obtain for s = 0; : : : ; t 2 (Sn0 ( s )) = n; (Sn0 ( s )) = 2ns k s k ; (A0n ( s )) = (ns 2) k s k ; (A0n ( s )) = 2 k s k (Sn ( t )) = n; (Sn ( t )) = 4 k t k : By (3.6) we have k sk = (2s ( 1)s). Inserting these values into the above equations yields for Algorithm 4.4 n ( 1)t + 1; (CnII ) = nt (CnII ) = nt n + ( 1)t 1: Remark 4.5 For comparison, the algorithm presented by Wang [26] (which already outperforms the algorithm in [4]) for the DCT{II of length n = 2t needs nt n + 3 multiplications and nt 2n + 3 additions. The complexity of our new Algorithm 4:4 is even comparible with fast DCT{II algorithms based on polynomial arithmetic which need nt multiplications and nt n + 1 additions (see e.g. [12; 13; 20; 21]). Namely, using the method of Remark 2:10 (or Remark 2:11) computing a (scaled) rotation matrix/rotation{ re ection matrix with only 3 multiplications and 3 additions, we obtain for Algorithm 4:4 (Sn0 ( s )) = ns k s k + n; (Sn0 ( s )) = ns k s k ; s = 0; : : : ; t 2; (Tn ( t )) = k t k + n; (Tn ( t ) = 3 k t k : Hence we get (CnII ) = nt n + 1 and (CnII ) = nt 1. The goal of [24], p. 229 was to develop FFT{based DCT{II algorithms that require 2:5 n log n ops, assuming that the transform length n is a power of 2. We have shown that this is also possible with 2 n p log n ops using only operations of simple structure, namely permutations, scaling with 2, butter y operations, and plane rotations/rotation{ re ections with small rotation angles contained in (0; =4). 1
1
1
1
1 1
1
1 3
1
4 3 2 3
8 9 1 9
1 9 1 9
3 4
7 4
1 2
3 2
1 2
3 2
1
1 1
1
1 1
1
3 2
1
1 2
2
2
5
Split-radix DCT-IV algorithms
The same split{radix technique as in Sections 3 and 4 can be applied to the DCT{IV of radix{2 length. Let n = 2t (t 2) be given. In the rst step we can split CnIV into CnII1 CnII1 by (2.8). Then in the second step, we use (2.4) in order to split CnII1 CnII1 into CnII1 CnIV1 CnII1 CnIV1 . 18
In the case n 4, we continue the procedure. This method is illustrated by the following diagram: IV Cn step s = 0 Æ X X XXXXz II 9 Cn1 CnII1 step s = 1 Æ Æ H H j IV II H H Hj IV II H C C C step s = 2 Æn@2 Æn@2 Æn@2 ÆCn@2 II @R IV II @R II II @R IV II @R II C C C C C C C Æn3 Æn3 Æn3 Æn3 Æn3 Æn3 Æn3 ÆCn3 step s = 3 2
Analogously as in Sections 3 and 4, we introduce binary vectors s := ( s(1); : : : ; s (2s)), s 2 f0; : : : ; t 1g. We put s (k ) := 0 if CnII stands at position k 2 f1; : : : ; 2s g of step s, and s (k ) := 1 if CnIV stands at position k of step s. These pointers possess dierent properties as in Lemma 3.4. Lemma 5.1 Let t 2 N (t 2) and := (1). Then
s = (~ s ; ~ s ); s = 0; : : : ; t 2; where ~ s equals s with the exception that the last bit position is reversed. Further, s
s
0
+1
2s
k s k = P s(k) = (2s + 2( 1)s ): 1
1 3
k=1
The proof is similar to that of Lemma 3.4 and omitted here. Now, for each pointer s we de ne An( s ) and Tn( s) (or their modi ed versions) in the same way as An( s ) and Tn ( s ) in Section 4. Theorem 5.2 Let n = 2t (t 2). Then the matrix CnIV can be factorized into the following product of sparse orthogonal matrices
= Pn (0)An( ) : : : Pn (t 2)An ( t )Tn( t ) : : : Tn( ): The proof directly follows from Lemma 2.2 and Lemma 2.4. The factorization of CnIV in Theorem 5.2 (slightly changed by diagonal matrices as in Section 4) implies fast DCT{IV algorithms analogously to Algorithm 4.2 or Algorithm 4.4. For computing CnIV x for x 2 Rn using Lemma 2.4 and Algorithm 4.4 we derive an arithmetical complexity p (CnIV ) = 2 (CnII1 ) + ( 2 An (1)) + (Tn (1)) = nt n + ( 1)t; p (CnIV ) = 2 (CnII1 ) + ( 2 An (1)) + (Tn (1)) = nt + n ( 1)t + 1: We see that DCT{IV of radix{2 length n can be computed by (1+2 log n) n ops. With the method of Remark 2.10 (or Remark 2.11), the number of multiplications can even be reduced further. CnIV
0
2
4 3
2 9
2 3
11 9
1
0
2 9
2 9
2
19
6
Split-radix DCT-I algorithms
A split{radix DCT{I algorithm is based on (2.5), (2.7), and (2.8). Let n = 2t (t 2) be given. Further, let ns := 2t s (s = 0; : : : ; t 1). In the rst factorization step, CnI is splitted into CnI1 CnIII1 by (2.5). Then in the second step, we use (2.5) and (2.7) in order to split CnI1 CnIII1 into CnI2 CnIII2 CnIII2 CnIV2 . In the case n > 2 we continue this procedure. For CnIV we use the factorization III T CnIV = Tn (1)T (CnIII 1 Cn1 ) An (1) Pn which follows from (2.8) by transposing. Finally, we obtain the split{radix factorization of CnI . Note that (2.5) is in some sense also true for n = 2: C I = P T (C II 1) T (2); i.e., the DCT{I of length 3 can be computed by 2 (scaled) butter y operations. The rst factorization steps canbe illustrated by the following diagram: CnI step s = 0 Æ X X XXXXz I 9 Cn1 CnIII step s = 1 1 Æ Æ H H j III III H H Hj IV I H C C C step s = 2 Æn2@ Æn@2 Æn@2 ÆCn@2 @R III III @R IV III @R IV III @R III I C C C C C C C Æn3 Æn3 Æn3 Æn3 Æn3 Æn3 Æn3 ÆCn3 step s = 3 +1
+1
+1
2
+1
+1
3
3
3
2
+1
+1
+1
+1
Now we have to indicate on which position k 2 f1; : : : ; 2k g in step s 2 f0; : : : ; t 1g stands CnI , CnIII , and CnIV , respectively. We introduce triadic vectors Æs = (Æs (1); : : : ; Æs (2s )) for s 2 f0; : : : ; t 1g as pointers. Since CnI stands at rst position of each step s, we set Æs(1) := 2. We put Æs(k) := 0 if CnIII stands at position k 2 f2; : : : ; 2s g in step s 2 f1; : : : ; t 1g, and Æs (k ) := 1 if CnIV stands at position k in step s. 2 Æ = (2)
X X X XXXXz 2 9 0H Æ = (2; 0)
H 2 HHj0 0 HHj1 Æ = (2; 0; 0; 1)
@ @ @ @ 2 @R0 0 @R1 0 @R1 0 @R0 Æ = (2; 0; 0; 1; 0; 1; 0; 0)
These pointers Æs have similar properties as s in Section 3. Lemma 6.1 Let t 2 N with t 2 be given and Æ := (2). Then Æs = (Æs ; s ); s = 0; : : : ; t 2; (6.1) where s is determined by the recursion (3.5). Further, kÆsk 2 = 2s + ( 1)s s +1
s
s
s +1
s
s
0
1
2
3
0
+1
1
1 2
1 3
is the number of ones in the triadic vector Æs .
20
1 6
The proof is similar to that of Lemma 3.4 and omitted here for shortness. For each pointer Æs we de ne the matrix Bn (Æs ) := PnT Bn (Æs (2)) : : : Bn (Æs (2s )); s = 0; : : : ; t 2; with Bn (0) := Tn (0)T and Bn (1) := Tn (1)T as in Lemma 2.2 and Lemma 2.4. Further we introduce the matrix Un (Æs ) := Tn (2) Un (Æs (2)) : : : Un (Æs (2s )); s = 0; : : : ; t 2; with Un (0) := Pn and Un (1) := An (1)T Pn as in Lemma 2.2 and Lemma 2.4, and nally the cosine matrix Cn (Æs ) := CnI Cn (Æs (2)) : : : Cn (Æs (2s )); s = 0; : : : ; t 1; with Cn (0) := CnIII and Cn (1) := CnIV . Note that Cn(Æ ) = CnI . We shall see in the following that the cosine matrix Cn (Æs ) appears as intermediate result in our recursive factorization of CnI . By construction, all matrices Bn (Æs) and Un (Æs) are sparse and orthogonal. Theorem 6.2 Let n = 2t (t 2). Then the matrix CnI can be factorized into the +1
s
s +1
s
s
s
+1
s
s +1
s
s
s
s
s
+1
s
s
s
s
0
s
s
s +1
s
s
s
+1
+1
+1
+1
following product of sparse orthogonal matrices
CnI +1 = Bn+1 (Æ0) : : : Bn+1 (Æt
) Cn (Æt ) Un (Æt ) : : : Un (Æ ): (6.2) The proof directly follows from Lemma 2.2 and Lemma 2.4. The matrix Cn (Æt ) in (6.2) is a block matrix consisting only of C I (in the rst block), C III and C IV . The factorization of CnI in Theorem 6.2 implies a fast DCT{I algorithm with n + t + ( 1)t + ; (CnI ) = nt (CnI ) = nt n+t ( 1)t + : We see that DCT{I of length n + 1 can be computed by 3 n log n ops if n is a power of 2. This split{radix DCT{I algorithm uses only permutations, scaled butter y operations, and plane rotations/rotation{re ections and works without additional scaling. Therefore we obtain a higher arithmetical complexity. For comparison, the FFT{based DCT{I algorithm in [24], pp. 238 { 239 requires 2:5 n log n ops, but it possesses a worse numerical stability, since it requires divisions by small sine values. 2
+1
1
+1
2
+1
0
+1
3
2
1
2
+1
14 9 22 9
4 3 5 3
+1
+1
1 18 1 18
7 2 9 2
2
2
7
Numerical stability of split-radix DCT algorithms
In the following we use Wilkinson's standard method for the binary oating point arithmetic for real numbers (see [11], p. 44). If x 2 R is represented by the oating point number (x), then
(x) = x (1 + Æ) (jÆj u); where u denotes the unit roundo or machine precision as long as we disregard under ow and over ow. For arbitrary x ; x 2 R and any arithmetical operation Æ 2 f+; ; ; =g, the exact value y = x Æ x and the computed value y~ = (x Æ x ) are related by
(x Æ x ) = (x Æ x )(1 + ÆÆ) (jÆÆj u): (7.1) 0
0
1
1
0
0
1
0
1
21
1
In the IEEE arithmetic of single precision (24 bits for the mantissa with 1 sign bit, 8 bits for the exponent), we have u 2 5:96 10 . For arithmetic double precision (53 bits for the mantissa with 1 sign bit, 11 bits for the exponent), we have u 2 1:11 10 (see [11], p. 45). Usually the total roundo error in the result of an algorithm is composed of a number of such errors. To make the origin of relative errors ÆkÆ clear in this notation, we use superscripts for the operation Æ and subscripts for the operation step k. In this section we show that, under weak assumptions, a split{radix DCT{II algorithm possesses a remarkable numerical stability. We consider Algorithm 4.2 in detail. The other algorithms can be analysed similarly. The roundo errors of Algorithm 4.2 are caused by multiplications with the matrices Sn ( s ), s = 0; : : : ; t 1, and An ( s ), s = 1; : : : ; t 2. These matrices have a very simple structure. After suitable permutations, every matrix is block{diagonal with blocks of order 2. All blocks of order 1 are equal to 1. Every block of order 2 is either a (scaled) butter y matrix 1 1; p 1 1; p 1 1; 1 1 1 1 1 1 or a scaled rotation matrix/rotation{re ection matrix 24
8
53
1 2
1 2
with a0 =
p
2 cos (2k2+s 1) ; +3
a1 =
16
a0 a1 ; a1 a0
a0 a1
a1 a0
p
2 sin (2k2+s 1) ;
s = 0; : : : ; t
+3
2; k = 0; : : : ; 2s 1:
First we analyse the roundo errors of simple matrix{vector products, where the matrix of order 2 is a (scaled) butter y matrix or scaled rotation matrix. Before starting the detailed analysis, we show the following useful estimate. Lemma 7.1 For all a; b; c; d 2 R, we have (jacj + jbdj + jac bdj) + (jadj + jbcj + jad + bcj) (a + b )(c + d ) where the constant 16=3 is best possible. Proof. Without loss of generality, we can assume that a; b; c; d 0 and ac bd. Then the above inequality reads as follows (ac) + (ad + bc) (a + b )(c + d ): This inequality is equivalent to 0 (ad bc) + (ac 2bd) p and obviously true. For a = c = 2 and b = d = 1 we have equality. 2 2
2
2
4 3
2
2
22
2
2
2
2
16 3
2
2
2
2
2
Lemma 7.2 (i) For the butter y operation y0 := x0 + x1; y1 := x0
(x0 + x1) and y~1 := (x0 x1), the roundo error can be estimated by
x1 with y~0
:=
(~y y ) + (~y y ) 2 u (x + x ): (7.2) (ii) If the scaling factor a 62 f0; 1g is precomputed by ~a = a + a with jaj c u, then for the scaled butter y operation y := a(x + x ); y := a(x x ) with y~ := (~a(x + x )) and y~ := (~a(x x )), the roundo error can be estimated by p p (7.3) (~y y ) + (~y y ) (2 2 jaj + 2 c + O(u)) u (x + x ): (iii) If the dierent entries ak 62 f0; 1g with a := a + a > 0 are precomputed by a~k = ak + ak with jak j c u for k = 0; 1, then for the scaled rotation y := a x + a x ; y := a x + a x with y~ := (~a x + a~ x ), y~ := ( a~ x + a~ x ), the roundo error can be estimated by (~y y ) + (~y y ) ( p jaj + p2 c + O(u)) u (x + x ): (7.4) Proof. (i) By (7.1) we have y~ = (x + x )(1 + Æ ) = y + (x + x )Æ ; y~ = (x x )(1 + Æ ) = y + (x x ) Æ with jÆk j u for k = 0; 1 such that by jy~ y j jx + x j u; jy~ y j jx x j u; we obtain (7.2). (ii) Putting z := a~(x + x ), z := a~(x x ), it follows from (7.1) that y~ = a~(x + x )(1 + Æ )(1 + Æ ) = z + a~(x + x )(Æ + Æ + Æ Æ ); y~ = a~(x x )(1 + Æ )(1 + Æ ) = z + a~(x x )(Æ + Æ + Æ Æ ) with jÆk j u, jÆkj u, k = 0; 1. Thus, by jy~ z j ja~(x + x )j(2u + u ); jy~ z j ja~(x x )j(2u + u ) we get the estimate jy~ z j + jy~ z j 2 a~ u (2 + u) (x + x ) with a~ = a + O(u), i.e.
p
y~
z
(2 2 jaj + O(u)) u x :
x
y~ z By z y = a 1 1 x z y 1 1 x 0
2
0
1
1
2
2
2 0
2 1
1
0
1
0
0
0
1
1
0
1
0
0
1
0
2
1
2
1
2
1
2
2 0
2
2 0
2 1
2 1
2
0
0
0
0
0
1
0
0
1
0
1
1
2
1
1
1
0
0
1
0
0
1
1
0
1
0
0
1
1
4 3
2
1
1
2
2
+ 0 + 1
0
0
1
1
0
1
2
2 0
2 1
+ 0 + 1
+
0
0
0
0
1
0
0
1
1
0
1
0
1
1
1
0
1
0
1
1
+ 0 + 1
0
0
0
1
1
1
0
1
+ 0 + 1
0
1
+ 0 0 + 1 1
+
0
0
0
0
2
2
1
0
2
1
1
1
2
2
2
1
0
2
2 0
1
2 1
2
0
0
0
1
1
2
0
0
0
1
1
1
1
23
2
2
1
we obtain
z0
z1
y0 y1
p
2
2c
1
u
x0
x1
2
and nally by triangle inequality
p p
y~
y
(2 2 jaj + 2 c + O(u)) u x :
y~
y x (iii) Introducing z := a~ x + a~ x , z := a~ x + a~ x , it follows from (7.1) that y~ = [~a x (1 + Æ ) + a~ x (1 + Æ )](1 + Æ ); y~ = [ a~ x (1 + Æ ) + a~ x (1 + Æ )](1 + Æ ) with jÆjj u for j = 0; : : : ; 3 and jÆk j u for k = 0; 1. Hence we obtain jy~ z j (ja~ x j + ja~ x j + ja~ x + a~ x j) u + (ja~ x j + ja~ x j) u ; jy~ z j (ja~ x j + ja~ x j + ja~ x a~ x j) u + (ja~ x j + ja~ x j) u and hence jy~ z j + jy~ z j [(ja~ x j + ja~ x j + ja~ x + a~ x j) +(ja~ x j + ja~ x j + ja~ x a~ x j) ] u (1 + u) : Applying Lemma 7.1, we nd jy~ z j + jy~ z j (~a + a~ )(x + x )u (1 + u) = (a + O(u))u (x + x ); since a~ + a~ = a + a + O(u) = a + O(u) by assumption. Therefore we obtain
x
y~
z
p jaj + O(u) u
y~
x : z By z y = a a x (7.5) z y a a x we conclude that
p
z y
2c u x ;
z x y since the matrix in (7.5) is almost orthogonal and therefore its spectral norm equals p p (a ) + (a ) 2 c u: Using the triangle inequality, we obtain (7.4). 2 For an arbitrary input vector x 2 Rn, let y := CnII x 2 Rn be the exact transformed vector. Further let y~ 2 Rn be the output vector computed by Algorithm 4.2 using oating point arithmetic with unit roundo u. Since CnII is nonsingular, y~ can be represented in the 0
0
1
1
0
0
0
1
1
2
0
1
0
0
1
1
1
1
0
1
0
1
0
0
0
1
1
0
2
2
+ 0
1
1
+ 1
3
+
0
0
0
0
1
1
0
0
1
1
0
0
1
1
1
1
1
0
0
1
1
0
0
1
1
0
0
1
0
0
2
1
1
2
0
0
1
0
2 0
2 1
2 0
0
2
1
1
2 1
1
1
0
2
0
16 3 16 3
2 0
0
1
0
1
2 1
1
0
2 0
2
2 1
2
1
2
2 0
0
2 1
1
1
0
4 3
2
1
0
0
0
1
0
1
1
1
0
1
0
0
1
1
0
2
2
2
1
24
2
0 1
2
2
2
2
2
2
0
2
2
1
0
2
2
2
form y~ =
CnII (x + x) with x 2 Rn. An algorithm for computing CnII x is called normwise backward stable (see [11], p. 142), if there is a positive constant kn such that
kxk (kn u + O(u )) kxk (7.6) for all vectors x 2 Rn and knu 1. The constant kn measures the quality of numerical stability. Since CnII is orthogonal, we conclude that kxk = kCnII (x)k = ky~ yk and kxk = kCnII xk = kyk . Hence we also have normwise forward stability by ky~ yk (kn u + O(u )) kyk ; 2
2
2
2
2
2
2
2
2
2
2
2
if (7.6) is satis ed. Now let us look closer at the computation steps in Algorithm 4.2. In step 1 of Algorithm 4.2, for every s = 0; : : : ; t 2, all values p (2k + 1) p 2 cos (2k2s+ 1) ; 2 sin 2s ; k = 0; : : : ; 2s 1 (7.7) are precomputed. If cosine and sine are internally computed to higher precision and the results afterwards are rounded towards the next machine number, then we obtain very accurate values of (7.7) with an error constant c = 1 (see Lemma 7.2, (iii)). We use the p matrices S~n( s), s = 1; : : : ; t 1,pwith precomputed entries (7.7) instead of Sn ( s) = 2 Tn( s). Assume that the value 2=2 is precomputed with an error constant c (see Lemma 7.2, (ii)). p We use the matrices A~n( s), s = 1; : : : ; t 2, with the precomputed scaling factors 2=2 instead of An( s). The vectors s in Algorithm 4.2, 2. are generated without producing roundo errors. Let x~ = x := x. We denote the vectors computed in Algorithm 4.2, 3. by x~ s := (S~n ( s ) x~ s ); s = 0; : : : ; t 1: Further, we introduce the error vectors e s 2 Rn by x~ s = Sn ( s )~x s + e s : (7.8) Note that e s describes the precomputation error and roundo error of one step in Algorithm 4.2, 3. The matrix{vector product S~n( )~x = Sn( )x involves only butter y operations such that by Lemma 7.2, (i) p ke k 2u kxk : (7.9) Every matrix{vector product S~n ( s ) x~ s , s =p1; : : : ; t 1, consists of butter y operations and rotations/rotation{re ections scaled by 2 such that by Lemma 7.2, (i) and (iii) we obtain p p ke s k ( p + 2 c + O(u)) u kx~ s k ; s = 1; : : : ; t 1: (7.10) Now we introduce the vectors computed in Algorithm 4.2, 4. by x~ t s := (Pn (t s 2)A~n ( t s ) x~ t s ); s = 0; : : : ; t 2; and the corresponding error vectors e t s 2 Rn by x~ t s := Pn (t s 2)An ( t s ) x~ t s + e t s : (7.11) +3
+3
2
1
(0)
(0)
( +1)
( )
( +1)
( +1)
( )
( +1)
( +1)
(0)
0
(1)
2
0
2
( )
( +1)
2
4
2 3
( )
2
( + +1)
2
2
( + )
( + +1)
( + +1)
2
25
( + )
( + +1)
The vector e t s describes the precomputation error and roundo error of one step in Algorithm 4.2, 4. Permutations do not produce roundo errors. Every matrix{vector product A~n( t s ) x~ t s , s = 0; : : : ; t 3,pconsists of identities and scaled butter y operations (with precomputed scaling factor 2=2) such that by Lemma 7.2, (ii) we can estimate p ke t s k (2 + 2 c + O(u)) ukx~ t s k ; s = 0; : : : ; t 3: (7.12) Note that by A~n( ) = In we have e t = 0. The nal part of Algorithm 4.2 is the scaling y := pn x t . Let y~ := ( pn x~ t ). By (7.1) we nd the estimate x t k ky~ yk ky~ pn x~ t k + pn kx~ t pun kx~ t k + pn kx~ t x t k: (7.13) We are now ready to pestimate the total roundo error ky~ yk of Algorithm 4.2 under the assumption that 2=2 and the trigonometric values (7.7) in the factor matrices are precomputed with error bounds c u and c u. p Theorem 7.3 Let n = 2t (t 3). Assume that 2=2 is precomputed with absolute error bound c u and the values in (7.7) are precomputed with absolute error bound c u, respectively. Then the split{radix Algorithm 4.2 is normwise backward stable with the constant p p 2c : kn = p + 2 + 2c + c (log n 1) Proof. First we estimate the roundo error kx~ t x t k . Applying (7.8) and (7.11) repeatedly, we obtain x~ t = x t + Pn (0)An( ) : : : Pn (t 2)An( t )Sn( t ) : : : Sn( )e + : : : +Pn (0)An( ) : : : Pn (t 2)An( t )Sn ( t )e t +Pn (0)An( ) : : : Pn (t 2)An( t )e t + : : : +Pn (0)An( )e t : (7.14) The matricespSn ( s ), s = 0; : : : ; t 1, are almost orthogonal with the spectral norm kSn( s )k = 2. The matrices Pn (s) An( s ), s = 0; : : : ; t 2, are orthogonal such that kPn (s) An( s)k = 1. By (7.8) and (7.11) we can estimate p kx~ s k 2 kx~ s k + ke s k ; s = 0; : : : ; t 1; kx~ t s k kx~ t s k + ke t s k ; s = 0; : : : ; t 2: Thus by (7.10) and (7.12) we see that p kx~ s k ( 2 + O(u)) kx~ s k ; s = 0; : : : ; t 1; kx~ t s k (1 + O(u)) kx~ t s k ; s = 0; : : : ; t 2: Since x~ = x this implies kx~ s k (2s= + O(u)) kxk ; s = 0; : : : ; t 1; (7.15) kx~ t s k (2t= + O(u)) kxk ; s = 0; : : : ; t 2: (7.16) ( + +1)
( + )
2
( + +1)
2
( + )
1
(2
0
1)
1
1
2
2
(2
(2
1)
1)
1
2
1
2
(2
(2
1
1)
(2
1)
1)
(2
(2
1)
1)
(2
1)
2
2
2
1
2
1
2
4 3
1
2
(2
(2
1)
(2
1)
1)
(2
0
2
0
2
(2
( )
2)
2
( + +1)
2
( )
( + )
2
( +1)
2
2
( + +1)
( +1)
( )
( + +1)
( + )
2
2
(0)
( +1)
2
( + +1)
2
2
2
26
1
1
2
( +1)
1)
2
0
0
1
2
2
(
1)
1
(1)
From (7.10), (7.12), (7.15), and (7.16) it follows that ke s k 2s= ( p + c + O(u)) u kxk ; s = 1; : : : ; t 1; p ke t s k 2t= (2 + 2 c + O(u)) u kxk ; s = 0; : : : ; t 3: We obtain from (7.14) that kx~ t x t k kSn ( t ) : : : Sn ( )k ke k + : : : kSn ( t )k ke t k +pke t k + : : : + ke t pk ( 2)t ke k + : : : + 2 ke t k + ke t k + : : : + ke t k and hence by (7.9) kx~ t x t k p p 4 t= 2 u ( p + 2 + 2 c + c )(t 1) 1 2 c + O(u) kxk : (7.17) 3 ( +1)
( + +1)
(2
1)
(2
1)
2
2
4 3
2
(2
1)
2
1
(2
(1)
(1)
2
1
2
1
1)
2
1
2
( )
(2
2
2
2)
(
1)
2
2
(
2
2
1
1)
( )
2
(2
2
2)
2
2
2
1
2
1
2
The nal part of Algorithm 4.2 is the scaling y = 2 t= x t . Let y~ = (2 t= x~ t ). For even t this does not produce a additional roundo error. By (7.13), (7.16) and (7.17) we get the nal estimate p p ky~ yk u ( p + 2 + 2 c + c )(t 1) 2 c + O(u) kxk : This completes the proof. 2 Remark 7.4 In [2], it has been shown that for computing y = CnII x one has normwise backward stability p = with (i) kn = p2 n for classicalpmatrix{vector computation, (ii) kn = (4p 2 + 2)log n + 2 for an FFT{based DCT{II algorithm, and (iii) kn = (2 3)(n 1) for a real fast algorithm based on polynomial arithmetic. In these algorithms the nontrivial entries of the factor matrices were assumed to be precomputed exactly. As shown in Theorem 7:3, the new Algorithm 4:2 is extremely stable with a constant which is comparable with the constant for the FFT{based DCT{II algo2
4 3
2
1
2
(2
1)
1
2
(2
1)
2
3 2
2
rithm.
Remarkp7.5 Considering the numerical stability of Algorithm 4:4, we obtain a constant kn = O( n log2 n). Hence Algorithm 4:4 is less stable than Algorithm 4:2. This is due to the fact that all matrices A0n ( s ) and Sn0 ( s ), s = p 1; : : : ; t 2, as well as Sn( t 1) are not longer orthogonal, but have the spectral norm
2.
References
[1] H. Ahmed, T. Natarajan, and K. R. Rao, Discrete cosine transform, IEEE Trans. Comput. 23 (1974), 90 { 93. [2] G. Baszenski, U. Schreiber, and M. Tasche, Numerical stability of fast cosine transforms, Numer. Funct. Anal. Optim. 21 (2000), 25 { 46. 27
[3] V. Britanak, A uni ed discrete cosine and discrete sine transform computation, Signal Process. 43 (1995), 333 { 339. [4] W. H. Chen, C. H. Smith, and S. Fralick, A fast computational algorithm for the discrete cosine transform, IEEE Trans. Comm. 25 (1977), 1004 { 1009. [5] I. Daubechies and W. Sweldens, Factoring wavelet transforms into lifting steps, J. Fourier Anal. Appl. 4 (1998), 247 { 269. [6] P. Duhamel and H. Hollmann, Split radix FFT algorithms, Electron. Lett. 20 (1984), 14 { 16. [7] P. Duhamel, Implementation of the split{radix FFT algorithms for complex, real, and real{symmetric data, IEEE Trans. Acoust. Speech Signal Process. 34 (1986), 285 { 295. [8] E. Feig, A scaled DCT algorithm, Proc. SPIE 1244 (1990), 2 { 13. [9] E. Feig and S. Winograd, Fast algorithms for the discrete cosine transform, IEEE Trans. Signal Process. 40 (1992), 2174 { 2193. [10] M. T. Heideman, D. H. Johnson, and C. S. Burrus, Gauss and the history of the fast Fourier transform, Arch. Hist. Exact Sci. 34 (1985), 265 { 277. [11] N. J. Higham, Accuracy and Stability of Numerical Algorithms, SIAM, Philadelphia, 1996. [12] H. S. Hou, A fast recursive algorithm for computing the discrete cosine transform, IEEE Trans. Acoust. Speech Signal Process. 35 (1987), 1455 { 1461. [13] B. Lee, A new algorithm to compute the discrete cosine transform, IEEE Trans. Acoust. Speech Signal Process. 32 (1984), 1243 { 1245. [14] L. Loeer, A. Ligtenberg, and G. S. Moschytz, Practicle fast 1{d DCT algorithms with 11 multiplications, Proc. IEEE ICASSP (1989), 989 { 991. [15] M. J. Narasimha and A. M. Peterson, On computing the discrete cosine transform, IEEE Trans. Commun. 26 (1978), 934 { 936. [16] M. Puschel and J. M. Moura, The algebraic approach to the discrete cosine and sine transforms and their fast algorithms, Preprint, Carnegie Mellon Univ., Pittsburgh, 2001. [17] K. R. Rao and P. Yip, Discrete Cosine Transform: Algorithms, Advantages, Applications, Academic Press, Boston, 1990. [18] C. Runge, On the decomposition of empirically given periodic functions into sine waves (in German), Z. Math. Phys. 48 (1903), 443 { 456, and 52 (1905), 117 { 123. [19] U. Schreiber, Fast and numerically stable trigonometric transforms (in German), Thesis, Univ. of Rostock, 1999. [20] G. Steidl, Fast radix{p discrete cosine transform, Appl. Algebra Engrg. Comm. Comput. 3 (1992), 39 { 46. [21] G. Steidl and M. Tasche, A polynomial approach to fast algorithms for discrete Fourier{cosine and Fourier{sine transforms, Math. Comput. 56 (1991), 281 { 296. [22] G. Strang, The discrete cosine transform, SIAM Rev. 41 (1999), 135 { 147. 28
[23] M. Tasche and H. Zeuner, Roundo error analysis for fast trigonometric transforms, in: Handbook of Analytic{Computational Methods in Applied Mathematics, G. Anastassiou (ed.), Chapman & Hall/CRC, Boca Rota, 2000, pp. 357 { 406. [24] C. F. Van Loan, Computational Framework for the Fast Fourier Transform, SIAM, Philadelphia, 1992. [25] M. Vetterli and H. J. Nussbaumer, Simple FFT and DCT algorithms with reduced number of operations, Signal Process. 6 (1984), 267 { 278. [26] Z. Wang, Fast algorithms for the discrete W transform and the discrete Fourier transform, IEEE Trans. Acoust. Speech Signal Process. 32 (1984), 803 { 816. [27] H. Zeuner, A general theory of stochastic roundo error analysis with applications to DFT and DCT, J. Comput. Anal. Appl., to appear. INSTITUTE OF MATHEMATICS, GERHARD{MERCATOR{UNIVERSITY OF DUISBURG, D { 47048 DUISBURG, GERMANY DEPARTMENT OF MATHEMATICS, UNIVERSITY OF ROSTOCK, D { 18051 ROSTOCK, GERMANY
E{mail addresses:
[email protected] [email protected]
29