Algorithms to solve block Toeplitz systems and

Algorithms to solve block Toeplitz systems and least-squares problems by transforming to Cauchy-like matrices K. Gallivan

S. Thirumalai

1 Introduction

P. Van Dooren

Fast algorithms to factor Toeplitz matrices have existed since the beginning of this century [17]. Several new fast and superfast algorithms to solve structured matrices such as Toeplitz and Hankel matrices have been proposed and studied over the last couple of decades. A fundamental problem with these fast methods is that they do not allow any form of pivoting because the structure of the matrices is destroyed. In [12], [11] the authors suggest ways to overcome this problem by transforming one class of structured matrices to another using fast trigonometric transforms in such a way that pivoting may be incorporated in the factorization algorithms. These algorithms factor Toeplitz, Hankel and Vandermonde matrices by converting them to Cauchy matrices and performing Gaussian elimination with partial pivoting. It was shown that the special displacement structure of Cauchy matrices is conducive to pivoting strategies such as partial pivoting. The algorithms suggested in [12], [11], however, did not exploit properties such as realness and symmetry simultaneously in the matrices. Recently there has been a surge of activity in this area and new variants are constantly being developed to solve various structured matrix problems with pivoting. This paper attempts to collect all the variants to solve Toeplitz matrices and compare them from a computational standpoint. Emphasis is given to variants that exploit properties such as realness and symmetry. We also present some new extensions of these algorithms to solve Toeplitz least-squares problems. These extensions are based on the normal equations and the augmented system of equations. Finally, we also suggest how a rank-revealing QR algorithm can be obtained for Toeplitz matrices by converting them to Cauchy-like matrices. Most problems in signal processing and systems theory yield block Toeplitz and block Hankel systems that are real and symmetric. Least-squares problems in these areas also give rise to real matrices. In [10], Gallivan et al. propose an algorithm based on the idea of converting real, symmetric positive semi-de nite block Toeplitz systems into Cauchy-like matrices using the Hartley transform. This algorithm preserves the desirable properties such as realness and symmetry, thereby signi cantly reducing the complexity of the method. The algorithm has one drawback - in some cases there is a possibility of breakdown. In such cases, the authors suggest that realness be given up and that the real Cauchy-like matrix be converted to a complex Cauchy-like matrix. The factorization then preserves the Hermitian property of the complex Cauchy-like matrix. In this paper we shall discuss variants of the original Cauchy-based algorithm to factor real, symmetric Toeplitz matrices that do not have the drawback mentioned above and still exploit the properties of realness and symmetry simultaneously. Section 2 reviews some fundamental concepts such as the displacement structure of a matrix and shows how one class of structured matrices may be transformed to another class using trigonometric transforms. Section 3 reviews the Gaussian elimination method applied to structured matrices as proposed by Gohberg et al. in [11]. A variant of this algorithm that uses a symmetric form of the displacement equation, [10], is also reviewed in section 3. Section 4 presents a variant to factor Hermitian Toeplitz matrices that exploits the Hermitian symmetry property. This section also computes the computational savings that results from exploiting the Hermitian property. Section 5 deals with factoring real unsymmetric Toeplitz matrices and estimates the computational cost. Section 6 deals with factoring real and symmetric Toeplitz matrices 1

2 estimates the savings in computation that results from exploiting both properties simultaneously. Section 6 also presents a new algorithm to convert a Hermitian Toeplitz matrix to a rel, symmetric Cauchy-like matrix. The factorization of this resulting real, symmetric Cauchy-like matrix is signi cantly less expensive than the algorithm that converts a Hermitian Toeplitz matrix to a Hermitian Cauchy-like matrix before factorization. Section 7 discusses algorithms to solve real Toeplitz least-squares problems using the method of normal equations and the augmented system of equations. Section 8 derives a new rank revealing QR factorization algorithm based on the conversion of Toeplitz matrices to Cauchy-like matrices. Section 9 discusses generalization of the algorithms to block Toeplitz matrices. We also present results from experiments on several high performance architectures such as the Cray Y-MP, J90, and T90.

2 Transformations between classes of structured matrices

In [14] Kailath et al. introduced the idea of displacement structure to describe the structure in Toeplitz matrices that made them conducive to fast factorization schemes. Consider a real Toeplitz matrix T. Let Z be a down-shift matrix that shifts a matrix one row down when it is applied from the left. The matrix T ? ZTZ T can immediately be recognized as a sparse matrix with only the rst row and column consisting of non-zero elements. This matrix has a rank two factorization. The equation T ? Z T ZT = G HT

(1)

is called displacement equation of the Toeplitz matrix w.r.t. to (Z; Z T ) and the rank of the matrices G and H is the displacement rank. For Toeplitz matrices this rank is 2. The matrix T can be easily constructed from G and H and hence these matrices are referred to as the generators of T w.r.t. (Z; Z T ). The displacement equation may also be written as Z T ? T Z = G HT : (2) We refer to Equation (1) as the displacement equation of type I and (2) as the displacement equation of type II. For real, symmetric Toeplitz matrices the displacement equation is of the form following forms: T ? Z T Z T = G J GT

(3)

where J is a symmetric (diagonal) matrix and G has a displacement rank of 2. Factorizations of the forms described above can be obtained analytically from the entries of the Toeplitz matrix without the need for explicit rank factorizations. Fast algorithms based on the above displacement structure to factor Toeplitz matrices are commonly referred to as Schur or Bareiss-type algorithms. These are in contrast to Levinsontype algorithms that factor the inverse of the Toeplitz matrix. In [8], Chun et al. describe a method to generalize the Schur algorithm to factor Toeplitz matrices to the QR factorization of Toeplitz matrices. They do this by demonstrating that the matrix T T T has a displacement rank of 4 w.r.t. (Z; Z T ). Since this algorithm is a generalization of the classical Schur algorithm (displacement rank > 2), it is referred to as the generalized Schur algorithm. Again as in the case of Toeplitz matrices, the generators can be computed analytically without the need for explicit rank factorization. All this extends quite naturally to block Toeplitz matrices which have similar displacement structure properties. If T is a block Toeplitz matrix with a block size of m, and Z is the down-shift matrix that shifts T down by m rows when applied from the left, then T has a displacement rank of 2m w.r.t. (Z; Z T ). Similarly T T T has a displacement rank of 4m w.r.t. the same matrices. In general, any matrix A that has a low displacement rank w.r.t. the matrices (Fl ; Fr ) is called a structured matrix. The following equation represents a general displacement equation of type I. Fl A ? A Fr = G H T

(4)

Cauchy-like matrices have the property that the displacement matrices Fl and Fr are diagonal. Any transform that diagonalizes the matrices Fl and Fr can, therefore, be used to convert the given structured matrix A

3 to a Cauchy-like matrix. We show that the displacement structure of a Cauchy-like matrix is invariant to pivoting. Let the displacement equation for a cauchy matrix C be: Dl C ? C D r = G H T

(5)

Let P be a permutation matrix (P T P = I) corresponding to a partial pivoting operation. Applying this permutation to the Cauchy matrix C yields P Dl C ? P C D r = P G H T (P Dl P T ) (P C) ? (P C) Dr = (P G) H T D^ l C^ ? C^ Dr = G^ H T :

(6)

It is clear that D^ l has the same structure as Dl (diagonal) and that the equation remains unchanged in structure. It is also easily veri ed that if Dl were not diagonal, such a permutation would destroy the displacement structure. For symmetric matrices with symmetric pivoting we would require both Dl and Dr to be diagonal. In particular, consider type II of the displacement equation for a Toeplitz matrix T of size n w.r.t. (Z; Z), where Z is the circulant down-shift matrix. Z T ? T Z = GH T

(7)

It is known that the discrete Fourier transformi(DFT) diagonalizes the displacement matrix Z. Let F be the DFT matrix of size n de ned by F = p1n [e n kj ]0k;j (n?1) , then 2

F Z F =

(8)

where is a diagonal matrix with

p (j; j) = e ni j for j = 0 (n ? 1); where i = ?1: 2

Applying the transformation F(:)F to the displacement equation we have F Z T F ? F T Z F = F G H F (F Z F ) (F T F ) ? (F T F ) (F Z F ) = G^ H^ C ? C = G^ H^

(9)

where the displacement matrices Dl and Dr of (5) are both equal to and C is a Cauchy-like matrix. Note that if T were a real matrix, G and H would also be real. The DFT transformation converts these real matrices to complex matrices. This is undesirable because it increases the amount of computation involved in the factorization and the triangular solves. It will be shown later that at each step of the factorization of Cauchy-like matrices of the form shown in (5) one has to solve Lyapunov equations derived from the displacement equation. It is well known that the Lyapunov equation shown in (5) can be solved only if the eigenvalues of Fl and Fr are distinct. This in turn implies that the diagonal matrices Dl and Dr must have distinct entries. If they have some identical eigenvalues then, one has to compute certain additional parameters which need to be updated throughout the factorization algorithm. This increases the complexity of the algorithm. Keeping this in mind, Gohberg et al. [11] introduce a dierent form of the displacement equation to solve Toeplitz matrices. Consider the displacement matrices Z1 and Z?1 de ned as: 2 2 0 0 0 ?1 3 0 0 0 1 3 6 1 6 1 0 0 0 77 0 0 0 77 6 6 7 6 6 .. 77 . .. 7 ; . 7 (10) Z?1 = 66 0 1 Z1 = 66 0 1 7 7 7 6 . 6 . . . . . . . 4 . 4 . . . 5 5 0 0 1 0 0 0 1 0

4 The displacement equation for Toeplitz matrices can be written as Z1 T ? T Z?1 = G H T

(11)

with a displacement rank of 2. It is well known that the DFT matrix F diagonalizes both Z1 and Z?1 as: F Z1 F F (D Z?1 D?1 ) F F F? D

= = = = =

F and F? where diag(1; e ni ; ; e ni (n?1)); n? i diag(e in ; e ni ; ; e n ) and n? i diag(1; e in ; ; e n ) 2

2

(2

3

(

1)

1)

(12)

The displacement equation can now be rewritten as: (F Z1 F ) (F T D?1 F ) ? (F T D?1 F ) (F D Z?1 D?1 F ) = (F G) (H T D?1 F ) (13) F C ? C F? = G^ H^ where C is a Cauchy-like matrix de ned by C = F T D?1 F . It can be seen from the above displacement equation that if T is a real matrix, the DFT matrix destroys the realness property. Also if T is symmetric, the Cauchy-like matrix that it is transformed to is no longer hermitian. In the next few subsections we review several forms of displacement equations and the corresponding fast trigonometric transforms that convert the Toeplitz matrices to Cauchy-like matrices.

2.1 Non-Hermitian Toeplitz matrices

Consider a non-Hermitian Toeplitz matrix T. The displacement equation for such a matrix using the displacement matrices Z1 and Z?1 mentioned above would be of the form Z1 T ? T Z?1 = G H : (14) The Toeplitz matrix in the above equation could be converted to a Cauchy-like matrix as demonstrated in (13).

2.2 Hermitian Toeplitz matrices

The technique described in section 2.1 could be applied to Hermitian Toeplitz matrices. However, doing so would convert the Hermitian Toeplitz matrix to a non-Hermitian Cauchy-like matrix. In order to maintain Hermitian symmetry, one may use the following displacement equations. T ? Z1 T Z1T = G J G (15) Applying the DFT transformation F(:)F to the above equation we get the following displacement equation for the Hermitian Cauchy-like matrix C = F T F : C ? C = G^ J G^ (16)

2.3 Real unsymmetric Toeplitz matrices

The techniques described in sections 2.1 and 2.2 would destroy the realness property of Toeplitz matrices. One would like to preserve the property of realness to avoid complex arithmetic which is in general more expensive than real arithmetic. Just as the discrete Fourier transform was used to convert from complex Toeplitz matrices to complex Cauchy-like matrices, several real trigonometric transforms such as the discrete Sine, Cosine and Hartley transforms can be used to convert real Toeplitz matrices to real Cauchy-like matrices. In

5 this section we demonstrate how these transforms may be used. We rst review the displacement matrices and the real trigonometric transforms that diagonalize them [4]. We then show how they may be used to convert real Toeplitz matrices into real Cauchy-like matrices. Consider the general displacement matrix Z of size n de ned as: 2 1 0 0 3 6 . 7 6 1 0 . . . . . . .. 77 6 Z = 666 0 . . . . . . . . . 0 777 (17) 6 . . 7 . .. .. 0 1 5 4 .. 0 0 1 It can be easily veri ed that when = = 0, the discrete Sine transform (DST) matrix de ned as: r S00 = n +2 1 sin nij (18) + 1 ; i; j = 1; ; n diagonalizes Z00. Speci cally, we have: (19) S00 Z00 S00 = 2 diag(cos nj + 1 ); j = 1; ; n When = = 1, then the discrete Cosine transform-II (DCT-II) S11 diagonalizes Z11.

p

T Z11 S11 S11

r

2 (2i + 1)j ; i; j = 0; ; n ? 1 n kj cos 2n = 2 diag(cos j n ); j = 0; ; n ? 1

S11 =

(20)

Here kj = 1= 2 for j = 0 and kj = 1 otherwise. For = = ?1, the discrete Sine transform-II (DST-II) S?1?1 diagonalizes Z?1?1 . r (2i ? 1)j 2 ; i; j = 1; ; n S?1?1 = n kj sin 2n (21) S?T 1?1 Z?1?1 S?1?1 = 2 diag(cos j n ); j = 1; ; n

p

Here, kj = 1= 2 for j = n and kj = 1 otherwise. If = ?1 and = 1 or vice-versa, then the discrete Sine transform-IV (DST-IV) S?11 diagonalizes Z?11 and the discrete Cosine transform-IV (DCT-IV) S1?1 diagonalizes Z1?1. r (2i + 1)(2j + 1) 2 ; i; j; 0; ; n ? 1 S?11 = n sin 4n + 1) ); j = 0; ; n ? 1 S?11 Z?11 S?11 = 2 diag(cos (2j 2n r + 1) ; i; j; 0; ; n ? 1 S1?1 = n2 cos (2i + 1)(2j 4n + 1) ); j = 0; ; n ? 1 S1?1 Z1?1 S1?1 = 2 diag(cos (2j 2n (22)

The displacement matrices Z can be used to formulate the displacement equations for unsymmetric Toeplitz matrices. However, unlike the case of Z1 , the displacement rank w.r.t. to Z will be 4 and not 2. This is the penalty one incurs for staying the real domain. It will be shown in a later section that this increase in the displacement rank does not result in an algorithm more expensive than the one that transforms the real matrix to a complex Cauchy-like matrix.

6 Consider a real unsymmetric Toeplitz matrix T. One could use any of the following displacement equations : Z00 T ? T Z11 Z00 T ? T Z?1?1 Z00 T ? T Z1?1 Z00 T ? T Z?11

= = = =

G1 H1T G2 H2T G3 H3T G4 H4T

(23)

Each of the equations shown above results in a displacement rank of 4. The corresponding real trigonometric transformations can then be applied to obtain the displacement equation for the corresponding real Cauchylike matrix. It must be mentioned that the matrices G1 ; ; G4 and H1; ; H4 can be calculated analytically without the need for a rank factorization routine.

2.4 Real Symmetric Toeplitz matrices

If the Toeplitz matrix is real and symmetric, then the techniques described in section 2.3 would convert it to an unsymmetric Cauchy-like matrix. One may, however, use symmetric forms of the displacement equations described in section 2.3 : (24) Z T ? T Z = G GT where T is a real symmetric Toeplitz matrix of size n, is a skew-symmetric signature matrix and may be either 0 or 1. By construction, it is easy to see that the displacement of T w.r.t. Z is of the form : 2 3 0 ?a T 0 0 ea 5 Z T ? T Z = 4 a (25) 0 ?a T e 0 where a is a vector of length n ? 2 that depends on the displacement matrix Z and e is the re ection permutation matrix of size n ? 2. From the above displacement equation, we see that the generator G can be written as : 2 3 0 1 0 0n?2 0 5 G = 4 a 0n?2 ea (26) 0 0 0 1 In the above equation 0n?2 is a vector of zeros of size (n ? 2) 1. The Sine-I (S00) or the Cosine-II (S11 ) transforms can be used to diagonalize the displacement matrices Z00 and Z11 . The corresponding displacement equation for the Cauchy-like matrix C is : L C ? C L = G^ G^ T (27)

where L = S T Z S , C = S T T S and G^ = S T G. The displacement rank of the above equations is 4. Interestingly, the Cauchy-like matrix C has a lot of sparsity that can be exploited during the factorization algorithm. Speci cally, if P is the odd-even sort permutation matrix (i.e. P x = [x1 x3 x2 x4 ]T ), then M 0 1 T (28) P C P = 0 M2

where M1 is a Cauchy-like matrix of size dn=2e and M2 is of size bn=2c. In addition, it can be shown that the matrices M1 and M2 have a displacement rank of 2 as opposed to C that has a displacement rank of 4. We prove this for both even and odd n. First, consider the case when n is even. From the de nitions of S00 and S11 , it can be seen that they satisfy the following condition when n is even : S S 1 2 T (29) P S P = ES1 ?ES2 ;

7 where S1 and S2 are submatrices of size n=2 that depend on the trigonometric transform S00 or S11 and E is the re ection permutation matrix of size n=2. Let us partition the matrix P T P T as, PTP T

= TT1T TT21 : 2

(30)

Here T1 is a symmetric Toeplitz matrix and T2 is an non-symmetric Toeplitz matrix of size n=2. It follows that P C P T = (P S T P T ) (PT P T ) (P S P T ) T ST E S S S T T 1 2 1 2 1 1 = S T ?S T E ES1 ?ES2 : T2T T1 2 2

(31)

The (2; 1) entry in the above matrix equation is (S2T T1 ? S2T E T2T ) S1 + (S2T T2 ? S2T E T1 ) E S1 Rearranging the terms we have S2T (T1 ? E T1 E) S1 + S2T (T2 E ? E T2T ) S1 = 0; since T1 and T2 are Toeplitz matrices. The matrix P C P T is symmetric and the (1; 2) sub-matrix is also zero. This proves that one can now solve 2 smaller systems of size n=2 instead of one large system of size n. Having proved that P C P T is of the form shown in (28), we now show that M1 and M2 have a displacement rank of 2. Applying the permutation matrix P to the displacement equation (27), we have (P L P T ) (P C P T ) ? (P C P T ) (P L P T ) = (P S T P T ) (P G) (G T P T ) (P S P T ): (32) From (26), we see that P G can be partitioned as :

g2 E g4 P G = gg1 gg3 E E g1 E g3 2 4

(33)

where g1; g2 ; g3 and g4 are vectors of lenth n=2. From (33) and (29), we have

T + S T E g2 S T g3 + S T E g4 S T g1 + S T E g2 S1T g3 + S1T E g4 : 1 1 1 1 1 (P S T P T ) (P G) = SS1T gg1 ? T T T T T S2 E g2 S2 g3 ? S2 E g4 ?S2 g1 + S2 E g2 ?S2T g3 + S2T E g4 1 2 (34) From the above equation, it is clear that the generators of M1 and M2 have rank 2. This proves that the displacement rank of M1 and M2 is 2. We now outline the proof for odd n. If n is odd, then the permuted trigonometric transform P S P T has the following property : P S P T = SS1 SS3 (35) 2

4

where S1 = E1 S1 ; S2 = E2 S2 ; S3 = ?E1 S3 ; S4 = ?E2 S4 . E1 and E2 are re ection permutation matrices of size dn=2e and bn=2c respectively. Now partitioning P T P T conformally as

P T P T = TTT1 TT2 3 2

we have T P T ) (P S P T ) P C P T = (P S T P T ) (P T T T1 T2 S1 S3 : = SS1T SS2T T T3 S2 S4 T 3 4 2

(36)

8 Again, the (2; 1) sub-matrix in the above matrix equation is S3T T1 S1 + S4T T2T S1 + S3T T2 S2 + S4T T3 S2 : We can show that each term in the above expression evaluates to zero and hence the (2; 1) block is zero. For example, consider the rst term, S3T T1S1 = S3T T1 E1S1 = S3T E1T1 S1 = ?S3T T1 S1 = 0. Similarly, all other terms in the expression evaluate to zero and the (2; 1) block is zero. Since the Cauchy-like matrix P C P T is symmetric, the (1; 2) block is also zero. The permutation matrix P, therefore, separates the system of equation into two systems about half the size of the original system that an be solved independently. Now we show that the displacement rank of M1 and M2 is 2. If n is odd, then P G can be partitioned as : g g E g E g 1 3 1 3 (37) P G = g2 g4 E g2 E g4 where g1 and g3 are of size dn=2e and g2 and g4 are vectors of lenth bn=2c. From 37 and 35, we have (P

S T

P T ) (P

G ) = =

S1T g1 + S2T S3T g1 + S4T S1T g1 + S2T S3T g1 + S4T

g2 g2 g2 g2

S1T g3 + S1T g4 S3T g3 + S4T g4 S1T g3 + S2T g4 S3T g3 + S4T g4

S1T E1 g1 + S2T E2 g2 S1T E1 g3 + S2T E2 g4 S3T E1 g1 + S4T E2 g2 S3T E1 g3 + S4T E2 g4 S1T g1 + S2T g2 S1T g3 + S2T g4 (38) ?S3T g1 ? S4T g2 ?S3T g3 ? S4T g4

The above equation shows that the displacement rank of M1 and M2 is 2 because the generators of M1 and M2 have rank 2. In this section we have shown how the odd-even permutation matrix can be used to decouple a Cauchylike matrix arising from a real symmetric Toeplitz matrix of size n into two Cauchy-like matrices of half the size and half the displacement rank. This yields a substantial savings over the unsymmetric forms of the displacement equation as we shall see in a latter section.

2.4.1 Converting Hermitian Toeplitz matrices to real Cauchy-like matrices

In section 2.2, a displacement equation for Hermitian Toeplitz matrices was suggested using Z1 as the displacement matrix. The discrete Fourier transform was used to convert the Hermitian Toeplitz matrix to a Hermitian Cauchy-like matrix. In this section, we show how the displacement matrix Z may be used along with the odd-even permutation matrix, to convert a Hermitian Toeplitz matrix into a real, symmetric Cauchy-like matrix. The factorization of the Cauchy-like matrix can be done in real arithmetic and the savings in computation is signi cant. Consider a Hermitian Toeplitz matrix of size n. The displacement equation of T w.r.t. Z can be written as Z T ? T Z = G G (39) where is a skew-symmetric matrix and G has rank 4. In the previous section on real, symmetric Toeplitz matrices, we proved that P S T Real(T) S P T is of the form

P S T Real(T) S P T = M01 M02

(40)

where M1 and M2 are real Cauchy-like matrices of size dn=2e and bn=2c respectively. In addition, one can prove, through construction, that the imaginary part of T satis es the equation T 0 ? M T T 3 P S Imag(T) S P = M3 0 ; (41) where M3 is a real Cauchy-like matrix of size dn=2e bn=2c. If we de ne a matrix D to be of the form

D = I01 i 0I ; 2

(42)

9

p

where I1 and I2 are identity matrices of size dn=2e and bn=2c respectively and i = ?1, then

M1 M3T : (43) T DP M3 M2 The matrix on the right hand side is a real symmetric Cauchy-like matrix. Since the rank of the generator G of the Hermitian Toeplitz matrix was 4, the generator matrix of the corresponding real, symmetric Cauchy-like matrix will also be of rank 4. This section shows how a Hermitian Toeplitz matrix may be converted to a real symmetric Cauchy-like matrix. The factorization may proceed in real arithmetic. The savings in computation resulting from this conversion will be calculated in section 6. S T

S P T

D =

3 Factorization of Cauchy-like matrices with pivoting

In this section we overview factorization algorithms for Cauchy-like matrices that allow for various pivoting strategies to be incorporated. We rst discuss the algorithm due to Gohberg et. al to factor non-Hermitian Cauchy-like matrices with displacement matrices of the form shown in (5). We then present another algorithm [10] to factor Hermitian Cauchy-like matrices with displacement equations of the form C ? Dl CDl = GG. These can then be trivially adapted to suit the kind of Cauchy-like matrix at hand.

3.1 Factoring non-Hermitian Cauchy-like matrices

Consider a complex non-hermitian Cauchy-like matrix C of size n, de ned by the displacement equation Dl C ? C D r = G H :

(44)

Here Dl and Dr are diagonal matrices and G and H are matrices of size n with rank equal to . Let us further assume that Dl and Dr do not have any entries on the diagonal that are equal. We relax this restriction later. In section we show how partial pivoting may be incorporated into the factorization algorithm. From the above displacement equation, it is clear that any column of C can be obtained by solving the following Sylvester equation : Dl C(:; j) ? C(:; j) Dr (j; j) = G H(j; :)

(45)

and the (i; j)th element of C can then be computed as :) H(j; :) : (46) C(i; j) = DG(i; l (i; i) ? Dr (j; j) This indicates that unless the diagonal elements of Dl and Dr are distinct, one cannot construct all elements of C. Speci cally, if Dl (k; k) = Dr (l; l), then the element C(k; l) cannot be computed. Such elements would have to be known prior to the start of the factorization. If it so happens that Dl = Dr , the the entire diagonal of C would have to be known apriori. We now proceed to describe the LU factorization algorithm with partial pivoting. The rst step of the algorithm would be to compute the rst column of C. This can be done as described in the previous paragraph. Let the permutation matrix that brings the pivot element to the (1; 1) position be P1. Applying this permutation to the displacement equation, we get : P1 Dl P1T P1C ? P1 C Dr = P1 G H Let us partition the matrix P1 C as :

P1 C = dl Cu 1

(47) (48)

10 Let us de ne two matrices X and Y as :

X = l d1?1 I0

then P1 C can be factored as:

?1 Y = 10 d I u

(49)

(50) P1 C = X d0 C0 Y sc Further, let P1 Dl P1T and Dr be conformally partitioned as : D D 0 0 l r T P1 Dl P1 = 0 D Dr = 0 D (51) l r Let us apply the transformation X ?1 (:) Y ?1 to (47). Using (50) and (51), we can write the transformed equation as : X ?1 (P1 Dl P1T ) X (X ?1 P1 CY ?1) ? (X ?1 P1 CY ?1) Y Dr Y ?1 = X ?1 P1 G H Y ?1 (52) The above equation can be rewritten after simpli cation as : Dl d 0 ? d 0 Dr d?1uDr ? Dr d?1u = X ?1 P GH Y ?1 0 1 Dl ld?1 ? ld?1Dl Dl 0 Csc 0 Csc 0 Dr (53) Equating the (2; 2) position in the above equation we have : (54) Dl Csc ? Csc Dr = G1 H1 where G1 is the portion of X ?1 P1G from the second row down and H1 is the portionof Y ? H from the ? second row down. The rst column of L in the LU factorization would be 1 d l and the rst row of U would be d u . This completes one step of the LU factorization algorithm. The process can now be repeated on the displacement equation of the Schur complement of P1 C w.r.t. d (Csc ) to get the second column of L and row of U. After n steps, one would have the LU factorization of a permuted Cauchy-like matrix. If the displacement matrices Dl and Dr have diagonal entries that are identical, then as we pointed out earlier, some elements corresponding to these entries would have be known apriori. In addition, these elements would have to be updated with the transformation X ?1 (:)Y ?1 to re ect their values in the Schur complement Csc . To avoid this extra step in the algorithm, it is often desirable to have Dl and Dr distinct. In some cases such as hermitian Cauchy-like matrices, however, one cannot satisfy this condition because doing so would destroy the symmetry. In this case the extra computation at the end of each step of the algorithm to update the diagonal elements of C is unavoidable if symmetry is to be maintained. If the permutations at each step are accumulated into the matrix P, then we see that the above algorithm produces a factorization of the Toeplitz matrix of the form T = F P T L U F: (55) 1

1

2

2

1

2

1

1

2

1

2

2

2

2

3.2 Factoring Hermitian Cauchy-like matrices

Now consider a Hermitian Cauchy-like matrix C with the following displacement equation. C ? Dl C Dl = G G (56) Any column of C can be obtained by solving the following Lyapunov equation C(:; j) ? Dl C(:; j) Dl (j; j) = G G(j; :) (57) If Dl (j; j) is equal to any eigenvalue of Dl , then the corresponding element of C will have to be computed apriori and updated during the course of the algorithm as described earlier. For the moment, let us assume

11 that this is not the case. Further, let us assume that the pivot block is in the right location. Since we have a symmetric Cauchy-like matrix, a pivoting strategy like Bunch-Kaufman would have to be used during the factorization. As a result, the pivot block may be either 1 1 or 2 2. Let us partition the matrix C as d l C= l C : (58) 1 Let us de ne the matrix X, 0 I X = ld?1 I (59)

then applying X ?1 ( : ) X ? to (56) we obtain d 0 ? A11 A21 = X ?1 GGX ? d 0 ? A11 0 (60) 0 Csc A21 A22 0 Csc 0 A22 where A11 = Dl , A22 = Dl and A21 = Dl ld?1 ? ld?1Dl . If the Bunch-Kaufman pivoting strategy results in a 1 1 pivot, then Dl is a 1 1 matrix, otherwise it is of size 2 2. To proceed with the factorization of the Cauchy-like matrix, we have to obtain a displacement equation of the form (56) for the Schur complement of C, i.e. for Csc . The displacement equation will have to be of the form Csc ? A22Csc A22 = Gsc scGsc; (61) Partitioning G conformally as G = [G1 G2 ] and equating the (2; 2) position in (60) we have Csc ? A22 CscA22 = (G2 ? ld?1G1)(G2 ? ld?1G1) + A21 dA21 (62) where A21 = A22ld?1 ? ld?1A11 . The last term of (62), A21 dA21, can be expanded as = (A22 ld?1 ? ld?1A11 )d(d?1l A22 ? A11d?1l ) ?1A?1 = (A22 lA11 ? ld?1A11 dA11)A? 11 d 11 (A11 l A22 ? A11dA11d?1 l ): (63) From the displacament equation (56), we can also write d ? A11dA11 = G^ 1 G^ 1 (64) l ? A22 lA11 = G^ 2 G^ 1 (65) Inserting the above equations in (63) yields ?1A?1G1 ) (G2 ? ld?1G1) A21dA21 = (G2 ? ld?1 G1) (G1A? (66) 11 d 11 Substituting (66) in (62), Csc = Csc ? A22Csc A22 has the form ?1A?1 G^ 1)(G2 ? ld?1G1) (67) Csc = (G2 ? ld?1G1)( + G1 A? 11 d 11 Using the Sherman-Morrison-Woodbury formula and (64) it can be shown that ?1 A?1G1 ) = (?1 ? G d?1G1 )?1 ( + G1A? (68) 1 11 d 11 Hence, the update equations for the generator and the signature matrices are Gsc = G2 ? ld?1G1 ?sc1 = ?1 ? G1 d?1G1 (69) At this point, all the elements of C that were computed apriori would have to be updated to re ect their values in the Schur complement Csc. Since we now have the same displacement for the Schur complement Csc, the factorization algorithm can proceed in the same manner to the next step and eventually to completion. 2

1

1

2

1

12

4 Factoring Hermitian Toeplitz matrices

In this section we present an algorithm to compute a symmetric factorization of a Hermitian Toeplitz matrix by converting it to a Hermitian Cauchy-like matrix. We then compare this method in complexity to the method for factoring non-Hermitian Cauchy-like matrices. In section 2.4.1, we presented an alternate algorithm to factor Hermitian Toeplitz matrices. This was based on the conversion of Hermitian Toeplitz matrices to real, symmetric Cauchy-like matrices. The factorization of the Cauchy-like matrices is then done in real arithmetic. This results in substantial savings in computation. We postpone the discussion of this algorithm to section 6 however, because it is similar to the algorithm to factor real, symmetric Toeplitz matrices. The comparison of the complexity of the two methods can be found in table 3 in section 6. Consider a Hermitian Toeplitz matrix T of size n. The displacement equation of type I for such a matrix and the corresponding Cauchy-like matrix were shown in section 2.2 to be : T ? Z1 T Z1T = H H C ? C = H^ H^ (70) Since the matrix T is Hermitian, the Cauchy-like matrix C is also Hermitian. The displacement matrices and are diagonal and have entries that are complex conjugates of each other. Also, from the de nition of we see that the (j + 1; j + 1)th entry of and the (n ? j + 1; n ? j + 1)th entry of are identical for j = 1; ; n ? 1 : ?j)

(n ? j + 1; n ? j + 1) = e? n ij n e n = e? in ij = e n = (j + 1; j + 1) (71) In addition, the (1; 1) elements of the two displacement matrices are identical. As indicated in section 3, this means that we have to compute the following elements of C apriori : C(1; 1) and C(i; j) for j = 2; ; n and i = n ? j + 2. This set of elements includes some diagonal and other non diagonal elements. If n is even, then for i = j = 1 and i = j = n=2, the elements C(i; j) are diagonal and the rest are non diagonal. If n is odd, then the only diagonal element is C(1; 1). We rst present a fast method to compute the non-diagonal elements of C and later indicate how the diagonal elements may be computed. To compute the non-diagonal elements of C that are needed apriori, we set up a non-Hermitian form of the displacement equation for T and the corresponding Cauchy-like matrix C as : Z1 T ? T Z1 = G1 G2 (72) C ? C = F G1 G2 F Since, (i; i) 6= (j; j) for i 6= j, any nondiagonal element of C can be easily computed using (46). To show how the diagonal elements of C are computed, we make use of the following theorem. Theorem 1 For any matrix A of size n, if F is the DFT matrix of size n and C is a circulant matrix that minimizes the Frobenius norm of (A ? C ), then the diagonal of FAF is equal to the eigenvalues of the 2i(n

2

2

2

minimizer C .

Since the Cauchy-like matrix C is de ned as FTF , the diagonal elements can be obtained from the eigenvalues of the circulant minimizer C that minimizes the Frobenius norm of (T ? C ). Further, it can be easily proved that p the eigenvalues of a circulant matrix are obtained from the DFT of the rst column of the matrix : (n)F C (:; 1). For Toeplitz matrices, the circulant minimizer C can be computed in O(n) ops as demonstrated in [7]. Since we only require a few diagonal elements of C (one if n is odd and two if n is even), we can use the matrix-vector product form of the DFT (instead of an FFT) to compute them. This means that the required diagonal entries of C can be computed in O(n) ops. Having computed the elements of C that are needed apriori, we can now proceed with a symmetric factorization of the matrix with symmetric pivoting. The Bunch-Kaufman algorithm can be used as a

13 symmetric pivoting strategy. We outline the rst step of such an algorithm. The Bunch-Kaufman pivoting strategy requires the computation of either one or two columns of C. The computation of columns of C was described in the previous section. Let P1 be the permutation that permutes the 1X1 or 2X2 pivot block to the proper place. The displacement equation is then written as : (73) (P1 C P1T ) ? (P1 P1T ) (P1 C P1T )(P1 P1T ) = P1 H^ H^ P1T The recurrence relations between the generators of the Cauchy-like matrix and its Schur complement w.r.t. the pivot blocks was given in section 3 by (69). The obvious advantage in the Hermitian case is that only half the computation needs to be performed. However, we have to compute some elements of C apriori because and have common eigenvalues. These elements will have to be updated at each step of the factorization to obtain their values in the Schur complement of C. It can, therefore, be seen that the reduction in computation due to the Hermitian property of T, is to some extent oset by the additional work one has to do in the beginning to compute some elements of C apriori and at every step in updating these elements. We now determine the complexity of the two algorithms to factor non-Hermitian and Hermitian Toeplitz matrices. In all the calculations we assume that a complex multiplication requires 6 ops, a complex division requires 9 ops (assuming that a real division requires 1 op) and a complex addition requires 2 ops. We ignore the computation required to set up the displacement equation for Toeplitz matrices, since this can be done in O(n) time. Let us rst consider the non-Hermitian case. Transforming the displacement equation of a Toeplitz matrix (11) to that of a Cauchy-like matrix (13) requires 2 FFTs of length n (the displacement rank = 2). The cost of computing these FFTs is 2K1n log n ops. The value of K1 is small if n is a highly composite number. If n is not so composite (or prime), then the constant K1 can be quite large. Computing each row or column of the factorization using (46) at the kth step of the factorization requires 6(n ? k) + 2( ? 1)(n ? k) + 2(n ? k) + 9(n ? k) = 8(n ? k) + 9(n ? k) ops. Since a column and a row have to be computed at each step, this means that the total work at each step to obtain a row and a column of the matrix is 16(n ? k) + 18(n ? k)flops. Having computed the required row and column of the factorization we must now update the generators of the kth step to obtain the generators of the (k + 1)th step. The update of each generator requires 8(n ? k) for a total of 16(n ? k) ops. Thus the total number of ops required to factor the matrix would be Flops = 2 K1 n log (n) +

nX ?1 k=1

(32 + 18)(n ? k)

= 2 K1 n log (n) + (16 + 9)(n2 ? n) (16 + 9)n2 (74) 2 For Toeplitz matrices with = 2, the algorithm requires approximately 41n ops. Now in the Hermitian case, one would use the Bunch-Kaufman pivoting strategy. At each step of the factorization the Bunch-Kaufman algorithm checks either one or two rows of the matrix and selects either a 1X1 or a 2X2 pivot. The worst case scenario would be that at each step two rows are checked but a 1X1 pivot is used. The best case, however, would be if a 2X2 pivot were used every time 2 rows were checked. We can, therefore, only estimate the complexity of the algorithm in the Hermitian case. Since the matrix is Hermitian, the number of FFTs needed to transform the generators of the Toeplitz-like matrix to those of a Cauchy-like matrix is exactly half of that in the non-Hermitian case. The complexity for this step is K1 n log(n). However, extra work is necessary to compute some elements of C apriori. Of these elements, n ? 1 (n ? 2) elements are o diagonal if n is odd (even). The o diagonal elements are computed by solving the corresponding Lyapunov equations in (72). The complexity to do this is 2K1 n log(n) to compute FG1 and G2 F and 6 + 2 + 2 + 9 to solve the Lyapunov equation for each element. In addition, there are 1 (2) elements of C that are on the diagonal if n is odd (even). These elements are computed from the DFT of the rst column of the circulant minimizer discussed earlier. These elements require 2n (10n) if n is odd (even). Hence, the total complexity in computing the elements of C needed apriori is 2K1 n log(n) + 8n + O(n). We now compute the complexity of the factorization algorithm. In the worst case, at every step k in the factorization, two rows are computed from the generators and tested for a 1 1 pivot. The work to compute

14 2 rows from the generators is (16 + 18)(n ? k). The work to update the generators of the kth step to those of the (k + 1)th step is 8(n ? k). In addition to this, some extra work is required to update the elements along the diagonal of C that had to be computed apriori. This adds an extra 14(n ? k) ops at each step. The worst case complexity would, therefore, be Flops = 3K1 n log(n) + 8n + O(n) +

nX ?1

nX ?1

k=1

k=1

(24 + 18)(n ? k) +

14(n ? k)

= 3K1 n log(n) + 8n + O(n) + (12 + 9 + 7)(n2 ? n) (12 + 16)n2 (75) For = 2, the total complexity is 40n2. This shows that, in the worst case, the complexity of the Hermitian algorithm is the same as that for the non-Hermitian case. However, in the best case scenario, only one row is checked at each step and a 1X1 pivot block is used. The complexity in this situation would be : Flops = 3K1n log(n) + 8n + O(n) +

nX ?1

nX ?1

k=1

k=1

(16 + 9)(n ? k) +

14(n ? k)

= 3K1n log(n) + 8n + O(n) + (8 + 11:5)(n2 ? n) (8 + 11:5)n2 (76) For = 2, the complexity is 27:5n2. This indicates that the complexity of the factorization algorithm using the Bunch-Kaufman pivoting scheme can vary from 27:5n2 to 40n2 depending on the pivot sequence obtained. Preserving the Hermitian structure of the factorization reduces the complexity of the factorization algorithm to some extent. Further reduction in complexity can be obtained if the Hermitian Toeplitz matrix is converted to a real, symmetric Cauchy-like matrix. This is discussed in section 6. If the Toeplitz matrix is real, one would like to preserve this property as well because computation in complex arithmetic is very expensive. In the following sections we present algorithms to factor real Toeplitz matrices that are either symmetric or unsymmetric. We also compare them in complexity to the complex arithmetic cases.

5 Real unsymmetric Toeplitz matrices

In this section we present an algorithm to factor real unsymmetric Toeplitz matrices by converting them to real Cauchy-like matrices. We then compare this method to the algorithms discussed in the previous sections and show how maintaining the realness property leads to signi cant savings in computation. Consider a real unsymmetric Toeplitz matrix T of size n. Following the notation of section 2.3, we write the displacement equation for T as : Z00 T ? T Z11 = G H T : (77) The displacement rank of this equation is 4. If we were to use Z1 and Z?1 as the displacement matrices, the displacement rank would have been 2. Since Z1 and Z?1 are diagonalized by the DFT matrix, a real Toeplitz matrix would be converted to a complex Cauchy-like matrix and all subsequent factorization would have to be done in complex arithmetic. If, however, we use (Z00 ; Z11) as the displacement matrix pair, then real trigonometric transforms can be used to convert a real Toeplitz matrix into a real Cauchy-like matrix. The subsequent factorization is done in real arithmetic. We show that the reduction in complexity due to real arithmetic more than osets the increased displacement rank. From (19) and (20), we can write : T Z11 S11) = (S00 G) (H T S11 ) (S00 Z00 S00) (S00 T S11) ? (S00 T S11 ) (S11 Dl C ? C Dr = G^ H^ T (78) where Dl is a diagonal matrix containing the eigenvalues of Z00 as de ned in (19), Dr is also a digonal matrix T H. For all n, the eigenvalues containing the eigenvalues of Z11 as de ned in (20), G^ = S00 G and H^ = S11 of Z00 and Z11 are distinct and hence one would not have to compute any elements of the real Cauchy-like

15 matrix C apriori. One could easily derive a real arithmetic version of the algorithm described in section 3. If the permutations at every step of the algorithm are accumulated in P and the upper and lower triangular factors are denoted by U and L, then we obtain a factorization of T as : T T = S00 P T L U S11

We now attempt to compute the complexity of the factorization algorithm for real Toeplitz matrices. Transforming the Toeplitz matrix to real Cauchy-like matrices involves applying the Sine-I and Cosine-II transforms to the generators. Let the displacement rank be (= 4 for real Toeplitz matrices). The complexity of the transformation would be 2K2 n log(n). If n were a power of 2, then K2 = 2:5. Computing a row or column of the factorization at the kth step using a real arithmetic version of (46) requires (2 ? 1)(n ? k) + 2(n ? k)

ops. Since both a row and column of the matrix are to be computed at every step, the total ops for this operation is (4 ? 2)(n ? k) + 4(n ? k). Having computed the row and column of the factorization we must update the generators of the kth step to those of the (k + 1)th step using a real arithmetic version of (53). The complexity of this step is 4(n ? k). The total number of ops for the entire factorization algorithm would then be : nX ?1 Flops = 2K2n log(n) + (8 + 2)(n ? k): (79) k=1

The asymptotic complexity would, therefor, be 4n + n2. For real Toeplitz matrices, since = 4, the complexity is 17n2. If the complex arithmetic version were used the complexity would have been 41n2 ops. It can therefore be seen that staying in the real domain leads to signi cant savings in computation. 2

6 Real, Symmetric Toeplitz matrices

In this section we present an algorithm to factor real, symmetric Toeplitz matrices by converting them to real, symmetric Cauchy-like matrices. In addition, we also present an algorithm that converts a Hermitian Toeplitz matrix to a real, symmetric matrix and proceeds to factor it in real arithmetic. In section 2.4, we showed that a signi cant reduction in complexity may be obtained if we exploit both realness and symmetry in the Toeplitz matrix simultaneously. It was shown that for a real symmetric Toeplitz matrix T of size n, if the symmetric form of the displacement equation was used with a displacement matrix Z , then the corresponding Cauchy-like matrix C could be decoupled into two Cauchy-like matrices of half the size. Further, it was shown that the two smaller Cauchy-like matrices would have a displacement rank of 2. These two smaller Cauchy-like matrices can be factored independently of each other. Since we use the symmetric form of the displacement equation (27), the diagonal elements of C cannot be obtained by solving the corresponding Lyapunov equation. One has to compute these elements apriori. In the following paragraphs we show how the diagonal elements of C may be computed. We demonstrate this for the case when = 0. The construction for = 1 is similar. The diagonal elements of C00 = S00 T S00 can be computed using the following theorem.

Theorem 2 Let S be a vector-space containing all n n matrices that can be diagonalized by the Sine-I transform. Then, for any matrix A of size n, if we obtain a matrix S in this space that minimizes the Frobenius norm of (A ? S ), then the diagonal of S AS (S is the Sine-I transform of size n) is identical to the eigenvalues of S . In addition, it was proved independently in [1], [3] and [13] that a matrix belongs to the vector-space S if 00

00

00

and only if the matrix can be expressed as a special sum of a Toeplitz and a Hankel matrix. This is outlined in the following theorem.

Theorem 3 Any matrix S in S can be written as S = X ? Y , where X us a symmetric Toeplitz matrix with rst column x = [ x1 x2 xn ]T , and Y is a Hankel matrix with rst column [ 0 0 xn x3 ]T and last column [ x3 xn 0 0 ]T .

16 In [5], R. Chan et al. show how the minimizer S may be constructed for any matrix A in O(n2 ) ops. If A is Toeplitz, then they show that this computation requires only n log(n) ops. The algorithm proceeds by setting the partial derivative of kA ? Sk w.r.t. x1; x2 ; ; xn equal to zero. We summarize the lemmas and algorithms that are important to to this discussion. An important lemma due to Boman and Koltract [3] gives a basis for the vector space S. Lemma 1 Let Qi; i = 1; ; n be n n matrices with the (j; k) entry being given by 8 1 if jj ? kj = i ? 1 > > < 1 if j + k = i ? 1 Qi = > ? ? 1 if j + k = 2n ? i ? 3 > : 0 otherwise. Then fQi gni=1 is a basis for S. Let us de ne a vector r = [ r1 r2 rn ], where ri = 1Tn (Qi A)1n : (80) 1n is a vector of ones of length n and denotes the element-wise product. The following corollary due to Chan et al. [5] gives an explicit formula for the entries on the rst column of the minimizer S for any matrix A. Corollary 1 Let A be a symmetric matrix of size n and let S be the minimizer of kA ?SkF over all matrices in the vector space S. Let z, be the rst column of S and ri = 1Tn (Qi A)1n . If so and se are de ned to be the sum of the odd and even entries of the vector r, then we have : z1 = 2(n 1+ 1) (2r1 ? r3) zi = 2(n 1+ 1) (ri ? ri+2 ) i = 2; ; n ? 2 and

if n is even; and

zn?1 = 2(n 1+ 1) (so + rn?1) zn = 2(n 1+ 1) (2se + rn) zn?1 = 2(n 1+ 1) (se + rn?1) zn = 2(n 1+ 1) (2so + rn)

if n is odd.

The eigenvalues of the minimizer S can now be calculated from the rst column of S . S = S00 S00 ) S00 S e1 = S00 e1 ) = D?1 S00 S e1 ; where D = diag(S00e1 ): (81) For any arbitrary matrix A, it is clear that the vector r can be computed in O(n2) ops and the diagonal of S00 A S00 in O(n2 + n log(n)) ops. If however, A were Toeplitz, then r can be computed in O(n) ops and the diagonal of the Cauchy-like matrix C00 = S00 A S00 can be computed in O(n log (n)) ops. In [5], the authors present the following O(n) algorithm to obtain the r vector given a symmetric Toeplitz matrix T of size n whose rst column is [ t1 t2 tn ]T .

17

Algorithm to compute r (for Sine-I transform): r1 = nt1 r2 = 2(n ? 1)t2 w1 = ?t1 v1 = ?2t2 for k = 2 : bn=2c r2k?1 = 2(n ? 2k + 2)t2k?1 + 2wk?1 wk = wk?1 ? 2t2k?1 r2k = 2(n ? 2k + 1)t2k + 2vk?1 vk = vk?1 ? 2t2k end if n is odd rn = 2tn + 2w(n?1)=2 end

From the above discussion, it can be seen that the total complexity of computing the diagonal elements of the Cauchy-like matrix C00 = S00 T S00 is O(n log(n)). The next step in the factorization of the Cauchy-like matrix C00 is the application of the odd-even sort permutation matrix P as shown in (28) in order to expose the sparsity of C00 and separate the large system of equations into two independent systems of size dn=2e and bn=2c respectively. Each sub-system can be solved using the real arithmetic variant of the algorithm to factor non-Hermitian Cauchy-like matrices discussed in section 3. Since the two Cauchy-like matrices are symmetric, we use the Bunch-Kaufman algorithm to search for a pivot. We now estimate the complexity of the factorization algorithm. Consider a Cauchy-like matrix of size m with the displacement equation Dl C ? C Dl = G GT . At the kth step of the factorization algorithm computing a row or column of the matrix requires (2 + 1)(m ? k) ops. The worst case scenario is one in which, at each step, if 2 rows are computed and checked and only a 1 1 pivot block is used. The complexity to update the generators for the next step is 2(m ? k). In addition, the elements of the Cauchy-like matrix that were computed apriori have to be updated. The complexity to do this at the kth step is 3(m ? k). The total complexity in the worst case scenario is, therefore, Flops =

mX ?1 k=1

(6 + 5)(m ? k)

= (3 + 2:5)m2

(82)

In the best case scenario, at each step, if 2 rows are computed and checked then a 2 2 pivot is used. Or else, only 1 row is checked and a 1 1 pivot is used. The best case complexity is Flops =

mX ?1 k=1

(4 + 4)(m ? k)

= (2 + 2)m2

(83)

As shown in section 2.4, there are two independent systems each of size approximately n=2 having a displacement rank of = 2. The total complexity, in the worst case, for factoring real symmetric Toeplitz matrices of size n by converting them to Cauchy-like matrices is Flops = K2 ( + 1)n log(n) + O(n) + 4:25n2

(84)

and in the best case, the complexity is Flops = K2 ( + 1)n log (n) + O(n) + 3n2

(85)

A similar algorithm can be used if we choose to use the displacement matrix Z11 instead of Z00. The diagonal elements of C11 can be computed in a similar manner [6] .

18

6.1 Implementation on the Cray J90 and T90

We now present the results of some implementations of this algorithm on the Cray Parallel Vector Processor (PVP) systems such as the Cray J90 and T90. The J90 systems is a PVP system with upto 32 processors. Each processor is rated as having a peak performance of 200 MFlops. The T90 system of the other hand is the latest PVP system with a peak performance of 2 GFlops/processor. The algorithm presented in this section splits a large real, symmetric Toeplitz system to two real, symmetric Cauchy-like matrices of half the size. These systems may be solved independently. For a two processor system, this yields perfect parallelism. At each step of the factorization of the two smaller systems, one has additional parallelism in the following tasks - computing a row of the factorization from the generators, searching for the pivot elements and updating the generators for the next step of the factorization. On Cray PVP systems a list of autotasking directives are provided to the user to exploit parallelism in the program. The limitation of these directives is that once parallelism has been invoked at the higher-level by breaking the work into several tasks, each task can only be executed on a single processor. So if we choose to invoke parallelism at the highest level by the considering factorization of the Cauchy-like matrices to be two concurrent tasks, then we cannot exploit the parallelism at the lower level. Table 6.1 shows the time in milliseconds to factor a 4095times4095 real, symmetric Toeplitz matrix using 2 processors. It can be seen that the the speedup for 2 processors is almost 2. Number of CPUS J90 T90 1 872 143 2 448 74

Table 1: Time in milliseconds to factor a 4095 4095 real, symmetric Toeplitz matrix while exploiting parallelism at the higher level Since, two levels of parallelism cannot be exploited using the autotasking directives provided on the Cray PVP systems, we could, ignore the concurrency in the two factorization tasks and exploit the parallelism at the lower level. This would mean that the two Cauchy-like matrices are factored one after the other and the entire set of processors works to factor each Cauchy-like matrix. The eectiveness of this method is limited by the fact that the amount of work decreases linearly at each step of the factorization. A xed amount of overhead is incurred each time autotasking is invoked to exploit the parallelism in computing a row of the factorization or updating the generators of the next step. Beyond a certain point in the factorization, due to reduced work and constant overhead, it is better to disregard the parallelism and exploit only vectorization. This means that increasing the number of processors beyond a certain point to solve the problem will only yield diminishing returns. For the same problem of factoring a 4095 4095 real, symmetric Toeplitz matrix, table 2 shows the time in milliseconds on the J90 when autotasking is invoked at each step of the factorization of the two Cauchy-like matrices. For the sake of comparison, the time in milliseconds to factor the matrix using the LAPACK routine SSYTRF has also be tabulated in table 2. From tables 6.1 and 2, it can be seen that exploiting the higher level parallelism with only 2 CPUS gives the best performance on the Cray PVP machines. This would be true for larger problem sizes as well because the total amount of work in factoring a real, symmetric Toeplitz matrix using this algorithm is quite small (3n2 to 4n2) and the parallelism reduces linearly at each step of the factorization.

6.2 Factoring Hermitian Toeplitz matrices

In section 2.4.1, we showed how a Hermitian Toeplitz matrix may be converted to a real, symmetric Cauchylike matrix with a displacement rank of 4. If T is a Hermitian Toeplitz matrix of size n, then the displacement equation w.r.t. Z has rank 4. Z T ? T Z = G GT (86)

19 Number of CPUS Time for Cauchy-based algorithm Time for SSYTRF (LAPACK) 1 872 139149 2 757 74601 4 629 43540 6 596 36659 8 584 32431 Table 2: Time in milliseconds to factor a 4095 4095 real, symmetric Toeplitz matrix on a J90 while exploiting parallelism at each step of the factorization If S^ = D P S T , where D is de ned in (42) and P is the odd-even sort permutation matrix, then the Cauchy-like matrix S^T T S is real and symmetric with a displacement rank of 4. Since we use the symmetric form of the displacement equation, the diagonal entries of the Cauchy-like matrix will have to be computed apriori. Computing the diagonal of S T T S was discussed earlier in section 6. One point to note is that if T is Hermitian, then the imaginary part of T is ske-symmetric. The imaginary part of T, therefore, does not contribute anything to the vector r de ned in (80. Hence, the diagonal of S T T S , can be computed using only the real part of T. Having computed the diagonal of S T T S , the diagonal of the Cauchy-like matrix S^T T S is computed trivially. Computing the diagonal of the Cauchy-like matrix requires O(nlog(n)) ops. The factorization of the real, symmetric Cauchy-like matrix then proceeds exactly as discussed earlier in this section using the Bunch-Kaufman pivoting scheme. From (82) and (83), substituting alpha = 4 and m = n we see that the complexity of factoring the real, symmetric Cauchy-like matrix is between 10n2 to 14:5n2 ops. This is more than 2:5 times less than the algorithm described in section 4 that converted a Hermitian Toeplitz matrix to a Hermitian Cauchy-like matrix using the DFT. The following table lists the highest order terms in the complexity to factor the various classes of Toeplitz matrices. Non-Hermitian

Hermitian Hermitian Symmetric Cauchy Cauchy

Non-Symmetric

Symmetric

41n2

27n2 to 40n2 10n2 to 14:5n2

17n2

3n2 to 4:25n2

Table 3: Comparison of cost to factor Toeplitz matrices by converting them to Cauchy-like matrices

7 Least squares problems with real Toeplitz matrices

In this section we show how least squares problems for Toeplitz matrices may be solved using the idea of conversion to Cauchy-like matrices. We present two methods and compare them from a numerical and computational standpoint. The rst algorithm is based on the normal equations while the second algorithm uses the augmented system of equations.

20

7.1 Normal equations method

Consider a real Toeplitz least squares problem where the Toeplitz matrix T has size m n; m n. min kT x ? bk (87) The normal equations method for solving the least-squares problem involves factoring the matrix T T T in the equation T T T x = T T b: (88) If T is ill-conditioned, then the matrix T T T would be very ill-conditioned and the generalized Schur algorithm [8] may not produce an accurate factorization of T T T. In such cases, it may be bene cial to convert T T T to a Cauchy-like matrix, to allow diagonal pivoting to improve the accuracy of the factorization. Since the matrix T T T is semi-de nite, diagonal pivoting would be sucient. It has been shown that if we choose the displacement matrices of the form Z1 , then the displacement rank of T T T is 4 and the generalized Schur algorithm produces a factorization in 8 n2 ops [8]. If we use the displacement matrices Z00 or Z11 as in section 6, then the displacement rank of T T T is 8. Considering Z00 to be the displacement matrix, the displacement equation would be : Z00 T T T ? T T T Z00 = GGT ; (89) where is a skew symmetric matrix. Note that one could use Z11 in place of Z00 for a similar factorization algorithm. Just as in section 6, one could use the Sine-I transform to diagonalize Z00. If we denote the Cauchy-like matrix corresponding to T T T as C, then we have C = S00 T T T S00 and : ^ G^ T 00 C ? C 00 = G (90) where G^ = S00 G. It can be seen that this equation is identical in form to (27). The only dierence is that the displacement rank of this equation is 8, while that of (27) was 4. Since the diagonal elements of C cannot be computed using the Lyapunov equation, we have to compute these elements apriori. In section 6, it was shown that the eigenvalues of the minimizer S among all matrices in the vector space S of all matrices diagonalizable by the Sine-I transform are the diagonal elements of C. If we can compute the vector r for the matrix T T T given by (80), then using corollary 1 we can compute the rst column of the minimizer S . The eigenvalues of the minimizer S are easily computed using (81). We now show how the vector r may be computed in O(n2) ops. Consider the displacement equation for T T T w.r.t. the displacement matrix Z1 . T T T = T T T ? Z1 T T T Z1T = H1 H T (91) where H is of rank 4 and 1 is a symmetric matrix. This displacement equation can be computed in O((m + n) log (m + n)) ops. Now from the above displacement equation it follows that TTT =

n X i=0

Z1i (T T T) (Z1T )i

(92)

The kth diagonal of T T T, denoted by bk can be computed from the displacement equation (91) in 7(n?k+1)

ops. An additional (n ? k) ops are needed to compute the kth diagonal of T T T from bk using (92). Let us denote this by dk . Since there are n distinct diagonals in T T T (symmetry), we need approximately 4n2

ops to compute all diagonals of T T T. The contribution of each diagonal to the vector r is then calculated using the following algorithm

Algorithm for computing r from T T T for i = 1 : n P ?i+1 d (j) dsumi = jn=1 i n ? i m=b 2 c k = n?i+1

21 for j = 1 : m di (j) = di(j) + di(k ? j + 1) end for if (i = 1) r1 = dsum1 for l = 1 : m j = i + 2l rj = rj ? bi (l) end for else ri = ri + 2dsumi if (i

Algorithms to solve block Toeplitz systems and

Algorithms to solve block Toeplitz systems and

Suggest Documents

Toeplitz and Toeplitz-block-Toeplitz matrices and their correlation with ...

Trigonometric Preconditioners for Block Toeplitz Systems - CiteSeerX

Block Toeplitz Matrices - COMONSENS

integer algorithms to solve linear equations and systems - CiteSeerX

spectral and computational analysis of block toeplitz

Preconditioning of Hermitian block-Toeplitz-Toeplitz ... - TU Chemnitz

Fast algorithms to compute matrix-vector products for Toeplitz and

Hyponormality of block Toeplitz operators - Mathematical Sciences ...

Hyponormality of block Toeplitz operators - Mathematical Sciences

EVOLUTIONARY ALGORITHMS TO SOLVE LOOSELY ... - ijicic

Using genetic algorithms to solve complex problems

On certain (block) Toeplitz matrices related to radial ... - Science Direct

Asymptotics of block Toeplitz determinants and the classical dimer ...

Asymptotics of block Toeplitz determinants and the classical dimer ...

integer algorithms to solve diophantine linear equations and

integer algorithms to solve diophantine linear equations and ... - arXiv

Building systems to block pornography

A CONTINUATION METHOD TO SOLVE POLYNOMIAL SYSTEMS ...

Toeplitz

Linear Toeplitz Space Time Block Codes - McMaster University > ECE

Block band Toeplitz preconditioners derived from generating function ...

Block-Structured Adaptive Mesh Refinement Algorithms and ...

parallelizing and comparing block matching algorithms ... - DigitalXplore

Algorithms for Haplotype Inference and Block