SIAM J. MATRIX ANAL. APPL. Vol. 23, No. 4, pp. 929–947
c 2002 Society for Industrial and Applied Mathematics
THE MULTISHIFT QR ALGORITHM. PART I: MAINTAINING WELL-FOCUSED SHIFTS AND LEVEL 3 PERFORMANCE∗ KAREN BRAMAN† , RALPH BYERS† , AND ROY MATHIAS‡ Abstract. This paper presents a small-bulge multishift variation of the multishift QR algorithm that avoids the phenomenon of shift blurring, which retards convergence and limits the number of simultaneous shifts. It replaces the large diagonal bulge in the multishift QR sweep with a chain of many small bulges. The small-bulge multishift QR sweep admits nearly any number of simultaneous shifts—even hundreds—without adverse effects on the convergence rate. With enough simultaneous shifts, the small-bulge multishift QR algorithm takes advantage of the level 3 BLAS, which is a special advantage for computers with advanced architectures. Key words. QR algorithm, implicit shifts, level 3 BLAS, eigenvalues, eigenvectors AMS subject classifications. 65F15, 15A18 PII. S0895479801384573
1. Introduction. This paper presents a small-bulge multishift variation of the multishift QR algorithm [4] that avoids the phenomenon of shift blurring, which retards convergence and limits the number of simultaneous shifts that can be used effectively. The small-bulge multishift QR algorithm replaces the large diagonal bulge in the multishift QR sweep with a chain of many small bulges. The small-bulge multishift QR sweep admits nearly any number of simultaneous shifts—even hundreds—without adverse effects on the convergence rate. It takes advantage of the level 3 BLAS by organizing nearly all the arithmetic work into matrix-matrix multiplies. This is particularly efficient on most modern computers and especially efficient on computers with advanced architectures. The QR algorithm is the most prominent member of a large and growing family of bulge-chasing algorithms [11, 14, 15, 17, 28, 30, 51, 39, 40, 44, 57, 59]. It is remarkable that after thirty-five years, the original QR algorithm [25, 26, 35] with few modifications is still the method of choice for calculating all eigenvalues and (optionally) eigenvectors of small, nonsymmetric matrices. Despite being a dense matrix method, it is arguably still the method of choice for computing all eigenvalues of moderately large, nonsymmetric matrices, i.e., at this writing, matrices of order greater than 1,000. Its excellent rounding error properties and convergence behavior are both theoretically and empirically satisfactory [9, 16, 19, 32, 41, 42, 52, 62, 63, 64] despite surprising convergence failures [8, 10, 19, 56, 58]. ∗ Received by the editors February 5, 2001; accepted for publication (in revised form) by D. Boley May 15, 2001; published electronically March 27, 2002. This work was partially supported by the National Computational Science Alliance under award DMS990005N and utilized the NCSA SGI/Cray Origin2000. http://www.siam.org/journals/simax/23-4/38457.html † Department of Mathematics, University of Kansas, 405 Snow Hall, Lawrence, KS 66045 (
[email protected],
[email protected]). The research of these authors was partially supported by National Science Foundation awards CCR-9732671, DMS-9628626, CCR-9404425, and MRI-9977352 and by the NSF EPSCoR/K*STAR program through the Center for Advanced Scientific Computing. ‡ Department of Mathematics, College of William & Mary, Williamsburg, VA 23187 (mathias@ wm.edu). The research of this author was partially supported by NSF grants DMS-9504795 and DMS-9704534.
929
930
KAREN BRAMAN, RALPH BYERS, AND ROY MATHIAS
It has proven difficult to implement the QR algorithm in a way that takes full advantage of the potentially high execution rate of computers with advanced architectures—particularly hierarchical memory architectures. This is so much the case that some high-performance algorithms work around the slow but reliable QR algorithm by using faster but less reliable methods to tear or split off smaller submatrices on which to apply the QR algorithm [2, 5, 6, 7, 23, 33]. An exception is the successful high-performance pipelined Householder QZ algorithm in [18]. Although this paper is not directly concerned with distributed memory computation, it is worth noting that there are distributed memory implementations of the QR algorithm [31, 45, 48, 50]. Readers of this paper need to be familiar with the double implicit shift QR algorithm [25, 26, 35]. See, for example, any of the textbooks [20, 27, 47, 53, 62]. Familiarity with the multishift QR algorithm [4] as implemented in LAPACK version 2 [1] is helpful. We will refer to the iterative step from bulge-introduction to bulge-disappearance as a QR sweep. 1.1. Notation and the BLAS. 1.1.1. The BLAS. The basic linear algebra subprograms (BLAS) are a set of frequently required elementary matrix and vector operations introduced first in [38, 37] and later extensively developed [21, 22]. The level 3 BLAS are a small set of matrixmatrix operations like matrix-matrix multiply [22]. The level 2 BLAS are a set of matrix-vector operations like matrix-vector multiplication. The level 1 BLAS are a set of vector-vector operations like dot-products and scaled vector addition (x ← ax + y). Because they are relatively simple, have regular patterns of memory access, and have a high ratio of arithmetic work to data, the level 3 BLAS can be organized to make near optimal use of hierarchical cache memory [20, section 2.6], execution pipelining, and parallelism. It is only a small exaggeration to say that executing level 3 BLAS is what modern computers do best. Many manufacturers supply hand-tuned, extraordinarily efficient implementations of the BLAS. Automatically tuned versions of the BLAS [61] also perform well. It is the ability to exploit matrix-matrix multiplies that makes spectral splitting methods attractive competitors to the QR algorithm [2, 5, 6, 7]. The small-bulge multishift QR algorithm which we propose attains much of its efficiency through the level 3 BLAS. 1.1.2. Notation. Throughout this paper we use the following notation and definitions. 1. We will use the “colon notation” to denote submatrices: Hi:j,k:l is the submatrix of matrix H in rows i–j and columns k–l inclusively. The notation H:,k:l indicates the submatrix in columns k–l inclusively (and all rows). The notation Hi:j,: indicates the submatrix in rows i–j inclusively (and all columns). 2. A quasi-triangular matrix is a real, block triangular matrix with 1-by-1 and 2-by-2 blocks along the diagonal. 3. A matrix H ∈ Rn×n is in Hessenberg form if hij = 0 whenever i > j +1. The matrix H is said to be unreduced if, in addition, the subdiagonal entries are nonzero, i.e., hij 6= 0 whenever i = j + 1. 4. Following [27, p. 19], we define a “flop” as a single floating point operation, i.e., either a floating point addition or a floating point multiplication together with its associated subscripting. The Fortran statement C(I, J) = C(I, J) + A(I, K) ∗ B(K, J)
MULTISHIFT QR ALGORITHM
931
involves two flops. In some of the examples in section 3, we report an automatic hardware count of the number of floating point instructions executed. Note that a trinary multiply-add instruction counts as just one executed instruction in the hardware count even though it executes two flops. Thus, depending on the compiler and optimization level, the above Fortran statement could be executed using either one floating point instruction or two floating point instructions (perhaps along with some integer subscripting calculations) even though it involves two flops. 2. A small-bulge multishift QR algorithm. If there are many simultaneous shifts, then most of the work in a QR sweep may be organized into matrix-vector and matrix-matrix multiplies that take advantage of the higher level BLAS [4]. However, to date, the performance of the large-bulge multishift QR has been disappointing [24, 55, 56, 58]. Accurate shifts are essential to accelerate the convergence of the QR algorithm. In the large-bulge multishift QR algorithm rounding errors “blur” the shifts and retard convergence [56, 58]. The ill effects of shift blurring grow rapidly with the size of the diagonal bulge, effectively limiting the large-bulge multishift QR algorithm to roughly 10 simultaneous shifts—too few to take full advantage of level 3 BLAS. See Figure 1. Watkins explains much of the mechanics of shift blurring with the next theorem [56, 58]. Theorem 2.1 (see [58]). Consider a Hessenberg-with-a-bulge matrix that occurs ˜ be the during a multishift QR sweep using m pairs of simultaneous shifts. Let B ˜ by 2(m + 1)-by-2(m + 1) principal submatrix containing the bulge. Obtain B from B dropping the first row and the last column. Let N be a (2m + 1)-by-(2m + 1) Jordan block with eigenvalue zero. The shifts are the 2m finite eigenvalues of the bulge pencil (2.1)
B − λN.
The theorem shows that the shift information is transferred through the matrix by the bulge pencils (2.1). Watkins observes empirically that the eigenvalues of B − λN tend be ill-conditioned and that the ill-conditioning grows rapidly with the size of the bulge. The observation suggests that in order to avoid shift blurring, the bulges must be confined to very small order. This is borne out by many successful small bulge versions of the QR algorithm including the one described in this paper. See, for example, [17, 24, 31, 29, 34, 36, 54, 55]. In order to apply complex conjugate shifts in real arithmetic, one must allow bulges big enough to transmit at least two shifts. The smallest such bulges occupy full 4-by-4 principal submatrices. We sometimes refer to 4-by-4 bulges as “double implicit shift bulges” because they are used in the double implicit shift QR algorithm [25, 26], [27, Algorithm 7.5.1]. This paper uses double implicit shift bulges exclusively. The QR algorithm needs to use many simultaneous shifts in order to generate substantial level 3 BLAS operations, but, in order to avoid shift blurring, it is restricted to small bulges. Small bulges transmit only a few simultaneous shifts. One way to resolve these superficially contradictory demands is to use a chain of m tightly packed two-shift bulges as illustrated in Figure 2. This configuration allows many simultaneous shifts while keeping each pair “well focused” inside a small bulge. The processes of chasing the chain of bulges along the diagonal is illustrated in Figure 3. Near the diagonal, “3-by-3” Householder reflections are needed to chase the bulges one at a time along the diagonal. The reflections are applied individually in the light shaded region near the diagonal in a sequence of level 1 BLAS operations.
932
KAREN BRAMAN, RALPH BYERS, AND ROY MATHIAS
cpu minutes
1000−by−1000
2000−by−2000
4
40
3
30
2
20
1
10
100
50
0 0
50
100
0 0
10
floating point instructions
3000−by−3000
3
100
0 0
11
x 10
3
2.5
2.5
2
2
1.5
1.5
1
1
0.5 0
50
50
100
x 10
0.5 0 50 100 simultaneous shifts
50
100 DHSEQR TTQR
50
100
11
8
x 10
6
4
2 0
Fig. 1. Effect of varying numbers of simultaneous shifts on the large-bulge multishift QR algorithm as implemented in DHSEQR, from LAPACK version 2 and on our experimental implementation of the small-bulge multishift QR, TTQR. The two programs calculated the Schur decompositions (including both the orthogonal and quasi-triangular factors) of 1,000-by-1,000, 2,000-by-2,000 and 3,000-by-3,000 pseudorandom Hessenberg matrices. (The pseudorandom test matrices, computational environment, and other details are described in section 3.) The first row plots number of simultaneous shifts versus cpu-minutes. The second row plots number of simultaneous shifts versus a hardware count of floating point instructions executed. (A trinary multiply-add instruction counts as one instruction but executes two flops.) The first, second, and third columns report data from matrices of order n = 1,000, n = 2,000 and n = 3,000, respectively.
Periodically, a group of reflections are accumulated into a thickly banded orthogonal matrix. The orthogonal matrix is then used to update the dark shaded region in Figure 3 using matrix-matrix multiplication or other level 3 BLAS operations. We refer to this as a “two-tone” QR sweep. The level 1 BLAS operations form one tone; the level 3 operations form the other. Using the implicit Q theorem [27, Theorem 7.4.3], it is easy to show that a twotone small-bulge multishift QR sweep with m pairs of simultaneous shifts, (sj , tj ) ∈ C × C, j = 1, 2, 3, . . . , m, is equivalent to m Francis doubleQ implicit shift QR sweeps, m i.e., a two-tone QR sweep overwrites H by QT HQ where j=1 (H − sj )(H − tj ) = QR is a QR (orthogonal-triangular) factorization. An examination of the detailed description below shows that, in exact arithmetic, the two-tone QR sweep generates the same bulges and the same sequence of similarity transformations by “3-by-3” Householder reflections as m double implicit QR sweeps using the same shifts. Modified QR algorithms that chase more than one bulge at a time have been proposed several times [17, 24, 31, 29, 34, 36, 54, 55]. In particular, Henry, Watkins, and Dongarra [31] recently developed a distributed memory parallel QR algorithm
MULTISHIFT QR ALGORITHM
933
XXXXXXXXXXXXXXXXXXXXXXXXXX XXXXXXXXXXXXXXXXXXXXXXXXXX XXXXXXXXXXXXXXXXXXXXXXXXX XXXXXXXXXXXXXXXXXXXXXXXX XXXXXXXXXXXXXXXXXXXXXXX XXXXXXXXXXXXXXXXXXXXXX XXXXXXXXXXXXXXXXXXXXXX XXXXXXXXXXXXXXXXXXXXXX XXXXXXXXXXXXXXXXXXX XXXXXXXXXXXXXXXXXXX XXXXXXXXXXXXXXXXXXX XXXXXXXXXXXXXXXX XXXXXXXXXXXXXXXX XXXXXXXXXXXXXXXX XXXXXXXXXXXXX XXXXXXXXXXXX XXXXXXXXXXX XXXXXXXXXX XXXXXXXXX XXXXXXXX XXXXXXX XXXXXX XXXXX XXXX XXX XX Fig. 2. A chain of m = 3 tightly packed double implicit shift bulges in rows r = 5 through s = r + 3m = 14.
that chases many small bulges independently. Lang [36] recently described symmetric QL and QR algorithms that use rotations to simultaneously chase many small bulges. His algorithms also achieve some parallelism and good cache reuse by accumulating the rotations in groups. The algorithm proposed in the next section uses Householder reflections and applies to nonsymmetric matrices. 2.1. Detailed description. This section describes the two-tone small-bulge multishift QR sweep. Let H ∈ Rn×n be an unreduced Hessenberg matrix and let (sj , tj ) ∈ C × C, j = 1, 2, 3, . . . , m, be a set of m pairs of shifts. To avoid complex arithmetic, we require each pair to be closed under complex conjugation, i.e., either s¯j = tj or (sj , tj ) ∈ R × R. There are three phases in a two-tone QR algorithm. In the first phase, a tightly packed chain of 4-by-4, double implicit shift bulges is introduced into the upper lefthand corner of a Hessenberg matrix. In the second phase, the chain of bulges is chased down the diagonal from the upper left-hand corner to the lower right-hand corner by a sequence of reflections. In the final phase, the matrix is returned to Hessenberg form by chasing the chain off the bottom row. 2.1.1. Phase 2: The central computation. The bulk of the work is in the second phase. Suppose that H ∈ Rn×n is an upper Hessenberg matrix with a packet of m double implicit shift bulges in rows r through s = r + 3m. See Figure 3. Using a process called “bulge chasing” that has been known at least since the double implicit shift algorithm [35, 25, 26, 27] was introduced in 1961, one may use a sequence of similarity transformations by “3-by-3” Householder reflections to chase the chain of ˜ ∈ Rn×n bulges down k rows, one bulge at a time, starting with the lowest bulge. If U
934
KAREN BRAMAN, RALPH BYERS, AND ROY MATHIAS
r
s
s+k
r s s+k
Fig. 3. Chasing a packet of m 4-by-4 double implicit shift bulges from rows r through s = r +3m down to rows r + k through s + k.
˜ has the form is the product of this sequence of Householder reflections, then U
(2.2)
r r I ˜ = 3m + k − 1 0 U n − (s + k) + 1 0
3m + k − 1 0 U 0
n − (s + k) + 1 0 , 0 I
˜r+1:s+k−1,r+1:s+k−1 is itself an orthogonal matrix. For notational convewhere U = U ˆ ∈ R(3m+k+1)×(3m+k+1) by nience, define U 1 3m + k − 1 1 1 0 ˆ = 3m + k − 1 0 U U 1 0 0
1 0 0 , 1
i.e., the r = 1, n = s + k case of (2.2). ˆ , the multibulge chase mentioned above consists of two In terms of U and U parts. The first part is a similarity transformation on the the lightly shaded region in Figure 3, i.e., (2.3)
ˆ T Hr:s+k,r:s+k U ˆ. Hr:s+k,r:s+k ← U
ˆ T from The second part is the matrix multiplication of the dark, horizontal slab by U the left, i.e., (2.4)
ˆ T Hr:s+k,s+k+1:n , Hr:s+k,s+k+1:n ← U
935
MULTISHIFT QR ALGORITHM s
r
s
r
Fig. 4. Left: introducing m bulges into row 1 through row s = 1 + 3m. Right: chasing m bulges off the bottom from row r = n − (1 + 3m) through row n.
ˆ from the right, i.e., and the multiplication of the dark, vertical slab by U (2.5)
ˆ. H1:r−1,r:s+k ← H1:r−1,r:s+k U
The similarity transformation (2.3) has the effect of moving the packet of bulges from rows r through s down to rows r + k through s + k. Using a sequence of “3-by-3” reflections, this small similarity transformation does not take advantage of the level 3 BLAS, but if m ≪ n and k ≪ n, then the work of updating the lightly shaded region in Figure 3 is small compared to the work of updating the two dark slabs. Updating the dark horizontal and vertical slabs in Figure 3 is a matrix multiplication by the orthogonal matrix U in (2.2). In order to make this into a level 3 BLAS matrixmatrix multiplication, it is necessary to form explicitly the orthogonal submatrix U from the “3-by-3” reflections used to chase the bulges. Accumulating U does not take advantage of the level 3 BLAS either, but this is also a small amount of arithmetic work compared to updating the dark slabs. Using a sequence of reflections to update the lightly shaded region in Figure 3 but using matrix-matrix multiplication to update the dark bands results in a high ratio of level 3 BLAS work to level 1 BLAS work [13]. 2.1.2. Phases 1 and 3: Getting started and finishing up. The two-tone structure of the algorithm is slightly simpler during the initial bulge-introduction phase and during the final chasing-bulges-off-the-bottom phase than during most of the computation. See Figure 4. A modified version of the Francis double implicit shift algorithm may be used to introduce the m bulges into the upper left-hand corner of the Hessenberg matrix. One at a time, bulges are introduced and chased down to their proper position in the chain. This is equivalent to doing 3m(m − 1)/2 similarity transformations by “3-by-3” reflections on H1:3m+1,1:3m+1 , the lightly shaded region on the left of Figure 4. The matrix U has zero structure similar to the (1, 1) block of Figure 5. Chasing the packet of m bulges off the bottom at the end of the QR sweep is similar. It is equivalent to doing a similarity transformation by a sequence of Householder reflections followed by a orthogonal matrix multiplication on the dark vertical slab on the right of Figure 4. This operation also requires roughly the same number of flops as does the operation of introducing a packet of m bulges.
936
KAREN BRAMAN, RALPH BYERS, AND ROY MATHIAS 0
5
10
15
20
25
30
35
40
45 0
5
10
15
20
25
30
35
40
45
Fig. 5. The zero structure of the orthogonal submatrix U in (2.2) in the special case m = 8, k = 22.
2.1.3. Level 3 BLAS and exploiting the zero structure of U. The orthogonal matrix U has more structure than a generic, dense orthogonal matrix. The amount of arithmetic work needed to update the dark slabs can, theoretically, be substantially reduced by taking advantage of the zero structure of U . The example in Figure 5 is 38% zeros. However, this adds some complexity to the computation which offsets the reduced flop count. Depending on the computing environment, taking advantage of some or all of the zeros may add more work than it saves. Nevertheless, it is likely to be worth taking advantage of at least some of the zeros either by treating U as a thick band matrix or as a 2-by-2 block matrix. As illustrated in Figure 5, U has a thick band structure with lower bandwidth kl = 2 min(m, k) and upper bandwidth ku = k + 1. The band is fairly dense, so, theoretically, the thick, banded matrix-matrix multiply is a level 3 operation. However, except for triangular matrix multiplication, banded matrix-matrix multiplication software is usually implemented as a level 2 (matrix-vector) operation. In order to take advantage of both band structure and the current set of level 3 BLAS, matrix multiplication by U may be broken up into two dense-by-dense matrix multiplications and two triangular-by-dense matrix multiplications. If 3m − 2 ≥ k ≥ m + 2, then one way to do this is to partition U into a 2-by-2 block matrix with two nearly dense, rectangular diagonal blocks and two triangular off-diagonal blocks. See Figure 5. In addition, at least one of the triangular blocks has a thick band of zeros along the diagonal, so the work of multiplying by this triangular matrix can be further reduced. For a fixed choice of the number of simultaneous shifts 2m, k ≈ 3m approximately minimizes the number of flops in an n-by-n two-tone multibulge QR sweep regardless of the logical zero structure of U . See Table 1. With k ≈ 3m, the algorithm uses between 1.6 and 2.4 times as many flops as m double implicit shift sweeps, depending
937
MULTISHIFT QR ALGORITHM
Table 1 The ratio of the number of flops performed by two-tone small-bulge multishift QR sweep with m pairs of shifts chased k rows at a time to 10mn2 the number of flops needed for m double implicit shift QR sweeps. It is assumed that m is small relative to the order of the Hessenberg matrix n.
Logical U structure Banded 2-by-2 block Dense
flops × 10mn2 Ratio for Approximately k = 3m optimal k 1.57 k = 2.8m 1.63 k ≈ 2.54m 2.40 k = 3m
Ratio for optimal k 1.57 1.62 2.40
Table 2 Flops used by an n-by-n two-tone multibulge QR sweep chasing m bulges, k rows at a time, assuming k ≪ n and m ≪ n. (These estimates do not include the work for accumulating the orthogonal matrix of Schur vectors Q. Accumulating Q approximately doubles the flop count.) Flops + O(m2 n)
Logical U structure 2-by-2 block m > k or 3m − 2 ≥ k ≥ m + 1 Dense
49m2 −39m+12 3m−2
n2
(2p2 +6p+13)m2 −pm+2 pm
n2
(2p2 +6p+13)m2 +(p−4)m+3 pm
n2
2(6m−1)2 3m
n2
2((p+3)m−1)2 pm
n2
k = 3m − 2 k = pm (3 + p)m even k = pm (3 + p)m odd
k = 3m k = pm
on how much of the zero structure of U is exploited. The version of the large-bulge multishift QR algorithm which most closely resembles the two-tone small-bulge multishift QR algorithm aggregates groups of p transformations in order to take advantage of the level 3 BLAS. A 2m shift sweep of the aggregated large-bulge multishift QR algorithm needs roughly 1.5 + p/(8m) times as many flops as m double implicit shift sweeps [4]. (The version of the large-bulge multishift QR algorithm implemented in DHSEQR from LAPACK version 2.0 [1] does not aggregate groups of transformations and takes advantage of only the level 1 and 2 BLAS. It needs roughly the same number of flops to chase a single 2m shift bulge as m double implicit shift sweeps [4, section 6].) Detailed algorithms and mathematical flop counts are given in [13] and are summarized here in Tables 2 and 1. In section 3, numerical examples demonstrate that the avoidance of shift blurring by using small bulges and the ability to use level 3 BLAS by using many simultaneous shifts per QR sweep more than overcome the greater flop counts per sweep displayed in Tables 2 and 1. 2.2. Deflation and choice of shifts. It is inexpensive to monitor the subdiagonal entries during the course of a two-tone QR sweep and take advantage of a small-subdiagonal deflation should one occur. Note that this includes taking advantage of any small subdiagonal entries that may appear between the bulges during a two-tone QR sweep. In [55], this is called “vigilant deflation.” Ordinarily, the bulges above a vigilant deflation collapse as they encounter the newly created subdiagonal zero. This may block those shifts so that, for the current QR sweep, the benefit of using many simultaneous shifts may be lost. However, the bulges can be reintroduced in the row in which the new zero subdiagonal appears
938
KAREN BRAMAN, RALPH BYERS, AND ROY MATHIAS
using the same methods as are used to introduce bulges at the upper left-hand corner [27, p. 377], [62, p. 530]. In this way, the shift information passes through a zero subdiagonal and the two-tone QR sweep continues with all its shifts. It is well known that a good choice of shifts in the QR algorithm leads to rapid convergence. Shifts selected to be the eigenvalues of a trailing principal submatrix give local quadratic convergence [62, 60]. In the symmetric case, convergence is cubic [64]. Deferred shifts retard convergence [49]. Following the choice of LAPACK version 2 subroutine DHSEQR, our experimental computer programs select the shifts to be the eigenvalues of a trailing principal submatrix. 3. Numerical examples. We compared the execution time of the large- and small-bulge QR algorithm on ad hoc and pseudorandom Hessenberg matrices of order n = 1,000 to n = 10,000 and on nonrandom matrices of similar order taken from a variety of applications in science and engineering [3]. (The two algorithms delivered ˜−Q ˜ T˜kF /kAk roughly the same accuracy as measured by the relative residual kAQ ˜ T˜Q ˜ T is the ˜ − IkF /√n, where A = Q ˜T Q and the departure from orthogonality kQ ˜ and computed real Schur decomposition of A with computed orthogonal factor Q computed quasi-triangular factor T˜.) We call our experimental Fortran implementation of the two-tone small-bulge multishift QR algorithm TTQR. When using m pairs of simultaneous shifts, each TTQR sweep chases each tightly packed chain of m small bulges k = 3m − 2 rows at a time. The implementation exploits the logical 2-by-2 block structure of U including the thick band of zeros along the diagonal of the (1,2) block in Figure 5. Except where noted otherwise, TTQR uses 150 simultaneous shifts, but no particular significance should be attached to this choice. TTQR uses LAPACK subroutine DHSEQR [1], the conventional large-bulge QR algorithm, to reduce diagonal subblocks of order no greater than 1.5 times the number of simultaneous shifts. Following EISPACK [46] and LAPACK [1], a Hessenberg subdiagonal entry hi+1,i is set to zero when |hi+1,i | ≤ ε (|hii | + |hi+1,i+1 |) with ε equal to the unit round. (A new deflation procedure which greatly reduces the arithmetic work and execution time that TTQR needs is reported in [12, 13].) For comparison, we used the implementation of the large-bulge multishift QR algorithm in subroutine DHSEQR from LAPACK version 2 [1], which is widely recognized to be an excellent implementation. DHSEQR uses the double implicit shift QR algorithm to reduce subblocks of size MAXB-by-MAXB or smaller. In the examples reported here, MAXB is chosen to be the greater of 50 and the number of simultaneous shifts. Except where noted otherwise, DHSEQR uses only six simultaneous shifts which Figure 1 shows is approximately optimal. The examples reported here were run on an Origin2000 computer equipped with 400MHz IP27 R12000 processors and 16 gigabytes of memory. Each processor has 32 kilobytes of level 1 instruction cache, 32 kilobytes of level 1 data cache, and 8 megabytes of level 2 combined instruction and data cache. For serial execution, the experimental Fortran implementation of the small bulge multishift QR algorithm was compiled with version 7.30 of the MIPSpro Fortran 77 compiler with options -64 -TARG:platform=ip27 -Ofast=ip27 -LNO. For comparison purposes the same options were used to compile DHSEQR from LAPACK version 2. (We observe that this compilation of DHSEQR is usually slightly faster than the compilation distributed in the SGI/Cray Scientific Library version 1.2.0.0.) For parallel execution the -mp and -pfa options were added. The programs called optimized BLAS subroutines from the SGI/Cray Scientific Library version 1.2.0.0. In our computational environment, we observed that the measured serial “cpu time” of any particular program with
MULTISHIFT QR ALGORITHM
939
its particular data might vary by at most a few percent. We were fortunate to get exclusive use of several processors for the purpose of timing parallel benchmark runs. Except where otherwise mentioned, n-by-n matrices were stored in n-by-n arrays. We ran the experimental program on both pseudorandomly generated and nonrandom non-Hermitian matrices of order n = 1,000 to n = 10,000. We selected the entries on the diagonal and upper triangle of pseudorandomly generated Hessenberg matrices to be normally distributed pseudorandom p numbers with mean zero and variance 1. We set the subdiagonal entry hj+1,j = χ2n−j , where χ2n−j is selected from a Chi-squared distribution with n − j degrees of freedom. Hessenberg matrices with the same distribution of random entries can also be generated by applying the reduction to Hessenberg form algorithm [27, Algorithm 7.4.2] to full n-by-n matrices with pseudorandom entries selected from the normal distribution with mean zero variance one. Our source of nonrandom matrices was the Non-Hermitian Eigenvalue Problems (NEP) collection [3]. This is a collection of eigenvalue problems from a variety of applications in science and engineering. We selected 21 real eigenvalue problems of order 1,000-by-1,000 to 8,192-by-8,192. In some of the examples reported below, we report an automatic hardware count of the number of floating point instructions executed. Note that a trinary multiplyadd instruction counts as just one executed instruction in the hardware count even though it executes two flops. We also report the floating point execution rate in millions of floating point instructions per second, or “mega-flops” for short. For comparison purposes, we measured the floating point execution rate of the level 3 BLAS full matrix-matrix multiply subroutine DGEMM and the triangular matrix-matrix multiply subroutine DTRMM from the SGI/Cray Scientific Library version 1.2.0.0 applied to matrix-matrix products similar to updating the dark bands in Figure 3. In serial execution, in the Origin2000 computational environment described above, DGEMM computes the product of the transpose of a 200-by-200 matrix times a 200-by-10,000 slab embedded a 10,000-by-10,000 array at roughly 330 mega-flops. It computes the product of a 10,000-by-200 slab times a 200-by-200 matrix at roughly 325 mega-flops. DTRMM computes the product of the transpose of a triangular 200-by-200 matrix times a 200-by-10,000 slab at roughly 305 mega-flops if it is an upper triangular matrix and at roughly 260 mega-flops if it is a lower triangular matrix. DTRMM computes the product of a 10,000-by-200 slab times a 200-by-200 triangular matrix at roughly 275 mega-flops if it is an upper triangular matrix and at roughly 255 mega-flops if it is a lower triangular matrix. Example 1. Figure 1 displays the serial execution time of DHSEQR and TTQR using various numbers of simultaneous shifts to calculate the Schur decomposition of the pseudorandom Hessenberg matrices described above. Both the quasi-triangular and orthogonal Schur factors were computed. The figure demonstrates that the convergence rate of TTQR does not suffer from shift blurring even with hundreds of simultaneous shifts. The figure also displays the total number of floating point instructions executed by DHSEQR and TTQR to calculate the Schur decompositions of pseudorandom Hessenberg matrices. It also demonstrates that the number of floating point operations needed by TTQR changes little as the number of simultaneous shifts varies in a range that is small compared to n, the order of the Hessenberg matrix. The number of simultaneous shifts used by TTQR may be chosen to fit the cache size and other machine dependent parameters with little effect on the total amount of arithmetic work. However, in this example, TTQR executes between 1.5 and 2 times as many floating point instructions as DHSEQR with 6 simultaneous shifts. A partial explanation for
940
KAREN BRAMAN, RALPH BYERS, AND ROY MATHIAS
millions of floating point operations per second
1000−by−1000
2000−by−2000
3000−by−3000
300
300
300
250
250
250
200
200
200
150
150
150
100
100
100
50
50
50
0 0
50
100
0 0 50 100 simultaneous shifts
0 0
50 100 DHSEQR TTQR
Fig. 6. Floating point execution rate of DHSEQR, the large-bulge multishift QR algorithm, and TTQR, the small-bulge multishift QR algorithm, using varying numbers of simultaneous shifts to calculate the Schur decompositions of the pseudorandom Hessenberg matrices described in section 3.
this is that TTQR executes more floating point instructions per shift than DHSEQR. See Table 1. However, in other examples TTQR executes fewer floating point instructions than DHSEQR. See Figures 7 and 10. Figure 6 displays the floating point execution rates achieved on the Origin2000 computer described above. It demonstrates that TTQR attains substantially higher rates of execution of floating point instructions when using many simultaneous shifts. On the Origin2000 computer described above, we observe that, on pseudorandom Hessenberg matrices of order roughly n > 500, TTQR with forty simultaneous shifts calculates the Schur decomposition of pseudorandom Hessenberg faster than DHSEQR with six simultaneous shifts. (For matrices of order n = 100, TTQR with 40 simultaneous shifts uses 170% of the execution time of DHSEQR with six simultaneous shifts.) Example 2. Figure 7 shows the execution time, floating point execution rate, and number of floating instructions executed to calculate the Schur decompositions of ad hoc Hessenberg matrices of the form of
(3.1)
6 0.001 S6 =
5 1 0.001
4 0 2 0.001
3 0 0 3 0.001
2 0 0 0 4 0.001
1 0 0 0 0 5
.
941
MULTISHIFT QR ALGORITHM
Millions of Floating Point Instructions per Second
1500
300
1000
200
megaflops
cpu minutes
Serial Execution Time
500
0 0
5000 order n
10000
100
0 0
5000 order n
10000
floating point instructions
12
8
x 10
DHSEQR TTQR
6 4 2 0 0
5000 order n
10000
Fig. 7. Execution time, floating point execution rate, and hardware count of floating point instructions executed to calculate the Schur decomposition (including both orthogonal and quasitriangular factors) of Hessenberg matrices of the form of (3.1).
The eigenvalues of trailing principal submatrices of Sn are extraordinarily close approximations to eigenvalues of the whole matrix Sn [13, 12], so, at least initially, the shifts used by both DHSEQR and TTQR are exceptionally good. Both DHSEQR and TTQR calculate the Schur decompositions of Sn faster than the Schur decomposition of a pseudorandom Hessenberg matrix of the same size. See Figure 8. However, TTQR finishes substantially sooner, maintains a higher floating point execution rate, and executes fewer floating point instructions. Example 3. Figure 8 compares the serial execution time and floating point instruction execution rate of TTQR and DHSEQR applied to pseudorandom Hessenberg matrices of order n = 1,000 to n = 10,000. In this example, TTQR consistently executes more floating point instructions than DHSEQR but overcomes this handicap by maintaining a high floating point execution rate. Note that in other examples TTQR executes fewer floating point instructions than DHSEQR. See Figures 7 and 10. The advantage of TTQR over DHSEQR remains qualitatively similar even if only eigenvalues are computed or if only one of the orthogonal or quasi-triangular Schur factors is computed. Example 4. Figure 9 displays the serial execution times of DHSEQR and TTQR applied to 20 real non-Hermitian eigenvalue problems from the NEP collection [3].
942
KAREN BRAMAN, RALPH BYERS, AND ROY MATHIAS
Millions of Floating Point Instructions per Second
Serial Execution Time 300 1500
megaflops
cpu minutes
2000
1000 500 0 0
5000 order n
10000
200
100
0 0
5000 order n
10000
floating point instructions
12
15
x 10
DHSEQR TTQR
10
5
0 0
5000 order n
10000
Fig. 8. Serial execution times, floating point execution rates, and hardware counts of floating point instructions of DHSEQR, the large-bulge multishift QR algorithm, and TTQR, the small-bulge multishift QR algorithm. The two subroutines calculated Schur decompositions (including both the orthogonal and quasi-triangular factors) of pseudorandom Hessenberg matrices as described in section 3.
To avoid cache conflicts, each n-by-n matrix was stored in an (n + 7)-by-(n + 7) array. Each matrix is identified by an acronym from [3]. The alphabetic part indicates the application from which the matrix comes. The numerical part gives the order of the matrix. Summaries of the applications, descriptions of the matrices, and references can be found in [3]. DHSEQR uses six simultaneous shifts which Figure 1 shows is approximately optimal. TTQR uses 60 simultaneous shifts on 1,000-by-1,000 to 1,999-by-1,999 matrices, 116 simultaneous shifts on 2,000-by-2,000 to 2,499-by-2,499 matrices, 150 simultaneous shifts on 2,500-by-2,500 to 3,999-by-3,999 matrices, and 180 simultaneous shifts on 4,000-by-4,000 or larger matrices. Notice that the superiority of TTQR is usually greater for matrices of larger order. We also mention that DW8192 is not reported in Figure 9 only because the execution times are out of scale with the other reported times. DHSEQR calculated the Schur decomposition of the Hessenberg matrix derived from DW8192 in 544 cpu minutes. TTQR calculated the Schur decomposition in 301 cpu minutes. Figure 10 displays hardware counts of the number of floating point instructions executed by DHSEQR and TTQR. Note that neither TTQR nor DHSEQR consistently executes more floating point instructions.
MULTISHIFT QR ALGORITHM
18
200
RW5151 OLM5000 MHD4800B MHD4800A TOLS4000 RDB3200L MHD3200B MHD3200A PDE2961
14
250 RDB2048L RDB2048 DW2048 TOLS2000 OLM2000 BWM2000 RDB1250L RDB1250 TOLS1090 TUB1000 OLM1000
16
943
10 8
cpu minutes
cpu minutes
12 150
100
6 4
50
2 0
0 DHSEQR TTQR
Fig. 9. Serial execution times of DHSEQR, the large-bulge multishift QR algorithm, and TTQR, the small-bulge multishift QR algorithm, applied to 20 non-Hermitian eigenvalue problems from [3]. The acronyms indicate the corresponding matrix from [3]. DHSEQR uses six simultaneous shifts which Figure 1 shows is approximately optimal. TTQR uses 60 simultaneous shifts on 1,000-by-1,000 to 1,999-by-1,999 matrices, 116 simultaneous shifts on 2,000-by-2,000 to 2,499-by-2,499 matrices, 150 simultaneous shifts on 2,500-by-2,500 to 3,999-by-3,999 matrices, and 180 simultaneous shifts on 4,000-by-4,000 or larger matrices. (The execution time of the reduction to Hessenberg form is not included in the above graphs.)
Example 5. Our experimental implementation of the small-bulge QR algorithm is not designed for parallel computation. However, TTQR makes heavy use of the level 3 BLAS—particularly matrix-matrix multiply. Hence, it is not surprising to observe modest but not insignificant speedups when the experimental version of TTQR is compiled for parallel execution and linked with parallel versions of the BLAS. Figure 11 shows wall clock execution time, parallel speedup, and parallel efficiency of TTQR calculating the Schur decomposition (including both the quasi-triangular and orthogonal factors) of pseudorandom Hessenberg matrices on the Origin2000 computer described above. To avoid cache conflicts, some n-by-n matrices were stored in (n + 1)-by-(n + 1) arrays. (Parallel speedup is the ratio T1 /Tp , where T1 is the 1 processor wall clock execution time and Tp is the p-processor wall clock execution time. Parallel efficiency is T1 /(pTp ).) A major serial bottle neck can be attributed to computations in the lightly shaded region of Figure 3. There is potential for some small grain parallel execution even in this region, but there is not enough parallel work to overcome the overhead associated with synchronizing the processors. A planned production version of TTQR is expected
944
KAREN BRAMAN, RALPH BYERS, AND ROY MATHIAS 11
floating point instructions
1.4 1.2 1 0.8 0.6
1.8 1.6 1.4 1.2 1 0.8 0.6
0.4
0.4
0.2
0.2
0
x 10
RW5151 OLM5000 MHD4800B MHD4800A TOLS4000 RDB3200L MHD3200B MHD3200A PDE2961
1.6
2
RDB2048L RDB2048 DW2048 TOLS2000 OLM2000 BWM2000 RDB1250L RDB1250 TOLS1090 TUB1000 OLM1000
1.8
12
x 10
floating point instructions
2
0 DHSEQR TTQR
Fig. 10. Hardware count of floating point instructions executed by DHSEQR, the large-bulge multishift QR algorithm, and TTQR, the small-bulge multishift QR algorithm, when applied to 20 non-Hermitian eigenvalue problems from [3]. The acronyms indicate the corresponding matrix from [3]. DHSEQR uses six simultaneous shifts which Figure 1 shows is approximately optimal. TTQR uses 60 simultaneous shifts on 1,000-by-1,000 to 1,999-by-1,999 matrices, 116 simultaneous shifts on 2,000-by-2,000 to 2,499-by-2,499 matrices, 150 simultaneous shifts on 2,500-by-2,500 to 3,999-by-3,999 matrices, and 180 simultaneous shifts on 4,000-by-4,000 or larger matrices. (Floating point instructions executed during the reduction to Hessenberg form are not included.)
to exhibit better parallel speedup by overlapping computation in the lightly shaded region with computations in the dark bands. 4. Conclusions. The small-bulge two-tone multishift QR algorithm avoids shift blurring so completely that it admits a nearly unlimited number of simultaneous shifts—even hundreds—without adverse effects on the convergence rate. With enough simultaneous shifts, level 3 BLAS operations dominate, allowing both high serial floating point execution rates and at least modest parallel speedup. The possibility of using hundreds of simultaneous shifts makes the choice of how many to use more complicated than for the conventional large-bulge multishift QR algorithm. We observe empirically that the amount of arithmetic work is insensitive to this choice as long as the number of shifts is small compared to the order of the matrix. Hence, the number of shifts may be chosen to fit the cache size and other characteristics of a particular computational environment without compromising the amount of arithmetic work.
945
MULTISHIFT QR ALGORITHM
wall clock execution time 6
800 Speedup
wall clock minutes
1000
600 400
4
2
200 0
6000
8000 order n
10000
1
6000
8000 order n
10000
1 processor 2 processors 3 processors 4 processors 8 processors
0.8 Efficiency
0
0.6 0.4 0.2 0
6000
8000 order n
10000
Fig. 11. Parallel execution time, speedup, and efficiency of TTQR, the small-bulge multishift QR algorithm, on pseudorandom Hessenberg matrices. The pseudorandom test matrices, computational environment, and program parameters are described in section 3. TTQR used 150 simultaneous shifts. Parallel speedup is the ratio T1 /Tp , where T1 is the 1 processor wall clock execution time and Tp is the p-processor wall clock execution time. Parallel efficiency is T1 /(pTp ).
Acknowledgment. The authors would like to thank David Watkins for helpful discussions and for bringing [55] to their attention. REFERENCES [1] E. Anderson, Z. Bai, C. Bischof, J. Demmel, J. Dongarra, J. D. Croz, A. Greenbaum, S. Hammarling, A. McKenney, S. Ostrouchov, and D. Sorensen, LAPACK Users’ Guide, SIAM, Philadelphia, 1992. [2] L. Auslander and A. Tsao, On parallelizable eigensolvers, Adv. in Appl. Math., 13 (1992), pp. 253–261. [3] Z. Bai, D. Day, J. Demmel, and J. Dongarra, A test matrix collection for non-Hermitian eigenvalue problems, Department of Mathematics, University of Kentucky, Lexington, KY. Also available online from http://math.nist.gov/MatrixMarket. [4] Z. Bai and J. Demmel, On a block implementation of Hessenberg QR iteration, Intl. J. of High Speed Comput., 1 (1989), pp. 97–112. Also available online as LAPACK Working Note 8 from http://www.netlib.org/lapack/lawns/lawn08.ps and http://www.netlib.org/ lapack/lawnspdf/lawn08.pdf. [5] Z. Bai and J. Demmel, Design of a parallel nonsymmetric eigenroutine toolbox, part I, in Proceedings of the Sixth SIAM Conference on Parallel Processing for Scientific Computing, Vol. 1, SIAM, Philadelphia, 1993, pp. 391–398. [6] Z. Bai and J. Demmel, Design of a Parallel Nonsymmetric Eigenroutine Toolbox, Part II,
946
KAREN BRAMAN, RALPH BYERS, AND ROY MATHIAS
Tech. Report 95-11, Department of Mathematics, University of California, Berkeley, 1995. [7] Z. Bai, J. Demmel, and M. Gu, An inverse free parallel spectral divide and conquer algorithm for nonsymmetric eigenproblems, Numer. Math., 76 (1997), pp. 279–308. [8] S. Batterson, Convergence of the shifted QR algorithm on 3 × 3 normal matrices, Numer. Math., 58 (1990), pp. 341–352. [9] S. Batterson, Convergence of the Francis shifted QR algorithm on normal matrices, Linear Algebra Appl., 207 (1994), pp. 181–195. [10] S. Batterson and D. Day, Linear convergence in the shifted QR algorithm, Math. Comp., 59 (1992), pp. 141–151. [11] A. Bojanczyk, G. H. Golub, and P. Van Dooren, The periodic Schur decomposition; algorithms and applications, Proc. SPIE 1770, Bellingham, WA, 1992, pp. 31–42. [12] K. Braman, R. Byers, and R. Mathias, The multishift QR algorithm. Part II: Aggressive early deflation, SIAM J. Matrix Anal. Appl., 23 (2002), pp. 948–973. [13] K. Braman, R. Byers, and R. Mathias, The Multi-Shift QR-Algorithm: Aggressive Deflation, Maintaining Well Focused Shifts, and Level 3 Performance, Tech. Report 99-05-01, Department of Mathematics, University of Kansas, Lawrence, KS, 1999. Also available online from http://www.math.ukans.edu/˜reports/1999.html. [14] A. Bunse-Gerstner, R. Byers, and V. Mehrmann, A quaternion QR algorithm, Numer. Math., 55 (1989), pp. 83–95. [15] A. Bunse-Gerstner, R. Byers, and V. Mehrmann, A chart of numerical methods for structured eigenvalue problems, SIAM J. Matrix Anal. Appl., 13 (1992), pp. 419–453. [16] R. R. Burnside and P. B. Guest, A simple proof of the transposed QR algorithm, SIAM Rev., 38 (1996), pp. 306–308. [17] R. Byers, A Hamiltonian QR algorithm, SIAM J. Sci. Statist. Comput., 7 (1986), pp. 212–229. ¨ m, Blocked algorithms and software for reduction of a regular [18] K. Dackland and B. K˚ agstro matrix pair to generalized Schur form, ACM Trans. Math. Software, 25 (1999), pp. 425– 454. [19] D. Day, How the Shifted QR Algorithm Fails to Converge and How to Fix It, Tech. Report 96-0913, Sandia National Laboratories, Albuquerque, NM, 1996. [20] J. Demmel, Applied Numerical Linear Algebra, SIAM, Philadelphia, 1997. [21] J. J. Dongarra, J. Du Croz, S. Hammarling, and R. J. Hanson, An extended set of Fortran basic linear algebra subprograms, ACM Trans. Math. Software, 14 (1988), pp. 1–17. [22] J. J. Dongarra, J. Du Croz, S. Hammarling, and I. S. Duff, A set of level 3 basic linear algebra subprograms, ACM Trans. Math. Software, 16 (1990), pp. 1–17. [23] J. J. Dongarra and M. Sidani, A parallel algorithm for the nonsymmetric eigenvalue problem, SIAM J. Sci. Comput., 14 (1993), pp. 542–569. [24] A. Dubrulle, The Multi-Shift QR Algorithm: Is It Worth the Trouble?, Palo Alto Scientific Center Report G320-3558x, IBM Corp., Palo Alto, CA, 1991. [25] J. G. F. Francis, The QR transformation: A unitary analogue to the LR transformation. I, Comput. J., 4 (1961/1962), pp. 265–271. [26] J. G. F. Francis, The QR transformation. II, Comput. J., 4 (1961/1962), pp. 332–345. [27] G. Golub and C. Van Loan, Matrix Computations, 2nd ed., The Johns Hopkins University Press, Baltimore, MD, 1989. [28] G. H. Golub and C. Reinsch, Singular value decomposition and least squares solutions, Numer. Math., 14 (1970), pp. 403–420. [29] D. E. Heller and I. C. F. Ipsen, Systolic networks for orthogonal decompositions, SIAM J. Sci. Statist. Comput., 4 (1983), pp. 261–269. [30] J. J. Hench and A. J. Laub, Numerical solution of the discrete-time periodic Riccati equation, IEEE Trans. Automat. Control, 39 (1994), pp. 1197–1210. [31] G. Henry, D. S. Watkins, and J. Dongarra, A Parallel Implementation of the NonSymmetric QR Algorithm for Distributed Memory Architectures, Tech. Report CS-97-352, Department of Computer Science, University of Tennessee, 1997. Also available online as LAPACK Working Note 121 from http://www.netlib.org/lapack/lawns/lawn121.ps and http://www.netlib.org/lapack/lawnspdf/lawn121.pdf. [32] W. Hoffmann and B. N. Parlett, A new proof of global convergence for the tridiagonal QL algorithm, SIAM J. Numer. Anal., 15 (1978), pp. 929–937. [33] E. R. Jessup, A case against a divide and conquer approach to the nonsymmetric eigenvalue problem, Appl. Numer. Math., 12 (1993), pp. 403–420. [34] L. Kaufman, A parallel QR algorithm for the symmetric tridiagonal eigenvalue problem, J. Parallel and Distrib. Comput., 3 (1994), pp. 429–434. [35] V. N. Kublanovskaya, On some algorithms for the solution of the complete eigenvalue problem, U.S.S.R. Comput. Math. and Math. Phys., 3 (1961), pp. 637–657.
MULTISHIFT QR ALGORITHM
947
[36] B. Lang, Using level 3 BLAS in rotation-based algorithms, SIAM J. Sci. Comput., 19 (1998), pp. 626–634. [37] C. L. Lawson, R. J. Hanson, D. R. Kincaid, and F. T. Krogh, Algorithm 539: Basic linear algebra subprograms for Fortran usage, ACM Trans. Math. Software, 5 (1979), pp. 324–325. [38] C. L. Lawson, R. J. Hanson, D. R. Kincaid, and F. T. Krogh, Basic linear algebra subprograms for Fortran usage, ACM Trans. Math. Software, 5 (1979), pp. 308–323. [39] R. Mathias and G. W. Stewart, A block QR algorithm and the singular value decomposition, Linear Algebra Appl., 182 (1993), pp. 91–100. [40] C. B. Moler and G. W. Stewart, An algorithm for generalized matrix eigenvalue problems, SIAM J. Numer. Anal., 10 (1973), pp. 241–256. [41] B. N. Parlett, Global convergence of the basic QR algorithm on Hessenberg matrices, Math. Comp., 22 (1968), pp. 803–817. [42] B. N. Parlett and W. G. Poole, Jr., A geometric theory for the QR, LU, and power iterations, SIAM J. Numer. Anal., 10 (1973), pp. 389–412. [43] R. V. Patel, A. J. Laub, and P. Van Dooren, eds., Numerical Linear Algebra Techniques for Systems and Control, IEEE Press, New York, 1994. [44] H. Rutishauser, Solution of eigenvalue problems with the LR transformation, in Nat. Bur. Standards Appl. Math. Ser. 49, 1958, pp. 47–81. [45] T. Schreiber, P. Otto, and F. Hofmann, A new efficient parallelization strategy for the QR algorithm, Parallel Comput., 20 (1994), pp. 63–75. [46] B. T. Smith, J. M. Boyle, J. J. Dongarra, B. S. Garbow, Y. Ikebe, V. C. Klema, and C. B. Moler, Matrix Eigensystem Routines—EISPACK Guide, Lecture Notes in Comput. Sci. 6, Springer-Verlag, New York, 1976. [47] G. W. Stewart, Introduction to Matrix Computations, Academic Press, New York, 1973. [48] R. A. van de Geijn, Storage schemes for parallel eigenvalue algorithms, in Numerical Linear Algebra Digital Signal Processing and Parallel Algorithms, G. Golub and P. Van Dooren, eds., Springer-Verlag, Berlin, 1988, pp. 639–648. [49] R. A. van de Geijn, Deferred shifting schemes for parallel QR methods, SIAM J. Matrix Anal. Appl., 14 (1993), pp. 180–194. [50] R. A. van de Geijn and D. G. Hudson, An efficient parallel implementation of the nonsymmetric QR algorithm, in Proceedings of the Fourth Conference on Hypercube Concurrent Computers and Applications, Monterey, CA, 1989, pp. 697–700. [51] C. F. Van Loan, A general matrix eigenvalue algorithm, SIAM J. Numer. Anal., 12 (1975), pp. 819–834. [52] D. S. Watkins, Understanding the QR algorithm, SIAM Rev., 24 (1982), pp. 427–440. [53] D. S. Watkins, Fundamentals of Matrix Computations, John Wiley, New York, 1991. [54] D. S. Watkins, Bidirectional chasing algorithms for the eigenvalue problem, SIAM J. Matrix Anal. Appl., 14 (1993), pp. 166–179. [55] D. S. Watkins, Shifting strategies for the parallel QR algorithm, SIAM J. Sci. Comput., 15 (1994), pp. 953–958. [56] D. S. Watkins, Forward stability and transmission of shifts in the QR algorithm, SIAM J. Matrix Anal. Appl., 16 (1995), pp. 469–487. [57] D. S. Watkins, QR-like algorithms—an overview of covergence theory and practice, in The Mathematics of Numerical Analysis, Lectures in Appl. Math. 32, J. Renegar, M. Shub, and S. Smale, eds., AMS, Providence, RI, 1996, pp. 879–893. [58] D. S. Watkins, The transmission of shifts and shift blurring in the QR algorithm, Linear Algebra Appl., 241/243 (1996), pp. 877–896. [59] D. S. Watkins and L. Elsner, Chasing algorithms for the eigenvalue problem, SIAM J. Matrix Anal. Appl., 12 (1991), pp. 374–384. [60] D. S. Watkins and L. Elsner, Convergence of algorithms of decomposition type for the eigenvalue problem, Linear Algebra Appl., 143 (1991), pp. 19–47. [61] R. C. Whaley and J. J. Dongarra, Automatically tuned linear algebra software, Tech. report, Mathematical Sciences Section, Oak Ridge National Laboratory, Oak Ridge, TN, 1999. Also available online from www.netlib.org/utk/projects/atlas. [62] J. H. Wilkinson, The Algebraic Eigenvalue Problem, Clarendon Press, Oxford, UK, 1965. [63] J. H. Wilkinson, Convergence of the LR, QR, and related algorithms, Comput. J., 8 (1965), pp. 77–84. [64] J. H. Wilkinson, Global convergence of tridiagonal QR algorithm with origin shifts, Linear Algebra Appl., 1 (1968), pp. 409–420. [65] J. H. Wilkinson and C. Reinsch, eds., Linear Algebra, Handbook for Automatic Computation, Vol. 2, Springer-Verlag, New York, 1971.