Virtual Transmission Method, A New Distributed Algorithm to

Virtual Transmission Method, A New Distributed Algorithm to Solve Sparse Linear System

Fei Wei

Huazhong Yang

Department of Electronic Engineering, Tsinghua University, Beijing, China

Technical Report

Preprinted at arXiv.org

Abstract In this paper, we propose a new parallel algorithm which could work naturally on the parallel computer with arbitrary number of processors. This algorithm is named Virtual Transmission Method (VTM). Its physical background is the lossless transmission line and microwave network. The basic idea of VTM is to insert the virtual transmission lines into the linear system to achieve distributed computing. VTM is proved to be convergent to solve SPD linear system. Preconditioning method and performance model are presented. Numerical experiments show that VTM is efficient, accurate and stable. Accompanied with VTM, we bring in a new technique to partition the symmetric linear system, which is named Electric Vertex Splitting (EVS). It is based on Kirchhoff's Current Law from circuit theory. We proved that EVS is feasible to partition any SPD linear system.

Key words: Distributed Algorithm, Sparse Linear System, Partitioning, Convergence, Performance Modeling, Preconditioning, Transmission Line, Interconnect, Wire New words: Virtual Transmission Method (VTM), Electric Vertex Splitting (EVS), Transmission Delay Equations (TDE), Virtual Transmission Line (VTL), Interconnect, Wire, Electric Graph, Neighbor-To-Neighbor (N2N), Conformal Splitting Existence 1

Theory, Impedance Matching, Lines Coupling, Wire Tearing.

1. Introduction The linear system, Ax = b, is widely encountered in scientific computing. When the coefficient matrix A is symmetric-positive-definite (SPD), the linear system is called SPD system, which is extremely common in engineering applications [1, 2]. For example, most of the linear systems generated by the finite element method are SPD systems. Therefore, in many scientific disciplines, solving SPD systems is an inevitable task and the efficiency will be the dominant factor in those fields. To solve the SPD system, there are two basic approaches, direct methods and iterative methods. The direct methods are mainly based on the Sparse Cholesky Factorization. In order to efficiently compute the dense submatrices inside the sparse matrix, supernodal method and multifrontal method are used [3]. The representatives of the iterative methods are Conjugate Gradient method (CG) and Multigrid method (MG). CG is based on the Krylov subspace projection. If the preconditioner is properly chosen, the convergence of CG will be fast. MG is efficient for the linear systems generated from the elliptic partial differential equations [4]. All the algorithms mentioned above work well on the traditional single-processor computers, but they would get into trouble on parallel computers [5, 6]. The parallel version of Sparse Cholesky Factorization suffers from the limited concurrency which depends on the distribution of the nonzero elements in the sparse matrix. For the parallel CG, it is difficult to choose a proper preconditioner in a parallel way [4]. Another well known parallel method for large sparse linear system is the Domain Decomposition Method (DDM). DDM refers to a collection of techniques which revolve around the principle of divide and conquer [4]. Schur Complement method, Additive Schwarz method and the Dual-Prime Finite Element Tearing and Interconnection (FETI-DP) method are three commonly-adopted parallel methods of DDM [7]. The Schur Complement method makes use of the master-slave model [8]. This method first partitions the large linear system into a number of subsystems. Then these subsystems are simplified and solved by the slave processors in parallel. After that the simplified results are merged into a new linear system, which is much smaller than the original one. At last this new system is solved by the master processor. This model suffers from the heavy communication overheads imposed on the master processor, especially when the number of slave processors is large. Consequently, the scalability and concurrency of the Schur Complement method is limited. The Additive Schwarz method is similar to the block Jacobi iteration. For a SPD system, it needs two assumptions to be convergent, and the convergence speed depends on these two assumptions [4]. 2

The FETI-DP method is a scalable method to solve large problems [7, 9]. FETI-DP has to solve a coarse problem. This procedure needs global communication of the residual errors and the concurrency is difficult to explore. Consequently, the parallel efficiency of FETI-DP is affected. VTM is a new parallel algorithm for large-scale sparse SPD systems. It is inspired by the behavior of transmission lines in the electrical engineering. Although VTM is a distributed iterative algorithm, it is sure to be convergent because of its physical background. VTM adopts the Neighbor-To-Neighbor (N2N) communication model, which requires only local communication between adjacent processors, as shown in Fig. 1. Because of the N2N model, the communication network of the parallel computer could be simple.

Figure 1. Master-slave model Vs. N2N model. (A) Master-slave model. (B) N2N model. The paper is organized as follows. Section 2 introduces the basics of transmission line. Section 3 defines the electric graph of the symmetric linear system. Section 4 describes the partitioning technique for the electric graphs. Section 5 details the algorithm of VTM. Section 6 presents the convergence theory for VTM and a basic proof is given in the appendix. Section 7 focuses on the preconditioning of VTM. Section 8 proposes a performance model. Numerical experiments are shown in Section 9. We conclude this work in Section 10.

2. Transmission Line Transmission line is a magic element in electrical engineering. The circuit diagram of a transmission line is illustrated in Fig. 2. The function of the lossless transmission line could be described by the Transmission Delay Equations, as below. (2.1)

U1 (t ) + Z ⋅ I1 (t ) = U 2 (t − τ ) − Z ⋅ I 2 (t − τ )  U 2 (t ) + Z ⋅ I 2 (t ) = U1 (t − τ ) − Z ⋅ I1 (t − τ )

where U1 and I1 represent the potential and current of Port 1, and U 2 and I 2 represent those of Port 2. t is the time, and τ is the propagation delay. Z is the characteristic impedance of the transmission line [18, 19, 20]. 3

Figure 2. The circuit diagram of the transmission line.

Transmission line is always troublesome for integrated circuit designers, but it would be favorable for the parallel algorithm researchers. There are four reasons below. 1. It isolates different circuits from each other, and one circuit does not need to know any details about other ones. This could be exactly explained by the Distributed Memory Access model. 2. It transfers the interfacial potentials and currents from one circuit to another, which could be considered as the message passing approach in parallel computing [8]. 3. It only exists between adjacent circuits, so the communication just takes place between adjacent processors. This is an instance of the N2N communication model. 4. Its existence does not affect the stability of the resistor network. This observation is the physical base of the convergence theory of VTM. Consequently, we may ask how to make use of the transmission line to boost the parallel computing of sparse linear systems. Obviously, there is no transmission line in this mathematical problem, so we have to add them artificially. VTM is then discovered. It inserts the Virtual Transmission Lines (VTL) into the linear system to achieve parallel computing.

3. Weighted Graph and Electric Graph. In this section we define the weighted graph for the matrix, and define the electric graph for the linear system. Assume there is an n-dimension linear system,

(3.1)

Ax = b

4

 x1   b1   a11      b2  a x2    where x = , b= , A =  21            xn   bn   an1 symmetric.

a12  a1n   a22  a2 n  , aij = a ji , if i ≠ j . A is      an 2  ann 

As a symmetric matrix, A could be represented by an undirected graph G [2, 4]. Each vertex Vi of G is one-to-one mapped to an unknown xi of the linear system. There is an edge Eij between Vi and V j in G , iff aij ≠ 0 , i ≠ j ; otherwise, Vi

and V j are not connected. A weighted graph Ga is an undirected graph defined with the vertex weights and edge weights. aii is defined as the weight of Vi , and aij , i ≠ j , is defined as the weight of Eij . A weighted graph is one-to-one mapped to a symmetric matrix. Ga is defined to be SPD, if and only if the coefficient matrix A is SPD. An electric graph Ge is a weighted graph defined with the current sources. bi is defined as the inflow current source of Vi . We call xi the potential of Vi , and x is the potential vector of Ge . An electric graph is one-to-one mapped to a symmetric linear system. Ge is defined to be SPD if and only if its corresponding Ga is SPD.

5

Figure 3. (A) The weighted graph of the matrix A. (B) The electric graph of the linear system Ax = b. Example 3.1: The weighted graph of the coefficient matrix of (3.2) is shown in Fig. 3A, and the electric graph of this linear system is shown in Fig. 3B.

(3.2)

 6 −1 −2 0 0 0   x1   1    x     −1 7 0 −1 0 0   2   2   −2 0 8 −2 −1 0   x3   3     =   − − − 0 1 2 9 0 3    x4   4   0 0 −1 0 10 −5   x5   5         0 0 0 −3 −5 11   x6   6 

4. Electric Vertex Splitting Before the parallel computing of the symmetric linear system Ax = b, we should partition it first. In this section, we introduce a new splitting technique to partition the electric graph of the symmetric linear system, which is called Electric Vertex Splitting (EVS). To partition the sparse linear system from circuit, EVS is also called wire tearing. EVS is based on Kirchhoff's Current Law from electrical engineering [21]. The major difference of EVS over the traditional partitioning algorithm is that we bring in 6

some new unknowns, called inflow currents, to the subgraphs. We may consider the electric graph to be a linear electric network, and we may recognize the vertex to be an electric node, and the edge to be a branch. An electric network has not only potentials but also currents. When one node is split into two twin vertices, the continuous current inside is also cut off and thus disclosed, so it is reasonable for us to consider these disclosed currents when doing the splitting. This concept is illustrated in Fig. 4.

Figure 4. Illustration of the Electric Vertex Splitting. (A) The original node, with current flowing through it. (B) Splitting this node. (C) The node is split into a pair of twin nodes, and the currents are disclosed. (D) Simplified symbol of the inflow currents.

There are four steps to perform EVS upon the electric graph. Step 1. Set the splitting boundary GB . V ∈ Ge is called boundary vertex iff

V ∈ GB ; otherwise, V is called inner vertex. Step 2. Split each boundary vertex into a pair of vertices, which are called twin vertices. The original boundary vertex is called parent vertex. Step 3. Split the weight and current source of each boundary vertex, and split the

weight of each edge along the boundary, i.e. Eij , if Eij ∈ Ge and Vi ,V j ∈ GB . Step 4. Add inflow currents to the twin vertices. These inflow currents represent the disclosed currents after splitting.

After these four steps, the original electric graph is split into N subgraphs. If there is inflow current flowing into one vertex, then this vertex is called a port. As the result, 7

twin vertices are also the ports of subgraghs.

Figure 5. Electric Vertex Splitting of the electric graph of Ax = b

Example 4.1:

Continuing with Example 3.1, we split the electric graph Ge of the linear system (3.2), previously shown in Fig. 3A. V3 and V4 are set to be the boundary GB and we split the weights and current sources of them. Please be noted that the weight of the edge E34 is also split into two parts, −0.9 and −1.1. then we get 4 ports, P3a , P3b , P4a and P4b , with currents ω3a , ω3b , ω4a and ω4b flowing into them, respectively. After that Ge is split into two subgraghs. Finally we obtain two subsystems (4.1) and (4.2). Fig. 5 illustrates the process of EVS.

(4.1)

0   x1   1   0   6 −1 −2        −1   x2   2   0  0  x = b + ω  −1 7 1 , or A = + 1 1 1  −2 0 4.8 −0.9   x3a   1.6   ω3a           0 −1 −0.9 3.5   x4a   1.8   ω4a 

(4.2)

 3.2 −1.1 −1 0   x3b   1.4   ω3b          x = b + ω  −1.1 5.5 0 −3   x4b  =  2.2  +  ω4b  , or A 2 2 2 2  −1 0 10 −5   x5   5   0         −3 −5 11   x6   6   0   0 8

It should be noted that there are 12 unknowns in (4.1) and (4.2), while there are only 8 equations. Therefore, extra equations, also called boundary conditions, should be supplemented in order to construct an iterative relationship. Boundary conditions will be described in Theorem 4.1 and Section 5. The split electric graph which consists of N subgraghs is represented by G e . Usually, there is more than one way to choose the splitting boundary, and even the splitting boundary is chosen, there are still plenty of ways to split the weights and current sources. Each of these ways is called a partition scheme of the electric graph. EVS could also be used to split the weighted graph of a symmetric matrix A. Since no current sources in the weighted graph, it is unnecessary to add the currents into the twin vertex after splitting. As the result, to split the weighted graph Ga by EVS, there are three steps. Step 1. Set the splitting boundary GB , GB ⊆ Ga . Step 2. Split each boundary vertex V ∈ GB into a pair of twin vertices. Step 3. Split the weight of each boundary vertex, and split the weight of each edge

along the boundary, i.e. Eij , if Eij ∈ Ga and Vi ,V j ∈ GB . Example 4.2:

Continuing with Example 4.1, we split the coefficient matrix A of linear system (3.2), whose weighted graph Ga was previously shown in Fig. 3A.

9

Figure 6. Electric vertex splitting of the weighted graph of the matrix A

We set V3 and V4 to be the boundary and split the weights of them. The split result is shown in Fig. 6. After that, Ga is split into two subgraphs, which means that  and A  . the original matrix A is split into two matrices, A 1 2

0   6 −1 − 2   −1  0 ~  −1 7 A1 =  , −2 0 4.8 − 0.9     0 − 1 − 0.9 3.5   

 3.2 −1.1 −1 0    −1.1 5.5 0 −3    A2 =  −1 0 10 −5    −3 −5 11   0

After illustrating an example of EVS, we present its mathematical description. Assume the original graph Ge

is partitioned into N separated subgraghs,

M j , j = 1, 2, , N , following some partition scheme. Thereafter, we use M j to represent the number of vertices in M j . M j and M i are called adjacent subgraghs, if each of them has at least one twin vertex born from the same boundary vertex. Each subgragh could be mapped back into a symmetric linear subsystem with inflow currents. To express this subsystem, we define Γ j , port to be an ordered set of the ports in M j , and Γ j,inner an ordered set of the inner vertices in M j . We define

u j to be the potential vector of Γ j , port , and y j to be the potential vector of Γ j,inner . Then, the local linear system for each subgragh could be expressed by the following equation: (4.3)

C j   Fj

E j  u j   f j  ω j  = + D j   y j   g j   0 

where j = 1, 2, , N . ω j is the inflow current vector of the ports of M j . The inflow current of an inner vertex is zero. u j and ω j are also called the local boundary variables of M j , respectively. The above equations (4.3) could be simply rewritten as, 10

 x = b + ω j A j j j

(4.4) where j = 1, 2, , N .

The above-mentioned splitting technique is called level-one splitting technique, and the split vertices could be split again and again, which are called multilevel splitting technique, as illustrated in Fig. 7. To partition a physical problem in 2 or 3 dimensions, the level-two and level-three splitting techniques are inevitable.

Figure 7. Illustration of multilevel Electric Vertex Splitting. (A) The original vertex. (B) Level-one splitting, where one vertex is split into a pair of twin vertices, and there is one inflow current into each twin vertex. (C) Level-two splitting, where one vertex is split into four child vertices, and there are two inflow currents into each child vertex. (D) Level-three splitting, where one vertex is split into eight child vertices, and there are three inflow currents into each child vertex.

Theorem 4.1 (Reversibility): Suppose the electric graph Ge is partitioned into G e by Electric Vertex Splitting (EVS). If the potentials of each pair of twin vertices are set to be same, and the inflow currents of them are set to be opposite, then G e is equivalent to Ge , i.e. the potential of each pair of twin vertices in G e is equal to the 11

potential of their parent vertex in Ge , and the potential of each inner vertex in G e is equal to that of its corresponding inner vertex in Ge . This theorem tells us that EVS is reversible, and this is easy to understand according to its physical background. If we reverse the process of EVS, which means that we make the inflow currents to be a continuous current, merge the twin vertices into one vertex and envelop the continuous current inside it, then we get the original electric graph. A proof for this theorem is given in Appendix 2.

Example 4.3: Continuing with Example 4.1, we set:

(4.5)

 x3a = x3b = x3 x = x = x  4a 4b 4  ω3a + ω3b = 0 ω4a + ω4b = 0

Combining (4.1), (4.2) and (4.5), we get (3.2) after eliminating x3a , x3b , ω3a , ω3b , x4a , x4b , ω4a and ω4b .

Theorem 4.2 (Conformal Splitting Existence): Suppose the weighted graph Ga is SPD, then for arbitrarily-chosen boundary, there is more than one scheme to partition Ga into N subgraphs by EVS, G j , j = 1, 2, , N , and all G j are SPD. This theorem assures that an SPD graph must be able to be partitioned into arbitrary number of SPD subgraphs by EVS. A proof is given in Appendix 1. Here we reuse the word “conformal” to represent a kind of EVS partition schemes, which hold the SPD property of the electric graph. For the electric graph, we have the same conclusion, as below: Suppose the electric graph Ge is SPD, then for arbitrarily-chosen boundary, there is more than one scheme to partition

Ge

into N subgraghs by EVS,

G j , j = 1, 2, , N , and all G j are SPD.

Corollary 4.1: Suppose the electric graph Ge is SPD, then for arbitrarily-chosen boundary, there is more than one scheme to partition Ge into N subgraghs by EVS, 12

G j , j = 1, 2, , N , which are all symmetric-nonnegative-definite (SNND). Corollary 4.1 is a weak version of Theorem 4.2. Then, we broaden the definition of conformal partition: If some partition scheme makes Corollary 4.1 work, then this scheme is conformal, since it holds the SPD or SNND property for the subgraghs after partitioning. This paper does not figure out how to set a practical partition scheme for EVS to split the electric graph conformally. This is a simple work for any strongly-diagonal or weakly-diagonal sparse system. For the scientific problem, we recommend to do the partitioning on the physical level before generating sparse linear systems.

5. VTM Assume that the electric graph Ge has been partitioned into N subgraghs, then we add one VTL between each pair of twin vertices, which means that we use the transmission equations as the boundary conditions. A simple example is given as below.

Figure 8. The split electric graph with VTLs.

Example 5.1: Continuing with Example 4.1, we add one VTL T3 between x3a and

13

x3b , whose characteristic impedance Z 3 is set to be 1, then we add another line T4 between x4a and x4b , whose Z 4 is set to be 0.5. According to (2.1), the mathematical equation of T3 is:

(5.1)

 x3ka + 1.0 ⋅ ω3ka = x3kb−1 − 1.0 ⋅ ω3kb−1  k k k −1 k −1  x3b + 1.0 ⋅ ω3b = x3a − 1.0 ⋅ ω3a

where k is the iteration index in VTM. Similarly, the mathematical equation of T4 is:

(5.2)

 x4ka + 0.5 ⋅ ω4ka = x4kb−1 − 0.5 ⋅ ω4kb−1  k k k −1 k −1  x4b + 0.5 ⋅ ω4b = x4 a − 0.5 ⋅ ω4 a

Based on (4.1) and part of (5.1) and (5.2), the linear system of Subgragh 1 could be expressed as below:

(5.3)

 6 −1 −2 0   x1   1   0         −1   x2   2   0  0  −1 7 = +  −2 0 4.8 −0.9   x3a  1.6   ω3a    0 −1 −0.9 3.5   x   1.8   ω    4a     4a    x k + 1.0 ⋅ ω k = x k −1 − 1.0 ⋅ ω k −1 3a 3b 3b  3a k k k −1  x4 a + 0.5 ⋅ ω4 a = x4b − 0.5 ⋅ ω4kb−1

Eliminate ω3ka and ω4ka from (5.3) and we get (5.4):

(5.4)

1  0   x1    6 −1 −2    x   2 0 −1   2     −1 7 =  −2 0 5.8 −0.9   x3a   1.6 + x3kb−1 − ω3kb−1        k −1 k −1  0 −1 −0.9 5.5   x4a   1.8 + 2 ⋅ x4b − ω4b   ω3ka = − x3ka + x3kb−1 − ω3kb−1  k k k −1 k −1 ω4 a = −2 x4 a + 2 x4b − ω4b

Similarly, we get (5.5) for Subgragh 2:

14

(5.5)

k −1 k −1  4.2 −1.1 −1 0   x3b   1.4 + x3a − ω3a       k −1 k −1  −1.1 7.5 0 −3   x4b  =  2.2 + 2 ⋅ x4 a − ω4 a    −1 0 10 −5   x5   5        − − 0 3 5 11 x 6   6   

 ω3kb = − x3kb + x3ka−1 − ω3ka−1  k k k −1 k −1 ω4b = −2 x4b + 2 x4 a − ω4 a

After that, we set the initial value of the boundary variables as below: 0 0 0 0  x3a = x3b = x4 a = x4b = 0  0 0 0 0 ω3a = ω3b = ω4 a = ω4b = 0

At last, we compute this example distributedly on two processors. Subgragh 1 is located on Processor A, and Subgragh 2 is located on Processor B, as illustrated in Fig. 8. The boundary variables are communicated between these two processors by message passing. The computing result is shown in Fig. 9.

Figure 9. Distributed computing result of VTM on double processors

After illustrating this simple example, we present the mathematical description of VTM. For the subgragh M j , we have defined Γ j , port as an ordered set of its ports. Further, we define Γ j,twin to be another ordered set of ports whose twin vertices 15

belong to Γ j , port . The ports in Γ j , port and their corresponding twin vertices in Γ j,twin have the same order. It is easy to know that the ports of Γ j,twin belong to the adjacent subgraghs of M j . Define u j ,twin as the potential vector of Γ j,twin , and ω j ,twin as the current vector of Γ j,twin . Then, for M j , j = 1, 2, , N , the transmission equations (2.1) could be expressed in a matrix-vector form as below: 1 1 u kj + Z j ω kj = u kj ,−twin − Z j ω kj ,−twin

(5.6)

Here Z j is a positive diagonal matrix, called the local characteristic impedance matrix of M j . The diagonal elements of Z j are the characteristic impedances of the VTLs connected to M j . Zj is the local preconditioner for subgraph Mj. 1 1 and ω kj ,−twin are the previous (5.6) is an distributedly-iterative relation, and u kj ,−twin

computing results passed from the adjacent subgraghs, which are called the remote boundary variables of M j . Merge (4.1) and (5.6), we get:

(5.7)

C j   Fj I 

Ej Dj 0

k  −I   u j   fj  k    0  y j  =  gj  1 1  − Z j ω kj ,−twin Z j  ω kj  u kj ,−twin   

where I is the identity matrix. Eliminating ω kj , we get the following SPD system:

(5.8)

C j + Z j -1   F j

-1 -1 E j  u kj  f j + Z j -1 (u kj,twin - Z j ω kj,twin )  k =   D j   y j   gj 

1 1 ω kj = − Z −1 ⋅ u kj + Z −1 ⋅ u kj ,−twin − ω kj ,−twin

16

Figure 10. Illustration of the computing process of VTM. (A) The original electric graph of the sparse linear system. (B) Partition the original graph into N subgraghs by EVS. (C) Add VTLs between adjacent subgraghs. (D) Map each subgragh onto one processor.

(5.8) is called the local subsystem of M j , which could be solved by Sparse or Dense Cholesky, CG, MG, etc. Table 1 gives the full description of VTM, and Fig. 10 illustrates the computing process of this algorithm. It should be noted that there is no broadcasting, but only N2N communication.

Table 1. Algorithm description of VTM Assume the original electric graph has been partitioned into N subgraghs. Each subgragh is located on one processor, and there are communication networks between adjacent subgraghs. For Subgragh j , j = 1, , N , do in parallel: 1.

Communicate with adjacent subgraghs, to make an agreement of the characteristic impedances for each VTL, so that Z j is set. 17

2.

Guess the initial local boundary variable u 0j and ω 0j of each port, which could be arbitrary values.

3.

1 Wait until receiving the new remote boundary variables, u kj ,−twin 1 and ω kj ,−twin , do:

4.

Solve the local subsystem with the updated remote boundary 1 1 variable u kj ,−twin and ω kj ,−twin , and then we get the new local

boundary variable u kj and ω kj . 5.

Send the new local boundary variable

u kj and ω kj to the

adjacent subgraghs. 6.

If convergent, Break;

7.

EndWait

6. Convergence theory of VTM. According to the description of VTM, it is not straightforward to judge whether this algorithm is convergent or not. In this section we present the convergence theorem. Theorem 6.1 (Convergence): Assume the electric graph of an SPD linear system, Ax = b, is partitioned into N symmetric-non-negative-definite (SNND) subgraghs, then for positive characteristic impedances of VTLs, VTM converges to the solution of the original system.

This conclusion is valid for both the level-one and the multilevel EVS, and we give a proof for this theorem in Appendix 3. Theorem 6.1 could also be simplified as: Assume the electric graph of a SPD linear system is partitioned into N subgraghs following a conformal partition scheme, then VTM converges.

7. Preconditioning As we observed, the choice of the characteristic impedances of VTLs, would make a huge impact to the convergence speed of VTM. Consequently, the characteristic impedances, i.e. the characteristic impedance matrix Zj, could be considered as the 18

preconditioner for VTM. Further, we define the preconditioning of VTM as the process to find proper characteristic impedance matrix for VTLs. 7.1. Impedance Matching

Here we propose a simple way, called impedance matching, to choose the characteristic impedances, i.e. to precondition VTM. Before describing this technique, it is necessary to define the port’s input impedance, which could be found in any textbook of circuit theory or microwave network. The theory of VTM could be considered as a mix of numerical analysis and microwave network. This paper borrows quite a few notations and definitions from electrical engineering, such as transmission line, potential, source current, inflow current, characteristic impedance, etc. Definition 7.1 (Input Impedance of Port):

For the subgragh described by (4.3), we first set all the inflow current sources to be zero, and then set the inflow currents of all the ports except Pj to be zero, and set the inflow current ω j of Pj to be 1, than we solve this system and get the potential u j of Pj , then, rin, j = u j / ω j = u j /1 = u j , here rin , j is the input impedance of port Pj . The impedance matching technique is that, the characteristic impedance of VTL should be neither too large nor too small, and usually it is set near the input impedances of either port of VTL. We use the following example to illustrate the effect of impedance matching. Example 7.1: We continue to use Example 5.1. The input impedance of P3a should

be the answer of x3a in (7.3.1), and we get rin,3a = 0.2598 .

(7.3.1)

0   x1   0   6 −1 −2      0 −1   x2   0   −1 7 =  −2 0 4.8 −0.9   x3a   1         0 −1 −0.9 3.5   x4a   0 

Similarly, we get rin,4a = 0.3190 , rin,3b = 0.3699 and rin,4b = 0.2557 . After that, we choose different combination of Z 3 and Z 4 , and redo the computation in Example 5.1. The root mean squared (RMS) errors after 20 iterations are shown in Fig. 11, from which we know that the computational error of 19

VTM is lowest when Z 2 is set near rin,3a or rin,3b , and Z 3 is set near rin,4a or rin ,4b . This simple example shows that impedance matching is impactful to make VTM accurate and fast. Then we test VTM on 128 processors and Fig. 12 illustrates the convergence curves of VTM with and without impedance matching, which is also impressive.

Figure 11. Computational error of VTM after 20 iterations

20

Figure 12. Effect of the impedance matching technique on 128 processors

At last, it should be noted that the computational error of VTM is a continuous function of the characteristic impedances of VTL, and it is not sensitive to the small change of the characteristic impedances. This character makes VTM to be a practical and robust numerical algorithm. 7.2. Coupling

Generally, the local preconditoner Zj in (5.8) could not only be a diagonal matrix, but also a banded matrix or even a full matrix. In this case, these exists coupling among the adjacent VTLs. According to the knowledge of microwave network, if Zj is a symmetrical matrix, then the VTLs connected to M j are symmetric coupled; if Zj is an unsymmetrical matrix, these VTLs are unsymmetrical coupled. If Zj is diagonal, the VTLs are uncoupled. The microwave network with symmetric coupled transmission lines inclines to be more stable than that with uncoupled transmission lines. This means that the convergence of VTM with coupled VTLs might be faster than that with uncoupled VTLs. If there exist coupled VTLs, the convergence theory of VTM is updated as below: Theorem 7.1: Assume the electric graph of an SPD linear system, Ax = b, is partitioned into N symmetric-non-negative-definite (SNND) subgraghs. If all the local preconditioner Zj is SPD, VTM converges to the solution of the original system.

The proof for this theorem is similar to Theorem 6.1.

8. Performance Modeling In this section we set a simple model for VTM [1, 22]. First we make several assumptions. (1). One floating point operation at top speed (i.e. the speed of matrix multiplication) costs one time unit. (2). We have p processor arranged in a 2D mesh. (3). The communication delay between neighboring processors are same, and sending a message of l words from one processor to its adjacent processor costs ( α + β *l ) time units. Second, we prepare the linear system for test. The electric graph of this linear system is a 2D grid, whose dimension is n. By EVS, we partition this graph regularly into p subgraghs. Each subgragh is a smaller grid whose dimension b =

n n +2 , p p 21

and b ≈

n . p

Third, we locate each subgragh on one processor and use the sparse Cholesky factorization to solve the local subsystem. Numerical experiments shows that, to solve this kind of sparse linear systems, the computational complexity of sparse Cholesky factorization is O(b1.5 ) , and the computational complexity of the forward and backward substitution is O(b) . Fourth, we do the precondition for VTM. The characteristic impedances of VTLs are optimized by impedance matching technique. The computational complexity of impedance matching is O(b) . Fifth, we do the distributed iterative computation using VTM. Assume it needs K iterations to achieve the computational error of ε . We need to do the Cholesky factorization for one time, and do the forward and backward substitution for the rest K-1 times, as explained in Section 7. Then, the total parallel computing time is:

(

Tp = b1.5 + K 2b + α + β b (8.1)

)

1.5

n n n ≈   + 2 K   + Kα + K β    p  p  p

0.5

Compared to the computing time on a single processor: (8.2)

Ts = n1.5 + 2n

The speedup ratio is: S=

Ts n1.5 + 2n = 0.5 Tp  n 1.5 n n  p  + 2 K  p  + Kα + K β  p        1

≈ 1 + 2K

p  p + Kβ   n n

⋅ p1.5

Here the key is to know the total iterative number K, which could be approximately considered as a function of n and p, i.e. K(n, p). It is difficult to make a theoretical analysis of K(n, p); however, numerical experiments in Section 9 show that the convergent speed of VTM is acceptable and K is a moderate number to achieve high computational accuracy. 22

9. Numerical Experiments. We test VTM by the VTM toolbox, which is a distributed computing emulation platform developed by us under MATLAB and SIMULINK. Here n is the dimension of the sparse linear system, and p is the number of cores. We first test a sparse linear system whose dimension n is 4225. We partition it into p subgraghs and solve it on p processors. Fig. 13 illustrates the RMS errors’ curve of VTM when p is 4, 8, 16, 32, 64 and 128. According to this figure, we know that the computational error of VTM is decreasing, and it is limited by the machine precision of the computer, which is double-precision in this case.

Figure 13. Computational errors of VTM when p changes

23

Figure 14. Illustration of K(p) when ε = 2.0 E − 15

If we set ε = 2.0 E − 15 , then we get K(p) in Fig. 14, which is based on Fig. 13. This figure indicates that K increases slowly with p. Then we solve a number of sparse linear systems on 64 processors. The dimensions of these testbenches are 289, 1089, 2401, 9409 and 14641, respectively. Fig. 15 illustrates the RMS errors’ curve of VTM depending on n.

Figure 15. Computational errors of VTM when n changes, p = 64

If we set ε = 2.0 E − 15 , then we get K(n) in Fig. 16. This figure indicates that K is somewhat immune to the change of n.

24

Figure 16. Illustration of K(n) when ε = 2.0 E − 15

These experiments show that VTM is an efficient and accurate algorithm. The total iteration number K is not sensitive to the change of n, which is the dimension of the sparse system, and K increases slowly with the number of processors p. As the result, if the dimension of subsystem on each processor were large enough, the efficiency of VTM might approach p, as predicted in Section 8. Theoretically, the dimension of the sparse linear system being solved by VTM could be arbitrarily-large, and the processors being employed could be arbitrary number. Limited by our hardware, we are not able to test extremely large problem on supercomputers.

10. Conclusion In this paper, we propose a new parallel algorithm, VTM, to solve the sparse SPD linear systems. VTM could be considered as a new block relaxation method, similar to block Jacobi, or a new algebraic domain decomposition algorithm, similar to additive Schwarz method. The partitioning technique for VTM, i.e. Electric Vertex Splitting, is different from the traditional decomposition algorithms for sparse linear system. The preconditioning of VTM is flexible. The characteristic impedance matrix has a strong impact to the convergence speed. If there is coupling between adjacent VTLs, the precondioner might be more efficient. VTM could not only be used to solve the SPD systems, but also the non-SPD, unsymmetrical linear systems and nonlinear systems. For the unsymmetrical linear system, coupling technique would be helpful to make the algorithm easier to converge. 25

Acknowledgments. We thank Prof. Hao Zhang, Yi Su, Dr. Chun Xia, Dr. Wei Xue, Pei Yang, Bin Niu. This work was partially sponsored by the Major State Basic Research Development Program of China (973 Program) under contract G1999032903, the National Natural Science Foundation of China Key Program, 90207001, and the National Science Fund for Distinguished Young Scholars of China, 60025101.

Appendix 1. Proof of the Conformal Splitting Existence Theorem. Before proving the Conformal Splitting Existence Theorem (Theorem 4.2), first we prove its simple version as below: Lemma A1.1: Suppose the weighted graph Ga is SPD, then for arbitrarily-chosen

boundary

GB , there is more than one partition scheme to partition Ga into two

SPD subgraphs. In order to prove Lemma A1.1, we present three other lemmas. Lemma A1.2: The symmetric matrix A is one to one mapped to the quadratic form

PA (x) = xT Ax . Lemma A1.3: A is SPD, if and only if PA (x) is positive-definite, i.e.

PA (x) = xT Ax > 0 , ∀x ∈  n , x ≠ 0 . According to Lemma A1.2 and A1.3, we know that Lemma A1.4 is right. Lemma A1.4: To partition a weighted graph using the electric vertex splitting is equivalent to divide its quadratic form using the variable splitting technique, and vice versa.

The following example illustrates the variable splitting technique for the quadratic form. Example A1.1: This example is based on Example 4.1 in Section 4. The quadratic form of A is:

(A1.1)

PA (x) = 6 x12 + 7 x22 + 8 x32 + 9 x42 + 10 x52 + 11x62 −2 x1 x2 − 4 x1 x3 − 2 x2 x4 − 4 x3 x4 − 2 x3 x5 − 6 x4 x6 − 10 x5 x6

After the splitting of the weighted graph of A, the quadratic form of A is also split: (A1.2)

P (x ) = P(x 1 ) + P(x 2 ) 26

P (x 1 ) = 6 x12 + 7 x22 + 4.8 x3a2 + 3.5 x4a2 − 2 x1 x2 − 4 x1 x3a − 2 x2 x4a − 1.8 x3a x4a 2 2 P (x 2 ) = 3.2 x3b + 5.5 x4b + 10 x52 + 11x62 − 2.2 x3b x4b − 2 x3b x5 − 6 x4b x6 − 10 x5 x6

 , Here x3 is split into x3a and x3b , and so is x4 . P (x 1 ) is the quadratic form of A 1  , as given in Example 4.1. and P(x 2 ) is the quadratic form of A 2 If we merge x3a and x3b back to x3 , and merge x4a and x4b to x4 ,

 x3a = x3b = x3   x4a = x4b = x4

(A1.3)

then (A1.2) is changed back to (A1.1). This indicates that the variable splitting technique is also reversible. After introducing the conception of the variable splitting technique, we begin to prove Lemma A1.1. First, we consider a trivial case that all the vertices are on the boundary, ie. GB = Ga . Then A could be split into λ A and (1 − λ ) A , where λ ∈ (0,1) . If A is SPD, then both λ A and (1 − λ ) A must be SPD. As the result, A is split into 2 SPD subgraphs. Then, we make use of the induction method. Assume n is the dimension of A . Step 1. When n = 1 , there is only one vertex in Ge , and this vertex must be on the

boundary. This is the trivial case, so Lemma A1.1 is true when n = 1 . Step 2. When n = 2 , PA (x) = a11 x12 + 2a12 x1 x2 + a22 x22 . To split A into two subgraphs,

there are three and only three ways to choose the boundary vertices. (1). If x2 is the boundary, then using the method of completing the square, we get: PA (x) = a11 ( x1 +

a12 a a − a2 x2 ) 2 + ( 11 22 12 ) x22 . a11 a11

Since A is SPD, PA (x) > 0 holds for any x, x ≠ 0 , thus we have: 27

a11 > 0 and Δ = a11a22 − a122 > 0 . Splitting x2 into x2a and x2b , we get:  a12 a11a22 − a122 2   a11a22 − a122 2  2 ~    P ( x ) =  a11 ( x1 + x2 a ) + λ ( ) x2 a  +  (1 − λ ) x2 b  a a a 11 11 11     = Pα ( x1 , x2a ) + Pβ ( x1 , x2b ) , λ ∈ (0,1)

It’s easy to know that both Pα and Pβ are positive-definite. The corresponding matrix of Pα is: a12  a11    2  = A a12  α a  21 λ a22 + (1 − λ ) a   11 

The corresponding matrix of Pβ is a 1×1 matrix shown below: 2  =  (1 − λ )a − (1 − λ ) a12  A   22 β a11  

 and A  , both of which are SPD. So, A is split into A α β

(2). If x1 is the boundary, A is also able to be split into two SPD subgraphs, because we may swap x1 and x2 and the conclusion for x2 is also valid for x1 . (3). If both x1 and x2 are on the boundary, this is the trivial case which has been settled before. As the result, we conclude that Lemma A1.1 is true when n = 2 . Step 3. Assume that Lemma A1.1 is true when n = k − 1 . Step 4. When n = k , we assume that there is at least one vertex which is not on the boundary; otherwise, if all the vertices are on the boundary, this is the trivial case

settled before. Without loss of generality, suppose that xk is not on the boundary.

28

 x1     x2  PA (x k ) = x k T A k x k =       xk −1   x   k   x1    x2   =       xk −1 

T

 a11   a21     a( k −1)1

T

 a11   a21     a( k −1)1  ak1  a12 a22 

a( k −1)2

a12 a22 

  

a1( k −1) a2( k −1) 

a( k −1)2  a( k −1)( k −1) ak 2  ak ( k −1)

  x1      x2       a( k −1) k   xk −1  akk   xk  a1k a2 k 

  x1    k −1   x2  + 2a x x + a x 2 ik i k kk k     i =1    a( k −1)( k −1)   xk −1 

  

a1( k −1) a2( k −1) 

k −1

= x k −1T A k −1x k −1 +  2aik xi xk + akk xk2 i =1

k −1

= P(x k −1 ) +  2aik xi xk + akk xk2 i =1

k −1

We set yk = xk +  i =1

k −1 aik a xi , so that xk = yk −  ik xi . akk i =1 akk

k −1

k −1

i =1

i =1

PA (x k ) = P (x k −1 ) +  2aik xi ( yk − 

k −1 aik a xi ) + akk ( yk −  ik xi ) 2 akk i =1 akk

k −1

k −1 k −1

aik a jk

i =1

i =1 j =1

akk

= P(x k −1 ) +  2aik xi yk −  2 k −1 k −1

aik a jk

i =1 j =1

akk

= P(x k −1 ) − 

k −1

k −1 k −1

aik a jk

i =1

i =1 j =1

akk

xi x j + akk yk2 −  2aik xi yk + 

xi x j

xi x j + akk yk2

= Pˆ (x k −1 ) + akk yk2

a a where Pˆ (x k −1 ) = P(x k −1 ) −  ik jk xi x j . i =1 j =1 akk k −1 k −1

k −1

Because PA (x k ) is positive-definite, ∀x k ∈  k , x k ≠ 0 , we set xk = − i =1

aik xi , akk

T x k = ( x k −1 , xk ) , then we have yk = 0 and Pˆ ( x k −1 ) = PA ( x k ) > 0 , ∀x k −1 ∈  k −1 ,

29

x k −1 ≠ 0 . Therefore Pˆ (x k −1 ) is positive-definite. As Pˆ (x k −1 ) is a quadratic form of (k−1) dimensions, it could be arbitrarily split into two positive-definite quadratic forms, as assumed in Step 3. Pˆ (x k −1 ) = P(x α ) + P (x β )

This means that the electric graph of Pˆ (x k −1 ) is split into two SPD subgraphs, Gα and G β . We know that xk should not be connected to both Gα and G β , because xk is not on the boundary, as we assumed at the beginning of Step 4. Without loss of generality, assume that xk is connected to Gα . Then, PA (x k ) = Pˆ (x k −1 ) + akk yk2 = P(x α ) + P(x β ) + akk yk2

= ( P(x α ) + akk yk2 ) + P(x β ) k −1   a =  P(x α ) + akk ( xk +  ik xi ) 2  + P(x β ) i =1 akk  

  aik 2    xi )  + P(x β ) =  P(xα ) + akk ( xk +  a kk   xi ∈ Gα  

= P(x α , xk ) + P(x β ) So, PA (x k ) has been split into P (x α , xk ) and P (x β ) by variable partitioning, and both of them are positive-definite. According to Lemma A1.4, Lemma A1.1 is true when n = k . Step 5. We conclude that Lemma A1.1 is true for arbitrary n . As long as Lemma A1.1 is proved, it is straightforward to prove the Conformal Splitting existence theorem (Theorem 4.2), since one SPD graph could be split for (N−1) times to get N SPD subgraphs. 30

■

Appendix 2. Proof for the reversibility theorem. In Section 4 we have introduced the Electric Vertex Splitting technique from the viewpoint of a local subgragh; however, this local viewpoint is not suited to prove the reversibility theory. What we need is a global viewpoint for this splitting technique, which is presented here. The relationship between the global viewpoint and the local viewpoint is also discussed. And then, we give a basic proof for the reversibility theory (Theorem 4.1). All the discussion is bounded to the level-one splitting technique. In Section 4, the electric graph Ge of Ax = b has been partitioned into N separated subgraghs, M j , j = 1, 2, , N , and each subgragh could be described by (4.4). Then, we define:  A 1   = A    0

 A 2

1   b 1  0   x1  ω      ω     , x =  x 2  , b =  b 2  , ω  = 2                      N A x N  ω b N  N

As the result, the split system could be expressed by: (A2.1)

  = b + ω  Ax

 is called the split matrix of A . (A2.1) is called the split system of the Here A original system Ax = b. However, (A2.1) is still not suited to express the proof. We need another way to achieve this. We define Γboundary to be an ordered set of all the boundary vertices, and Γinner an ordered set including all the inner vertices. Further, we define u the voltage vector corresponding to Γboundary , and y the voltage vector of Γinner . As the result, the original linear system Ax = b could be reformatted into (A2.2): (A2.2)

C E  u   f   F D  y  = g      

Then, we partition the electric graph of this system using the Electric Vertex Splitting technique, and every boundary vertex is split into a pair of twin vertices, one of which is called the senior vertex, and the other is called the junior vertex. 31

Hence, we define Γse to be an ordered set of all the senior vertices, and Γ ju an ordered set of all the junior vertices. The orders of Γse , Γ ju and Γboundary are accordant. Then, we define u se to be the corresponding voltage vector of Γse , and u ju the voltage vector of Γ ju . Consequently, (A2.2) is split to (A2.3).

(A2.3)

Cse   0  Fse 

0 C ju F ju

E se   u se  f se  ω se        E ju  u ju  = f ju  + ω ju  D   y   g   0 

where Cse + C ju = C , E se + E ju = E , f se + f ju = f . These three equations give a straightforward explanation of the splitting of the vertex weights, current sources and edge weights of the boundary vertices in Section 4. ω se and ω ju are the inflow currents of Γ ju and Γse , respectively. (A2.3) could be represented by (A2.4), for short. (A2.4)

Ax = b + ω

Here A is symmetric.  , and (A2.4) is equivalent to (A2.1). Lemma A2.1: A is a reordering of A  are same, the splitting manners to Proof: The original electric graphs of A and A generate them are same as well, and the only difference is the ordering of the unknowns in x and x , so Lemma A2.1 is right. ■ Then we are going to prove the reversibility theorem, which could be re-expressed as Lemma A2.2: Lemma A2.2: If u se = u ju , ω se = −ω ju , then Ax = b + ω becomes Ax = b . Proof: Set u se = u ju = u , ω se = −ω ju = ω , then:

32

Cse   0  Fse 

0 C ju F ju

E se  u  f se  ω     E ju  u  = f ju  +  −ω      D   y   g   0 

Eliminate ω : Cse + C ju   Fse + F ju

E se + E ju  u  f se + f ju  = D   y   g 

Because Cse + C ju = C , E se + E ju = E , f se + f ju = f , then we get (A2.2), which is Ax = b. ■ Finally, we present Lemma A2.3, which will be useful to prove the convergence theorem in Appendix 3.

 is SPD, Lemma A2.3: If there exists a partition scheme which assures that A j

 is SPD, and A is SPD, consequently. j = 1,2,  N , then A This conclusion is straightforward and the proof is omitted. The above mathematical description of the Electric Vertex Splitting technique is only for the level-one splitting technique. The cases for the multilevel splitting techniques will be more complex and will be given in the next edition of this paper.

Appendix 3. Proof for the convergence theorem. Here we give a basic proof for the convergence theory (Theorem 6.1) of VTM. We only focus on the level-one splitting technique. Assume the original graph Ge is partitioned into N separated subgraghs,

M j , j = 1, 2, , N , following some partition scheme. Ge is SPD, and all the

 is SPD, j = 1,2, N . subgragh are SPD, i.e. A j As described in Section 5, we add one VTLs between each pair of twin vertices. Based on the global view introduced in Appendix 2, we have, (A3.1)

k k k −1 k −1 u ju + Zω ju = u se − Zω se  k k k −1 k −1  u se + Zω se = u ju − Zω ju

33

where Z should be SPD. We call Z the global characteristic impedance matrix of the VTLs. If all the local characteristic matrices Z j , j = 1, 2, , N , are positive diagonal matrices, then Z is a positive diagonal matrix as well.

 = diag (Z , Z , Z ) , and Μ =  Z 0  , then, we have: Define Z 1 2 N  0 Z   . Lemma A3.1: Μ is a reordering of Z  are different ways to express the characteristic impedances of the Proof: Μ and Z VTLs, so they are equivalent. ■ Remove the inner voltage y from (A2.3) and we get: (A3.2)

−1 −E se D−1F ju   u se  f se − E se D g  ω se  +   =   C ju − E ju D−1F ju  u ju  f ju − E ju D−1g  ω ju 

Cse − E se D−1Fse  −1  −E ju D Fse

Then simplify (A3.2) into (A3.3). ˆ Suˆ = β + ω

(A3.3) Cse where S =  

  E se  −1 − ⋅ D ⋅ Fse C ju  E ju 

F ju  .

 is SPD, j = 1,2,  N , then A is SPD. According to Lemma A2.3, if A j Thereafter, we have Lemma A3.2.

Lemma A3.2: If A is SPD, then S is SPD. Cse  Proof: A =  0  Fse 

E se   E ju  , as presented in Appendix 2. If A is SPD, then D 

0 C ju

F ju

x T Ax > 0, ∀x, x ≠ 0 , which means that:

[u

T se

u

T ju

C se  y  0  Fse  T

]

0 C ju

F ju

E se   u se   E ju  u ju  > 0 ,   D  y 

∀u se ≠ 0, ∀u ju ≠ 0, ∀y ≠ 0 .

34

If we set y = −D−1Fseu se − D−1F ju u ju , then

(A3.4)

u

T se

u

T ju

Cse  −u E se D − u E ju D   0  Fse  T se

−1

T ju

−1

0 C ju F ju

u se  E se     u ju E ju    > 0, D   −D−1Fseu se − D−1F ju u ju 

∀u se ≠ 0 , ∀u ju ≠ 0 . (A3.4) could be written as: (A3.5)

u Tse

 Cse u Tju     

  E se  −1 − ⋅ D ⋅ Fse C ju  E ju 

  u se  F ju     > 0 ,  u ju  

∀u se ≠ 0 , ∀u ju ≠ 0 . Cse Because S =  

  E se  −1 − ⋅ D ⋅ Fse C ju  E ju 

F ju  , (A3.5) could be expresses as:

uˆ T Suˆ > 0 , ∀uˆ , uˆ ≠ 0 . As the result, S is SPD. ■ Reformat (A3.1) into a totally matrix-vector form and we get:  u kse   Z 0   ω kse  u kju−1   Z 0  ω kju−1   k +   k  =  k −1  −    k −1  u ju   0 Z  ω ju  u se   0 Z  ω se  0 I  Define the row exchange matrix J =   , where I is the identity matrix.  I 0 (A3.6)

 u kse   Z 0   ω kse  0 I  u kse−1  0 I   Z 0  ω kse−1   k +  k  =    k −1  −     k −1  u ju   0 Z  ω ju   I 0  u ju   I 0   0 Z  ω ju 

(A3.6) could be simply expressed by (A3.7). (A3.7)

ˆ k = J (uˆ k −1 − Mω ˆ k −1 ) uˆ k + Mω

Z 0  Remind that Μ =   . Because Z is SPD, M is SPD.  0 Z 35

ˆ k from (A3.7). We get: According to (A3.3), remove ω uˆ k + M (Suˆ k − β) = Juˆ k −1 − JM (Suˆ k −1 − β) uˆ k = (I + MS) −1 J (I - MS)uˆ k −1 + (I + MS) −1 (JM + M )β Let P = (I + MS) −1 J (I - MS) , γ = (I + MS) −1 (JM + M )β , then,

uˆ k = Puˆ k −1 + γ

(A3.8)

Lemma A3.3: If M is SPD, and S is SPD, then MS M = QTQ T . Here QQ T = I ,

M M = M , T = diag (t1 , t 2 ,  , t r ) , ti > 0, i = 1, 2, , r . r is the

dimension of S . Proof: Because M is SPD, and S is SPD,

MS M is SPD, then there exists a real

MS M = QTQ T , where T is a positive

orthogonal matrix Q such that diagonal matrix. ■

Lemma A3.4: MS = MQTQT M Proof: MS = M ( M S M ) M

= M (QTQT ) M

−1

−1

M J M =J

Lemma A3.5:

Proof:

−1

−1

 Z M J M =  0 −1

 Z −1 =  0   Z −1 =  0 

−1

0  0 I   Z  × × Z   I 0   0

0  0 I   Z × × −1 I 0     0 Z  0   0 × −1 Z   Z

0   Z 

0   Z 

Z  0 I   = =J 0   I 0 

■ 36

According to Lemma A3.4 and A3.5, we write,

( =(

P = I + MQTQ T M

) J (I −

−1 −1

MQTQT M

−1

MQQT M + MQTQT M

) J(

−1 −1

−1

−1

MQQ T M − MQTQT M

−1

= MQ ( I + T ) QT M J MQ ( I − T ) Q T M −1

= MQ ( I + T ) QT JQ ( I - T ) QT M −1

Pk =

(

−1

)

−1

−1

MQ ( I + T ) Q T JQ ( I - T ) Q T M −1

(

)

−1 k

= MQ ( I + T ) QT JQ ( I - T )( I + T ) Q T −1

)

−1

)

k −1

JQ ( I - T ) Q T M

−1

Therefore, Pk

=

2

(

M Q(I + T ) Q T JQ(I − T )(I + T ) Q T −1

M × Q × (I + T)

≤

−1

−1

)

k −1

JQ(I − T )Q T M

(

× Q T × JQ ( I − T )( I + T ) Q T

× J × Q × I − T × QT ×

M

−1

(

M × (I + T) −1 × J × Q × (I − T)(I + T) −1 × QT

=

M × (I + T)

=

≤

M ×

M ×

M

−1

M

−1

× ( I − T )( I + T )

−1 k −1

× I−T ×

1 − t 1 − t2 1 − tr × diag ( 1 , , , ) 1 + t1 1 + t2 1 + tr

× I − T × ( I + T)

When k is large enough, P k

2

2

k −1

−1

=

−1

)

−1

−1

k −1

)

k −1

M

× I−T ×

M

−1

−1

× I − T × (I + T)

−1

 1 − t1 1 − t2 1 − tr    , , , × max   1 + tr    1 + t1 1 + t2 

k −1

< 1 . Then the iteration (A3.8) converges for any

initial uˆ 0 . So we conclude that VTM is convergent. Finally, we are going to prove that the converging result is the answer of the original system Ax = b.

37

ˆ , which is equal to, ˆ , lim ω ˆk =Ω Suppose that lim uˆ k = U k →∞

k →∞

 u se k  k →∞  U se   ω se k  k →∞  Ω se  → →  k  ⎯⎯⎯ .  ,  k  ⎯⎯⎯  U ju  ω ju   Ω ju  u ju  Then we get (A3.11) from (A3.7): ˆ = JU ˆ ˆ + MΩ ˆ − JMΩ U

(A3.11)

Multiply both sides of (A3.11) by J . We obtain, ˆ = JJU ˆ =U ˆ ˆ + JMΩ ˆ − JJMΩ ˆ − MΩ JU ˆ = JU ˆ ˆ − MΩ ˆ + JMΩ U

(A3.12)

Add (A3.11) to (A3.12), we get, U = J×U

Thus, I = −J × I

As the result,

U se = U ju , Ω se = −Ω ju . According to the reversibility theorem (Theorem 4.1), we conclude that the convergent result is exactly the answer to the original system. So we have proved the convergence theorem of VTM. It should be noted that the above proof does not cover the case when there exists multilevel splitting during graph partitioning. A full proof will be given in the next edition of this paper.

References [1]. J. W. Demmel. Lecture notes of Application of Parallel Computers at University of California, Berkeley. Available at http://www.cs.berkeley.edu/~demmel/cs267/ [2]. A. George and J. W. Liu. Computer solution of sparse systems, Prentice Hall, 1981. [3]. J. J. Dongarra, I. S. Duff, D. C. Sorensen and H. A. van der Vorst. Numerical linear algebra on high-performance computers, SIAM, 1998. 38

[4]. Y. Saad. Iterative methods for sparse linear systems, 2nd edition, SIAM, 2003. [5]. M. T. Heath. Parallel direct methods for sparse linear systems, in Parallel Numerical Algorithms, D. Keyes, A. Sameh, and V. Venkatakrishnan, eds., Kluwer Academic Publishers, 1997, pp. 55--90. [6]. R. Barrett, M. Berry, T. Chan, J. Demmel, J. Donato, J. Dongarra, V. Eijkhout, R. Pozo, C. Romine and H. Van der Vorst. Templates for the solution of Linear Systems: Building Blocks for Iterative Methods, 2nd Edition, SIAM, 1994. [7]. A. Toselli and O. Widlund. Domain decomposition methods - algorithms and theory, Springer, 2005. [8]. A. Grama, A. Gupta, G. Karypis and V. Kumar, Introduction to parallel computing, 2nd edition, Addison-Wesley, 2002. [9]. M. Bhardwaj, K. Pierson, G. Reese, T. Walsh, D. Day, K. Alvin, J. Peery, C. Farhat and M. Lesoinne. Salinas: a scalable software for high-performance structural and solid mechanics simulations, 15th Annual Supercomputing Conference, 2002. [10]. TOP500, a list of the 500 most powerful high performance computers. Available at http://www.top500.org [11]. M. B. Taylor, J. Kim, J. Miller, D. Wentzlaff, F. Ghodrat, B. Greenwald, H. Hoffman, P. Johnson, Lee Jae-Wook, W. Lee, A. Ma, A. Saraf, M. Seneski, N. Shnidman, V. Strumpen, M. Frank, S. Amarasinghe, and A. Agarwal. The RAW microprocessor: a computational fabric for software circuits and general-purpose program. IEEE Micro. 22, 2 (March, 2002), 25-35. [12]. K. Sankaralingam, R. Nagarajan, R. McDonald, R. Desikan, S. Drolia, M. S. Govindan, P. Gratz, D. Gulati, H. Hanson, C. Kim, H. Liu, N. Ranganathan, S. Sethumadhavan, S. Sharif, P. Shivakumar, S. W. Keckler and D. Burger. Distributed microarchitectural protocols in the TRIPS prototype processor, 39th Annual International Symposium on Microarchitecture, 2006. [13]. S. Dighe, H. Wilson, J. Tschanz, D. Finan, P. Iyer, A. Singh, T. Jacob, S. Jain, S. Venkataraman, Y. Hoskote and N. Borkar. An 80-Tile 1.28 TFLOPS Network-on-Chip in 65nm CMOS. In Proceedings of International Solid-State Circuits Conference, 11-15 Feb. 2007, San Francisco, CA. [14]. LAPACK, a linear algebra package. Available at http://www.netlib.org/lapack/ [15]. T. A. Davis. Summary of available software for sparse direct methods, 2006, Available at http://www.cise.ufl.edu/research/sparse/codes/codes.pdf [16]. X. Li. Direct methods for sparse matrices, 2006. Available at http://crd.lbl.gov/~xiaoye/SuperLU/SparseDirectSurvey.pdf [17]. V. Eijkhout. Overview of iterative linear system solver packages, 1997. Available at http://www.netlib.org/utk/papers/iterative-survey/ 39

[18]. H. J. Pain. The physics of vibrations and waves, Wiley, 1976. [19]. R. E. Collin. Foundations for microwave engineering, 2nd edition, Wiley-IEEE Press, 2000. [20]. J. H. Gridley. Principles of electrical transmission lines in power and communications, Pergamon Press, 1967. [21]. T. L. Floyd. Principles of electric circuits, 6th edition, Prentice Hall, 1999. [22]. D. Culler, R. Karp, D. Patterson, A. Sahay, K. E. Schauser, E. Santos, R. Subramonian and T. von Eicken. LogP: Towards a realistic model of parallel computation. Proceedings of the Fourth ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, May 1993, San Diego, CA.

40

Virtual Transmission Method, A New Distributed Algorithm to

Virtual Transmission Method, A New Distributed Algorithm to

Suggest Documents

Transmission Line Inspires A New Distributed Algorithm to ... - arXiv

Transmission Line Inspires A New Distributed Algorithm to ... - arXiv

A Method to Minimize Distributed PSO Algorithm ... - Springer Link

A New Partitioning Approach for Distributed Virtual

LOGO: A New Distributed Leader Election Algorithm

Waveform Transmission Method, A New Waveform-relaxation

A Unified Algorithm for Virtual Desktops Placement in Distributed ...

A Unified Algorithm for Virtual Desktops Placement in Distributed

Virtual image navigation: a new method to control ...

Virtual Axis Finder: A New Method to Determine

A New Prognostic Method based on Simulated Annealing Algorithm to

A New Method to Enhance the Clustering Algorithm

A numerical method to compute transmission

A New Distributed Power Control Algorithm Based on a Simple

New Algorithm for Distributed Frequency ... - Semantic Scholar

Multi-Resolution Model Transmission in Distributed Virtual ... - CiteSeerX

A new single ended fault location algorithm for combined transmission

A New Distributed Java Virtual Machine for Cluster ...

Toward a New Massively Distributed Virtual Machine based Cloud ...

A new distributed power flow algorithm between multi ... - IEEE Xplore

A New Model of Distributed Genetic Algorithm for Cluster ... - CiteSeerX

A new distributed algorithm for side-chain repacking

Seven-O'Clock: A New Distributed GVT Algorithm Using ... - CiteSeerX

A New Scalable, Distributed, Fuzzy C-Means Algorithm-Based ... - MDPI