Parallel Algorithm for Computing Matrix Inverse by Gauss-Jordan Method. *
C. Vancea*, F. Vancea**
Department of Electrical Measurements and Usage of Electric Energy,
[email protected] ** Department of Computers,
[email protected] University of Oradea, Faculty of Electrical Engineering, 5 University st., Oradea, Romania
Abstract – Multi-core and even multi-processor computers are increasingly common, even for home use. This does not automatically bring better processing performance. When the machine is used for a single task, in order to take advantage of the full processing power available, some measures have to be taken even for simple cases. We illustrate this by computing matrix inverse with Gauss method in a sequential way and introducing a parallel improvement to the calculus. I.
Explicit reprezentation of the workspace is:
INTRODUCTION
A simple implementation of a computation method that has no intrinsic parallelism consists of a sequence of operations to be performed (usually repeatedly) in order to obtain the solution. Unless the programming environment has special translation abilities, this is transformed into a single flow of processor instruction to be executed sequentially. Common numerical algorithm implementations have little or no interaction with the computing environment, except for the data input and the result output. This means that the processing itself will be speed-bound by the speed at which the CPU can execute those instructions. When multi-core and/or multi-processor technology is used to execute the program, this approach will use at most the capabilities of one processing core, leaving other resources unused. If the machine has other tasks to perform, this is acceptable from a global point of view, but still suboptimal for the problem solving itself. We will see that relatively minor changes to a program can bring good improvement on processor usage, when the underlying algorithm offers the opportunity for parallel processing. II.
GAUSS-JORDAN METHOD
Let’s review briefly the Gauss-Jordan matrix inversion method. The algorithm starts with a square n x n matrix [A]. This matrix is augmented (right-extended) with an identity matrix of the same order into an extended matrix [C]. [C] = [A| I]
110
a11 a 21 [C ] = [A I ] = a31 M an1
a12
a13 L a1n
a22 a32
a23 L a2n a33 L a3n
M an 2
M O M an 3 L ann
1 0 0 L 0 0 1 0 L 0 0 0 1 L 0 M M M O M 0 0 0 L 1
Elementary row operations are used then to reduce the left half of [C] occupied by [A] into the identity matrix. Each iteration (i) for this step aims to reduce the aii element to 1
O L L M 1 a ( i −1) i −1,i [C ] = M 0 1 ( i −1) M 0 ai +1,i L L L
L ( i −1) i −1,i +1 ( i −1) ( i −1) i ,i +1 i ,i ( i −1) i +1,i +1
a
a
/a
a
L
L ( i −1) i −1,i + 2 ( i −1) ( i −1) i ,i + 2 i ,i ( i −1) i +1,i + 2
a
a
/a
a
L
L M M M O
where a(k) is the result of the iteration k. Further operation is to zero all aji coefficients except aii by replacing row j with a properly chosen linear combination between row i and row j. L O L L M 1 0 (i ) ai −1,i +1 [C ] = M 0 1 ai(,ii)+1 ( i −1) (i) ( i −1) M 0 0 ai +1,i +1 − ai ,i +1 / ai +1,i L L L L
L
(i) i −1,i + 2 (i ) i ,i + 2 (i) i ,i + 2
a a
ai(+i1−,1i)+2 − a / ai(+i1−,1i) L
L M M M O
After all iterations are over, the right half of [C] will contain [A]-1. } 1 0 L 0 a1{,inv 1 } 0 1 L 0 a {2inv ,1 [C ] = M M O M M {inv} 0 0 L 1 a n ,1
} } a1{,inv L a1{,inv 2 n } } a 2{inv L a 2{inv ,2 ,n M O M {inv} {inv} a n , 2 L a n ,n
Iterations have to be performed sequentially, i.e. iteration k+1 has to be performed after iteration k is over. However, the algorithm has some opportunities for parallel processing. The first step of the iteration i means dividing all elements of row i with aii and this can be executed in parallel. However, the simple division is an operation too simple compared to the overhead implied by parallelism introduction and the gains would be rather small. The second step of iteration i is performed over all rows j with j ` i and within each row j we have to perform one multiplication and one addition for each column of the rows i and j. Processing one row j is an operation complex enough to allow parallel processing despite parallelization overhead. Moreover, particular programming techniques allow us to reduce the overhead to one equivalent fork/join operation per iteration i. III.
PARALLEL IMPLEMENTATION
Since the iterations are purely sequential let us consider the calculus flow within each iteration (Fig.1). reduce aii to 1 reduce aji to 0 for all rows j ` i
Fig. 1 Steps in one iteration Let us consider a dual-core machine. To prepare for parallelization the second step is equivalently split in two. reduce aii to 1
reduce aji to 0 for all even rows j ` i reduce aji to 0 for all odd rows j ` i
Fig.2 Step 2 split in even/odd parts
111
Since the last two steps are completely independent, we can execute them in parallel, as shown in Fig. 3. reduce aii to 1
reduce aji to 0 for all even rows j ` i
reduce aji to 0 for all odd rows j ` i
Fig.3 Parallelization for one iteration Parallelizing reduction of aji to 0 introduces a fork and a join that are not intrinsically useful to the computation and this introduces overhead. IV.
IMPLEMENTATION DETAILS
We implemented several versions of the matrix inversion algorithm in a C program to measure the effect of the parallelization on the performance and resource usage. In order to obtain meaningful and sustainable results, the program has to solve the same problem with different methods, in distinct repeated runs. This would eliminate perturbations introduced by other random processes running on the same machine by averaging the results over several runs. Each method reads the original matrix from a disk file into memory, then runs the algorithm and saves back the results in another disk file for verification. Because matrix size has a great impact on the calculus duration, each algorithm itself is written intentionally such that the original matrix is not affected and can be re-used for a rerun. This allows us to run the solving function several times and measure the number of runs in a larger time interval to eliminate errors due to timer resolution on the target machine or other transient processor usage spikes. To automatically adjust the duration measurement for both small and large matrices, the time measurement logic repeats the solving until the overall time exceeds a large enough interval (we arbitrarily used 10 second which was appropriate for the matrix sizes and processor power available), but not for less than a predefined number of runs (we considered that a limit of minimum 10 runs is reasonable). Of course, loading the matrix from the disk file and saving back the results of the last iteration are not within timing loop. Data used for testing was randomly generated. Since we had to test different matrix sizes and the size should
be rather large to warrant usage of parallelization we also included in the program a matrix generator for comfortable testing. At implementation level we followed closely the “recipe” described in previous section, to evaluate as accurate as possible only the impact of introducing parallelization and no other effects of code tweaking. Thus, even if some code optimizations would have been feasible in one method or another, we chose not to implement them unless they were feasible in all methods in order to keep all solution at the same level of optimization. The program was written under Windows Visual C but moderate care was taken to ease an eventual port. Pure sequential solutions code can run (eventually with minor adjustments) on any platform but for the parallel part we chose to use Windows-specific features (threading). We expect process forking and synchronization to bring a greater overhead than the levels we have experienced with the threaded solution. Even with threads, creating threads within each iteration to perform the even and odd parts of step 2 would have been a too large overhead. Therefore we created upfront (within each solve run) the threads required for even/odd computations. The threads are created then they block on a non-signaled event waiting for the main thread to finish step 1. Within each iteration, the main thread performs step 1 of the iteration then unblocks the even and odd threads to perform step 2 in parallel. It then blocks waiting for them to finish. The process is illustrated in Fig.4: Odd thread
Main thread
Even thread
Prepare even/odd threads
Iteration i
Perform step 1
Perform step 2 odd
Perform step 2 even
Iteration i+1
Fig.4 Threads in parallel execution model
112
While waiting, the threads implementation have very little overhead, so mainly the algorithm is using one core for step 1 (which is relatively short) and two cores for step 2, which is more demanding. V.
PRACTICAL RESULTS
We ran first the program on a single-core machine (Intel Mobile, 1.73GHz), running Windows XP. For a 100 x 100 matrix, sequential solving yielded 671 and respectively 681 solving iterations in 10 seconds. Parallel solving (even/odd) performed only 648 solving iterations in 10 seconds. This was to be expected, because thread creation and synchronization overhead account for the difference (which is however rather small, under 10%). For a 400x400 matrix we achieved 12 solving iterations in 10 seconds for all methods (method differences got lost in limited timing measurement procedure). At the other end of size range, with a 20 x 20 matrix the sequential even/odd method achieved 42114 solving iterations in 10 seconds but the parallel even/odd method only 19202 solving iterations (way under 50%). For small matrices the thread synchronization overhead is overwhelming. During all tests, the processor was used 100%, according to operating system tools. The second machine tested was a Hyperthreading Pentium 4 at 2.8 GHz, running Windows XP. This is not a true multi-core machine but still may benefit from the approach. For a 100 x 100 matrix, sequential solving yielded 875 and respectively 902 solving iterations in 10 seconds. The processor load was around 50%, as expected, unevenly distributed among “processors”. Parallel solving (even/odd) performed only 418 solving iterations in 10 seconds. Processor load was 100% (both “processors” fully loaded). The explanation for the radical loss of performance is resource contention on hyperthreaded architecture at floating-point processing unit level. Results for 20 x 20 and 400 x 400 were consistent with the 100 x 100 results. Third test machine was a dual processor Pentium 3 at 700 MHz running Windows 2000 Server. For a 100 x 100 matrix, sequential solving yielded 237 and respectively 249 solving iterations in 10 seconds. The processor load was around 50%, as expected, distributed roughly 33% / 66% among processors. Parallel solving (even/odd) performed 218 solving iterations in 10 seconds. Processor load was 100% (both processors fully loaded). In 400 x 400 case, 10 iterations of sequential method were performed in 55 seconds (50% load, processors loaded equally). However, parallel method achieved the
same 10 iterations in 39 seconds with 100% processor load, which shows a significant improvement (40% improvement in throughput, but obviously far from ideal 100%). The 20 x 20 case yielded 18641 versus 5540 iterations, confirming the parallelizing overhead. VI.
CONCLUSIONS
Newer machines have parallel processing abilities that can be exploited by changing algorithm implementations. We tried to implement such a change for matrix inversion Gauss-Jordan method. Within our implementation’ limitations, we found that parallelization introduces significant overhead such that true advantages are only apparent when the size of the problem increases to fairly large values. There are also additional contention issues to be considered when using hyper-threaded architectures. These results suggest that additional processing capabilities found in newer architectures can be used only when the problem can be split in parallel slices large enough to minimize the effect of synchronization and context switches. Further work may reveal if a process-based approach, coupled with processorassignment techniques can bring any greater gains.
113
1.
2. 3.
4.
5. 6.
REFERENCES
Singiresu S. Rao, “Applied Numerical Methods for Engineers and Scinetists”, University of Miami, Coral Gables, Florida, Pretince Hall 2002. R. Despa, C. Coculescu, “ Metode Numerice”, Editura Universitarã, Bucuresti, 2006. C. Vancea, “Studiu comparativ al eficienþei algoritmilor paraleli pentru rezolvarea sistemelor algebrice liniare”, Analele Universitãþii Oradea, fascicola Calculatoare, iunie 1995 C. Vancea, “Parallel Algorithms for Solving Linear Equation System. Implementation on a distributed computing environment”, Proceeding on SINTES 8, International Symposium on System Theory, Robotics, Computers and Process Informatics, Craiova, iunie 1996. I. Chiorean, “Calcul Paralel. Fundamente.”, Editura Albastrã, Cluj – Napoca, 1995 T. A. Beu, “Calcul numeric în C”, Editura Albastrã, Cluj – Napoca, 2004