Multithreading and Strassen's algorithms in SUNRED field solver

0 downloads 0 Views 426KB Size Report
Aug 21, 2008 - Tel: +36-1-463-2704, Fax: +36-1-463-2973. Budapest University of Technology and Economics (BME), Department of Electron Devices.
24-26 September 2008, Rome, Italy

Multithreading and Strassen’s algorithms in SUNRED field solver László Pohl [email protected] Tel: +36-1-463-2704, Fax: +36-1-463-2973 Budapest University of Technology and Economics (BME), Department of Electron Devices 1521 Budapest, Hungary Abstract-Complex structures can be well modeled by simulation using appropriate field solvers; however the investigation of detailed models is a time demanding process even on the latest computers. This article surveys the new developments resulting in execution time reduction of the thermal and electro thermal field solver that takes a Finite Differences Method (FDM) based model as input. This tool is based on the vectorized version of SUNRED algorithm. One part of the developments is architectural optimization: multithreading, memory and cache system optimization, the other part is the implementation of matrix multiplication and inversion acceleration algorithms. The result is a significantly faster field solver program.

I.

INTRODUCTION

Field simulation can be extremely computation time demanding examination method, especially in complex cases such as transient analysis. Transient analysis is based on series of steady-state (DC) simulations, often with hundreds of steps. There are two ways for software engineers to reduce computation time: decrease the element number of the simulated field or, when this is not possible, optimizing the software with special algorithms. An example for the first case is our earlier published algorithm [1] which lets SUNRED users to simulate not only cuboid fields but any shape they wish. Element number can be also reduced by decreasing the resolution and details of the field, but this is not always applicable, and the benefit is questionable. This paper presents the new algorithms introduced in the SUNRED algorithm that result in significant speed increase while offering almost the same accuracy when simulating the same problem as with the older version of the program. The new algorithms can be split in two groups: • Taking advantage of merits of modern PC architectures: o Multithreading on multi-CPU or multi-core systems. The POSIX Threads library [2] is applied for this purpose that is available for UNIX/Linux and Windows platforms. o Other important algorithmic changes concerning memory and cache usage [3,4]. • The second group of algorithmic developments consists of the implementation of Strassen’s special matrix multi-

©EDA Publishing/THERMINIC 2008

137

plication and inversion algorithm [5], and the fast complex matrix multiplication method [6]. II.

MULTITHREADING AND OTHER ARCHITECTURAL OPTIMIZATIONS

The actual version of SUNRED algorithm is capable of steady-state simulation of finite differences equations describing thermal or coupled electro-thermal fields. The simulation model of the structure under investigation is turned into an electrical network which is treated by the SUccessive Network REDuction algorithms (hence the name SUNRED). The purpose of computation is to determine the node voltages of the electrical model network by applying boundary conditions and excitations. Solution is done by the successive algorithm shown in Fig.1: first the electrical network is divided into elementary cells which are represented by the admittance matrices and inhomogeneous current vectors. Details can be found our earlier publications, e.g.: [1,7]. The elementary cells are merged by successive reduction steps (left to right in the figure). In the final step, when only two cells remained, the voltages of common nodes are calculated, and then the voltages of eliminated nodes are determined in backward substitution steps. In a reduction or substitution step the merger or the separation of two cells does not depend on the other cells of the network, so the operations are parallelizable. There is no use to start a new thread for every reduction because thread administration consumes time. Theoretically the most optimal case is when the thread number is equal to the physical or logical (Hyper-Threading [8]) processing units. In practice, in the case of SUNRED, it is recommended to start more threads, because one thread can be slower than the other, and more threads can flatten the differences, see benchmarking details later in section IV. In SUNRED the number of threads can be controlled externally in the problem definition files. Fig.1 represents the case when the node-reduction runs using four threads; each thread deals with one quarter of the cells. After a reduction step the threads join, and in the next step new threads are started (because this required the least modification on the algorithm). As shown in Fig.1, in the last steps there are fewer threads than set, which means: in these steps the processor cores are not fully utilized. The computation time demand is similar in

ISBN: 978-2-35500-008-9

24-26 September 2008, Rome, Italy every step: the fewer cells the more cell nodes. SUNRED C ij = ∑ A ik B kj (2.c) solves this problem by parallelizing Strassen’s multiplication k and inversion, see in the next chapter. The elements of Cij sub blocks are calculated at the same Matrix multiplications take 40-70%, and matrix inversion time, as (2.c) shows. This method is called blocking [3]. take 6-25% of the execution time, depending on resolution Why is this faster? Because if A is sized m×n and B is sized and optimizations, so acceleration of matrix operations is the n×o, aik and bkj are used n times in a multiplication, and if most important way to obtain a faster running program. The the matrices are too big, the values drop out of the cache, next algorithms serve the efficiency of these matrix operaand they must be reloaded from the slow RAM every time tions. they are needed. SUNRED uses 4×4 sized blocks, the 4×4 block multiplication loops are unrolled in the program code C=A×B matrix multiplication means c ij = ∑ a ik b kj dot for faster execution. k products: III. STRASSEN’S ALGORITHMS AND COMPLEX ⎡ b11 b12 ⎤ MULTIPLICATION ⎡ c11 c12 ⎤ ⎡ a 11 a 12 a 13 a 14 ⎤ ⎢ ⎥ b b ⎢c ⎥ 21 22 ⎢ ⎥ With conventional algorithms the process of matrix multi⎥ (1.a) ⎢ ⎢ 21 c 22 ⎥ = ⎢a 21 a 22 a 23 a 24 ⎥ × ⎢ b b 32 ⎥ plication and inversion require operations in the order of 31 ⎢⎣c 31 c 32 ⎥⎦ ⎢⎣a 31 a 32 a 33 a 34 ⎥⎦ ⎢ ⎥ magnitude of n3. Volker Strassen in 1969 presented a ⎣b 41 b 42 ⎦ method which reduces the process to n log 7 ≈ n 2.807 [5]. First Accessing bkj is not optimal in cache utilization because let us see the multiplication. memory and cache subsystems are optimized for serial ac2

cess [4]. By transposing B, the speed of multiplication can be doubled:

⎡ c11 ⎢ ⎢c 21 ⎢c ⎣ 31

c12 ⎤ ⎡ a 11 a 12 ⎥ ⎢ c 22 ⎥ = ⎢a 21 a 22 c 32 ⎥⎦ ⎢⎣a 31 a 32 ⎡ b11 ×⎢ ⎣b12

a 13 a 23 a 33

b 21

b 31

b 22

b 32

a 14 ⎤ ⎥ a 24 ⎥ × a 34 ⎥⎦

⎤ ⎡⎡a a 12 ⎤ ⎥ ⎢ ⎢ 11 ⎥ = ⎥ ⎢ ⎣a 21 a 22 ⎦ ⎥ ⎢ [a ⎦ ⎣ 31. a 32. ] ⎡ ⎡ b11 × ⎢⎢ ⎢⎣ ⎣b12

⎡C11 ⎤ ⎡ A11 ⎢ ⎥=⎢ ⎣C12 ⎦ ⎣A 21

(1.b)

A12 ⎤ ⎡ B11 ⎥×⎢ A 22 ⎦ ⎣B 21

B12 ⎤ ⎥ B 22 ⎦

C11 = A11 × B11 + A12 × B21 C12 = A11 × B12 + A12 × B22 C21 = A21 × B11 + A22 × B21 C22 = A21 × B12 + A22 × B22

b 41 ⎤ ⎥ b 42 ⎦

⎡ a 13 ⎢ ⎣a 23 [a 33.

a 14 ⎤ ⎥ a 24 ⎦ a 34. ]

b 21 ⎤ ⎡ b 31 ⎥ ⎢ b 22 ⎦ ⎣b 32

b 41 ⎤ ⎤ ⎥⎥ b 42 ⎦ ⎥⎦

A12 ⎤ ⎥ × [B11 A 22 ⎦

C12 ⎤ ⎡ A11 = C 22 ⎥⎦ ⎢⎣A 21

(3)

Conventional multiplication (formulas from [6]):

Execution time needed for transposing time is negligible to the execution time of. This technique has been part of the SUNRED since it was vectorized. More speed can be gained by violating this rule a bit, and computing sub blocks: ⎡⎡c c12 ⎤ ⎢ ⎢ 11 ⎥ ⎢ ⎣c 21 c 22 ⎦ ⎢ [c ⎣ 31. c 32. ]

⎡ C11 ⎢C ⎣ 21

⎤ ⎥ ⎥× ⎥ (2.a) ⎦

B12 ]

(4)

Aij, Bij and Cij are matrix blocks. Eight multiplications and four additions were required. Strassen’s multiplication: Q1 =(A11 + A22) × (B11 + B22) Q2 = (A21 + A22) × B11 Q3 = A11 × (B12 − B22) Q4 = A22 × (−B11 + B21) Q5 = (A11 + A12) × B22 Q6 = (−A11 + A21) × (B11 + B12) Q7 = (A12 − A22) × (B21 + B22)

(5)

C11 = Q1 + Q4 − Q5 + Q7 C21 = Q2 + Q4 C12 = Q3 + Q5 C22 = Q1 + Q3 − Q2 + Q6

(2.b)

Thread1

Thread2

Thread1

Thread2

Thread1

Thread2

Thread1

Thread3

Thread4

Thread3

Thread4

Thread3

Thread4

Thread2

(6)

Thread1

Fig. 1: Successive node reduction with multithreading

©EDA Publishing/THERMINIC 2008

138

ISBN: 978-2-35500-008-9

24-26 September 2008, Rome, Italy Here Qi‘s are temporary matrices. Seven multiplications In SUNRED half of multiplications and all inversions are performed on symmetrical matrices, special routines are and eighteen additions or subtractions were required. Bemade, which are almost twice as fast as non-symmetrical cause addition and subtraction require ~n2 operations, at big ones. matrices Strassen’s matrix multiplication method is faster Frequency-domain (AC) simulation requires complex than conventional methods (see details in the next section). number arithmetic. Complex multiplication contains four Input matrices do not needed to be square, the only criterion real multiplications, one addition and one subtraction: is that their sizes must be dividable by two. In the literature one can find theoretically better algo(A + iB)(C + iD) = (AC − BD) + i(AD + BC) (9) rithms than Strassen’s, see, e.g. [9]. These methods are generally not usable in practice because their benefit appears However, similar to Strassen’s methods, multiplication only on extremely huge matrices. Strassen’s algorithm is can be reordered as the number of multiplications decrease also slower on small matrices than the conventional methto three, while addition and subtraction increase to threeods. In SUNRED we found that the algorithm is optimal for three: matrices with n>500. The calculation of Qi matrices can be parallelized. Full (A + iB)(C + iD) = (10) parallelization demands 5.25n2 additional temporary mem= (AC − BD) + i[(A + B)(C + D) − AC − BD] ory space. Without parallelizing this demand is only 1.5n2. We have chosen a compromise: two multiplications at the Multiplications can be done parallel, temporary matrices same time; in this case 2.25n2 extra memory is required. are required with 2-5n2 size, depending on the level of paralWhen n>1000, Strassen’s multiplication become recursive, lelization. so the multiplication runs on more than two threads. Disadvantage of Strassen’s method is the degradation of IV. BENCHMARK RESULTS numerical stability [5]. In the case of SUNRED generally The speed gain of the algorithms is presented in this secthis means the decrease of accuracy from 7-10 decimal digits tion. The following abbreviations are used in the tables: to 5-8 digits. This is not a problem in practice because the 1024×1 field: A real life sample: 1024×1024 grid resoludeviation of material parameters, fitting of components, untion, 1 layer (2D) thermal field, full DC simulation. certainty of radiated power and other effects anyway would 64×64 field: A real life sample: 64×64 grid resolution, 16 result in a simulation accuracy in the range of percents at layers (3D) thermal field, full DC simulation. best. n×n matrix: Multiplication of two 3200×3200 sized, real Strassen presented a method for inversion as well. The (64 bit double precision) matrices or inversion of one. matrix to invert is divided into blocks again: m×o matrix: Multiplication of a 4800×1920 and a −1 1920×1600 sized, real (64 bit double precision) matrix ⎡ C11 C12 ⎤ ⎡ A11 A12 ⎤ (7) Test computer: Dell Dimension 9200 with Intel Core 2 ⎢ ⎥=⎢ ⎥ ⎣C 21 C 22 ⎦ ⎣A 21 A 22 ⎦ Duo E6400 (2.13 GHz, 2 MB cache), 4 GB DDR2 667 MHz RAM, Windows Vista Ultimate 32 bit operating system. The Strassen’s inversion: solver was compiled with MS Visual Studio .NET 2003 C++ -1 compiler to SSE2 instruction set. The run times are calcuR1 = A11 lated as average of three measurements. R2 = A21 × R1 Table I shows the effect of thread number. The test system R3 = R1 × A12 contained a dual core processor, so more than two threads R4 = A21 × R3 give minimal speed change, but four threads give a bit more R5 = R4 − A22 -1 speed than two. 64×16 field gained much more from the R6 = R5 (8) thread number increase, because 90% of its runtime was C12 = R3 × R6 consumed by matrix operations vs. 50% of 1024×1. C21 = R6 × R2 Table II presents the results of matrix blocking multiplicaR7 = R3 × C21 tion (2) and loop unrolling; 14-26% acceleration was gained C11 = R1 − R7 in real life applications. C22 = −R6

Two inversions and six multiplications remained, so the number of n3 operations is not changed however because the multiplications can be done by Strassen’s algorithm, inversions are recursively decomposable; the n log 7 ≈ n 2.807 theoretical improvement is achievable. In practice the situation is better: an n×n inversion takes 2.5-3 times longer than an n×n multiplication, and now the inversion is changed to multiplication. Parallelization is available at inversion but not so much as at multiplication: R2-R3 and C11-C21 can go together.

TABLE I Effect of thread number on the speed (dual core system)

2

©EDA Publishing/THERMINIC 2008

139

1024×1 field

64×16 field

Thread 1 2 4 64 1024 1 2 4 64 1024

Runtime 30,65 sec 19,75 sec 19,57 sec 19,85 sec 21,34 sec 16,75 sec 9,57 sec 9,49 sec 9,60 sec 10,52 sec

Speed ratio 100,0% 155,2% 156,6% 154,4% 143,6% 100,0% 175,1% 176,5% 174,5% 159,3%

ISBN: 978-2-35500-008-9

24-26 September 2008, Rome, Italy TABLE II Effect of multiplication-blocking on the speed

n×n matrix m×o matrix 1024×1 field 64×16 field

Normal runtime 33,95 sec 10,84 sec 22,45 sec 11,96 sec

Normal speed ratio 100,0% 100,0% 100,0% 100,0%

Block runtime 18,92 sec 8,84 sec 19,57 sec 9,49 sec

TABLE IV Effect of Strassen’s inversion on the speed Block speed ratio 179,4% 122,6% 114,7% 126,0%

n×n matrix

TABLE III Effect of Strassen’s multiplication on the speed

n×n matrix

m×o matrix

1024×1 field

64×16 field

Min. block size Traditional (∞) 1200 800 500 200 100 Traditional (∞) 1200 800 500 200 100 Traditional (∞) 1200 800 500 200 100 Traditional (∞) 1200 800 500 200 100

Runtime 46,38 sec 20,23 sec 18,79 sec 18,92 sec 19,84 sec 31,43 sec 20,93 sec 9,51 sec 9,44 sec 8,84 sec 9,22 sec 11,83 sec 19,89 sec 19,82 sec 19,74 sec 19,57 sec 20,10 sec 24,20 sec 9,92 sec 9,90 sec 9,91 sec 9,49 sec 10,13 sec 13,03 sec

Speed ratio 100,0% 229,3% 246,8% 245,1% 233,8% 147,6% 100,0% 220,0% 221,7% 236,8% 226,9% 176,9% 100,0% 100,4% 100,7% 101,6% 99,0% 82,2% 100,0% 100,3% 100,1% 104,6% 98,0% 76,1%

1024×1 field

64×16 field

n×n matrix m×o matrix

Speed ratio 100,0% 444,0% 559,7% 560,8% 560,4% 561,6% 100,0% 99,5% 108,6% 108,2% 108,1% 108,3% 100,0% 100,9% 116,2% 118,4% 117,9% 117,3%

Real runtime 18,92 8,84

Real time ratio 100,0% 100,0%

Complex runtime 56,15 25,99

Complex time ratio 296,8% 294,1%

Although each of the other algorithmic changes result only a few percent of gain, but the net effect of these small gains is about a 25-60% boost in speed. The more layers in the model the higher gain in speed is achieved – that is optimal for complex architectures. ACKNOWLEDGMENT

This work is based on prof. Dr. Vladimir Székely’s original SUNRED field solver algorithm [7]. I would like to thank his guidance, recommendations and ideas. Thanks for Dr. András Poppe for his recommendations and his help in presentation and for prof. Dr. Márta Rencz for her support in the publication of the article. REFERENCES [1] [2] [3] [4]

CONCLUSIONS

Summarizing the developments in the SUNRED algorithm, we can state the major speed increase is the result of multithreading. Nowadays the main development direction of processor manufacturers is the core number elevation, so SUNRED would be able to take even better advantage of this trend in the future. In some years GPUs will be integrated into CPUs (AMD Fusion [10], Intel Larabee [11]) which means a new challenge in the future development of the SUNRED algorithm.

©EDA Publishing/THERMINIC 2008

Runtime 129,42 29,15 23,12 23,08 23,10 11,24 21,17 21,27 19,48 19,57 19,58 19,55 11,24 11,14 9,67 9,49 9,53 9,58

TABLE V Real (double precision) and complex matrix multiplication run times

Table III presents the effect of Strassen’s multiplication algorithm. The big difference in matrix multiplications is the result of multithreading, but the Strassen’s multiplications give more than double speed vs. double core, so the algorithm is efficient. If block size is decreased to 200 or less, the speed drops back significantly. The effect is minimal to real life applications, only 1.6-4.6%. The effect of the two cores cannot be seen in this case because the reduction runs on four threads. Strassen’s inversion however means considerable acceleration: 8-18% in real-life applications (Table IV) because inversion gains much more of the new algorithm than multiplication. AC simulation is under construction in Vector SUNRED, but the complex matrix multiplication method is ready. The multiplication takes about three times longer to the same sized real multiplication (Table V). V.

Min. block size Traditional (∞) 1200 800 500 200 100 Traditional (∞) 1200 800 500 200 100 Traditional (∞) 1200 800 500 200 100

140

[5] [6] [7] [8]

L. Pohl, V. Szekely: A more flexible realization of the SUNRED algorithm, THERMINIC Workshop, Sept. 27-29, Nice, France, Proceedings, 2006. David R. Butenhof: Programming with POSIX Threads, AddisonWesley, 1997, ISBN 0-201-63392-2 Monica S. Lam, Edward E. Rothberg and Michael E. Wolf, “The Cache Performance and Optimizations of Blocked Algorithms”, ASPLOS IV, 1991. M. Frigo, C.E. Leiserson, H. Prokop, and S. Ramachandran: Cacheoblivious algorithms. In Proceedings of the 40th Annual Symposium on Foundations of Computer Science, pages 285–297, New York, October 1999. Strassen, V. Numerische Mathematik, 1969, vol. 13, pp. 354–356 William H. Press: Numerical Recipes in C: The Art of Scientific Computing; Cambridge University Press, 1992, pp 102-104, p 177, ISBN: 0-521-43720-2 V. Székely: SUNRED a new thermal simulator and typical applications, 3rd THERMINIC Workshop, 21-23 September, Cannes, France, pp. 84-90, 1997 Hyper-Threading Technology, Intel Corporation, http://www.intel.com/technology/platform-technology/hyperthreading/, on 21 August 2008.

ISBN: 978-2-35500-008-9

24-26 September 2008, Rome, Italy [9]

Coppersmith, D., Winograd S.: Matrix multiplication via arithmetic progressions, J. Symbolic Computation 9, p. 251-280, 1990 [10] Cyril Kowaliski: AMD's 2007 analyst day: Platforms and the glass half full http://techreport.com/articles.x/13792/2, on 21 August 2008. [11] Larry Seiler et al.: Larrabee: A Many-Core x86, Architecture for Visual Computing, ACM Transactions on Graphics, Vol. 27, No. 3, Article 18, Publication date: August 2008

KEYWORDS SUNRED, multithreading, Strassen, FDM, field simulation

©EDA Publishing/THERMINIC 2008

141

BIOGRAPHY László Pohl is working as research assistant, post PhD student at Department of Electron Devices, Budapest University of Technology and Economics. He teaches programming in C and C++ for students of informatics, electronic engineering and engineer-physicists. Research areas: application of the Finite Differences Method; simulation; thermal and electro-thermal investigations; OLEDs; computer architectures.

ISBN: 978-2-35500-008-9

Suggest Documents