Distributed Multi-Core Processors for Accelerating ...

4 downloads 64 Views 8MB Size Report
many applications in signal processing, data mining, statistics, multi- media, etc. ... system) to improve the performance of applications. 2 .... Compiler: MS Visual Studio 2012 Ultimate. ➢ MPI: MPICH-2 library ...... us/library/windows/desktop/ms682396%28v=vs.85%29.aspx. ▫ J. Mike ... Computing, Université Paris-Sud, 2015.
Contents Motivation and Objective Contributions Introduction Evaluating Linear Algebra Routines on OOO Superscalar Processors Exploiting DLP to Accelerate Linear Algebra Routines Multi-threaded-SIMD Implementation/Evaluation of Dense Linear Algebra Distributed Multi-Core Intel Processors for Accelerating Dense Linear Algebra Conclusion 1

Motivation and Objective  More sophisticated mathematical models are frequently used in many applications in signal processing, data mining, statistics, multimedia, etc.  These models take long computation time and require large memory.  Processing these models by the traditional serial computing (single processor) is inefficient  Therefore, parallel processing is the powerful key for reducing the execution time.  Moreover, the current needs of computational power can only be satisfied by using parallel and distributed architectures like multiprocessor and multicomputer systems.  So that, the main objective of this work is to exploit all forms of parallelism (ILP, DLP, TLP, and Distributed system) to improve the performance of applications. 2

Contributions  The main contributions of this work are as follows:  Evaluating the execution of linear algebra routines from BLAS and SVD on Intel Xeon E5410 processor, which exploits ILP implicitly using pipelining, OOO, and superscalar techniques.

 Exploiting DLP to accelerate linear algebra routines on Intel Xeon E5410 processor using SIMD instructions.

 Implementing/evaluating multi-threaded-SIMD version of dense linear algebra routines using both of multi-threading and SIMD techniques on Intel Xeon E5410 processor.

 Accelerating dense linear algebra routines on distributed multicore Intel Xeon E5410 processors by exploiting all forms of parallelism. 3

Software Applications  Scientific/engineering researches dependent on the development / implementation of efficient parallel algorithms on modern high-performance computers.  Linear algebra (in particular, the solution of linear systems of equations) lies at the heart of most calculations in scientific computing.  The three common levels of linear algebra operations  Level1: vector-vector operations (SAXPY, dot-product, vector scaling, …), which involve (for vectors of length n) O(n) data and O(n) operations.  Level2: matrix-vector operations (matrix-vector multiplication and rank-one update), which involve O(n2) operations on O(n2) data.  Level3: matrix-matrix operations (matrix-matrix multiplication), which involve O(n3) operations on O(n2) data.  These three levels of operations are called BLAS: basic linear algebra subprograms

 As a representative example of dense matrix factorization, SVD is considered 4

Switching from Sequential to Parallel Processing

Designing for Parallel Processing

Functional Decomposition

Producer/Consumer Decomposition Data Decomposition

5

Forms of Parallelism

Forms of Parallelism

ILP

Distributed System

TLP DLP

6

Instruction-Level Parallelism (ILP)

Pipelining

ILP

7

Instruction-Level Parallelism (ILP)

ILP

Pipelining

Super Pipelining

Simple scalar pipeline

Scalar Superpipeline 8

Instruction-Level Parallelism (ILP)

ILP

Pipelining

Super Pipelining

Super Scalar Processor

9

Instruction-Level Parallelism (ILP)

ILP

Pipelining

VLIW

Super Pipelining

Super Scalar Processor

10

Instruction-Level Parallelism (ILP)

ILP

Pipelining

VLIW

Super Pipelining

Super Scalar Processor

11

Thread-Level Parallelism (TLP)

TLP

Multi-Core Processor

Dual core processor as an example of the multi-core processor

12

Thread-Level Parallelism (TLP)

TLP

Multi-Core Processor

MultiThreading Technique

Challenges Load Balance

Spin Wait Events

Scalability Mutex

Synchronization Interlocked Functions

Critical Section

13

Data-Level Parallelism (DLP)

DLP

SIMD

14

Data-Level Parallelism (DLP)

DLP

SIMD

MMX

15

Data-Level Parallelism (DLP)

DLP

SIMD

MMX SSE

16

Data-Level Parallelism (DLP)

DLP

SIMD

AVX MMX SSE

17

Distributed System Message Passing Interface (MPI)

Root

MPI Slave1

Slave3

Point to Point Communication Slave2

18

Distributed System Message Passing Interface (MPI)

Root

MPI Slave1

Point to Point Communication

Slave3

Collective Communication Slave2

19

Execution Environment  System: Cluster of Fujitsu Siemens Computers  Processor: Intel Xeon CPU running at 2.33GHz.  Core: 4 cores per processor running 4 threads (1 thread / core)  L1 Cache: 32KB data cache and 32KB instruction cache per core  L2 Cache: 12 MB shared data cache  Memory: 4GB RAM  Architecture: Intel 64 architecture, compatible with IA-32 software  OS: Windows 7 Ultimate  Compiler: MS Visual Studio 2012 Ultimate  MPI: MPICH-2 library

20

Evaluating Linear Algebra Routines on OOO Superscalar Processors Level-1 BLAS 1- Apply givens rotation (AGR) 2- Norm-2

=

= 4- SAXPY 2n/(2n,n)

×

+

6n/(2n,2n)

2n/(n,1)

=

3- Dot product

, 1 ≤ i ≤ n

=

5- Vec-Scal Mul

×



2n/(2n,1)

1≤i≤n n/(n,n)

2.1 1.8 1.6 1.2 0.8

21

Evaluating Linear Algebra Routines on OOO Superscalar Processors Level-2 BLAS 1- Matrix-Vector multiplication =

+

Α( ,

)

, ℎ

2n2/(2n2 + 2n)

1 ≤ , ≤

2- Rank-1 update Α( , ) = Α( , ) +

, ℎ

2n2/(3n2 + n)

1 ≤ , ≤

1.6

1.1

22

Evaluating Linear Algebra Routines on OOO Superscalar Processors Level-3 BLAS  Matrix-Matrix multiplication (MM-Mul): 1- Traditional algorithm of MM-Mul × = for( i = 0; i < n ; i++) for( j = 0; j < n ; j++) for( k = 0; k < p ; k++) C[i][ j] = C[i][ j] + A[i][k] * B[k][ j];

×

×

×

(2n3 FLOPs)

Six variants by loop exchange ijk, ikj, jik, jki, kij, and kji

ikj variant

23

Evaluating Linear Algebra Routines on OOO Superscalar Processors Level-3 BLAS  Matrix-Matrix multiplication (MM-Mul): 1- Traditional algorithm of MM-Mul × = for( i = 0; i < n ; i++) for( j = 0; j < n ; j++) for( k = 0; k < p ; k++) C[i][ j] = C[i][ j] + A[i][k] * B[k][ j];

×

×

(2n3 FLOPs)

×

ikj variant ikj

kij

jik

ijk

jki

ikj

kji

2.5

1.9

1.5 1

jki

kji

1 0.5

0

0

Small Matrix Size

ijk

1.5

0.5

Matrix Size

jik

2

GFlops/sec

GFlops/sec

2

kij

2.5

Matrix Size

Large Matrix Size

24

Evaluating Linear Algebra Routines on OOO Superscalar Processors Level-3 BLAS  Matrix-Matrix multiplication (MM-Mul): 2- Straseen’s algorithm of MM-Mul =

(O(n2.807) FLOPs)

×

M1 = (A11 + A22) (B11 + B22) M2 = (A21 + A22) B11 M3 = A11 (B12 – B22) M4 = A22 (B21 – B11) M5 = (A11 + A12) B22 M6 = (A21 – A11) (B11 + B12) M7 = (A12 – A22) (B21 + B22) C11 = M1 + M4 – M5 + M7 C12 = M3 + M5 C21 = M2 + M4 C22 = M1 – M2 + M3 + M6

25

Evaluating Linear Algebra Routines on OOO Superscalar Processors Level-3 BLAS  Matrix-Matrix multiplication (MM-Mul): 2- Straseen’s algorithm of MM-Mul

Trad

Str

1.2

2,500

1

2,000

Execution Time

Execution Time

Trad

(O(n2.807) FLOPs)

0.8 0.6 0.4

Str

1,500 1,000 500

0.2

0

0

Matrix Size

Small Matrix Sizes

Matrix Size

Large Matrix Sizes 26

Evaluating Linear Algebra Routines on OOO Superscalar Processors Level-3 BLAS  Matrix-Matrix multiplication (MM-Mul): 2- Straseen’s algorithm of MM-Mul

(O(n2.807) FLOPs)

3

3 2.5

2

2

GFlops/sec

GFlops/sec

2.5 2.5

1.5

1.5

1

1

0.5

0.5

0

0

Matrix Size

Small Matrix Sizes

Matrix Size

Large Matrix Sizes 27

Evaluating Linear Algebra Routines on OOO Superscalar Processors Singular Value Decomposition (SVD) An×n = Un×n ∑n×nVTn×n • Un×n and Vn×n are orthogonal matrices (i.e., UTU = In and VT V = In), U: left singular vectors

V: right singular vectors

• ∑n×n is a diagonal matrix diag(σ1, σ2, σ3,…, σn) singular values • Avi = σiui and ATui = σivi

28

Evaluating Linear Algebra Routines on OOO Superscalar Processors Singular Value Decomposition (SVD) An×n = Un×n ∑n×nVTn×n • One sided Jacobi (OSJ)

=

− Where vi = cui - suj and vj = sui + cuj. This means that c and s of J should be chosen such that vivjT = 0

• For an n×n matrix, there are n(n-1)/2 Jacobi transformations in one sweep, which is the number of row-pairs (ai, aj) and i ≠ j. • 6n2×(n-1) is the total number of FLOPs needed for a sweep of the OSJ algorithm

29

Evaluating Linear Algebra Routines on OOO Superscalar Processors Singular Value Decomposition (SVD)

6.2

Performance in GFLOPs/sec 30

Exploiting DLP to Accelerate Linear Algebra Routines Level-1 BLAS 8.1 3.8

Ideal Speed up

3.6

96 89.6

31

Exploiting DLP to Accelerate Linear Algebra Routines Ideal Speed up

Level-2 BLAS 6.3 3.9 3.1

96.5

Blocking Technique

78.6

For MV-Mul (2n2+2n) to (n2+3n) For Rank-1 update (3n2+n) to (2n2+2n) 32

Exploiting DLP to Accelerate Linear Algebra Routines Level-3 BLAS: Traditional MM-Mul

Applying SIMD technique on traditional MM-Mul

33

Exploiting DLP to Accelerate Linear Algebra Routines Level-3 BLAS: Traditional MM-Mul

Applying SIMD and matrix blocking techniques on traditional MM-Mul 34

Exploiting DLP to Accelerate Linear Algebra Routines Level-3 BLAS: Traditional MM-Mul SIMD_Blocking

SIMD

SIMD_Blocking

Seq

12

10.3

Seq

8

7.2

6 4

SIMD

Seq

10

GFlops/sec

GFlops/sec

10

8 6 4

2

2

0

0

Matrix Size

Matrix Size

SIMD_Blocking

SIMD

6

Seq

SIMD_Blocking 6

5.4

5

5

4

3.8

3 2

SpeedUp over Seq

SpeedUp over Seq

SIMD

12

4 3 2

1

1

0

0

Matrix Size

Small Matrix Size

Matrix Size

Large Matrix Size

35

Exploiting DLP to Accelerate Linear Algebra Routines Level-3 BLAS: Strassen MM-Mul SIMD_Blocking

SIMD

Seq

14

SIMD_Blocking

13.1

12

8.8

8 6 4

10

GFlops/sec

GFlops/sec

Seq

12

10

8 6 4

2

2

0

0

Matrix Size SIMD_Blocking

Matrix Size

SIMD

Seq

SIMD_Blocking

6

SIMD

Seq

6

4

3.5

3 2

SpeedUp over Seq

5.2

5

SpeedUp over Seq

SIMD

14

5 4 3 2 1

1

0

0 Matrix Size

Small Matrix Size

Matrix Size

Large Matrix Size

36

Exploiting DLP to Accelerate Linear Algebra Routines Singular Value Decomposition (SVD)

23.3

Performance in GFLOPs/sec due to using SIMD

Ideal Speed up

3.7

Speedup over single threaded OSJ

37

Multi-Threaded-SIMD Implementation/ Evaluation of Dense Linear Algebra Level-1 BLAS

Thread1 Thread2 = Thread3 Thread4 Vector x Vector y

Vector x Vector y

Partition the vectors of the apply givens rotation as an example of Level-1 BLAS among four threads 38

Multi-Threaded-SIMD Implementation/ Evaluation of Dense Linear Algebra Level-1 BLAS 4.6

1.8

2.5

39

Multi-Threaded-SIMD Implementation/ Evaluation of Dense Linear Algebra Level-2 BLAS

Matrix-vector Multiplication

Rank-1 update

40

Multi-Threaded-SIMD Implementation/ Evaluation of Dense Linear Algebra Level-2 BLAS 3.3

2.0

2.0

41

Multi-Threaded-SIMD Implementation/ Evaluation of Dense Linear Algebra Level-3 BLAS: Traditional MM-Mul

Partition matrices among threads

Each thread use SIMD and matrix blocking techniques

42

Multi-Threaded-SIMD Implementation/ Evaluation of Dense Linear Algebra Level-3 BLAS: Traditional MM-Mul 30

Thrd_SIMD_Blocking SIMD_Blocking SIMD

Thrd_SIMD Thrd Seq

30

24.2

20 15 10

20 15 10

5

5

0

0

Matrix Size

14

Thrd_SIMD_Blocking SIMD_Blocking SIMD

Matrix Size Thrd_SIMD Thrd Seq

14

12.6 SpeedUp over Seq

12

SpeedUp over Seq

Thrd_SIMD Thrd Seq

25

GFlops/sec

GFlops/sec

25

Thrd_SIMD_Blocking SIMD_Blocking SIMD

10 8 6 4

10 8 6 4 2

0

0

Small Matrix Sizes

Thrd_SIMD Thrd Seq

12

2

Matrix Size

Thrd_SIMD_Blocking SIMD_Blocking SIMD

Ideal Speedup is 16

Matrix Size

Large Matrix Sizes

43

Multi-Threaded-SIMD Implementation/ Evaluation of Dense Linear Algebra Level-3 BLAS: Strassen MM-Mul

Partition the tasks of the Strassen’s algorithm among 4 threads

44

Multi-Threaded-SIMD Implementation/ Evaluation of Dense Linear Algebra Level-3 BLAS: Strassen MM-Mul Thrd_SIMD Thrd Seq

35

30.0

30

30

25

25

GFlops/sec

GFlops/sec

35

Thrd_SIMD_Blocking SIMD_Blocking SIMD

20 15

Thrd_SIMD Thrd Seq

20 15 10

10

5

5

0

0 Matrix Size

Thrd_SIMD_Blocking SIMD_Blocking SIMD

Matrix Size Thrd_SIMD Thrd Seq

14

14

11.9

10 8 6 4

Thrd_SIMD_Blocking SIMD_Blocking SIMD

Thrd_SIMD Thrd Seq

Ideal Speedup is 16

12

SpeedUp over Seq

12

SpeedUp over Seq

Thrd_SIMD_Blocking SIMD_Blocking SIMD

10 8 6 4 2

2

0

0 Matrix Size

Small Matrix Sizes

Matrix Size

Large Matrix Sizes

45

Multi-Threaded-SIMD Implementation/ Evaluation of Dense Linear Algebra Singular Value Decomposition (SVD) Block Jacobi (BJ)

The Block Jacobi algorithm on Intel Xeon quad-core processor

46

Multi-Threaded-SIMD Implementation/ Evaluation of Dense Linear Algebra Singular Value Decomposition (SVD) Block Jacobi (BJ) 46.7

7.5

24.2

3.9

Performance in GFLOPs/sec due to using multi-threading and SIMD

Speedup over single threaded OSJ

47

Multi-Threaded-SIMD Implementation/ Evaluation of Dense Linear Algebra Singular Value Decomposition (SVD) Hierarchal Block Jacobi (HBJ) B0 SR

B1

.....

.....

B2

B2P Block of super-rows The input matrix

48

Multi-Threaded-SIMD Implementation/ Evaluation of Dense Linear Algebra Singular Value Decomposition (SVD) Hierarchal Block Jacobi (HBJ)

73.9

Performance in GFLOPs/sec due to using multi-threading and SIMD

17.8

Speedup over single threaded OSJ

49

Distributed Multi-Core Processors for Accelerating Dense Linear Algebra Level-1 BLAS Amount of data that send/receive to/from each node for the Level-1 BLAS Subroutine Apply givens rotation Dot-product SAXPY Norm-2 Vector-scalar multiplication 1.0

Sending data 2 sub-vectors 2 sub-vectors 2 sub-vectors 1 sub-vector

Receiving data 2 sub-vectors 1 scalar value 1 sub-vector 1 scalar value

1 sub-vector

1 sub-vector 0.002

On large vector length 2,000,000 for AGR, DotProduct, and SAXPY, and 3,500,000 for Norm-2 and VecScal.

50

Distributed Multi-Core Processors for Accelerating Dense Linear Algebra Level-1 BLAS

4.5×10-2

2.1

Maximum performance on single node

51

Distributed Multi-Core Processors for Accelerating Dense Linear Algebra Level-1 BLAS

8.1

1.76×10-1

Maximum performance on single node

52

Distributed Multi-Core Processors for Accelerating Dense Linear Algebra Level-1 BLAS

9.0×10-2 3.0

Maximum performance on single node

53

Distributed Multi-Core Processors for Accelerating Dense Linear Algebra Level-1 BLAS

3.6 9.8×10-2

Maximum performance on single node

54

Distributed Multi-Core Processors for Accelerating Dense Linear Algebra Level-2 BLAS Amount of data that send/receive to/from each node for the Level-2 BLAS Subroutine

Sending data

Receiving data

Matrix-vector multiplication

1 sub-matrix 1 sub-vector

1 sub-vector

Rank-1 update

1 sub-matrix 1 sub-vector

33.8

1 sub-matrix

1.0×10-1 8.7×10-2

16.9

On large matrix size 10000×10000

55

Distributed Multi-Core Processors for Accelerating Dense Linear Algebra Level-2 BLAS

3.2×10-2

1.6

Maximum performance on single node

56

Distributed Multi-Core Processors for Accelerating Dense Linear Algebra Level-2 BLAS

1.2×10-1 6.0

Maximum performance on single node

57

Distributed Multi-Core Processors for Accelerating Dense Linear Algebra Level-2 BLAS

4.1×10-2

2.0

Maximum performance on single node

58

Distributed Multi-Core Processors for Accelerating Dense Linear Algebra Level-2 BLAS

5.3×10-2

2.5

Maximum performance on single node

59

Distributed Multi-Core Processors for Accelerating Dense Linear Algebra Level-3 BLAS: Traditional MM-Mul

The execution of the traditional algorithm of the matrix-matrix multiplication using MPI on 4 computers

60

Distributed Multi-Core Processors for Accelerating Dense Linear Algebra Level-3 BLAS: Traditional MM-Mul

61

Distributed Multi-Core Processors for Accelerating Dense Linear Algebra Level-3 BLAS: Traditional MM-Mul

104

99.7

Maximum Performance at large matrix size 10000×10000 on 1, 2, 4, 8, and 10 nodes

62

Distributed Multi-Core Processors for Accelerating Dense Linear Algebra Level-3 BLAS: Strassen MM-Mul

The execution of Strassen’s algorithm of the matrix-matrix multiplication using MPI on 7 computers

63

Distributed Multi-Core Processors for Accelerating Dense Linear Algebra Level-3 BLAS: Strassen MM-Mul

50.7

67.7

Maximum Performance at large matrix size 10000×10000 on 1, 2, 4, and 7 nodes

64

Distributed Multi-Core Processors for Accelerating Dense Linear Algebra Singular Value Decomposition (SVD) Block Jacobi (BJ)

Blocks hold on four computers for each step of the parallel block Jacobi algorithm

Sending and receiving blocks between four computers in step1 (S1) and step2 (S2) of the block Jacobi algorithm. (send (S), and receive (R)) 65

Distributed Multi-Core Processors for Accelerating Dense Linear Algebra Singular Value Decomposition (SVD) Block Jacobi (BJ)

49.8

207

Performance in GFLOPs/sec due to using multi-threading and SIMD at large matrix size 10000×10000

Speedup over single threaded OSJ

66

Distributed Multi-Core Processors for Accelerating Dense Linear Algebra Singular Value Decomposition (SVD) Hierarchal Block Jacobi (HBJ) 124

515

Performance in GFLOPs/sec due to using multi-threading and SIMD at large matrix size 10000×10000

Speedup over single threaded OSJ

67

Conclusion  Dense linear algebra is used to highlight the most important factors that must be considered in designing software applications for cluster of multi-core Intel processors.  Exploiting all forms of parallelism is the only way to significantly improve the performance  On single computer the maximum speedup of Level-1 and Level-2 BLAS due to using SIMD and multi-threading techniques are as follows:

Level-1

AGR

Norm-2

DotProd

SAXPY

VecScal

SIMD

3.84

3.71

3.46

3.29

3.66

Thrd

1.43

2.10

1.42

1.65

1.78

ThrdSIMD

1.73

2.52

1.93

2.07

2.28

68

Conclusion  Dense linear algebra is used to highlight the most important factors that must be considered in designing software applications for cluster of multi-core Intel processors.  Exploiting all forms of parallelism is the only way to significantly improve the performance  On single computer the maximum speedup of Level-1 and Level-2 BLAS due to using SIMD and multi-threading techniques are as follows:

Level-2

SIMD

SIMDB

Thrd

ThrdSIMD

ThrdSIMDB

MV-Mul

3.67

3.86

1.25

1.54

2.03

Rank-1

3.54

3.78

1.27

1.56

2.10 69

Conclusion  Dense linear algebra is used to highlight the most important factors that must be considered in designing software applications for cluster of multi-core Intel processors.  Exploiting all forms of parallelism is the only way to significantly improve the performance  On single computer the maximum speedup of Level-1 and Level-2 BLAS due to using SIMD and multi-threading techniques are as follows:

Level-3

SIMD

Thrd

ThrdSIMD

ThrdSIMDB

Trad MM-Mul Str MM_Mul

1.46

2.20

4.27

12.00

3.50

3.73

7.25

11.86 70

Conclusion  Dense linear algebra is used to highlight the most important factors that must be considered in designing software applications for cluster of multi-core Intel processors.  Exploiting all forms of parallelism is the only way to significantly improve the performance  On single computer the maximum speedup of Level-1 and Level-2 BLAS due to using SIMD and multi-threading techniques are as follows:

SVD

SIMD

Thrd

ThrdSIMD

BJ

3.73

3.88

7.50

HBJ

8.11

9.10

17.78 71

Conclusion  On a cluster of multi-core Intel processors  the performance of Level-1 is speeded down  the performance of Level-2 is speeded down  the performance of Level-3 is speeded up the maximum speedup due to using all forms of parallelism are:

104

67.7

72

Conclusion  On a cluster of multi-core Intel processors  the performance of Level-1 is speeded down  the performance of Level-2 is speeded down  the performance of Level-3 is speeded up the maximum speedup due to using all forms of parallelism are: BJ on 8 nodes at 10000×10000

49.8

73

Conclusion  On a cluster of multi-core Intel processors  the performance of Level-1 is speeded down  the performance of Level-2 is speeded down  the performance of Level-3 is speeded up the maximum speedup due to using all forms of parallelism are: HBJ on 8 nodes at 10000×10000

124

74

Future Work

 Implement and evaluate the performance of dense linear algebra algorithms on advanced architectures like Intel Xeon Phi.  Restructure dense linear algebra algorithms to exploit all forms of parallelism and memory hierarchy.  Re-implement and evaluate the BJ and HBJ algorithms after replacing round-robin method by another to exploit memory hierarchy furthermore.  Using the graphical processing units (GPUs) as another form of computer parallelism to improve the performance of the parallel algorithms on cluster of computers.  Exploiting the cloud computing technology to execute and evaluate new implementations of parallel algorithms. 75

References             



J. Ayala, M. López-Vallejo, and A. Veidenbaum, “Energy-efficient register renaming in high-performance processors,” Proceedings of WASP, pp. 56-61, December 2003. Advanced Micro Devices, Inc, AMD Extensions to the 3DNow! and MMX Instruction Sets Manual, March 2000. S. Akhter and J. Roberts, “Multi-Core Programming: Increasing Performance through Software Multithreading,” Intel PRESS, ISBN 0976483246, Vol. 33, 2006. M. Ali, E. Stotzer, F. Igual, and R. Van de Geijn, “Level-3 BLAS on the TI C6678 multi-core DSP,” 24th International Symposium on Computer Architecture and High Performance Computing, IEEE, pp. 179-186, October 2012. B. Barney, “Message Passing Interface (MPI),” Lawrence Livermore National Laboratory, UCRL-MI-133316, 2014, https://computing.llnl.gov/tutorials/mpi/. O. Brewer, J. Dongarra, and D. Sorensen, “Tools to aid in the analysis of memory access patterns for FORTRAN Programs,” Parallel Computing, Vol. 9, No. 1, pp. 25-35, June 1988. A. Binstock and R. Gerber, Programming with Hyper-Threading Technology, Intel Press, ISBN 0971786143, 2004. R. Brent and F. Luk, “The solution of singular-value and symmetric eigenvalue problems on multiprocessor arrays,” SIAM Journal on Scientific and Statistical Computing, Vol. 6, No. 1, pp. 69-84, 1985. D. Bailey, K. Lee, and H. Simon, “Using Strassen’s algorithm to accelerate the solution of linear systems,” the journal of the supercomputing, Kluwer Academic Publishers, Boston, Netherlands, Vol. 4, No.4, pp. 357-371, September 1990. J. Dongarra, J. Croz, S. Hammarling, and I. Duff, “A Set of Level 3 Basic Linear Algebra Subprograms,” ACM Transactions on Mathematical Software, Vol.16, No.1, pp.1-17, March 1990. J. Dongarra, J. Croz, S. Hammarling, and R. Hanson, “An Extended Set of Fortran Basic Linear Algebra Subprograms,” ACM Transactions on Mathematical Software, Vol.14, No.1, pp. 1-17, March 1988. J. Dongana, J. Croz, S. Ilammarling, and I. Duff, “A Set of Level-3 Basic Linear Algebra Subprograms,” ACM Transactions on Mathematical Software, Vol. 16, No. 1, pp. 1-17, 1990. K. Diefendorff, P. Dubey, R. Hochsprung, and H. Scales, “Altivec Extension to PowerPC Accelerates Media Processing,” IEEE MICRO, Vol. 20, No. 2, pp. 85-95, March/April 2000. J. Dongarra and V. Eijkhout, “Numerical linear algebra algorithms and software,” Journal of Computational and Applied Mathematics, Vol. 123, No. 1-2, pp. 489–514, November 2000. 76

References    

 

      

Digital Equipment Corporation. Alpha Architecture Handbook. available: http://www.support.compaq.com/alphatools/documentation/current/alpha-archt/alpha-architecture.pdf, April 13, 2001. J. Dongarra, I. Foster, G. Fox, K. Kennedy, A. White, L. Torczon, and W. Gropp, The Sourcebook of Parallel Computing, Morgan Kaufmann, November 2002. J. Demmel and K. Veselic, “Jacobi's method is more accurate than QR,” SIAM Journal on Matrix Analysis and Applications, Vol. 13, No. 4, pp. 1204-1245, 1992. Y. Ding, G. Zhu, C. Cui, J. Zhou, and L. Tao, “A Parallel Implementation of Singular Value Decomposition based on MapReduce and PARPACK,” Computer Science and Network Technology (ICCSNT) International Conference on IEEE, Vol. 2, pp. 739-741, December 2011. J. Fisher, “VLIW Architectures and the ELI-512,” Proc. 10th International Symposium on Computer Architecture, Stockholm, Sweden, pp.140-150, June 1983. X. Feng, H. Jin, R. Zheng, and L. Zhu, “Parallel Singular Value Decomposition on Heterogeneous Multi-core and MultiGPU Platforms,” Ninth International Conference on Digital Information Management (ICDIM), IEEE, pp. 45-50, September 2014. Fujitsu, “data sheet of CELSIUS R550”, November 2009. G. Golub and W. Kahan, “Calculating the Singular Values and Pseudo-Inverse of a Matrix,” Journal of the Society for Industrial and Applied Mathematics, Series B: Numerical Analysis, Vol. 2, No. 2, pp. 205-224, 1965. W. Gropp, K. Kennedy, L. Torczon, A. White, J. Dongarra, I. FIoster, and G. C. Fox, Sourcebook of parallel computing, 2009. G. Golub and F. Luk, “Singular Value Decomposition: Applications and Computations,” Transactions of the TwentySecond Conference of Army Mathematicians, Vol. 577, pp. 577-605, 1977. K. Gallivan, R. Plemmons, and A. Sameh, “Parallel algorithms for dense linear algebra computations,” SIAM Review, Society for industrial and applied mathematics, Vol. 32, No. 1, pp. 54-135, March 1990. G. Golub and C. Van Loan, Matrix computations, Johns Hopkins series in the mathematical sciences,” Johns Hopkins University Press, Baltimore, MD, 1989. G. Golub and C. Van Loan, Matrix Computations, John Hopkins University Press, Baltimore and London, 2nd edition, 1993. 77

References    

 

 

 

 

G. Golub and C. Van Loan, Matrix Computations, 3rd Edition, The Johns Hopkins University Press, Baltimore and London, 1996. J. Held, J. Bautista, and S. Koehl, “From a Few Cores to Many: A Tera-scale Computing Research Overview,” White Paper from Intel Corporation, 2006. M. Hestenes, “Inversion of matrices by biorthogonalization and related results,” Journal of the Society for Industrial and Applied Mathematics, Vol. 6, No. 1, pp. 51-90, 1958. A. Haidar, J. Kurzak, and P. Luszczek, “An Improved Parallel Singular Value Algorithm and Its Implementation for Multicore Hardware,” Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis. ACM, Denver, CO, USA, November 2013. J. Hennessay and D. Patterson. Computer Architecture A Quantitative Approach, Morgan-Kaufmann, 5th Edition, September 2011. A. Heinecke, K. Vaidyanathan, M. Smelyanskiy, A. Kobotov, R. Dubtsov, G. Henry, and P. Dubey, “Design and implementation of the linpack benchmark for single and multi-node systems based on intel® xeon phi coprocessor,” 27th International Symposium on Parallel and Distributed Processing (IPDPS), IEEE, pp. 126-137, May 2013. Intel, Intel® Xeon® Processor 5400 Series, 2008. Intel® 64 and IA-32 Architectures Software Developer’s Manual, February 2014. http://www.intel.com/content/www/us/en/processors/architectures-software-developer-manuals.html?iid=tech_vt_tech+6432_manuals. M. Jahre and L. Natvig, “Performance Effects of a Cache Miss Handling Architecture in a Multi core Processor,” in proc. Norwegian Informatics Conference NIK 2007, November 2007. N. Jouppi and D. Wall, “Available Instruction-Level Parallelism for Superscalar and Superpipelined Machines,” Proc. 3rd International Conference on Architectural Support for Programming Languages and Operating Systems, Vol. 17, No. 2, pp.272-282, April 1989. J. Kurzak, W. Alvaro, and J. Dongarra, “Optimizing matrix multiplication for a short-vector SIMD architecture – CELL processor,” Parallel Computing, Vol.35, No. 3, pp.138–150, 2009. P. Kongetira, K. Aingaran, and K. Olukotun, “Niagara: A 32-way multithreaded SPARC processor,” Micro, IEEE, Vol. 25, No. 2, pp. 21-29, 2005. 78

References  

            

L. Kohn, G. Maturana, M. Tremblay, A. Prabhu, and G. Zyner, “Visual Instruction Set (VIS) in UltraSPARC,” Proc. COMPCON’95: Technologies for the Information Superhighway, pp. 462–469, March 1995. M. Krishnan and J. Nieplocha, “SRUMMA: a matrix multiplication algorithm suitable for clusters and scalable shared memory systems,” 18th International Conference on Parallel and Distributed Processing Symposium, Proceedings, IEEE, April 2004. N. Karonis, B. Toonen, and I. Foster, “MPICH-G2: A grid-enabled implementation of the message passing interface,” Journal of Parallel and Distributed Computing, Vol. 63, No. 5 , pp. 551-563, 2003. R. Lee, “Subword Parallelism with MAX-2,” Micro IEEE, Vol. 16, No. 4, pp. 51–59, August 1996. C. Lawson, R. Hanson, D. Kincaid, and F. Krogh, “Basic Linear Algebra Subprograms for Fortran Usage,” ACM Transactions on Mathematical Software, Vol.5, No.3, pp.308-323, September 1979. H. Ltaief, P. Luszczek, and J. Dongarra, “High-performance bidiagonal reduction using tile algorithms on homogeneous multicore architectures,” ACM Transactions on Mathematical Software, Vol. 39, No. 3, 2013. D. Liu, R. Li, D. Lilja, and W. Xiao, “A Divide-and-Conquer Approach for Solving Singular Value Decomposition on a Heterogeneous System,” Proceedings of the ACM International Conference on Computing Frontiers. ACM, May 2013. F. Luk, “Computing the singular-value decomposition on the ILLIAC IV,” ACM Transactions on Mathematical Software , Vol. 6, No .4, pp. 524-539, 1980. Microsoft Developer Network, “CreateThread function”, © 2014 Microsoft, it available at http://msdn.microsoft.com/enus/library/windows/desktop/ms682453%28v=vs.85%29.aspx Microsoft, Windows Dev Center, “WaitForMultipleObjects function”, © 2014 Microsoft, it available at http://msdn.microsoft.com/enus/library/windows/desktop/ms687025%28v=vs.85%29.aspx Microsoft, Windows Dev Center, “CreateEvent function”, © 2015 Microsoft, it available at https://msdn.microsoft.com/enus/library/windows/desktop/ms682396%28v=vs.85%29.aspx J. Mike, Superscalar Microprocessor Design, Prentice Hall (Prentice Hall Series in Innovative Technology), 1991. MIPS extension for digital media with 3d, Technical Report http://www.mips.com, MIPS technologies, Inc., 1997 J. Mike, and M. Johnson, “Superscalar microprocessor design,” Englewood Cliffs, New Jersey: prentice Hall, Vol. 77, 1991. G. Moore, “Cramming More Components onto Integrated Circuits,” Electronics, Vol.38, No.8, pp.114-117, April 1965. 79

References 

           



M. Marqués, G. Quintana-Ort´ı, E. Quintana-Ort´ı, and R. van de Geijn, “Solving “large” dense matrix problems on multicore processors and gpus,” 10th IEEE International Workshop on Parallel and Distributed Scientific and Engineering Computing, Roma, Italy, January 2009. K. Matsumoto and S. Sedukhin, “Matrix multiply-add in min-plus algebra on a short-vector SIMD processor of Cell/BE,” First International Conference Networking and Computing (ICNC), IEEE, pp. 272-274, November 2010. P. Pacheco, An Introduction to Parallel Programming, Elsevier, ISBN 978-0-12-374260-5, 2011. A. Peleg and U. Weiser, “MMX Technology Extension to the Intel Architecture,” IEEE Micro, Vol. 16, No. 4, pp. 42-50, August 1996. A. Rémy, “Solving dense linear systems on accelerated multicore architectures,” Distributed, Parallel, and Cluster Computing, Université Paris-Sud, 2015. S. Rajasekaran and M. Song, “A relaxation scheme for increasing the parallelism in Jacobi-SVD,” Journal of Parallel and Distributed Computing, Vol. 68, No. 6, pp. 769-777, 2008. H. Rutishauser, “The Jacobi method for real symmetric matrices,” Linear algebra. Springer Berlin Heidelberg, pp. 202-211, 1971. A. Silberschatz, P. Galvin, and G. Gagne, Operating System Concepts, 7th edition, ISBN 0-471-69466-5, 2005. V. Strumpen, H. Hoffmann, and A. Agarwal, “A Stream Algorithm for the SVD,” Technical Memo 641, Computer Science and Artificial Intelligence Laboratory, Massachusetts Institute of Technology, October 2003. M. Soliman, A Technology-Scalable Matrix Processor for Data Parallel Applications, Ph.D. Thesis, Aizu University, Japan, 2004. M. Soliman, “Memory hierarchy exploration for accelerating the parallel computation of SVDs,” Neural, Parallel & Scientific Computations, Vol. 16, No. 4, pp. 543-561, 2008. M. Soliman, “Exploiting ILP, TLP, and DLP to Improve Multi-Core Performance of One-Sided Jacobi SVD,” Parallel Processing Letters, World Scientific Publishing Company, Vol. 19, No. 2, pp. 355-375, March 2009. M. Soliman, “Performance Evaluation of Multi-Core Intel Xeon Processors on Basic Linear Algebra Subprograms”, Parallel Processing Letter (PPL), World Scientific Publishing Company, ISSN: 0129-6264, Vol. 19, No. 1, pp. 159-174, March 2009. M. Soliman, “Efficient implementation of QR decomposition on Intel multi-core processors,” Seventh International 80 Computer Engineering Conference (ICENCO), IEEE, pp. 25-30, December 2012.

References 

 



      

F. Sánchez Castaño, A. Ramírez Bellido, and M. Valero Cortés. “Exploiting different levels of parallelism in the biological sequence comparison problem,” 4th Colombian Computing Conference, Bucaramanga, ISBN 978-958-8166-43-8, pp. 140-149, 2009. URI http://hdl.handle.net/2117/8069. N. Slingerl and A. Smith, “Multimedia Extensions for General Purpose microprocessors: a Survey,” Microprocessors and Microsystems, Vol. 29, No. 5, pp. 225–246, 2005. A. Shi, W. Shen, Y. Li, L. He, and D. Zhao, “Implementation and Analysis of Jacobi Iteration Based on Hybrid Programming,” 2010 International Conference On Computer Design And Appliations (ICCDA 2010), IEEE, Vol. 2, June 2010. S. Sampath and B. Sagar, “Performance Evaluation of Linear Matrices over Cluster of Four Nodes Using MPI,” International Journal of Research in Engineering and Technology, eISSN: 2319-1163 | pISSN: 2321-7308, Vol. 03, November 2014. J. Smith and G. Sohi, “The Microarchitecture of Superscalar Processors,” Proceedings of the IEEE, Vol.83, No.12, pp. 1609-24, December 1995. M. Thottethodi, S. Chatterjee, and A. Lebeck, “Tuning Strassen's matrix multiplication for memory efficiency,” Proceedings of the 1998 ACM/IEEE conference on Supercomputing. IEEE Computer Society, pp. 1-14, November 1998. M. Tremblay, J. O'Connor, V. Narayanan, and L. He, “VIS Speeds New Media Processing,” IEEE Micro, Vol. 16, No. 4, pp.10-20, August 1996. F. Van Zee, R. van de Geijn, and G. Quintana, “Restructuring the QR Algorithm for High-Performance Application of Givens Rotations,” FLAME Working Note #60, October 2011. E. Welch, D. Patru, E. Saber, and K. Bengtson, “A study of the use of SIMD instructions for two image processing algorithms,” Image Processing Workshop (WNYIPW), Western New York, IEEE, pp. 21-24, November 2012. Z. Xianyi, W. Qian, and Z. Yunquan, “Model-driven level 3 BLAS performance optimization on Loongson 3A processor,” 18th International Conference on Parallel and Distributed Systems (ICPADS), IEEE, pp. 684-691, December 2012. W. Yang and Z. Liu, “Accelerating householder bidiagonalization with ARM NEON technology,” Signal and Information Processing Association Annual Summit and Conference (APSIPA ASC), Asia-Pacific, IEEE, pp. 1-4, December 2012.

81

Suggest Documents