many applications in signal processing, data mining, statistics, multi- media, etc. ... system) to improve the performance of applications. 2 .... Compiler: MS Visual Studio 2012 Ultimate. ⢠MPI: MPICH-2 library ...... us/library/windows/desktop/ms682396%28v=vs.85%29.aspx. ⫠J. Mike ... Computing, Université Paris-Sud, 2015.
Contents Motivation and Objective Contributions Introduction Evaluating Linear Algebra Routines on OOO Superscalar Processors Exploiting DLP to Accelerate Linear Algebra Routines Multi-threaded-SIMD Implementation/Evaluation of Dense Linear Algebra Distributed Multi-Core Intel Processors for Accelerating Dense Linear Algebra Conclusion 1
Motivation and Objective More sophisticated mathematical models are frequently used in many applications in signal processing, data mining, statistics, multimedia, etc. These models take long computation time and require large memory. Processing these models by the traditional serial computing (single processor) is inefficient Therefore, parallel processing is the powerful key for reducing the execution time. Moreover, the current needs of computational power can only be satisfied by using parallel and distributed architectures like multiprocessor and multicomputer systems. So that, the main objective of this work is to exploit all forms of parallelism (ILP, DLP, TLP, and Distributed system) to improve the performance of applications. 2
Contributions The main contributions of this work are as follows: Evaluating the execution of linear algebra routines from BLAS and SVD on Intel Xeon E5410 processor, which exploits ILP implicitly using pipelining, OOO, and superscalar techniques.
Exploiting DLP to accelerate linear algebra routines on Intel Xeon E5410 processor using SIMD instructions.
Implementing/evaluating multi-threaded-SIMD version of dense linear algebra routines using both of multi-threading and SIMD techniques on Intel Xeon E5410 processor.
Accelerating dense linear algebra routines on distributed multicore Intel Xeon E5410 processors by exploiting all forms of parallelism. 3
Software Applications Scientific/engineering researches dependent on the development / implementation of efficient parallel algorithms on modern high-performance computers. Linear algebra (in particular, the solution of linear systems of equations) lies at the heart of most calculations in scientific computing. The three common levels of linear algebra operations Level1: vector-vector operations (SAXPY, dot-product, vector scaling, …), which involve (for vectors of length n) O(n) data and O(n) operations. Level2: matrix-vector operations (matrix-vector multiplication and rank-one update), which involve O(n2) operations on O(n2) data. Level3: matrix-matrix operations (matrix-matrix multiplication), which involve O(n3) operations on O(n2) data. These three levels of operations are called BLAS: basic linear algebra subprograms
As a representative example of dense matrix factorization, SVD is considered 4
Switching from Sequential to Parallel Processing
Designing for Parallel Processing
Functional Decomposition
Producer/Consumer Decomposition Data Decomposition
5
Forms of Parallelism
Forms of Parallelism
ILP
Distributed System
TLP DLP
6
Instruction-Level Parallelism (ILP)
Pipelining
ILP
7
Instruction-Level Parallelism (ILP)
ILP
Pipelining
Super Pipelining
Simple scalar pipeline
Scalar Superpipeline 8
Instruction-Level Parallelism (ILP)
ILP
Pipelining
Super Pipelining
Super Scalar Processor
9
Instruction-Level Parallelism (ILP)
ILP
Pipelining
VLIW
Super Pipelining
Super Scalar Processor
10
Instruction-Level Parallelism (ILP)
ILP
Pipelining
VLIW
Super Pipelining
Super Scalar Processor
11
Thread-Level Parallelism (TLP)
TLP
Multi-Core Processor
Dual core processor as an example of the multi-core processor
12
Thread-Level Parallelism (TLP)
TLP
Multi-Core Processor
MultiThreading Technique
Challenges Load Balance
Spin Wait Events
Scalability Mutex
Synchronization Interlocked Functions
Critical Section
13
Data-Level Parallelism (DLP)
DLP
SIMD
14
Data-Level Parallelism (DLP)
DLP
SIMD
MMX
15
Data-Level Parallelism (DLP)
DLP
SIMD
MMX SSE
16
Data-Level Parallelism (DLP)
DLP
SIMD
AVX MMX SSE
17
Distributed System Message Passing Interface (MPI)
Root
MPI Slave1
Slave3
Point to Point Communication Slave2
18
Distributed System Message Passing Interface (MPI)
Root
MPI Slave1
Point to Point Communication
Slave3
Collective Communication Slave2
19
Execution Environment System: Cluster of Fujitsu Siemens Computers Processor: Intel Xeon CPU running at 2.33GHz. Core: 4 cores per processor running 4 threads (1 thread / core) L1 Cache: 32KB data cache and 32KB instruction cache per core L2 Cache: 12 MB shared data cache Memory: 4GB RAM Architecture: Intel 64 architecture, compatible with IA-32 software OS: Windows 7 Ultimate Compiler: MS Visual Studio 2012 Ultimate MPI: MPICH-2 library
20
Evaluating Linear Algebra Routines on OOO Superscalar Processors Level-1 BLAS 1- Apply givens rotation (AGR) 2- Norm-2
=
= 4- SAXPY 2n/(2n,n)
×
+
6n/(2n,2n)
2n/(n,1)
=
3- Dot product
, 1 ≤ i ≤ n
=
5- Vec-Scal Mul
×
2n/(2n,1)
1≤i≤n n/(n,n)
2.1 1.8 1.6 1.2 0.8
21
Evaluating Linear Algebra Routines on OOO Superscalar Processors Level-2 BLAS 1- Matrix-Vector multiplication =
+
Α( ,
)
, ℎ
2n2/(2n2 + 2n)
1 ≤ , ≤
2- Rank-1 update Α( , ) = Α( , ) +
, ℎ
2n2/(3n2 + n)
1 ≤ , ≤
1.6
1.1
22
Evaluating Linear Algebra Routines on OOO Superscalar Processors Level-3 BLAS Matrix-Matrix multiplication (MM-Mul): 1- Traditional algorithm of MM-Mul × = for( i = 0; i < n ; i++) for( j = 0; j < n ; j++) for( k = 0; k < p ; k++) C[i][ j] = C[i][ j] + A[i][k] * B[k][ j];
×
×
×
(2n3 FLOPs)
Six variants by loop exchange ijk, ikj, jik, jki, kij, and kji
ikj variant
23
Evaluating Linear Algebra Routines on OOO Superscalar Processors Level-3 BLAS Matrix-Matrix multiplication (MM-Mul): 1- Traditional algorithm of MM-Mul × = for( i = 0; i < n ; i++) for( j = 0; j < n ; j++) for( k = 0; k < p ; k++) C[i][ j] = C[i][ j] + A[i][k] * B[k][ j];
×
×
(2n3 FLOPs)
×
ikj variant ikj
kij
jik
ijk
jki
ikj
kji
2.5
1.9
1.5 1
jki
kji
1 0.5
0
0
Small Matrix Size
ijk
1.5
0.5
Matrix Size
jik
2
GFlops/sec
GFlops/sec
2
kij
2.5
Matrix Size
Large Matrix Size
24
Evaluating Linear Algebra Routines on OOO Superscalar Processors Level-3 BLAS Matrix-Matrix multiplication (MM-Mul): 2- Straseen’s algorithm of MM-Mul =
(O(n2.807) FLOPs)
×
M1 = (A11 + A22) (B11 + B22) M2 = (A21 + A22) B11 M3 = A11 (B12 – B22) M4 = A22 (B21 – B11) M5 = (A11 + A12) B22 M6 = (A21 – A11) (B11 + B12) M7 = (A12 – A22) (B21 + B22) C11 = M1 + M4 – M5 + M7 C12 = M3 + M5 C21 = M2 + M4 C22 = M1 – M2 + M3 + M6
25
Evaluating Linear Algebra Routines on OOO Superscalar Processors Level-3 BLAS Matrix-Matrix multiplication (MM-Mul): 2- Straseen’s algorithm of MM-Mul
Trad
Str
1.2
2,500
1
2,000
Execution Time
Execution Time
Trad
(O(n2.807) FLOPs)
0.8 0.6 0.4
Str
1,500 1,000 500
0.2
0
0
Matrix Size
Small Matrix Sizes
Matrix Size
Large Matrix Sizes 26
Evaluating Linear Algebra Routines on OOO Superscalar Processors Level-3 BLAS Matrix-Matrix multiplication (MM-Mul): 2- Straseen’s algorithm of MM-Mul
(O(n2.807) FLOPs)
3
3 2.5
2
2
GFlops/sec
GFlops/sec
2.5 2.5
1.5
1.5
1
1
0.5
0.5
0
0
Matrix Size
Small Matrix Sizes
Matrix Size
Large Matrix Sizes 27
Evaluating Linear Algebra Routines on OOO Superscalar Processors Singular Value Decomposition (SVD) An×n = Un×n ∑n×nVTn×n • Un×n and Vn×n are orthogonal matrices (i.e., UTU = In and VT V = In), U: left singular vectors
V: right singular vectors
• ∑n×n is a diagonal matrix diag(σ1, σ2, σ3,…, σn) singular values • Avi = σiui and ATui = σivi
28
Evaluating Linear Algebra Routines on OOO Superscalar Processors Singular Value Decomposition (SVD) An×n = Un×n ∑n×nVTn×n • One sided Jacobi (OSJ)
=
− Where vi = cui - suj and vj = sui + cuj. This means that c and s of J should be chosen such that vivjT = 0
• For an n×n matrix, there are n(n-1)/2 Jacobi transformations in one sweep, which is the number of row-pairs (ai, aj) and i ≠ j. • 6n2×(n-1) is the total number of FLOPs needed for a sweep of the OSJ algorithm
29
Evaluating Linear Algebra Routines on OOO Superscalar Processors Singular Value Decomposition (SVD)
6.2
Performance in GFLOPs/sec 30
Exploiting DLP to Accelerate Linear Algebra Routines Level-1 BLAS 8.1 3.8
Ideal Speed up
3.6
96 89.6
31
Exploiting DLP to Accelerate Linear Algebra Routines Ideal Speed up
Level-2 BLAS 6.3 3.9 3.1
96.5
Blocking Technique
78.6
For MV-Mul (2n2+2n) to (n2+3n) For Rank-1 update (3n2+n) to (2n2+2n) 32
Exploiting DLP to Accelerate Linear Algebra Routines Level-3 BLAS: Traditional MM-Mul
Applying SIMD technique on traditional MM-Mul
33
Exploiting DLP to Accelerate Linear Algebra Routines Level-3 BLAS: Traditional MM-Mul
Applying SIMD and matrix blocking techniques on traditional MM-Mul 34
Exploiting DLP to Accelerate Linear Algebra Routines Level-3 BLAS: Traditional MM-Mul SIMD_Blocking
SIMD
SIMD_Blocking
Seq
12
10.3
Seq
8
7.2
6 4
SIMD
Seq
10
GFlops/sec
GFlops/sec
10
8 6 4
2
2
0
0
Matrix Size
Matrix Size
SIMD_Blocking
SIMD
6
Seq
SIMD_Blocking 6
5.4
5
5
4
3.8
3 2
SpeedUp over Seq
SpeedUp over Seq
SIMD
12
4 3 2
1
1
0
0
Matrix Size
Small Matrix Size
Matrix Size
Large Matrix Size
35
Exploiting DLP to Accelerate Linear Algebra Routines Level-3 BLAS: Strassen MM-Mul SIMD_Blocking
SIMD
Seq
14
SIMD_Blocking
13.1
12
8.8
8 6 4
10
GFlops/sec
GFlops/sec
Seq
12
10
8 6 4
2
2
0
0
Matrix Size SIMD_Blocking
Matrix Size
SIMD
Seq
SIMD_Blocking
6
SIMD
Seq
6
4
3.5
3 2
SpeedUp over Seq
5.2
5
SpeedUp over Seq
SIMD
14
5 4 3 2 1
1
0
0 Matrix Size
Small Matrix Size
Matrix Size
Large Matrix Size
36
Exploiting DLP to Accelerate Linear Algebra Routines Singular Value Decomposition (SVD)
23.3
Performance in GFLOPs/sec due to using SIMD
Ideal Speed up
3.7
Speedup over single threaded OSJ
37
Multi-Threaded-SIMD Implementation/ Evaluation of Dense Linear Algebra Level-1 BLAS
Thread1 Thread2 = Thread3 Thread4 Vector x Vector y
Vector x Vector y
Partition the vectors of the apply givens rotation as an example of Level-1 BLAS among four threads 38
Multi-Threaded-SIMD Implementation/ Evaluation of Dense Linear Algebra Level-1 BLAS 4.6
1.8
2.5
39
Multi-Threaded-SIMD Implementation/ Evaluation of Dense Linear Algebra Level-2 BLAS
Matrix-vector Multiplication
Rank-1 update
40
Multi-Threaded-SIMD Implementation/ Evaluation of Dense Linear Algebra Level-2 BLAS 3.3
2.0
2.0
41
Multi-Threaded-SIMD Implementation/ Evaluation of Dense Linear Algebra Level-3 BLAS: Traditional MM-Mul
Partition matrices among threads
Each thread use SIMD and matrix blocking techniques
42
Multi-Threaded-SIMD Implementation/ Evaluation of Dense Linear Algebra Level-3 BLAS: Traditional MM-Mul 30
Thrd_SIMD_Blocking SIMD_Blocking SIMD
Thrd_SIMD Thrd Seq
30
24.2
20 15 10
20 15 10
5
5
0
0
Matrix Size
14
Thrd_SIMD_Blocking SIMD_Blocking SIMD
Matrix Size Thrd_SIMD Thrd Seq
14
12.6 SpeedUp over Seq
12
SpeedUp over Seq
Thrd_SIMD Thrd Seq
25
GFlops/sec
GFlops/sec
25
Thrd_SIMD_Blocking SIMD_Blocking SIMD
10 8 6 4
10 8 6 4 2
0
0
Small Matrix Sizes
Thrd_SIMD Thrd Seq
12
2
Matrix Size
Thrd_SIMD_Blocking SIMD_Blocking SIMD
Ideal Speedup is 16
Matrix Size
Large Matrix Sizes
43
Multi-Threaded-SIMD Implementation/ Evaluation of Dense Linear Algebra Level-3 BLAS: Strassen MM-Mul
Partition the tasks of the Strassen’s algorithm among 4 threads
44
Multi-Threaded-SIMD Implementation/ Evaluation of Dense Linear Algebra Level-3 BLAS: Strassen MM-Mul Thrd_SIMD Thrd Seq
35
30.0
30
30
25
25
GFlops/sec
GFlops/sec
35
Thrd_SIMD_Blocking SIMD_Blocking SIMD
20 15
Thrd_SIMD Thrd Seq
20 15 10
10
5
5
0
0 Matrix Size
Thrd_SIMD_Blocking SIMD_Blocking SIMD
Matrix Size Thrd_SIMD Thrd Seq
14
14
11.9
10 8 6 4
Thrd_SIMD_Blocking SIMD_Blocking SIMD
Thrd_SIMD Thrd Seq
Ideal Speedup is 16
12
SpeedUp over Seq
12
SpeedUp over Seq
Thrd_SIMD_Blocking SIMD_Blocking SIMD
10 8 6 4 2
2
0
0 Matrix Size
Small Matrix Sizes
Matrix Size
Large Matrix Sizes
45
Multi-Threaded-SIMD Implementation/ Evaluation of Dense Linear Algebra Singular Value Decomposition (SVD) Block Jacobi (BJ)
The Block Jacobi algorithm on Intel Xeon quad-core processor
46
Multi-Threaded-SIMD Implementation/ Evaluation of Dense Linear Algebra Singular Value Decomposition (SVD) Block Jacobi (BJ) 46.7
7.5
24.2
3.9
Performance in GFLOPs/sec due to using multi-threading and SIMD
Speedup over single threaded OSJ
47
Multi-Threaded-SIMD Implementation/ Evaluation of Dense Linear Algebra Singular Value Decomposition (SVD) Hierarchal Block Jacobi (HBJ) B0 SR
B1
.....
.....
B2
B2P Block of super-rows The input matrix
48
Multi-Threaded-SIMD Implementation/ Evaluation of Dense Linear Algebra Singular Value Decomposition (SVD) Hierarchal Block Jacobi (HBJ)
73.9
Performance in GFLOPs/sec due to using multi-threading and SIMD
17.8
Speedup over single threaded OSJ
49
Distributed Multi-Core Processors for Accelerating Dense Linear Algebra Level-1 BLAS Amount of data that send/receive to/from each node for the Level-1 BLAS Subroutine Apply givens rotation Dot-product SAXPY Norm-2 Vector-scalar multiplication 1.0
Sending data 2 sub-vectors 2 sub-vectors 2 sub-vectors 1 sub-vector
Receiving data 2 sub-vectors 1 scalar value 1 sub-vector 1 scalar value
1 sub-vector
1 sub-vector 0.002
On large vector length 2,000,000 for AGR, DotProduct, and SAXPY, and 3,500,000 for Norm-2 and VecScal.
50
Distributed Multi-Core Processors for Accelerating Dense Linear Algebra Level-1 BLAS
4.5×10-2
2.1
Maximum performance on single node
51
Distributed Multi-Core Processors for Accelerating Dense Linear Algebra Level-1 BLAS
8.1
1.76×10-1
Maximum performance on single node
52
Distributed Multi-Core Processors for Accelerating Dense Linear Algebra Level-1 BLAS
9.0×10-2 3.0
Maximum performance on single node
53
Distributed Multi-Core Processors for Accelerating Dense Linear Algebra Level-1 BLAS
3.6 9.8×10-2
Maximum performance on single node
54
Distributed Multi-Core Processors for Accelerating Dense Linear Algebra Level-2 BLAS Amount of data that send/receive to/from each node for the Level-2 BLAS Subroutine
Sending data
Receiving data
Matrix-vector multiplication
1 sub-matrix 1 sub-vector
1 sub-vector
Rank-1 update
1 sub-matrix 1 sub-vector
33.8
1 sub-matrix
1.0×10-1 8.7×10-2
16.9
On large matrix size 10000×10000
55
Distributed Multi-Core Processors for Accelerating Dense Linear Algebra Level-2 BLAS
3.2×10-2
1.6
Maximum performance on single node
56
Distributed Multi-Core Processors for Accelerating Dense Linear Algebra Level-2 BLAS
1.2×10-1 6.0
Maximum performance on single node
57
Distributed Multi-Core Processors for Accelerating Dense Linear Algebra Level-2 BLAS
4.1×10-2
2.0
Maximum performance on single node
58
Distributed Multi-Core Processors for Accelerating Dense Linear Algebra Level-2 BLAS
5.3×10-2
2.5
Maximum performance on single node
59
Distributed Multi-Core Processors for Accelerating Dense Linear Algebra Level-3 BLAS: Traditional MM-Mul
The execution of the traditional algorithm of the matrix-matrix multiplication using MPI on 4 computers
60
Distributed Multi-Core Processors for Accelerating Dense Linear Algebra Level-3 BLAS: Traditional MM-Mul
61
Distributed Multi-Core Processors for Accelerating Dense Linear Algebra Level-3 BLAS: Traditional MM-Mul
104
99.7
Maximum Performance at large matrix size 10000×10000 on 1, 2, 4, 8, and 10 nodes
62
Distributed Multi-Core Processors for Accelerating Dense Linear Algebra Level-3 BLAS: Strassen MM-Mul
The execution of Strassen’s algorithm of the matrix-matrix multiplication using MPI on 7 computers
63
Distributed Multi-Core Processors for Accelerating Dense Linear Algebra Level-3 BLAS: Strassen MM-Mul
50.7
67.7
Maximum Performance at large matrix size 10000×10000 on 1, 2, 4, and 7 nodes
64
Distributed Multi-Core Processors for Accelerating Dense Linear Algebra Singular Value Decomposition (SVD) Block Jacobi (BJ)
Blocks hold on four computers for each step of the parallel block Jacobi algorithm
Sending and receiving blocks between four computers in step1 (S1) and step2 (S2) of the block Jacobi algorithm. (send (S), and receive (R)) 65
Distributed Multi-Core Processors for Accelerating Dense Linear Algebra Singular Value Decomposition (SVD) Block Jacobi (BJ)
49.8
207
Performance in GFLOPs/sec due to using multi-threading and SIMD at large matrix size 10000×10000
Speedup over single threaded OSJ
66
Distributed Multi-Core Processors for Accelerating Dense Linear Algebra Singular Value Decomposition (SVD) Hierarchal Block Jacobi (HBJ) 124
515
Performance in GFLOPs/sec due to using multi-threading and SIMD at large matrix size 10000×10000
Speedup over single threaded OSJ
67
Conclusion Dense linear algebra is used to highlight the most important factors that must be considered in designing software applications for cluster of multi-core Intel processors. Exploiting all forms of parallelism is the only way to significantly improve the performance On single computer the maximum speedup of Level-1 and Level-2 BLAS due to using SIMD and multi-threading techniques are as follows:
Level-1
AGR
Norm-2
DotProd
SAXPY
VecScal
SIMD
3.84
3.71
3.46
3.29
3.66
Thrd
1.43
2.10
1.42
1.65
1.78
ThrdSIMD
1.73
2.52
1.93
2.07
2.28
68
Conclusion Dense linear algebra is used to highlight the most important factors that must be considered in designing software applications for cluster of multi-core Intel processors. Exploiting all forms of parallelism is the only way to significantly improve the performance On single computer the maximum speedup of Level-1 and Level-2 BLAS due to using SIMD and multi-threading techniques are as follows:
Level-2
SIMD
SIMDB
Thrd
ThrdSIMD
ThrdSIMDB
MV-Mul
3.67
3.86
1.25
1.54
2.03
Rank-1
3.54
3.78
1.27
1.56
2.10 69
Conclusion Dense linear algebra is used to highlight the most important factors that must be considered in designing software applications for cluster of multi-core Intel processors. Exploiting all forms of parallelism is the only way to significantly improve the performance On single computer the maximum speedup of Level-1 and Level-2 BLAS due to using SIMD and multi-threading techniques are as follows:
Level-3
SIMD
Thrd
ThrdSIMD
ThrdSIMDB
Trad MM-Mul Str MM_Mul
1.46
2.20
4.27
12.00
3.50
3.73
7.25
11.86 70
Conclusion Dense linear algebra is used to highlight the most important factors that must be considered in designing software applications for cluster of multi-core Intel processors. Exploiting all forms of parallelism is the only way to significantly improve the performance On single computer the maximum speedup of Level-1 and Level-2 BLAS due to using SIMD and multi-threading techniques are as follows:
SVD
SIMD
Thrd
ThrdSIMD
BJ
3.73
3.88
7.50
HBJ
8.11
9.10
17.78 71
Conclusion On a cluster of multi-core Intel processors the performance of Level-1 is speeded down the performance of Level-2 is speeded down the performance of Level-3 is speeded up the maximum speedup due to using all forms of parallelism are:
104
67.7
72
Conclusion On a cluster of multi-core Intel processors the performance of Level-1 is speeded down the performance of Level-2 is speeded down the performance of Level-3 is speeded up the maximum speedup due to using all forms of parallelism are: BJ on 8 nodes at 10000×10000
49.8
73
Conclusion On a cluster of multi-core Intel processors the performance of Level-1 is speeded down the performance of Level-2 is speeded down the performance of Level-3 is speeded up the maximum speedup due to using all forms of parallelism are: HBJ on 8 nodes at 10000×10000
124
74
Future Work
Implement and evaluate the performance of dense linear algebra algorithms on advanced architectures like Intel Xeon Phi. Restructure dense linear algebra algorithms to exploit all forms of parallelism and memory hierarchy. Re-implement and evaluate the BJ and HBJ algorithms after replacing round-robin method by another to exploit memory hierarchy furthermore. Using the graphical processing units (GPUs) as another form of computer parallelism to improve the performance of the parallel algorithms on cluster of computers. Exploiting the cloud computing technology to execute and evaluate new implementations of parallel algorithms. 75
References
J. Ayala, M. López-Vallejo, and A. Veidenbaum, “Energy-efficient register renaming in high-performance processors,” Proceedings of WASP, pp. 56-61, December 2003. Advanced Micro Devices, Inc, AMD Extensions to the 3DNow! and MMX Instruction Sets Manual, March 2000. S. Akhter and J. Roberts, “Multi-Core Programming: Increasing Performance through Software Multithreading,” Intel PRESS, ISBN 0976483246, Vol. 33, 2006. M. Ali, E. Stotzer, F. Igual, and R. Van de Geijn, “Level-3 BLAS on the TI C6678 multi-core DSP,” 24th International Symposium on Computer Architecture and High Performance Computing, IEEE, pp. 179-186, October 2012. B. Barney, “Message Passing Interface (MPI),” Lawrence Livermore National Laboratory, UCRL-MI-133316, 2014, https://computing.llnl.gov/tutorials/mpi/. O. Brewer, J. Dongarra, and D. Sorensen, “Tools to aid in the analysis of memory access patterns for FORTRAN Programs,” Parallel Computing, Vol. 9, No. 1, pp. 25-35, June 1988. A. Binstock and R. Gerber, Programming with Hyper-Threading Technology, Intel Press, ISBN 0971786143, 2004. R. Brent and F. Luk, “The solution of singular-value and symmetric eigenvalue problems on multiprocessor arrays,” SIAM Journal on Scientific and Statistical Computing, Vol. 6, No. 1, pp. 69-84, 1985. D. Bailey, K. Lee, and H. Simon, “Using Strassen’s algorithm to accelerate the solution of linear systems,” the journal of the supercomputing, Kluwer Academic Publishers, Boston, Netherlands, Vol. 4, No.4, pp. 357-371, September 1990. J. Dongarra, J. Croz, S. Hammarling, and I. Duff, “A Set of Level 3 Basic Linear Algebra Subprograms,” ACM Transactions on Mathematical Software, Vol.16, No.1, pp.1-17, March 1990. J. Dongarra, J. Croz, S. Hammarling, and R. Hanson, “An Extended Set of Fortran Basic Linear Algebra Subprograms,” ACM Transactions on Mathematical Software, Vol.14, No.1, pp. 1-17, March 1988. J. Dongana, J. Croz, S. Ilammarling, and I. Duff, “A Set of Level-3 Basic Linear Algebra Subprograms,” ACM Transactions on Mathematical Software, Vol. 16, No. 1, pp. 1-17, 1990. K. Diefendorff, P. Dubey, R. Hochsprung, and H. Scales, “Altivec Extension to PowerPC Accelerates Media Processing,” IEEE MICRO, Vol. 20, No. 2, pp. 85-95, March/April 2000. J. Dongarra and V. Eijkhout, “Numerical linear algebra algorithms and software,” Journal of Computational and Applied Mathematics, Vol. 123, No. 1-2, pp. 489–514, November 2000. 76
References
Digital Equipment Corporation. Alpha Architecture Handbook. available: http://www.support.compaq.com/alphatools/documentation/current/alpha-archt/alpha-architecture.pdf, April 13, 2001. J. Dongarra, I. Foster, G. Fox, K. Kennedy, A. White, L. Torczon, and W. Gropp, The Sourcebook of Parallel Computing, Morgan Kaufmann, November 2002. J. Demmel and K. Veselic, “Jacobi's method is more accurate than QR,” SIAM Journal on Matrix Analysis and Applications, Vol. 13, No. 4, pp. 1204-1245, 1992. Y. Ding, G. Zhu, C. Cui, J. Zhou, and L. Tao, “A Parallel Implementation of Singular Value Decomposition based on MapReduce and PARPACK,” Computer Science and Network Technology (ICCSNT) International Conference on IEEE, Vol. 2, pp. 739-741, December 2011. J. Fisher, “VLIW Architectures and the ELI-512,” Proc. 10th International Symposium on Computer Architecture, Stockholm, Sweden, pp.140-150, June 1983. X. Feng, H. Jin, R. Zheng, and L. Zhu, “Parallel Singular Value Decomposition on Heterogeneous Multi-core and MultiGPU Platforms,” Ninth International Conference on Digital Information Management (ICDIM), IEEE, pp. 45-50, September 2014. Fujitsu, “data sheet of CELSIUS R550”, November 2009. G. Golub and W. Kahan, “Calculating the Singular Values and Pseudo-Inverse of a Matrix,” Journal of the Society for Industrial and Applied Mathematics, Series B: Numerical Analysis, Vol. 2, No. 2, pp. 205-224, 1965. W. Gropp, K. Kennedy, L. Torczon, A. White, J. Dongarra, I. FIoster, and G. C. Fox, Sourcebook of parallel computing, 2009. G. Golub and F. Luk, “Singular Value Decomposition: Applications and Computations,” Transactions of the TwentySecond Conference of Army Mathematicians, Vol. 577, pp. 577-605, 1977. K. Gallivan, R. Plemmons, and A. Sameh, “Parallel algorithms for dense linear algebra computations,” SIAM Review, Society for industrial and applied mathematics, Vol. 32, No. 1, pp. 54-135, March 1990. G. Golub and C. Van Loan, Matrix computations, Johns Hopkins series in the mathematical sciences,” Johns Hopkins University Press, Baltimore, MD, 1989. G. Golub and C. Van Loan, Matrix Computations, John Hopkins University Press, Baltimore and London, 2nd edition, 1993. 77
References
G. Golub and C. Van Loan, Matrix Computations, 3rd Edition, The Johns Hopkins University Press, Baltimore and London, 1996. J. Held, J. Bautista, and S. Koehl, “From a Few Cores to Many: A Tera-scale Computing Research Overview,” White Paper from Intel Corporation, 2006. M. Hestenes, “Inversion of matrices by biorthogonalization and related results,” Journal of the Society for Industrial and Applied Mathematics, Vol. 6, No. 1, pp. 51-90, 1958. A. Haidar, J. Kurzak, and P. Luszczek, “An Improved Parallel Singular Value Algorithm and Its Implementation for Multicore Hardware,” Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis. ACM, Denver, CO, USA, November 2013. J. Hennessay and D. Patterson. Computer Architecture A Quantitative Approach, Morgan-Kaufmann, 5th Edition, September 2011. A. Heinecke, K. Vaidyanathan, M. Smelyanskiy, A. Kobotov, R. Dubtsov, G. Henry, and P. Dubey, “Design and implementation of the linpack benchmark for single and multi-node systems based on intel® xeon phi coprocessor,” 27th International Symposium on Parallel and Distributed Processing (IPDPS), IEEE, pp. 126-137, May 2013. Intel, Intel® Xeon® Processor 5400 Series, 2008. Intel® 64 and IA-32 Architectures Software Developer’s Manual, February 2014. http://www.intel.com/content/www/us/en/processors/architectures-software-developer-manuals.html?iid=tech_vt_tech+6432_manuals. M. Jahre and L. Natvig, “Performance Effects of a Cache Miss Handling Architecture in a Multi core Processor,” in proc. Norwegian Informatics Conference NIK 2007, November 2007. N. Jouppi and D. Wall, “Available Instruction-Level Parallelism for Superscalar and Superpipelined Machines,” Proc. 3rd International Conference on Architectural Support for Programming Languages and Operating Systems, Vol. 17, No. 2, pp.272-282, April 1989. J. Kurzak, W. Alvaro, and J. Dongarra, “Optimizing matrix multiplication for a short-vector SIMD architecture – CELL processor,” Parallel Computing, Vol.35, No. 3, pp.138–150, 2009. P. Kongetira, K. Aingaran, and K. Olukotun, “Niagara: A 32-way multithreaded SPARC processor,” Micro, IEEE, Vol. 25, No. 2, pp. 21-29, 2005. 78
References
L. Kohn, G. Maturana, M. Tremblay, A. Prabhu, and G. Zyner, “Visual Instruction Set (VIS) in UltraSPARC,” Proc. COMPCON’95: Technologies for the Information Superhighway, pp. 462–469, March 1995. M. Krishnan and J. Nieplocha, “SRUMMA: a matrix multiplication algorithm suitable for clusters and scalable shared memory systems,” 18th International Conference on Parallel and Distributed Processing Symposium, Proceedings, IEEE, April 2004. N. Karonis, B. Toonen, and I. Foster, “MPICH-G2: A grid-enabled implementation of the message passing interface,” Journal of Parallel and Distributed Computing, Vol. 63, No. 5 , pp. 551-563, 2003. R. Lee, “Subword Parallelism with MAX-2,” Micro IEEE, Vol. 16, No. 4, pp. 51–59, August 1996. C. Lawson, R. Hanson, D. Kincaid, and F. Krogh, “Basic Linear Algebra Subprograms for Fortran Usage,” ACM Transactions on Mathematical Software, Vol.5, No.3, pp.308-323, September 1979. H. Ltaief, P. Luszczek, and J. Dongarra, “High-performance bidiagonal reduction using tile algorithms on homogeneous multicore architectures,” ACM Transactions on Mathematical Software, Vol. 39, No. 3, 2013. D. Liu, R. Li, D. Lilja, and W. Xiao, “A Divide-and-Conquer Approach for Solving Singular Value Decomposition on a Heterogeneous System,” Proceedings of the ACM International Conference on Computing Frontiers. ACM, May 2013. F. Luk, “Computing the singular-value decomposition on the ILLIAC IV,” ACM Transactions on Mathematical Software , Vol. 6, No .4, pp. 524-539, 1980. Microsoft Developer Network, “CreateThread function”, © 2014 Microsoft, it available at http://msdn.microsoft.com/enus/library/windows/desktop/ms682453%28v=vs.85%29.aspx Microsoft, Windows Dev Center, “WaitForMultipleObjects function”, © 2014 Microsoft, it available at http://msdn.microsoft.com/enus/library/windows/desktop/ms687025%28v=vs.85%29.aspx Microsoft, Windows Dev Center, “CreateEvent function”, © 2015 Microsoft, it available at https://msdn.microsoft.com/enus/library/windows/desktop/ms682396%28v=vs.85%29.aspx J. Mike, Superscalar Microprocessor Design, Prentice Hall (Prentice Hall Series in Innovative Technology), 1991. MIPS extension for digital media with 3d, Technical Report http://www.mips.com, MIPS technologies, Inc., 1997 J. Mike, and M. Johnson, “Superscalar microprocessor design,” Englewood Cliffs, New Jersey: prentice Hall, Vol. 77, 1991. G. Moore, “Cramming More Components onto Integrated Circuits,” Electronics, Vol.38, No.8, pp.114-117, April 1965. 79
References
M. Marqués, G. Quintana-Ort´ı, E. Quintana-Ort´ı, and R. van de Geijn, “Solving “large” dense matrix problems on multicore processors and gpus,” 10th IEEE International Workshop on Parallel and Distributed Scientific and Engineering Computing, Roma, Italy, January 2009. K. Matsumoto and S. Sedukhin, “Matrix multiply-add in min-plus algebra on a short-vector SIMD processor of Cell/BE,” First International Conference Networking and Computing (ICNC), IEEE, pp. 272-274, November 2010. P. Pacheco, An Introduction to Parallel Programming, Elsevier, ISBN 978-0-12-374260-5, 2011. A. Peleg and U. Weiser, “MMX Technology Extension to the Intel Architecture,” IEEE Micro, Vol. 16, No. 4, pp. 42-50, August 1996. A. Rémy, “Solving dense linear systems on accelerated multicore architectures,” Distributed, Parallel, and Cluster Computing, Université Paris-Sud, 2015. S. Rajasekaran and M. Song, “A relaxation scheme for increasing the parallelism in Jacobi-SVD,” Journal of Parallel and Distributed Computing, Vol. 68, No. 6, pp. 769-777, 2008. H. Rutishauser, “The Jacobi method for real symmetric matrices,” Linear algebra. Springer Berlin Heidelberg, pp. 202-211, 1971. A. Silberschatz, P. Galvin, and G. Gagne, Operating System Concepts, 7th edition, ISBN 0-471-69466-5, 2005. V. Strumpen, H. Hoffmann, and A. Agarwal, “A Stream Algorithm for the SVD,” Technical Memo 641, Computer Science and Artificial Intelligence Laboratory, Massachusetts Institute of Technology, October 2003. M. Soliman, A Technology-Scalable Matrix Processor for Data Parallel Applications, Ph.D. Thesis, Aizu University, Japan, 2004. M. Soliman, “Memory hierarchy exploration for accelerating the parallel computation of SVDs,” Neural, Parallel & Scientific Computations, Vol. 16, No. 4, pp. 543-561, 2008. M. Soliman, “Exploiting ILP, TLP, and DLP to Improve Multi-Core Performance of One-Sided Jacobi SVD,” Parallel Processing Letters, World Scientific Publishing Company, Vol. 19, No. 2, pp. 355-375, March 2009. M. Soliman, “Performance Evaluation of Multi-Core Intel Xeon Processors on Basic Linear Algebra Subprograms”, Parallel Processing Letter (PPL), World Scientific Publishing Company, ISSN: 0129-6264, Vol. 19, No. 1, pp. 159-174, March 2009. M. Soliman, “Efficient implementation of QR decomposition on Intel multi-core processors,” Seventh International 80 Computer Engineering Conference (ICENCO), IEEE, pp. 25-30, December 2012.
References
F. Sánchez Castaño, A. Ramírez Bellido, and M. Valero Cortés. “Exploiting different levels of parallelism in the biological sequence comparison problem,” 4th Colombian Computing Conference, Bucaramanga, ISBN 978-958-8166-43-8, pp. 140-149, 2009. URI http://hdl.handle.net/2117/8069. N. Slingerl and A. Smith, “Multimedia Extensions for General Purpose microprocessors: a Survey,” Microprocessors and Microsystems, Vol. 29, No. 5, pp. 225–246, 2005. A. Shi, W. Shen, Y. Li, L. He, and D. Zhao, “Implementation and Analysis of Jacobi Iteration Based on Hybrid Programming,” 2010 International Conference On Computer Design And Appliations (ICCDA 2010), IEEE, Vol. 2, June 2010. S. Sampath and B. Sagar, “Performance Evaluation of Linear Matrices over Cluster of Four Nodes Using MPI,” International Journal of Research in Engineering and Technology, eISSN: 2319-1163 | pISSN: 2321-7308, Vol. 03, November 2014. J. Smith and G. Sohi, “The Microarchitecture of Superscalar Processors,” Proceedings of the IEEE, Vol.83, No.12, pp. 1609-24, December 1995. M. Thottethodi, S. Chatterjee, and A. Lebeck, “Tuning Strassen's matrix multiplication for memory efficiency,” Proceedings of the 1998 ACM/IEEE conference on Supercomputing. IEEE Computer Society, pp. 1-14, November 1998. M. Tremblay, J. O'Connor, V. Narayanan, and L. He, “VIS Speeds New Media Processing,” IEEE Micro, Vol. 16, No. 4, pp.10-20, August 1996. F. Van Zee, R. van de Geijn, and G. Quintana, “Restructuring the QR Algorithm for High-Performance Application of Givens Rotations,” FLAME Working Note #60, October 2011. E. Welch, D. Patru, E. Saber, and K. Bengtson, “A study of the use of SIMD instructions for two image processing algorithms,” Image Processing Workshop (WNYIPW), Western New York, IEEE, pp. 21-24, November 2012. Z. Xianyi, W. Qian, and Z. Yunquan, “Model-driven level 3 BLAS performance optimization on Loongson 3A processor,” 18th International Conference on Parallel and Distributed Systems (ICPADS), IEEE, pp. 684-691, December 2012. W. Yang and Z. Liu, “Accelerating householder bidiagonalization with ARM NEON technology,” Signal and Information Processing Association Annual Summit and Conference (APSIPA ASC), Asia-Pacific, IEEE, pp. 1-4, December 2012.
81