A Modern Framework for Portable High ... - Semantic Scholar

A Modern Framework for Portable High Performance Numerical Linear Algebra

A Thesis

Submitted to the Graduate School of the University of Notre Dame in Partial Fulfillment of the Requirements for the Degree of

Master of Science in Computer Science and Engineering

by

Jeremy G. Siek, B.S.

Andrew Lumsdaine, Director

Department of Computer Science and Engineering Notre Dame, Indiana April 1999

A Modern Framework for Portable High Performance Numerical Linear Algebra

Abstract by Jeremy G. Siek This thesis describes a generic programming methodology for expressing data structures, algorithms, and optimizations for numerical linear algebra. A high-performance implementation of this approach, the Matrix Template Library (MTL), is also described. The goal of the MTL is to facilitate development of higher-level libraries and applications for scientific computing. In addition, the programming techniques developed in this thesis are widely applicable and can be used to reduce development costs, improve readability, and improve the performance of many kinds of software. Portable high performance is a particular focus of the MTL. Flexible kernels were constructed that provide an automated tool for cross architecture performance portability.

ii

This is for all the code warriors in scientific computing.

Contents Tables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

x

Figures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

xi

Acknowledgements . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xiii Chapter 1: Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

1

Chapter 2: Generic Programming . . . . . . . . . . . . . . . . . . . . . . . . . 2.1 Generic Programming and the Standard Template Library . . . . . . . . . 2.2 Generic Programming for Linear Algebra . . . . . . . . . . . . . . . . .

5 5 9

Chapter 3: Related Work . . . . . . . . . . . . . . . . . 3.1 Traditional Basic Linear Algebra Libraries . . . . 3.2 Automatically Tuned Dense Linear Algebra . . . 3.2.1 The Optimizing Compiler Approach . . . 3.2.2 The Library Approach . . . . . . . . . . 3.3 C++ Libraries for Linear Algebra . . . . . . . . . 3.4 Generic Programming and Software Engineering

. . . . . . .

13 13 14 14 14 15 16

Chapter 4: MTL Algorithms . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.1 Pointwise LU Factorization Example . . . . . . . . . . . . . . . . . . . . 4.2 Blocked LU Factorization Example . . . . . . . . . . . . . . . . . . . .

17 21 24

Chapter 5: MTL Components . . . . . . . . . . . 5.1 Domain Analysis of Matrix Storage Formats 5.1.1 Matrix Element Type . . . . . . . . 5.1.2 Matrix Shape . . . . . . . . . . . . 5.1.3 Matrix Storage . . . . . . . . . . . 5.1.4 OneD Storage . . . . . . . . . . . . 5.2 Component Selection and Generation . . .

30 30 31 32 34 37 38

iii

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

CONTENTS . . . . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . . . . .

39 40 41 42 42 43 43 44 44 46 46 48 49 50 51 51 51 51 52 52 53 57

Chapter 6: High Performance . . . . . . . . . . . . . . 6.1 Mayfly Components . . . . . . . . . . . . . . . . 6.2 High Performance Iterators . . . . . . . . . . . . 6.3 High Performance & Template Metaprogramming 6.4 Fixed Algorithm Size Template (FAST) Library . 6.5 Basic Linear Algebra Instruction Set (BLAIS) . . 6.6 BLAIS in a General Matrix-Matrix Product . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

61 66 69 72 74 75 78

Chapter 7: Iterative Template Library (ITL) 7.1 Generic Interface . . . . . . . . . . . 7.2 Ease of Implementation . . . . . . . . 7.3 ITL Performance . . . . . . . . . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

81 81 83 85

Chapter 8: Performance Experiments . . . . . . . . . . . . . 8.1 Dense Matrix-Matrix Multiplication . . . . . . . . . . 8.2 Dense and Sparse Matrix-Vector Multiplication . . . . 8.3 Performance Analysis of Matrix-Matrix Multiplication

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

86 86 87 88

Chapter 9: Testing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

91

5.3

5.4 5.5 5.6

5.7 5.8

5.2.1 Template Metaprogramming MTL Concepts . . . . . . . . . . . 5.3.1 Matrix . . . . . . . . . . . . 5.3.2 Vector . . . . . . . . . . . 5.3.3 IndexedIterator . . . . . . 5.3.4 Indexer . . . . . . . . . . . 5.3.5 OneDIndexer . . . . . . . 5.3.6 Offset . . . . . . . . . . . . MTL Object Memory Model . . . . The MTL Component Architecture . TwoD Storage Classes . . . . . . . 5.6.1 dense2D . . . . . . . . . . 5.6.2 compressed2D . . . . . . 5.6.3 array2D . . . . . . . . . . 5.6.4 envelope2D . . . . . . . OneD Containers/Vectors . . . . . . 5.7.1 dense1D . . . . . . . . . . 5.7.2 compressed1D . . . . . . Adaptors . . . . . . . . . . . . . . . 5.8.1 sparse1D . . . . . . . . . 5.8.2 Scaling Adaptors . . . . . . 5.8.3 Striding Adaptors . . . . . .

iv . . . . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . . . . .

. . . .

. . . . . . . . . . . . . . . . . . . . . .

. . . .

. . . . . . . . . . . . . . . . . . . . . .

. . . .

. . . . . . . . . . . . . . . . . . . . . .

. . . .

. . . . . . . . . . . . . . . . . . . . . .

. . . .

. . . .

CONTENTS

Chapter 10: Future Work and Conclusion . . . . . . . . 10.1 Future Work . . . . . . . . . . . . . . . . . . . . 10.1.1 MTL User Interface . . . . . . . . . . . 10.1.2 MTL Functionality . . . . . . . . . . . . 10.1.3 Higher Level Libraries . . . . . . . . . . 10.1.4 New Language . . . . . . . . . . . . . . 10.1.5 MTL for Advanced Parallel Architectures 10.2 Conclusion . . . . . . . . . . . . . . . . . . . .

v

. . . . . . . .

. . . . . . . .

. . . . . . . .

. . . . . . . .

. . . . . . . .

. . . . . . . .

. . . . . . . .

. . . . . . . .

. . . . . . . .

. . . . . . . .

. . . . . . . .

. . . . . . . .

. . . . . . . .

94 94 94 94 95 95 96 96

BIBLIOGRAPHY . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

98

Appendix A: Containers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . A.1 Concepts . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . A.1.1 Matrix . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . A.1.2 RowMatrix . . . . . . . . . . . . . . . . . . . . . . . . . . . . . A.1.3 ColumnMatrix . . . . . . . . . . . . . . . . . . . . . . . . . . . A.1.4 DiagonalMatrix . . . . . . . . . . . . . . . . . . . . . . . . . . . A.1.5 Vector . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . A.1.6 TwoDStorage . . . . . . . . . . . . . . . . . . . . . . . . . . . . A.2 Container type generators . . . . . . . . . . . . . . . . . . . . . . . . . . A.2.1 matrix< T, Shape = rectangle, Storage = dense, Orientation = row major > . . . . . . . . . . . . . . . . . . . . . . . . . A.2.2 band view . . . . . . . . . . . . . . . . . . . . . . . . A.2.3 block view . . . . . . . . . . . A.2.4 symmetric view . . . . . . . . . . . . . . . . A.2.5 triangle view . . . . . . . . . . . . . . . . . A.3 Container type selectors . . . . . . . . . . . . . . . . . . . . . . . . . . . A.3.1 rectangle . . . . . . . . . . . . . . . . A.3.2 symmetric . . . . . . . . . . . . . . A.3.3 hermitian . . . . . . . . . . . . . . . A.3.4 banded . . . . . . . . . . . . . . . . . . A.3.5 triangle . . . . . . . . . . . . . . . . A.3.6 diagonal . . . . . . . . . . . . . . . . . A.3.7 array . . . . . . . . . A.3.8 dense . . . . . . . . . . . . . . . . . . . A.3.9 compressed . . . . . . . . . . . . . . . . . . . . . . . . . . . . . A.3.10 packed . . . . . . . . . . . . . . . . . . A.3.11 banded view . . . . . . . . . . . . . . . A.3.12 envelope . . . . . . . . . . . . . . . . . A.3.13 linked list . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . A.3.14 sparse pair . . . . . . . . . . . . . . . . . . . . . . . . . . . . . A.3.15 tree . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . A.4 Container classes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . A.4.1 dense1D . . . . . . . . . . . . . . . . . . . . .

103 103 103 113 113 114 115 117 119 119 121 122 123 124 125 125 127 129 129 131 133 134 136 137 140 141 142 143 144 144 145 145

CONTENTS A.4.2 compressed1D . . . . . . . . . . . . . . . . . . . . . . . . . . . . . A.4.3 external vec . . . . . . . . . . . . . . . . . . . A.4.4 generic dense2D A.4.5 dense2D . . . . . . . . A.4.6 external2D . . . . . . . A.4.7 generic comp2D . . . . . . . . . . . . . . . . . . . . . . . . . . . . . A.4.8 compressed2D . . . . . . . . . A.4.9 ext comp2D . . . . . . . . . . A.4.10 array2D . . . . . . . . . . . . . . . . . . . . . . . . . A.5 Container adaptors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . A.5.1 linalg vec . . . . . A.5.2 scaled1D . . . . . . . . . . . . . A.5.3 sparse1D . . . . . . . . . . . . . . . . . . . . . . . A.5.4 strided1D . . . . . . . . . . . . A.5.5 scaled2D . . . . . . . . . . . . . . . . . . . . . . . A.5.6 block2D . . . . . . . . . . . . . . . . . . . A.6 Container functions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . A.6.1 scaled . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . A.6.2 strided . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . A.6.3 rows . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . A.6.4 columns . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . A.6.5 trans . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . A.6.6 blocked . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . A.6.7 blocked . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . A.7 Container tags . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . A.7.1 banded tag . . . . . . . . . . . . . . . . . . . . . . . . . . . . . A.7.2 column matrix traits . . . . . . . . . . . . . . . . . . A.7.3 column tag . . . . . . . . . . . . . . . . . . . . . . . . . . . . . A.7.4 dense tag . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . A.7.5 diagonal matrix traits . . . . . . . . . . . . . . . . . . A.7.6 diagonal tag . . . . . . . . . . . . . . . . . . . . . . . . . . . . A.7.7 external tag . . . . . . . . . . . . . . . . . . . . . . . . . . . . . A.7.8 hermitian tag . . . . . . . . . . . . . . . . . . . . . . . . . . . . A.7.9 internal tag . . . . . . . . . . . . . . . . . . . . . . . . . . . . . A.7.10 linalg traits . . . . . . . . . . . . . . . . . . . . . . . A.7.11 matrix traits . . . . . . . . . . . . . . . . . . . . . . . A.7.12 not strideable . . . . . . . . . . . . . . . . . . . . . . . . . . . . A.7.13 oned tag . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . A.7.14 rectangle tag . . . . . . . . . . . . . . . . . . . . . . . . . . . . A.7.15 row matrix traits . . . . . . . . . . . . . . . . . . . . A.7.16 row tag . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . A.7.17 sparse tag . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . A.7.18 strideable . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . A.7.19 symmetric tag . . . . . . . . . . . . . . . . . . . . . . . . . . . A.7.20 triangle tag . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

vi

147 150 153 157 158 160 163 164 166 169 169 172 174 177 179 180 183 183 184 185 186 186 187 188 189 189 189 189 189 189 189 189 189 189 189 190 191 191 191 191 191 191 191 191 191

CONTENTS

vii

A.7.21 twod tag . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 191 Appendix B: Iterators . . . . . . . . . . . . . . . . . . . . . . B.1 Concepts . . . . . . . . . . . . . . . . . . . . . . . . B.1.1 IndexedIterator . . . . . . . . . . . . . . . . . B.2 Iterator functions . . . . . . . . . . . . . . . . . . . . B.2.1 trans iter . . . . . . . . . . . . . . . . . . . . B.3 Iterator adaptors . . . . . . . . . . . . . . . . . . . . . B.3.1 dense iterator . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . .

. . . . . . . . . . .

192 192 192 193 193 194 194 196 198 200 201

Appendix C: Algorithms . . . . C.0.6 sum . . . . . . . C.0.7 set . . . . . . . . C.0.8 scale . . . . . . . C.0.9 set diagonal . . . C.0.10 two norm . . . . C.0.11 one norm . . . . C.0.12 infinity norm . . C.0.13 max index . . . C.0.14 max . . . . . . . C.0.15 min . . . . . . . C.0.16 transpose . . . . C.0.17 transpose . . . . C.0.18 mult . . . . . . . C.0.19 mult . . . . . . . C.0.20 mult . . . . . . . C.0.21 tri solve . . . . . C.0.22 tri solve . . . . . C.0.23 rank one update C.0.24 rank two update C.0.25 copy . . . . . . . C.0.26 add . . . . . . . C.0.27 add . . . . . . . C.0.28 add . . . . . . . C.0.29 ele mult . . . . . C.0.30 ele mult . . . . . C.0.31 ele div . . . . . C.0.32 swap . . . . . . C.0.33 dot . . . . . . . C.0.34 dot . . . . . . . C.0.35 dot conj . . . . .

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

203 203 204 205 206 207 208 209 210 211 211 212 212 213 214 216 217 218 220 221 223 223 224 225 226 226 227 227 228 228 229

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

CONTENTS

viii

C.0.36 dot conj . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 229 C.0.37 lu factorize . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 230 Appendix D: Function Objects . . . . . . . . . D.0.38 givens rotation . . . . . . D.0.39 givens rotation D.0.40 modified givens . . . . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

232 232 233 234

Appendix E: Iterative Template Library E.1 Concepts . . . . . . . . . . . . E.1.1 Iteration . . . . . . . . . E.1.2 Preconditioner . . . . . E.2 Algorithms . . . . . . . . . . . E.2.1 cg . . . . . . . . . . . . E.2.2 cgs . . . . . . . . . . . E.2.3 bicg . . . . . . . . . . . E.2.4 gmres . . . . . . . . . . E.2.5 bicgstab . . . . . . . . . E.2.6 qmr . . . . . . . . . . . E.2.7 tfqmr . . . . . . . . . . E.2.8 gcr . . . . . . . . . . . E.2.9 cheby . . . . . . . . . . E.2.10 richardson . . . . . . . . E.3 Preconditioners . . . . . . . . . E.3.1 ILU . . . . . E.3.2 ILUT . . . . E.3.3 SSOR . . . . E.3.4 cholesky< Matrix > . .

. . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . . .

235 235 235 237 238 238 238 239 240 241 242 243 244 245 245 246 246 248 249 251

Appendix F: Fixed Algorithm Size Template (FAST) Library F.0.5 copy . . . . . . . . . . . . . . . . . . . . . . . F.0.6 transform . . . . . . . . . . . . . . . . . . . . F.0.7 transform . . . . . . . . . . . . . . . . . . . . F.0.8 fill . . . . . . . . . . . . . . . . . . . . . . . . F.0.9 swap ranges . . . . . . . . . . . . . . . . . . F.0.10 accumulate . . . . . . . . . . . . . . . . . . . F.0.11 accumulate . . . . . . . . . . . . . . . . . . . F.0.12 inner product . . . . . . . . . . . . . . . . . . F.0.13 inner product . . . . . . . . . . . . . . . . . . F.0.14 count . . . . . . . . . . . . . . . . .

. . . . . . . . . . .

. . . . . . . . . . .

. . . . . . . . . . .

. . . . . . . . . . .

. . . . . . . . . . .

. . . . . . . . . . .

. . . . . . . . . . .

. . . . . . . . . . .

. . . . . . . . . . .

. . . . . . . . . . .

253 253 253 254 254 255 255 255 256 256 257

. . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . . .

CONTENTS

ix

Appendix G: Basic Linear Algebra Instruction Set (BLAIS) Library G.0.15 add . . . . . . . . . . . . . . . . . . . . . . . G.0.16 copy . . . . . . . . . . . . . . . . . . . . . . G.0.17 copy . . . . . . . . . . . . . . . . . . G.0.18 dot . . . . . . . . . . . . . . . . . . . . . . . G.0.19 mult . . . . . . . . . . . . . . . . . . G.0.20 mult . . . . . . . . . . . . . . . G.0.21 rank one . . . . . . . . . . . . . . . . G.0.22 set . . . . . . . . . . . . . . . . . . . . . . . G.0.23 set . . . . . . . . . . . . . . . . . . .

. . . . . . . . . .

. . . . . . . . . .

. . . . . . . . . .

. . . . . . . . . .

. . . . . . . . . .

. . . . . . . . . .

258 258 259 259 260 260 261 261 262 262

Appendix H: MTL to LAPACK Interface . . . . . . . . . . H.0.24 lapack matrix H.1 Functions . . . . . . . . . . . . . . . . . . . . . . . H.1.1 gecon . . . . . . . . . . . . . . . . . . . . . H.1.2 geev . . . . . . . . . . . . . . . . . . . . . . H.1.3 geqpf . . . . . . . . . . . . . . . . . . . . . H.1.4 geqrf . . . . . . . . . . . . . . . . . . . . . H.1.5 gesv . . . . . . . . . . . . . . . . . . . . . . H.1.6 getrf . . . . . . . . . . . . . . . . . . . . . . H.1.7 getrs . . . . . . . . . . . . . . . . . . . . . . H.1.8 geequ . . . . . . . . . . . . . . . . . . . . . H.1.9 gelqf . . . . . . . . . . . . . . . . . . . . . H.1.10 orglq . . . . . . . . . . . . . . . . . . . . . H.1.11 orgqr . . . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . .

. . . . . . . . . . . . . .

. . . . . . . . . . . . . .

. . . . . . . . . . . . . .

. . . . . . . . . . . . . .

. . . . . . . . . . . . . .

. . . . . . . . . . . . . .

. . . . . . . . . . . . . .

. . . . . . . . . . . . . .

. . . . . . . . . . . . . .

. . . . . . . . . . . . . .

263 263 264 264 265 266 267 268 269 270 271 272 272 273

Appendix I: Utilities . . . . . . . . . . . . . . . . . . . . . I.1 Concepts . . . . . . . . . . . . . . . . . . . . . . I.1.1 Indexer . . . . . . . . . . . . . . . . . . . I.1.2 Offset . . . . . . . . . . . . . . . . . . . . I.2 Functions . . . . . . . . . . . . . . . . . . . . . . I.2.1 read dense matlab . . . . . . . . . . . . . I.2.2 write dense matlab . . . . . . . . . . . . . I.2.3 read sparse matlab . . . . . . . . . . . . . I.2.4 write sparse matlab . . . . . . . . . . . . . I.3 Classes . . . . . . . . . . . . . . . . . . . . . . . I.3.1 dimension I.3.2 harwell boeing stream . . . . . . . . I.3.3 matrix market stream . . . . . . . . .

. . . . . . . . . . . . .

. . . . . . . . . . . . .

. . . . . . . . . . . . .

. . . . . . . . . . . . .

. . . . . . . . . . . . .

. . . . . . . . . . . . .

. . . . . . . . . . . . .

. . . . . . . . . . . . .

. . . . . . . . . . . . .

. . . . . . . . . . . . .

. . . . . . . . . . . . .

274 274 274 276 278 278 279 279 280 280 280 281 282

. . . . . . . . . . . . .

Tables 1.1

Breakdown of personal accomplishments vs. others' related work and work used in this thesis. . . . . . . . . . . . . . . . . . . . . . . . . . .

3

2.1

Excerpt from the STL random-access iterator requirements.

. . . . . . .

6

3.1

Summary of C++ Libraries for Linear Algebra

. . . . . . . . . . . . . .

16

4.1

MTL generic linear algebra algorithms. . . . . . . . . . . . . . . . . . .

19

4.2

MTL adaptor classes and helper functions for creating algorithm permutations. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

20

Permutations of the add() operation made possible with the use of the scaled() adaptor helper function. . . . . . . . . . . . . . . . . . . . .

20

5.1

Matrix associated types, in addition to those of Container . . . . . . . .

41

5.2

Matrix method requirements, in addition to those of Container

. . . . .

58

5.3

Vector requirements, in addition to those of Container . . . . . . . . . .

59

5.4

IndexedIterator requirements . . . . . . . . . . . . . . . . . . . . . . .

59

5.5

Indexer requirements

. . . . . . . . . . . . . . . . . . . . . . . . . . .

59

5.6

OneDIndexer requirements . . . . . . . . . . . . . . . . . . . . . . . .

59

5.7

Offset requirements

. . . . . . . . . . . . . . . . . . . . . . . . . . . .

60

6.1

The effect of iterator and comparison operator choice on performance (in Mflops) for dot product on Sun C, IBM XLC, and SGI C compilers. . . .

70

MTL test suite results summary. . . . . . . . . . . . . . . . . . . . . . .

93

4.3

9.1

x

Figures 2.1

Separation of containers and algorithms using iterators. . . . . . . . . . .

6

2.2

The TwoD iterator concept. . . . . . . . . . . . . . . . . . . . . . . . . .

10

2.3

Simplified example of a generic matrix-vector product, with a comparison to the traditional approach to writing dense and sparse matrix-vector products. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

12

4.1

LU factorization pseudo-code. . . . . . . . . . . . . . . . . . . . . . . .

21

4.2

Diagram for LU factorization. . . . . . . . . . . . . . . . . . . . . . . .

22

4.3

Complete MTL version of pointwise LU factorization. . . . . . . . . . .

25

4.4

Pointwise step in block LU factorization. . . . . . . . . . . . . . . . . . .

27

4.5

Update steps in block LU factorization. . . . . . . . . . . . . . . . . . .

28

4.6

MTL version of block LU factorization. . . . . . . . . . . . . . . . . . .

29

5.1

Feature diagram for common matrix formats. . . . . . . . . . . . . . . .

31

5.2

The MTL Matrix configuration grammar. . . . . . . . . . . . . . . . . .

32

5.3

Example of a banded matrix with bandwidth (1,2). . . . . . . . . . . . .

33

5.4

Example of a symmetric matrix with bandwidth (2,2). . . . . . . . . . . .

33

5.5

Example of the dense matrix storage format. . . . . . . . . . . . . . . . .

34

5.6

Example of the banded matrix storage format. . . . . . . . . . . . . . . .

35

5.7

Example of the packed matrix storage format. . . . . . . . . . . . . . . .

35

5.8

Example of the compressed column matrix storage format. . . . . . . . .

36

5.9

Example of the array matrix storage format with dense and with sparse pair OneD storage types. . . . . . . . . . . . . . . . . . . . . . . . . . .

37

5.10 Example of the envelope matrix storage format. . . . . . . . . . . . . . .

38

xi

FIGURES

xii

5.11 Non-intrusive reference counting pointer implementation. . . . . . . . . .

47

5.12 The MTL implementation layer components. . . . . . . . . . . . . . . .

48

6.1

A recursive matrix-matrix product algorithm. . . . . . . . . . . . . . . .

80

7.1

An ITL method interface example. . . . . . . . . . . . . . . . . . . . . .

82

7.2

Example use of the ITL QMR iterative method. . . . . . . . . . . . . . .

83

7.3

Comparison of an algorithm for the preconditioned conjugate gradient method and the corresponding ITL code. . . . . . . . . . . . . . . . . . .

84

7.4

Comparison of ITL and IML++ performance over six matrices. . . . . . .

85

8.1

Performance comparison of generic dense matrix-matrix product with other libraries on Sun UltraSPARC (upper) and IBM RS6000 (lower). . .

89

Performance of generic matrix-vector product applied to column-oriented dense (upper) and row-oriented sparse (lower) data structures compared with other libraries on Sun UltraSPARC. . . . . . . . . . . . . . . . . . .

90

8.2

FIGURES

xiii

Acknowledgements

Thanks go to Andrew Lumsdaine, my advisor and originator of the MTL vision. I thank my father, Richard Siek, for instilling in me a love of good software engineering practices and object-oriented programming. I thank my mother, Elisabeth Siek, for always encouraging creativity and imagination. I thank Alexander Stepanov and David Musser for their inspirational Standard Template Library, and also Bjarne Stroustrup for designing a language with enough flexibility and expressiveness to make MTL not only possible but also efficient. I would like to thank Brian McCandles for his contributions to the first version of MTL. Thanks also go to all of my wonderful colleagues in the Laboratory for Scientific Computing who helped with this thesis in innumerable ways. In addition I would like to thank the Philosophy and Humanities professors I studied under at Notre Dame who gave me an appreciation of how concepts can be made to fit together to form coherent structures.

Chapter 1 Introduction Software construction for scientific computing is a difficult task. Scientific codes are often large and complex, requiring vast amounts of domain knowledge for their construction. They also process large data sets so there is an additional requirement for efficiency and high performance. Considerable knowledge of modern computer architectures and compilers is required to make the necessary optimizations, which is a time-intensive task and further complicates the code. The last decade has seen significant advances in the area of software engineering. New techniques have been created for managing software complexity and building abstractions. Underneath the layers of new terminology (object-oriented, generic [51], aspectoriented [40], generative [17], metaprogramming [55]) there is a core of solid work that points the way for constructing better software for scientific computing: software that is portable, maintainable and achieves high performance at a lower development cost. One important key to better software is better abstractions. With the right abstractions each aspect of the software (domain specific, performance optimization, parallel communication, data-structures etc.) can be cleanly separated, then handled on an individual basis. The proper abstractions reduce the code complexity and help to achieve high-quality and high-performance software. The first generation of abstractions for scientific computing came in the form of sub1

CHAPTER 1. INTRODUCTION

2

routine libraries such as the Basic Linear Algebra Subroutines (BLAS) [22, 23, 36], LINPACK [21], EISPACK [50], and LAPACK [2]. This was a good first step, but the first generation libraries were inflexible and difficult to use, which reduced their applicability. Moreover the construction of such libraries was a complex and expensive task. Many software engineering techniques (then in their infancy) could not be applied to scientific computing because of their interference with performance. In the last few years significant improvements have been made in the tools used for expressing abstractions, primarily in the maturation of the C++ language and its compilers. The old enmity between abstraction and performance can now be put aside. In fact, abstractions can be used to aid performance portability by making the necessary optimizations easier to apply. With the intelligent use of modern software engineering techniques it is now possible to create extremely flexible scientific libraries that are portable, easy to use, highly efficient, and which can be constructed in far fewer lines of code than has previously been possible. This thesis describes such a library, the Matrix Template Library (MTL), a package for high-performance numerical linear algebra. There are four main contributions in this thesis. The first is a breakthrough in software construction that enables the heavy use of abstraction without inhibiting high performance. The second contribution is the development of software designs that allow additive programming effort to produce multiplicative amounts of functionality. This produced an order of magnitude reduction in the code length for MTL compared to the Netlib BLAS implementation, a software library of comparable functionality. The third contribution is the construction of flexible kernels that simplify the automatic generation of portable optimized linear algebra routines. The fourth contribution is the analysis and classification of the numerical linear algebra problem domain which is formalized in the concepts that define the interfaces of the MTL

CHAPTER 1. INTRODUCTION Personal Accomplishments Implementation of all the MTL software Idea to use adaptors to solve “fat” interface problem Use of aspect objects to handle indexing for matrices

3 Others' Related Work BLAS [22, 23, 36] and LAPACK [2]

Generic Programming [43], Aspect Oriented Programming [40], idea of a separation of orientation and 2D containers [37, 38], idea to use iterators for linear algebra [37, 38] Idea to use template metaprogramming to Complete unrolling for operations on perform register blocking in linear alge- small arrays [55], matrix constructor bra kernels interface [16, 18], compile-time prime number calculations [54] Tuned MTL algorithms for high perfor- Tiling and blocking techniques [10, 11, mance 12, 14, 32, 34, 35, 39, 60, 61], automatically tuned libraries [7, 59] Proved that iterators can be used in high Optimizing compilers [33, 41], performance arenas lightweight object optimization, inlining Created the Mayfly pattern Andrew Lumsdaine thought of the name Designed the ITL interface ITL implementation by Andrew Lumsdaine and Rich Lee Table 1.1. Breakdown of personal accomplishments vs. others' related work and work used in this thesis.

components and algorithms. The work in this thesis builds off of work by many other people, and parts of others work is described in this thesis. Table 1.1 is provided in order to clarify what work was done by others, and what work I did as part of this thesis. The related work listed here is only the work that was very closely related to MTL, or that was used heavily in MTL. Chapter 3 describes in more detail the work related to MTL. The following is a road map for the rest of this thesis. Chapter 2 gives a introduction to generic programming, and describes how to extend generic programming to linear algebra. Chapter 3 gives and overview of prior work by others that is related to MTL.

CHAPTER 1. INTRODUCTION

4

Chapters 4 and 5 address the design and implementation of the MTL algorithms and components. Chapter 6 discusses performance issues such as the ability of modern C++ compilers to optimize abstractions and how template metaprogramming techniques can be used to express loop optimizations. Chapter 7 describes an iterative methods library — the Iterative Template Library (ITL) — that is constructed using MTL. The ultimate purpose of the work in this thesis is to aid the construction of higher-level scientific libraries and applications in several respects: reduce the development costs, improve software quality from a software engineering standpoint, and to make high-performance easier to achieve. The Iterative Template Library is an example of how higher-level libraries can be constructed using MTL. Chapter 8 gives the real proof that our generic programming approach is viable for scientific computing: the performance results. The performance of MTL is compared to vendor BLAS libraries for several dense and sparse matrix computations on several different architectures. Chapter 9 summarizes the verification and testing of the MTL software. Chapter 10 discusses some future directions of MTL and concludes the thesis.

Chapter 2 Generic Programming 2.1 Generic Programming and the Standard Template Library This chapter gives a short description of generic programming with a few examples from the Standard Template Library (STL). For a more complete description of STL refer to [52]. For an introduction to STL and generic programming refer to [3, 42, 53]. In the chapters following this one it will be assumed the reader has a basic knowledge of STL and generic programming. Generic programming has recently entered the spotlight with the introduction of the Standard Template Library (STL) into the C++ standard [27]. The principal idea behind generic programming is that many algorithms can be abstracted away from the particular data structures on which they operate. Algorithms typically need the functionality of traversing through a data structure and accessing its elements. If data structures provide a standard interface for these operations, generic algorithms can be freely mixed and matched with data structures (called containers in STL). The main facilitator in the separation of algorithms and containers in STL is the iterator (sometimes called a “generalized pointer”). Iterators provide a mechanism for traversing containers and accessing their elements. The interface between an algorithm

5

CHAPTER 2. GENERIC PROGRAMMING

6

and a container is specified by the types of iterators exported by the container. Generic algorithms are written solely in terms of iterators and never rely upon specifics of a particular container. Iterators are classified into broad categories, some of which are: InputIterator, ForwardIterator, and RandomAccessIterator. Figure 2.1 depicts the relationship between containers, algorithms, and iterators. Iterators Containers

Algorithms

Figure 2.1. Separation of containers and algorithms using iterators. The STL defines a set of requirements for each class of iterators. The requirements are in the form of which operations (functions) are defined for each iterator, and what the meaning of the operation is. As an example of how these requirements are defined, an excerpt from the requirements for the STL RandomAccessIterator is listed in Table 2.1. In the table, X is the iterator type, T is the element type pointed to by the iterator, (a,b,r,s) are iterator objects, and n is an integral type. expression a == b a < b *a a->m ++r --r r+=n a + n b - a

return type bool bool T& U& X& X& X& X Distance

a[n]

convertible to T

note *a == *b b - a > 0 dereference a (*a).m r == s ! ++r == ++s r == s ! --r == --s same as n of ++r f tmp = a; return tmp += n; (anext; return *this; } }; ... };

When dealing with some container class, one can access the correct type of iterator using the double-colon scope operator, as is demonstrated below in the function foo(). template void foo(ContainerX& x, ContainerY& y) { typename ContainerX::iterator xi; typename ContainerY::iterator yi; ... }


9

2.2 Generic Programming for Linear Algebra The advantages of generic programming coincide with the library construction problems of numerical linear algebra. The traditional approach for developing numerical linear algebra libraries is combinatorial in the required development effort. Individual subroutines must be written to support every desired combination of algorithm, basic numerical type, and matrix storage format. For a library to provide a rich set of functions and data types, one would need to code hundreds of versions of the same routine. As an example, to provide basic functionality for selected sparse matrix types, the NIST implementation of the Sparse BLAS contains over 10,000 routines and a custom code generation system [48]. The combinatorial explosion in implementation effort arises because, with most programming languages, algorithms and data structures are more tightly coupled than is conceptually necessary. That is, one cannot express an algorithm as a subroutine independently from the type of data that is being operated on. As a result, providing a comprehensive linear algebra library — much less one that also offers high performance — would seem to be an overwhelming task. Fortunately, certain modern programming languages, such as Ada and C++, support generic programming by providing mechanisms for expressing algorithms independent of the specific data structure to which they are applied. A single function can then work with many different data structures, drastically reducing the size of the code. In rough terms, given M algorithms and N data structions, the amount of code goes from O (M

N ) to

just O (M + N ). As a result development, maintenance, testing, and optimization become much easier. If generic algorithms are to be created for numerical linear algebra, there must be a common interface, a common way to access and traverse the vectors and matrices of different types. The STL has already provided a model for traversing through vectors


10

and other one-dimensional containers by using iterators. In addition, the STL defines several numerical algorithms such as the accumulate() algorithm presented in the last section. Thus creating generic algorithms to encompass the rest of the Level-1 BLAS functionality [36] is relatively straightforward. Matrix operations are slightly more complex, since the elements are arranged in a two-dimensional format. The MTL algorithms process matrices as if they are containers of containers (the matrices are not necessarily implemented this way). The matrix algorithms are coded in terms of iterators and two-dimensional iterators, as depicted in Figure 2.2. An algorithm can choose which row or column of a matrix to process using the two-dimensional iterator. The iterator can then be dereferenced to produce the row or column vector, which is a first class STL-style container. The one-dimensional iterators of the row vector can then be used to traverse along the row and access individual elements. TwoD Iterator OneD Container TwoD Container

OneD Container

OneD Iterator

Figure 2.2. The TwoD iterator concept. The code for an iterator-based generic matrix-vector product is listed in Figure 2.3. The matvec mult function is templated on the matrix type and the iterator types (which give the starting points of two vectors). The first two lines of the algorithm declare variables for the TwoD and OneD iterators that will be used to traverse the matrix.


11

The Matrix::const iterator expression extracts the iterator type from the Matrix type to declare the variable i. Similarly, the Matrix::OneD::const iterator expression extracts the iterator type from the OneD defined inside the matrix to declare variable j. The Matrix::iterator i is set to the beginning of the matrix i = A.begin(). The outer loop repeats until i reaches A.end(). For the inner loop the Matrix::OneD::iterator j is set to the beginning of the row pointed to by i with j = i->begin(). The inner loop repeats until j reaches the end of the row i->end(). The computation for the matrix-vector product consists of multiplying an element from the matrix *j by the appropriate element in vector x. The index into x is the current column, which is the position of iterator j, given by j.index(). The result is accumulated into tmp and stored into vector y according to the position of iterator i. The generic matrix-vector algorithm in Figure 2.3 is extremely flexible, and can be used with a wide variety of dense, sparse, and banded matrix types. For purposes of comparison, the traditional approach for coding matrix-vector products for sparse and dense matrices is listed. Note how the indexing in the MTL routine has been abstracted away. The traversal across a row goes from begin() to end(), instead of using explicit indices. Also, the indices used to access the x and y vectors are abstracted through the use of the row() and column() methods of the iterator. The row() and column() methods provide a uniform way to access index information regardless of whether the matrix is dense, sparse, or banded.


12

// generic matrix-vector multiply template void matvec_mult(Matrix A, IterX x, IterY y) { typename Matrix::const_iterator i; typename Matrix::OneD::const_iterator j; for (i = A.begin(); i != A.end(); ++i) for (j = i->begin(); j != i->end(); ++j) y[j.row()] += *j * x[j.column()]; } // BLAS-style dense matrix-vector multiply for (int i = 0; i < m; ++i) for (int j = 0; j < n; ++j) y[i] += a[i*lda+j] * x[j]; // SPARSPAK-style sparse matrix-vector multiply for (int i = 0; i < n; ++i) for (int k = ia[i]; k < ia[i+1]; ++k) y[i] += a[k] * x[ja[k]];

Figure 2.3. Simplified example of a generic matrix-vector product, with a comparison to the traditional approach to writing dense and sparse matrix-vector products.

Chapter 3 Related Work The Matrix Template Library draws on previous research in the following fields:

Linear algebra libraries

Optimizing Compilers

Generic Programming and Software Engineering

The Matrix Template Library is unique in the way it combines the advances made in these three fields to produce a linear algebra library. The MTL contains new insights and advances which were born as a result of merging ideas from these fields.

3.1 Traditional Basic Linear Algebra Libraries The Basic Linear Algebra Subprograms (BLAS) [22, 23, 36] currently define an informal standard in C and Fortran for dense linear algebra. There are many different implementations of the BLAS. The Netlib BLAS [1] (Fortran) provides the reference implementation, and each hardware vendor supplies tuned versions of the BLAS. For sparse matrices, there is the NIST Sparse BLAS [48] (C) and SPARSKIT [49] (Fortran). There is also an effort underway to define the new BLAS, which will add operations on sparse matrices, mixed precision numbers, and intervals. A reference implementation of the new BLAS will be 13

CHAPTER 3. RELATED WORK

14

constructed on top of MTL. The matrix formats and algorithms of the BLAS were used as a basis for the construction of the MTL.

3.2 Automatically Tuned Dense Linear Algebra There are currently two approaches used for generating portable dense linear algebra routines. The first approach is that of optimizing compilers, which transforms the loops found in application programs. The second approach is for the automated code generation and optimization to be built into a library, which is then called from an application program.

3.2.1 The Optimizing Compiler Approach A large amount of research has been done in the compiler community with regards to loop optimizations, especially for dense linear algebra operations [34, 10, 61, 35, 60, 12, 11, 39, 14, 32]. The results of this research has proved extremely valuable in selecting the performance optimization used in MTL. Many of the performance optimizations used in MTL would not be necessary if commercial compilers implemented more transformations in a consistent and reliable fashion. The difficulty in applying optimizations within a general purpose compiler is that it is hard for the compiler to determine where to apply the transformations, and exactly which transformations to apply. An alternative to relying on a general purpose compiler is to build optimizations into a library, such as MTL. This has the advantage that the library designer knows where the optimizations are needed, and which transformations to apply.

3.2.2 The Library Approach There are currently two libraries (both written in C) that provide portable high performance, PHiPAC [7] and ATLAS [59]. There are two basic parts to each of these libraries.


15

The first is a script which determines optimal blocking sizes and other parameters for high performance loops for a given architecture. Both PHiPAC and ATLAS use a brute force search to find these parameters. Some work in the compiler community points to a more efficient technique for determining the optimization parameters [10, 61]. The second part of these libraries is a code generation script which generates the optimized code based on these parameters. One of the contributions of the Matrix Template Library is in providing a more elegant and concise method of code generation based on the C++ template system. This is described in section 6.3. We currently use brute force searches to find the appropriate optimization parameters.

3.3 C++ Libraries for Linear Algebra There are many existing linear algebra libraries written in C++. Most of these can be categorized as object-oriented but not generic. They provide matrix and vector objects, but the algorithms are not formulated to allow one algorithm to work with many matrix formats. In addition, the other C++ libraries do not address highly optimized performance and do not claim “vendor-tuned” performance as MTL does. Table 3.1 categorizes the C++ libraries as object-oriented (OO), using expression templates [56] and/or operator overloading (ET/Op), and whether they use generative methods to create component combinations (Generative). Expression templates are a mechanism in C++ for improving the cache behavior when operator overloading is used. Again, none of these libraries use generic algorithms in the sense of the STL. Though not a linear algebra library, Blitz++ [58] is related to MTL in that it uses template metaprogramming [55] techniques to achieve high performance. In addition, the use of expression templates was pioneered in Blitz++.


16

Library OO ET/Op Generative Template Numeric Toolkit (TNT) [47] X X C++ Scientific Library (SL++) [20] X X Generative Matrix Computation Library (GMCL) [16] X X X LAPACK++ [25] X Sparslib++ [26] X GNU Scientific Software Library (GNUSSL) [46] X Newmat [19] X X Table 3.1. Summary of C++ Libraries for Linear Algebra

3.4 Generic Programming and Software Engineering As already mentioned, the design of the MTL draws heavily from the Standard Template Library [52, 3], which popularized the notion of generic programming [43] to the C++ community. The mix-and-match component design used in MTL is related to several other works, including the Generative Matrix Computation Library (GMCL) [18, 16] GenVoca [6], and the aspect oriented programming methodology [40] from Xerox Parc. Barton and Nackman [5] introduce several important design techniques for using C++ in scientific computing. The MTL uses several of their techniques.

Chapter 4 MTL Algorithms The Matrix Template Library provides a rich set of basic linear algebra operations, roughly equivalent to the Level-1, Level-2 and Level-3 BLAS, though the MTL operates over a much wider set of datatypes. The Matrix Template Library is unique among linear algebra libraries because each algorithm (for the most part) is implemented with just one template function. From a software maintenance standpoint, the reuse of code gives MTL a significant advantage over the BLAS [22, 23, 36] or even other object-oriented libraries like TNT [47] (which still has different subroutines for different matrix formats). Because of the code reuse provided by generic programming, MTL has an order of magnitude fewer lines of code than the Netlib Fortran BLAS [1], while providing much greater functionality and achieving significantly better performance. The MTL implementation is 8,284 words (according the the unix wc utility) for algorithms and 6,900 words for dense containers. The Netlib BLAS total 154,495 words and high-performance versions of the BLAS (with which MTL is competitive) are even more verbose. In addition, the MTL has been designed to be easier to use than the BLAS. Data encapsulation has been applied to the matrix and vector information, which makes the MTL interface simpler because input and output is in terms of matrix and vector objects, instead of integers, floating point numbers, and pointers. It also provides the right level of abstraction to the user — operations are in terms of linear algebra objects instead of 17

CHAPTER 4. MTL ALGORITHMS

18

low-level programming constructs such as pointers. Table 4.1 lists the principal operations implemented in the MTL. One would expect to see many more variations on the operations to take into account transpose and scaling permutations of the argument matrices and vectors — or at least one would expect a “fat” interface that contains extra parameters to specify such combinations. The MTL introduces a new approach to creating such permutations. Instead of using extra parameters, the MTL provides matrix and vector adaptor classes. An adaptor object wraps up the argument and modifies the behavior of the object in the algorithm. Table 4.2 gives a list of the MTL adaptor classes and their helper functions. A helper function provides a convenient way to create adapted objects. For instance, the scaled() helper function wraps a vector in a scaled1D adaptor. The adaptor causes the elements of the vector to be multiplied by a scalar inside of the MTL algorithm. There are two other helper functions in MTL, strided() and trans(). The strided() function adapts a vector so that its iterators move a constant stride with each call to operator++. The trans() function switches the orientation of a matrix (this happens at compile time) so that the algorithm “sees” the transpose of the matrix. Table 4.3 shows how one can create all the permutations of scaling for a daxpy()-like operation. Example code for a basic implementation of the BLAS daxpy() is listed below. void daxpy(int int int ix, iy; for (ix = 0, y[iy] += a }

n, double a, double* x, incx, double* y, int incy) { iy = 0; i < n; ix += incx, iy += incy) * x[ix];

The example below shows how the matrix-vector multiply algorithm (generically written to compute y

A x) can also compute y

AT x with the use of adap-

tors to transpose A and to scale x by alpha. Note that the adaptors cause the appro-

CHAPTER 4. MTL ALGORITHMS Function Name Vector Algorithms set(x,alpha) scale(x,alpha) s = sum(x) s = one norm(x) s = two norm(x) s = infinity norm(x) i = max index(x) s = max(x) s = min(x) Vector Vector Algorithms copy(x,y) swap(x,y) ele mult(x,y,z) ele div(x,y,z) add(x,y) s = dot(x,y) s = dot conj(x,y) Matrix Algorithms set(A, alpha) scale(A,alpha) set diagonal(A,alpha) s = one norm(A) s = infinity norm(A) transpose(A) Matrix Vector Algorithms mult(A,x,y) mult(A,x,y,z) tri solve(T,x) rank one update(A,x,y) rank two update(A,x,y) Matrix Matrix Algorithms copy(A,B) swap(A,B) add(A,C) ele mult(A,B,C) mult(A,B,C) mult(A,B,C,E) tri solve(T,B)

19 Operation

xi 8i x P x s P i xi s Pi j xi j1 s ( i x2i ) 2 s max j xi j i index of max j xi j s max xi s min xi y x y$x z y x z yx y x+y s xT y s xT y A A Aii s s A

A P maxi ( Pj j aij j) maxj ( i j aij j) AT

y Ax z Ax + y x T ,1x A A + xyT A A + xyT + yxT B A B$A C A+C C B A C AB E AB + C B T ,1B

Table 4.1. MTL generic linear algebra algorithms.

CHAPTER 4. MTL ALGORITHMS Adaptor Class scaled1D scaled2D strided1D row/column orien row orien and strided offset column orien and strided offset

20 Helper Function scaled(x) scaled(A) strided(x) trans(A) rows(A) columns(A)

Table 4.2. MTL adaptor classes and helper functions for creating algorithm permutations.

Function Invocation Operation add(x,y) y x+y add(scaled(x,alpha),y) y x + y add(x,scaled(y,beta)) y x + y add(scaled(x,alpha),scaled(y,beta)) y x + y Table 4.3. Permutations of the add() operation made possible with the use of the scaled() adaptor helper function.

priate changes to occur within the algorithm; they are not evaluated before the call to mtl::mult() (which would hurt performance). The adaptor technique drastically reduces the amount of code that must be written for each algorithm. Section 5.8 discusses the details of how these adaptor classes are implemented. // y ::type myMatrix; // definition: template < class T, class Shape = rectangle, class Storage = dense, class Orien = row_major > struct matrix { typedef typename IF< EQUAL< Shape::id, RECT>::RET, typename gen_rect::RET, IF< EQUAL< Shape::id, DIAG>::RET, typename gen_diag::RET, generator_error >::RET

CHAPTER 5. MTL COMPONENTS

40

>::RET type; };

The generator class consists of logic that is evaluated at compile-time. The IF construct is itself a simple generator class that selects between two types depending on a condition (which must be a value known at compile time). The following code shows how the IF class can be implemented. template struct IF { typedef error_type RET; }; template struct IF { typedef A RET; }; template struct IF { typedef B RET; };

5.3 MTL Concepts In the context of generic programming, the term concept is used to describe the collection of requirements that a template argument must meet for the template function of templated class to compile and operator properly. In many respects a concept is similar to an interface description in that a concept specifies which methods a class must implement. In addition a concept can require that a class make certain internal type definitions, and a concept can place constraints on the behavior of the methods, such as complexity guarantees. If a class fulfills the requirements of a concept, the class is said to model the concept. A concept can extend another concept, which is called refinement. This terminology was adopted from the SGI STL documentation [3], and most of the concepts from the STL are used in MTL. The bold sans serif font is used for all concepts.


41

5.3.1 Matrix The central concept in the Matrix Template Library is of course the Matrix. An MTL Matrix can be thought of as a Container of Containers (referred to as a 2D Container). As an STL Container an MTL Matrix has begin() and end() methods which return iterators for traversing over the 2D Container. These iterators dereference to give a 1D Container, which also has begin() and end() functions. The MTL implements a large variety of matrix formats, all with very different data representations and implementation details. However, the same MTL Matrix interface is provided for all of the matrices. Table 5.1 and Table 5.2 list the requirements of the Matrix concept. In addition, Matrix is a refinement of Container, so the requirements listed here are in addition to those of the Container concept. In the table, X refers to the type that models Matrix and A refers to an object of type X. The symbols m, n, i, j, row start, column start are all of unsigned integral type. split rows and split columns are Containers containing integral values. type definition X::shape X::orientation X::sparsity X::OneD X::OneDRef X::submatrix type X::partition type X::value type

description Tag to describe the shape of the matrix Either row tag or column tag Either dense tag or sparse tag The type of the inner containers (rows, columns, or diagonals) The reference type for OneD The type for a submatrix of X The type of a partitioned X The element type stored in the X

Table 5.1. Matrix associated types, in addition to those of Container


42

5.3.2 Vector The MTL Vector concept is a Container in which every element has a corresponding index. The elements do not have to be sorted by their index, and the indices do not necessarily have to start at 0. Also the indices do not have to form a contiguous range. The iterator type must be a model of IndexedIterator, which provides access methods to the indices. Vector is not a refinement of RandomAccessContainer (even though Vector defines operator[]) because Vector does not guarantee amortized constant time for that operation (to allow for sparse vectors). Note also that the invariant a[n] == advance(A.begin(), n) that applies to RandomAccessContainer does not apply to Vector, since the a[i] is defined for Vector to return the element with the ith index. So a[n] == *i if and only if i.index() == n. Table 5.3 lists the associated types and required methods of Vector.

5.3.3 IndexedIterator IndexedIterator is the iterator concept for iterators of Vectors and Matrices. An IndexedIterator provides access to the indices, as well as the elements, of a Vector or Matrix. For instance, given an iterator i of a Vector, i.index() gives the index corresponding to the element stored at *i. For IndexedIterators inside of Matrices, the row and column indices cooresponding to the current position of the iterator can be accessed through the i.row() and i.column() methods. In this way the IndexedIterator concept hides the differences between matrix types such as banded and sparse, allowing the MTL algorithms to be more generic. Table 5.4 gives the requirements for IndexedIterator.


43

5.3.4 Indexer This concept is special in that it defines a matrix aspect. An aspect is a cross-cutting attribute that affects the implementation of a component. By packaging the functionality into a separate class, one can swap in different aspects without changing the main code of the component. The Indexer concept is in charge of mapping indices from normal Matrix coordinates into the TwoD coordinate system. Here is an example of such a mapping for a banded matrix. [ 1 Matrix = [ [

2 4

3 5 7

] 6 ] 8 ]

The element whose value is 4 is at (1,1) in Matrix coordinates. In TwoD coordinates the 4 is at (1,0). The TwoD mapping of this matrix would look as follows. [ 1 TwoD = [ 4 [ 7

2 3 ] 5 6 ] 8 ]

The TwoD Mapping for a diagonal matrix would looks as follows. [ 3 TwoD = [ 2 [ 1

6 ] 5 8 ] 4 7 ]

There are three models of the Indexer concept, and each one provides a different mapping. There is the rect indexer for rectangular matrices, the banded indexer for banded matrices, and the diagonal indexer for diagonal matrices. Table 5.5 lists the requirements for Indexer.

5.3.5 OneDIndexer Models of this concept are used to implement the row() and column() methods of Matrix iterators. Table 5.6 gives the methods required by OneDIndexer .


44

5.3.6 Offset Offset is also an aspect concept. This concept is used in the dense2D class to allow for variability in the way matrix elements are mapped to linear memory. There are several models of the Offset concept that allow for different storage types such as the packed and banded matrix formats used in the BLAS. The elt(i,j) method gives the offset to find the (i; j ) element. The following code examples gives the implementations of elt(i,j) for rect offset, strided offset, and banded offset. The implementation for banded offset is much different since it must take into account the bandwidth of the matrix. ld is the leading dimension of the matrix (the distance from the start of one row to the start of the next). bw.first() gives the number of sub diagonals, and ndiag is the number of diagonals (bandwidth) of the matrix. // dense rectangular size_type elt(size_type i, size_type j) const return i * ld + j; } // strided size_type elt(size_type i, size_type j) const return j * ld + i; } // banded size_type elt(size_type i, size_type j) const return i * ndiag + max(0, bw.first() - i) + }

{

{

{ j;

The implementation for packed offset uses the formula for an arithmetic series to calculate the offset. Table 5.7 lists the requirements for the Offset concept.

5.4 MTL Object Memory Model The MTL object memory model is handle based. This differs from the Standard Template Library. Object copies and assignment are shallow. This means that when one vector


45

object is assigned to another vector object it becomes a second handle to the same vector. The same applies to matrices. The example below demonstrates this. The vector y reflects the change made to vector x, since they are both handles to the same vector. Vector z, however, does not reflect the change in x, since it is a different vector. mtl::dense1D x(5, 1.0); print_vector(x); > 1 1 1 1 1 mtl::dense1D y = x; mtl::dense1D z(N); mtl::copy(x, z); x[2] = 3; print_vector(y); > 1 1 3 1 1 print_vector(z); > 1 1 1 1 1

The main reason that the handle-based object model was used for MTL was that MTL makes heavy use of adaptor helper functions to modify the arguments to MTL algorithms. The adaptor functions return temporary objects that are typically passed as arguments directly to MTL algorithms. If the MTL algorithms use pass-by-reference for all their arguments then the C++ compiler will emit a warning. Using const pass-by-reference solves the problem for in parameters but not for out. MTL solves this problem by instead passing the matrix and vector arguments by value, and by making the matrix and vector objects handles to the underlying data structures. The MTL containers make considerable use of STL containers (though we use our own higher-performance version of the vector container). The underlying STL objects within the MTL OneD containers are reference counted to make the memory management easier for the user. This is especially helpful when the MTL matrix and vector object are used in the construction of larger object-oriented software systems. In C++ the reference counting can be non-intrusive to the classes that must be refer-


46

ence counted. This is especially important if one wishes to use containers supplied by other libraries (such as STL). The implementation of reference counting smart pointers in Figure 5.11 derives from the Handle class in [53].

5.5 The MTL Component Architecture The MTL has a layered architecture that maximizes internal reuse, allowing an additive number of classes to be combined to generate a multiplicative number of concrete components. The implementation components coordinate to implement the particular model of Matrix requested by the user. Fig. 5.12 gives an overview of all of the components that go into the MTL matrices and vectors. The notation used is a variation on UML [31, 45]. The solid boxes are MTL classes, and the boxes with dotted lines are template arguments and the corresponding concepts. The models relationship is depicted by placing the class box within the concept box. To construct and MTL Matix types, components are chosen from the models of 2D Storage, Indexer, Offset, and Offset, and plugged in to the matrix implementation class, which provides the glue to implement the Matrix interface. The following sections discuss the role of each of the component concepts, and discusses some of the concrete components that are in the MTL.

5.6 TwoD Storage Classes The matrix formats of the MTL are expressed as 2D Containers. The 2D Containers are neither row-major or column-major. Instead they are orientation neutral, and the Orienter aspect classes (row orien and column orien) are responsible for mapping the matrix coordinates to 2D coordinates. In this way each matrix format can be implemented just once but still provide row and column major versions of the matrix format for the


template class refcnt_ptr { typedef refcnt_ptr self; public: refcnt_ptr() : object(0) { } refcnt_ptr(Object* c) : object(c), count(new int(1)) { } refcnt_ptr(const self& x) : object(x.object), count(x.count) { this->inc(); } refcnt_ptr() { this->dec(); } self& operator=(Object* c) { if (object) this->dec(); object = c; count = new int(1); return *this; } self& operator=(const self& x) { if (object) this->dec(); object = x.object; count = x.count; this->inc(); return *this; } Object& operator*() { return *object; } const Object& operator*() const { return *object; } Object* operator->() { return object; } const Object* operator->() const { return object; } void inc() { (*count)++; } void dec() { (*count)--; if (*count ::type Matrix; const Matrix::size_type N = 3; Matrix::size_type large; double dA[] = { 1, 3, 2, 1.5, 2.5, 3.5, 4.5, 9.5, 5.5 }; Matrix A(dA, N, N); // Find the largest element in column 1. large = mtl::max_index(A[0]); // Swap the first row with the row containing the largest // element in column 1. mtl::swap( rows(A)[0] , rows(A)[large]);

5.6.2 compressed2D This storage class implements the compressed row or column matrix format described in Section 5.1.3. The following example shows two ways in which one can construct compressed matrices. The user can provide external data pointers to the matrix constructor, as for matrix A, or the user can create the matrix from scratch as in B, in which case the matrix manages its own memory. const int m = 3, n = 3, nnz = 5; double values[] = { 1, 2, 3, 4, 5 }; int indices[] = { 1, 3, 2, 2, 3 }; /* stored indices are Fortran */ int row_ptr[] = { 1, 3, 4, 6 }; /* Style for compatibility */ // Create from pre-existing arrays typedef matrix::type MatA; MatA A(m, n, nnz, values, row_ptr, indices); // Create from scratch typedef matrix::type MatB; MatB B(m, n); B(0,0) = 1; B(0,2) = 2; B(1,1) = 3; B(2,1) = 4; B(2,2) = 5;


50

5.6.3 array2D This class implements the array matrix storage format. array2D is actually implemented with a Container of Containers (whereas the other 2D storage types just act like Containers of Containers). In this way the array2D storage type is the most flexible, since many different types of 1-D Containers can be used in conjunction with this class. The 1-D Containers include linked list (implemented with sparse1D), tree (sparse1D), dense (std::vector), sparse pair ( sparse1D ), and compressed (mtl::compressed1D). One special feature of the array2D is that one can swap and assign the 1D Containers inside the array in constant time. The following example shows how one can create matrices of array storage with many different types of OneD Containers. In addition, the example demonstrates how the OneD Containers in an array matrix can be swapped in constant time. typedef matrix< double, rectangle, array< dense >, row_major>::type MatA; typedef matrix< double, rectangle, array< compressed >, row_major>::type MatB; typedef matrix< double, rectangle, array< sparse_pair >, row_major>::type MatC; MatA A(M,N); MatB B(M,N); MatC C(M, N); // Fill A ... mtl::copy(A, B); MatB::Row tmp = B[2]; B[2] = B[3]; B[3] = tmp; mtl::copy(B, C);


51

5.6.4 envelope2D This is the 2D storage type that implements the envelope matrix storage format. Two arrays are used to represent the matrix, the

V AL array holds the values in the sparse

matrix, and the PTR array points to the diagonal elements in VAL. The 1D segments of

V AL are actually dense in the sense that zeros are stored. Each 1D segment starts at the first non-zero element in the row. The index mapping for envelope storage is A(i; j ) = V AL(PTR(i) , i + j ).

5.7 OneD Containers/Vectors 5.7.1 dense1D This is the primary MTL class for representing Vectors. This class uses the std::vector for its implementation. A dense1D object serves as a handle to the underlying std::vector. The iterators of the std::vector are adapted with dense iterator so that they model the IndexedIterator concept.

5.7.2 compressed1D The compressed1D Vector is a sparse vector implemented with a pair of arrays. One array is for the element values, and the other array is for the indices of the elements. The elements are ordered by their index as they are inserted into the compressed1D. compressed1Ds can be used to build matrices with the array storage format, and they can also be used stand-alone as a Vector. [ (1.2, 3), (4.6, 5), (1.0, 10), (3.7, 32) ] [ 1.2, 4.6, 1.0, 3.7 ] [ 3, 5, 10, 32 ]

Value Array Index Array

A Sparse Vector


52

The compressed1D::iterator dereferences (*i) to return the element value. One can access the index of that element with the i.index() function. One can also access the array of indices through the nz struct() method. One particularly useful fact is that one can perform scatters and gathers of sparse elements by using the mtl::copy(x,y) function with a sparse and a dense vector.

5.8 Adaptors An adaptor class is one which modifies the interface and/or behavior of some base class. Adaptors can be used for a wide range of purposes. They can be used to modify the interface of a class to fit the interface expected by some client code [28], or to restrict the interface of a class. The STL std::stack adaptor is a good example of this. It wraps up a container, and restricts the operations allowed to push(), pop(), and top(). A good example of an adaptor that modifies the behavior of a class without changing its interface is the reverse iterator of the STL. An adaptor class can be implemented with inheritance or aggregation (containment). The use of inheritance is nice since it can reduce the amount of code that must be written for the adaptor, though there are some circumstances when it is not appropriate. One such situation is where the type to be adapted could be a built-in type (a pointer for example). This is one of the reasons why the reverse iterator of the STL is not implemented with inheritance.

5.8.1 sparse1D The sparse1D class adapts an STL-style container into a sparse vector. The adapted container must have elements that are index-value pairs. The std::vector, std::list, and std::set STL containers all work well with the sparse1D adap-


53

tor. The time and space complexity of the various operations on a sparse1D container depend on the adapted container. For instance, random insertion into a set is O (log n) while it is O (n) for a vector. The main purpose of the sparse1D container in the MTL is to allow for flexible construction of many sparse matrix types through composing different types of sparse vectors. The code below is a short example of creating a sparse1D with a set, inserting a few elements, and then accessing the values and indices of the resulting vector. The normal iterators of the sparse1D, returned by begin(), give access to the values of the elements (through dereference) and also the indices (through the index() method on the iterator). If one wishes to view the indices only (the non-zero structure of the vector), one can use the nz struct() method to obtain a container consisting of the indices of the elements in the sparse vector. typedef mtl::sparse1D > SparseVec; SparseVec x; for (int i = 0; i < 5; ++i) x[i*2] = i; ostream_iterator couti(cout); std::copy(x.begin(), x.end(), couti); SparseVec::IndexArray ix = x.nz_struct(); std::copy(ix.begin(), ix.end(), couti); > 01234 > 02468

The value type of the underlying containers must be mtl::entry1, which is just an index-value pair. The elements are ordered by their index as they are inserted.

5.8.2 Scaling Adaptors For performance reasons basic linear algebra routines often need to incorporate secondary operations into the main algorithm, such as scaling a vector while adding to another vector. The daxpy() BLAS routine (y

x + y) is a typical example, with its alpha


54

parameter. void daxpy(int n, double* dx, double alpha, int incx, double* dy, int incy); The problem with the BLAS approach is that it becomes necessary to provide many different versions of the algorithm within the daxpy() routine, to handle special cases with respect to the value of alpha. If alpha is 1 then it is not necessary to perform the multiplication. If alpha is 0 then the daxpy() can return immediately. When there are two or more arguments in the interface that can be scaled, the permutations of cases results in a large amount of code to be written. The MTL uses a family of iterator, vector, and TwoD container adaptors to solve this problem. In the MTL, only one version of each algorithm is written, and it is written without regard to scaling. There are no scalar arguments in MTL routines. Instead, the vectors and matrices can optionally be modified with adaptors so that they are transparently (from the point of view of the algorithm) scaled as their elements are accessed. If the scalar value is set at compile-time, one can rely on the compiler to optimize and create the appropriate specialized code. The scaling adaptors should also be coded to handle the case were the specialization needs to happen at run time, though this is more complicated and is a current area of research. Note that the BLAS algorithms do all specialization at run time, which causes some unnecessary overhead for the dispatch in the situations where the specialization could have been performed at compile time.

scale iterator The first member of the scaling adaptor family is the scale iterator. This is an iterator adaptor that multiplies each element by a scalar as the iterator dereferences. Below is an example of how the scale iterator could be used in conjunction with the STL transform algorithm. The transform algorithm is similar to the daxpy()


55

routine; the vectors x and y are being added and the result is going into vector z. The elements from x are being scaled by alpha. The operation is z

x + y. Note that this

example is included to demonstrate a concept, it does not show how one would typically scale a vector using MTL. The MTL algorithms take vector objects as arguments, instead of iterators, and there is a scaled1D adaptor that will cause a vector to be scaled. double alpha; std::vector x, y, z; // set the lengths of x, y, and z // fill x, y and set alpha typedef std::vector::iterator iter; scale_iterator start(x.begin(), alpha); scale_iterator finish(x.end()); // z = alpha * x + y std::transform(start, finish, y.begin(), z.begin(), plus())

The following example is an excerpt from the scale iterator implementation. The base iterator is stored as a data member (instead of using inheritance) since the base iterator could possibly be just a basic pointer type, and not a class type. The scalar value is passed into the iterator's constructor and also stored as a data member. The scalar is then used in the dereference operator*() to multiply the element from the vector. The increment operator++() merely calls the underlying iterator's method. template class scale_iterator { public: scale_iterator(const Iterator& x, const value_type& a) : current(x), alpha(a) { } value_type operator*() const { return alpha * *current; } scale_iterator& operator++ () { ++current; return *this; } // ... protected: Iter current; value_type alpha; };


56

At first glance one may think that the scale iterator introduces overhead which would have significant performance implications for inner loops. In fact modern compilers will inline the operator*(), and propagate the scalar's value to where it is used if it is a constant (known at compile time). This results in code with no extra overhead.

scaled1D The scaled1D class wraps up any OneD class, and it uses the scaleiterator to wrap up the OneD iterators in its begin() and end() methods. The main job of this class is to merely pass along the scalar value (alpha) to the scaleiterator. An excerpt from the scaled1D class is given below. template class scaled1D { public: typedef scale_iterator const_iterator; scaled1D(const Vector& r, value_type a) : rep(r), alpha(a) { } const_iterator begin() const { return const_iterator(rep.begin(), alpha); } const_iterator end() const { return const_iterator(rep.end(), alpha); } const_reference operator[](int n) const { return *(begin() + n); } protected: Vector rep; value_type alpha; };

The helper template function scaled() is provided to make the creation of scaled vectors easier. template scaled1D scaled(const Vector& v, const T& a) { return scaled1D(v,a); }


57

The example below shows how one could perform the scaling step in a Gaussian elimination. In this example the second row of matrix A is scaled by 2, using a combination of the MTL vecvec::copy algorithm, and the scaled1D adaptor. // [[5.0,5.5,6.0], // [2.5,3.0,3.5], // [1.0,1.5,2.0]] double scalar = A(0,0) / A(1,0); Matrix::RowVectorRef row = A.row_vector(1); mtl::copy(scaled(row, scalar), row); // [[5.0,5.5,6.0], // [5.0,6.0,7.0], // [1.0,1.5,2.0]]

As one might expect, there is also a scaling adaptor for matrices in the MTL, the scaled2D class. It builds on top of the scaled1D in a similar fashion to the way the scaled1D is built on the scale iterator.

5.8.3 Striding Adaptors A similar problem to the one of scaling is that of striding. Many times a user wishes to operate on a vector that is not contiguous in memory, but at constant strides such as a row in a column oriented matrix. Again a library writer would need to add an argument to each linear algebra routine to specify the striding factor. Also, the algorithms would need to handle striding by 1 different from other striding factors for performance reasons. Again this causes the amount of code to balloon, as is the case for the BLAS. This problem can also be handled with adaptors. The MTL has a strided iterator and strided vector adaptor. The implementation of the strided adaptors is very similar to that of the scaled adaptors.


expression X A(m, n); X A(m, n, sub, super) X A(B); X A(data, m, n); X A(data, m, n, ld); X A(data, m, n, ld, sub, super); A(data); X A(m, n, nnz, val, ptrs, inds); X A(B, sub, super); X A(stream); X A(stream); X A(stream, sub, super); X A(stream, sub, super); A(i,j) A(i,j) A[i] A[i] A.sub matrix(row start, row finish, col start, col finish) A.partition(split rows, split cols) A.subdivide(row split, column split) A.nrows() A.ncols() A.nnz() A.sub() A.super() A.is unit() A.is upper() A.is lower()

58

return type X X X X X X

note Normal Constructor Constructor for banded matrices Copy Constructor Construct for external matrices Different leading dimension (ld) External banded matrices

X X X X X X X X reference const reference OneDRef ConstOneDRef submatrix type

Static sized dimension constructor Sparse external matrix constructor Banded view constructor Matrix market stream constructors Harwell Boeing stream constructor Matrix market stream (banded) Harwell Boeing stream (banded) Element Access Const Element Access OneD Access Const OneD Access Submatrix Access

partition type

Partitioned Matrix Creation

partition type

Subdivide (partition into 4) Number of rows Number of columns Number of non-zero elements Sub part of bandwidth Super part of bandwidth Main diagonal all ones? Matrix shape is upper triangular? Matrix shape is lower triangular?

size type size type size type difference type difference type bool bool bool

Table 5.2. Matrix method requirements, in addition to those of Container

CHAPTER 5. MTL COMPONENTS type definition X::iterator X::sparsity X::IndexArray X::subrange X::partition type a[n] a[n] a.partition(splits) a.subdivide(split) a(s,f) a.nnz()

59

return type tag tag

reference const reference partition type subrange size type

description Models IndexedIterator Either dense tag or sparse tag Container containing the indices of the elements A Vector of a subrange A Vector of Vectors Element access Const element access Partition the vector Divide in 2 Subrange access Number of non-zero elements

Table 5.3. Vector requirements, in addition to those of Container

expression i.row() i.column() i.index()

return type

note Row Index access Column Index access Vector Index access

Table 5.4. IndexedIterator requirements

expression X::dim type x.deref() x.at(p) X::twod dim(dim) x.nrows() x.ncols() x.sub() x.super()

return type

note Pair type for matrix dimension

OneDIndexer dim type size type size type size type difference type difference type

Map point p from Matrix coordinates to TwoD coordinates Calculate the dimensions for the TwoD container number of rows number of columns number of sub diagonals number of super diagonals

Table 5.5. Indexer requirements

expression x.row(i) x.column(i) x.at(i)

return type size type size type size type

note Calculate the row index of the given iterator Calculate the column index of the given iterator Calculate the offset to the element with index i

Table 5.6. OneDIndexer requirements


expression x.elt(i,j) x.oned offset(i)

return type size type size type

x.oned length(i) x.twod length() x.stride() X::size(m, n, sub, super) x.major() x.minor()

size size size size

type type type type

size type size type

60

note Maps the (i,j) point from TwoD coords to a linear offset The offset to the start of the ith OneD part of the TwoD The length of the ith OneD part The size of the outer Container of the TwoD The distance from one element to the next, usually 1 The total size of the memory allocated to the TwoD The major dimension size The minor dimension size

Table 5.7. Offset requirements

Chapter 6 High Performance We have presented many levels of abstraction, and a comprehensive set of algorithms for a variety of matrices, but this elegance matters little if high performance cannot be achieved. In this section we will discuss recent advances in languages and compilers that allow abstractions to be used with little or no performance penalty. Furthermore, we will present a comprehensive set of abstractions, the Basic Linear Algebra Instruction Set (BLAIS) and Fixed Algorithm Size Template (FAST) sub-libraries, that have the specific purpose of generating optimized code. The generation of optimized code is made possible with template meta-programming [55]. There is a common perception that the use of abstraction hurts performance. This is due to a particular set of language features that are used to create abstractions and how those language features are implemented. The language features that are used to create abstractions are listed below.

Procedures

The basic building block of abstraction is the procedure call. For each

abstraction level, one needs a set of functions, an interface for the abstraction level. Traditionally a procedure call has incurred a significant overhead—copying parameters to the stack, etc. Many compilers are now able to inline procedure calls to remove this performance penalty. 61

CHAPTER 6. HIGH PERFORMANCE

62

Classes The main tool for data encapsulation is the class language construct. It allows arbitrarily complex sets of data to be grouped together and hidden within an abstraction. Classes (and structures in general) interfere with the register allocation algorithms of many compilers. Optimizing compilers map the local variables of a function to machine registers. This can drastically reduce the number of loads and stores necessary since the registers are being used to cache data from memory. Many compilers do not recognize that this optimization can also be applied objects on the stack. They specify a load or store for each access to a data item within the object, which kills performance for codes like STL and MTL which use iterators — objects that have to be accessed over and over again within the inner loops of the code. A typical example of the problem with small objects comes up in the use of complex numbers in C++. We illustrate the problem with the code below which calculates the sum of a complex vector. We have written out the pseudo-assembly code for the loop. A load or store results from each access to a complex number. Therefore each loop iteration includes 4 loads and 2 stores. We show the pseudo-assembly code for the loop after register allocation has been applied, which maps the complex number a to registers (in this case R3 and R5). This version only does 2 loads and 0 stores inside the loop. In modern processors memory access is more expensive than ALU operations, so the reduction in the number of loads and stores has a large impact on overall performance. complex a, b[N]; // take the sum of vector b for (int i = 0; i < N; ++i) a += b[i];


63

// pseudo-assembly code (unoptimized) looptop: CMP i N BRZ loopend LOAD a.real R3 LOAD b[i].real R4 ADD R3 R4 R5 STORE R5 a.real LOAD a.imag R3 LOAD b[i].imag R4 ADD R3 R4 R5 STORE R5 a.imag JMP looptop loopend: // pseudo-assembly code (optimized) LOAD a.real R3 LOAD a.imag R5 looptop: CMP i N BRZ loopend LOAD b[i].real R4 ADD R3 R4 R3 LOAD b[i].imag R6 ADD R5 R6 R5 JMP looptop loopend: STORE R3 a.real STORE R5 a.imag

Polymorphism The language feature that enables generic programming is polymorphism. It enables an algorithm to work with many data types and data structures instead of just one in particular. The first language feature in C++ that enabled polymorphism was the virtual function call, which allows a function call to dispatch to a specialized version based on the object type. With virtual functions, the object type is not known until run time, so the dispatch happens during program execution. This is called dynamic polymorphism. A disadvantage of the run time dispatch is that it adds some overhead to the cost of a normal function call. Even worse, virtual function calls interfere with inlining. Virtual function calls cannot be inlined because the object type is not known at


64

compile time, when the inlining optimization is applied, and the compiler cannot decide which specialized version to dispatch to. Two important advances have been introduced into C++ compilers that remove the performance penalties associated with abstraction. The first is a language feature and the second is a compiler optimization.

Static Polymorphism The addition of templates to the C++ language creates a way for functions to be selected at compile time based on object type. With static polymorphism the object type is known at compile time, which enables the compiler to hard code the dispatch decision. As a result template functions can be inlined in the same way as regular functions. In addition, many C++ compilers have improved their ability to inline functions in general, to the point where one can know with a relatively high degree of confidence that if a function is labeled inline that it is really being inlined. Even extremely complicated layers of functions can be completely flattened out. We prove this with the performance achieved by the BLAIS and FAST libraries.

Lightweight Object Optimization The performance penalty associated with the use of classes and structures can be solved with a relatively straightforward optimization (though the implementation is difficult). Each object is removed from the code, and it is replaced with its individual parts. This happens in a recursive fashion until there are only basic data types (integers, floats, etc.). Then each reference through an object to one of its parts is replaced with a direct reference to the part. Note that this is only applied to objects on the stack (local variables). The Kuck and Associates C++ compiler [33] performs this optimization, which is also known as scalar replacement of aggregates [41]. The end result is that the data items within the objects can then be mapped appropriately to


65

machine registers by the normal register allocation algorithms. With the use of template functions, and with lightweight object optimization, it is now possible to introduce abstractions with no performance penalty. This allowed us to design MTL in a generic fashion, composing containers and allowing a mix and match model with the algorithms. At this time we know that the compilers that perform both of these optimizations include Kuck and Associates Inc. C++ compiler and the SGI C++ compiler. 1

The egcs compiler (though it has great language support) is missing the lightweight ob-

ject optimization as well as several other optimizations. We will be conducting a complete survey of current compilers' optimization levels in the near future. Even after the “abstraction” barrier has been removed, there are yet more optimizations that need to be applied to achieve high performance. As alluded to in the introduction, achieving high performance on modern microprocessors is a difficult task, requiring many complex optimizations. Today's compiler can aid somewhat in this area, though to achieve “ideal” performance one must still hand optimize the code.

Compiler Optimizations: Unrolling and Instruction Scheduling

Modern compilers

can do a great job of unrolling simple loops and scheduling instructions, but typically only for specific (recognizable) cases. There are many ways, especially in C and C++ to interfere with the optimization process. The MTL containers and algorithms are designed to result in code that is easy for the compiler to optimize. Furthermore, the iterator abstraction makes inter-compiler portability possible, since it encapsulates how looping is performed. This is discussed below in section 6.1 and 6.2. 1

We use the C++ compiler from Kuck and Associates, Inc. (KAI) for development. We have found that the easiest way to check to make sure a function is inlined is to inspect the intermediate C code that KAI's compiler generates. With the proper use of compiler flags we have found that the KAI compiler does a reliable job. Hopefully, in the future, compilers will give more direct feedback on the optimizations that they are performing.

CHAPTER 6. HIGH PERFORMANCE Algorithmic Blocking

66

To obtain high performance on a modern microprocessor, an al-

gorithm must properly exploit the associated memory hierarchy and pipeline architecture. Todays compilers are not able to apply all of these transformations, so the programmer must apply some optimizations by hand. To make matters worse, the transformations are somewhat machine dependent. The number of registers, size of cache, and other machine characteristics affect the blocking sizes. This makes it difficult to express high performance algorithms in a portable fashion. Our solution to this problem is discussed in section 6.3.

6.1 Mayfly Components A specific coding style must be used in order for the compiler inlining and lightweight object optimization to occur. As mentioned above, using template functions is a step in the right direction. In order to have the lightweight object optimization occur, one must code the interface objects (iterators, vectors, matrices) in a particular style, which we have named the Mayfly pattern (cataloging design patterns was popularized by the “Gang-ofFour” [28]) , after the insect well known for its short life span. Mayfly components are objects that live on the stack (and often only in registers), and whose sole purpose is to provide a generic interface to some data structure. The iterators of the Standard Template Library are a good examples of mayflies. Mayflies also play a prominent role in the Matrix Template Library. The MTL supports a wide variety of matrix formats including compressed row/column, diagonal, banded, packed, and envelope (these formats appear in popular linear algebra libraries such as LAPACK, SPARSKIT, etc). It is the role of the mayflies in MTL to make all of these matrix formats appear to have the same interface, while introducing zero overhead. The MTL matrix interface corresponds to the container of containers abstraction,


67

i.e. from the users perspective all MTL matrices behave like an STL container such as vector< vector >. Of course the commonly used matrix formats do not have this “container of containers” structure, at least not in a concrete sense. For example, the most common matrix format is a single contiguous array, where sections of the array are considered to represent rows or columns of the matrix. When one wants to operate on a particular row some pointer arithmetic is performed to move a pointer to the beginning of a row, and then the pointer can be incremented to traverse the row (or one can also index into the array with an offset).

Mayflies in MTL

In order to provide a “container of containers” interface to this dense

matrix format, the MTL creates row objects on the fly. That is, when the user requests a particular row, the MTL matrix creates a row object. Suppose one wishes to add the elements of row 3 to row 5 of an MTL matrix. One would write the following expression, where A is some MTL matrix. add(A[3],A[5]); The operator[](n) creates a vector object. The object is not allocated on the heap (with new), instead it is just returned by value and copied into the add() routine. This keeps the object on the stack, from which many compilers can further optimize the object so that it lives solely in registers. In addition, most functions with which the mayfly is involved become inlined, and therefore any overhead of passing by value is removed. These are the a characteristics of mayflies: they are passed by value, live on the stack or in registers, are small enough to induce little overhead, and provide an abstract interface to some lower-level data structure. To tie this into the example above, the row vector object did not exist before the call to operator[](n), and it is gone once the function call to add() is over, hence the name mayfly.


68

MTL matrices export iterators in same way that STL vector< vector > does. There are the 2-D iterators that traverse the outer Container, and there are 1-D iterators that traverse down a particular row (or column). The following example revisits the generic algorithm for matrix-vector multiplication with a focus on the mayfly components. There are several mayfly objects involved in this algorithm. Of course the iterators i and j are mayflies, they are local to this function. In addition, the vector objects that result from dereferencing i, the two expressions (*i), are mayflies. template void mult(const Matrix& A, const VecX& x, VecY y) { typename Matrix::const_iterator i; typename Matrix::OneD::const_iterator j; for (i = A.begin(); i != A.end(); ++i) for (j = (*i).begin(); j != (*i).end(); ++j) y[j.row()] += *j * x[j.column()]; }

Mayflies are high performance

In many cases the overhead associated with mayflies

can be reduced to zero, resulting in the optimal light weight interface. There are several things that help make this happen. By design, one will typically inline the functions in which the mayflies are involved. Second, the data elements within a mayfly can be mapped to registers by a good optimizing compiler (lightweight object optimization [33, 41]). In this way, any overhead that might have in passing the object in a function call, or in the extra loads and stores that would have occurred due to the presence of a structure, have been removed by the compiler. The resulting assembly code ends up being the same as it would have if one had written the code in a non-generic fashion, resulting in the same high performance.


69

6.2 High Performance Iterators Iterators control how looping is performed, and therefore their design can make a large difference to the performance of a particular code. The biggest concern here is identifying what kind of loops the underlying backend compiler will optimize (perform unrolling, instruction scheduling, etc.). The design space includes whether to increment a pointer or increment an integer offset, and it also includes whether to use the less-than or not-equal operator for the loop termination condition. One would think modern compilers should produce equally good code for all of these cases, but this is not the case. There can be a factor of 2 or more difference depending on the type of loop used. The four variations on the traversal method can be seen in the following example loop, which computes a dot product of two vectors. int i; double* x, *y, *xp, *yp; // integer, != operator for (i = 0; i != N; ++i) tmp += x[i] * y[i]; // integer, < operator for (i = 0; i < N; ++i) tmp += x[i] * y[i]; // pointer, != operator yp = y; for (xp = x; xp != x + N; ++xp, ++yp) tmp += *x + *y; // pointer, < operator yp = y; for (xp = x; xp < x + N; ++xp, ++yp) tmp += *x + *y;

Table 6.1 shows the variations in performance on a loop (dot product) for three different computer architectures/compilers. The dot product computation was chosen because there are no aliasing issues and it includes the typical add/multiply floating point operation. The native C compiler was used for each machine with maximum optimization flags turned on.

CHAPTER 6. HIGH PERFORMANCE iterator type UltraSPARC 30 integer pointer RS6000 590 integer pointer R10000 integer pointer

70 comparison type < != 180.085 44.6187 180.319 44.4693 47.7829 47.78 47.7595 20.8623 102.512 106.101 81.2678 72.648

Table 6.1. The effect of iterator and comparison operator choice on performance (in Mflops) for dot product on Sun C, IBM XLC, and SGI C compilers.

From Table 6.1 we can surmise that by choosing to increment an integer offset, and using the less-than comparison operator, we can achieve top performance with all of the compilers tested. This result differs from the findings of PHiPAC [7], which suggests that one should always use the not-equal operator because it is more efficiently implemented in some architectures. Our experience has shown that the architecture implementation is not as important as whether the compiler knows how to optimize a loop that uses a not-equal comparison operator. Of course, this test did not include all C compilers, but it does give us a warning that we ought to write our code in such a way as to make it easy to change the operator used. In C++ this is easy to do with an extra layer of abstraction. Instead of using a particular operator for each loop, we call the not at() template function. The not at() function then invokes the proper comparison operator. Now there is only one line that needs to change if the compiler or architecture has a particular operator preference. Another reason for using the not at() method is that if one wants to make algorithms generic, then the less-than operator should not be used since most iterator types do not


71

support less-than. This is why the not-equal operator is instead used for most loops in the STL. The not at() function solves this problem by allowing the less-than operator to be used for random access iterators and the not-equal to be used for all others. The code below shows the implementation of the not at() family of functions. The last not at() function dispatches to either the first or second version of not at() depending on the iterator's category type. The dispatch happens at compile time, and all the functions are easily inlined by a modern C++ compiler. This technique has been referred to as external polymorphism. template bool not_at(const RandomIter1& a, const RandomIter2& b, random_access_iterator_tag) { return a < b; } template inline bool not_at(const Iter1& a, const Iter2& b, AnyTag) { return a != b; } template inline bool not_at(const Iter1& a, const Iter2& b) { typedef typename iterator_traits::iterator_category Cat; return not_at(a, b, Cat()); }

Now that we have a way of picking out the proper operator, and know from the experiment that an integer offset should be incremented instead of a pointer, we are ready to implement the MTL iterators. An excerpt from the dense iterator class is shown below. This is the iterator that is used in the dense1D container . We use the integer pos to keep track of the iterator position, and then use it as an offset in the operator* method. This implementation results in an iterator that produces loops that are easier to optimize for the compilers we have tested. The typical implementation of an iterator for a contiguous memory container such as a vector is to just use a pointer. But as the discussion above points out, this is not always the best choice. One nice thing about the iterator


72

abstraction is that we are not forced into a particular implementation, we can change the implementation at any time without affecting the code using the iterator. template class dense_iterator { public: dense_iterator(Iter s, size_type i) : start(s), pos(i) { } size_type index() const { return pos; } reference operator*() const { return *(start + pos); } self& operator++ () { ++pos; return *this; } bool operator < (const dense_iterator& x) const { return pos < x.pos; } protected: Iter start; size_type pos; };

6.3 High Performance & Template Metaprogramming The bane of portable high performance numerical linear algebra is the need to tailor key routines to specific execution environments. To obtain high performance on a modern microprocessor, an algorithm must make efficient use of the cache, registers, and the pipeline architecture (typically through careful loop blocking and structuring). Ideally, one would like to be able to express high performance algorithms in a portable fashion, but there is not enough expressiveness in languages such as C or Fortran to do so. This is because the blocking done at the lowest level, for registers and the pipeline, affects the number and type of instructions that must be in the inner loop, as is shown in the example below. The variation of the number of operations in the loop cannot be expressed directly in C or Fortran. Recent efforts (PHiPAC [7], ATLAS [59]) have resorted to going outside the language, i.e., to custom code generation systems to gain the kind of flexibility needed to generate the inner loop in a portable fashion. The BLAIS and FAST libraries use the template metaprogramming features of C++ to express flexible unrolling and blocking factors for inner loops.


73

// need to unroll by two for machine X for (int i = 0; i < N; i += 2) { y[i] += a * x[i]; y[i+1] += a * x[i+1]; } // need to unroll by three for machine Y for (int i = 0; i < N; i += 3) { y[i] += a * x[i]; y[i+1] += a * x[i+1]; y[i+2] += a * x[i+2]; }

In this section we describe a collection of high performance kernels for basic linear algebra, called the Basic Linear Algebra Instruction Set (BLAIS) and the Fixed Algorithm Size Template (FAST) Library. The kernels encapsulate small fixed size computations to provide building blocks for numerical libraries in C++. The sizes are templated parameters of the kernels, so they can be easily configured to a specific architecture for portability. In this way the BLAIS delivers the power of such code generation systems as PHiPAC [7] and ATLAS [59]. BLAIS has a simple and elegant interface, so that one can write flexible-sized block algorithms without the complications of a code generation system. The BLAIS specification contains fixed size algorithms with functionality equivalent to that of the Level-1, Level-2, and Level-3 BLAS [36, 23, 22]. The BLAIS routines themselves are implemented using the FAST library, which contains general purpose fixed-size algorithms equivalent in functionality to the generic numerical algorithms in the Standard Template Library (STL) [52]. In the following sections, we describe the implementation of the FAST algorithms and then show how the BLAIS are constructed from them. Next, we demonstrate how the BLAIS can be used as high-level instructions (kernels) to construct a dense matrix-matrix product. Finally, experimental results show that the performance obtained by our approach can equal and even exceed that of vendor-tuned


74

libraries.

6.4 Fixed Algorithm Size Template (FAST) Library The FAST Library includes generic algorithms such as transform(), for each(), inner product(), and accumulate() that are found in the STL. The interface closely follows that of the STL. All input is in the form of iterators (generalized pointers). The only difference is that the loop-end iterator is replaced by a count template object. The example below demonstrates the use of both the STL and FAST versions of transform() to realize an AXPY-like operation (y

x + y).

The first1 and last1 parameters are iterators for the first input container (indicating the beginning and end of the container, respectively). The first2 parameter is an iterator indicating the beginning of the second input container. The result parameter is an iterator indicating the start of the output container. The binary op parameter is a function object that combines the elements from the first and second input containers into the result containers. int x[4] = {1,1,1,1}, y[4] = {2,2,2,2}; // STL template OutIter transform(InIter1 first1, InIter1 last1, InIter2 first2, OutIter result, BinaryOp binary_op); std::transform(x, x + 4, y, y, plus()); // FAST template OutIter transform(InIter1 first1, cnt, InIter2 first2, OutIter result, BinOp binary_op); fast::transform(x, cnt(), y, y, plus());


75

The difference between the STL and FAST algorithms is that STL accommodates containers of arbitrary size, with the size being specified at run time. FAST also works with containers of arbitrary size, but the size is fixed at compile time. In the example below we show how the FAST transform() routine is implemented. We use a tailrecursive algorithm to achieve complete unrolling — there is no actual loop in the FAST transform(). The template-recursive calls are inlined, resulting in a sequence of N copies of the inner loop statement. This technique, called template metaprograms, has been used to a large degree in the Blitz++ Library and is explained in [16, 55]. // The general case template inline OutIter fast::transform (InIter1 first1, cnt, InIter2 first2, OutIter result, BinOp binary_op) { *result = binary_op (*first1, *first2); return transform(++first1, cnt(), ++first2, ++result, binary_op); } // The N = 0 case to stop template recursion template inline OutIter fast::transform(InIter1 first1,cnt,InIter2 first2, OutIter result, BinOp binary_op){ return result; }

6.5 Basic Linear Algebra Instruction Set (BLAIS) The BLAIS library is implemented directly on top of the FAST Library, as a thin layer that maps generic FAST algorithms into fixed-size mathematical operations. There is no added overhead in the layering because all the function calls are inlined. Using the FAST library allows the BLAIS routines to be expressed in a very simple and elegant fashion.


76

The following discussions looks at one example from each of the levels of the BLAIS library: vector-vector, matrix-vector, and matrix-matrix.

Vector-Vector Operations The add() routine is typical of the BLAIS vector-vector operations. It is implemented in terms of a FAST algorithm, in this case transform(). The implementation of the BLAIS add() is listed below. The add() function is implemented with a class and its constructor to provide a simple syntax for its use. template struct add { template inline add(Iter1 x, Iter2 y) { typedef typename iterator_traits::value_type T; fast::transform(x, cnt(), y, y, plus()); } };

The example below shows how the add() routine can be used. The comment shows the resulting code after the call to add() is inlined. Note that only one add() routine is required to provide any combination of scaling or striding. This is made possible through the use of the scaling and striding adaptors, as discussed in section 5.8. Any resulting overhead is removed by inlining and lightweight object optimizations [33]. The scl(x, a) call below automatically creates the proper scale iterator out of x and a. double x[4], y[4]; fill(x, x+4, 1); fill(y, y+4, 5); double a = 2; vecvec::add(scl(x, a), y); // // // // //

the compiler expands add() to: y[0] += a * x[0]; y[1] += a * x[1]; y[2] += a * x[2]; y[3] += a * x[3];


77

Matrix-Vector Operations To illustrate the BLAIS matrix-vector operations we look at the BLAIS matrix-vector multiply. The algorithm simply carries out a vector add operation for each column of the matrix. Using the same technique as in the FAST library, we write a fixed depth recursive algorithm which becomes inlined by the compiler. The implementation is listed below. // General Case template struct mult { template inline mult(AColIter A_2Diter, IterX x, IterY y) { vecvec::add(scl((*A_2Diter).begin(), *x), y); mult(++A_2Diter, ++x, y); } }; // N = 0 Case template struct mult { template inline mult(AColIter A_2Diter, IterX x, IterY y) { // do nothing } };

Matrix-Matrix Operations

The most important of the the BLAIS matrix-matrix oper-

ations is the matrix-matrix multiply. This algorithm builds on the BLAIS matrix-vector multiply. The code looks very similar to the matrix vector multiply, except that there are three integer template arguments (M, N, and K), and the inner “loop” contains a call to matvec::mult() instead of vecvec::add(). Remember that the BLAIS matrixmatrix operation is intended to be used in the inner loop (literally) of algorithms, and the BLAIS operation is completely inlined and expanded. Therefore it is perfectly fine to build the matrix-matrix multiply out of the matrix-vector operation since at this low level cache blocking issues are not a factor.


78

6.6 BLAIS in a General Matrix-Matrix Product A typical use of the BLAIS kernels would be to construct linear algebra subroutines for arbitrarily sized objects. The fixed-size nature of the BLAIS routines make them wellsuited to perform register-level blocking within a hierarchically blocked matrix-matrix multiplication. Blocking (or tiling) is a well known optimization that increases the reuse of data while it is in cache and in registers, thereby reducing the memory bandwidth requirements and increasing performance. The code below shows the inner most set of blocking loops for a matrix-matrix multiply. The constants BFM, BFN, and BFK are blocking factors chosen so that c can fit into the registers. The blocking factors describe the size and shape of the submatrices, or blocks, that the matrix is divided into. 2 for (jj = 0; jj < N; jj += BFN) for (ii = 0; ii < M; ii += BFM) { copy_block c(C+ii*N+jj, BFM, BFN, N); for (kk = 0; kk < K; kk += BFK) { light2D a(A+ii*K+kk, BFM, BFK); light2D b(B + kk*N + jj, BFK, BFN); matmat::mult(a, b, c); } }

A Configurable Recursive Matrix-Matrix Multiply

To obtain the highest performance

in a matrix-matrix multiply code, algorithmic blocking must be done at each level of the memory hierarchy. A natural way to formulate this is to write the algorithm in a recursive fashion, with a level of recursion for each level of blocking and memory hierarchy. We take this approach in the MTL algorithm. The size and shapes of the blocks at each level are determined by the blocking adaptor. Each adaptor contains the information for 2

Excessive code bloat is not a problem in MTL because the complete unrolling is only done for very small sized blocks


79

the next level of blocking. In this way the recursive algorithm is determined by a recursive template data structure (which is set up at compile time). The setup code for the matrixmatrix multiply is shown below. This example blocks for just one level of cache, with 64 x 64 sized blocks. The small 4 x 2 blocks fit into registers. Note that these numbers would normally be constants that are set in a header file and derived from the results of an automated parameter search facility. template void matmat::mult(MatA& A, MatB& B, MatC& C) { MatA::RegisterBlock A_L0; MatA::Block A_L1; MatB::RegisterBlock B_L0; MatB::Block B_L1; MatC::CopyBlock C_L0; MatC::Block C_L1; matmat::__mult(block(block(A, A_LO), A_L1), block(block(B, B_L0), B_L1), block(block(C, C_L0), C_L1)); }

The recursive algorithm is listed in Figure 6.1. There is a TwoD iterator (A k, B k, and C i) for each matrix, as well as 1-D iterator (A ki, B kj, and C ij). The matrices have been wrapped up with blocked matrix adaptors, so that dereferencing the OneD iterator results in a submatrix. The recursive call is then made on the submatrices A block, *B kj, and *C ij. The bottom-most level of recursion is implemented with a separate function that makes the calls to the BLAIS matrix-matrix multiply, and “cleans up” the leftover edge pieces. Since the recursion depth is fixed at compile time, the whole algorithm can be inlined by the compiler.

Optimizing Cache Conflict Misses Besides blocking, there is another important optimization that can be done with matrix-matrix multiply code. Typically utilization of the level-1 cache is much lower than one might expect due to cache conflict misses. This is especially apparent for large matrices in direct mapped and low associativity caches.


80

template void mult(MatA& A, MatB& B, MatC& C) { A_k = A.begin(); B_k = B.begin(); while (A_k != A.end()) { C_i = C.begin(); A_ki = (*A_k).begin(); while (C_i != C.end()) { B_kj = (*B_k).begin(); C_ij = (*C_i).begin(); MatA::Block A_block = *A_ki; while (B_kj != (*B_k).end()) { mult(A_block, *B_kj, *C_ij); ++B_kj; ++C_ij; } ++C_i; ++A_ki; } ++A_k; ++B_k; } }

Figure 6.1. A recursive matrix-matrix product algorithm. The way to minimize this problem is to copy the block of matrix A being accessed into a contiguous section of memory [34]. This allows the code to use blocking sizes closer to the size of the L-1 cache without inducing as many cache conflict misses. It turns out that this optimization is straightforward to implement in our recursive matrix-matrix multiply. We already have block objects (submatrices A block, *B j, and *C j). We modify the constructor for these objects to make a copy to a contiguous part of memory, and the destructor to copy the block back to the original matrix. This is especially nice since the optimization does not clutter the algorithm code, but instead the change is encapsulated in the copy block matrix class.

Chapter 7 Iterative Template Library (ITL) The Iterative Template Library is collection of sophisticated iterative methods written in C++ (similar to the IML++ Library [24]). It contains methods for solving both symmetric and non-symmetric linear systems of equations, many of which are described in [4]. The ITL methods were constructed in a generic style, allowing for maximum flexibility and separation of concerns about matrix data structure, performance optimization, and algorithms. Presently, ITL contains routines for conjugate gradient (CG), conjugate gradient squared (CGS), biconjugate gradient (BiCG), biconjugate gradient stabilized (BiCGStab), generalized minimal residual (GMRES), quasi-minimal residual (QMR) without look-ahead, transpose-free QMR, and Chebyshev and Richardson iterations. In addition, ITL provides the following preconditioners: SSOR, incomplete Cholesky, incomplete LU(n), and incomplete LU with thresholding.

7.1 Generic Interface The generic construction of the ITL revolves around the four major interface components. Vector An array class with an STL-like iterator interface. Matrix Either a MTL matrix, a multiplier (for matrix-free methods), or a custom matrix 81

CHAPTER 7. ITERATIVE TEMPLATE LIBRARY (ITL)

82

with a specialized mtl::mult(A,x,y) function defined. Preconditioner An object with solve(x,z) and trans solve(x,z) methods defined. Iteration An object that defines the test for convergence, and the maximum number of iterations. Figure 7.1 shows the generic interface for the QMR iterative method. The function is templated for each interface component, which allows for the ability to mix and match concrete components. This particular method uses a split preconditioner, hence the M1, and M2 arguments. template int qmr(const Matrix &A, Vector &x, const VectorB &b, const Precond1 &M1,const Precond2 &M2, Iteration& iter);

Figure 7.1. An ITL method interface example. Figure 7.2 gives an example of how one might call the qmr() routine. The interface presented to the user of ITL has been made as simple as possible. This example uses the compressed2D matrix type from MTL and the SSOR preconditioner. As listed above, there are certain requirements for each interface component, but there is a significant amount of flexibility in the concrete implementation of a particular component. For instance, any of the ITL supplied preconditioners (Cholesky, ILU, ILUT, SSOR) can be used with any method (though some are for symmetric matrices only), and custom preconditioners can be added to the list with little extra effort. Likewise, the control over the test for convergence has been encapsulated in the Iteration interface, so that variations in this regard can be made independent of the main algorithms.


83

typedef matrix::type Matrix; int max_iter = 50; Matrix A(5, 5); dense1D x(A.nrows(), 0.0); dense1D b(A.ncols(), 1.0); // fill A ... SSOR precond(A); basic_iteration iter(b, max_iter, 1e-6); qmr(A, x, b, precond.left(), precond.right(), iter);

Figure 7.2. Example use of the ITL QMR iterative method. Similarly, the Matrix and Vector interfaces allow for flexibility in matrix storage implementation, or even in how the matrix-vector multiplication is carried out. There are several levels of flexibility available. The ITL uses the MTL interface for its basic linear algebra operations. Since the MTL linear algebra operations are generic algorithms, a wide range of matrix types can be used including all of the dense, sparse, and banded types provided in the MTL. Additionally, the MTL generic algorithms can work with custom matrix types, assuming the matrix exports the required interface. A second level of flexibility is available in that the user may specialize the mtl::mult(A,x,y) function for a custom matrix type, and bypass the MTL generic algorithms. An example use of this would be to perform matrix-free computations [9, 8]. Another possibility would be in using a distributed matrix with parallel versions of the linear algebra operations.

7.2 Ease of Implementation The most significant benefit of layering the ITL on top of the MTL interface is the ease of implementation. The ITL algorithms can be expressed in a concise fashion, very close to the pseudo-code from a text book. Figure 7.3 compares the preconditioned conjugate


84

gradient algorithm from [4] with the code from the ITL. Initial r(0) = b , Ax(0) for i = 1; 2; : : : solve Mz (i,1) = r(i,1)

,1 = r( ,1)T z ( ,1) if i = 1 p(1) = z (0) i

i

i

else

,1 = ,1 = ,2 p( ) = z ( ,1) + ,1 p( ,1) i i

i i

i

i

i

endif

q( ) = Ap( ) T = ,1 =p( ) q( ) x( ) = x( ,1) + p( ) r( ) = r( ,1) , q( ) i

i i i

i

i

i

i

i

i

i

i

check convergence; end

i

i

while (! iter.finished(r)) { M.solve(r, z); rho = mtl::dot_conj(r, z); if (iter.first()) mtl::copy(z, p); else { beta = rho / rho_1; mtl::add(z,scaled(p,beta),p); } mtl::mult(A, p, q); alpha = rho/vecvec::dot_conj(p,q); mtl::add(x,scaled(p,alpha),x); mtl::add(r,scaled(q,-alpha),r); rho_1 = rho; ++iter; }

Figure 7.3. Comparison of an algorithm for the preconditioned conjugate gradient method and the corresponding ITL code.

The generic component construction of the ITL also aids in testing and verification of the software, since it enables an incremental approach. The basic linear algebra operations are tested thoroughly in the MTL test suite, and the preconditioners are tested individually before they are used in conjunction with the iterative method. In addition, there is the identity preconditioner, so that the iterative methods can be tested in isolation (without a real preconditioner). The abstraction level over the linear algebra also makes performance optimization much easier. Since the ITL iterative methods do not enforce the use of a particular matrix type, or a particular matrix-vector multiply, optimizations at these levels can happen with no change to the iterative method code. This was a significant factor in our ability to implement and optimize such a large group of iterative methods in short time (4 man months).


85

7.3 ITL Performance This section compares the performance of the ITL with IML++ [24], one of the few other comprehensive iterative method packages (which uses SparseLib++ [26] for the sparse basic linear algebra). Six matrices from the Harwell Boeing collection are used. Computation time (in seconds) per iteration is plotted in Figure 7.4 for each of the following methods: CGS, BICG, BICGSTAB, QMR and TFQMR (which only exists in ITL). The ILU (without fill-in) preconditioner was used in all of the experiments. All timings were run an a Sun UltraSPARC 30. The ITL methods were roughly twice as fast as the corresponding IML++ methods. Calculated Time per Iteration 0.025

0.02

Time

0.015

CGS in ITL BICGS in ITL BICGSTAB in ITL QMR in ITL TFQMR in ITL CGS in IML BICGS in IML BICGSTAB in IML QMR in IML

0.01

0.005

0 MCCA

FS7603

PORES2 ZENIOS Matrix Names

SHERMAN5

SAYLR4

Figure 7.4. Comparison of ITL and IML++ performance over six matrices.

Chapter 8 Performance Experiments This chapter presents a set of experiments that provide performance results comparing MTL with other available libraries (both public domain and vendor supplied). The algorithms timed were the dense matrix-matrix multiplication, the dense matrix-vector multiplication, and the sparse matrix-vector multiplication.

8.1 Dense Matrix-Matrix Multiplication Figure 8.1 shows the dense matrix-matrix product performance for MTL, Fortran BLAS, the Sun Performance Library, TNT [47], and ATLAS [59], all obtained on a Sun UltraSPARC 170E. The experiment shows that the MTL can compete with vendor-tuned libraries (on an algorithm that tends to get extra attention due to benchmarking). The MTL and TNT executables were compiled using Kuck and Associates C++ (KCC) [33], in conjunction with the Solaris C compiler. ATLAS was compiled with the Solaris C compiler and the Fortran BLAS (obtained from Netlib) were compiled with the Solaris Fortran 77 compiler. All possible compiler optimization flags were used in all cases. The cache was cleared between each trial of the experiment. To demonstrate portability across different architectures and compilers, Figure 8.1 also compares the performance of MTL with ESSL [30] on an IBM RS/6000 590. In this case, the MTL executable was compiled with the KCC and IBM xlc compilers. 86

CHAPTER 8. PERFORMANCE EXPERIMENTS

87

The presence (and absence) of optimization techniques in the different programs can readily be seen in Fig. 8.1 (top) and manifest themselves quite strongly as a function of matrix size. In the region from N = 10 to N = 256, performance is dominated by register usage and pipeline performance. “Unroll and jam” techniques [10, 61] are used to attain high levels of performance in this region. In the region from 256 to approximately 1024, performance is dominated by data locality. Loop blocking for cache usage is used to attain high levels of performance here. Finally, for matrix sizes larger than approximately N = 1024, performance can be affected by conflict misses in the cache — notice how the results for ATLAS and Fortran BLAS fall precipitously at this point. To attain good performance in the face of conflict misses (in low associativity caches) block-copy techniques as described in [34] are used. Note that performance effects are cumulative. For instance, the Fortran BLAS do not use any of the techniques listed above for performance enhancement. As a result, performance is poor initially and continues to degrade as different effects come into play.

8.2 Dense and Sparse Matrix-Vector Multiplication To demonstrate genericity across different data structures and data types, Figure 8.2 shows performance results obtained using the same generic matrix-vector multiplication algorithm for dense and for sparse matrices, and compares the performance to that obtained with non-generic libraries. The dense matrix-vector performance of MTL is compared to the Netlib BLAS (Fortran), the Sun Performance Library, and TNT [47]. The sparse matrix-vector performance for MTL is compared to SPARSKIT [49] (Fortran), NIST [48](C), and TNT (C++). The sparse matrices used were from the MatrixMarket [44] collection. The cache was not cleared between each matrix-vector timing trial. The focus of this experiment is on the pipeline behavior of the algorithm. If the cache is


88

cleared, the bottle neck becomes memory bandwidth and differences in pipeline behavior can not be seen. Blocking for cache is not as important for matrix-vector multiplication because there is no data reuse of the matrix.

8.3 Performance Analysis of Matrix-Matrix Multiplication The presence (and absence) of different optimization techniques in the various implementations of the matrix-matrix multiplication can readily be seen in Figure 8.1 (the UltraSPARC comparison) and manifest themselves quite strongly as a function of matrix size. In the region from

N

= 10 to

N

= 256, performance is dominated by register usage

and pipeline performance. “Unroll and jam” techniques [10, 61] are used to attain high levels of performance in this region. In the region from 256 to approximately 1024, performance is dominated by data locality. Loop blocking for cache is used to attain high levels of performance here. Finally, for matrix sizes larger than approximately N = 1024, performance can be affected by conflict misses in the cache. The results for ATLAS and Fortran BLAS fall precipitously at this point. To attain good performance in the face of conflict misses (in low associativity caches) block-copy techniques as described in [34] are used. Note that performance effects are cumulative. For instance, the Netlib BLAS do not use any of the techniques listed above for performance enhancement. As a result, performance is poor initially and continues to degrade as different effects come into play.


89

300

MTL Sun Perf Lib ATLAS Fortran BLAS TNT

250

Mflops

200

150

100

50

0 1 10

2

3

10

10 Matrix Size

300

MTL ESSL ATLAS Netlib

250

Mflops

200

150

100

50

0 1 10

2

10

3

10

Matrix Size

Figure 8.1. Performance comparison of generic dense matrix-matrix product with other libraries on Sun UltraSPARC (upper) and IBM RS6000 (lower).


90

160

MTL Fortran BLAS Sun Perf Lib TNT

140

120

Mflops

100

80

60

40

20

0

50

100

150

200 N

250

300

350

400

70

MTL SPARSKIT NIST TNT

60

50

Mflops

40

30

20

10

0 0 10

1

10 Average non zeroes per row

2

10

Figure 8.2. Performance of generic matrix-vector product applied to column-oriented dense (upper) and row-oriented sparse (lower) data structures compared with other libraries on Sun UltraSPARC.

Chapter 9 Testing The Matrix Template Library posed a special challenge for testing due to the combinatorial nature of the components and algorithms. The amount of functionality to be tested was extremely large, so the same techniques used in the MTL to manage code size were applied to the test suite, those of generic programming. The test suite consists of three parts, the matrix tests, the algorithms tests, and a script to drive the test compilation and execution. The matrix tests exercise all of the functionality described in the Matrix concept (Section 5.3.1). The matrix tests are written in a generic style, similar to the generic algorithms of the MTL. Most of the matrix tests can be applied to any MTL matrix, however there are a few tests that must be specialized. For instance, the test that exercises the A(i,j) operator must be specialized for banded matrices. The total number of specializations is not very large, since even the specializations are written in a generic style and cover a large number matrix types. In addition to testing each matrix format, the matrix tests check each matrix adaptor, such as the scaling and transpose adaptors, with every matrix to verify that the adapted matrix can pass all of the tests. The code below shows one of the matrix tests. This tests focuses on the constiterators of the matrix. The test reads the values stored in the matrix, and checks to make sure those values are correct. In this case the matrix has been filled with the series 91

CHAPTER 9. TESTING

92

of numbers 0; 1; 2; : : :. template bool const_iterator_test(const Matrix& A, string test_name) { typedef typename mtl::matrix_traits::value_type T; typename Matrix::const_iterator i; typename Matrix::OneD::const_iterator j; T c = T(0); for (i = A.begin(); i != A.end(); ++i) for (j = (*i).begin(); j != (*i).end(); ++j) { c = c + T(1); if (*j != c) { cerr void matvec_mult(const Matrix& A, VecX x, VecY y) { typename Matrix::const_iterator i; typename Matrix::OneD::const_iterator j; for (i = A.begin(); i != A.end(); ++i) for (j = (*i).begin(); j != (*i).end(); ++j)

103

APPENDIX A. CONTAINERS

104

y[j.row()] += *j * x[j.column()]; }

There are two iterators used in the above algorithm, i and j. We refer to i as a 2D iterator since it iterates through the 2D container. We refer to j as the 1D iterator. *i gives a 1D Container, and the (*i).begin() and (*i).end() expressions define the iteration through the 1D part of the Matrix, which could be a Row, Column, or Diagonal depending on the Matrix type. MTL matrices can also be sparse or dense, so the traversal behaviour of the 1D iterator j varies accordingly. j only iterates over the non-zero elements in the sparse case. In addition the row() and column() functions on the 1D iterator return the row and column index corresponding to the position of the iterator. This hides the differences in indexing between a large class of sparse and dense matrices. Compare the MTL-style code to that of the BLAS-style and the SPARSKIT matrixvector products. The BLAS and SPARSKIT algorithms include a lot of details that are specific to the matrix format being used. For instance, the sparse matrix algorithm must explicitly access the index and pointer arrays ia and ja. // SPARSKIT-style sparse matrix-vector multiply void matvec_mult(double* a, int n, int* ia, int* ja, double* y, double* x) { for (int i = 0; i < n; ++i) for (int k = ia[i]; k < ia[i+1]; ++k) y[i] += a[k] * x[ja[k]]; } }

The dense matrix algorithm must use the leading dimension lda of the matrix to map to the appropriate place in memory. // BLAS-style dense matrix-vector multiply void matvec_mult(double* a, int m, int n, int lda, double* y, double* x) {


105

for (int i = 0; i < m; ++i) for (int j = 0; j < n; ++j) y[i] += a[i*lda+j] * x[j]; } }

The MTL-style algorithm has none of these format specific expressions because the implementation details are hidden under the MTL Matrix interface. If one uses a dense MTL Matrix with the generic algorithm, the resulting assembly code looks similar to the BLAS-style algorithm. If one uses a sparse MTL Matrix with the generic algorithm, the resulting assembly code looks similar to the SPARSKIT-style algorithm.

operator()(i,j)

One may ask, what about the A(i,j) operator? MTL matrices do

have an A(i,j) operator, but using it typically is not the most efficient or “generic” way to traverse and access the matrix, especially if the matrix is sparse. Furthermore, if the Matrix is banded, one would have to consider which indices of the Matrix are valid to access. With the MTL Matrix iterators, one does not have to worry about such considerations. Traversing with the MTL Matrix iterators gives you access to all matrix elements that are stored in any given storage format.

operator[](i)

This operator gives access to the OneD containers within a Matrix. For

instance, if you have a column major matrix A, and wish to find the index of the maximum element in the first column, one would do the following: Matrix::Column first_column = A[0]; max_elt_index = max_index(first_column);

Accessing Rows vs. Columns Most matrix types only provide access to either rows or columns, but not both. However, if one has a dense rectangle matrix, then the matrix can


106

be easily converted back and forth from row major to column major using the rows and columns helper functions. The following finds the maximum element in the first row of the matrix. max_elt_index = max_index(rows(A)[0]);

Submatrices

The sub matrix() function returns a new matrix object that is a view

into a particular portion of the matrix. The indexing within the new matrix is reset so that the first element is at (0,0). Here is an example of creating a submatrix: A =

[ 1 2 3 4 ] [ 5 6 7 8 ] [ 9 10 11 12 ] [ 13 14 15 16 ]

A_00 = [ [

1 5

2 ] 6 ]

A_10 = [ 9 10 ] [ 13 14 ] A_00 A_01 A_10 A_11

= = = =

A_01 = [ 3 4 ] [ 7 8 ] A_11 = [ 11 12 ] [ 15 16 ]

A.sub_matrix(0,2,0,2); A.sub_matrix(2,4,0,2); A.sub_matrix(0,2,2,4); A.sub_matrix(2,4,2,4);

If one wants to create submatrices for the whole matrix, as above, MTL provides a short cut in the partition fuction. The function returns a Matrix of submatrices. The input is the row and column numbers that split up the matrices. The following code gives an example. int splitrows[] = { 2 }; int splitcols[] = { 2 }; Matrix::partitioned Ap = A.partition(array_to_vec(splitrows), array_to_vec(splitcols));

Now Ap(0,0) is equivalent to A 00, Ap(0,1) is equivalent to A 01, etc.


107

Matrix Type Selection To create an MTL Matrix, one uses the matrix type constructor to choose the particular storage format, element type, etc. This performs a shallow copy, since MTL objects are really just reference counted handles to the actually data. The IndexList is some Container containing the list of row and column numbers to use to split the matrix. The result is a matrix of submatrices. Partition the matrix into four submatrices. Refinement of Container Associated types Concept

Type

Tag

X::shape

Description See matrix traits

Tag

X::orientation

Either row tag or column tag

Tag

X::sparsity

Either dense tag or sparse tag

Tag

X::dimension

Either oned tag or twod tag

Matrix

X::transpose type

Used by trans helper function

Matrix

X::strided type

Used by rows and columns helper functions

Matrix

X::scaled type

Used by scaled helper function

Vector

X::OneD

The type for a OneD slice of the Matrix

Vector

X::OneDRef

The type for a reference to a OneD slice of the Matrix

Matrix

X::submatrix type

The type for a submatrix of this Matrix

Matrix

X::partition type

The type for a partitioned version of this Matrix


108

Concept

Type

TrivialConcept

X::value type

Description The element type

TrivialConcept&

X::reference

Reference to the element type

const TrivialConcept&

X::const reference

Const reference to the element type

TrivialConcept*

X::pointer

Pointer to the element type

BidirectionalIterator

X::iterator

Iterator type, dereference gives a OneD

X::const iterator

Const iterator type

X::reverse iterator

Reverse iterator type

X::const reverse iterator

Const reverse iterator type

X::size type

Size type

Integral

X::difference type

Difference type

size type

X::M, N

for static sized matrices

const BidirectionalIterator BidirectionalIterator const BidirectionalIterator NonNegativeIntegral

Notations X

The type of a model of the Matrix concept

A,B

An object of type X

stream A matrix stream for file IO Expression semantics Expression X(m, X(m, A(m, X(B)

n) n, n, or

or X A(m, n); sub, super) or X sub, super); X A(B);

Semantics

Description Normal Constructor Constructor for banded matrices Copy Constructor

APPENDIX A. CONTAINERS Expression X A(data, m, n); X A(data, m, n, ld);

X A(data, m, n, ld, sub, super); X A(data); X A(m, n, nnz, val, ptrs, inds); X A(B, sub, super); X A(stream); X A(stream); X A(stream, sub, super); X A(stream, sub, super); A.begin() A.begin() A.end() A.end() A.rbegin() A.rbegin() A.rend() A.rend() A(i,j) A(i,j) A[i] A[i] A.sub matrix(row start, row finish, col start, col finish) A.partition(split rows, split cols) A.subdivide(row split, column split) A.nrows() A.ncols()

109 Semantics

Description Contstruct for external matrices Construct for external matrices, with different leading dimension (ld) Constructor for external banded matrices Static sized dimension constructor Sparse external matrix constructor Banded view constructor Matrix market stream constructors Harwell Boeing stream constructor Matrix market stream banded constructors Harwell Boeing stream banded constructor Iterate over OneD slices Const Iterator Begin Iterator End Const Iterator End Reverse Iterator Begin Const Reverse Iterator Begin Reverse Iterator End Const Reverse Iterator End Element Access Const Element Access OneD Access Const OneD Access Submatrix Access

Partitioned Matrix Creation Subdivide (partition into 4) Number of rows Number of columns

APPENDIX A. CONTAINERS Expression

110 Semantics

Description Number of non-zero elements (really number of stored elements) Sub part of bandwidth Super part of bandwidth Whether the main diagonal is all ones, and hence not stored Whether the matrix has elements stored only in the upper triangle Whether the matrix has elements stored only in the upper triangle

Prototype

Description

Complexity

X(size type m, size type n) X(size type m, size type n, difference type sub, difference type super) X(const X& x) X(pointer data, size type m, size type n) X(pointer data, size type m, size type n, size type ld) X(pointer data, size type m, size type n, difference type sub, difference type super) X(pointer data) X(size type m, size type n, size type nnz, pointer val, size type* ptrs, size type* inds) template X(const Matrix& x, difference type sub, difference type super) X(matrix market stream& m in) X(harwell boeing stream& m in)

Normal Constructor

A.nnz()

A.sub() A.super() A.is unit() A.is upper() A.is lower()

Function specification

Constructor for banded matrices Copy Constructor Contstruct for external matrices Construct for external matrices, with different leading dimension (ld) Constructor for external banded matrices Static sized dimension constructor Sparse external matrix constructor

Banded view constructor Matrix market stream constructors Harwell Boeing stream constructor

APPENDIX A. CONTAINERS Prototype X(matrix market stream& m in, difference type sub, difference type super) X(harwell boeing stream& m in, difference type sub, difference type super) iterator begin()

111 Description

Complexity

Matrix market stream banded constructors Harwell Boeing stream banded constructor Iterate over OneD slices

constant time

const iterator begin()

Const Iterator Begin

constant time

iterator end()

Iterator End

constant time

const iterator end()

Const Iterator End

constant time

reverse iterator rbegin()

Reverse Iterator Begin

constant time

const reverse iterator rbegin()

Const Reverse Iterator Begin

constant time

reverse iterator rend()

Reverse Iterator End

constant time

const reverse iterator rend()

Const Reverse Iterator End

constant time

reference operator()(size type i, size type j)

Element Access

constant for dense, linear for sparse

const reference operator()(size type i, size type j)

Const Element Access

constant for dense, linear for sparse


112

Prototype

Description

Complexity

OneDRef operator[](size type i)

OneD Access

constant

const OneDRef operator[](size type i)

Const OneD Access

constant

submatrix type sub matrix(size type row start, size type row finish, size type col start, size type col finish) partitioned partition(IndexList rows, IndexList columns) paritioned subdivide(size type row split, size type column split) size type nrows()

Submatrix Access

Partitioned Matrix Creation

Subdivide (partition into 4) Number of rows

constant time

size type ncols()

Number of columns

constant time

size type nnz()

Number of non-zero elements (really number of stored elements)

constant time

difference type sub() difference type super() bool is unit()

Sub part of bandwidth Super part of bandwidth Whether the main diagonal is all ones, and hence not stored Whether the matrix has elements stored only in the upper triangle Whether the matrix has elements stored only in the upper triangle

bool is upper() bool is lower()

Models

row matrix

column matrix

diagonal matrix


triangle matrix

symmetric matrix

113

A.1.2 RowMatrix Description A row-oriented, or row-major Matrix. The iterators for this Matrix type traverse along the rows of the Matrix, and the operator[](n) returns a row of the Matrix. Additionally, there is a Row typedef which refers to the OneD type of the Matrix. Associated types Concept

Type

Vector

X::Row

Description Row type, same as Matrix::OneD

Models

row matrix

A.1.3 ColumnMatrix Description A column-oriented, or column-major Matrix. The iterators for this Matrix type traverse along the columns of the Matrix, and the operator[](n) returns a column of the Matrix. Additionally, there is a Column typedef which refers to the OneD type of the Matrix. Associated types Concept

Type

Description

Vector

X::Column

Column type, same as Matrix::OneD


114

Models

column matrix

A.1.4 DiagonalMatrix Description A diagonal matrix is quite different from the normal MTL Matrix. Instead of the OneD parts of the Matrix being Rows or Columns, the OneD parts of this Matrix are diagonals of the matrix. The below example shows a piece of code that uses the iterators of a diagonal matrix to fill up the matrix with incrementing numbers. The matrix depicted below is the result of the code applied to a diagonal matrix. The iterators traverse along the diagonals of the Matrix instead of the rows or columns. int c = 0; typename Matrix::const_iterator i; typename Matrix::OneD::const_iterator j; for (i = A.begin(); i != A.end(); ++i) { for ( j = (*i).begin(); j != (*i).end(); ++j) { c = c + 1; *j = c; } } [ [ [ [

4 8

1 5 9

2 6 10

3 7

] ] ] ]

Associated types Concept

Type

Description

Vector

X::Diagonal

Diagonal type, same as Matrix::OneD


115

Models

diagonal matrix

A.1.5 Vector Description Not to be confused with the std::vector class. The MTL Vector concept is a Container in which every element has a corresponding index. The elements do not have to be sorted by their index, and the indices do not necessarily have to start at 0. Also the indices do not have to form a contiguous range. The iterator type must be a model of IndexedIterator. Vector is not a refinement of RandomAccessContainer (even though Vector defines operator[]) because Vector does not guarantee amortized constant time for that operation (to allow fo sparse vectors). Note also that the invariant a[n] == advance(A.begin(), n) that applies to RandomAccessContainer does not apply to Vector, since the a[i] is defined for Vector to return the element with the ith index. So a[n] == *i if and only if i.index() == n. Associated types Description

Concept

Type

IndexedIterator const IndexedIterator Tag

X::iterator

X::dimension

Marks this as 1-D

Tag

X::sparsity

dense tag or sparse tag

const Vector

X::scaled type

The vector scaled be a constant

size type

X::N

the static size, 0 if dynamic

X::const iterator


116

Type

Description

X::IndexArray

An array containing the indices of the elements in the vector.

Vector

X::subrange

The sub-vector type

const Vector

X::const subrange

The const sub-vector type

Concept const tainer

Con-

Notations X

The type of a model of Vector

a

An object of type X

splits A container of integral objects Expression semantics Expression

Semantics

Description Element access Const element access Partition vector Subdivide vector (partition into 2) Subrange access (not yet implemented) Number of non-zero elements (actually the number of stored elements)

Prototype

Description

Complexity

reference operator[](size type n)

Element access

linear time

const reference operator[](size type n)

Const element access

linear time

a[n] a[n] a.partition(splits) a.subdivide(split) a(s,f) a.nnz()

Function specification


117

Prototype

Description

Complexity

partitioned partition(const Container& splits)

Partition vector

linear number splits

Subdivide vector (partition into 2)

constant

Subrange access (not yet implemented)

linear

size type nnz()

Number of non-zero elements (actually the number of stored elements)

constant time

IndexArray nz struct()

The non-zero structure of the Vector (an array of indices cooresponding to the elements stored)

partitioned subdivide(size type split) subrange operator()(size type start, size type finish)

in of

Invariants a.nnz() Description Matrices that occur in real engineering and scientific applications often have special structure, especially in terms of how many zeros are in the matrix, and where the non-zeros are located in the matrix. This means that space and time saving can be acheived by using various types of compressed storage. There are a multitude of matrix storage formats in use today, and the MTL tries to support many of the more common storage formats. The following discussion will describe how the user of MTL can select the type of matrix he or she wishes to use. To create a MTL matrix, one first needs to construct the appropriate matrix type. This is done using the matrix type generation class, which is easier to think of as a function. It takes as input the characteristics of the matrix type that you want and then returns the


120

appropriate MTL matrix. The matrix type generators “function” has defaults defined, so in order to create a normal rectangular matrix type, one merely does the following: typedef matrix< double >::type MyMatrix; MyMatrix A(M, N);

The matrix type generators can take up to four arguments, the element type, the matrix shape, the storage format, and the orientation. The following is the “prototype” for the matrix type generators. matrix< EltType, Shape, Storage, Orientation >::type

This type of ”generative” interface technique was developed by by Krzysztof Czarnecki and Ulrich Eisenecker in their work on the Generative Matrix Computation Library. *Storage can be made external by specifying such in the storage parameter. eg. dense, packed. Definition matrix.h Template Parameters Parameter EltType

Default

Description Valid choices for this argument include double, complex, and bool. In essence, any builtin or user defined type can be used for the EltType, however, if one uses the matrix with a particular algorithm, the EltType must support the operations required by the algorithm. For MTL algorithms these typically include the usual numerical operators such as addition and multiplication. The std::complex class is a good example of what is required in a numerical type. The documentation for each algorithm will include the requirements on the element type.

APPENDIX A. CONTAINERS Parameter

121

Default

Description This argument specifies the general positioning of the non zero elements in the matrix, but does not specify the actual storage format. In addition it specifies certain properties such as symmetry. The choices for this argument include rectangle, banded, diagonal, triangle, and symmetric. Hermitian is not yet implemented. The argument specifies the storage scheme used to lay out the matrix elements (and sometimes the element indices) in memory. The storage formats include dense , banded, packed , banded view, compressed, envelope, and array. The storage order for an MTL matrix can either be row major or column major.

Shape

Storage

Orientation

Members Declaration type

Description The generated type

A.2.2 band view Description A helper class for creating a banded view into an existing dense matrix. Example In banded view test.cc: template void print_banded_views(Matrix& A) { using namespace mtl; band_view::type B(A, 2, 1); } int main(int argc, char* argv[]) { using namespace mtl; const int M = atoi(argv[1]), N = atoi(argv[2]); typedef matrix::type Matrix;

Where Defined


122

Matrix A(M, N); print_banded_views(A); return 0; }

Definition matrix.h Template Parameters Parameter Matrix

Default

Description The type of the Matrix to be viewed, must be dense



A.2.3 block view Description block_view bA = blocked(A, 16, 16); or block_view bA = blocked(A);

Note: currently not supported for egcs (internal compiler error). Example In blocked matrix.cc: const int M = 4; const int N = 4; typedef matrix::type Matrix;

Where Defined


123

Matrix A(M,N); for (int i = 0; i < M; ++i) for (int j = 0; j < N; ++j) A(i, j) = i * N + j; print_all_matrix(A); block_view::type bA = blocked(A, blk()); print_partitioned_matrix(bA); block_view::type cA = blocked(A, 2, 2); print_partitioned_by_column(cA);

Definition matrix.h Template Parameters Parameter Matrix BM

BN

Default 0 for dynamic size

0 for dynamic size

Description The type of the Matrix to be blocked, must be dense The blocking factor for the rows (M dimension) The blocking factor for the columns (N dimension)



Where Defined

A.2.4 symmetric view Description A helper class for creating a symmetric view into an existing dense or sparse matrix. For sparse matrices, the matrix must already have elements in the appropriate lower/upper


124

portion of the matrix. This just provides the proper symmetric matrix interface. Definition matrix.h Template Parameters Parameter

Default

Matrix

Uplo

Description The type of the Matrix to be viewed, must be dense Whether to view the upper or lower triangle of the matrix

Members Declaration

Description

Sparsity type

The generated type

Where Defined

A.2.5 triangle view Description A helper class for creating a triangle view into an existing dense or sparse matrix. For sparse matrices, the matrix must already have elements in the appropriate triangular portion of the matrix. This just provides the proper triangular matrix interface. Example In banded view test.cc: template void print_banded_views(Matrix& A) { using namespace mtl; band_view::type B(A, 2, 1); } int main(int argc, char* argv[])


125

{ using namespace mtl; const int M = atoi(argv[1]), N = atoi(argv[2]); typedef matrix::type Matrix; Matrix A(M, N); print_banded_views(A); return 0; }

Definition matrix.h Template Parameters Parameter Matrix

Uplo

Default

Description The type of the Matrix to be viewed, must be dense Whether to view the upper or lower triangle of the matrix

Members Declaration

Description

Sparsity type

The generated type

Where Defined

A.3 Container type selectors A.3.1 rectangle Description A MTL rectangular matrix is one in which elements could appear in any position in the matrix, i.e., there can be any element A(i,j) where 0

, row_major >::type SparseArrayMat;

Definition matrix.h Template Parameters Parameter

Default

MM

not static

NN

not static

Members

Description The number of rows of the matrix, if the matrix has static size (known at compile time) The number of columns of the matrix, if the matrix has static size (known at compile time)

APPENDIX A. CONTAINERS Declaration

127 Description

enum f M = MM, N = NN, id = RECT, uplo g

Where Defined

A.3.2 symmetric Description Symmetric matrices are similar to banded matrices in that there is only access to a particular band of the matrix. The difference is that in an MTL symmetric matrix, A(i,j) and A(j,i) refer to the same element. The following is an example of a symmetric matrix: the full symmetric matrix [ 1 2 3 4 5 ] [ 2 6 7 8 9 ] [ 3 7 10 11 12 ] [ 4 8 11 13 14 ] [ 5 9 12 14 15 ] the symmetric [ 1 ] [ 2 6 ] [ 3 7 10 [ 4 8 11 [ 5 9 12

matrix in packed storage

] 13 14

] 15

]

Similar to the triangle shape, the user must provide an Uplo argument which specifies which part of the matrix is actually stored. The valid choices are upper and lower for symmetric matrices. typedef matrix < double, symmetric, packed, row_major >::type SymmMatrix;

Example In symm packed vec prod.cc: double da[16]; typedef matrix< double,


128

symmetric, packed, column_major >::type Matrix; const Matrix::size_type matrix_size = 5; Matrix A(da, matrix_size, matrix_size); typedef dense1D Vec; Vec y(matrix_size,1),x(matrix_size), Ax(matrix_size); double alpha=1, beta=0; // 1 2 3 4 5 1 1000 // 2 6 7 8 9 2 2000 // A = 3 7 10 11 12 x = 3 y = 3000 // 4 8 11 13 14 4 4000 // 5 9 12 14 15 5 5000 //make A for (int i = 0; i < 15; ++i) da[i] = i + 1; //make x y for (int i=0;i::type BLAS_Packed; typedef matrix < double, banded, banded, column_major >::type BLAS_Banded;

Storage Type Selectors

banded is also the type selectors for the banded storage for-

mat. This storage format is equivalent to the banded storage used in the BLAS and LAPACK. Similar to the dense storage format, a single contiguous chunk of memory is allocated. The banded storage format maps the bands of the matrix to a twod-array of dimension (sub + super + 1) by min(M, N + sub). In MTL the 2D array can be row or column major (for the BLAS it is always column major). The twod-array is then in turn mapped the the linear memory space of the single chunk of memory. The following is an example banded matrix with the mapping to the row-major and column-major 2D arrays. The x's represent memory locations that are not used. [ [ [ [ [ [

1 4 0 0 0 0

2 5 8 0 0 0

3 6 9 12 0 0

0 7 10 13 16 0

0 0 11 14 17 19

row-major [ 1 2 3 [ 4 5 6 [ 8 9 10 [ 12 13 14

x 7 11 15

] ] ] ]

0 0 0 15 18 20

] ] ] ] ] ]

APPENDIX A. CONTAINERS [ [

x x

16 x

17 19

18 20

] ]

column-major [ x x 3 [ x 2 6 [ 1 5 9 [ 4 8 12

7 10 13 16

11 14 17 19

15 18 20 x

131

] ] ] ]


Default

Description

External

internal

Specify whether the memory used is ”owned” by the matrix or if it was provided to the matrix from some external source (with a pointer to some data)

Members Declaration

Description

Where Defined

size type enum f id = BAND, oned id, uplo, ext=External, M=0, N=0, issparse=0, index g

A.3.5 triangle Description The triangular shape is a special case of the banded shape. There are four kinds of triangular matrices in MTL, based on the Uplo argument: The following is an example of a triangle shaped matrix: [

1

2

3

4

5

]

APPENDIX A. CONTAINERS Uplo type upper unit upper lower unit lower

[ [ [ [

0 0 0 0

6 0 0 0

7 10 0 0

8 11 13 0

9 12 14 15

132 Sub Super 0 N-1 -1 N-1 M-1 0 M - 1 -1

] ] ] ]

The next example is of a triangle matrix. The main diagonal is not stored, since it consists of all ones. The MTL algorithms recognize when a matrix is “unit” and perform a slightly different operation to take this into account. The ones will not show up in an iteration of the matrix, and access to the A(i,i) element of a unit lower/upper matrix is an error. [ [ [ [ [

1 1 2 4 7

0 1 3 5 8

0 0 1 6 9

0 0 0 1 10

0 0 0 0 1

] ] ] ] ]

Here are a couple examples of creating some triangular matrix types: typedef matrix < double, triangle, banded, column_major >::type UpperTriangle; typedef matrix < double, triangle, packed, row_major >::type UnitLowerTriangle;

Definition matrix.h


133

Template Parameters Parameter

Default

Description The type of triangular matrix. Either upper, lower, unit upper, or unit lower.

Uplo

Members Declaration

enum f id = TRI, uplo = Uplo, M=0, N=0 g

Description

Where Defined

A.3.6 diagonal Description The diagonal matrix shape is similar to the banded matrix in that there is a bandwidth that describes the area of the matrix in which non-zero matrix elements can reside. The difference between the banded matrix shape lies in how the MTL iterators traverse the matrix, which is explained in DiagonalMatrix. The MTL storage types that can be used are banded, packed, banded view, and array. To get the traditional tridiagonal matrix format, one just has to specify the bandwith to be (1,1) and use the array dense

> storage format.

Definition matrix.h Template Parameters Parameter External

Default internal

Description Specify whether the memory used is ”owned” by the matrix or if it was provided to the matrix from some external source (with a pointer to some data)


134

Members Declaration

enum f uplo, id = DIAG, ext=External, M=0, N=0 g

Description

Where Defined

A.3.7 array Description This storage type gives an ”array of pointers” style implementation of a matrix. Each row or column of the matrix is allocated separately. The type of vector used for the rows or columns is very flexible, and one can choose from any of the OneD storage types, which include dense, compressed, sparse pair, tree, and linked list. matrix < double, rectangle, array< dense >, row_major >::type [ ] -> [ 1 0 0 4 0 ] [ ] -> [ 0 7 8 0 0 ] [ ] -> [ 11 0 13 14 0 ] [ ] -> [ 16 0 18 0 20 ] [ ] -> [ 0 22 0 24 0 ] matrix < double, rectangle, array< sparse_pair >, row_major >::type [ ] -> [ (1,0) (4,3) ] [ ] -> [ (7,1) (8,2) ] [ ] -> [ (11,0) (13,2) (14,3) ] [ ] -> [ (16,0) (18,2) (20,4) ] [ ] -> [ (22,1) (24,3) ]

One advantage of this type of storage is that rows can be swapped in constant time. For instance, one could swap the row 3 and 4 of a matrix in the following way.


135

Matrix::OneD tmp = A[3]; A[3] = A[4]; A[4] = tmp;

The rows are individually reference counted so that the user does not have to worry about deallocating the rows. Definition matrix.h Template Parameters Parameter

Default

OneD

dense

External

internal

Description The storage type used for each row/column of the matrix Specify whether the memory used is ”owned” by the matrix or if it was provided to the matrix from some external source (with a pointer to some data)

Model of TwoDStorage Members Declaration size type enum fid=ARRAY, oned id=OneD::id, ext=External, issparse = OneD::issparse, index=index from zero

g

Description

Where Defined


136

A.3.8 dense Description TwoD Storage Type Selectors This is the most common way of storing matrices, and consists of one contiguous piece of memory that is divided up into rows or columns of equal length. The following example shows how a matrix can be mapped to linear memory in either a row-major or column-major fashion. [ 1 2 3 ] [ 4 5 6 ] [ 7 8 9 ] row major: [ 1 2 3 4 5 6 7 8 9 ] column major: [ 1 4 7 2 5 8 3 6 9 ]

OneD Storage Type Selectors

This specifies a normal dense vector to be used as the

OneD part of matrix with array storage. Example In swap rows.cc: typedef matrix< double, rectangle, dense, column_major>::type Matrix; const Matrix::size_type N = 3; Matrix::size_type large; double dA[] = { 1, 3, 2, 1.5, 2.5, 3.5, 4.5, 9.5, 5.5 }; Matrix A(dA, N, N); // Find the largest element in column 1. large = max_index(A[0]); // Swap the first row with the row containing the largest // element in column 1. swap( rows(A)[0] , rows(A)[large]);

More examples can be found in general matvec mult.cc


137

Definition matrix.h Template Parameters Parameter External

Default internal


Members Declaration

Description

Where Defined

size type enum f id = DENSE, oned id, ext=External, issparse=0, index g

A.3.9 compressed Description TwoD Storage Type Selectors

This storage type is the traditional compressed row or

compressed column format. The storage consists of three arrays, one array for all of the elements, one array consisting of the row or column index (row for column-major and column for row-major matrices), and one array consisting of pointers to the start of each row/column. The following is an example sparse matrix in compressed for format, with the stored indices specified as index from one. Note that the MTL interface is still indexed from zero whether or not the underlying stored indices are from one. [ [ [

1

2

3

4 6

5 7

8

] ] ]

APPENDIX A. CONTAINERS [ [

9

10 11

12

138

] ]

row pointer array [ 1 4 6 9 11 13 ] element value array [ 1 2 3 4 5 6

7

8

9 10 11 12 ]

element column index array [ 1 3 4 2 5 1 3 4 1

4

3

5 ]

Of course, the user of the MTL sparse matrix does not need to concern his or herself with the implementation details of this matrix storage format. The interface to an MTL compressed row matrix is the same as that of any MTL matrix, as described in Matrix.

OneD Storage Type Selectors This is a OneD type used to construct array matrices. The compressed OneD format uses two arrays, one to hold the elements of the vector, and the other to hold the indices that coorespond to each element (either their row or column number). Example In sparse matrix.cc: // [1,0,2] // [0,3,0] // [0,4,5] const int m = 3, n = 3, nnz = 5; double values[] = { 1, 2, 3, 4, 5 }; int indices[] = { 1, 3, 2, 2, 3 }; int row_ptr[] = { 1, 3, 4, 6 }; // Create from pre-existing arrays typedef matrix::type MatA;


139

MatA A(m, n, nnz, values, row_ptr, indices); // Create from scratch typedef matrix::type MatB; MatB B(m, n); B(0,0) = 1; B(0,2) = 2; B(1,1) = 3; B(2,1) = 4; B(2,2) = 5;


Default

SizeType External

int internal

IndexStyle

Description The type used in the index and pointer array Specify whether the memory used is ”owned” by the matrix or if it was provided to the matrix from some external source (with a pointer to some data) Specify whether the underlying index array stores indices starting from one (fortan style) or from zero (cstyle) - index from zero

Members Declaration size type enum f id = COMPRESSED, oned id, ext=External, issparse=1, index=IndexStyle g

Description

Where Defined


140

A.3.10 packed Description This storage type is equivalent to the BLAS/LAPACK packed storage format. The packed storage format is similar to the banded format, except that the storage for each row/column of the band is variable so there is no wasted space. This is better for efficiently storing triangular matrices. [ [ [ [ [

1 0 0 0 0

2 6 0 0 0

3 7 10 0 0

[ [ [ [ [

1 6 10 13 15

2 7 11 14 ]

3 8 12 ]

4 8 11 13 0

5 9 12 14 15

] ] ] ] ]

4 9 ]

5

]

]

mapped to linear memory with row-major order: [

1

2

3

4

5

6

7

8

9

10

11

12

13

Example In tri pack vect.cc: typedef matrix< double, triangle, packed, column_major >::type Matrix; typedef dense1D Vector; // 1 3 // A = 2 4 x = 2 // 3 5 6 1 const Matrix::size_type N = 3; double dA[] = { 1, 2, 3, 4, 5, 6 }; Matrix A(dA, N, N); Vector x(N), Ax(N); for (int i = 0; i < N; ++i) x[i] = 3-i;

14

15

]


141

mult(A, x, Ax);


Default

Description

External

internal

Specify whether the memory used is ”owned” by the matrix or if it was provided to the matrix from some external source (with a pointer to some data)

Members Declaration

Description

Where Defined

size type enum f id = PACKED, oned id, ext=External, issparse=0, index g

A.3.11 banded view Description This storage type is used for creating matrices that are ”views” into existing full dense matrices. For instance, one could create a triangular view of a full matrix. Definition matrix.h Template Parameters Parameter External

Default internal



142

Members Declaration

Description

size type enum f id = BAND VIEW,oned id, ext=External, issparse=0, index

Where Defined

g

A.3.12 envelope Description The storage scheme is for sparse symmetric matrices, where most of the non-zero elements fall near the main diagonal. The storage format is useful in certain factorizations since the fill-ins fall in areas already allocated. This scheme is different than most sparse matrices since the row containers are actually dense, similar to a banded matrix. [ [ [ [ [

1 2 4

3 5 6 8

7 9

] ] ] ] 10 ]

[ 0 2 _______/__/ V V [ 1 2 3 4 0

Definition matrix.h Template Parameters

5 | V 5

7 11 ] Diagonals pointer array |__\___________ V V 6 7 8 0 9 10 ] Element values array

APPENDIX A. CONTAINERS Parameter External

Default internal

143 Description Specify whether the memory used is ”owned” by the matrix or if it was provided to the matrix from some external source (with a pointer to some data)

Members Declaration

Description

Where Defined

size type enum f id = ENVELOPE, oned id, ext=External, issparse = 0, index g

A.3.13 linked list Description This is a OneD type for constructing array matrices. The implementation is a std::list consisting of index-value pairs. Definition matrix.h Members Declaration size type enum f id = LINKED LIST, issparse = 1, index=index from zero

g

Description

Where Defined


144

A.3.14 sparse pair Description This is a OneD type for constructing array matrices. The implementation is a std:vector consisting of index-value pairs. Definition matrix.h Members Declaration

Description

Where Defined

size type enum f id = SPARSE PAIR, issparse = 1, index=index from zero

g

A.3.15 tree Description This is a OneD type for constructing array matrices. The implementation is a std::set consisting of index-value pairs. Definition matrix.h Members Declaration size type enum f id = TREE, issparse = 1, index=index from zero

g

Description

Where Defined


145

A.4 Container classes A.4.1 dense1D Description This is the primary class that you will need to use as a Vector. This class uses the STL vector for its implementation. The dense1D class serves as a handle to the vector. The MTL algorithms assume that the vector and matrix arguments are handles, and will not work with the STL style containers. For interoperability, one can create a dense1D from pre-existing memory. In this case, the mtl reference counting does not delete the memory. Definition dense1D.h Template Parameters Parameter RepType

Default

Description the underlying representation

Model of Vector Members Declaration

enum f N = NN dimension sparsity scaled type value type reference

g

Description

Where Defined

The sparsity tag The scaled type of this container The value type The reference type

Scalable Container Container


146

iterator const iterator reverse iterator

Description The const reference type The pointer type The type for dimensions and indices The type for differences between iterators The iterator type The const iterator type The reverse iterator type

const reverse iterator

The const reverse iterator type

subrange class IndexArray IndexArrayRef dense1D () dense1D (int n)

The subrange vector type The type for the index array The reference type for index array Default Constructor Non-Initializing Constructor not very standard :(

dense1D (int n, const value type& init) dense1D (const self& x) dense1D () self& operator= (const self& x) iterator begin ()

Initializing Constructor

Sequence

Copy Constructor (shallow copy) The destructor.

ContainerRef Container

Assignment Operator (shallow copy)

AssignableRef

Return an iterator pointing to the beginning of the vector Return an iterator pointing past the end of the vector

Container

const iterator begin () const

Return a const iterator pointing to the begining of the vector

Container

const iterator end () const

Return a const iterator pointing past the end of the vector

Container

reverse iterator rbegin ()

Return a reverse iterator pointing to the last element of the vector Return a reverse iterator pointing past the end of the vector

Reversible Container Reversible Container

const reverse iterator rbegin () const

Return a const reverse iterator pointing to the last element of the vector

Reversible Container

const reverse iterator rend () const

Return a const reverse iterator pointing past the end of the vector


reference operator[] (int i)

Return a reference to the element with index i

Random Access Container

const reference pointer size type difference type

iterator end ()

reverse iterator rend ()

Where Defined Container Container Container Container Container Container Reversible Container Reversible Container

Container Sequence

Container

APPENDIX A. CONTAINERS Declaration subrange operator() (size type s, size type f) const reference operator[] (int i) const size type size () const size type nnz () const void resize (size type n) void resize (size type n, const T& x) int capacity () const void reserve (int n) const value type* data () const pointer data () iterator insert (iterator position, const value type& x = value type()) void insert (iterator position, size type n, const value type& x = value type()) IndexArrayRef nz struct () const

147 Description

Where Defined

Return a const reference to the element with index i Return the size of the vector Return the number of non-zeroes

Random Access Container Container

Raw Memory Access

Container

A.4.2 compressed1D Description The compressed1D Vector is a sparse vector implemented with a pair of parallel arrays. One array is for the element values, and the other array is for the indices of those elements. The elements are ordered by their index as they are inserted into the compressed1D. compressed1D's can be used to build matrices with the array storage format, and they can also be used on their own. [ (1.2, 3), (4.6, 5), (1.0, 10), (3.7, 32) ] A Sparse Vector [ 1.2, 4.6, 1.0, 3.7 ] [ 3, 5, 10, 32 ]

Value Array Index Array


148

The compressed1D::iterator dereferences (*i) to return the element value. One can access the index of that element with the i.index() function. One can also access the array of indices through the nz struct() method. One particularly useful fact is that one can perform scatters and gathers of sparse elements by using the mtl::copy(x,y) function with a sparse and a dense vector. Example In gather scatter.cc: void do_gather_scatter(DenseVec& d, SparseVec& c) { using namespace mtl; c[2] = 0; c[5] = 0; c[7] = 0; copy(d,c); scale(c,2.0); copy(c,d); typedef dense1D denseVec; typedef compressed1D compVec; denseVec d(9,2); compVec c; do_gather_scatter(d, c);

More examples can be found in array2D.cc, sparse copy.cc Definition compressed1D.h Template Parameters Parameter

Default

T SizeType

int

IND OFFSET

index from zero

Description the element type the type for the stored indices To handle indexing from 0 or 1


149

Model of Vector Members Declaration

enum f N = 0 sparsity dimension scaled type value type pointer size type

g

difference type reference const reference iterator const iterator reverse iterator const reverse iterator IndexArrayRef IndexArray subrange class insert iterator insert iterator inserter () compressed1D () compressed1D (size type n) compressed1D (const self& x) template compressed1D (const IndexArray& x) compressed1D (const light1D& x) self& operator= (const self& x)

Description This is a sparse vector This is a 1D container Scaled type of this vector Element type A pointer to the element type Unsigned integral type for dimensions and indices Integral type for differences in iterators Reference to the value type Const reference to the value type Iterator type Const iterator type Reverse iterator type The const reverse iterator type Reference to the index array The type for the index array The type for the subrange vector

Default Constructor Length N Constructor Copy Constructor Index Array Constructor

Assignment Operator

Where Defined


150 Description Return an iterator pointing to the beginning of the vector Return an iterator pointing past the end of the vector

Where Defined Container



Container



Container










reference operator[] (size type i) const reference operator[] (size type i) const iterator insert (size type i, const T& val) void clear () size type size () const size type nnz () const

Access the element with index i

void resize (size type n) void reserve (size type n) IndexArrayRef nz struct () const IndexArrayRef nz struct ()

Resize the vector to size n

iterator begin () iterator end ()


Container

Access the element with index i Insert val into the vector at index i Erase the vector The size of the vector The number of non-zero elements (the number stored)

Reserve storage for n elements Returns the array of indices Returns the array of indices

A.4.3 external vec Description This is similar to dense1D, except that the memory is provided by the user. This allows for interoperability with other array packages and even with Fortran.


151

Example In dot prod.cc: const int N = 3; double dx[] = { 1, 2, 3}; double dy[] = { 3, 0, -1}; typedef external_vec Vec; Vec x(dx,N), y(dy,N); print_vector(x); print_vector(y); double dotprd = dot(x, y); if (dotprd == 0) cout , row_major>::type MatC; MatA A(M,N);


167

MatB B(M,N); MatC C(M, N); // Fill A ... mtl::copy(A, B); MatB::Row tmp = B[2]; B[2] = B[3]; B[3] = tmp; mtl::copy(B, C);

Definition array2D.h Template Parameters Parameter OneD

Default

Description the one dimensional container the array is composed of

Model of TwoDStorage Members Declaration

template struct partitioned transpose type submatrix type banded view type enum fM = 0, N = 0g OneD OneDRef ConstOneDRef

Description

Where Defined

APPENDIX A. CONTAINERS Declaration storage loc sparsity strideability value type reference const reference size type iterator const iterator reverse iterator const reverse iterator dim type band type array2D () array2D (dim type d, size type start index = 0) array2D (dim type d, band type band, size type start index = 0) template array2D (const TwoD& x, band type) template array2D (MatrixStream& s, Orien) template array2D (MatrixStream& s, Orien, band type bw) array2D (const self& x) iterator begin () iterator end ()

168 Description

The 1D container type A reference to the value type A const reference to the value type The integral type for dimensions and indices The iterator type The const iterator type The reverse iterator type The const reverse iterator type A pair type for the dimension A pair type for the bandwidth Default Constructor Normal Constructor Banded Constructor sparse banded view constructor

Matrix Stream Constructor

Banded Matrix Stream Constructor

Copy Constructor (shallow) Return an iterator pointing to the first 1D container Return an iterator pointing past the end of the 2D container


Return a const iterator pointing to the first 1D container


Return a const iterator pointing past the end of the 2D container

Where Defined


169

Declaration

Description


Return a reverse iterator pointing to the last 1D container Return a reverse iterator pointing past the start of the 2D container

reverse iterator rend () const reverse iterator rbegin () const

Return a const reverse iterator pointing to the last 1D container


Return a const reverse iterator pointing past the start of the 2D container

OneD reference operator () (size type i, size type j) OneD const reference operator () (size type i, size type j) const

Return a reference to the (i,j) element, where (i,j) is in the 2D coordinate system Return a const reference to the (i,j) element, where (i,j) is in the 2D coordinate system

OneDRef operator [] (size type i)

Return a reference to the ith 1D container

ConstOneDRef operator [] (size type i) const

Return a const reference to the ith 1D container The dimension of the 2D container The dimension of the 1D containers The number of non-zeros

size type major () const size type minor () const size type nnz () const void print () const size type first index () const template void fast copy (const Matrix& x)

Where Defined

A faster specialization for copying

A.5 Container adaptors A.5.1 linalg vec Description This captures the main functionality of a dense MTL vector. The dense1D and external1D derive from this class, and specialize this class to use either internal or external storage.


170

Definition linalg vector.h Template Parameters Parameter RepType

Default

Description the underlying representation

Model of Linalg Vector Members Declaration

Description

self rep type rep ptr enum f N = NN g dimension sparsity scaled type value type reference const reference pointer size type difference type iterator const iterator reverse iterator

The sparsity tag The scaled type of this container The value type The reference type The const reference type The pointer (to the value type) type The size type (non negative) The difference type (an integral type) The iterator type The const iterator type The reverse iterator type



Vec class IndexArray IndexArrayRef

The type for an array of the indices of the element in the vector

Where Defined

Scalable Container Container Container Container Container Container Container Container Reversible Container Reversible Container

Vector

APPENDIX A. CONTAINERS Declaration subrange linalg vec () linalg vec (rep ptr x, size type start index) linalg vec (const self& x) linalg vec () self& operator= (const self& x) iterator begin ()

171 Description The type for a subrange vector-view of the original vector Default Constructor (allocates the container)

Where Defined Vector Container

Normal Constructor Copy Constructor (shallow copy)

ContainerRef

The destructor.

Container

Assignment Operator (shallow copy)

AssignableRef

Return an iterator pointing to the beginning of the vector Return an iterator pointing past the end of the vector

Container



Container



Container










reference operator[] (size type i)

Return a reference to the element with the ith index

Vector

Return a const reference to the element with the ith index The size of the vector The number of non-zeroes in the vector

Vector

iterator end ()


const reference operator[] (size type i) const size type size () const size type nnz () const void resize (size type n) void resize (size type n, const value type& x)

Resize the vector to n

size type capacity () const

Return the total capacity of the vector

Resize the vector to n, and assign x to the new positions

Container

Container


172

Declaration

Description

void reserve (size type n) const value type* data () const value type* data () iterator insert (iterator position, const value type& x = value type()) IndexArrayRef nz struct () const

Reserve more space in the vector

Where Defined

Raw Memory Access Raw Memory Access Insert x at the indicated position in the vector

Container

A.5.2 scaled1D Description This class in not meant to be used directly. Instead it is created automatically when the scaled(x,alpha) function is invoked. See the documentation for ”Shortcut for Creating A Scaled Vector”. This vector type is READ ONLY therefore there are only const versions of things ie. there is no iterator typedef, just const iterator. Definition scaled1D.h Template Parameters Parameter

Default

RandomAccessContainerRef

Model of RandomAccessRefContainerRef Members

Description The type of underlying container


enum f N = RandomAccessContainerRef::N g value type size type dimension iterator const iterator const reverse iterator pointer reference const reference difference type scaled type sparsity subrange IndexArray IndexArrayRef scaled1D () scaled1D (const Vector& r, value type scale ) scaled1D (const Vector& r, value type scale , do scaled s) scaled1D (const self& x) self& operator= (const self& x) scaled1D () operator Vector& () const iterator begin () const

173 Description

Where Defined

Static size, 0 if dynamic size The value type The unsigned integral type for dimensions and indices The dimension, should be 1D The iterator type (do not use this) The const iterator type The const reverse iterator type The pointer to the value type The reference type The const reference type The difference type The scaled type The sparsity tag (dense tag or sparse tag) The type for the index array The reference type to the index array Default constructor Normal constructor

Copy constructor Assignment operator Destructor Access base containers Return a const iterator pointing to the beginning of the vector

Container



Container







const reference operator[] (int i) const


Container Vector

size type size () const size type nnz () const


174 Description

Where Defined

void adjust index (size type i)

A.5.3 sparse1D Description This is a sparse vector implementation that can use several different underlying containers, including std::vector, std::list, and std::set. This adaptor is used in the implementation of the linked list, tree, and sparse pair OneD storage types (used with the array matrix storage type). This adaptor can also be used as a stand-alone Vector. The value type of the underlying containers must be entry1, which is just an indexvalue pair. The elements are ordered by their index as they are inserted. Example In gather scatter.cc: void do_gather_scatter(DenseVec& d, SparseVec& c) { using namespace mtl; c[2] = 0; c[5] = 0; c[7] = 0; copy(d,c); scale(c,2.0); copy(c,d); typedef dense1D denseVec; typedef compressed1D compVec; denseVec d(9,2); compVec c; do_gather_scatter(d, c);


175

Definition sparse1D.h Template Parameters Parameter RepType

Default

Description The Container type used to store the index value pairs.

Model of ContainerRef? Type requirements

The value type of RepType must be of type entry1

Members Declaration

enum f N = 0 sparsity entry type PR dimension scaled type value type pointer

g

size type difference type entry reference const reference

Description This is a sparse vector The index-value pair type The value type This is a 1D container The scaled type The value type The type for pointers to the value type The unsigned integral type for dimensions and indices The type for differences between iterators The type for references to the value type The type for const references to the value type

Where Defined

APPENDIX A. CONTAINERS Declaration iterator const iterator reverse iterator const reverse iterator IndexArray IndexArrayRef subrange sparse1D () sparse1D (int n) sparse1D (const self& x) template sparse1D (const IndexArray& x) self& operator= (const self& x) iterator begin ()

176 Description The iterator type The const iterator type The reverse iterator type The const reverse iterator type The type for the index array The reference type for the index array The type for subrange vectors Default Constructor Length N Constructor Copy Constructor

Where Defined

Construct from index array Assignment Operator Return an iterator pointing to the beginning of the vector Return an iterator pointing past the end of the vector

Container



Container



Container











Element Access, return element with index i


Element Access, return element with index i

iterator insert (int i, const PR& value)

Insert the value at index i of the vector Returns size of the vector

iterator end ()


int size () const

Container

APPENDIX A. CONTAINERS Declaration int nnz () const template void resize imp (int n, R*) template void resize imp (int n, R*) void resize (int n) rep type& get rep () void print () const IndexArrayRef nz struct () const

177 Description Number of non-zero (stored) elements

Where Defined

Resizes the vector to size n

Return an array of indices describing the non-zero structure

A.5.4 strided1D Description This class in not meant to be used directly. Instead it is created automatically when the stride(x,inc) function is invoked. See the documentation for ”Shortcut for Creating A Strided Vector”. Definition strided1D.h Template Parameters Parameter Default RandomAccessContainerRef

Model of RandomAccessContainerRef

Description base container type


178

Members Declaration

Description

Where Defined

Container Container

iterator const iterator reverse iterator

The value type The type for references to the value type The type for const references to the value type The iterator type The const iterator type The reverse iterator type



scaled type sparsity IndexArrayRef

The scaled vector type Whether the vector is sparse or dense The type for references to the index array The type for the index array This is a 1D container The unsigned integral type for dimensions and indices The integral type for differences between iterators The type for pointers to the value type The subrange vector type

enum f N = RandomAccessContainerRef::N g value type reference const reference

IndexArray dimension size type difference type pointer subrange strided1D (const Vector& r, int stride ) strided1D (const self& x) operator Vector& () iterator begin ()

Container Container Container Reversible Container Reversible Container Scalable

Container Container

Normal Constructor Copy Constructor Return an iterator pointing to the beginning of the vector Return an iterator pointing past the end of the vector

Container



Container



Container

iterator end ()

Container


179

Declaration

Description

Where Defined











Return a reference to the element with index i

Random Access Container



Random Access Container Container Vector


int size () const size type nnz () const void reindex (int i) subrange operator() (size type s, size type f) void adjust index (size type i)

Return a subrange vector containing the elements from index s to f

A.5.5 scaled2D Description This class is not meant to be used directly. Instead, use the scaled() function to create a scaled matrix to pass into an MTL algorithm. Members Declaration

Description

value type reference

The unsigned integral type for dimensions and indices The 1D container type The type for references to value type

template struct partitioned enum f M = TwoD::M, N = TwoD::N g size type

Where Defined

APPENDIX A. CONTAINERS Declaration const reference iterator const iterator reverse iterator const reverse iterator sparsity strideability storage loc transpose type scaled2D () scaled2D (const TwoD& x, const T& a) const iterator begin () const

180 Description The type for const references to value type The iterator type (not used) The const iterator type The reverse iterator type (not used) The const reverse iterator type Either sparse tag or dense tag Whether the underlying 2D container is strideable Either internal or external storage The transpose type Default Constructor Normal Constructor Return a const iterator pointing to the first 1D container




Return a const reverse iterator pointing to the last 1D container


Return a const reverse iterator pointing past the start of the 2D container

reference operator[] (int i) const

Return a const reference to the ith 1D container

T operator() (int i, int j) const

Return a const reference to the (i,j) element, where (i,j) is in the 2D coordinate system The dimension of the 2D container The dimension of the 1D containers The number of non-zeros

int major () const int minor () const size type nnz () const

Where Defined

A.5.6 block2D Description For use in blocked algorithms with rectangle dense matrices. The blocks all have the same size (vs. variable sizes as in a partitioned matrix). The matrix objects for each block are not stored, they are generated on the fly as they are requested, and they are lightweight object on the stack so no overhead is incurred.


181

The blocking size must divide evenly into the original matrix size. One good way to ensure this is to partition the original matrix into a main region that divides evenly and into the blocks, and 3 others edge regions that do not get blocked. Use the block view type constructor and the blocked function to create matrices of this type. Example In blocked matrix.cc: const int M = 4; const int N = 4; typedef matrix::type Matrix; Matrix A(M,N); for (int i = 0; i < M; ++i) for (int j = 0; j < N; ++j) A(i, j) = i * N + j; print_all_matrix(A); block_view::type bA = blocked(A, blk()); print_partitioned_matrix(bA); block_view::type cA = blocked(A, 2, 2); print_partitioned_by_column(cA);

Definition block2D.h Template Parameters Parameter Block

OffsetGen

Default

Description The submatrix block, a dense external matrix. The Offset generator.


182

Members Declaration

enum f M = 0, N = 0 size type difference type

g

sparsity storage loc strideability template struct partitioned class block vector value type reference pointer class iterator class const iterator dyn dim bdt template block2D (TwoD& x, dyn dim b) block2D (const block2D& x) const block2D& operator= (const block2D& x) block2D () block2D () iterator begin () iterator end ()

Description The 1D container type The type for differences between iterators This is a dense 2D container This has external storage This is strideable

The 1D container type A reference to the value type The type for pointers to the value type The iterator type The const iterator type

Constructor from underlying 2D container Copy Constructor

Default Constructor Destructor Return an iterator pointing to the first 1D container Return an iterator pointing past the end of the 2D container


Return a const iterator pointing to the first 1D container



block vector operator[] (size type i)

Return a reference to the ith 1D container

Where Defined


183

Declaration

Description

Block operator() (size type i, size type j)

Return a reference to the (i,j) element, where (i,j) is in the 2D coordinate system

const Block operator() (size type i, size type j) const size type ld () const

Where Defined

Return a const reference to the (i,j) element, where (i,j) is in the 2D coordinate system The leading dimension

A.6 Container functions A.6.1 scaled Prototype template Scalable::scaled type scaled(const Scalable& A, const T& alpha) ;

Description This function can be used to scale arguments in MTL functions. For example, to perform the vector addition operation z ::type Matrix; Matrix A(3,3); double SCALE = - A(2,1) / A(1,1); add(scaled(A[0], SCALE), A[1], A[1]);

A.6.2 strided Prototype template strided1D strided(RandomAccessContainerRef& v, Distance stride ) ;

Description The helper function for creating a strided vector adaptor. Definition strided1D.h Requirements on types

Distance must be compatible with RandomAccessContainerRef's Distance


185

Complexity compile time

A.6.3 rows Prototype template rows type::type rows(const Matrix& A) ;

Description For matrix A, A[i] now gives you the ith row and A.begin() gives you an iterator over rows Definition matrix implementation.h Example In swap rows.cc: typedef matrix< double, rectangle, dense, column_major>::type Matrix; const Matrix::size_type N = 3; Matrix::size_type large; double dA[] = { 1, 3, 2, 1.5, 2.5, 3.5, 4.5, 9.5, 5.5 }; Matrix A(dA, N, N); // Find the largest element in column 1. large = max_index(A[0]); // Swap the first row with the row containing the largest // element in column 1. swap( rows(A)[0] , rows(A)[large]);


186

A.6.4 columns Prototype template columns type::type columns(const Matrix& A) ;

Description For matrix A, A[i] now gives you the ith column and A.begin() gives you an iterator over columns. See rows for an example. Definition matrix implementation.h

A.6.5 trans Prototype template Matrix::transpose type trans(const Matrix& A) ;

Description Swap the orientation of a matrix (i.e., from row-major to column-major). In essence this transposes the matrix. This operation occurs at compile time. Definition matrix implementation.h Example In trans mult.cc: typedef matrix< double, rectangle, dense, row_major >::type EMatrix; typedef dense1D Vector;


187

const EMatrix::size_type n = 5; Vector y(n,1),Ay(n); double da[n*n]; EMatrix A(da,n,n); mult(trans(A),y,Ay);

A.6.6 blocked Prototype template block view::type blocked(Matrix& A, int bm, int bn) ;

Description block_view bA = blocked(A, 16, 16);

Note: currently not supported for egcs (internal compiler error). Definition matrix.h Example In blocked matrix.cc: const int M = 4; const int N = 4; typedef matrix::type Matrix; Matrix A(M,N); for (int i = 0; i < M; ++i) for (int j = 0; j < N; ++j) A(i, j) = i * N + j; print_all_matrix(A); block_view::type bA = blocked(A, blk());


188

print_partitioned_matrix(bA); block_view::type cA = blocked(A, 2, 2); print_partitioned_by_column(cA);

A.6.7 blocked Prototype template block view::type blocked(Matrix& A, blk) ;

Description This version of the blocked matrix generator is for statically sized blocks. block_view bA = blocked(A);

Note: currently not supported for egcs (internal compiler error). Definition matrix.h Example In blocked matrix.cc: const int M = 4; const int N = 4; typedef matrix::type Matrix; Matrix A(M,N); for (int i = 0; i < M; ++i) for (int j = 0; j < N; ++j) A(i, j) = i * N + j; print_all_matrix(A); block_view::type


189

bA = blocked(A, blk()); print_partitioned_matrix(bA); block_view::type cA = blocked(A, 2, 2); print_partitioned_by_column(cA);

A.7 Container tags A.7.1 banded tag A.7.2 column matrix traits Members Declaration

Description

Where Defined

Column

A.7.3 column tag A.7.4 dense tag A.7.5 diagonal matrix traits Members Declaration

Description

Where Defined

Diagonal

A.7.6 diagonal tag A.7.7 external tag A.7.8 hermitian tag A.7.9 internal tag A.7.10 linalg traits Members Declaration dimension

Description Whether the object is a 1D or 2D container

Where Defined

APPENDIX A. CONTAINERS Declaration value type sparsity magnitude type

190 Description The element type within the container Either sparse or dense The return type for abs(value type)

Where Defined

A.7.11 matrix traits Members Declaration shape

orientation sparsity transpose type strided type strideability scaled type

storage loc OneD

value type reference const reference pointer size type difference type

Description The shape of the matrix, either rectangle tag, banded tag, diagonal tag, triangle tag, or symmetric tag The orientation, either row tag or column tag The sparsity, either dense tag or sparse tag Used by the trans helper function Used by the rows and columns helper functions Whether the rows and columns functions can be used with this Matrix The Matrix type resulting from wrapping a scaled adator around this Matrix Whether the Matrix owns its data, either external tag or internal tag A OneD part of a Matrix. This could be a Row, a Column or a Diagonal depending on the type of Matrix. The element type of the matrix

A NonNegativeIntegral type

Where Defined


191

A.7.12 not strideable A.7.13 oned tag A.7.14 rectangle tag A.7.15 row matrix traits Members Declaration Row

A.7.16 row tag A.7.17 sparse tag A.7.18 strideable A.7.19 symmetric tag A.7.20 triangle tag A.7.21 twod tag

Description

Where Defined

Appendix B Iterators B.1 Concepts B.1.1 IndexedIterator Description The iterator concept for iterators of Vector's. An IndexedIterator provides access to the indices, as well as the elements, of a Vector. For instance, i.row() gives the row index cooresponding to the element *i. Refinement of BidirectionalIterator Notations X A type that is a model of IndexedIterator i Object of type X V A type that is a model of Vector a An object of type V n An object of integral type r A row in some Matrix.

192

APPENDIX B. ITERATORS

193

Expression semantics Expression

Semantics

i.row() i.column() i.index()

Description Row Index access Column Index access Index access

Function specification Prototype

Description

Complexity

size type row()

Row Index access

constant time

size type column()

Column Index access

constant time

size type index()

Index access

constant time

Models

dense iterator

sparse iterator

scale iterator

stride iterator

compressed iter

B.2 Iterator functions B.2.1 trans iter Prototype template transform iterator trans iter(Iterator i, UnaryFunction op) ;


194

Description Definition transform iterator.h

B.3 Iterator adaptors B.3.1 dense iterator Description An iterator for dense contiguous container that keeps track of the index. Definition dense iterator.h Template Parameters Parameter Default RandomAccessIterator

Description the base iterator

Model of RandomAccessIterator Members Declaration value type iterator category difference type pointer

Description The value type This is a random access iterator The type for differences between iterators The type for pointers to the value type

Where Defined

APPENDIX B. ITERATORS Declaration reference Distance RandomAccessIterator start int pos int start index int index () const dense iterator () dense iterator (RandomAccessIterator s, int i, int first index = 0) dense iterator (const self& x) template dense iterator (const SELF& x) self& operator= (const self& x) dense iterator () RandomAccessIterator base () const operator RandomAccessIterator () const reference operator* () const pointer operator-> () const self& operator++ () self operator++ (int) self& operator-- () self operator-- (int) self operator+ (Distance n) const self& operator+= (Distance n) self operator- (Distance n) const difference type operator- (const self& x) const self& operator-= (Distance n)

195 Description The type for references to the value type

Where Defined

Return the index of the current element Default Constructor

IndexedIterator

Constructor from underlying iterator Copy Constructor

Assignment operator Destructor Access the underlying iterator

Dereference operator Member access operator Pre-increment operator Post-increment operator Pre-decrement operator Post-decrement operator Add iterator and distance n Add distance n to this iterator Subtract iterator and distance n Return the difference between two iterators Subtract distance n from this iterator


196

Declaration

Description

bool operator!= (const self& x) const

Return whether this iterator is not equal to iterator x

bool operator < (const self& x) const bool operator > (const self& x) const bool operator== (const self& x) const bool operator= (const self& x) const reference operator[] (Distance n) const

Where Defined

Return whether this iterator is less than iterator x Return whether this iterator is greater than iterator x Return whether this iterator is equal to iterator x Return whether this iterator is less than or equal to iterator x Return whether this iterator is greater than or equal to iterator x Equivalent to *(i + n)

B.3.2 scale iterator Description The scale iterator is an adaptor which multiplies the value of the underlying element by some scalar as they are access (through the dereference operator). Scale iterators are somewhat different from most in that they are always considered to be a constant iterator whether or not the underlying elements are mutable. Typically users will not need to use scale iterator directly. It is really just an implementation detail of the scaled1D container. Definition scale iterator.h Template Parameters Parameter RandomAccessIterator

T

Default

Description The underlying iterator The type of the scalar to multiply by


197

Model of RandomAccessIterator Type requirements

T must be convertible to RandomAccessIterator's value type

RandomAccessIterator's value type must be a model of Ring

Members Declaration difference type value type iterator category pointer Distance iterator type reference const reference scale iterator () scale iterator (const RandomAccessIterator& x) scale iterator (const RandomAccessIterator& x, const value type& a) scale iterator (const self& x) int index () const operator RandomAccessIterator () RandomAccessIterator base () const value type operator* () const self& operator++ () self operator++ (int)

Description The difference type The value type The iterator category The pointer type

Where Defined

The reference type The default constructor

Trivial Iterator scale iterator

Normal constructor

scale iterator

Copy constructor

Trivial Iterator

MTL index method

Indexible Iterator

Convert to base iterator

scale iterator

Access base iterator

scale iterator

Dereference (and scale)

Trivial Iterator

Preincrement Postincrement

Forward Iterator Forward Iterator

APPENDIX B. ITERATORS Declaration

198

self& operator-- ()

Description Preincrement

self operator-- (int)

Postincrement

self operator+ (Distance n) const

Iterator addition

Random Access Iterator

self& operator+= (Distance n)

Advance a distance


self operator- (Distance n) const

Subtract a distance


Retreat a distance


difference type operator- (const self& x) const self& operator-= (Distance n) value type operator[] (Distance n) const bool operator== (const self& x) const bool operator!= (const self& x) const bool operator< (const self& x) const

Where Defined Bidirectional Iterator Bidirectional Iterator

Access at an offset Equality

Trivial Iterator

Inequality

Trivial Iterator

Less than


B.3.3 sparse iterator Description This iterators is used to implement the sparse1D adaptor. The base iterator returns a entry1 (an index-value pair) and this iterator makes it look like we are just dealing with the value for dereference, while the index() method return the index. Template Parameters Parameter Iterator

T

Default

Description the underlying iterator type the value type


199

Members Declaration PR iterator category value type difference type reference pointer sparse iterator () sparse iterator (const sparse iterator& x) sparse iterator (const Iterator& iter , int p = 0) sparse iterator (const Iterator& start, const Iterator& finish) self& operator= (const self& x) operator Iterator () int index () const bool operator!= (const self& x) const bool operator< (const self& x) const reference operator* () const self& operator++ () self operator++ (int) self& operator-- () self& operator+= (int n) self& operator-= (int n) self operator+ (int n) const self operator- (int n) const int operator- (const self& x) const Iterator base () const Iterator iter int pos

Description

Where Defined

The iterator category The value type The type for differences between iterators The type for references to value type The type for pointers to value type Default Constructor Copy Constructor Constructor from underlying iterator

Assignment Operator Return the index of the element pointed to by this iterator Return whether this iterator is not equal to iterator x Return whether this iterator is less than iterator x Deference, return the element pointed to by this iterator Pre-increment operator Post-increment operator Pre-decrement operator Add distance n to this iterator Subtract distance n from this iterator Add this iterator and distance n Subtract this iterator and distance n Return the difference between this iterator and iterator x

IndexedIterator


200

B.3.4 strided iterator Description This iterator moves a constant stride for each increment or decrement operator invoked. The strided iterator is used to implement a row-view to column oriented matrices, or column-views to row oriented matrices. Model of RandomAccessIterator Members Declaration difference type value type reference iterator category pointer Distance iterator type strided iterator () strided iterator (const RandomAccessIterator& x, int s) strided iterator (const self& x) self& operator= (const self& x) int index () const operator RandomAccessIterator () const RandomAccessIterator base () const

Description The type for the difference between two iterators The value type pointed to by this iterator type The type for references to the value type The iterator category for this iterator The type for pointers to the value type

Where Defined

The underlying iterator type Default Constructor Construct from the underlying iterator Copy Constructor Assignment Operator Return the index of the element this iterator points to Convert to the underlying iterator

IndexedIterator


201

Declaration

Description

reference operator* () const

Dereference, return the element currently pointed to Pre-increment operator Post-increment operator Pre-decrement operator Post-decrement operator

self& operator++ () self operator++ (int) self& operator-- () self operator-- (int) self operator+ (Distance n) const self& operator+= (Distance n) self operator- (Distance n) const self& operator-= (Distance n) self operator+ (const self& x) const Distance operator(const self& x) const

Add this iterator and n Add distance n to this iterator Subtract this iterator and distance n Subtract distance n from this iterator Add this iterator and iterator x Return this distance between this iterator and iterator x

reference operator[] (Distance n) const bool operator== (const self& x) const

Return *(i + n)

bool operator!= (const self& x) const

Return whether this iterator is not equal to iterator x

bool operator< (const self& x) const

Where Defined

Return whether this iterator is equal to iterator x

Return whether this iterator is less than iterator x

B.3.5 transform iterator Description This iterator adaptor applies some function during the dereference Template Parameters Parameter Iterator

UnaryFunction

Default

Description The underlying iterator type A function that takes one argument of value type


202

Members Declaration value type difference type iterator category pointer reference transform iterator (Iterator i, UnaryFunction op) transform iterator (const transform iterator& x) transform iterator& operator= (const transform iterator& x) value type operator* ()

Description The value type The difference type The iterator category The pointer type The reference type Normal Constructor Copy Constructor Assignment Operator Dereference Operator (applies the function here)

Where Defined

Appendix C Algorithms C.0.6 sum Prototype template linalg traits::value type sum(const Vector& x) ;

Description The sum of all of the elements in the container. Definition mtl.h Requirements on types

The addition operator must be defined for Vector::value type.

Complexity linear Example In vec sum.cc: mtl::dense1D< double > x(10, 2.0); double s = vec::sum(x); cout x(10, 2.0); vec::scale(x, 2.0); mtl::print_vector(x);

APPENDIX C. ALGORITHMS

206

C.0.9 set diagonal Prototype template void set diagonal(Matrix& A, const T& alpha) ;

Description Set the value of the elements on the main diagonal of A to alpha. Definition mtl.h Requirements on types

T must be convertible to Matrix::value type.

Complexity O(min(m,n)) for dense matrices, O(nnz) for sparse matrices (except envelope, which is O(m)) Example In tri pack sol.cc: const int N = 4; Matrix A(N, N); set_diagonal(A, 1); // // // //

1.0 A = 2.0 3.0 4.0

1.0 5.0 6.0

1.0 7.0

1.0

8.0 b = 25.0 79.0 167.0

A(1,0) = 2; A(2,1) = 5; A(3,2) = 7; A(2,0) = 3; A(3,1) = 6; A(3,0) = 4;

APPENDIX C. ALGORITHMS

207

C.0.10 two norm Prototype template linalg traits::magnitude type two norm(const Vector& x) ;

Description The square root of the sum of the squares of the elements of the container. Definition mtl.h Requirements on types

Vector must have an associated magnitude type that is the type of the absolute value of Vector::value type.

There must be std::abs() defined for Vector::value type.

The addition must be defined for magnitude type.

sqrt() must be defined for magnitude type.

Complexity O(n) Example In vec two norm.cc: dense1D x(10, 2.0); double s = two_norm(x); cout OutIter transform(InIter1 first1, count, InIter2 first2, OutIter result, BinOp binary_op);

Description Takes input from two iterators, applies a binary operator, and outputs the result into a third iterator. Definition fast.h

F.0.8 fill Prototype template OutputIterator fill(OutputIterator first, count, const T& value);

Description Assign the value into N elements of the output iterator first. Definition fast.h

APPENDIX F. FIXED ALGORITHM SIZE TEMPLATE (FAST) LIBRARY

255

F.0.9 swap ranges Prototype template ForwardIterator2 swap_ranges(ForwardIterator1 first1, count, ForwardIterator2 first2);

Description Swap N elements from first1 and first2. Definition fast.h

F.0.10 accumulate Prototype template T accumulate(InputIterator first, count, T init) ;

Description Sum N elements from first. Definition fast.h

F.0.11 accumulate Prototype template T accumulate(InputIterator first, count, T init, BinaryOperation binary_op);


256

Description Accumulate the result of the binary operator applied to the N elements of first and init. Definition fast.h

F.0.12 inner product Prototype template T inner_product(InIter1 first1, count, InIter2 first2, T init, BinOp1 binary_op1, BinOp2 binary_op2);

Description A fixed size inner product. Definition fast.h

F.0.13 inner product Prototype template T inner product(InIter1 first1, count, InIter2 first2, T init) ;

Description A fixed size inner product using addition and multiplication operators. Definition fast.h


257

F.0.14 count Description A class for representing numbers at compile time. Members Declaration enum

f

N = NN

g

Description

Where Defined

Appendix G Basic Linear Algebra Instruction Set (BLAIS) Library G.0.15 add Description This adds vector x into vector y. Example In blais add.cc: template inline void do_add(VecX& x, VecY& y) { for (int i = 0; i < N; ++i) { x[i] = i; y[i] = i + 1; } blais_vv::add(x.begin(), y.begin()); } int ix[N], iy[N]; external_vec x1(ix, N); external_vec y1(iy, N); do_add(x1, y1); dense1D x2(N); dense1D y2(N); do_add(x2, y2, N);

Template Parameters 258

APPENDIX G. BASIC LINEAR ALGEBRA INSTRUCTION SET (BLAIS) LIBRARY259 Parameter N

Default

Description the length of the vectors

Members Declaration

template add (Vec1 x, Vec2 y)

Description

Where Defined

G.0.16 copy Description Copies vector x into vector y Template Parameters Parameter N

Default


Members Declaration

template copy (Vec1 x, Vec2 y)

Description

G.0.17 copy Description Template Parameters Parameter M

N

Default

Description Number of rows in A Number of columns in A

Where Defined

APPENDIX G. BASIC LINEAR ALGEBRA INSTRUCTION SET (BLAIS) LIBRARY260

Members Declaration

template copy (MatrixA& A, MatrixB& B)

Description

Where Defined

G.0.18 dot Template Parameters Parameter N

Default


Members Declaration

template dot (Vec1 x, Vec2 y, T& prod)

Description

G.0.19 mult Template Parameters Parameter M

N

Members

Default


Where Defined

APPENDIX G. BASIC LINEAR ALGEBRA INSTRUCTION SET (BLAIS) LIBRARY261 Declaration

template mult (const Matrix& A, VecX x, VecY y)

Description

Where Defined

G.0.20 mult Template Parameters Parameter

Default

M N

K

Description Number of rows in A and C Number of columns in B and rows in C Number of columns in A and rows in B

Members Declaration

template mult (MatrixA& A, MatrixB& B, MatrixC& C)

Description

Where Defined

G.0.21 rank one Description Perform a rank one update (outer product) on the M

N static sized matrix.

Template Parameters Parameter M

N

Members

Default


APPENDIX G. BASIC LINEAR ALGEBRA INSTRUCTION SET (BLAIS) LIBRARY262 Declaration

template rank one (Matrix& A, VecX x, VecY y)

Description

Where Defined

G.0.22 set Description Set the elements of the static sized N vector to alpha. Template Parameters Parameter N

Default

Description static length of x

Members Declaration

template set (Vector x, const T& alpha)

Description

Where Defined

G.0.23 set Description Set the elements of the static sized M

N matrix to alpha.

Members Declaration

template set (Matrix A, const T& alpha)

Description

Where Defined

Appendix H MTL to LAPACK Interface H.0.24 lapack matrix Description Use this matrix type constructor to create the type of matrix to use in conjunction with the mtl2lapack functions. The vector type you use with mtl2lapack functions must be contiguous in memory, and have a function data() defined which returns a pointer to that memory, and a function size() with gives the length. Example In getrf.cc: double da [] = { 1, 2, 2, 2, 1, 2, 2, 2, 1 }; lapack_matrix::type A(da, M, N); lapack_matrix::type B(M*NRHS, NRHS); mtl::set(B, 15.0); dense1D pivot(N, 0); int info = getrf(A, pivot); if (info == 0) { info = getrs('N', A, pivot, B); if (info == 0) { cout M, ROWCND contains the ratio of the smallest R(i) to the largest R(i). If ROWCND >= 0.1 and AMAX

row cond (OUT - Real number) If INFO = 0 or INFO

is neither too large nor too small, it is not worth scaling by R.

col cond (OUT - Real number) If INFO = 0, COLCND contains the ratio of the smallest C(i) to the largest C(i). If COLCND >= 0.1, it is not worth scaling by C.

amax (OUT - Real number) Absolute value of largest matrix element. If AMAX is very close to overflow or very close to underflow, the matrix should be scaled.

Definition mtl2lapack.h

APPENDIX H. MTL TO LAPACK INTERFACE

272

H.1.9 gelqf Prototype template int gelqf(LapackMatA& a, VectorT& tau) ;

Description Compute an LQ factorization of a M-by-N matrix A.

a (IN/OUT - matrix(M,N)) On entry, the M-by-N matrix A. On exit, the elements on and below the diagonal of the array contain the m-by-min(m,n) lower trapezoidal matrix L (L is lower triangular if m

::type matlab_dense;

Definition matlabio.h

I.2.2 write dense matlab Prototype void write dense matlab(matlab dense& A, char* matrix name, const char* file) ;

Description The matrix type for this function is the following typedef matrix::type matlab_dense;


I.2.3 read sparse matlab Prototype void read sparse matlab(matlab sparse& A, char* matrix name, const char* file) ;

279

APPENDIX I. UTILITIES

280

Description The matrix type for this function is the following typedef matrix, column_major >::type matlab_sparse;


I.2.4 write sparse matlab Prototype void write sparse matlab(matlab sparse& A, char* matrix name, const char* file) ;

Description The matrix type for this function is the following typedef matrix< double, rectangle, array< compressed >, column_major >::type matlab_sparse;


I.3 Classes I.3.1 dimension Description This is similar to the std::pair class except that it can have static parameters, and only deals with size types. The purpose of this class is to transparently hide whether the dimensions


281

of a matrix are specified statically or dynamically. Members Declaration

Description

Where Defined

transpose type size type enum f M = MM, N = NN g dimension () dimension (const dimension& x) template dimension (const std::pair& x) m (x.first) template dimension (const Dim& x) dimension (size type m , size type n ) dimension& operator= (const dimension& x) size type first () const size type second () const bool is static () const transpose type transpose () const size type m, n

I.3.2 harwell boeing stream Description This class simplifies the job of creating matrices from files stored in the Harwell-Boeing format. All matrix types have a constructor that takes a harwell boeing stream object. One can also access the elements from a matrix stream using operator>>(). The stream handles both real and complex numbers. Usage: harwell_boeing_stream mms( fielname ); Matrix A(mms);


282


Default

Description the matrix element complex)

T

type

(double

or

Members Declaration

Description

harwell boeing stream (char* filename) harwell boeing stream () int nrows () const int ncols () const int nnz () const bool eof () bool is complex ()

Construct from file name

Where Defined

Destructor Number of rows in matrix Number of columns in matrix Number of non-zeroes in matrix At the end of the file?

I.3.3 matrix market stream Description This class simplifies the job of creating matrices from files stored in the Matrix Market format. All matrix types have a constructor that takes a matrix market stream object. One can also access the elements (of type entry2) from a matrix stream using the stream operator. The stream handles both real and complex numbers. Usage: matrix_market_stream mms( fielname ); Matrix A(mms);

Template Parameters Parameter T

Default

Description the matrix element complex)

type

(double

or


283

Members Declaration

Description

matrix market stream (char* filename) matrix market stream () bool eof () const int nrows () const int ncols () const int nnz () const bool is symmetric () const bool is complex () const bool is hermitian () const

Construct from filename Destructor, closes the file At the end of the file yet? Number of rows in matrix Number of columns in matrix Number of non-zeroes in matrix

Where Defined