The Parallel Mathematical Libraries Project (PMLP) { A Next Generation Scalable, Sparse, Object-Oriented, Mathematical Library Suite Lubomir Birov Yuri Bartenevy Anatoly Varginy Avijit Purkayasthaz Anthony Skjellum Yoginder Dandass Purushotham Bangalore
Abstract The Parallel Mathematics Libraries Project (PMLP), a joint eort of Intel, Lawrence Livermore National Laboratory, the Russian Federal Nuclear Laboratory (VNIIEF), and Mississippi State University (MSU), constitutes a concerted eort to create a supportable, comprehensive \Sparse Object-Oriented Mathematical Library suite."With overall design and software validation work at MSU, and most software development and testing at VNIIEF, this international collaboration brings objectoriented programming techniques and C++ to the task of providing linear and nonlinear algebraic-oriented algorithms for scientists and engineers. Language bindings for C, Fortran-77, and C++ are provided, oering the widest possible applicability. PMLP diers from other major library eorts in its systematic use of software engineering and design, including eorts to provide high performance, portability, and usability. In addition, important contributions of this eort, in design principles such as storageformat independence, data-distribution independence etc., which contributes towards the performance, ease-of-use, application interoperability and portability etc., will be highlighted. Finally, we will also provide an initial set of benchmarked results.
1 Introduction
The Parallel Mathematical Libraries Project (PMLP), a joint collaborative research eort of Intel Corporation, Lawrence Livermore National Laboratory, the Russian Federal Nuclear Support from the United States Industry Coalition (USIC) through DOE under subcontracts #B319811, #B329138 and #B342021 from LLNL, is gratefully acknowledged. This work was also supported in part under contract from the LLNL funded by the Initiatives for Proliferation Prevention Program (IPP) which is in the U.S. DoE Oce of Arms Control and Non-Proliferation. Additional support from the NSF Career Program, Grant #ASC-9501917 is acknowledged. Mississippi State University High Performance Computing Laboratory Department of Computer Science NSF Engineering Research Center for Computational Fluid Simulation PO Box 9627, Mississippi State, MS 39762
[email protected]; fyogi,avijit,puri,
[email protected] y Russian Federal Nuclear Center(VNIIEF) Mira 37, Sarov City N.Novgorod region RUSSIA fura,anatoly
[email protected] z Additional support is also acknowledged from the University of Puerto Rico, Mayaguez for sabbatical leave.
1
2 Laboratory (VNIIEF), and Mississippi State University (High Performance Computing Laboratory), constitutes a concerted eort to create a supportable \Sparse Object-oriented Mathematical Library Suite." This work builds on over a decade of exible, portable, and performance-oriented scalable library eorts that commenced with research on the Multicomputer Toolbox [14]. Unlike the xed-distribution approach used in libraries such as SCALAPACK [6], PMLP and Multicomputer Toolbox emphasize exible, applicationrelevant data layouts. The scope of the project includes developing libraries for sequential sparse basic linear algebra, parallel sparse matrix-vector multiplication, sequential and parallel iterative and direct solvers, a suite for mapping, conversion, and redistribution of sparse data objects, and a parallel matrix reordering library. In addition, the project includes a parallel preconditioner library for iterative solvers, a parallel preconditioner library based on Sparse LU and ILU, Newton library, and Jacobian technology [12] etc. As further goals, parallel random number generation and multi-grid solver libraries are under consideration. The only other library that is closest in functionality is PetSc [2, 1], although PMLP is projected to also contain direct solvers in a future release. However the two libraries are quite dierent in its design and technological features. In PMLP, abstract data structures are emphasized more than in traditional sparse libraries, and object-orientation is emphasized more than in recent recasts thereof. Oddities such as the \reverse calling mechanism" are deviated by the modern design. The remainder of this paper is organized as follows. Section 2 encompasses the main technical contributions of the PMLP eort; ve key design features and several minor features are described here. Section 3 continues with the object-oriented design of PMLP, with section 4 describing an initial set of results. In Section 5 we describe the current state of the implementation and then conclude with a summary of the main design goals and achievements thus far.
2 Important Features of PMLP
The Parallel Mathematical Library Project is a third-generation scalable library eort. By combining features such as object-oriented design, sequential and parallel modes, and regular (dense) and irregular (sparse) computational kernels, PMLP achieves easy of use while also being amenable to runtime optimization, and thus oers an ecient tool for parallel scienti c and engineering applications. The remainder of this section highlights the important characteristics of the library, generalizing the framework presented in [13].
2.1 Poly-algorithmic Approach
The performance of a parallel algorithm depends on the architecture of the machine used, the global and local data distributions, the problem size, the number of processes, etc. It is clear that a single algorithm can not always be the best performed given the diversity of architectures and physical constraints. Even if an algorithm is the best one in the general (average) case, it may not be the best under a speci c con guration such as a particular storage format or a particular distribution of data. Therefore, providing poly-algorithmic features becomes a requirement for the implementation of powerful and ecient parallel software library [9]. The possibility for combining dierent algorithms operating under various data representation schemes is one of the main features in the PMLP. The objectoriented design, with consistent user interfaces, supports encompassing of multiple related algorithms within \families."
3
2.2 Storage Format Independence
PMLP provides functionality independent of the internal data representation of irregular sparse objects. In the case of sparse matrices, dierent storage formats of the matrix elements are supported. In the current version of PMLP, coordinate, compressed sparse column, compressed sparse row, sparse diagonal, dense, Ellpack/Itpack, and skyline matrix formats are included [12]. Using parameterized classes (templates) new storage formats can be easily added. The same functionality is provided for dense objects as well as for sparse ones, making the system usable in dierent environments. An important contribution of this work is to treat storage format as a mechanism for achieving performance, rather than a requirement for how users must format their data in order to use a library. Other internal formats unknown (opaque) to users may consequently be supported in addition for enhancing cache-performance. By way of contrast to traditional libraries, explicit formats are only revealed in order to support legacy users and Fortran-77.
2.3 Mathematical Entity Type Independence
PMLP supports dierent types of mathematical entities that exist in the scienti c application domain. Matrix functionality covers dierent matrix types such as general, banded, symmetric, banded symmetric, skew symmetric, hermitian, skew hermitian, and lower and upper triangular matrices. Any possible combination of matrix type and storage format is enabled. The functions are overloaded for the dierent types of the mathematical entity giving the library an ecient implementation for both the concrete type and typeindependent uses.
2.4 Data Distribution Independence
As a poly-algorithm library, an important part of PMLP is the implementation of additional data-distribution-independent (DDI) algorithms. The DDI algorithms used in PMLP are based on the Multicomputer Toolbox [11] implementation of a concurrent BLAS library, which in turn arose from concepts due to Van de Velde [17]. Because of the data redistribution potentially required by xed-data-distribution algorithms, the DDI approach used in PMLP is more ecient in many cases. The structure of the library allows combining xed-data-distribution and DDI algorithms, oering further potential for runtime optimization, and enhancing performance-portability.
2.5 Persistent Calling Interface
PMLP library provides a persistent function calling interface. When a persistent function is initiated, an opaque handle encapsulates the information associated with the function and enables optimized repeated execution of the same function. This allows a program that exhibits temporal locality in operation execution with the ability to declare operation choices in advance, and exploit poly-algorithmic choices for optimizing single and small combinations of operations. The xed cost may be amortized over \N" uses of the faster kernel. When one or a few kernels have dominant complexity, this type of optimization may be extremely bene cial to overall performance, as it removes error checking, as well as dynamic memory allocation from inner loops, while supporting the deployment of polyalgorithms.
4
2.6 Portability, C and Fortran binding
In addition to the features mentioned above, PMLP is also a thread safe, portable library that provides C and Fortran bindings. The inter-process communication is based on MPI1.1 [15], which assures de facto portability to all parallel platforms that support MPI. The C and Fortran bindings strive to make the system inter-operable with most existing engineering applications. Notably, the sequential interface is suciently abstract to hide use of SMP or vector parallelism, if present in an implementation. Support for multiple types of \independence" together with poly-algorithms and persistent calling interfaces together provides an extremely powerful environment into which to cast algorithms, and for users to exploit such algorithms in realistic applications.
3 Object-Oriented Design of PMLP
The main motivation behind the use of object-oriented (OO) design methodology in the project has been to enable an implementation of a exible, robust, and maintainable library that also provides maximum eciency. Along with stipulating the highest possible performance compared to that of the computational kernels written in low-level languages, the loosest coupling and strongest cohesion of the entities has been chosen. An OO approach is used for high-level and user-level management of object. Eciency is maximized by selecting the most ecient implementation of the functionality required for each particular data representation scheme for the PMLP object. Particular attention is paid to minimize looping over the same data and to nd ecient means for iterating through the data. Object-orientation provides both data encapsulation and polymorphism [5]. That is, the internal data storage is not accessible to the users, assuring information hiding. Static compile-time polymorphism is strongly supported in the implementation. Function overloading for the dierent entity types and storage formats makes the specialization transparent for the user, while providing an ecient implementation. Uni ed design and functionality of all entity (matrix, graph) types is provided for the user. In addition, the OO techniques used minimize the copying of entities and number of temporary objects. The emerging standard modeling language Uni ed Modeling Language (UML) [7] is used in the representation of the design.The following sections detail techniques and technologies used to realize PMLP.
3.1 Handles
Ecient management of large objects is a required part of most scienti c and engineering applications. Avoiding memory copies of an entity is often a critical factor in the performance of such applications. One of the most useful and ecient techniques used in the PMLP design is handles, which manage the allocation, copying, and deallocation of memory. By using handles, users are not concerned with the ecient management of memory resources. Moreover, the handles in the library provide a safe, easy, and transparent way for sharing objects. A typical implementation of a handle as a \smart pointer" contains a pointer to the handled (represented) object [16, 18]. The represented class contains a reference count member and an interface that manages that count. By overloading the assignment operator, handles provide a shallow copy of the representation, incrementing its reference count by one. A uniform, reusable implementation of the reference counted functionality in the PMLP is provided by the RefCount class (Figure 1). All handled classes are derived from this class.
5 In addition to the ecient \smart pointer" functionality, handles in the PMLP provide a exible, maintainable, and user-friendly system explained below. An important feature of the reference counting is that it is transparent to the users. Also, users need not worry about the handles assignment and copying. The handles have a value semantic rather than pointer semantic ([16], pp294). After the assignment operation, Handle1=Handle2, the two handles are logically distinct and changes to one of them does not aect the other. Although Handle1 and Handle2 point to a same storage, they provide copy-on-write semantics. If one of the handles is changed such that the change aects the storage to which the handle points, the represented object is copied when other handles pointing to this storage also exist. Handles provide an identical interface to that of the class they handle. Also, handles provide independence between the user classes and the hierarchy of the handled classes, which are kept internal for the implementation. However, since performance is a key goal in the project, not all of the classes in the library are to be handle-based. Thus, unnecessary dereferencing operations are avoided and only the large objects are managed. Also, there is no single, generic handle class for all handled objects. Dierent handle classes are used and each of them conducts similar types of entities (as matrix storage formats for example). In this way, the speci c properties of the \handled" classes are exploited and transparency of the handle is assured by providing the same functionality and construction as that of the representation. Handles are implemented as parameterized classes. The representation of the templatized parameter is referenced by pointer in the handle. Unlike typical \smart pointer" implementations where handles are allocated with the new operation, the handles in PMLP act in the same way as the represented classes. The allocation and deallocation of memory is performed automatically and internally by the handle classes. This avoids the mixed usage of handles and pointers to the same object and the problem of distinguishing between static and dynamic allocation of a handle.
3.2 Class Hierarchy
A generic structure of the mathematical entities used in the PMLP is presented in this section. The term \entity" covers the mathematical objects (matrices, graphs, etc.) used in the application domain. PMLP supplies two views for the mathematical entities: global (distributed) and local (sequential). The global view represents the complete entity distributed over several processes, whereas a local view of an entity represents a part of the global entity presented in a particular process. A process denotes a task that can be executed independently, and is used instead of \processor" or \node." The complete graphical notation of the local view of a mathematical entity is shown in Figure 1. The RefCount base class provides the functionality needed for reference counting and handling of the object as explained in the handle section above. Dierent types of the entity elements are enabled by the template parameter used for specifying precision. In the case of a matrix object, the precision types initially allowed are oat, double, and the corresponding complex types. The EntityBaseClass is a base class containing the common data (size in each dimension) for dierent storage formats of the entity. Each particular storage format class, derived from the EntityBaseClass, contains the speci c data for the concrete format, and specializes the common interface based on the particular data representation. The LocalEntityStorageFormatHandle is a generic handle class for all entity storage formats. Dierent entity types owning speci c properties (such as symmetry) are speci ed. Each entity type uses a dierent storage format for the internal representation
6 RefCount
Precision
Local Mathem Entity
,
PR
,
PR ESF LET
EntityBaseClass
Local Math Entity
:
PR Precision
:
ESF Entity PR
Storage Format Entity Storage
:
LET Local
, ,
PR ESF LET
Entity Type
Format 1
Local Entity {or}
PR Entity Storage PR
Format i
Storage Format Handle
Entity Storage
User Classes
Format N
Implementation classes ESF
ESF
ESF
Local
Local
Local
Entity Type 1
Entity Type i
Entity Type N
Entity Storage Format
Local Mathematical Entity Structure
Fig. 1.
Generic, sequential structure of mathematical entities.
of data. LocalEntityStorageFormatHandle class is a base class of all entity types. It de nes the general functionality of the entity types. Entity types implement behavior speci c to their unique properties and provide dierent views of the storage formats. For example, a symmetric object exploits its symmetric properties and stores only half of the object in the chosen storage format. The implementation of the library is based on the compile-time polymorphism provided by C++ templates, which assures maximum performance. The parameterized LocalMathEntity class has as a template parameter derived from its entity type classes and forwards all operations to the template parameter entity type class at compile time (technique due to Furnish [8]). The distributed entity representation is shown in Figure 2. Dierent logical architectures are represented with the classes Architecture1...ArchitectureN. All of them are reference counted via a handle. The mapping classes (Mapping1, Mapping2,...) represent the mapping of the distributed object to the processes. A mapping function describes the data layout by de ning the conversion between the local and global indexes of data elements. When N elements are distributed over P processes, the data distribution mapping function gives the process that contains the element I, and index i of the element in the local process representation. The inverse data-distribution function converts the local index of the element to its global index. Each mapping class covers a speci c one-dimensional distribution (linear, scatter (cyclic), etc). The mapping classes are also reference counted by a handle class MappingHandle. The Distribution classes consist of an architecture handle and mapping handles (the number depends on the number of the dimensions in which the entity is distributed), which fully describe the data layout. The GlobalEntityWithStorageFormat class consists of a local component that belongs to a particular process (LocalEntityType),
7
Fig. 2.
Global (distributed) generic structure of mathematical entities.
an instance of Distribution, and information about global parameters of the entity such as its global size in each dimension. For the representation of a global entity, dierent entity types, specifying particular properties of the entity are also speci ed. Static, compile-time polymorphism is the basic approach used. The class hierarchy strongly supports the polyalgorithm approach. New algorithms using dierent data distributions, storage formats, entity types, or architectures can be added without changing the class structure.
3.3 Templates
The templates are used in PMLP in order to provide compile-time polymorphism while preserving the eciency of the code. Using parameterized types, the implementation of 7 storage format classes, 9 matrix type classes, and 4 types used for the precision, leads to 7x9x4 = 252 combinations. Templates produce large object les and require a long compilation time, but maximizing eciency is more important in scienti c applications (except notably in embedded settings). In addition, templates enable the implementation of an easily modi able system. With the help of parameterized types, the class hierarchy discussed above allows the addition of new storage formats and matrix types, as well as precision types. In addition, new architecture and distribution classes could be added and used in the case of distributed objects. All of these makes the system exible and provides an ecient poly-algorithmic basis. Two types of static polymorphism provided by the templates are included in the PMLP design. The rst one is used in the implementation of a parameterized handle class that encapsulates the template classes with the same functionality and forwards the operations to the template representation (LocalEntityStorageFormatHandle class). The second approach for implementing compiletime polymorphism is used for classes that have dierences in their functionality. The
8 recursive template pattern [6] is used in such cases ( and classes). The motivation for such polymorphic options is given above. LocalM athEntity < P R; ES F; LE T >
GlobalM athEntity
3.4 Iterators
Iterators in PMLP provide a convenient means for users to iterate over elements in vectors and matrices, regardless of their internal data storage format. They also provide a storage format independent means for writing functions that access the elements in objects using disparate storage formats. Since iterators are not an ecient mechanism for accessing elements in sparse matrices, much of the core functionality in PMLP is written using data access mechanisms speci c to particular storage formats. The resulting combinatorial explosion in the number of ecient internal functions is hidden from the users via function overloading. PMLP follows the Standard Template Library (STL) conventions for iterators [10]. However, the iterators in the current release are not STL compliant. They will be in a future release.
4 Initial Results
We present an initial set of benchmarking results, based on the rst alpha-release[4]. It must be noted that this release was based on a wider range of functionality, rather than performance, which is the target for the next release. However, a few of the functions and class-methods have been optimized for performance, and these benchmarks represent some of these functions. Speci cally, in Figure 3, we have run a couple of the matrixvector functions matvec product herm and matvec product trans in performance mode on a 4xPentiumPro 200MHz with MPIProTM 1 for Windows NT. Observing the numbers closely, we see that for the matvec product herm case, the best performance is obtained for 750,000 non-zeros in a complex matrix of order 5000, stored on COO (coordinate storage format), in a linear distribution on a 4x1 grid, where the eciency achieved is almost 80%. For the real matrix, the best performance is slightly under 60%, although the elapsed time for matrices with similar number of non-zeros as the complex case is almost a third, or better.
5 State of the Implementation
At this time, the core functionality of the project, including sparse sequential and parallel basic linear algebra and iterative solvers is in the stage of integration testing. The sequential sparse linear algebra includes BLAS 1, 2 and 3 functionality. The sparse parallel linear algebra library covers vector-vector and matrix-vector multiplication suite. The implementation also include persistent mode, C and Fortran bindings and part of the functionality for converting sequential and distributed objects. The iterative solvers include implementation of Conjugate Gradient, Generalized Minimal Residual, Transpose Free Quasi-Minimal Residual, Bi-Conjugate Gradient Stabilized, and Jacobi iterative methods [3]. In addition, preconditioners such as variants of ILU, block Jacobi amongst others, are also included. All methods are implemented both in sequential and parallel modes. An alpha-release of the PMLP library containing all of the above functionality, will be publicly released in the rst quarter of 1999 [4]. 1
Trademark of MPI Software Technology Inc.
9 p a r a lle l m a tv e c _ p r o d u c t_ h e r m 0.8
G e n e r a l M a tr ix 5 0 0 0 x 5 0 0 0 w ith 5 0 0 0 0 0 c o m p le x _ d o u b le e le m e n t s C O O f o r m a t s c a tt e r d is tr ib u tio n Px 1 g r id
0.7 0.6 T im e [
G e n e r a l M a tr ix 5 0 0 0 x 5 0 0 0 w ith 5 0 0 0 0 0 c o m p le x _ d o u b le e le m e n t s C O O f o r m a t lin e a r d is tr ib u tio n Px 1 g r id
G e n e r a l M a tr ix 5 0 0 0 x 5 0 0 0 w ith 5 0 0 0 0 0 c o m p le x _ d o u b le e le m e n t s C O O f o r m a t lin e a r _ b lo c k d is t r ib u tio n Px 1 g r id
0.5 0.4
G e n e r a l M a tr ix 5 0 0 0 x 5 0 0 0 w ith 7 5 0 0 0 0 c o m p le x _ d o u b le e le m e n t s C O O f o r m a t lin e a r d is tr ib u tio n Px 1 g r id
0.3 0.2
G e n e r a l M a tr ix 5 0 0 0 x 5 0 0 0 w ith 7 5 0 0 0 0 c o m p le x _ d o u b le e le m e n t s C O O f o r m a t lin e a r d is tr ib u tio n 1 x P g r id
0.1 0 1
2
4
P r o c e sso r s [P ]
p a ra lle l m a tv e c _ p ro d u c t_ tr a n
G e n e r a l M a tr ix 5 0 0 0 x 5 0 0 0 w ith 7 5 0 0 0 0 c o m p le x _ d o u b le e le m e n t s C O O f o r m a t s c a tt e r d is tr ib u tio n 1 x P g r id
G e n e r a l M a tr ix 1 0 0 0 0 x 1 0 0 0 0 w ith 7 0 5 0 3 3 f lo a t e le m e n ts C O O f o r m a t lin e a r d is tr ib u tio n Px 1 g r id
0 .2 G e n e r a l M a tr ix 1 0 0 0 0 x 1 0 0 0 0 w ith 7 0 5 0 3 3 f lo a t e le m e n ts C O O f o r m a t lin e a r d is tr ib u tio n 1 x P g r id
T im e [ s
0 .1 8 0 .1 6 0 .1 4
G e n e r a l M a tr ix 1 0 0 0 0 x 1 0 0 0 0 w ith 7 0 5 0 3 3 f lo a t e le m e n ts C O O f o r m a t s c a tte r d is tr ib u tio n Px 1 g r id
0 .1 2 0 .1 0 .0 8
G e n e r a l M a tr ix 5 0 0 0 x 5 0 0 0 w ith 7 5 0 0 0 0 d o u b le e le m e n ts C O O f o r m a t lin e a r d is tr ib u tio n Px 1 g r id
0 .0 6 0 .0 4
G e n e r a l M a tr ix 5 0 0 0 x 5 0 0 0 w ith 7 5 0 0 0 0 d o u b le e le m e n ts C O O f o r m a t s c a tte r d is tr ib u tio n Px 1 g r id
0 .0 2
G e n e r a l M a tr ix 5 0 0 0 x 5 0 0 0 w ith 7 5 0 0 0 0 d o u b le e le m e n ts C O O f o r m a t lin e a r d is tr ib u tio n 1 x P g r id
0 1
2
4
P r o c e s s o r s [P ]
Fig. 3.
G e n e r a l M a tr ix 5 0 0 0 x 5 0 0 0 w ith 7 5 0 0 0 0 d o u b le e le m e n ts C O O f o r m a t s c a tte r d is tr ib u tio n 1 x P g r id
Parallel matvec results: (top) herminitan case (bottom) transpose case
6 Conclusions
The Parallel Mathematical Libraries Project (PMLP) is developing parallel mathematical libraries of numerical methods that are required by a wide range of scienti c and engineering simulation codes. The advantage of PMLP is that it anticipates current and near-term future parallel application architectures. Applications of the future will have to cope with signi cant software complexity resulting from application, problem, and memory hierarchy management. Additional complexity stems from the size of the software itself, the need to use dierent languages, reusable code, and standard components. The PMLP philosophy is to develop a portable, high-performance library using OO design and C++ while supporting an API binding for C and Fortran-77 which are widely used in the scienti c parallel environment. The incorporation of signi cant front-end software engineering process, together with key design principles (poly-algorithms, data-distribution independence, storage-format independence, mathematical entity type independence, and persistent operation techniques) provide this library with an ability to exploit high performance, while providing portable interfaces. The opportunity to utilize underlying optimized kernels is enhanced through the support of persistent operations, which support poly-algorithmic selection, and eliminate repetitive error checking and also reduce dynamic memory management. These techniques couple eectively with storage format independence to provide greater control over resources and operational policies, while providing seamless results to the user.
7 Acknowledgments
We gratefully acknowledge the implementation eorts and detailed design feedback from the rest of the VNIIEF team headed by Yuri Bartenev and Anatoly Vargin, as well as
10 feedback from Bruce Greer, Ken Pocek (Intel) and Dale Nielsen (LLNL).
References [1] S. Balay, W. D. Gropp, L. C. McInnes, and B. F. Smith, Ecient management of parallelism in object oriented numerical software libraries, in Modern Software Tools in Scienti c Computing, E. Arge, A. M. Bruaset, and H. P. Langtangen, eds., Birkhauser Press, 1997, pp. 163{202. [2] , PETSc home page. http://www.mcs.anl.gov/petsc, 1998. [3] R. Barrett et al., Templates for the Solution of Linear Systems: Building Blocks for Iterative Methods, SIAM Press, Philadelphia, 1994. [4] L. Birov, A. Purkayastha, A. Skjellum, Y. Dandass, and P. V. Bangalore, PMLP home page. http://www.erc.msstate.edu/labs/hpcl/pmlp, 1998. [5] G. Booch, Object-Oriented Analysis and Design with Applications, The Benjamin/Cummings Publishing Company, Inc., 1993. [6] J. Choi, J. J. Dongarra, R. Pozo, and D. W. Walker, ScaLAPACK: A scalable linear algebra library for distributed memory concurrent computers, in Proceedings of the fourth symposium on the frontiers of massively parallel computation, IEEE Computer Society Press, 1992, pp. 120{127. [7] M. Fowler and K. Scott, UML Distilled, Addison-Wesley, 1997. [8] G. Furnish, Disambiguated Glommable Expression Templates, Computer in Physics, (1997). [9] J. Li, A. Skjellum, and R. Falgout, A Poly-Algorithm for Parallel Dense Matrix Multiplication on Two-Dimensional Process Grid Topologies, Concurrency: Practise and Experience, 9 (1997), pp. 345{389. [10] D. R. Musser and A. Saini, STL: tutorial and reference guide: C++ programming with Standard Template Library., Addison-Wesley, 1996. [11] A. Skjellum and C. H. Baldwin, The Multicomputer Toolbox: Scalable Parallel Libraries for Large-Scale Concurrent Applications, Tech. Rep. UCRL-JC-109251, Lawrence Livermore National Laboratory, December 1991. [12] A. Skjellum, P. Bangalore, A. Choudhury, J. Keasler, J. Li, and L. Sheng, Software Requirements Speci cation and Design Documents, February 1997. Version 1.0c. [13] A. Skjellum and P. V. Bangalore, Driving Issues in Scalable Parallel Libraries: PolyAlgorithms, Data Distribution Independence, Redistribution, Local Storage Scheme, in Proceedings of the Seventh (SIAM) Conference on Parallel Processing for Scienti c Computing, February 1995, pp. 734{737. [14] A. Skjellum, A. P. Leung, S. G. Smith, R. D. Falgout, C. H. Still, and C. H. Baldwin, The Multicomputer Toolbox { First-Generation Scalable Libraries, in Proceedings of HICSS{ 27, IEEE Computer Society Press, 1994, pp. 644{654. HICSS{27 Minitrack on Tools and Languages for Transportable Parallel Applications. [15] M. Snir, S. Otto, et al., MPI The Complete Reference, MIT Press, 1996. [16] B. Stroustrup, The C++ Programming Language, Addison-Wesley, third edition ed., 1997. [17] E. F. van de Velde, Data Redistribution and Concurrency, Parallel Computing, 16 (1990), pp. 125{138. Also: Caltech Applied Mathematics, May 1988 Report. [18] G. B. Wise, Getting the Handle of Handles, Crossroads, (1995).