TECHNIQUES FOR HIGH-PERFORMANCE DISTRIBUTED ... - Cimec

TECHNIQUES FOR HIGH-PERFORMANCE DISTRIBUTED COMPUTING IN COMPUTATIONAL FLUID MECHANICS by Lisandro Daniel Dalc´ın

Dissertation submitted to the Postgraduate Department of the ´ Y CIENCIAS HIDRICAS ´ FACULTAD DE INGENIERIA of the UNIVERSIDAD NACIONAL DEL LITORAL in partial fulfillment of the requirements for the degree of Doctor en Ingenier´ıa - Menci´on Mec´anica Computacional

2008

A mis padres Elda y Daniel, a mi hermana Marianela, a mi novia Julieta, a la memoria de mi tia Araceli y de mi abuela Lucrecia.

Author Legal Declaration This dissertation have been submitted to the Postgraduate Department of the Facultad de Ingenier´ıa y Ciencias H´ıdricas in partial fulfillment of the requirements the degree of Doctor in Engineering - Field of Computational Mechanics of the Universidad Nacional del Litoral. A copy of this document will be available at the University Library and it will be subjected to the Library’s legal normative. Some parts of the work presented in this thesis have been (or are going to be) published in the following journals: Computer Methods in Applied Mechanics and Engineering, Journal of Parallel and Distributed Computing and Advances in Engineering Software.

Lisandro Daniel Dalc´ın

c Copyright by

Lisandro Daniel Dalc´ın 2008

Contents Preface

ix

1 Scientific Computing with Python

1

1.1

The Python Programming Language . . . . . . . . . . . . . . .

1

1.2

Tools for Scientific Computing . . . . . . . . . . . . . . . . . . .

2

1.2.1

Numerical Python . . . . . . . . . . . . . . . . . . . . .

2

1.2.2

Scientific Tools for Python . . . . . . . . . . . . . . . . .

2

1.2.3

Fortran to Python Interface Generator . . . . . . . . . .

2

1.2.4

Simplified Wrapper and Interface Generator . . . . . . .

3

2 MPI for Python 2.1

5

An Overview of MPI . . . . . . . . . . . . . . . . . . . . . . . .

6

2.1.1

History . . . . . . . . . . . . . . . . . . . . . . . . . . . .

7

2.1.2

Main Features of MPI . . . . . . . . . . . . . . . . . . .

8

2.2

Related work on MPI and Python . . . . . . . . . . . . . . . . . 11

2.3

Design and Implementation . . . . . . . . . . . . . . . . . . . . 12

2.4

2.5

2.3.1

Accessing MPI Functionalities . . . . . . . . . . . . . . . 13

2.3.2

Communicating Python Objects . . . . . . . . . . . . . . 14

Using MPI for Python . . . . . . . . . . . . . . . . . . . . . . . 16 2.4.1

Classical Message-Passing Communication . . . . . . . . 16

2.4.2

Dynamic Process Management . . . . . . . . . . . . . . . 22

2.4.3

One-sided Operations . . . . . . . . . . . . . . . . . . . . 23

2.4.4

Parallel Input/Output Operations . . . . . . . . . . . . . 26

Efficiency Tests . . . . . . . . . . . . . . . . . . . . . . . . . . . 28 2.5.1

Measuring Overhead in Message Passing Operations . . . 29 v

vi

CONTENTS 2.5.2

Comparing Wall-Clock Timings for Collective Communication Operations . . . . . . . . . . . . . . . . . . . . . 35

3 PETSc for Python 3.1 An Overview of PETSc . . . . . . . . . . . . . . . . . . 3.1.1 Main Features of PETSc . . . . . . . . . . . . . 3.2 Design and Implementation . . . . . . . . . . . . . . . 3.3 Using PETSc for Python . . . . . . . . . . . . . . . . . 3.3.1 Working with Vectors . . . . . . . . . . . . . . . 3.3.2 Working with Matrices . . . . . . . . . . . . . . 3.3.3 Using Linear Solvers . . . . . . . . . . . . . . . 3.3.4 Using Nonlinear Solvers . . . . . . . . . . . . . 3.4 Efficiency Tests . . . . . . . . . . . . . . . . . . . . . . 3.4.1 The Poisson Problem . . . . . . . . . . . . . . . 3.4.2 A Matrix-Free Approach for the Linear Problem 3.4.3 Some Selected Krylov-Based Iterative Methods 3.4.4 Measuring Overhead . . . . . . . . . . . . . . . 4 Electrokinetic Flow in Microfluidic Chips 4.1 Background . . . . . . . . . . . . . . . . . 4.2 Theoretical Modeling . . . . . . . . . . . . 4.2.1 Governing Equations . . . . . . . . 4.2.2 Electrokinetic Phenomena . . . . . 4.3 Numerical Simulations . . . . . . . . . . . 4.4 Classical Domain Decomposition Methods 4.4.1 A Model Problem . . . . . . . . . . 4.4.2 Additive Schwarz Preconditioning .

. . . . . . . .

. . . . . . . .

. . . . . . . .

. . . . . . . .

. . . . . . . .

. . . . . . . .

. . . . . . . .

. . . . . . . . . . . . .

. . . . . . . .

. . . . . . . . . . . . .

. . . . . . . .

. . . . . . . . . . . . .

. . . . . . . .

. . . . . . . . . . . . .

. . . . . . . .

. . . . . . . . . . . . .

39 40 40 45 46 46 46 48 49 52 52 53 55 58

. . . . . . . .

65 66 68 68 69 73 78 79 81

5 Final Remarks 91 5.1 Impact of this work . . . . . . . . . . . . . . . . . . . . . . . . . 91 5.2 Publications . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 93

List of Figures 2.1

Access to MPI COMM RANK from Python. . . . . . . . . . . . . . . 14

2.2

Sending and Receiving general Python objects. . . . . . . . . . . 17

2.3

Nonblocking Communication of Array Data. . . . . . . . . . . . 20

2.4

Broadcasting general Python objects. . . . . . . . . . . . . . . . 21

2.5

Distributed Dense Matrix-Vector Product. . . . . . . . . . . . . 21

2.6 2.7

Computing π with a Master/Worker Model in Python. . . . . . 23 Computing π with a Master/Worker Model in C++. . . . . . . 24

2.8

Permutation of Block-Distributed 1D Arrays (slow version). . . 26

2.9

Permutation of Block-Distributed 1D Arrays (fast version). . . . 27

2.10 Input/Output of Block-Distributed 2D Arrays. . . . . . . . . . . 28 2.11 Python code for timing a blocking Send and Receive. . . . . . . 31 2.12 Python code for timing a bidirectional Send/Receive. . . . . . . 31 2.13 Python code for timing All-To-All. . . . . . . . . . . . . . . . . 31 2.14 Throughput and overhead in blocking Send and Receive. . . . . 32 2.15 Throughput and overhead in bidirectional Send/Receive. . . . . 33 2.16 Throughput and overhead in All-To-All. . . . . . . . . . . . . . 34 2.17 Timing in Broadcast. . . . . . . . . . . . . . . . . . . . . . . . . 35 2.18 Timing in Scatter. . . . . . . . . . . . . . . . . . . . . . . . . . 36 2.19 Timing in Gather. . . . . . . . . . . . . . . . . . . . . . . . . . . 36 2.20 Timing in Gather to All. . . . . . . . . . . . . . . . . . . . . . . 37 2.21 Timing in All to All Scatter/Gather. . . . . . . . . . . . . . . . 37 3.1

Basic Implementation of Conjugate Gradient Method. . . . . . . 47

3.2

Assembling a Sparse Matrix in Parallel. . . . . . . . . . . . . . . 48

3.3

Solving a Linear Problem in Parallel. . . . . . . . . . . . . . . . 49 vii

viii

LIST OF FIGURES 3.4 3.5 3.6 3.7 3.8 3.9 3.10 3.11 3.12 3.13 3.14 3.15 3.16 3.17 3.18

Nonlinear Residual Function for the Bratu Problem. . . . . . . . Solving a Nonlinear Problem with Matrix-Free Jacobians. . . . . Defining a Matrix-Free Operator for the Poisson Problem. . . . Solving a Matrix-Free Linear Problem with PETSc for Python. . Defining a Matrix-Free Operator, C implementation. . . . . . . Solving a Matrix-Free Linear Problem, C implementation. . . . Comparing Overhead Results for CG and GMRES (30). . . . . . PETSc for Python Overhead using CG. . . . . . . . . . . . . . . Residual History using CG. . . . . . . . . . . . . . . . . . . . . PETSc for Python Overhead using MINRES . . . . . . . . . . . Residual History using MINRES . . . . . . . . . . . . . . . . . . PETSc for Python Overhead using BiCGStab. . . . . . . . . . . Residual History using BiCGStab. . . . . . . . . . . . . . . . . . PETSc for Python Overhead using GMRES (30). . . . . . . . . . Residual History using GMRES (30). . . . . . . . . . . . . . . .

51 51 54 54 56 57 59 60 60 61 61 62 62 63 63

4.1 4.2 4.3 4.4 4.5 4.6 4.7 4.8 4.9 4.10 4.11 4.12 4.13

Microfluidic Chips. . . . . . . . . . . . . . . . . . . The Diffuse Double Layer and the Debye Length. . Electroosmotic Flow. . . . . . . . . . . . . . . . . . Geometry of the Microchannel Network. . . . . . . Initial Na+ and Ka+ Ions Concentrations (mol/3 m) Injection Stage. . . . . . . . . . . . . . . . . . . . . Separation Stage. . . . . . . . . . . . . . . . . . . . Model Problem. . . . . . . . . . . . . . . . . . . . . Additive Schwarz Preconditioning (Mesh #1). . . . Additive Schwarz Preconditioning (Mesh #2). . . . Additive Schwarz Preconditioning (Mesh #3). . . . Additive Schwarz Preconditioning (Mesh #3). . . . Additive Schwarz Preconditioning (32 processors). .

66 71 72 74 75 76 77 80 85 86 87 88 89

. . . . . . . . . . . . .

. . . . . . . . . . . . .

. . . . . . . . . . . . .

. . . . . . . . . . . . .

. . . . . . . . . . . . .

. . . . . . . . . . . . .

. . . . . . . . . . . . .

Preface Parallel Computing and Message Passing Among many parallel computational models, message-passing has proven to be an effective one. This paradigm is specially suited for (but not limited to) distributed memory architectures. Although there are many variations, the basic concept of processes communicating through messages has been well understood from long time. Portable message-passing parallel programming used to be a nightmare in the past. Developers of parallel applications were faced to many proprietary, incompatible and architecture-dependent message-passing libraries. Code portability was hampered by the differences between them. Fortunately, this situation definitely changed after the Message Passing Interface (MPI) standard specification appeared and rapidly gained acceptance. Since its release, the MPI specification has become the leading standard for message-passing libraries in the world of parallel computers. Nowadays, MPI is being widely used in the most demanding scientific and engineering applications related to modeling, simulation, design, and signal processing. Over the last years, high performance computing has finally become an affordable resource to everyone with needs of increased computing power. The conjunction of commodity hardware and high quality open source operating systems and software packages strongly influenced the now widespread popularity of Beowulf [1] class clusters and cluster of workstations. An important subset of scientific and engineering applications deals with problems modeled by partial differential equations on two-dimensional and ix

x

PREFACE

tree-dimensional domains. In those kind of applications, numerical methods are the only practical way to attack complex problems. Those methods necessarily involve a discretization of the governing equations at the continuum level. From this discretization process, systems of linear and nonlinear equations arise. When those systems of equations are very large, parallel processing is mandatory in order to solve them in reasonable time frames. The popularity and availability of parallel computing resources on distributed memory architectures, together with the high degree of portability offered by the MPI specification, strongly motivated the development of general purpose, multi-platform software components tailored to efficiently solve large-scale linear and nonlinear problems. Currently, PETSc [2] y Trilinos [3] are the most complete and advanced general purpose libraries available for supporting large-scale simulations in science and engineering. PETSc[2, 4], the Portable Extensible Toolkit for Scientific Computation, is a suite of state of the art algorithms and data structures for the solution of problems arising on scientific and engineering applications. It is being developed at Argonne National Laboratory, USA. PETSc is specially suited for those modeled by partial differential equations, of large-scale nature, and targeted for parallel, distributed-memory computing environments [5].

High-Level Languages for Scientific Computing In parallel to the aforementioned trends, the popularity of some general highlevel, general purpose scientific computing environments–such as MATLAB and IDL in the commercial side or Octave and Scilab in the open source side– has increased considerably. Users simply feel much more productive in such interactive environments providing tight integration of simulation and visualization. They are alleviated of low-level details associated to compilation and linking steps, memory management and input/output of the more traditional scientific programming languages like Fortran, C, and even C++. Recently, the Python programming language [6, 7] has attracted the attention of many end-users and developers in the scientific community. Python offers a clean and simple syntax, is a very powerful language, and allows skilled

xi users to build their own computing environment, tailored to their specific needs and based on their favorite high-performance Fortran, C, or C++ codes. Sophisticated but easy to use and well integrated packages are available for interactive command-line work, efficient multi-dimensional array processing, 2D and 3D visualization, and other scientific computing tasks.

About This Thesis Although a lot of progress has been made in theory as well as practice, the true costs of accessing parallel environments are still largely dominated by software. The number of end-user parallelized applications is still very small, as well as the number of people affected to their development. Engineers and scientists not specialized in programming or numerical computing, and even small and medium size software companies, hardly ever considered developing their own parallelized code. High performance computing is traditionally associated with software development using compiled languages. However, in typical applications programs, only a small part of the code is time-critical enough to require the efficiency of compiled languages. The rest of the code is generally related to memory management, error handling, input/output, and user interaction, and those are usually the most error-prone and time-consuming lines of code to write and debug in the whole development process. Interpreted high-level languages can be really advantageous for these kind of tasks. This thesis reports the attempts to facilitate the access to high-performance parallel computing resources within a Python programming environment. The target audience are all members of the scientific and engineering community using Python on a regular basis as the supporting environment for developing applications and performing numerical simulations. The target computing platforms range from multiple-processor and/or multiple-core desktop computers, clusters of workstations or dedicated computing nodes either with standard or special network interconnects, to high-performance shared memory machines. The net result of this effort are two open source and public domain packages, MPI for Python (known in short as mpi4py) and PETSc for Python (known in short as petsc4py).

xii

PREFACE

MPI for Python [8, 9, 10], is an open-source, public-domain software project that provides bindings of the Message Passing Interface (MPI) standard for the Python programming language. MPI for Python is a general-purpose and full-featured package targeting the development of parallel application codes in Python. Its facilities allow parallel Python programs to easily exploit multiple processors. MPI for Python employs a back-end MPI implementation, thus being immediately available on any parallel environment providing access to any MPI library. PETSc for Python [11] is an open-source, public-domain software project that provides access to the Portable, Extensible Toolkit for Scientific Computation (PETSc) libraries within the Python programming language. PETSc for Python is a general-purpose and full-featured package. Its facilities allow sequential and parallel Python applications to exploit state of the art algorithms and data structures readily available in PETSc. MPI for Python and PETSc for Python packages are fully integrated to PETSc-FEM [12], an MPI and PETSc based parallel, multiphysics, finite elements code. Within a parallel Python programming environment, this software infrastructure supported research activities related to the simulation of electrophoretic processes in microfluidic chips. This work is part of a multidisciplinary effort oriented to design and develop these devices in order to improve current techniques in clinical analysis and early diagnosis of cancer.

Chapter 1 Scientific Computing with Python This chapter is an introductory one. Section 1.1 provides a general overview of the Python programing language. Section 1.2 comments some fundamental packages and development tools commonly used in the scientific community taking advantage of both the high-level features of Python and the execution performance of traditional compiled languages like C, C ++ and Fortran.

1.1

The Python Programming Language

Python [6] is a modern, easy to learn, powerful programming language. It has efficient high-level data structures and a simple but effective approach to object-oriented programming. Python’s elegant syntax, together with its interpreted nature, make it an ideal language for scripting and rapid application development. It supports modules and packages, which encourages program modularity and code reuse. Additionally, It is easily extended with new functions and data types implemented in C, C++, and Fortran. The Python interpreter and the extensive standard library are freely available in source or binary form for all major platforms, and can be freely distributed. 1

2

1.2 1.2.1

CHAPTER 1. SCIENTIFIC COMPUTING WITH PYTHON

Tools for Scientific Computing Numerical Python

NumPy [13] is an open source project providing the fundamental library needed for serious scientific computing with Python. NumPy provides a powerful multi-dimensional array object with advanced and efficient array slicing operations to select array elements and convenient array reshaping methods. Additionally, NumPy contains three sub-libraries with numerical routines providing basic linear algebra operations, basic Fourier transforms and sophisticated capabilities for random number generation.

1.2.2

Scientific Tools for Python

SciPy [14] is an open source library of scientific tools for Python. It depends on the NumPy library, and it gathers a variety of high level science and engineering modules together as a single package. SciPy provides modules for statistics, optimization, numerical integration, linear algebra, Fourier transforms, signal and image processing, genetic algorithms, special functions, and many more.

1.2.3

Fortran to Python Interface Generator

F2PY [15], the Fortran to Python Interface Generator, provides a connection between the Python and Fortran programming languages. F2PY is a development tool for creating Python extension modules from special signature files or directly from annotated Fortran source files. The signature files, or the Fortran source files with additonal annotations included as comments, contain all the information (function names, arguments and their types, etc.) that is needed to construct convenient Python bindings to Fortran functions. The F2PY -generated Python extension modules enable Python codes to call those Fortran 77/90/95 routines. In addition, F2PY provides the required support for transparently accessing Fortran 77 common blocks or Fortran 90/95 module data.

1.2. TOOLS FOR SCIENTIFIC COMPUTING

3

Fortran (and specially Fortran 90 and above) is a convenient compiled language for efficiently implementing lengthy computations involving multidimensional arrays. Although NumPy provides similar and higher-level capabilities, there are situations where selected, numerically intensive parts of Python applications still requiere the efficiency of a compiled language for processing huge amounts of data in deeply-nested loops. Additionally, state of the art implementations of many commonly used algorithms are readily available and implemented in Fortran. In a Python programming environment, F2PY is then the tool of choice for taking advantage of the speed-up of compiled Fortran code and integrating existing Fortran libraries.

1.2.4

Simplified Wrapper and Interface Generator

SWIG [16], the Simplified Wrapper and Interface Generator, is an interface compiler that connects programs written in C and C++ with a variety of scripting languages. SWIG works by taking the declarations found in C/C++ header files and using them to generate the wrapper code that scripting languages need to access the underlying C/C++ code. In addition, SWIG provides a variety of customization features that let developers to tailor the wrapping process to suit specific application needs. Originally developed in 1995, SWIG was first used by scientists (in the Theoretical Physics Division at Los Alamos National Laboratory, USA) for building user interfaces to molecular dynamic simulation codes running on the Connection Machine 5 supercomputer. In this environment, scientists needed to work with huge amounts of simulation data, complex hardware, and a constantly changing code base. The use of a Python scripting language interface provided a simple yet highly flexible foundation for solving these types of problems [17]. This software infrastructure nowadays supports the largest-scale molecular dynamic simulations in the world [18]. Although SWIG was originally developed for scientific applications, it has since evolved into a general purpose tool that is used in a wide variety of applications–in fact almost anything where C/C++ programming is involved.

Chapter 2 MPI for Python This chapter is devoted to describing MPI for Python, an open-source, publicdomain software project that provides bindings of the Message Passing Interface (MPI) standard for the Python programming language. MPI for Python is a general-purpose and full-featured package targeting the development of parallel application codes in Python. It provides core facilities that allow parallel Python programs to exploit multiple processors. Sequential Python applications can also take advantages of MPI for Python by communicating through the MPI layer with external, independent parallel modules, possibly written in other languages like C++,C, or Fortran. MPI for Python employs a back-end MPI implementation, thus being immediately available on any parallel environment providing access to any MPI library. Those environments range from multiple-processor and/or multiplecore desktop computers, clusters of workstations or dedicated computing nodes with standard or special network interconnects, to high-performance shared memory machines. Section 2.1 presents a general description about MPI and the main concepts contained in the MPI-1 and MPI-2 specifications. Section 2.2 reviews some previous works related to MPI and Python; these works provided invaluable guidance for designing and implementing MPI for Python. Section 2.3 describes the general design and implementation of MPI for Python through a mixed language, C-Python approach. Additionally, two 5

6

CHAPTER 2. MPI FOR PYTHON

mechanisms for inter-process data communication at the Python-level are discussed. Section 2.4 presents a general overview of the many MPI concepts and functionalities accessible through MPI for Python. Additionally, a series of short, self-contained example codes with their corresponding discussions is provided. These examples show how to use MPI for Python for implementing parallel Python codes with the help of MPI. Finally, section 2.5 presents some efficiency tests and discusses their results. Those test are focused on measuring and comparing wall clock timings of selected communication operations implemented both in C and Python.

2.1

An Overview of MPI

Among many parallel computational models, message-passing has proven to be an effective one. This paradigm is specially suited for (but not limited to) distributed memory architectures and is used in today’s most demanding scientific and engineering application related to modeling, simulation, design, and signal processing. MPI, the Message Passing Interface, is a standardized, portable messagepassing system designed to function on a wide variety of parallel computers. The standard defines the syntax and semantics of library routines (MPI is not a programming language extension) and allows users to write portable programs in the main scientific programming languages (Fortran, C, and C++). MPI defines a high-level abstraction for fast and portable inter-process communication [19, 20]. Applications can run in clusters of (possibly heterogeneous) workstations or dedicated nodes, (symmetric) multiprocessors machines, or even a mixture of both. MPI hides all the low-level details, like networking or shared memory management, simplifying development and maintaining portability, without sacrificing performance.

2.1. AN OVERVIEW OF MPI

2.1.1

7

History

Portable message-passing parallel programming used to be a nightmare in the past because of the many incompatible options developers were faced to. Proprietary message passing libraries were available on several parallel computer systems, and were used to develop significant parallel applications. However, the code portability of those applications was hampered by the huge differences between these communication libraries. At the same time, several public-domain libraries were available. They had demonstrated that portable message-passing systems could be implemented without sacrificing performance. In 1992, the Message Interface Passing (MPI) Forum [21] was born, teaming-up a group of researchers from academia and industry involving over 80 people from 40 organizations. This group undertook the effort of defining the syntax and semantics of a standard core of library routines that would be useful for a wide range of users and efficiently implementable on a wide range of parallel computing systems and environments. The fist MPI standard specification [22], also known as MPI-1, appeared in 1994 and immediately gained widespread acceptance. After two years, a second version of the standard [23] was released. Although being completely backwards compatible, MPI-2 introduced some clarifications for features already available in MPI-1 but also many extensions and new functionalities. The MPI specifications is nowadays the leading standard for messagepassing libraries in the world of parallel computers. Implementations are available from vendors of high-performance computers and well known open source projects like MPICH [24, 25] and Open MPI [26, 27]. The MPI Forum has been dormant for nearly a decade. However, in late 2006 it reactivated for the purpose of clarifying current MPI issues, renew membership and interest, explore future opportunities, and possibly defining a new standard level. At the time of this witting, clarifications to MPI-2 are being actively discussed and new working groups are being established for generating a future MPI-3 specification.

8

2.1.2


Main Features of MPI

Communication Domains and Process Groups MPI communication operations occurs within a specific communication domain through an abstraction called communicator. Communicators are built from groups of participating processes and provide a communication context for the members of those groups. Process groups enable parallel applications to assign processing resources in sets of cooperating processes in order to perform independent work. Communicators provide a safe isolation mechanism for implementing independent parallel library routines and mixing them with user code; message passing operations within different communication domains are guaranteed to not conflict. Processes within a group can communicate each other (including itself) through an intracommunicator ; they can also communicate with processes within another group through an intercommunicator. Intracommunicators are intended for communication between processes that are members of the same group. They have one fixed attribute: its process group. Additionally, they can have an optional, predefined attribute: a virtual topology (either Cartesian or a general graph) describing the logical layout of the processes in the group. This extra, optional topology attribute is useful in many ways: it can help the underlying MPI runtime system to map processes onto hardware; it simplifies the implementation of common algorithmic concepts. Intercommunicators are intended to be used for performing communication operations between processes that are members of two disjoint groups. They provide a natural way of enabling communication between independent modules in complex, multidisciplinary applications.

Point-to-Point Communication Point to point communication is a fundamental capability of massage passing systems. This mechanism enables the transmittal of data between a pair of processes, one side sending, the other, receiving.

2.1. AN OVERVIEW OF MPI

9

MPI provides a set of send and receive functions allowing the communication of typed data with an associated tag. The type information enables the conversion of data representation from one architecture to another in the case of heterogeneous computing environments; additionally, it allows the representation of non-contiguous data layouts and user-defined datatypes, thus avoiding the overhead of (otherwise unavoidable) packing/unpacking operations. The tag information allows selectivity of messages at the receiving end. MPI provides basic send and receive functions that are blocking. These functions block the caller until the data buffers involved in the communication can be safely reused by the application program. MPI also provides nonblocking send and receive functions. They allow the possible overlap of communication and computation. Non-blocking communication always come in two parts: posting functions, which begin the requested operation; and test-for-completion functions, which allow to discover whether the requested operation has completed. Collective Communication Collective communications allow the transmittal of data between multiple processes of a group simultaneously. The syntax and semantics of collective functions is consistent with point-to-point communication. Collective functions communicate typed data, but messages are not paired with an associated tag; selectivity of messages is implied in the calling order. Additionally, collective functions come in blocking versions only. The more commonly used collective communication operations are the following. • Barrier synchronization across all group members. • Global communication functions – Broadcast data from one member to all members of a group. – Gather data from all members to one member of a group. – Scatter data from one member to all members of a group.

10

CHAPTER 2. MPI FOR PYTHON • Global reduction operations such as sum, maximum, minimum, etc.

Dynamic Process Management In the context of the MPI-1 specification, a parallel application is static; that is, no processes can be added to or deleted from a running application after it has been started. Fortunately, this limitation was addressed in MPI-2. The new specification added a process management model providing a basic interface between an application and external resources and process managers. This MPI-2 extension can be really useful, especially for sequential applications built on top of parallel modules, or parallel applications with a client/server model. The MPI-2 process model provides a mechanism to create new processes and establish communication between them and the existing MPI application. It also provides mechanisms to establish communication between two existing MPI applications, even when one did not “start” the other.

One-Sided Operations One-sided communications (also called Remote Memory Access, RMA) supplements the traditional two-sided, send/receive based MPI communication model with a one-sided, put/get based interface. One-sided communication that can take advantage of the capabilities of highly specialized network hardware. Additionally, this extension lowers latency and software overhead in applications written using a shared-memory-like paradigm. The MPI specification revolves around the use of objects called windows; they intuitively specify regions of a process’s memory that have been made available for remote read and write operations. The published memory blocks can be accessed through three functions for put (remote send), get (remote write), and accumulate (remote update or reduction) data items. A much larger number of functions support different synchronization styles; the semantics of these synchronization operations are fairly complex.

2.2. RELATED WORK ON MPI AND PYTHON

11

Parallel Input/Output The POSIX [28] standard provides a model of a widely portable file system. However, the optimization needed for parallel input/output cannot be achieved with this generic interface. In order to ensure efficiency and scalability, the underlying parallel input/output system must provide a high-level interface supporting partitioning of file data among processes and a collective interface supporting complete transfers of global data structures between process memories and files. Additionally, further efficiencies can be gained via support for asynchronous input/output, strided accesses to data, and control over physical file layout on storage devices. This scenario motivated the inclusion in the MPI-2 standard of a custom interface in order to support more elaborated parallel input/output operations. The MPI specification for parallel input/output revolves around the use objects called files. As defined by MPI, files are not just contiguous byte streams. Instead, they are regarded as ordered collections of typed data items. MPI supports sequential or random access to any integral set of these items. Furthermore, files are opened collectively by a group of processes. The common patterns for accessing a shared file (broadcast, scatter, gather, reduction) is expressed by using user-defined datatypes. Compared to the communication patterns of point-to-point and collective communications, this approach has the advantage of added flexibility and expressiveness. Data access operations (read and write) are defined for different kinds of positioning (using explicit offsets, individual file pointers, and shared file pointers), coordination (non-collective and collective), and synchronism (blocking, nonblocking, and split collective with begin/end phases).

2.2

Related work on MPI and Python

As MPI for Python started and evolved, many ideas were borrowed from other well known open source projects related to MPI and Python. OOMPI [29, 30] is an excellent C++ class library specification layered on top of the C bindings encapsulating MPI into a functional class hierarchy. This

12


library provides a flexible and intuitive interface by adding some abstractions, like Ports and Messages, which enrich and simplify the syntax. pyMPI [31] rebuilds the Python interpreter and adds a built-in module for message passing. It permits interactive parallel runs, which are useful for learning and debugging, and provides an environment suitable for basic parallel programing. There is limited support for defining new communicators and process topologies; support for intercommunicators is absent. General Python objects can be messaged between processors; there is some support for direct communication of numeric arrays. Pypar [32] is a rather minimal Python interface to MPI. There is no support for constructing new communicators or defining process topologies. It does not require the Python interpreter to be modified or recompiled. General Python objects of any type can be communicated. There is also good support for communicating numeric arrays and practically full MPI bandwidth can be achieved. Scientific Python [33] provides a collection of Python modules that are useful for scientific computing. Among them, there is an interface to MPI. This interface is incomplete and does not resemble the MPI specification. However, there is good support for efficiently communicating numeric arrays.

2.3

Design and Implementation

Python has enough networking capabilities as to develop an implementation of MPI in “pure Python”, i.e., without using compiled languages or depending on the availability of a third-party MPI library. The main advantage of such kind of implementation is surely portability (at least as much as Python provides); there is no need to rely on any foreign language or library. However, such an approach would have many severe limitations as to the point being considered a nonsense. Vendor-provided MPI implementations take advantage of special features of target platforms otherwise unavailable. Additionally, there are many useful and high-quality MPI-based parallel libraries; almost all them are written in compiled languages. The development of an MPI package based in calls to any available MPI implementation will sensibly ease the integration of

2.3. DESIGN AND IMPLEMENTATION

13

other parallel tools in Python. Finally, Python is really easy to extend and connect with external software components developed in compiled languages; it is expected that “wrapping” any existing MPI library would require by far less development effort than reimplementing from scratch the full MPI specification. In subsection 2.2 some previous attempts of integrating MPI and Python were mentioned. However, all of them lack from completeness and interface conformance with the standard specification. MPI for Python provides an interface designed with focus on translating MPI syntax and semantics from the standard MPI-2 C++ bindings to Python. As syntax translation from C++ to Python is generally straightforward, any user with some knowledge of those C++ bindings should be able to use this package without need of learning a new interface specification. Of course, accessing MPI functionalities from Python necessarily requires some adjustments and enhancements in order to follow common language idioms and take better advantage of such a highlevel environment.

2.3.1

Accessing MPI Functionalities

MPI for Python provides access to almost all MPI features through a two-layer, mixed language approach. In the low-level layer, a set of extension modules written in C provide access to all functions and predefined constants in the MPI specification. Additionally, this C code implements some basic machinery for converting any MPI object between its Python representation (i.e. an instance of a specific Python class) and C representation (i.e. an opaque MPI handle). All this conversion machinery is carefully designed for interoperability; any MPI object created and managed through MPI for Python can be easily recovered at the C level and the reused for any purpose (e.g. it can be used for calling a routine in any MPI-based library accessible through a C, C++, or Fortran interface). In the high-level layer, a module written in Python defines all class hierarchies, class methods and functions. This Python code is supported by the low-level C extension modules commented above. The final user interface

14


closely resembles the standard MPI-2 bindings for C++. The mixed-language approach for implementing the high-level Python interface to MPI is exemplified in figure 2.1. In figure 2.1a, a fragment of C code shows the necessary steps in the C side: parse arguments passed from Python to C, extract the underlying MPI communicator handle form the containing Python object, make the actual call to a MPI function, and finally return back the result as a Python object. In figure 2.1b, a fragment of C code shows how the previous low-level function written in C is employed to define the method Get_size() of the Comm class providing a higher-level Python interface to MPI communicators. #include #include /* ... */ PyObject *comm_rank(PyObject *self, PyObject *args) { PyObject *pycomm; MPI_Comm comm; int rank; PyArg_ParseTuple(args, "O", &pycomm); comm = PyMPIComm_AsComm(pycomm); MPI_Comm_rank(comm, &rank); return PyInt_FromLong(rank); } /* ... */

from mpi4py import _mpi # ... class Comm(_mpi.Comm): """Communicator class""" # ... def Get_rank(self): """Rank of calling process""" return _mpi.comm_rank(self) # ... # ...

(b) Python side

(a) C side

Figure 2.1: Access to MPI COMM RANK from Python.

2.3.2

Communicating Python Objects

Object Serialization The Python standard library supports different mechanisms for data persistence. Many of them rely on disk storage, but pickling and marshaling can also work with memory buffers. The pickle (slower, written in pure Python) and cPickle (faster, written in C) modules provide user-extensible facilities to serialize general Python objects using ASCII or binary formats. The marshal module provides facilities

2.3. DESIGN AND IMPLEMENTATION

15

to serialize built-in Python objects using a binary format specific to Python, but independent of machine architecture issues. MPI for Python can communicate any general or built-in Python object taking advantage of the features provided by cPickle and marshal modules. Their functionalities are wrapped in two classes, Pickle and Marshal, defining dump() and load() methods. These are simple extensions, being completely unobtrusive for user-defined classes to participate (they actually use the standard pickle protocol), but carefully optimized for serialization of Python objects on memory streams. This approach is also fully extensible; that is, users are allowed to define new, custom serializers implementing the generic dump()/load() interface. Any provided or user-defined serializer can be attached to communicator instances. They will be routinely used to build binary representations of objects to communicate (at sending processes), and restoring them back (at receiving processes).

Memory Buffers Although simple and general, the serialization approach (i.e. pickling and unpickling) previously discussed imposes important overheads in memory as well as processor usage, especially in the scenario of objects with large memory footprints being communicated. The reasons for this are simple. Pickling general Python objects, ranging from primitive or container built-in types to user-defined classes, necessarily requires computer resources. Processing is needed for dispatching the appropriate serialization method (that depends on the type of the object) and doing the actual packing. Additional memory is always needed, and if its total amount in not known a priori, many reallocations can occur. Indeed, in the case of large numeric arrays, this is certainly unacceptable and precludes communication of objects occupying half or more of the available memory resources. MPI for Python supports direct communication of any object exporting the single-segment buffer interface. This interface is a standard Python mechanism provided by some types (e.g. strings and numeric arrays), allowing access in the

16


C side to a contiguous memory buffer (i.e. address and length) containing the relevant data. This feature, in conjunction with the capability of constructing user-defined MPI datatypes describing complicated memory layouts, enables the implementation of many algorithms involving multidimensional numeric arrays (e.g. image processing, fast Fourier transforms, finite difference schemes on structured Cartesian grids) directly in Python, with negligible overhead, and almost as fast as compiled Fortran, C, or C++ codes.

2.4

Using MPI for Python

This section presents a general overview and some examples of many MPI concepts and functionalities readily available in MPI for Python. Discussed features range from classical MPI-1 message-passing communication operations to and more advances MPI-2 operations like dynamic process management, one-sided communication, and parallel input/output.

2.4.1

Classical Message-Passing Communication

Communicators In MPI for Python, Comm is the base class of communicators. Communicator size and calling process rank can be respectively obtained with methods Get_size() and Get_rank(). The Intracomm and Intercomm classes are derived from the Comm class. The Is_inter() method (and Is_intra(), provided for convenience, it is not part of the MPI specification) is defined for communicator objects and can be used to determine the particular communicator class. The two predefined intracommunicator instances are available: COMM_WORLD and COMM_SELF (or WORLD and SELF, they are just aliases provided for convenience). From them, new communicators can be created as needed. New communicator instances can be obtained with the Clone() method of Comm objects, the Dup() and Split() methods of Intracomm and Intercomm objects, and methods Create_intercomm() and Merge() of Intracomm and Intercomm objects respectively.

2.4. USING MPI FOR PYTHON

17

Virtual topologies (Cartcomm and Graphcomm classes, both being a specialization of Intracomm class) are fully supported. New instances can be obtained from intracommunicator instances with factory methods Create_cart() and Create_graph() of Intracomm class. The associated process group can be retrieved from a communicator by calling the Get_group() method, which returns am instance of the Group class. Set operations with Group objects like like Union(), Intersect() and Difference() are fully supported, as well as the creation of new communicators from these groups. Blocking Point-to-Point Communications The Send(), Recv() and Sendrecv() methods of communicator objects provide support for blocking point-to-point communications within Intracomm and Intercomm instances. These methods can communicate either general Python objects or raw memory buffers. Figure 2.2 shows an example of high-level communication of Python objects. Process zero creates and next sends a Python dictionary to all other processes; the other processes just issue a receive call for getting the sent object. MPI for Python automatically serializes (at sending process) and deserializes (at receiving processes) Python objects as needed. from mpi4py import MPI comm = MPI.COMM_WORLD size = comm.Get_size() rank = comm.Get_rank() if rank == 0: # create a Python ’dict’ object data = {’key1’ : [7, 2.72, 2+3j], ’key2’ : (’abc’, ’xyz’)} # send the object to all other processes for i in range(1, size): comm.Send(data, dest=i, tag=3) else: # receive a Python object from process data = comm.Recv(None, source=0, tag=3) # the received object should be a ’dict’ assert type(data) is dict

Figure 2.2: Sending and Receiving general Python objects. Additional examples of blocking point-to-point communication operations

18


can be found in section 2.5. Those examples show how MPI for Python can efficiently communicate NumPy arrays by directly using their exposed memory buffers, thus avoiding the overhead of serialization and deserialization steps.

Nonblocking Point-to-Point Communications On many systems, performance can be significantly increased by overlapping communication and computation. This is particularly true on systems where communication can be executed autonomously by an intelligent, dedicated communication controller. Nonblocking communication is a mechanism provided by MPI in order to support such overlap. The inherently asynchronous nature of nonblocking communications currently imposes some restrictions in what can be communicated through MPI for Python. Communication of memory buffers, as described in section 2.3.2 is fully supported. However, communication of general Python objects using serialization, as described in section 2.3.2, is possible but not transparent since objects must be explicitly serialized at sending processes, while receiving processes must first provide a memory buffer large enough to hold the incoming message and next recover the original object. The Isend() and Irecv() methods of the Comm class initiate a send and receive operation respectively. These methods return a Request instance, uniquely identifying the started operation. Its completion can be managed using the Test(), Wait(), and Cancel() methods of the Request class. The management of Request objects and associated memory buffers involved in communication requires a careful, rather low-level coordination. Users must ensure that objects exposing their memory buffers are not accessed at the Python level while they are involved in nonblocking message-passing operations. Often a communication with the same argument list is repeatedly executed within an inner loop. In such cases, communication can be further optimized by using persistent communication, a particular case of nonblocking communication allowing the reduction of the overhead between processes and communication controllers. Furthermore , this kind of optimization can also


19

alleviate the extra call overheads associated to interpreted, dynamic languages like Python. The Send_init() and Recv_init() methods of the Comm class create a persistent request for a send and receive operation respectively. These methods return an instance of the Prequest class, a subclass of the Request class. The actual communication can be effectively started using the Start() method, and its completion can be managed as previously described. Figure 2.3 shows a mixture of blocking and nonblocking point-to-point communication involving three processes. Process zero and two send data to process three using standard, blocking send calls; the messages have the same length but they are tagged with different values. Process three issues two nonblocking receive calls specifying a wildcard value for the source process, but explicitly selecting messages by their tag values; the data is received in a twodimensional array with two rows and enough columns to hold each message. The nonblocking receive calls at process three return request objects, they are next waited for completion. While messages are in transit (between the post-receive calls and the call waiting for completion ), process three can use its computing resources for any other local task, thus effectively overlapping computation with communication. The outcome of this message interchange is the following: process three receives the message sent from process zero in the second row of the local data array; the the message sent from process one is received in the fist row of the local data array.

Collective Communications The Bcast(), Scatter(), Gather(), Allgather() and Alltoall() methods of Intracomm instances provide support for collective communications. Those methods can communicate either general Python objects or raw memory buffers. The vector variants (which can communicate different amounts of data at each process) Scatterv(), Gatherv(), Allgatherv() and Alltoallv() are also supported, they can only communicate objects exposing raw memory buffers. Global reduction operations are accessible through the Reduce(), Allreduce(), Scan() and Exscan() methods. All the predefined (i.e., SUM,

20

CHAPTER 2. MPI FOR PYTHON from mpi4py import MPI import numpy comm = MPI.COMM_WORLD size = comm.Get_size() rank = comm.Get_rank() assert size == 3, ’run me in three processes’ if rank == 0: # send a thousand integers to process two data = numpy.ones(1000, dtype=’i’) comm.Send([data, MPI.INT], dest=2, tag=35) elif rank == 1: # send a thousand integers to process two data = numpy.arange(1000, dtype=’i’) comm.Send([data, MPI.INT], dest=2, tag=46) else: # create empty integer 2d array with two rows and # a thousand columns to hold received data data = numpy.empty([2, 1000], dtype=’i’) # post for receive 1000 integers with message tag 46 # from any source and store it in the firt row req1 = comm.Irecv([data[0, :], MPI.INT], source=MPI.ANY_SOURCE, tag=46) # post for receive 1000 integers with message tag 35 # from any source and store it in the second row req2 = comm.Irecv([data[1, :], MPI.INT], source=MPI.ANY_SOURCE, tag=35) # >> you could do other useful computations # >> here while the messages are in transit !!! MPI.Request.Waitall([req1, req2]) # >> now you can safely use the received data; # >> for example, the fist five columns of # >> data array can be printed to ’stdout’ print data[:, 0:5]

Figure 2.3: Nonblocking Communication of Array Data.

PROD, MAX, etc.) and even user-defined reduction operations can be applied to general Python objects (however, the actual required computations are performed sequentially at some process). Reduction operations on memory buffers are supported, but in this case only the predefined MPI operations can be used. Figure 2.4 shows an example of high-level communication of Python objects. A Python dictionary created a process zero, next it is collectively broadcast to all other processes within a communicator. An additional example of collective communication is shown in figure 2.5. In this case, NumPy arrays are communicated by using their exposed memory buffers, thus avoiding the overhead of serialization/deserialization steps. This example implements a parallel dense matrix-vector product y = Ax. For the


21

from mpi4py import MPI comm = MPI.COMM_WORLD rank = comm.Get_rank() # create a # but only if rank == data =

Python ’dict’ object, at process zero 0: {’key1’ : [7, 2.72, 2+3j], ’key2’ : ( ’abc’, ’xyz’)}

else: data = None # broadcast Python object created at # process zero to all other processes data = comm.Bcast(data, root=0) # now all processes should have a ’dict’ assert type(data) is dict

Figure 2.4: Broadcasting general Python objects.

sake of simplicity, the input global matrix A is assumed to be square and blockdistributed by rows within a group of processes with p members, each process owning m consecutive rows from a total of mp global rows. The input vector x and output vector y also have block-distributed entries in compatibility with the row distribution of matrix A. The final implementation is straightforward. The global concatenation of input vector x is obtained at all processes through a gather-to-all collective operation, a matrix-vector product with the local portion of A is performed, and the readily distributed output vector y is finally obtained.

from mpi4py import MPI import numpy def matvec(comm, A, x): "A x -> y" m = len(x) p = comm.Get_size() xg = numpy.zeros(m*p, dtype=’d’) comm.Allgather([x, MPI.DOUBLE], [xg, MPI.DOUBLE]) y = numpy.dot(A, xg) return y

Figure 2.5: Distributed Dense Matrix-Vector Product.

22

2.4.2


Dynamic Process Management

In MPI for Python, new independent processes groups can be created by calling the Spawn() method within an intracommunicator (i.e., an Intracomm instance). This call returns a new intercommunicator (i.e., an Intercomm instance) at the parent process group. The child process group can retrieve the matching intercommunicator by calling the Get_parent() method defined in the Comm class. At each side, the new intercommunicator can be used to perform point to point and collective communications between the parent and child groups of processes. Alternatively, disjoint groups of processes can establish communication using a client/server approach. Any server application must first call the Open_port() function to open a “port” and the Publish_name() function to publish a provided “service”, and next call the Accept() method within an Intracomm instance. Any client applications can first find a published “service” by calling the Lookup_name() function, which returns the “port” where a server can be contacted; and next call the Connect() method within an Intracomm instance. Both Accept() and Connect() methods return an Intercomm instance. When connection between client/server processes is no longer needed, all of them must cooperatively call the Disconnect() method of the Comm class. Additionally, server applications should release resources by calling the Unpublish_name() and Close_port() functions. As an example, figures 2.6 and 2.7 show a Python and a C++ implementation of a master/worker approach for approximately computing the number π in parallel through a simple numerical quadrature applied to the definite R1 integral 0 4(1 + x2 )−1 dx. The codes on the left (figures 2.6a and 2.7a) implement “master”, sequential applications. These master applications create a new group of independent processes and communicate with them by sending (through a broadcast operation) and receiving (through a reduce operation) data. The codes on the right (figures 2.6b and 2.7b) implement “worker”, parallel applications. These worker applications are in charge of receiving input data from the master (through a matching broadcast operation), making the actual computations,


23

and sending back the results (through a matching reduce operation). A careful look at figures 2.6a and 2.7a reveals that, for each implementation language, the sequential master application spawns the worker application implemented in the matching language. However, this setup can be easily changed: the master application written in Python can stead spawn the worker application written in C++; the master application written in C++ can instead spawn the worker application written in Python. Thus MPI for Python and its support for dynamic process management automatically provides full interoperability with other codes using a master/worker (or client/server) model, regardless of their specific implementation languages being C, C++, or Fortran. #! /usr/local/bin/python # file: master.py from mpi4py import MPI from numpy import array N = array(100, ’i’) PI = array(0.0, ’d’) cmd = ’worker.py’ args = [] master = MPI.COMM_SELF worker = master.Spawn(cmd, args, 5) worker.Bcast([N,MPI.INT], root=MPI.ROOT) sbuf = None rbuf = [PI, MPI.DOUBLE] worker.Reduce(sbuf, rbuf, op=MPI.SUM, root=MPI.ROOT) worker.Disconnect() print PI

(a) Master Python code

#! /usr/local/bin/python # file: worker.py from mpi4py import MPI from numpy import array N = array(0, ’i’) PI = array(0, ’d’) master = MPI.Comm.Get_parent() np = master.Get_size() ip = master.Get_rank() master.Bcast([N, MPI.INT], root=0) h = 1.0 / N s = 0.0 for i in xrange(ip, N, np): x = h * (i + 0.5) s += 4.0 / (1.0 + x**2) PI[...] = s * h sbuf = [PI, MPI.DOUBLE] rbuf = None master.Reduce(sbuf, rbuf, op=MPI.SUM, root=0) master.Disconnect()

(b) Worker Python code

Figure 2.6: Computing π with a Master/Worker Model in Python.

2.4.3

One-sided Operations

In MPI for Python, one-sided operations are available by using instances of the Win class. New window objects are created by calling the Create() method at all processes within a communicator and specifying a memory buffer (i.e.,

24


// file: master.cxx // make: mpicxx master.cxx -o master #include #include int main() { MPI::Init(); int N = 100; double PI = 0.0; const char cmd[] = "worker"; const char* args[] = { 0 }; MPI::Intracomm master = MPI::COMM_SELF; MPI::Intercomm worker = master.Spawn(cmd, args, 5, MPI_INFO_NULL, 0, MPI_ERRCODES_IGNORE); worker.Bcast(&N, 1, MPI_INT, MPI_ROOT); worker.Reduce(MPI_BOTTOM, &PI, 1, MPI_DOUBLE, MPI_SUM, MPI_ROOT); worker.Disconnect(); std::cout x(2:m-1, 2:n-1) ! center uN => x(2:m-1, 1:n-2) ! north uS => x(2:m-1, 3:n ) ! south uW => x(1:m-2, 2:n-1) ! west uE => x(3:m, 2:n-1) ! east ! compute nonlinear function hx = 1.0/(m-1) ! x grid spacing hy = 1.0/(n-1) ! y grid spacing f(:,:) = x f(2:m-1, 2:n-1) = & (2*u - uE - uW) * (hy/hx) & + (2*u - uN - uS) * (hx/hy) & - alpha * exp(u) * (hx*hy) end subroutine bratu2d

(b) Fortran 90 version

Figure 3.4: Nonlinear Residual Function for the Bratu Problem. from petsc4py import PETSc from bratu2dnpy import bratu2d # # # #

this user class is an application context for the nonlinear problem at hand; it contains some parametes and knows how to compute residuals

class Bratu2D: def __init__(self, nx, ny, alpha): self.nx = nx # x grid size self.ny = ny # y grid size self.alpha = alpha self.compute = bratu2d

def evalFunction(self, snes, X, F): nx, ny = self.nx, self.ny alpha = self.alpha x = X[...].reshape(nx, ny) f = F[...].reshape(nx, ny) self.compute(alpha, x, f)

# create application context # and nonlinear solver nx, ny = 32, 32 # grid sizes alpha = 6.8 appd = Bratu2D(nx, ny, alpha) snes = PETSc.SNES().create() # register the function in charge of # computing the nonlinear residual f = PETSc.Vec().createSeq(nx*ny) snes.setFunction(appd.evalFunction, f) # configure the nonlinear solver # to use a matrix-free Jacobian snes.setUseMF(True) snes.getKSP().setType(’cg’) snes.setFromOptions() # solve the nonlinear problem b, x = None, f.duplicate() x.set(0) # zero inital guess snes.solve(b, x)

Figure 3.5: Solving a Nonlinear Problem with Matrix-Free Jacobians.

52

3.4

CHAPTER 3. PETSC FOR PYTHON

Efficiency Tests

In the context of scientific computing, Python is commonly used as glue language for interconnecting different pieces of codes written in compiled languages like C, C++ and Fortran. By using this approach, complex scientific applications can take advantage of the best features of both worlds: the convenient, high-level programming environment of Python and the efficiency of compiled languages for numerically intensive computations. This section presents some efficiency tests targeted to measure the overhead of accessing PETSc functionalities through PETSc for Python. The overhead is determined by comparing the wall-clock running time of equivalent driving codes implemented in Python and C. Both codes employ PETSc iterative linear solvers and an auxiliary Fortran routine in charge of performing applicationspecific computations. The actual application deals with the numerical solution of a model linear partial differential equation using Krylov-based iterative linear solvers in combination with matrix-free techniques.

3.4.1

The Poisson Problem

Consider the following Poisson problem in three dimensions equipped with homogeneous boundary conditions: −∆φ = 1 on Ω, φ = 0 at Γ; where Ω is the unit box (0, 1)3 and Γ is the entire box boundary, ∆ is the three-dimensional Laplace operator, and φ is a scalar field defined on Ω. From the many discretization methods suitable for the above problem, finite differences is the chosen one; this method can be easily and efficiency implemented in Fortran 90 with a few lines of code. Thus, the spatial discretization is performed with finite differences using the standard 7-points stencil on a structured, regularly spaced grid. For the sake of simplicity, the discrete grid is assumed to have the same number of nodes on each of its three directions. For a given discrete grid having n + 2 points in each direction, a system of lin-

3.4. EFFICIENCY TESTS

53

ear equations with n3 equations and n3 unknowns is obtained. The associated discrete linear operator is symmetric and positive definite.

3.4.2

A Matrix-Free Approach for the Linear Problem

The system of linear equations arising from the spatial discretization can be approximately solved by using Krylov-based iterative methods. Those methods are suitable for employing matrix-free representations of linear operators. In matrix-free techniques, the entries of the linear system matrix are not explicitly stored. Instead, the linear operator is implicitly defined by its action on a given input vector. Figure 3.6 shows the complete implementation of a matrix-free linear operator for the problem at hand. In figure 3.6b, the auxiliary Fortran code implements a routine in charge of computing the action of the (negative) discrete Laplacian. This implementation takes advantage of multi-dimensional array processing; a careful look reveals that an auxiliary input array is being employed in order to simplify the handling of boundary conditions. This routine is easily made available to Python by using F2Py. In figure 3.6a, a Python class implements some selected methods of the generic interface for user-defined linear operators. The mult() method receives the input and output vectors and calls the previously discussed Fortran routine in order to actually compute the action of the discrete Laplace operator. Other two additional methods are also implemented: multTranspose() and getDiagonal(). Figure 3.7 shows how the previous codes are combined in order to actually solve the linear system of equations. A shell PETSc matrix is created with appropriate row and column sizes. This matrix is associated with an instance of the user-defined matrix class previously implemented as shown in figure 3.6. From the matrix object, appropriately sized vectors for storing the solution and right hand side are obtained; all right hand side vector entries are set to one. Next, a PETSc linear solver is created and configured to use conjugate gradients with no preconditioner. Finally, the linear system of equations is solved and the solution vector is scaled in order to account for the grid spacing. For the sake of completeness, figures 3.8 and 3.9 show a complete C im-

54

CHAPTER 3. PETSC FOR PYTHON

# file: del2mat.py

! file: del2lib.f90

from numpy import zeros from del2lib import del2apply

! to build a Python module, use this: ! $$ f2py -m del2lib -c del2lib.f90

class Del2Mat:

subroutine del2apply (n, F, x, y)

def __init__(self, n): self.N = (n, n, n) self.F = zeros([n+2]*3, order=’f’) def mult(self, x, y): "y

TECHNIQUES FOR HIGH-PERFORMANCE DISTRIBUTED ... - Cimec

TECHNIQUES FOR HIGH-PERFORMANCE DISTRIBUTED ... - Cimec

Suggest Documents

HighPerformance Glass Fiber Development for

MPI for Python - Cimec

MPI for Python - Cimec

FRAGMENTATION TECHNIQUES FOR DISTRIBUTED OBJECT ...

HighPerformance Polybenzoxazine Nanocomposites Containing

HighPerformance PhotoelectrochemicalType ...

A HighPerformance, LowPower Chip Multiprocessor for ... - CiteSeerX

A HighPerformance Recycling Solution for ... - TAMU Chemistry

Mathematical Decomposition Techniques for Distributed ... - CiteSeerX

Interpolation Techniques for Spatial Distributed System Identification ...

Formal Techniques for Networked and Distributed ...

Channel Estimation Techniques for Quantized Distributed ... - arXiv

SocialCDN: Caching Techniques for Distributed Social Networks

Distributed Signal Processing Techniques for Wireless Sensor ...

Formal Techniques for Networked and Distributed

Operating System Techniques for Distributed ... - Semantic Scholar

Efficient Techniques for Distributed Implementation ... - Semantic Scholar

Mathematical Decomposition Techniques for Distributed ... - IEEE Xplore

Intelligent Workflow Techniques for Distributed Group Facilitation

Operating System Techniques for Distributed ... - Semantic Scholar

Compiler Techniques for Effective Communication on Distributed ...

Efficient techniques for distributed computing - Semantic Scholar

Synthesis of a unique highperformance

Prepositional passives - clic-cimec