systems such as Python and Matlab are relatively easy for domain scientists to use but ... high performance fluids package is presented as an illustration of the ...
Embedded Python: bridging the gap between domain scientists and HPC. David A. Ham1,2 1. Department of Earth Science and Engineering 2. Grantham Institute for Climate Change Imperial College London
1
Abstract
This paper advocates and demonstrates the advantages for domain scientists of embedding Python in high performance simulation software. Simulation software is a powerful tool but, being implemented in compiled languages on performance grounds, it tends towards inflexibility and difficulty of use. In contrast, interpreted systems such as Python and Matlab are relatively easy for domain scientists to use but have limited applicability in high performance, batch processed contexts. The solution advocated here is to embed the Python language in the simulation software so that the public, configurable interface of the software benefits from the usability of that language while the computational back end remains unchanged. The embedded Python interface in the Fluidity high performance fluids package is presented as an illustration of the successful application of this approach.
2
The simulation configuration problem
Typically, simulation software embodies some mathematical model of a real-world system. That model is then driven with input data and the consequences in the model system are calculated. The development of modelling software, particularly software developed in academia, typically focusses on this calculation phase which may be exceptionally computationally expensive and is frequently also mathematically complex. The calculation part of the model, which we will call the model core, is typically written in a compiled language such as C, C++ or Fortran and in many cases will be parallelised using MPI and/or OpenMP. The model core is typically written by domain scientists with computational science interests or by computational scientists in collaboration with domain scientists. However, the users of simulation software are typically domain scientists without particular computational science expertise but with complex requirements of the modelling core. Every scenario to be simulated will have different input data and different required outputs. Users frequently wish to extend the model by taking computed outputs and feeding them back into the model at runtime. This is a requirement which is hard to meet with a conventional model core. For example, in a fluid flow simulation, the initial fluid velocity might be known as a mathematical function or might be available as a set of experimental or observational data. However the model core needs to be provided with the initial velocity vector at each discrete node in the domain. For a complex simulation, this preprocessing may need to be carried out by the user for many fields and for boundary as well as initial conditions. If the desired outputs are not within the scope of the model, the situation is still worse as the user may need to actually edit the model source code to ensure the correct data is output. This may require programming skills beyond those available to the user and also carries an increased risk of the user introducing bugs which invalidate the results of the simulation. The challenge, therefore, is to provide a mechanism for providing inputs and specifying outputs which matches the information and skills available to the user with the requirements of the model core.
3
Why Python?
Python is an interpreted language with a clean and clear syntax which is increasingly popular among scientists. The SciPy scientific packages (Jones et al., 2001–) provide similar facilities to Matlab and the language has begun to be used in textbooks on scientific computing (Langtangen, 2009, for example). Python also has a very straightforward C interface for embedding and tools to generate glue layers for C/C++ (Beazley, 1996) and Fortran (Peterson, 2009) code. Embedding Python in an application is as simple as linking against the Python library and calling suitable initialisation routines. Python is installed by default on almost all Unix machines and is available for Windows and Mac. Also of critical importance is that Python is open source. This distinguishes
1
it in particular from Matlab which has complex licensing requirements. HPC applications can therefore embed Python while maintaining portability to sites and supercomputers which may not have licenses for particular proprietary packages.
4
Python in Fluidity
To illustrate the application and benefits of the approach advocated here, we employ the example of Fluidity, a parallel adaptive flow solver developed at Imperial College. Fluidity has been applied to a wide variety of single and multi-fluid problems from industrial applications such as flow inside nuclear reactors to environmental problems such as ocean flow or street canyon pollution (for example Piggott et al., 2008; Gomes et al., 2008). User interaction with fluidity is assisted by a validating automatically generated graphical interface (Ham et al., 2009). This interface now enables users to specify initial and boundary conditions as well as diagnostic quantities and even model extensions as pieces of Python code which are executed by the Python interpreter embedded in Fluidity.
4.1 Initial and boundary conditions The most straightforward demand which users make is to specify the initial and boundary conditions for all the flow fields (velocity, temperature, salinity and so forth). The user typically knows these as either mathematical expressions of space and time or has data which can be interpolated to yield the same. The embedded Python interface to Fluidity enables the user to provide Python functions of the form val(X,t) which is evaluated by Fluidity on the fly to provide an initial or boundary condition as appropriate. Figure 1 illustrates a user-specified initial condition for the advected tracer in a dye tracking simulation.
4.2 Diagnostic quantities and extending the model The much more challenging demand which users make is to be able to calculate diagnostic quantities and, in some cases, to feed these back to the model. For example ocean biology can be represented as a number of scalar quantities representing, for example, phytoplankton and zooplankton. The Fluidity model core is responsible for the advection and diffusion of these quantities but the ocean biologist user needs to specify the interactions between these variables as a function of each other and possibly others such as the light level. These variables (more properly fields as they vary spatially over the domain) are available internally to the Fluidity core. By passing pointers to the relevant arrays and constructing wrapper classes, it is possible to expose the entire system state in Python for access and even modification by user code. Importantly, this is achieved without making a copy of the state data. Figure 2 illustrates the use of embedded Python to create a simple ocean biology model within Fluidity. The Python classes representing solution fields and the container state reflect the internal data storage of Fluidity but present it in a simple and usable way to the domain scientist user.
4.3 Breaking out to a Python shell One of the most user-friendly features of Python is that, as an interpreted language, it can be used interactively. This is not a usual feature of the compiled languages and the nearest alternative, running under a debugger, once again requires quite advanced software development skills. However, the interactive interfaces to Python can be invoked from an embedded Python instance. This enables users to insert a simple function call into their embedded Python code and have an interactive Python shell appear at that point in model execution and allow them to interact with the data at runtime. This is an excellent debugging tool for user input and potentially even allows for computational steering by the user.
5
Performance considerations
The reason that HPC applications are typically written in compiled languages is performance. It is therefore germane to ask whether runtime evaluation of Python code will have a deleterious effect on the execution time or memory footprint of the model. The simple answer is that in the Fluidity case, at least, this has not been observed. In the case of the memory footprint, this is easy to see as no copies of significant sized data are made. Instead Python directly accesses the computational system state via pointers wrapped in Python objects. For the computational cost, it is significant that typical uses of the Python interface perform node-wise calculations on solution fields. The computationally intensive parts of Fluidity, however, perform element integrals with expensive change of coordinate operations and large sparse matrix solves. For this reason, the typical uses to which Python do not form a significant part of the computational cost of the model so even if they are performed slowly, this does not result in a noticeable performance decrease.
2
References Beazley, D., 1996. SWIG: An easy to use tool for integrating scripting languages with C and C++. In: Proceedings of the 4th conference on USENIX Tcl/Tk Workshop, 1996-Volume 4. USENIX Association, p. 15. Gomes, J., Pain, C., Eaton, M., Goddard, A., Piggott, M., Ziver, A., de Oliveira, C., Yamane, Y., 2008. Investigation of nuclear criticality within a powder using coupled neutronics and thermofluids. Annals of Nuclear Energy 35 (11), 2073 – 2092. URL http://www.sciencedirect.com/science/article/B6V1R-4T0NGJ2-1/2/7338faf2aff9ea7ea5c8d0bfcde4c64e Ham, D., Farrell, P., Gorman, G., Maddison, J., Wilson, C., Kramer, S., Shipton, J., Collins, G., Cotter, C., Piggott, M., 2009. Spud 1.0: generalising and automating the user interfaces of scientific computer models. Geoscientific Model Development 2, 33–42. Jones, E., Oliphant, T., Peterson, P., et al., 2001–. SciPy: Open source scientific tools for Python. URL http://www.scipy.org/ Langtangen, H., 2009. A primer on scientific programming with Python. Springer Verlag. Peterson, P., 2009. F2PY: a tool for connecting Fortran and Python programs. International Journal of Computational Science and Engineering 4 (4), 296–305. Piggott, M., Gorman, G., Pain, C., Allison, P., Candy, A., Martin, B., Wells, M., 2008. A new computational framework for multi-scale ocean modelling based on adapting unstructured meshes. International Journal For Numerical Methods In Fluids 56 (8), 1003.
def val(X,t): from numpy import array from math import sqrt dx = array(X) - array((-0.5,0)) r = norm(dx) if (r