Document not found! Please try again

Optimised External Computation for the Euro50 MATLAB ... - CiteSeerX

3 downloads 28400 Views 1MB Size Report
toolkits were investigated and ultimately a custom written toolkit was developed for the project. ... has numerous advantages over native MATLAB computation. ... with persistent distributed MPI based programs which can be deployed on ...
Optimised External Computation for the Euro50 MATLAB based integrated model Michael Brownea , Anita Enmarkb , Torben Andersenb and Andy Shearera a Dept.

of Information Technology, National University of Ireland, Galway, Ireland. b Lund Observatory, Box 43, SE-221 00 Lund, Sweden. ABSTRACT

In previous work we have countered computational demands faced in integrated modelling by developing and using a parallel toolkit for MATLAB. However the use of an increasingly realistic model makes the computational requirements of the model much larger, particularly in wavefront sensing, reaching a point where simulations of several real time seconds were no longer practical taking up to 3 weeks per second. In response to this problem we have developed optimised C code to which MATLAB off loads computation. This code has numerous advantages over native MATLAB computation. It is portable, scaleable using OpenMP directives and can run remotely using Remote Procedure Calls (RPCs). This opened up the possibility of exploiting high end Itanium and Opteron based shared memory systems, optimised 3rd party libraries and aggressive compiler optimisation. These factors combined with hand-tuning give a performance increase of the order of 100 times. The interface to the rest of the model remains the same so the overall structure is unchanged. In addition we have developed a similar system based on Message Passing Interface version 2, (MPI-2) which allows us to exploited clusters. Here we present an analysis of techniques used and gains obtained along with a brief discussion of future work. Keywords: Euro50, Simulation, MATLAB, RPC, OpenMP, Parallel, MPI

1. INTRODUCTION Integrated modelling is an approach to computational simulation which emphasises the importance of the interactions between different aspects of a complex system. Traditionally these aspects are studied in isolation in dedicated simulation environments. The aim is to bring greater realism to results by providing a platform that supports interaction between sub-simulations that reflects reality. Realising such a platform in software frequently yields two distinct problems; how to orchestrate the high-level issues of having many concurrent and potentially distributed sub-simulations and their interconnections and how to achieve and sustain high rates of computation within the low-level components of the simulation. It should be noted that in isolation these are generic challenges in many fields of computational endeavour, here we address the latter. Large parts of the model were initially developed and tested using Simulink, a companion product of MATLAB’s commonly used in systems modelling and other scientific disciplines. Due to the excessive computation times and memory requirements of end-to-end simulation of the telescope’s behaviour a Linux based cluster with 11 processors each running MATLAB was adopted.1 There have been a number of attempts to produce toolkits to allow MATLAB to be used in a parallel fashion using a variety of techniques. A number of these toolkits were investigated and ultimately a custom written toolkit was developed for the project. The model was parallelised primarily on a functional basis. Whilst this effort allowed an increased memory footprint it failed to deliver runtime improvements. One second of simulated time was taking approximately 3 weeks of real time computations, an impractical situation. To address this we have developed optimised C code to which MATLAB off loads computation. This code has numerous advantages over native MATLAB computation. It is portable, scaleable within reason and can run remotely from the main model using RPCs. This opens up the possibility of exploiting high end Itanium and Opteron based multiprocessor systems, optimised 3rd party libraries and aggressive compiler optimisation. Send correspondence to M. Browne E-mail: brownemi - it.nuigalway.ie, Telephone: 00 353 91 493180

Whilst this technique has proved very effective it relies on having access to a large multiprocessor system and so is particularly suited to tightly coupled aspects of the simulation which use large amounts of shared memory. To complement this, a further technique has been developed that allows the MATLAB based model to communicate with persistent distributed MPI based programs which can be deployed on clusters in addition to shared memory systems. Both techniques are transparent to the MATLAB end user once the optimised code has been developed and allow performance of some sections of the model to be greatly enhanced without sacrificing portability. Results are presented which illustrate the performance profiles of both methods.

2. PARALLEL MATLAB TECHNIQUES 2.1. Parallel Toolkits Cleve Moler of MathWorks has clearly stated2 that a parallel MATLAB is not viable, however that was in 1995 and the high performance computing (HPC) environment has changed considerably. The economic advantage and performance of clusters has risen and multi-core processors are becoming common place. In light of these factors MathWorks introduced a distributed computing toolkit in late 2005. It is intended only for use on clusters with problems that are loosely coupled. Message passing and task sharing paradigms are employed. It can interoperate with some 3rd party schedulers and has the advantage of being fully integrated with the MATLAB environment and is professionally supported. It is reasonable to suggest that it will now become the most common method for dealing with large grain parallelism within MATLAB. Prior to the release of the distributed computing toolbox MATLAB did not include any parallel functionality however there are external interfaces to the TCP/IP network stack, C and Fortran, see appendix A. By building upon these interfaces there have been of the order of 30 attempts to produce toolkits to allow MATLAB to be used in a parallel fashion. They have used a variety of techniques and approaches. A catalogue and brief summary of these is maintained by Choy et al.3 Communication between the nodes is generally handled in either of two ways; via a shared file system e.g. NFS or via a TCP/IP connection. Some toolkits require a MATLAB engine and hence license on each node others do not. The commands called from within MATLAB to distribute the computation vary. Several of the toolkits take a low-level approach using commands that are syntactically and functionally very similar to that of the popular MPI libraries, see §3.4. Others use a more high-level approach choosing to use simpler commands that resemble more closely those of MATLAB.

2.2. MatlabWS Ultimately a custom written toolkit was developed for the project. MatlabWS is a parallel MATLAB toolkit developed by D. Moraru of the Lund Observatory Telescope Group specifically for the Euro50 modelling effort. It is general purpose in design and extends the notion of workspaces within MATLAB to include those of other MATLAB engines which can be running on separate machines. Remote MATLAB engines can be started and controlled from a master engine. Variables can be readily moved to and from remote engines. The computational resources of the remote engine can thus be used to work on transferred variables and the results can be returned. The initial version of MatlabWS was conceptually simple with a straightforward syntax. However it suffered latency problems similar to those of other toolkits as it too used MATLAB’s own communication routines. To avoid this problem additional communication and connection management routines were written in C++ using the Trolltech Qt libraries. They are built as mex files (see appendix A) and called directly by MATLAB. The master MATLAB engine runs as a multithreaded application. The MatlabWS event-loop runs as a separate thread within the context of the MATLAB interpreter process. This results in very fast access to the local workspace and interpreter, as no interprocess communication is required. Standard techniques in multithreaded programming such as semaphores and mutexes are used to control access to shared variables. Using MatlabWS the model was parallelized on a primarily functional basis, see figure 1(b) and [1]. While this effort allowed an increased memory footprint it failed to deliver material runtime or scalability improvements which were not rapidly overshadowed by increasing sophistication in the model. To overcome this other approaches were considered which we term hybrid distribution techniques.

Master Node

Node 01

Node 02

Node 03

... ...

Slow System: Telescope structure Primary mirror segment control Secondary mirror Unit Control Segment rigid body motion Segment actuator servos

(a)

Node 13

Node 14

Node 15

Fast System: Exit Pupil Wavefront Sensor Reconstructor Deformable Mirror

(b)

Figure 1. (a) rendering of Euro50 beside Lund’s Domkyrka & (b) Functionally parallel model configuration.

3. HYBRID DISTRIBUTION 3.1. Introduction The advent of a 64-bit version of MATLAB for Linux on the AMD Opteron processor commodity Opteron systems capable of supporting 16GB of RAM or more permitted an alternate approach, as parallelism is no longer needed to spread the memory footprint. If the MATLAB based model could run as one instance on a leading edge Opteron system with sufficient memory to accommodate the entire model, then parts of the model could be written in highly optimised C or Fortran which could be run on the same CPU as the MATLAB engine, alternate processor(s) in the same system or via a high performance mechanism on physically separate system(s). In addition, in the realm of C and Fortran tried and tested techniques for achieving high performance parallelism exist. If these can be implemented in a transparent manner then the core of the model can remain unchanged with no evident parallelism or deviations from MATLAB’s apparent serial execution.

3.2. OpenMP OpenMP∗ (Open Multi Processing) specifies a combination of library functions, compiler directives and environmental variables designed to ease development of parallel programs on shared memory systems. Programming with C, C++ or Fortran is supported and both course-grained and fine-grained parallelism can be achieved. OpenMP is regarded as being much easier to work with than message passing systems such as MPI (§3.4). The primary reason for this is OpenMP’s use of compiler directives rather than explicit function calls. Thus one can gradually transform a serial code into a parallel one by adding directives on a block-by-block basis. The serial version of the code remains ‘intact’ so to speak, as one can simply compile without enabling the use of OpenMP directives in which case they have no effect. This alone greatly aids debugging. Unless one is intent on executing code on a distributed system OpenMP should be considered over MPI as the development and execution overhead involved in message passing can be high and is rarely required on contemporary shared memory systems, which almost always feature globally accessible cache coherent memory, though its access may well be non-uniform. For clarity it should be emphasized that OpenMP does not support distributed execution, though it works well in hybrid scenarios where interprocess communication is handled by some other means. Without such means scalability is limited to at best the size of affordable multiprocessor systems. Using OpenMP within MATLAB mex files leads to conflicting compiler and linker requirements and a tool chain which is not supported by the vendor. Doing pre-computation and managing persistent storage are made more difficult as there is no longer a persistent server process with responsibility for this and so it must be managed at the core of the model. Using RPCs allows us to cleanly separate MATLAB and optimised parallel code, see §3.3. The server-side computation code can be ported to systems that do not support MATLAB.

3.3. Remote Procedure Call Technique A large number of mechanisms exist for invoking computation remotely, whether in an alternate process on the same system or on separate systems connected via a network. One of the aims of many such systems is ∗

http://www.openmp.org

to hide many of the intricacies involved in using a network for transport and adapting to the computational conditions on the remote system be it a differing endian or a different operating system. RPCs4 build on the omnipresent notion of function invocation as a high-level network application development model. RPCs were an early development in computing that predates the widespread use of networks for reasons other than access to central resources. In 1985 Sun Microsystems released Sun RPC, later renamed to Open Network Computing Remote Procedure Call (ONC RPC) software,5 it is still commonly referred to as Sun RPC and is defined in RFCs 923, 1050, 1014, 1831, 1832 and 1833. Of the numerous RPC systems developed at that time it is the most commonly used today. Client Side

Server Side

Client Application

Client Stub

Local function calls callrpc() request

Select requested precedure

RPC Libraries

System running MATLAB

Opportunity for network & system architecture boundary

Large Multi-Processor system

Server Stub

Run procedure Server Application

RPC Libraries

MATLAB

Return results

MEX interface

RPC interface

RPC interface

Computationally intensive calculations

Server Stub RPC Libraries

Single process / memory space

Complete reply Client Stub

Multi-threaded hand optimised C code and Libraries

Sequential task

Time Results received

Client Application

Execution Continues

(a)

(b)

Figure 2. (a) RPC execution flow & (b) RPC & MATLAB architecture.

Using an ONC RPC (hereafter RPC) is intended to resemble making a function call. In that arguments are passed from the caller, which then waits for results to be returned. Figure 2(a) shows the straightforward pattern of execution in a standard RPC. A client invokes an RPC, in so doing arguments are passed to a server and the client thread generally blocks. At the server side the requested procedure is executed. When it completes results are returned to the client which then continues execution. Clearly the client/server model can be said to apply. It should be noted that prior to the steps shown in figure 2(a) an exchange between the client and the server system’s RPC service, the Portmapper, must take place. Figure 2(b) gives a high-level overview of the order in which the components are inter-connected and on which systems they might reside relative to one another. We stress that all of the components may reside on a single system, for instance if one required access to common files or if the MATLAB host system was itself a high performance system. From this figure we can also see that the high performance system need not run any code which the user has not developed with the exception of the very widely ported RPC libraries, which could be deemed part of the operating system for most UNIX variants. The major benefit of this is that one is no longer dependant upon MATLAB’s availability for a given platform.

3.4. Message Passing Interface Technique The growth in the popularity of cluster based HPC systems has lead to increased interest in and adoption of programming techniques for distributed memory systems. Message passing is a natural paradigm for distributed systems and so MPI has become commonly used. The MPI 1.0 standard was released in June 1994 by the MPI

forum † . In 1997 two standards were released; 1.2 was mainly a series of clarifications to previous releases, more importantly 2.0 was a set of extensions to the 1.2 standard bringing new functionality. MPI’s design aimed to make use of the best features of previous message passing systems such as Parallel Virtual Machine,6 Chameleon7 and others. One of the goals of the design was to allow implementers flexibility to more fully exploit current and future architectures. While today MPI’s natural home is the cluster, Non-Uniform Memory Architecture (NUMA) systems can also make use of it. They frequently have vendor customised MPI implementations that use shared memory as the communications mechanism. This gives both low latency and high bandwidth and is generally more reliable than a physical network. A decided advantage of code developed with MPI is the fact that it can run on both NUMA systems, Symmetric Multi-Processor (SMP) systems and clusters with just a recompilation. For the developer it also advantageous to be able to run MPI multiprocess code on a single CPU system. 3.4.1. MPI-2 Dynamic process management Features introduced in MPI-2 include; parallel I/O, remote memory operations, dynamic process management, language bindings and thread awareness. In our application the MPI-2 dynamic process management features are particularly important. An MPI-2 process can establish communications with other MPI-2 processes which are already running and may have been started separately. It can also create a new MPI process. All dynamic process management calls are blocking for simplicity because of the number of host system issues involved in doing otherwise. They can however be timed out. These features give us capabilities not provided by the RPC technique.

MATLAB

System running MATLAB MEX interface

Opportunity for network & system architecture boundary

MPI interface

N processes on cluster or shared memory system MPI interface

MPI interface

... Computationally intensive calculations

Sequential task

MPI interface

... Computationally intensive calculations

Single process / memory space

Computationally intensive calculations

Potentially multi-threaded hand optimised C code and libraries

Figure 3. MPI-2 & MATLAB architecture.

The dynamic process management interface allows us to build a system that behaves in a similar client server fashion and is conceptually simple yet allows us to use clusters and have in effect a persistent standalone distributed server. The MATLAB interface maintains the semantics of a one-to-one client-server connection, †

http://www.mpi-forum.org

while the server is free to use its native MPI semantics to conduct conventional message passing transactions on diverse topologies which may or may not be distributed. Figure 3 shows the interconnection of components when using this technique and figure 4 shows the sequence of MPI-2 calls employed. The price of abstracting the MATLAB environment from the MPI computation is an additional data copy for both inputs and outputs. However this is often a copy between processes on a single system and so is not a limited by the wirespeed of a given network. One might then argue that it would be beneficial to mandate that both processes be on the same system and use shared memory to effect the ‘copy’. Unfortunately this is not without problems. The semantics of shared memory calls generally dictate that they allocate the block of memory to be shared and thus determine its address. This conflicts with MATLAB’s memory management system which dynamically allocates memory for data without direction from the end user, though one can determine its location after the fact using calls such as mxGetPr. It should be noted that shared memory calls can be forced to share a given area in memory by invoking the associated mmap call with the MAP FIXED argument, however this method is strongly discouraged in the function’s documentation. Furthermore as one of the principle motives for adopting the technique in the first place is to exploit high-performance architectures such as multiprocessor Itanium systems which MATLAB does not support, separation is crucial. Likewise one might envisage a scenario whereby a cluster is available which does not provide interactive graphical access to allow the use of MATLAB. In this case a connected system can host MATLAB. In practice one may be more concerned about delays experienced between nodes in the MPI world than at the MATLAB - MPI interface. Client Side

Server Side

Client Application

On first call: MPI_Init() MPI_Lookup_name() MPI_Comm_connect()

"Name"

Root node

N additional nodes

MPI_Init()

MPI_Init()

MPI_Open_port() MPI_Publish()

...

MPI_Comm_accept()

MPI_Send Input data

Misc. MPI calls

Return results

...

Calculation

Client stub & MPI calls

MPI_Recv

MPI_Recv

Calculation

MPI_Send

On client application termination: MPI_Comm_disconnect() MPI_Finalize()

Results received

On termination: MPI_Comm_disconnect() MPI_Finalize()

On termination: MPI_Finalize()

Client Application Time

Execution continues

Figure 4. MPI-2 & MATLAB execution flow.

4. APPLICATION OF PERFORMANCE ENHANCEMENT TECHNIQUES 4.1. Introduction The model has three execution phases; initialisation, execution and post-processing these are more fully described in [8], [1] & [9]. The model as a whole is made up of many sub-models corresponding to the physical subsystems of

the telescope, many of these are listed in figure 1(b). This figure also shows the grouping of these subsystems into fast and slow systems. This distinction reflects the physical characteristics of the systems. Those with slower dynamics are modelled at a slower rate thus reducing computation times. The fast subsystems periodically synchronise with the slow subsystems’ longer integration interval. When the ordinary differential equation solver acts on the fast subsystems it takes interpolated values for the slow systems into account. It was hoped that the problem was sufficiently loosely coupled that it could be parallelized on a functional basis, whereby different nodes represent and simulate different physical parts of the telescope. This proved not to be the case, as discussed in §2.2, necessitating a change in the structure of the model.

4.2. Current Model Structure The suitability of MATLAB for model development, visualisation and general scientific and engineering problem solving is beyond question. The model itself and development skills that produced it represent a huge investment in the platform which could not be easily replicated and so despite an urgent need for greater performance there was great reluctance to consider abandoning MATLAB with no clear guarantees from any alternatives, hence the development of the previously outlined techniques.

Deformable Mirror 2 Incident Wavefront

Deformable Mirror 1 Image Reconstruction & Actuator Control

7 Layer Propagating Atmospheric Screens

Atmospheric screens are precomputed in the preprocessing phase

Sequential task

Pupil

Wavefront Sensor

Structure Segments etc.

Tightly coupled parallelism on a single machine

Post-processing Point Spread Function Calculation Point Spread Function Calculation Point Spread Function PointCalculation Spread Function Calculation

Results

Loosely coupled parallelism potentially on several machines

Figure 5. Current model structure showing parallel wavefront sensor.

The notion of a system which is partitioned for parallel execution within the MATLAB environment has been abandoned. This is possible because of advances in commodity hardware which is supported by MATLAB allowing much larger memory capacities. As a result the system hosting MATLAB can now affordably be provisioned with 16GB of RAM or more, enough to accommodate all of the subsystems without resorting to swap files. With the subsystems consolidated communications overhead is eliminated. Where previously we were forced to endeavour to parallelize for the sake of both memory demands and computational speed within the MATLAB environment, now we need only concern ourselves with computation. Figure 5 shows the current structure of the model. The details of how this is achieved are presented in §3 and the performance gains made in §5. This refocusing of the optimisation effort means that bottlenecks, regardless of their location within the

model, can be addressed. The transparency of the techniques frees the end user of the error prone step of variable distribution and leads to cleaner high-level model code. Currently the wavefront sensor code has been addressed using the RPC and multithreaded technique. The post processing blocks have been optimised and are still under review. The architecture allows a piecemeal approach to be taken as and when resources permit or performance demands it, with minimal impact on the model itself.

4.3. Future Model Structure Figure 6 shows what may be the next evolutionary step for the model relative to the current state shown in figure 5. The number of guide stars and consequently wavefront sensors has been increased. It is envisaged that this number would be gradually increased. One of the goals of the model is to help study the performance of multiple laser guide stars. For Single Conjugate Adaptive Optics (SCAO) potentially 13 are required and for Dual Conjugate Adaptive Optics (DCAO) potentially 37 are required, see [10]. The current wavefront sensor simulation technique has proved to be reliable and fast and can also be used to deal with multiple wavefront sensors. Doing so incurs problems of asynchrony with RPCs, to overcome this a multithreaded version of the current model client-side code has been successfully implemented for several wavefront sensors. It remains to be determined if ultimately a solution based on the technique described in §3.4.1 will be more suited to working with a large numbers of wavefront sensors in a production environment. The availability of both techniques means that this decision can be based on the nature of available hardware, a pragmatic concern given the scale of possible future simulations. Loosely coupled parallelism on several machines

Sequential task

Atmospheric screens are precomputed in the preprocessing phase as loosely coupled parallel tasks

Deformable Mirror 2

7 Layer Propagating Atmospheric Screens

7 Layer Propagating Atmospheric Screens

Post-processing Point Spread Function Calculation Point Spread Function Results

Pupil

...

≤ 37 instances in total

Wavefront Sensor

Pupil

...

{

Deformable Mirror 1

...

Incident Wavefront

Tightly coupled parallelism ideally on a single machine

Wavefront Sensor

Structure Segments etc.

} Image Reconstruction & Actuator Control

Calculation Point Spread Function Calculation Point Spread Function Calculation

Figure 6. Possible future model configuration showing multiple wavefront sensors.

The preprocessing or initialisation stage may be converted to an optimised parallel format. This is a low

priority task as initialisation is executed infrequently, however, it becomes increasingly easy to do as a body of optimised code is developed. It is considered likely that the image reconstruction and actuator control block will become a bottleneck as the number of wavefront sensors increases and so it too is a candidate for conversion to tightly coupled optimised C code. It is important to note that figure 6 implies only desirable hardware capabilities and not a hardware topology as was the case in early attempts at achieving a parallel architecture such as that shown in figure 1. In fact MATLAB is no longer aware of what systems underlie the externally executed code. Similarly the user is less concerned with managing the problems of distributed computing as they can now be encapsulated at a lower level which is built upon interchangeable, interoperable and proven HPC technologies (§3.3, §3.4.1 & §3.2) and optimised math libraries. Similarly new technologies such as math co-processing boards could be introduced without affecting the MATLAB domain, making validation a straightforward process.

5. CONCLUSIONS While initial development has prioritised the primary computational bottleneck in the system it is envisaged that other parts could benefit from the same optimisation techniques and could run effectively in either a cluster or shared memory environment. To date improvements have meant that two simulations can now be run in one day rather than one in three weeks, a huge increase in productivity compared to approximately 3 weeks of computations per second of simulated time, as was required prior to this effort. Further improvements will mean that simulation time can be further reduced or the resolution of the model increased.

5.1. RPC & Multithreaded Code Results MATLAB (Intel 3GHz PIV)

C (Intel 3GHz PIV)

Multithreaded C (16x2.2GHz AMD Opterons)

1000

100

Time (s)

10

1

0.1

al To t

ift Sh T

va lu te Ab

so

FF

lu

e

T rs ve In

ne po Ex

e

nt

FF

ia

l

n tio la po te r In

lla ce is M

In

iti a

lis

ne

at

ou

s

io n

0.01

Figure 7. Multithreaded wavefront sensor optimisation results, per operation & in total.

Figure 7 shows the improvements that have been obtained in the current model. Initialisation overhead has been eliminated. Interpolation time has been greatly reduced though not yet parallelized. Calculation times have been reduced, the total time is reduced by a factor of 85 times. Figure 8 shows how the parallel code scales on two shared memory hardware platforms. It shows that in absolute terms there is little time to be gained once we advance beyond 8 processors. In this region the gain grows increasingly sublinear and improvements are based on an already greatly reduced time. At this stage the processors are no longer fully utilised as the bandwidth of the machine’s memory system is overwhelmed, also the per thread workload effectively become smaller as it is spread more widely and more time must be spend on marshalling threads.

16x2.2GHz AMD Opteron

SGI Altix 350 24x1.5Ghz Intel Itanium II

10 9 8 7

Time (s)

6 5 4 3 2 1 0 0

1

2

3

4

5

6

7

8

9

10

11

12

13

14

15

16

17

Number of Processors

Figure 8. Scalability of multithreaded wavefront sensor code.

5.2. MPI Implementation Results Growth in the model means that the 16 processor NUMA Opteron system currently used will be inadequate if current simulation times are to be maintained. Thus it is prudent to consider the performance of a message passing implementation of the critical wavefront sensor code, especially its scalability. To assess this the existing shared memory code was refactored to to use MPI libraries in line with the techniques outlined in §3.4.1, as one would expect the resulting code can be used on both shared memory systems and clusters. Figures 9 and 10 show the performance of the code on 16x2GHz Opteron NUMA system (as used in figures 7 and 8) and a nx2.4GHz Opteron cluster with a gigabit ethernet interconnect respectively. In addition to the overall runtime the graphs aim to show the important components in that time as we vary the number of processors. The total times are shown as wide bars on a log scale and the component times of the respective total times are shown as narrow bars. Figure 9 shows some interesting features, overall we get performance similar to that seen in figure 8 for the same hardware. However there are too significant differences, as the code is MPI (MPICH2) based we have a communication overhead not present in the shared memory case. It is minor and grows modestly. By definition memory is not shared, so we must either pass the results of the interpolation phase between processes prior to the calculation phase or do the interpolation in each process. The latter is significantly faster and allows greatest reuse of the multithreaded code. However we see that the interpolation time grows with the number of processors offsetting gains made in the calculation phase. This happens because of contention in the memory sub-system of the system, interpolation is a memory intensive process. An obvious improvement would be for each processor to interpolate only the region for which it is performing calculations. This applies to the case of a cluster also. We see in figure 10 that the interpolation time does not grow as there is no contention in the distributed memory configuration. However with a large number of processors it becomes an increasingly significant time factor and so should still be addressed as suggested. On the cluster we see that the communication time increases dramatically for 128 processors. This indicates that for this combination of interconnect and MPI library we should consider say 64-96 processors to be our maximum useful range. Based on these numbers with selective interpolation it is reasonable to estimate that we could achieve iteration times of 0.3s on 64 2.4GHz Opteron processors a factor of 355 times faster than the original MATLAB run time on a 3GHz PIV. Of course given large numbers of processors it is more likely that one would be more interested in increasing the number of sensors simulated, which is quite feasible. In conclusion we must add that we still believe MATLAB is almost always the correct tool in this field if one values one’s time more than computer time. As an anecdotal point we were required to write approximately

Total

Calculation

Interpolation

Communication 10

1

1

Time (s)

10

0.1

0.1

0.01

0.01 1

2

4

6

8

10

12

14

16

Number of Processors

Figure 9. MPI wavefront sensor code scalability on a 16x2GHz Opteron NUMA system. Total

Calculation

Interpolation

Communication 10

1

1

Time (s)

10

0.1

0.1

0.01

0.01 1

2

4

6

8

10

12

14

16

20

24

28

32

48

64

96 128

Number of Processors

Figure 10. MPI wavefront sensor code scalability on a nx2.4GHz Opteron cluster with a gigabit ethernet interconnect.

2000 lines of C and use a further 1000 lines generated by RPC stub generation tool to supplant just 150 lines of MATLAB code, much of which was commenting. However multi-core processors are now the default, parallelism using OpenMP is relatively easy to achieve and clusters are increasingly common, a situation that lends itself to the solutions we have proffered, for select cases.

APPENDIX A. MATLAB EXTERNAL INTERFACES MATLAB’s so-called external interfaces are a powerful set of features, they allow one to execute code written in certain other languages and exchange data with other environments and physical peripherals. Of primary interest is MATLAB’s interface to external programs written in C and or Fortran. C and Fortran functions can be called from within MATLAB as if they were built-in MATLAB functions. Such C and Fortran functions are called MEX-files. They are dynamically linked functions that MATLAB loads and executes automatically. They are generally used; to permit reuse of large blocks of code that would otherwise have to be rewritten, to recode bottleneck sections of code in order to improve on MATLAB’s native performance or to communicate with hardware via a custom written driver. MEX-files are called in the same way one would call a standard M-file. They have a platform specific extension beginning with .mex except under Microsoft Windows where they are dynamically linked library (.dll) files. They are used by simply referring to their name without any extension as a function call within MATLAB. The MATLAB path is then searched for a matching file.

ACKNOWLEDGMENTS The authors wish to acknowledge the support of Science Foundation Ireland and the assistance of the Irish Centre of High-end Computing.

REFERENCES 1. M. Browne, T. Andersen, A. Enmark, D. Moraru, and A. Shearer, “Parallelization of matlab for euro50 integrated modeling,” in Modeling and Systems Engineering for Astronomy, Proc. SPIE 5497, pp. 604–610, 2004. 2. C. Moler, “Why there isnt a parallel matlab,” Cleves corner, Mathworks Newsletter , 1995. 3. R. Choy and A. Edelman, “Parallel matlab survey,” http://supertech.lcs.mit.edu/ cly/survey.html , 2006. 4. J. Bloomer, Power Programming with RPC, O’Reilly and Associates, 1992. 5. SunSoft, Network Interfaces Programmer’s Guide, Sun Microsystems, 1994. 6. V. Sunderman, A. Giest, J. Dongarra, and R. Manchek, “The pvm concurrent computing system: Evolution, experiences and trends,” Parallel Computing 20(4), pp. 531–545, 1994. 7. W. Gropp and B. Smith, “Chameleon parallel programming tools users manual,” Technical Report ANL93/23, Argonne National Laboratory , 1993. 8. T. Andersen, M. Browne, A. Enmark, D. Moraru, M. Owner-Petersen, and H. Riewaldt, “Integrated modelling of the euro50,” Proceedings of 2nd Bckaskog Workshop on Extremely Large Telescopes , 2003. 9. T. Andersen, A. Enmark, D. Moraru, C. Fan, M. Owner-Petersen, H. Riewaldt, M. Browne, and A. Shearer, “A parallel integrated model of the euro50,” in Optimizing Scientific Return for Astronomy through Information Technologies, Proc. SPIE 5497, pp. 251–265, 2004. 10. T. Andersen, V. Lukin, M. Owner-Petersen, and A. Goncharov, “Laser guide stars for euro50,” Lund Telescope Group: Technical Note , 2002.

Suggest Documents