Parallelisation of an interactive lattice-Boltzmann ...

3 downloads 144203 Views 2MB Size Report
Keywords: Android. Mobile computing. Interactive simulation. Lattice Boltzmann method ..... ing the Android Developer best practice [15], the operating system. (OS) will mark ..... presents a different challenge when it comes to synchronisation.
Advances in Engineering Software 104 (2017) 38–50

Contents lists available at ScienceDirect

Advances in Engineering Software journal homepage: www.elsevier.com/locate/advengsoft

Parallelisation of an interactive lattice-Boltzmann method on an Android-powered mobile device Adrian R.G. Harwood a,∗, Alistair J. Revell b a b

Research Associate, School of Mechanical, Aerospace and Civil Engineering, The University of Manchester, Sackville Street, M1 3BB, United Kingdom Senior Lecturer, School of Mechanical, Aerospace and Civil Engineering, The University of Manchester, Sackville Street, M1 3BB, United Kingdom

a r t i c l e

i n f o

Article history: Received 16 July 2016 Revised 18 November 2016 Accepted 22 November 2016 Available online 6 December 2016 Keywords: Android Mobile computing Interactive simulation Lattice Boltzmann method Java concurrency

a b s t r a c t Engineering simulation is essential to modern engineering design, although it is often a computationally demanding activity which can require powerful computer systems to conduct a study. Traditionally the remit of large desktop workstations or off-site computational facilities, potential is now emerging for mobile computation, whereby the unique characteristics of portable devices are harnessed to provide a novel means of engineering simulation. Possible use cases include emergency service assistance, teaching environments, augmented reality or indeed any such case where large computational resources are unavailable and a system prediction is needed. This is particularly relevant if the required accuracy of a calculation is relatively low, such as cases where only an intuitive result is required. In such cases the computational resources offered by modern mobile devices may already be adequate. This paper proceeds to discuss further the possibilities that modern mobile devices might offer to engineering simulation and describes some initial developments in this direction. We focus on the development of an interactive fluid flow solver employing the lattice Boltzmann method, and investigate both task-based and threadbased parallel implementations. The latter is more traditional for high performance computing across many cores while the former, native to Android, is more simple to implement and returns a slightly higher performance. The performance of both saturates when the number of threads/tasks equal three on a quad-core device. Execution time is improved by a further 20% by implementing the kernel in C++ and cross-compiling using the Android NDK. © 2016 Published by Elsevier Ltd.

1. Introduction Modelling and simulation is an integral part of modern engineering as it allows the user to improve their understanding of physical scenarios and complex systems. Depending on the context, this knowledge may be used in a variety of ways; e.g., to inform a design decision, to aid in education of key concepts, or to identify level of risk for a given scenario. Due, in part, to both improved understanding and perpetually increasing computational power, we have become accustomed to a regular increase in the accuracy of these simulations. The calculations themselves are generally conducted on high-end computer facilities either housed locally or via a high-bandwidth interconnect to a high performance computing (HPC) facility. Due to the nature of the software and the skills required to manage and process the data, there are well defined processes in place to assure the quality of the simulation results. For these reasons, and others, the running of computer sim-



Corresponding author. E-mail address: [email protected] (A.R.G. Harwood).

http://dx.doi.org/10.1016/j.advengsoft.2016.11.005 0965-9978/© 2016 Published by Elsevier Ltd.

ulations tends to fall under the remit of an experienced engineer and are typically orchestrated from a desk-based computer. However, in the era of big data and pervasive computing, it is no longer impractical to envisage the coordination and indeed the running of simulations via or on-board a mobile device. There is no question that mobile devices, be that tablet computers or mobile phones, are lighter, more portable and often cheaper than laptops, desktops and servers currently being used for engineering simulations. Having simulation results presented directly to a individual using this platform can allow qualitative analysis to be performed in situations where such information has previously been unavailable. For example, in emergency scenario analysis, the mobile device may be used to capture surroundings using the built-in camera and contaminant sources using the touch screen. A local simulation is then used to provide the user with an immediate safe route of navigation. Alternatively, an interactive wind tunnel can be effectively given to a class of students to enhance education and learning. Mobile devices may also be given to physicians and used in combination with patient derived imagery to provide improved diagnostic information at the point of care [1].

A.R.G. Harwood, A.J. Revell / Advances in Engineering Software 104 (2017) 38–50

Over recent years, the prevalence of desktop computing has reduced, and the use of laptop and tablet devices has grown to fill this gap. Leveraging the many core graphics processing units (GPUs) typically available on mobile devices can deliver a significant boost in processing power. However, it is unlikely that a single device will reach the power available in current HPC clusters in the near future. Instead of matching the accuracy of “conventional” modelling and simulation methods, which is likely to always require significant computing power, there is arguably a role for a faster simulation tool, that trades accuracy for speed, in order to attain a level of human-interactivity. Particularly where used to complement and enhance human decision making, or even to provide fast approximations to inform an automated decision making system as one of many environmental data available to the system. In order to assess the suitability of mobile platforms for performing local, interactive engineering simulation, this article reports the development of two different parallel design patterns for performing interactive, grid-based fluid dynamics simulations on an Android-powered mobile device. The Android operating system is selected as the development platform, due to the availability and affordability of suitable hardware and the fact that it currently has the largest market share for mobile devices [2]. Although mobile devices often feature both a GPU and a CPU, the present study explores only use of the CPU. The development of custom software for mobile GPU is not yet widely supported and is left for a future publication. Our simulations use the latticeBoltzmann method (LBM) to simulate flow physics [3] and is introduced in the next section. The primary aim of the present study is to propose different approaches to interactive flow simulation using LBM on an Android device. This includes the implementation and cross-comparison of several candidate frameworks in order to assess the potential for mobile devices, either used alone or for multi-device parallelism. The completion of these aims provides baseline data and a design template on which other types of interactive engineering simulation on a range of devices may be built.

2. Use of mobile devices for engineering simulation A survey of device ownership in the US in 2015 [2] revealed that 68% of the US population owned a smartphone and 45% owned a tablet. Globally, smartphone ownership hit 10 0 0 million in 2012 and is set to exceed 2500 million by 2020. Ownership is spread across all continents with South Korea leading the way, where 88% of the population own a smartphone. Ownership in socalled advanced economies, including the US and much of Europe, is approximately 68% on average [4]. Modern mobile devices are designed to include a multi-core CPU and a GPU, providing similar versatility to a desktop computer. The computational power of these chips has increased approximately ten-fold since 2009 [5] and available RAM has also increased with more than 3 GB typical of current high-end Samsung Android devices (Fig. 1). Current engineering workstations may have 16 cores and 64 GB RAM with which to perform a local simulation – a 4x increase in CPU cores and a 16x increase in memory. Furthermore, due to active cooling mechanisms on desktop computers, clock speeds are often much higher increasing computing power further. Halpern et al. [5] show that power consumption for mobile chips has increased although further increases in power consumption are yielding less of a gain in performance causing a power saturation at about 1.5W. Instead, the most recent smartphone design has increased the number of cores rather than increasing the power per core. One may conclude that mobile hardware development is hence limited by its power-conserving motivation.

39

However, in defence of mobile devices, the wide-spread ownership of mobile devices, combined with the clear, albeit restricted, increases in power and capability, makes the platform a potential candidate for running smaller scale engineering simulations on site without reliance on external resources or connectivity. Although an individual device may not be able to offer the power of an HPC facility, simulations could be performed on a network of devices connected by a local network implemented via Bluetooth or Wi-Fi Direct. High-end HPC facilities, at present, typically offer O (105 ) cores and O (103 )GB of RAM according to the TOP500 list. However, in practice, these resources are shared amongst many users with individual jobs using much smaller allocations. Access to such facilities is also generally restricted. Theoretically, if all 2 × 109 smartphones globally are assumed to be quad-core with 2GB RAM (c.2013), a global P2P smartphone computer could offer 8 × 109 cores with 4 × 109 GB of memory. This is purely hypothetical but illustrates the compute potential for even the mobile devices in a single office block or city. In light of hardware limitations for individual mobile devices, it is expected that in order to run a simulation locally, there will be a trade-off between the level of model complexity (and hence simulation accuracy) and the speed with which a result can be obtained. However, mobile platforms have the potential to provide sufficient computing power for rapid simulation to an acceptable and situation-appropriate degree of accuracy. This can only be realised with the development of a suitable framework for engineering simulation in this context.

2.1. Integration with existing infrastructure In our increasingly connected world, mobile devices also have the option to off-load tasks with a high resource demand to more suitable systems [6]. In the case of engineering simulation, mobile devices may provide input data such as local wind speed and direction, measured structural loads or geometry and materials all recorded locally on the device, to a remote HPC facility which performs a potentially demanding calculation using these system data. A data-reduced result may then be returned for the user to inspect. It may also be possible to perform some part of the simulation locally as coarse approximation to the problem physics while simultaneously performing a more detailed analysis remotely which may be viewed or incorporated into the local platform at a later time Fig. 2. However, presently HPC facilities are expensive to build and maintain and access is typically restricted. Furthermore, network connectivity is not available in every location and may also suffer from reduced bandwidth or unreliability. Common interconnect between HPC facilities and external terminals is of the order of 1Gb/s which gives a maximum theoretical throughput of 125MB/s. The fastest mobile data connections in the UK at present use the LTE-A (4G+) standard and will theoretically support such a transfer rate [7]. However, this service is, at present, only available in select areas and at a premium subscription cost to a user. Typically, connection speeds may be as low as 12 MB/s depending on the infrastructure available. A sensible alternative may therefore be to develop approaches to performing the calculation locally on one or more available devices [8].

3. The lattice Boltzmann method



 ∂ + c · ∇ f =  ∂t

(1)

40

A.R.G. Harwood, A.J. Revell / Advances in Engineering Software 104 (2017) 38–50

Fig. 1. Illustration of the increase in CPU cores and memory on smartphone devices since 2009 [5].

Fig. 2. Illustration of the concept of interactive engineering simulation and the possible roles of mobile devices. Simulation may be performed locally on a single device, on a network of devices or off-loaded to High Performance Computers (HPC).

Unlike conventional CFD techniques, which aim to solve the Navier–Stokes equations, the lattice-Boltzmann method solves the Boltzmann equation Eq. (1) in order to obtain the statistical behaviour of the fluid. Physical space is modelled as a series of discrete nodes linked by a finite number of lattice links. At each lattice node, fluid is neither represented as finite volumes nor microscopic particles but as groups (or distributions) of particles f. The lattice links represent a discrete velocity vector c along which particles within a given group are permitted to move. As the simulation evolves, microscopic particle collisions are modelled by the application of a collision operator . A commonly-used collision operator, known as the BGK approximation [9], relaxes the momenta of the particles at a particular lattice site towards a local

equilibrium as

fnew = fold −

1

τ

( fold − f eq )

(2)

where τ is the rate of relaxation. feq is a function of the local macroscopic quantities. The redistributed momenta are then convected to neighbouring grid sites along the lattice links. At the end of each time step, the macroscopic quantities are updated by computing the statistical moments of the distribution functions. The main steps in the LBM are illustrated graphically in Fig. 3. No-slip boundaries are implemented using a bounce-back technique which simply reflects distributions with a component oriented normal to the wall back along suitable lattice links. The

A.R.G. Harwood, A.J. Revell / Advances in Engineering Software 104 (2017) 38–50

41

Fig. 3. Graphical illustration of a time step using LBM.

lattice-Boltzmann equation is known to recover the Navier–Stokes equations with second-order accuracy in both space and time [10]. LBM is an increasingly widely-used method for accelerated fluid mechanics simulation due to its suitability for parallelisation, particularly on GPUs [11]. Massively parallel execution is possible due to typical collision operators operating on site-local data. Although the convection process requires propagation of information to neighbouring lattice sites it is possible to write the order of the operations such that data read-write is atomic [12]. A single instruction may therefore be carried out by many threads on a GPU in parallel. The memory requirements for an LBM application will inevitably increase with resolution as more lattice sites will need allocated storage. The link-wise artificial compressibility method (LW-ACM) [13] is one potential solution to the increasing memory requirements by reducing the data stored per lattice site. Alternatively, multiple devices may be used in parallel as discussed later.

designed such that iterations become very rapid ( 20 frames per second. Evolution of the simulation thus appears smooth with the user unable to discern the drawing of individual frames at each time-step. As this study focusses on software design strategies, validation of the LBM is not performed. However, the accuracy of even coarsely resolved LBM for low Reynolds number, laminar flows is remarkably good [26] and the resolution used here is more than sufficient for the Reynolds number being examined. The development of the boundary layer from solid walls applied at the top and bottom of the domain was visible with the flow driven by a uniform inlet boundary condition at the left hand edge of the domain. A screenshot of the flow as evolved on the device is shown in Fig. 9. In reality, Reynolds numbers of interest will be much higher than simulated here. In order to maintain both stability and accuracy, it is expected that the resolution of the lattice will need to be increased. Additional modelling may also be required to capture turbulence or to increase the stability of the simulation. There will inevitably be a trade-off between accuracy and performance. However, the motivation for interactive CFD in the short-term is not to target the level of accuracy already offered by conventional CFD,

A.R.G. Harwood, A.J. Revell / Advances in Engineering Software 104 (2017) 38–50

45

Fig. 8. Sequence diagram illustrating a Thread-based, distributed memory design pattern for an interactive flow simulation application.

Fig. 9. Screenshot of converged 2D channel flow simulated using the latticeBoltzmann method on an Android tablet.

but simply to target a sufficient level of accuracy for the application. Both designs were implemented as separate applications and included timers to time the kernel function responsible for completing an LBM time step. In addition, an extra timer was added to the thread-based design to measure the performance of the pro-

cess responsible for the preparation, passing and dissemination of messages. As illustrated in Figs. 7 and8, the main differences in the designs are found in the execution of the LBM kernel rather than gesture processing or view drawing. Hence, although interaction is possible in both implementations, tests were conducted without touching the display to ensure only the kernel performance is measured. Timing data was recorded each iteration of the LBM kernel and an average value is dynamically updated. After 10 0 0 time steps, the simulation was stopped and the data recorded. This amount of time steps was sufficient to ensure that updates to the average time were less than 1 ms per 100 time steps, i.e., typically less than 1% change. The problem size is kept fixed with a grid size of 192 × 93 used in each case, and the number of threads/tasks increased from 1 to 6. Battery usage for any mobile application is an important consideration. Although it is likely that the device may be used in practice to run many short simulations rather than a single long calculation as only intuition and qualitative results are required, we detail the power consumption of the application in any case. It should be noted that the details of power consumption will vary from device to device as hardware such as CPUs, memory and screens will all have different power requirements across devices. Nevertheless, the battery usage was noted from the internal Android battery monitor application which was reset when the LBM application was loaded. The simulation is run for 60 min without

46

A.R.G. Harwood, A.J. Revell / Advances in Engineering Software 104 (2017) 38–50

Fig. 10. Timing data for Thread-based (Multi-thread) and Task-based (AsyncTask) implementations. Ideal strong scaling normalised to the serial data is indicated by the square and circle markers.

interaction and the data from the battery monitor recorded every 10 min. 8. Results Fig. 10 shows the timing results obtained during the test. The horizontal axis indicates the number of tasks/threads used to parallelise an LBM time step. The vertical axis shows the measured time in ms. The yellow bars indicate the time taken for the threadbased, distributed memory implementation and include a green region, which indicates the time taken to complete message-passing. The blue bars illustrate the time taken for the task-based, shared memory approach. The ideal strong scaling (Amdahl’s Law) is represented by the red markers in the figure. This scaling is normalised to the serial case in Fig. 10. Considering that the problem size does not change, the idealised scaling is computed by simply dividing the serial time by the number of threads/tasks used. The two implementations used are capable of performing 0.33 / 0.40 million lattice updates per second (MLUPS) in serial with this increasing to a maximum of 0.62 / 0.69 MLUPS for the thread-based, distributed memory solution and the task-based, shared memory solution, respectively. These translate to performance increases of factors of 1.88 and 1.73. There is a difference in serial execution time. This is expected as the kernel classes themselves are slightly different to facilitate the different memory structures of each design. The additional managerial software, including barrier synchronisation and manual control of the worker threads, in the thread-based design adds additional load to the LBM kernel which is reflected in an increase in execution time. However, the effect of this is less pronounced during parallel execution with the LBM execution time similar in both implementations. In line with expectations, the message passing cost associated with the thread-based design increases with the number of threads used. This is due to an increased CPU load associated with message routing and the use of a finite set of resources for executing the handlers. Both implementations exhibit a clear performance saturation. This point occurs when the number of requested threads/tasks is

Table 2 Peak memory usage for each of the cases. Bottom row indicates the percentage increase in memory usage for the Thread-based design. Tasks/Threads

1

2

3

4

5

6

Task-based (MB) Thread-based (MB) Thread-based (% increase)

5.85 5.90 0.0

5.87 5.96 1.1

5.87 6.01 1.9

5.88 6.09 3.2

5.89 6.25 5.9

5.89 6.41 8.6

equal to three. Considering that the device contains a quad-core CPU then this point represents when the application has spawned the same number of threads as there are additional cores beyond one. The scheduling of threads across available cores is handled by the OS. Each thread is given a priority. Background threads are always set to a lower priority than user interface threads to preserve responsiveness as discussed in Section 5. They are therefore allocated fewer resources including CPU time. Android also takes into consideration how many background threads exist with work to run and assigns them a thread group. The resource requirements of each thread group are also controlled by the thread scheduler to ensure that each thread group makes progress i.e., runs concurrently, with other thread groups. Given that each thread in our implementations will have a similar workload, the thread scheduler is expected to run all threads concurrently using a similar set of limited resources across available CPUs. As background threads increase beyond three, the workload per thread falls with the local grid size on each thread and the scheduler will simply run these threads less often in a limited resource pool. The execution time, therefore, stays roughly the same. 8.1. Memory usage The memory usage of each case is recorded in Table 2. First, the allocation of memory for the application is relatively low given that the device has 4096 MB of memory in total with perhaps 75% of that usable memory. Therefore, given the execution times given in Fig. 10 the current implementations are limited by the speed with which an iteration of LBM can be computed. However,

A.R.G. Harwood, A.J. Revell / Advances in Engineering Software 104 (2017) 38–50

47

Fig. 11. Line (right-hand axis) indicates the battery percentage for the first 60 min of running the simulation. Bars (left-hand axis) indicate how the power consumption is apportioned to individual applications as recorded by the Android battery monitor.

if in future the GPU is used, the problem may become limited by either memory capacity or potentially memory bandwidth, given that mobile GPUs share memory with the CPU and would require to read write large amounts of data in parallel. As expected, the shared memory model ensures an almost constant amount of memory allocated in each case with a small increase observed due to overhead associated with the creation of multiple tasks. Given that the LBM grid data is by far the largest allocation of memory, this observation is consistent. For the thread-based approach, there is a larger memory overhead associated with the creation of more threads. This is due to the addition of halo regions required by the distributed memory approach, where a thread contains the cost of two halo regions. The domain is 192 × 93 = 17, 856 grid sites big, and is divided in vertical strips for each thread. Each halo is thus 93 sites in total. So the addition of one more thread costs 186 sites ≈ 1.1% of the total grid size. The percentage increase in memory usage for each of the thread-based cases are also given in Table 2. The increase in memory is initially lower than this 1.1%. This is because the grid, although a large proportion of the allocated memory, is not the only component of the allocated memory; there will be an application overhead. Hence a 1.1% increase in the grid memory is seen as a smaller increase overall. As the number of threads increases further, the actual increase begins to exceed the 1.1% as thread management memory and grid memory begin to represent a larger and larger proportion of the total memory allocated.

8.2. Battery usage During the first 10 min, the battery level reduced from 100 to 95%, a total of 5%. For the remainder of the 60 min test, the battery drain over each 10 min period recorded never exceeded this value with drain approximately linear. As might be expected, the majority of the energy used during the test is spent keeping the screen illuminated for the duration of the simulation (cf. Fig. 11). The proportions of power consumed over each 10 min interval remain approximately constant with, on average, the simulation consuming 38% of the battery; the screen consuming roughly 1.5 times the amount of power. Projecting the available data linearly, it is expected that a full battery would therefore be exhausted after running the simulation for approximately 3.4 h. If a long-timeaveraged result is required rather than a dynamic window into the flow behaviour, the screen could be turned off for the duration of

Fig. 12. Pseudo-UML sequence diagram to illustrate practical two-task coupling of the non-local and LBM local operations when using a shared memory model and asynchronous tasks.

the simulation. In these circumstances, battery life would be extended to 8.4 h. 8.3. Complexity The previous discussion elicits the differences in execution time and memory usage of the two designs. There are, however, additional, practical considerations with regard to the complexity of the implementation. The Handler framework is efficient at passing messages between the objects in distributed memory. However, the programmer is required to design and instantiate a suitable container for the grid data, use countdown triggers (i.e., an AtomicInteger) to track the completion of background tasks and to notify the thread manager to continue at appropriate times in the algorithm. These elements are necessary for a distributed memory, thread-safe update of regions of the LBM grid common to more than one thread, although at the cost of added complexity. These parts of the framework amount to approximately 20% of the application software which requires significant additional effort to implement compared with a sequential version. However, the thread-based system is easy to synchronise using CyclicBarrier, as this higher-level construct hides its underlying complexity. In contrast, the shared memory model used by the task-based presents a different challenge when it comes to synchronisation and thread-safety. The task-based approach actually required the kernel to be broken into two steps. The local collision and nonlocal stream operations are performed by one set of tasks. Once complete, the thread manager then concatenates the non-local results by copying the post-stream grid quantities onto the prestream grid. Once the grid is up-to-date, another task is launched to complete the LBM kernel by updating the macroscopic quantities concurrently (cf. Fig. 12). There is added complexity due to this implementation but less so than managing halo data when using distributed memory. There is also an increase in task instantiation due to the creation of tasks twice per execution of the LBM kernel rather than once. AsyncTask is naturally asynchronous and does not provide barrier-type synchronisation hence synchronisation must be enforced explicitly using other constructs. The use of

48

A.R.G. Harwood, A.J. Revell / Advances in Engineering Software 104 (2017) 38–50

Fig. 13. Schematic illustrating the hybrid Java-C++ implementation where the kernel and the methods called by the kernel are all implemented natively in C++ and precompiled.

atomic flags to allow each concurrent task to indicate its completion to the thread manager is a simple and effective thread-safe solution. In summary, the task-based design performs better and is easier to implement than the thread-based design although the latter offers a familiarity to scientific programmers by sharing concepts with OpenMP and MPI. 9. Native implementation using the Android NDK Having investigated the effects of two different strategies of parallel design in Java, the Android native development kit (NDK) was used to improve the performance of the LBM kernel by implementing it as a pre-compiled C++ library. The channel flow simulation was performed using a serial execution of the task-based Java design and compared to a hybrid Java-C++ design. The taskbased design was chosen for modification due to its simplicity although similar modifications could easily be performed to the thread-based design. The LBM kernel module of the application was previously written as a Java class which was instantiated with all the data corresponding to the lattice and its properties. The LBM kernel is a method of the class which performs a complete LBM time step. This kernel is repeatedly called in the previous two designs either by a looping runnable object (thread-based design) or by continuous posting of asynchronous tasks to the background threads. In this modified case, the tasks are posted to a single background thread within a loop when the simulation is started as per Fig. 7. The hybrid application uses this same design but a portion of the LBM class is implemented as a C++ library. How much of the application to implement in C++ and how much to implement in Java is a design decision but these proportions are generally chosen such that highly reused or computationally demanding portions of the application are implemented

natively to improve performance. A key component of the NDK is the Java Native Interface (JNI) which is an environment in which software can be developed to provide a bridge between the Java side of the application and the C++ side of the application. The NDK provides C++ support for Android capabilities and it is possible to write an entire application in native software with no Java implementation at all. However, as Java already offers a clear, wellsupported framework for thread management which is used extensively in the designs presented above, the simplicity of the implementation, was preserved by reusing this part of the application. The LBM object is instantiated from the Java class definition but the methods (including the LBM kernel) are implemented in C++ with the native kernel being launched from a single AsyncTask. At run time, the JNI allows the C++ kernel implementation to obtain handles to the data arrays created in the Java run-time environment and to release these arrays after kernel execution finishes. This arrangement is depicted schematically in Fig. 13. In order for native methods to access the Java arrays, the interface software must search for the Java class and its fields. These references are then used to obtain pointers to the fields themselves. This process is achieved using JNI API calls. Repeated calls to the JNI API can be very slow [27] if calls are inside loops. To increase performance, it is advisable to perform as many required calls as possible in a native initialisation method when the C++ library is loaded. Fig. 13 illustrates this initialisation method which performs searches and caches field references in variables, publicly visible on the C++ side, for quick retrieval by native methods. 9.1. Test results The same channel simulation is again run for 10 0 0 time steps. The resulting speed-up in run-time performance, by loading the LBM kernel from a pre-compiled C++ implementation, is approximately 20% with respect to the original Java implementation of the

A.R.G. Harwood, A.J. Revell / Advances in Engineering Software 104 (2017) 38–50

49

Fig. 14. Illustration of the behaviour of a multi-device implementation of an interactive LBM simulation.

Table 3 Loop time and speed-up associated with the hybrid design versus a full Java implementation. Implementation

Loop time (ms)

MLUPS

Speed-up (%)

Java Java JNI C++

37 29

0.48 0.62

0.0 21.6

LBM kernel (cf. Table 3). If this performance were to scale by the same factor as the full Java implementation then the theoretical performance of the pre-compiled C++ version would be 1.1 MLUPS, a vast improvement on the 0.4 MLUPS of a serial implementation. This performance increase may be viewed as being specific to a given implementation, or specifically, a given choice of JNI placement. However, the benefits of using pre-compiled native implementations as a replacement for potentially slower, just-in-timecompiled Java implementations are really only significant when computationally intensive software is optimised in this way. There is little benefit from re-implementing native Android features that already require little CPU effort to execute. In Fig. 13, the choice of boundary is such that the part of the application which represents >90% of the CPU effort (as determined using trace profiling) is allocated to the C++ side of the application. Adjustment of this boundary further toward the Java side would therefore yield little additional gain. 10. Future work The Project Tango device features an nVidia Tegra GPU. Theoretically, the use of the NDK should allow an application to be written

in CUDA C/C++ and managed through the JNI. This would require possible cross-compilation of a mixture of Java, standard C/C++ and CUDA C/C++ API calls. This is achieved on other platforms using a specialist compiler supplied by nVidia. Although nVidia provide some development support [28], documentation is limited as to how to deploy software written using the CUDA API for Android platforms. Examples at present are limited to the deployment of native Android applications that only link to a pre-compiled CUDA library. The necessary cross-compilation appears to be difficult to achieve within the CodeWorks environment without the need for custom build profiles. Nevertheless, it is expected that documentation will be written in due coarse to simplify the process and enable programmers to leverage the power of mobile GPUs directly. 10.1. Multi-device framework Parallel design patterns, such as those described in this article, for engineering simulations on mobile devices will only guarantee performance gains up to a point. Beyond this, the limited memory capacity and the number and clock speed of CPU cores will impose restrictions on problem size and computational throughput. One way to circumvent the limits of memory and number of computing cores is to parallelise the work across more than one device. This is akin to a multi-node configuration in conventional HPC. Coupling grids distributed across multiple devices is precisely the purpose of the distributed memory model used by the threadbased design in this article. Halo data on device edges is this time passed to a unique device in a group of devices contributing to the same calculation. Results may be then shared throughout the many-device collection or simply collected into a master device for

50

A.R.G. Harwood, A.J. Revell / Advances in Engineering Software 104 (2017) 38–50

output to the screen. Interactive elements too may be added to this model by either sharing input across the collection or just accepting input from a master device. Java offers the tools for implementation of this approach in the Android API. The Socket class allows the construction of buffered, serialised data streams from one machine to another. Mobile devices invariably have a Wi-Fi adapter for connectivity and peerto-peer connections using Wi-Fi Direct is supported by Android 4+ (API level 14). Wi-Fi has greater range, bandwidth and speed than Bluetooth and is therefore the best available choice for direct wireless communication between Android mobile devices. The existing parallel frameworks may be modified to incorporate a second P2P communication step between a group of devices. One device in each pair will construct a ServerSocket and the other a ClientSocket over which two-way communication can take place. This behaviour is illustrated in Fig. 14. Such socket protocols make blocking method calls such that execution on one device will not continue until the communicating device has acknowledged the connection request. This provides an implicit means for synchronisation across the collection of devices. Performance increases offered by a multi-device arrangement will be offset by the additional overhead required to communicate between the distributed grids with each device required to pass halo data to a fixed number of adjacent devices. The design of this message passing protocol will be crucial in maintaining scalability of this approach as more and more devices are linked and concurrent message passing will be essential. This implementation is not tested here but left for a future publication. 11. Conclusions In this work, two parallel design patterns for performing gridbased flow simulation on an Android-powered mobile device have been presented. Implementations of the two designs in Java have been compared in terms of the update performance of a latticeBoltzmann grid for a varying number of threads/tasks. The taskbased design is simpler to implement and to synchronise using atomic data types. Furthermore, this design performs better than the thread-based implementation due to its shared memory configuration where message packing, passing and unpacking is not required. It has also been demonstrated that the performance of the LBM kernel can be improved through use of the Android NDK; when implementing the LBM kernel in C++, application performance improved by approximately 20% when compared with the initial Java implementations. Although tests were limited to a specific Android device, general trends and conclusions are expected to hold considering the similar requirements of all mobile devices and operating systems. Acknowledgements This work was supported by Engineering and Physical Science Research Council Impact Accelerator Account (grant number: EP/K503782/1). References [1] Ventola CL. Mobile devices and apps for health care professionals: uses and benefits. Pharm Ther 2014;39(5):356–64.

[2] Anderson M. Technology Device Ownership: 2015. Technical report. Pew Research Centre; 2015. [3] Succi S. The lattice Boltzmann equation: for fluid dynamics and beyond. New York: Oxford University Press; 2001. [4] Poushter J. Smartphone Ownership and Internet Usage Continues to Climb in Emerging Economies. Technical report. Pew Research Centre; 2015. [5] Halpern M, Zhu Y, Reddi VJ. Mobile CPU’s rise to power: quantifying the impact of generational mobile CPU design trends on performance, energy, and user satisfaction. In: Proceedings of the 2016 IEEE international symposium on high performance computer architecture (HPCA); 2016. p. 64–76. [6] Iida Y, Hirabayashi M, Azumi T, Nishio N, Kato S. Connected smartphones and high-performance servers for remote object detection. In: Proceedings of the 2014 IEEE international conference on cyber-physical systems, networks, and applications (CPSNA); 2014. p. 71–6. [7] Wang CX, Haider F, Gao X, You XH, Yang Y, Yuan D, et al. Cellular architecture and key technologies for 5g wireless communication networks. IEEE Commun Mag 2014;52(2):122–30. [8] Patera AT, Urban K. High performance computing on smartphones. Snapshots Mod Math (MFO) 2016(6). doi:10.14760/SNAP- 2016- 006- EN. [9] Bhatnagar PL, Gross EP, Krook M. A model for collision processes in gases. i. small amplitude processes in charged and neutral one-component systems. Phys Rev 1954;94:511–25. [10] Chen S, Doolen GD. Lattice Boltzmann method for fluid flows. Annu Rev Fluid Mech 1998;30(1):329–64. [11] Schönherr M, Kucher K, Geier M, Stiebler M, Freudiger S, Krafczyk M. Multi-thread implementations of the lattice Boltzmann method on non-uniform grids for CPUs and GPUs. Comput Math Appl 2011;61(12):3730–43. [12] Mawson MJ, Revell AJ. Memory transfer optimization for a lattice Boltzmann solver on Kepler architecture nVidia GPUs. Comput Phys Commun 2014;185(10):2566–74. [13] Asinari P, Ohwada T, Chiavazzo E, Rienzo AFD. Link-wise artificial compressibility method. J Comput Phys 2012;231(15):5109–43. [14] Oracle. Multithreaded Programming Guide. http://docs.oracle.com/cd/ E19455- 01/806- 5257/index.html; 2016 [accessed: 08.09.16]. [15] Google. Best Practices for Performance. https://developer.android.com/training/ best-performance.html; 2016 [accessed: 01.07.16], Android Developers. [16] Gao C, Gutierrez A, Dreslinski RG, Mudge T, Flautner K, Blake G. A study of thread level parallelism on mobile devices. In: Proceedings of the 2014 IEEE international symposium on performance analysis of systems and software (ISPASS); 2014. p. 126–7. [17] Gao C, Gutierrez A, Rajan M, Dreslinski RG, Mudge T, Wu CJ. A study of mobile device utilization. In: Proceedings of the 2015 IEEE international symposium on performance analysis of systems and software (ISPASS); 2015. p. 225–34. [18] Gropp W, Lusk E, Doss N, Skjellum A. A high-performance, portable implementation of the MPI message passing interface standard. Parallel Comput 1996;22(6):789–828. [19] Dagum L, Menon R. OpenMP: an industry-standard API for shared-memory programming. IEEE Comput Sci Eng 1998;5(1):46–55. [20] Wenisch P, van Treeck C, Borrmann A, Rank E, Wenisch O. Computational steering on distributed systems: indoor comfort simulations as a case study of interactive CFD on supercomputers. Int J Parallel Emergent Distrib Syst 2007;22(4):275–91. [21] Linxweiler J, Krafczyk M, Tölke J. Highly interactive computational steering for coupled 3d flow problems utilizing multiple GPUs. Comput Vis Sci 2010;13(7):299–314. [22] Hassan H. An interactive fluid dynamics game on the iPhone. [Master’s thesis], Technische Universität München; 2009. [23] Mawson M. Interactive fluid-structure interaction with many-core accelerators, [Ph.D. thesis]. School of Mechanical, Aerospace & Civil Engineering, The University of Manchester; 2013. [24] Koliha N, Janßen CF, Rung T. Towards online visualization and interactive monitoring of real-time CFD simulations on commodity hardware. Computation 2015;3(3):444. [25] Google. Project Tango. https://developers.google.com/tango/; 2016 [accessed: 01.07.16], Google Developers. [26] Rohde M, Kandhai D, Derksen JJ, van den Akker HEA. A generic, mass conservative local grid refinement technique for lattice-Boltzmann schemes. Int J Numer Methods Fluids 2006;51:439–68. doi:10.1002/d.1140. [27] Dawson M, Johnson G, Low A. Best practices for using the Java Native Interface. Technical report. IBM developerWorks; 2009. https://www.ibm.com/ developerworks/library/j-jni/, [accessed: 08.09.16]. [28] NVIDIA. NVIDA CodeWorks for Android. https://developer.nvidia.com/ codeworks-android, [accessed: 08.09.16].

Suggest Documents