Dynamic self-assembling petaflop scale clusters

Dynamic self-assembling petaflop scale clusters Your mobile device may find a cure for cancer

Mohammad Samarah

Rabeet Fatmi

Health Informatics and Big Data Lab Florida Polytechnic University Lakeland, Florida +1 (863) 874-8586

Health Informatics and Big Data Lab Florida Polytechnic University Lakeland, Florida +1 (954) 496-5962

[email protected]

[email protected]

ABSTRACT High Performance Computing (HPC) has been studied and used in the scientific community for decades. The Message Passing Interface was first introduced in 1992 [1]. Similarly, commercial businesses have been relying on High Throughput Computing (HTC) for the past two decades [2]. Mapreduce platforms became popular with the advent of Very Large Databases (VLDBs) and Big Data. We are now seeing the convergence between HPC and HTC to provide faster and cheaper parallel computation. The emergence of MPI as a scalable and viable parallel platform along with the acceptance of Mapreduce to tackle large data sets now opens the door to a host of new applications particularly in biomedical, public health, scientific, and health informatics research. This convergence is making it possible to have every device be a parallel node. In this paper we explore this convergence and a method for creating dynamic self-assembling clusters using commodity hardware and mobile devices.

Keywords HPC; HTC; MPI; Self-assembling clusters.

1. INTRODUCTION This is the era of data overload in which we live in. Digital data is expected to double every three years [3], resulting in challenges requiring massive amounts of data processing at ultra-high speeds. Big Data, a term which started percolating in the mid-1990s [4], refers to the type of data that needs specialized tools and frameworks to be processed where traditional processing applications fall short. Therefore, it is safe to say that we have now reached a stage where a single CPU is no longer capable of solving such large-scale and complex problems. This has fueled the need for a system that can tackle these computational challenges effectively. This is often achieved through the use of High Powered computing (HPC) systems containing thousands to millions of processors, commonly called supercomputers. These dedicated computing facilities use high-speed interconnections to reliably exchange information between processors and maintain data integrity. However, there is a subset of problems that can be solved in parallel by recognizing that the limiting factor for these types of problems is throughput rather than processing power - the so-called pleasingly parallel problem [5]. For example, if a scientific experiment needs to be repeatedly performed with Monte Carlo data to achieve statistical significance, then computational power is of less significance than computational resources that can be brought up. In such cases the job is further decomposed into completely independent sub-jobs which can run in any arbitrary order. Systems that enable such advances are called High Throughput computing (HTC) systems, examples include HTCondor [6] and BOINC [7]. HTC jobs are often run on non-

dedicated or shared resources, possibly at a low priority, when the dedicated task is not being performed by the computer. This enables the achievement of parallelism by exploiting resources purchased for another purpose at little to no cost. The downsides of such a system are that the work may not be performed as rapidly as on dedicated resources and an increase in job makespan (time elapsed from job submission to final results becoming available) as jobs could be interrupted. A very well-known application of HTC is the Folding@home project [8] [9] which takes advantage of the unused computing power of personal devices to crunch numbers and perform complicated calculations. An example of volunteer computing, Folding@home performs the work on computers recruited from people on the internet as opposed to machines under control of the project [10]. The idea is to simulate protein folding and other biomolecular phenomena to discover new patterns in the molecular dynamics of the human body and potentially find a cure to related diseases such as cancer or Alzheimer’s disease. This distributed architecture allows it to operate in range of 101 PetaFLOPS sustained [11], providing a lot more computing power than can be achieved locally due to cost, storage, and electrical/cooling issues. Similarly, SETI@home is another HTC project whose goal is to detect signs of extraterrestrial technology by listening for and analyzing narrow-bandwidth radio signals from space [12]. Increasing the computing power allows the project to cover greater frequency ranges in hopes of finding intelligent life outside Earth. Participants are asked to download the provided application, which performs related computations independently and sends the results back to the SETI@home servers. In this study, we introduce the reader to HPC and HTC clusters in section 2 and 3 respectively and review the supported platforms. In section 4 we analyze the performance metrics and statistics on the two aforementioned HTC projects and perform our simulations on how a similar cluster can be achieved by using mobile devices as well as commodity hardware. We conclude this paper by providing a view on the future of parallelism and what can be expected from the HPC/HTC community within the next 5-10 years.

2. THE STATE OF HPC HPC jobs can be defined as tightly coupled parallel jobs that need constant communication to attain maximum performance. The environment is built around dedicated computing nodes with highperformance processors and memory connected through a lowlatency network. The HPC community has served its purpose in practically every field including science and engineering, business, education, as well as mission-critical applications. Although we are now seeing a paradigm shift from HPC solutions to HTC solutions,

HPC still holds its place in problems requiring massive parallelism to be solved.

2.1 Emergent parallel platforms This section covers some of the emergent platforms and programming models designed specifically for HPC needs. There have been many advancements in the HPC space from the introduction of shared-memory management applications such as OpenMP to robust distributed memory solutions such as MPI. We discuss the pure parallel programming models that dominate the HPC space. Generally, parallel platforms are divided into three broad categories: Threads, Shared Memory and Message Passing [13].

2.1.1 Threads A thread is an independent sub-process spawned within the lifecycle of an existing process having its own program counter and execution stack [14]. They differ from processes in the sense that they don’t require allocation of dedicated memory and resources as threads are expected to perform computations on shared memory. Threads generally rely on a management protocol, called a scheduler or an arbiter to maintain their lifecycle and manage the creation, deletion or recycling of the threads. This low-level design leaves much of the decision making to the programmer. Therefore, it is possible to achieve specialized parallelism using threads. This approach, however, has certain limitations. A multithreaded application is expected to handle synchronization effectively, this includes proper handling of race conditions and deadlocks. Moreover, writing a fully parallel program using threads is often times cumbersome, therefore, requires a high level of proficiency.

2.1.2 Shared Memory OpenMP Although shared memory is also a multithreaded model, it works at a higher level of abstraction than threads. In a shared-memory system every processor has access to the memory of every other processor, which allows developers to build parallelized software. This architecture allows parts of the memory to be set as private to the processor, to achieve pure parallelism. This approach to parallelism, however, differs from distributed-memory in the sense that inter-process communication does not require message passing, however, attempts have been made to combine shared memory and message passing to create a fully parallel multi-core computer cluster [15]. OpenMP is an industry standard application-programming interface for shared memory programming [16]. It allows programmers to make C, C++ or Fortran function calls to manage the thread pools. This highly structured multithreading architecture is designed to specifically support the HPC space.

2.1.3 Message Passing Message passing is a parallel programming paradigm where communication between processors is done by exchanging messages. Multiple processors in a computer cluster, therefore use message passing to coordinate with each other. Message passing is needed where sharing of variables is unfavorable, if possible at all, for example, a distributed memory architecture. Message passing operations are supported by systems such as MPI [17] or PVM [18]. However, over time, MPI has risen to become the de-facto standard for message passing architectures. MPI is a standardized interface that enables message passing between processors. It was defined by a group of about 60 people

from 40 different organizations in the US and Europe [17]. There exist many implementations of MPI, however, MPI is not an implementation by itself, it is merely a standard-by-consensus for message passing libraries. The fundamentals of MPI, according to [29], are as follows:



 

MPI is a library, not a language. It specifies the names, calling sequences, and results of subroutines to be called from Fortran programs, the functions to be called from C programs, and the classes and methods that make up the MPI C++ library. MPI is a specification, not a particular implementation. MPI addresses the message-passing model.

MPICH library [19] is one of the most popular and freely available implementations of MPI. MPICH provides support for commodity clusters, high-speed networks as well as high-end supercomputers. It can also be extended to further provide support for a more specialized MPI derived framework. Implementations of MPI have also been passed over to Bluetooth enabled devices. MMPI [20] is such an example that allows communication between nodes (smartphones) over a Bluetooth network. This enables developers to write parallel code that is specifically designed to work on mobile phones.

2.2 What’s next Global competition will continue among industrialized nations to build bigger and faster HPC clusters. The US, EU nations and China will lead this race, however the most important development in the near horizon is the ability for small and medium organizations to build their own HPC clusters using commodity hardware and consumer grade devices. The advances from big government and large research universities will translate to even better, faster and cheaper components available for small players. In addition, we envision small business and families to be able to harness the power of a personal HPC cluster to perform analytics in personal finance, fitness and health and be able to donate their cluster computation power to their community.

3. THE STATE OF HTC Strictly speaking HTC is a specialized classification of High Performance Computing, however, in an HTC system, related tasks are run on networked processors without synchronization. This technique is of particular importance when computing resources are required over a long period of time where constant communication between processors is unfeasible and/or expensive. Therefore, it is very prevalent in the scientific and engineering research community where experiments or simulations are run over a long timeframe. However, businesses have recently started to harness the power of HTC by exploiting the ‘idle’ or unused resources of their existing clusters, therefore, allowing a much larger pool of nodes than typical HPC systems [21].

3.1 What’s next We will continue to see more uses for HTC particularly in long running research projects in medicine, public health and human behavior. We expect the industry to converge on a standard protocol where mobile devices and consumer grade machines are able to perform computations and send back results to an HTC cluster in a uniform way. This will pave the way to an ever increasing computational power and sustained performance that

rivals the fastest supercomputers available. This development indeed will give birth to making every device capable of being a parallel node.

than 85% of the market share and Apple’s iPhone platform with 10% of the market share, while other platforms occupy less than 1% of the market based on data from IDC for Q4 2016 [26].

4. PERFORMANCE SIMULATIONS

We use iPhone 4 and Samsung Galaxy S4 as reference devices given that in practice and globally the capabilities of mobile devices vary in terms of computing performance. This provides a fair yet representative sampling of the aggregate mobile devices computational power. iPhone 4’s reference performance is measured at 1.6 GFLOPS while Samsung Galaxy S4 reference performance is measured at 3.086 GFLOPS [27].

In this section, we look at CPU performance data from Folding@home [9] and SETI@home [22] projects and then design four simulations to determine whether a petaflops scale computation platform is possible using consumer grade mobile devices and commodity Linux hardware. We examined the SETI@home project dataset and analyzed participant’s CPU performance data shown in table 1. There are 994 distinct Intel processors in use ranging from Intel Xeon CPU E5-2699 v3 at 2.30GHz with 64 cores at 2.85 GFLOPS each for a total of 183.39 GFLOPS to Intel Pentium 4 CPU with a single core at 0.76 GFLOPS. In contrast, there are 300 different AMD processors ranging from AMD FX-9590 A6 processor with 8 cores at 3.49 GFLOPS per core for a total of 27.67 GFLOPS to AMD Sempron mobile processor 2100+ with a single core at 1.02 GFLOPS. In addition, participants with ARM based devices used 9 distinct ARM processors ranging from ARMv7 processor rev 1 with 4 cores at 1.59 GFLOPS per core for total of 6.49 GFLOPS per device to ARMv7 Processor rev 10 with two cores at 0.84 GFLOPS per core for a total of 1.64 per device. Table 1 shows a breakdown of processor family, number of CPU types for each processor family, average data for number of participants per CPU type, number of cores per CPU, GFLOPS rating per CPU core and total GFLOPS rating per machine. Table 1. SETI@home CPU Dataset Processor Family

CPU Types

Participants/

Cores/ CPU

GFLOPS/ CPU

GFLOPS/

CPU type

Machine

4.1 Simulation 1: 500 TFLOPS cluster using iPhone and Android devices In this simulation, we combine iPhone 4 class devices and Samsung Galaxy S4 class devices to create a distributed cluster. Given an 11.70% and 87.60% distribution [24] for iPhone 4 and Samsung Galaxy S4 respectively. The cluster is capable of achieving 3 TFLOPS with only 1,000 devices. At 10,000 devices it achieves about 30 TFLOPS. Using 100,000 it achieves close to 300 TFLOPS. Finally, the cluster is able to achieve more than 500 TFLOPS with only 200,000 devices. Results: 200,000 mobile devices achieve 1/2 petaflops performance. Given the performance boost of newer mobile devices, we would expect the number of devices needed to achieve this performance mark to be much less. As table 2 shows, the drive for more GFLOPS and the race of processor manufactures to produce faster CPUs and GPUs are enabling 50 GFLOPS smart phones and 1TFLOPS gaming consoles. It is expected by 2020 that most mobile devices to be capable of 20 GFLOPS or higher rating using CPU and GPU combinations which can achieve a petaflop scale performance with only 50,000 devices.

Intel

994

100

5.29

2.83

16.41

AMD

300

60

3.4

2.42

8.57

ARM

9

484

1.03

3.6

3.82

iPhone market share

0.117

Total

1303

96

4.78

2.7

14.23

Android market share

0.876

iPhone 4 device class GFLOPS rating

1.6

Samsung S4 device class GFLOPS rating

3.086

Intel x86 market share

0.828

AMD market share

0.171

Intel Xeon CPU E5-1650 v2 12 cores GFLOPS rating

54.72

AMD FX-9590 A6 processor 8 cores GFLOPS rating

27.92

Intel Core i3-4370 CPU 3.80GHz GFLOPS rating

16.96

AMD Phenom II X4 970 processor GFLOPS rating

13.64

Looking at the progress made by the Folding@home project. The project is able to take advantage of Linux, Windows, Mac OS, Android and Sony PlayStation 3 (PS3) consoles to create a cluster achieving 101 PetaFLOPS. Folding@home has more than 15 million PS3 users donating more than 100 million computational hours. The project is expected to be able to attain performance on the 20 gigaflops scale per computer. With about 50,000 machines, the project is able to achieve performance on the petaflops scale [23]. Sony and other companies have supported the Folding@home project and built software into their systems and applications to support it natively. PS3 system version 1.6 or later has a Folding@home icon in the PS3 menu [23]. Google’s Chrome browser has built a plug-in for Folding@home that uses Google’s Native Client, a sandbox for running compiled C and C++ code in the browser [24] [25]. We conducted four simulations to show that a petaflops scale supercomputer cluster is achievable using both pure mobile devices and a blend of commodity Linux hardware and mobile devices. Simulation metrics and setup is shown in table 2. There are two main mobile device platforms namely Android with more

Table 2. Simulation Metrics and Setup Metric

Value

4.2 Simulation 2: 500 TFLOPS cluster using commodity Linux hardware In this simulation, we use commodity Linux hardware with Intel and AMD processors. Given a market distribution of more than 80% for Intel processors and about 17% for AMD processors [28]. We used both low and high powered Linux virtual machines (VMs) to achieve the target performance measure. Table 3 shows a sampling of commodity processors and their usage. For high

powered VMs we used Intel Xeon CPU E5-1650 v2 with 12 cores at 4.56 GFLOPS per core and AMD FX-9590 A6 processor with 8 cores at 3.49 GFLOPS per core. For low powered VMS, we used Intel Core i3-4370 CPU at 3.80GHz with 4 cores each one capable of 4.24 GFLOPS and AMD Phenom II X4 970 CPU with 4 cores at 3.41 GFLOPS per core. The cluster with low powered VMs is able to achieve 20 TFLOPS with 1,000 VMs and more than 150 TFLOPS with 10,000 VMs. At 30,000 VMs its able to break 500 TFLOPS performance measure. Using high powered VMs, 1,000 VMs achieve more than 50 TFLOPS while 5,000 VMs are capable of 1/4 petaflops performance. The cluster achieves 1/2 petaflops performance with 10,000 VMs. Results: 30,000 low power VMs or 10,000 high power VMs achieve 1/2 petaflops performance. Using a mix of high powered CPUs along with GPUs can bring the number of VMs needed much lower than 10,000. We would expect a gain of 30% to 50% in performance bringing the total VMs needed to 5,000. Table 3. Seeking More GFLOPS Device

GFLOPS

CPU Speed

1985 Cray-2 Supercomputer

1.9 GFLOPS

244 MHz

iPhone 4

1.6 GFLOPS

800 MHz

PlayStation 2S

6.2 (GPU)

0.3 GHz singlecore

Apple Watch

3 GFLOPS

1GHz dual-core

Samsung Galaxy S6

34.8 GFLOPS

2.1GHz quad-core

Samsung Galaxy S4

3.086GFLOPS

1.9GHz quad-core

iPhone 7

46.7 GFLOPS

2.33GHz core

single-

iPhone 7

87.4 GFLOPS

2.33GHz core

quad-

PC

1.2TFLOPS

Varies

Samsung Galaxy S7

407 GFLOPS

2.3GHz octa-core

Xbox One

1.31 TFLOPS

1.75GHz core

PlayStation 4S

1.84 TFLOPS (GPU)

1.6GHz octa-core

Tianhe-2 Supercomputer

33.86 PFLOPS

2.2GHz 12-cores 32,000 Intel Xeon (3.12M cores)

octa-

4.3 Simulation 3: Combine a cluster of iPhone/android phones and commodity Linux hardware to achieve 500 TFLOPS cluster In this simulation, we combine iPhone and Android devices with commodity Linux hardware. Given the large market share enjoyed by Android devices, more than 85%, and the large share of desktop and server processors by Intel, more than 80% of the market share. We simulate a cluster with 4 types of devices and VMs: iPhone 4 iOS class device, Samsung Galaxy S4 Android class device, Intel x86 multi-core VM and AMD multi-core VM. 1,000 iPhone and Android devices are capable of 1 TFLOPS performance. 10,000 and 100,000 achieve about 30 and 300

TFLOPS respectively. At 100,00 and 200,000 the cluster achieves 300 and 500 TFLOPS respectively. A small number of commodity Linux hardware is needed to manage the cluster. 25 Linux VMs achieve more than 1 TFLOPS of performance while 100 VMs boost up to 5 TFLOPS of performance. This allows for 0.006% to 0.025% of computation power to manage each device. We model the management of the cluster with two functions Cl(d) and Ch(d) as shown in equations (1) and (2) where n is the total number of devices in cluster, tfp is the total flops performance and r is the 0.1 reserved computation coefficient. Cl(d) is the computation power used by each device using a low threshold while Ch(d) models computation power used by each device using a high threshold. The low threshold accounts for 6 megaflops per device while the high threshold accounts for as much as 25 megaflops per device. 10% of the computational power is reserved. Cl(d) = 0.006 * (n-r*tfp)

(1)

Ch(d) = 0.025 * (n-r* tfp)

(2)

Results: A small cluster of commodity Linux servers with 25 to 100 VMs is able to manage and control 200,000 mobile devices to achieve 1/2 petaflops performance. With high power newer mobile devices, the total number of devices is expected to be much less to achieve the same performance mark. It is expected by 2020 that most mobile devices can achieve a 20 GFLOPS rating which would bring the total number of devices needed to 1/4th the number of devices used in this simulation.

4.4 Simulation 4: Combine a cluster of iPhone/Android phones and commodity Linux hardware to achieve 2 PFLOPS cluster In this simulation, we use a large cluster of iPhone and Android devices. A small number of commodity Linux hardware is used to manage the cluster. 100,000 and 400,000 devices achieve 1/4 and over 1 petaflops performance respectively. At 500,000 and 600,000 devices the cluster produces 1.5 and 1.7 petaflops performance. 700,000 devices break the 2 petaflops performance mark. Similar to simulation 3, a small number of Linux VMs are capable of managing the cluster. 6 megaflops to 25 megaflops are used to manage each device. Results: 700,000 mobile devices can achieve two petaflops performance with a 100 commodity Linux VMs used for management and control. As in previous simulation, it is expected that newer high powered mobile devices can achieve the 2 PFLOPS mark with much less total devices in the cluster. The iPhone 7 with A10 Fusion 2.34 GHz with 2 processors achieves single-core SGEMM benchmark of 46.7 GFLOPS and SFFT of 6.85 GFLOPS. For multi-core A10 it achieves SGEMM 87.4 GFLOPS and SFFT 12.6 GFLOPS [26]. At 20 GFLOPS rating, only 100,000 devices are needed to achieve a 2 petaflops performance mark as shown in Figure 1.

[9]

[10]

[11] [12]

Figure 1. Petaflops achieved using commodity and iPhone 7 devices

[13]

5. CONCLUSION The future of parallelism is bright given the sustained progress in CPU architectures, wireless network performance, and parallel programming models. We expect to see 20 gigaflops class mobile devices available as a reference platform by 2020. For tablets, laptops and desktop machines we expect to see a 10x projection with reference low end machines having 200 gigaflops performance within the same period. The increased performance of mobile devices and consumer grade machines allow the creation of dynamic self-assembling clusters capable of one petaflops performance with only 50,000 devices. Similarly, using hybrid mobile machines the same cluster can be built using only 5,000 machines. This also gives rise to personal clusters where a family is able to aggregate computing power of all their devices to perform certain computations or donate it towards other projects in medicine and public health informatics. As we have seen with the Folding@home project from Stanford, mobile devices are assisting in finding a cure for cancer and other diseases. With a converged and standard protocol for client HTC and the increase in device performance we expect to see every mobile device capable of being a parallel node in the near future.

[14] [15]

[16]

[17]

[18] [19] [20]

[21]

6. REFERENCES [1]

[2]

[3]

[4]

[5]

[6]

[7] [8]

Message Passing Interface Forum. “MPI: a message-passing interface standard version 1.3”. MPI-Forum, 2008. Available at http://www.mpi-forum.org. S. Chaudhry, P. Caprioli, S. Yip and M. Tremblay. "Highperformance throughput computing," in IEEE Micro, vol. 25, no. 3, pp. 32-45, May-June 2005. doi: 10.1109/MM.2005.49. J. Gantz, and D. Reinsel. "The digital universe in 2020: Big data, bigger digital shadows, and biggest growth in the far east." IDC iView: IDC Analyze the future 2007 (2012): 1-16. F. Diebold. “A Personal Perspective on the Origin(s) and Development of ‘Big Data’: The Phenomenon, the Term, and the Discipline”, Second Version. 2012. A. S. McGough, and M. Forshaw. “Energy-aware simulation of workflow execution in High Throughput Computing systems” in IEEE/ACM 19th International Symposium on Distributed Simulation and Real Time Applications (DS-RT), Oct 2015. doi: 10.1109/DSRT.2015.31. T. Tannenbaum, D. Wright, K. Miller, and M. Livny, “Condor: a distributed job scheduler,” in Beowulf cluster computing with Linux. MIT press, 2001, pp. 307-350. D. P. Anderson, “Boinc: A system for public resource computing and storage,” in Grid Computing, 2004. IEEE, 2004, pp. 4-10. M. R. Shifts., V. S. Pande, “Screensavers of the world unite!”, Science. 290:1903-1904 (2000).

[22]

[23]

[24]

[25] [26] [27] [28] [29]

S. M. Larson, C. Snow, V. S. Pande, “Folding@home and Genome@Home: Using distributed computing to tackle previously intractable problems in computational biology,” Modern Methods in Computational Biology, R. Grant, ed, Horizon Press (2003). J. Diaz, C. Muñoz-Caro and A. Niño, "A Survey of Parallel Programming Models and Tools in the Multi and Many-Core Era," in IEEE Transactions on Parallel and Distributed Systems, vol. 23, no. 8, pp. 1369-1386, Aug. 2012. doi: 10.1109/TPDS.2011.308A. Folding@home, "Folding@home Team & Stats," [Online]. Available: https://folding.stanford.edu. B. Javadi, D. Kondo, J. M. Vincent and D. P. Anderson, "Discovering Statistical Models of Availability in Large Distributed Systems: An Empirical Study of SETI@home," in IEEE Transactions on Parallel and Distributed Systems, vol. 22, no. 11, pp. 1896-1903, Nov. 2011. doi: 10.1109/TPDS.2011.50. J. Diaz, C. Muñoz-Caro and A. Niño, "A Survey of Parallel Programming Models and Tools in the Multi and Many-Core Era," in IEEE Transactions on Parallel and Distributed Systems, vol. 23, no. 8, pp. 1369-1386, Aug. 2012. doi: 10.1109/TPDS.2011.308A. G.R. Andrews. “Foundations of Multithread, Parallel, and Distribtued Programming.” Addison Wesley, 1999. A. Basumallik, S. J. Min and R. Eigenmann, "Programming Distributed Memory Sytems Using OpenMP," 2007 IEEE International Parallel and Distributed Processing Symposium, Long Beach, CA, 2007, pp. 1-8. doi: 10.1109/IPDPS.2007.370397. L. Dagum and R. Menon, "OpenMP: an industry standard API for shared-memory programming," in IEEE Computational Science and Engineering, vol. 5, no. 1, pp. 46-55, Jan-Mar 1998.doi: 10.1109/99.660313. S. J. Hussain and G. Ahmed. “A comparative study and analysis of PVM and MPI for parallel and distributed systems”. IEEE, 2005, p. 183-187. Fadlallah, G.M. Lavoie and L-A.Dessaint. “Parallel computing environments and methods”, IEEE, 2000. p. 2-7. MPICH, “MPICH,” [Online]. Available: http://www.mpich.org/. F. Fu, S. Sun, X. Hu, J. Song, J. Wang and M. Yu, "MMPI: A flexible and efficient multiprocessor message passing interface for NoC-based MPSoC," 23rd IEEE International SOC Conference, Las Vegas, NV, 2010, pp. 359-362. doi: 10.1109/SOCC.2010.5784695. C. Hamilton, “Enterprise-Wide Clustering: High Throughput Computing For Business.” B. Javadi, D. Kondo, J. M. Vincent and D. P. Anderson, "Discovering Statistical Models of Availability in Large Distributed Systems: An Empirical Study of SETI@home," in IEEE Transactions on Parallel and Distributed Systems, vol. 22, no. 11, pp. 1896-1903, Nov. 2011. doi: 10.1109/TPDS.2011.50. J. Dongarra, I. Foster, G. Fox, W. Gropp, K. Kennedy, L. Torczon, and A. White. “The sourcebook of Parallel Computing.” Morgan Kaufmann Publishers, 2003. Folding@home, “Folding@chrome – folding with just your browser,” [Online]. Available: http://folding.stanford.edu/home/foldingchromefoldinhome-using-google-portable-native-client-technology. Chrome, “Google Chrome Native Client,” [Online]. Available: https://developer.chrome.com/native-client. IDC, “IDC Smartphone OS Market Share,” [Online]. Available: http://idc.com/prodserv/smartphone-os-market-share.jsp. Primate Labs, “Geekbench 4.0.1 iOS AArch64,” [Online]. Available: http://browser.primatelabs.com/v4/cpu/387526. CPU Benchmark, “AMD vs. Intel Market Share,” [Online]. Available: http:// cpubenchmark.net/market_share.html. W. Gropp, E. Lusk and A. Skjellum. “Using MPI portable parallel programming with the message-passing interface”. London: MIT Press. 1997.

Dynamic self-assembling petaflop scale clusters

Dynamic self-assembling petaflop scale clusters

Suggest Documents

Dynamic Clusters

Petaflop/s, seriously

Petaflop/s, seriously - Semantic Scholar

Hydrogen hosting of nano scale boron clusters

Port Scanning in Large-Scale Commodity Clusters

Port Scanning in Large-Scale Commodity Clusters

Resource Management for Dynamic MapReduce Clusters in ...

Dynamic Clusters in Developing Countries: Collective ...

Dynamic Resource Allocation of Computer Clusters ...

Dynamic Multi Phase Scheduling for Heterogeneous Clusters

Clusters of bioactive compounds target dynamic ... - PNAS

Scheduling Constrained Dynamic Applications on Clusters

LRZ's new 3 PetaFlop/s System

Petaflop Computing for Protein Folding - CiteSeerX

Dynamic clusters available under Clusterix Grid

Scheduling Constrained Dynamic Applications on Clusters

The Universe at Extreme Scale: Multi-Petaflop Sky Simulation on the ...

design of a large-scale dynamic and pseudo-dynamic

Blue Gene/Q: Design for Sustained Petaflop Computing

Elaboration of microgel protein particles by controlled selfassembling ...

Global dynamic routing for scale-free networks

Representation Learning for Large-scale Dynamic Networks

pilot-scale testing of dynamic operation and

Dynamic Aspects in Large Scale Distributed