Massive affordable computing using ARM ...

Home

Search

Collections

Journals

About

Contact us

My IOPscience

Massive affordable computing using ARM processors in high energy physics

This content has been downloaded from IOPscience. Please scroll down to see the full text. 2015 J. Phys.: Conf. Ser. 608 012001 (http://iopscience.iop.org/1742-6596/608/1/012001) View the table of contents for this issue, or go to the journal homepage for more Download details: IP Address: 104.140.73.242 This content was downloaded on 17/04/2017 at 21:51 Please note that terms and conditions apply.

You may also be interested in: The development of a general purpose ARM-based processing unit for the ATLAS TileCal sROD M A Cox, R Reed and B Mellado Speeding up HEP experiment software with a library of fast and auto-vectorisable mathematical functions Danilo Piparo, Vincenzo Innocente and Thomas Hauth Initial explorations of ARM processors for scientific computing David Abdurachmanov, Peter Elmer, Giulio Eulisse et al. Measurements of the LHCb software stack on the ARM architecture S Vijay Kartik, Ben Couturier, Marco Clemencic et al. Explorations of the viability of ARM and Xeon Phi for physics processing David Abdurachmanov, Kapil Arya, Josh Bendavid et al. A QoS adaptive multimedia transport system: design, implementation and experiences Andrew Campbell and Geoff Coulson A wireless transmission neural interface system for unconstrained non-human primates Jose A Fernandez-Leon, Arun Parajuli, Robert Franklin et al.

ACAT2014 Journal of Physics: Conference Series 608 (2015) 012001

IOP Publishing doi:10.1088/1742-6596/608/1/012001

Massive affordable computing using ARM processors in high energy physics J W Smith, A Hamilton Department of Physics University of Cape Town, Rondebosch, South Africa E-mail: joshua.wyatt.smith.cern.ch Abstract. High Performance Computing is relevant in many applications around the world, particularly high energy physics. Experiments such as ATLAS, CMS, ALICE and LHCb generate huge amounts of data which need to be stored and analyzed at server farms located on site at CERN and around the world. Apart from the initial cost of setting up an effective server farm the cost of power consumption and cooling are significant. The proposed solution R to reduce costs without losing performance is to make use of ARM processors found in nearly all smartphones and tablet computers. Their low power consumption, low cost and respectable processing speed makes them an interesting choice for future large scale parallel data processing R centers. Benchmarks on the CortexTM -A series of ARM processors including the HPL and PMBW suites will be presented as well as preliminary results from the PROOF benchmark in the context of high energy physics will be analyzed.

1. Introduction High energy physics (HEP) at the Large Hadron Collider (LHC) [1] creates an enormous amount of data that need to be stored for later analysis. Dedicated server farms have been built at CERN and Hungary (Tier 0) as well as around the world (Tier 1’s and Tier 2’s) and are connected through The Worldwide LHC Computing Grid (WLCG). The initial setup costs of these server farms can be immense and the costs to maintain such a server farm can be even larger. Cooling and the power required to maintain such servers at their peak performance make server farms an expensive venture. In early 2013, CERN’s Tier 0 had a power capacity of 3.5MW. R The proposed solution is to make use of ARM processors from here on referred to as ARM. These are found in smartphones and tablet computers where the combination of low power consumption and high performance is the top priority. Significant savings might be achieved if ARM processors are able to cope with the huge amount of data processed. This letter serves to explore the capabilities of ARM processors through parallel processing benchmarks. 2. ARM Processors The ARM processor was designed to perform only a few instructions at once. This reduces the need for hardware such as transistors and thus minimizes power consumption. This letter addresses three system on chip (SoC) setups in the CortexTM -A range, namely the A7 MPCoreTM , A9 MPCoreTM and A15 MPCoreTM [2, 3, 4], from here on referred to as the A7, A9 and A15 respectively. Table 1 summarizes the different processors used. An obvious advantage Content from this work may be used under the terms of the Creative Commons Attribution 3.0 licence. Any further distribution of this work must maintain attribution to the author(s) and the title of the work, journal citation and DOI. Published under licence by IOP Publishing Ltd 1



to the A9 is the number of cores, however this is irrelevant because the number of cores is an intrinsic property of the SoC. The hardware Floating-Point Unit (FPU) generates results for the speed at which multiplication and addition operations are carried out. v3/4 refers to the R respective FPU version. A traditional Intel computer is also used (Hep405). This serves to provide some reference to the reader. Power measurements for the A7, A9 and A15 were taken R using a Fluke 289 Digital Multimeter. Measurements on Hep405 were taken using the Intel Power Gadget. However, if the power usage characteristics and temperatures vary from those used in Intel’s calibration then there will be errors between estimated and actual power usage. For this reason, when referring to Hep405’s results the reader must take care in remembering that these values serve as more of an estimate. Table 1. The different setups with key features. Setup

Processor

Cores

RAM

Cache

FPU

OS

Cubietruck

AllWinner A20, 1.2GHz

A7 dual core

2GiB DDR3

512 KiB L2

VFPv4 Archlinux, hard float

Wandboard-Quad

Freescale i.MX6 Quad, 996MHz

A9 quad core

2GiB DDR3

32KiB L1, 1 MiB L2

VFPv3 Archlinux, hard float

ArndaleBoard-K

Samsung Exynos 5250, 1.7GHz

A15 dual core

2GiB DDR3

32 KiB L1, 1MiB L2

VFPv4 Fedora 19, hard float

Hep405

R Intel Core i7-2600, 3.4GHz

quad core

16GiB DDR3

256KiB L1, 1MiB L2, 8MiB L3

-

Scientific Linux 6

3. Benchmarks 3.1. HPL The High-Performance LINPACK (HPL) benchmark is the parallel version of the LINPACK benchmark [5] which is synthetic and CPU intensive. Flops of a system are calculated by splitting the large matrix into blocks that are then solved on different cores. Figure 1 shows the relationship between matrix size, block size, Gflops, power consumption and Gflops/watt for the HPL benchmark. The grey area is an “envelope function” which splits the data into blocks along the time axis and then finds the average power value for that window. It gives a representation of the average power consumption which is indicated by the blue line. A trend we see for the A7 and A9 is that as the size of the matrix block that is passed to the separate cores increases, the average power consumption decreases. This is because slower computation means that less work in the form of communication has to happen between the processors. The average power consumption remains fairly constant with the A15 because of the faster clock speed of 1.7GHz. The Gflops/ watt is calculated by taking the average power consumption for the time window where there is a HPL measurement. We can see that the A9 is the most efficient provided the block size is relatively large. The A7 does not have a fast enough processor and the A15 uses to much power for the output it delivers. The A9’s Gflops/ watt is comparable to the output of Hep405. 3.2. PMBW The Parallel Memory Bandwidth Benchmark (PMBW) [6] is a suite that measures bandwidth capabilities of a multi-core computer. This is an important test because more cores result in the floating point performance increasing in a linear fashion. However, if the memory bandwidth 2

4

6 5 4 3 2 1

Watts

18 16 14 12 10 8 6 4 2

Watts

0

70 60 50 40 30 20 10 0

0

0

6

8

A9

5

2

4

2

12

10

8

Hep405 6

14

16

15

A15 6

4

10

10

8 Time (hours)

1.0 0.8 0.6 0.4 0.2 0.0

Gflops/watt power consumption 1.40 Gflops >1.80 Gflops >2.20 Gflops >2.50 Gflops

0.5 0.4 0.3 0.2 0.1 0.0


1.0 0.8 0.6 0.4 0.2 0.0


20

12

10


Gflops/watt

2

0.5 0.4 0.3 0.2 0.1 0.0

14

12

16

14

Gflops/watt

Watts

0

A7

Gflops/watt

Watts

4.0 3.5 3.0 2.5 2.0 1.5


Gflops/watt


Figure 1. The HPL benchmark applied to the 3 different ARM processors and 1 Intel processor. is not capable of processing the data fast enough those processors will stall. Unlike floating point units the memory bandwidth does not scale with the number of cores running in parallel. The code was developed in assembler language which means compile flags for the SoC become unimportant, thus there are no optimizations. The code uses two general synthetic access patterns, namely sequential scanning and pure random access. The benchmark outputs an enormous amount of data and so only the results for multiple threads, read/write scan operations with each clock-cycle transferring 32-bits (ARM) and 256 bits (HEP405) are shown in Table 2. It is accurate to say that none of these ARM processors perform very well when it comes to memory bandwidth. Table 2. Summary of the important results for the HPL and PMBW benchmarks. A7

A9

A15

R Intel i7

HPL: Average Gflops/W

0.109

0.408

0.323

0.421

PMBW: Max read (GiB/s)

9.505

24.611

21.472

298.264

PMBW: Max write (GiB/s)

9.035

21.428

21.546

163.123

3.3. Dedicated PROOF Cluster The Parallel ROOT Facility, PROOF [7], is an extension to the well known framework, ROOT [8]. PROOF strives to introduce massive parallelism by splitting up files among workers. The dedicated PROOF cluster at the University of the Witwatersrand (Wits) in Johannesburg, South Africa uses 7 A9 boards sharing a NFS directory hosted by a traditional server. The PROOF benchmark suite [9] was used to determine performance. It is a further add-on designed to test the topology and scalability of clusters by running cycle- (CPU) and data-driven (I/O)

3



processes. Results for vfpv3-d16, neon and default are shown. These correspond to compile flags used in the compilation of ROOT that are specific to the A9. 3.3.1. CPU benchmark The default TSelHist selector was used with the output shown in Figure 2. The unit of measurement is the number of events per unit time.We can see that the cluster scales linearly for CPU-intensive tasks. This is to be expected as the only significant constraint is the overhead produced when the master has to pool the results and create the histograms.

5000

3

× 10

Events/sec

Events/sec

3

6000

Max., vfpv3-d16 Avg., vfpv3-d16 Max., neon Avg., neon Max., default Avg., default

CPU performance

4000

220

× 10


200 Normalized CPU performance

180 160

3000

140 120

2000 100 1000 80 0

5

10

15

20

60

25

5

10

15

20

25

Active Workers

Active Workers

5000

Events/sec

Events/sec

Figure 2. PROOF CPU benchmark for ARM cluster.


I/O performance (events)

4000

Max., vfpv3-d16 Abg., vfpv3-d16 Max., neon Avg., neon Max., default Avg., default

800 Normalized I/O performance (events)

700 600 500

3000

400 2000 300 1000

200 100

0

5

10

15

20

25

5

10

15


35 I/O performance (MB)

30

20

25

Active Workers

MB/sec

MB/sec

Active Workers

5.5

Max., vfpv3-d16 Abg., vfpv3-d16 Max., neon Avg., neon Max., default Avg., default

5 Normalized I/O performance (MB)

4.5 4

25

3.5

20

3 15

2.5

10

2

5

1.5

0

1 5

10

15

20

25

5

10

15

Active Workers

20

25

Active Workers

Figure 3. PROOF I/O benchmark for ARM cluster.

4



3.3.2. I/O benchmark The default TSelEvent selector was used with the output shown in Figure 3. This benchmark shows the bottleneck that the NFS server creates. Scalability saturates at around 12-13 workers which equates to reading events at a total of around 23 MB/sec. Its clear that a shared NFS directory reduces a lot of the advantage that one might gain from a PROOF cluster. 4. Conclusions ARM processors use significantly less power than traditional processors. However, they are also significantly slower and so parallel computing for the right type of problem must be exploited. HEP data analysis provides tasks that are easily parallelizable and with the correct frameworks in place can be successful. PROOF provides one such framework, however communication bottlenecks have to be overcome. The performance per watt for the A7 and A15 shows poor results when compared to the A9 and Intel machines. The A9’s advantage over the other ARM processors lies in the quad core SoC. This processor is still slow compared to traditional computers due in part to the age of the technology. The newer A15 quad core processors should show a dramatic improvement in overall performance and specifically performance per watt. Furthermore, the newly developed 64 bit architecture along with their respective SoCs will have a large impact on the CPU performance and memory bandwidth of these ARM processors. Acknowledgments The authors would like to thank the National Research Foundation of South Africa and the University of Cape Town for funding as well as the Massive Affordable Computing (MAC) team at the University of the Witwatersrand for building and providing access to the A9 ARM cluster. References [1] [2] [3] [4] [5] [6] [7] [8] [9]

2008 Journal of Instrumentation vol 3 (IOP and SISSA) ARM Ltd 2013 CortexTM -A7 MPCoreTM Technical Reference Manual r0p5 ed ARM Ltd 2012 CortexTM -A9 MPCoreTM Technical Reference Manual r4p1 ed ARM Ltd 2013 CortexTM -A15 MPCoreTM Processor Technical Reference Manual r4p0 ed Dongarra J Linpack benchmark Computer Science Technical Report CS - 89 – 85 University of Tennessee Bingmann T 2013 pmbw - Parallel Memory Bandwidth Benchmark / Measurement website Ballintijn M, Brun R, Rademakers F and Roland G 2003 CHEP03 Intl. Conf. Brun R, Rademakers F An object orientated analysis framework URL http://root.cern.ch Ryu S, G Ganis 2012 J. Phys.Conf. Ser. 368 012020 IOPscience

5

Massive affordable computing using ARM ...

Massive affordable computing using ARM ...

Suggest Documents

Behavior Switching Using Reservoir Computing for a Soft Robotic Arm

Computing the Semantics of Massive Entities using ...

Computing without Clocks: Micropipelining the ARM ... - CiteSeerX

Massive Attacks With Distributed Computing - Def Con

Visualrobot arm guidance using neuralnetworks

Xen on ARM: System Virtualization Using Xen Hypervisor for ARM ...

Xen on ARM: System Virtualization Using Xen Hypervisor for ARM ...

Affordable clean water using nanotechnology - Organisation for ...

Using Grid Computing Within the Using Grid Computing Within the ...

Computing Bounding Volume Hierarchies Using ... - NUS Computing

Grid Computing Using Web Services - Computing Science

Quantum Computing Using Optics

Real Time Robotic Arm Control Using Wearable

Optimization of Robotic Arm Trajectory Using

Robot arm control using structured light

Robotic Arm Controlling using Automated Balancing

Identification of Intended Arm Movement Using ... - arXiv

ARM

A rotating arm using shape-memory alloy

Arm Movement Prediction Using Neural Networks - WordPress.com

Interactive Electronic Board Using ARM Processor - International ...

Portable Network Monitor Using ARM Processor - ScienceDirect

Interactive Electronic Board Using ARM Processor - International ...

Environment quality monitoring using ARM processor