a look at the increase of the number of cores in the Intel Xeon family in recent years: ... 3Q2017 Platinum Intel Xeon Platinum 8180 28 cores. .... 15 1,731 0,0000173 4,417 0,0000221 13,523 0,0000338 44,542 0,0000557 66,163 0,0000662.
OpenMP as An Efficient Method to Parallelize Code With Dense Synchronization Rafał Bocian, Dominika Pawłowska, Krzysztof Stencel, Piotr Wiśniewski Faculty of Mathematics and Computer Science Nicolaus Copernicus University Toruń, Poland {rafalb,dpawlowska,stencel,pikonrad}@mat.umk.pl
Abstract. In recent years, adding new cores and new threads are main methods to add computational power. In line with this approach in this paper we analyze the efficiency of the parallel computational model with shared memory, when dense synchronization is required. As our experimental evaluation shows, contemporary CPUs assisted with OpenMP library perform well in case of such tasks. We also present evidence that OpenMP is easy to learn and use.
1
Introduction
In 2004 Intel released 3rd family of Intel Pentium 4 codenamed Prescott with a 3,8 GHz clock. Even after 14 years from then nowadays processors have a similar clock signal. Recently, new computational power is added using two major methods. First, the number of clock ticks needed to perform a task is minimized. Second, the number of cores and the number of threads per core is enlarged. This article is inspired by the second abovementioned approach. Let us take a look at the increase of the number of cores in the Intel Xeon family in recent years: 2Q2012 1Q2014 2Q2015 2Q2016 3Q2017
E5 Intel Xeon E5-4650 8 cores. E7 v2 Intel Xeon E7-8890 v2 15 cores. E7 v3 Intel Xeon E7-8890 v3 18 cores. E7 v4 Intel Xeon E7-8890 v4 24 cores. Platinum Intel Xeon Platinum 8180 28 cores.
This growth of the number of cores caused a new insight into the problem of parallel programming. Nowadays, parallel computation is no longer restricted to big computation centres, but it is simply available even on desktop machines. This trend changes the approach to parallel computation. When a motherboard has 8 processors with 28 cores each, the computation is in fact performed by 224 computational cores using the same shared memory installed on the motherboard. Thus, it possible to implement parallel computation models based on shared memory. The OpenMP library is one of the major tools to design and realize parallel computation with shared memory. If the task at hand can be split into a set
of subtasks, the profits caused by parallelization are significant [1]. However, numerous authors [2–4] are convinced that if the subtasks required frequent synchronization, efficient parallelization is a notably non-trivial challenge. In this article, we report on our experience collected from our efforts that required solving a special class of linear equations driven by Toeplitz matrices. We solved such equations during analyses of seismic data [5]. We have found out that a dense synchronization (even thousand times per second) does not prohibit efficient parallel computation. We have prepared efficient implementations using OpenMP for the problem that requires such dense synchronization. The main contribution of this article is a demonstration that contemporary tools exceptionally well parallelize computations even in case of tasks requiring frequent synchronization. Moreover, largest profits are obtained when the number of computing threads is not bigger than the number of CPU cores. When the number of threads is between the number of cores and the number of threads served by CPU, the parallelization is still efficient. However, when the number of CPU threads is exceeded, the performance will degrade dramatically. The article is organized as follows. Section 2 rolls out the problem and proposes the actual implementation. Section 3 reports on the experimental evaluation of this implementation. Section 4 concludes.
2
Levinson recursion
The main goal of this article is to demonstrate an opportunity of efficient parallel computations even in cases when dense synchronization is required. We use Levinson algorithm as a running example to show possible parallelization gains. This algorithm solves particular systems of linear equations called Toeplitz systems. Such systems are used to compute coefficients of a Wiener filter, i.e. one of the most popular filters employed to process seismic data. There are algorithms, e.g. Superfast [6], that solve this problem and have better sequential time complexity than Levinson algorithm. Unfortunately, they are notably more complex to implement (even in sequential version), since they require transforming the input matrix upfront and adjusting the output vector afterwards to make it correspond to the original matrix. Moreover, Levinson algorithm is more numerically stable than Superfast and other algorithms. Therefore, we have selected to use Levinson algorithm in our experiments. Assume n ∈ \ {0} and a1−n , a2−n , . . . , a−1 , a0 , a1 , . . . , an−1 ∈ . We define T = (ti,j )i,j=1,2,...,n , where ti,j = ai−j for all i, j = 1, 2, . . . , n−1. Such T is called a Toeplitz matrix. A matrix equation of the form T x = b is called a Toeplitz system. Levinson recursion is an algorithm that recursively computes the solution of a Toeplitz system. The parallel implementation of this recursion is presented on the following listing. The sequential implementation will be obtained, if the pragma directives are removed. The number n is the dimension of the input linear system. The array matrix contains the n × n square matrix of a Toeplitz system and vector is its n-length vector constant terms. Moreover, ef , eb, ex , w , wf , wb, wx are floating-point numbers initialized with zero. The variables x, xs,
N
R
f , fs, b, bs are n-length arrays of floating-point numbers initialized with zeros. In addition, the array xs stores the solution of our Toeplitz system. The time complexity of the sequential implementation is O n2 . 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27
xs [ 0 ] = vector [ 0 ] / matrix [ n − 1 ] ; fs [ 0 ] = 1 / matrix [ n − 1 ] ; bs [ 0 ] = 1 / matrix [ n − 1 ] ; f o r ( i =1;i