information Article
Algorithms for Optimization of Processor and Memory Affinity for Remote Core Locking Synchronization in Multithreaded Applications Alexey Paznikov
ID
and Yulia Shichkina *
ID
Department of Computer Science and Engineering, Saint Petersburg Electrotechnical University “LETI”, Saint Petersburg 197022, Russia;
[email protected] * Correspondence:
[email protected]; Tel.: +8-981-963-6645 Received: 11 December 2017; Accepted: 16 January 2018; Published: 18 January 2018
Abstract: This paper proposes algorithms for optimization of the Remote Core Locking (RCL) synchronization method in multithreaded programs. We propose an algorithm for the initialization of RCL-locks and an algorithm for thread affinity optimization. The algorithms consider the structures of hierarchical computer systems and non-uniform memory access (NUMA) in order to minimize the execution time of multithreaded programs with RCL. The experimental results on multi-core computer systems show the reduction of execution time for programs with RCL. Keywords: remote core locking; RCL; synchronization; critical sections; scalability
1. Introduction Currently available computer systems (CS) [1] are of large scale and include multiple architectures. These systems are composed of shared memory multi-core compute nodes (SMP, NUMA systems) equipped with universal processors, as well as specialized accelerators (graphics processors, massively parallel multi-core processors). The number of processor cores exceeds 102 –103 . Effective execution of parallel programs on these systems is a significant challenge. System software must consider the large scale, multiple architectures and hierarchical structure. Parallel programs for multi-core CS with shared memory are multithreaded, in most cases. Software tools must ensure linear speedup with a large amount of parallel threads. Thread synchronization while accessing shared data structures is one of the most significant problems in multithreaded programming. The existing approaches for thread synchronization include locks, lock-free algorithms and concurrent data structures [2], and software transactional memory [3]. The main drawback of lock-free algorithms and concurrent data structures, despite of their good scalability, is the limited application scope and severe complexity related to the parallel programs development [2,4,5]. Furthermore, the development of lock-free algorithms and data structures includes the problems, connected with the memory release (ABA problem) [6,7], poor performance and restricted nature of atomic operations. Moreover, the throughput of lock-free concurrent data structures is often similar to their corresponding lock-based data structures. Software transactional memory nowadays has a variety of issues, connected with large overheads for execution of transactions and multiple aborts of transactions. Moreover, the existing transactional memory implementations restrict the operations set inside the transactional section. Consequently, transactional memory does not ensure sufficient performance of multithreaded programs, and is, for now, not commonly applied in real applications. The conventional approach of using lock-based critical sections for synchronization in multithreaded programs is still the most widespread in software development. Locks are simple to use (compared with lock-free algorithms and data structures) and, in most cases, ensure acceptable performance. Furthermore, Information 2018, 9, 21; doi:10.3390/info9010021
www.mdpi.com/journal/information
Information 2018, 9, 21
2 of 12
most existing multithreaded programs utilize a lock-based approach. Thereby, development of scalable algorithms and software tools for lock-based synchronization is urgent, today. Lock scalability depends on access contention of threads accessing shared memory areas and the locality of references. Access contention arises when multiple threads simultaneously access a critical section, protected by one synchronization primitive. In terms of hardware, this leads to a huge load being applied to the data bus, as well as cache memory inefficiency. The cache memory locality is meaningful when a thread inside a critical section accesses the shared data previously used on another processor core. This case leads to cache misses, and a significant increase in the time required for critical section execution. The main approaches for scalable lock implementations are CAS spinlocks [8], MCS-locks [9], Flat combining [10], CC-Synch [11], DSM-Synch [11], Oyama lock [12]. For the analysis of existing approaches, we have to consider the critical section’s execution time. The time t for a critical section execution comprises the time t1 for the execution of a critical section’s instructions and the time t2 for the transfer of lock ownership. In the existing lock algorithms, the time required for the transfer of lock ownership is determined by global flag access (CAS spinlocks), context switches and awakening of the thread executing the critical section (PThread mutex, MCS-locks, etc.), or global lock capture (Flat Combining). The time for execution of the critical section’s instructions depends substantially on the time t3 of global variables access. Most existing locking algorithms do not localize access to shared memory areas. Existing research includes methods for localization access to cache memory [10,13–15]. The works [14,15] are devoted to the development of concurrent data structures (linked lists and hash tables) on the basis of critical section execution on dedicated processor cores. The paper [13] proposes a universal hardware solution, which includes the set of processor instructions for transferring the ownership to a dedicated processor core. Flat Combining [10] refers to software approaches. Flat Combining realizes execution of critical sections by server threads (all threads become server by turns). However, the transfer of lock server ownership between threads being executed on different processor cores leads to a performance decrease, even at insignificant access contention. In addition to this, all these algorithms do not support thread locking inside critical sections, including active waiting and operation system core locking. Ownership transfer involves the overheads due to context switches, global variable cache loading from RAM, and activation of the thread executing the critical section. Critical section execution time depends heavily on the reference localization of global variables. Common mutual exclusion algorithms assume frequent context switches, which leads to the exclusion of shared variables from cache memory. This paper considers the Remote Core Locking (RCL) method [16,17], which assumes the execution of critical sections by dedicated processor cores. RCL minimizes the execution time of existing programs, thanks to critical path reduction. This technique assumes the replacement of high-load critical sections in existing multithreading applications with remote functions, which call for its execution on dedicated processor cores. In this case, all of the critical sections protected by one lock are executed by a server thread running on the dedicated processor core. In this way, all critical sections are executed on that particular thread, and the overheads for the transfer of lock ownership are negligible. RCL also reduces critical sections’ instruction execution time. The minimization is achieved by localization of data accessed within a critical section in the cache memory of RCL-server’s core. Localization minimizes cache misses. Each working thread (client-thread) uses dedicated cache-line, which is used for critical section data storage and active waiting. This schema reduces access contention. The current implementation of RCL has several drawbacks. Firstly, there is no memory affinity in NUMA systems. Computer systems with non-uniform memory access (NUMA) (Figure 1) are currently widespread. These systems are compositions of multi-core processors, each of which relates directly to the local segment of the global memory. A processor (subset of processors) with their local
Information 2018, 9, 21
3 of 12
Information 2018, 9, 21
3 of 12
memory constitutes a NUMA-node. Interconnection between processors and local addresses addresses is performed through busses (AMD HyperTransport, Intel Quick Path memory Interconnect). The is performed through busses (AMD HyperTransport, Intel Quick Path Interconnect). The address address to local memory is performed directly, the address to remote NUMA-nodes is more costly, to local memory is performed directly, the address NUMA-nodes is more costly, because because it requires the use of an interprocessor bus.to Inremote multithreaded programs, the latency of RCLit requires the use of an interprocessor bus. In multithreaded programs, the latency of RCL-server server addresses to the shared memory areas is essential for the execution time of a program. Memory addresses shared memory areas essential forRCL-server, the execution time a program. Memory allocation to on the NUMA-nodes, which are is not local the leads to of severe overheads when allocation NUMA-nodes, which are noton local RCL-server, leads to severe overheads when RCL-serveronaccesses the variables allocated the the remote NUMA-nodes. RCL-server accesses the variables allocated on the remote NUMA-nodes.
Figure Figure 1. 1. Structure of of NUMA-systems. NUMA-systems.
RCL also also has has no no mechanism mechanism for for automatic automatic selection selection of of processor processor cores cores for for the the server server thread thread and and RCL the working threads while considering the hierarchical structure of the computer system and existing the working threads while considering the hierarchical structure of the computer system and existing affinities. Processor Processor affinity affinity greatly greatly affects affects the the overheads overheads caused caused by by localization localization of of access access to to global global affinities. variables. Therefore, user threads should be executed on the processors cores, located “closer” to the the variables. Therefore, user threads should be executed on the processors cores, located “closer” to processor core of the RCL-server. In existing RCL implementation, users have to manually choose processor core of the RCL-server. In existing RCL implementation, users have to manually choose the affinity affinity for for all all threads. threads. In In so so doing, doing, they they should should consider consider existing existing affinity affinity of of the the RCL-server RCL-server and and the working threads. Thus, the development of tools for the automation of this procedure constitute working threads. Thus, the development of tools for the automation of this procedure constitutea acurrent currentproblem. problem. This work work proposes proposes an an algorithm algorithm for for RCL-lock RCL-lock initialization initialization that that recognizes recognizes memory memory affinity affinity to to This NUMA-nodes and RCL-server affinity to processor cores, as well as an algorithm for the sub-optimal NUMA-nodes and RCL-server affinity to processor cores, as well as an algorithm for the sub-optimal affinity of of working working threads threads to to processor processor cores. cores. These These algorithms algorithms take take into into account account the the hierarchical hierarchical affinity structure of multi-core CS and non-uniform memory access in NUMA-systems in order to minimize structure of multi-core CS and non-uniform memory access in NUMA-systems in order to minimize the execution time of critical sections. the execution time of critical sections. 2. Multi-Core Hierarchical Computer Computer System System Model Model 2. Let there there be be multi-core multi-core CS CS with with shared sharedmemory, memory,including includingNNprocessor processorcores: cores:pp=={1, {1,2,2,. …, Let . . , N}. The computer computer system system has has aa hierarchical hierarchical structure, structure, which which can can be be described described as as aa tree, tree, comprising comprising LL The levels (Figure (Figure 2). 2). Each level of the system is represented represented by by individual individual types types of of structural structural element element in in levels the CS CS (NUMA-nodes, (NUMA-nodes, processor processor cores cores and and multilevel multilevel cache-memory). cache-memory). We introduce introduce the the following following the notation: clklk—the number of processor processor cores cores possessed possessed by bychildren childrenofofan anelement elementk k∈∊{1, {1,2,2,. .…, of notation: . , nnll} of level l ∊ {1, 2, …, L}; r = p(l, k) – the first direct parent element r ∊ {1, 2, …, n l−1 } for an element k, located level ∈ {1, 2, . . . , L}; r = p(l, k) – the first direct parent element r ∈ {1, 2, . . . , nl −1 } for an element k, on the level l; m—the NUMA-nodes of the multi-core CS; j = m(i)—the j ∊ number {1, 2, …, located on the level l;number m—the of number of NUMA-nodes of the multi-core CS; j =number m(i)—the of 2, NUMA-nodes containing containing a processora processor core i; qi—the of processor cores belonging to the jm} ∈ {1, . . . , m} of NUMA-nodes core i;set qi —the set of processor cores belonging NUMA-node i. to the NUMA-node i.
Information 2018, 9, 21 Information 2018, 9, 21
4 of 4 of 1212
Figure 2. An example of the hierarchical structure of a multi-core CS, N = 8, L = 5, m = 2, c23 = 2, p(3; 4) = 2, Figure 2. An example of the hierarchical structure of a multi-core CS, N = 8, L = 5, m = 2, c23 = 2, p(3; 4) m = 2, m(3) = 1. = 2, m = 2, m(3) = 1.
3. RCL Optimization Algorithms 3. RCL Optimization Algorithms For memory affinity optimization and RCL-server processor affinity optimization we propose the For memory affinity optimization and RCL-server processor affinity optimization we propose algorithm RCLLockInitNUMA for the initialization of RCL-locks (Algorithm 1). The algorithm takes the algorithm RCLLockInitNUMA for the initialization of RCL-locks (Algorithm 1). The algorithm into account non-uniform memory access, and is performed during the initialization of the RCL-lock. takes into account non-uniform memory access, and is performed during the initialization of the RCL-lock. Algorithm 1. RCLLockInitNUMA.
1: /* Compute the number of free cores on CS and nodes. */ Algorithm 1. RCLLockInitNUMA. 2: node_usage[1, . . , m] =0 1: /* Compute. the number of free cores on CS and nodes. */ = 0 …, m] = 0 2:3: nb_free_cores node_usage[1, i = 1 to N do= 0 3:4: for nb_free_cores 5: to NERVER do (i) then 4: for ifi =I S1RCLS 6: node_usage[m(i)] = node_usage[m(i)] + 1 if ISRCLSERVER(i) then 5: 7: else 6: node_usage[m(i)] = node_usage[m(i)] + 1 8: nb_free_cores = nb_free_cores + 1 else 7: 9: end if 8: nb_free_cores = nb_free_cores + 1 10: end for if memory affinity to the NUMA-node. */ 9:11: /* Tryend to set end for 10: 12: nb_busy_nodes = 0 11: /* Try memory affinity to the NUMA-node. */ 13: for i = 0 to to set m do 12: = 0> 0 then 14: nb_busy_nodes if node usage[i] tobusy m do 13: 15: for i = 0nb nodes = nb busy nodes + 1 16: nodeusage[i] =i if node > 0 then 14: 17: end ifnb busy nodes = nb busy nodes + 1 15: 18: end for node = i 16: 19: if nb_busy_nodes = 1 then end if 17: 20: S ET M EM B IND (node) 18: end for 21: end if if nb_busy_nodes = 1 then 19: 22: /* Set the affinity of RCL-server. */ 20: SETMEMBIND(node) 23: if nb_free_cores = 1 then 21: end if 24: core = G ET N EXT C ORE RR() 22: /* Set the affinity of RCL-server. */ 25: else if nb_free_cores = 1 then 23: 26: n = G ET M OST B USY N ODE(node_usage) 24: core = G ET N EXTCORERR() 27: for i = 1 to qn do 25: 28: else if not I S RCLS ERVER(i) then 26: n = GETcore MOST 29: =B i USYNODE(node_usage) for i = 1break to qn do 27: 30: if not ISRCLSERVER(i) then 28: 29: core = i break 30: end if 31:
Information 2018, 9, 21
5 of 12
31: end if 32: end for 33: end if 34: RCLL OCK I NIT D EFAULT(core)
In the first stage of the algorithm (lines 2–10), we compute the number of processor cores which are not busy by RCL-server and the number of free processor cores on each of NUMA-nodes. Subsequently (lines 12–18), we compute the summary number of NUMA-nodes with the RCL-servers running on it. If there is only one such NUMA-node, we set the memory affinity to this node (lines 19–21). The second stage of the algorithm (lines 23–34) includes a search for sub-optimal processor cores, and the binding of the RCL-server to it. If there is only one processor core in the system which is not busy by RCL-server, we set the affinity of the RCL-server to the first next (occupied) processor core (lines 23–24). One core is always kept free, on which to run working threads. If there is more than one free processor core in the system, we search for the least busy NUMA-node (line 26), and set the affinity of the RCL-server to the first free core in this node (lines 27–32). The algorithm concludes with a call of the default function of the RCL-lock initialization with obtained processor affinity (line 34). For the optimization of working thread affinity, we propose the heuristic algorithm RCLHierarchicalAffinity (Algorithm 2). The algorithm takes into account the hierarchical structure of multi-core CS in order to minimize the execution time of multithreaded programs with RCL. This algorithm is executed each time a parallel thread is created. Algorithm 2. RCLHierarchicalAffinity. 1: if I S R EGULART HREAD(thr_attr) then 2: core_usage[1, . . . , N] = 0 3: for i = 1 to N do 4: if I S RCLS ERVER(i) then 5: nthr_per_core = 0 6: l=L 7: k=i 8: core = 0 9: /* Search for the nearest processor core. */ 10: do 11: /* Find the first covering parent element. */ 12: do 13: clk _prev = clk 14: k = p(l; k) 15: l=l−1 16: while clk = clk_prev or l = 1 17: /* When the root is reached, increase the minimal count 18: of threads per one core. */ 19: if l = 1 then 20: nthr_per_core = nthr_per_core + 1 21: obj = i 22: else 23: /* Find the first least busy processor core. */ 24: for j = 1 to clk do 25: if core_usage[j] ≤ nthr_per_core then 26: core = j 27: break 28: end if
Information 2018, 9, 21
6 of 12
29: end for 30: end if 31: while core = 0 32: S ETA FFINITY(core, thr_attr) 33: core_usage[core] = core_usage[core] + 1 34: return 35: end if 36: end for 37: end if
In the first stage of the algorithm (line 1), we check whether the thread is an RCL-server. For each common (working) threads, we search for all of the RCL-servers (lines 3–4) being executed in the system. When the first processor core with an RCL-server is found, this core becomes the current element (lines 6–7), and for this element we search for the nearest free processor core for the affinity of the created thread (lines 4–35). At the beginning of the algorithm, we consider that a processor core with no affined threads is a free core (line 5). In the first stage of the core search, we find the first covering element of hierarchical structure. The covering element contains the current element and some any other processor cores except current element (lines 12–16). When the uppermost element of the hierarchical structure is reached, we increment the minimal number of threads per core (the free core is now the core with a greater number of threads running on it) (lines 19–21). When the covering element is found, we search for the first free processor core in it (lines 24–29), and set the affinity of the created thread to it (line 32), whereby, for this core, the number of threads executed on it is increased (line 33). After the affinity of the thread is set, the algorithm is finished (line 34). We implemented RCLLockInitNUMA and RCLHierarchicalAffinity algorithms in a library that utilizes the RCL method. We expect that the use of a modified RCL library will reduce the time required for critical section execution and critical section throughput for existing multithreaded programs that use RCL. The algorithms utilize non-uniform memory access and the hierarchical structure of the system (caching overheads) and access contention, so the effect of using the algorithms depends on the type of the system (NUMA or SMP) on which the multithreaded programs are executed, the structure of the system (number of hierarchical levels, existence of shared caches), the number of working threads, existing affinities of the threads (RCL-server thread, working threads and a thread which allocates memory), and the patterns of access to the memory of data structures. Thus, the efficiency of the algorithms has to be evaluated based on these factors. 4. Experimental Results For the evaluation of the algorithms, it is necessary to use shared memory systems with uniform and non-uniform memory access (NUMA and SMP). While RCLLockInitNUMA is focused on NUMA systems, RCLHierarchicalAffinity can be used with both system types. Benchmarks should consider different access patterns to the memory of the data structures. The evaluation should be done for different numbers of working threads (taking into account the number of processor cores in the system). The experiments were conducted on the multi-core nodes of the computer clusters Oak and Jet in the multicluster computer system of at the Center of Parallel Computational Technologies at the Siberian State University of Telecommunications and Information Sciences. The node of the Oak cluster (a NUMA system) includes two quad-core Intel Xeon E5620 processors (2.4 GHz, with sizes of cache-memory of 32 KiB, 256 KiB, and 12 MiB for levels 1, 2 and 3, respectively) and 24 GiB of RAM (Figure 3a). The ratio of rate of access to local and remote NUMA-nodes is 21 to 10. The node of the Jet cluster (a SMP system) is equipped with a quad-core Intel Xeon E5420processor (2.5 GHz, with sizes of cache-memory of 32 KiB and 12 MiB for levels 1 and 2, respectively) and 8 GiB of RAM (Figure 3b).
Information 2018, 9, 21
7 of 12
Information 2018, 9, 21
7 of 12
(a)
(b) Figure 3. 3. Hierarchical Hierarchical structure structureofofthe thecomputing computingnodes nodesofofclusters clusters Oak and Computer node Figure Oak and Jet.Jet. (a)(a) Computer node of of cluster Oak; (b) Computer node of cluster Jet. cluster Oak; (b) Computer node of cluster Jet.
The operating systems GNU/Linux CentOS 6 (Oak) and Fedora 21 (Jet) are installed on the The operating systems GNU/Linux CentOS 6 (Oak) and Fedora 21 (Jet) are installed on the computing nodes. The compiler GCC 5.3.0 was used. computing nodes. The compiler GCC 5.3.0 was used. Using the described systems, we aim to study how non-uniform memory access (Oak node) Using the described systems, we aim to study how non-uniform memory access (Oak node) affects the efficiency of the algorithms. Additionally, we will determine how efficiency is affected by affects the efficiency of the algorithms. Additionally, we will determine how efficiency is affected by the shared cache (L3 in the Oak node and L2 in the Jet node). We expect the algorithms will be more the shared cache (L3 in the Oak node and L2 in the Jet node). We expect the algorithms will be more efficient in the NUMA system with the deeper hierarchy. efficient in the NUMA system with the deeper hierarchy. We developed a benchmark for the evaluation of the algorithms. The benchmark performs We developed a benchmark for the evaluation of the algorithms. The benchmark performs iterative access to elements of integer arrays of length b = 5 × 1088 elements inside the critical section, iterative access to elements of integer arrays of length b = 5 × 10 elements inside the critical section, organized with the RCL. The number of operations is n = 108/p. As a pattern of operations in the organized with the RCL. The number of operations is n = 108 /p. As a pattern of operations in the critical section, we used an increment of the integer variable by 1 (this operation is executed at each critical section, we used an increment of the integer variable by 1 (this operation is executed at each iteration). We used three memory access patterns: iteration). We used three memory access patterns: sequential access: on each new iteration, choose the element that follows the previous one; • sequential access: on each newaccess): iteration, the element follows previous one; the strided access (interval-based onchoose each new iteration,that choose the the element for which • strided access (interval-based access): on each new iteration, choose the element for which the index exceeds the previous one by s = 20; index exceeds one byrandomly s = 20; choose the element of the array. random access:the onprevious each iteration, • random access: on each iteration, randomly choose the element of the array. It is well known that these access patterns are the most common in multithreaded programs. We plan Ittoisstudy how thethat memory accesspatterns pattern affects efficiency proposed algorithms; well known these access are the the most commonofinthe multithreaded programs. and We according to the of the experiments, will the provide recommendations foralgorithms; multithreading plan to study howresults the memory access patternwe affects efficiency of the proposed and programming practice. according to the results of the experiments, we will provide recommendations for multithreading The number p of parallel threads was varied between 2 and 7 (7—number of processor cores on programming practice. the computer nodes arethreads not busy by varied RCL-server) in the first7 experiment, between 2cores and The number p ofwhich parallel was between 2 and (7—numberand of processor 100the in the secondnodes one. Here, evaluate the efficiency of the depending on the on computer whichwe areaim notto busy by RCL-server) in the firstalgorithm experiment, and between 2 number of the working threads (and we how this to the number of processor cores). We and 100 in second one. Here, aim to number evaluaterelates the efficiency of the algorithm depending on expect the efficiency of RCLLockInitNUMA does not depend on the number of threads, while
Information 2018, 9, 21
Information 2018, 9, 21
8 of 12
8 of 12
the number of working threads (and how this number relates to the number of processor cores). RCLHierarchicalAffinity be more effectivedoes when the threads are We expect the efficiency ofwill RCLLockInitNUMA notall depend on the (including number of RCL-server) threads, while bound to the processor cores of one NUMA node. RCLHierarchicalAffinity will be more effective when all the threads (including RCL-server) are bound throughput b =one n/t NUMA of the critical to the The processor cores of node. section was used as an indicator of efficiency (here, n is the number of operations, t isthe thecritical time of benchmark execution). Note that throughput is the The throughput b = and n/t of section was used as an indicator of efficiency (here, n ismost the common efficiency indicator in studies on synchronization techniques. number of operations, and t is the time of benchmark execution). Note that throughput is the most Weefficiency compared the efficiency ofsynchronization the algorithms for the initialization of RCL-lock; common indicator in studies on techniques. RCLLockInitDefault (current RCL-lock initialization function) and RCLLockInitNUMA. We compared the efficiency of the algorithms for the initialization of RCL-lock; RCLLockInitDefault Additionally, we compared the affinity of the threads obtained bywethe algorithms (current RCL-lock initialization function) and RCLLockInitNUMA. Additionally, compared the RCLHierarchicalAffinity withbyother arbitrary affinities. affinity of the threads obtained the algorithms RCLHierarchicalAffinity with other arbitrary affinities. theexperiments experimentsfor forthe theevaluation evaluationofofRCL-lock RCL-lockinitialization initializationalgorithms, algorithms,we weused usedthe thethread thread InInthe affinity depicted in Figure 4. Additionally, we conducted the experiment without fixing the affinity affinity depicted in Figure 4. Additionally, we conducted the experiment without fixing the affinity theworking workingthreads threadsininorder ordertotostudy studyhow howfixed fixedaffinity affinityaffects affectsthe theefficiency efficiencyofofthe thealgorithms algorithmsfor for ofofthe lock initialization. lock initialization.
Figure 4. 4. Thread affinity toto thethe processor cores in in thethe experiments, p =p 2, . . .…, , 7.7. Figure Thread affinity processor cores experiments, = 2,
Figure5 depicts 5 depicts dependence ofcritical the critical section’s throughput on the pnumber p of Figure thethe dependence of the section’s throughput b on thebnumber of working workingWe threads. that theRCLLockInitNUMA algorithm RCLLockInitNUMA minimizes the of throughput threads. can seeWe thatcan thesee algorithm minimizes the throughput the criticalof the critical section for random access and strided access to theofelements the by test10–20%. array byWe 10–20%. section for random access and strided access to the elements the test of array can We can explain this by the fact that, in these access patterns, data is not cached in the local cache explain this by the fact that, in these access patterns, data is not cached in the local cache of processorof processor core on which is RCL-server is running. Thereforeaddresses RCL-server the RAM directly, core on which RCL-server running. Therefore RCL-server theaddresses RAM directly, whereby the whereby accesson rate on theindata in the local or remoteThe NUMA-node. The effect access rate the depends thedepends data location the location local or remote NUMA-node. effect is perceptible is perceptible when number of threads is near toofthe number of processor 5a,b) and when the number of the threads is near to the number processor cores (Figurecores 5a,b)(Figure and exceeds it exceeds it and (Figure doessignificantly not change when significantly whenofthe number of threads changes. The (Figure 5c), does5c), notand change the number threads changes. The fixed affinity affinity of threads to processor cores 5a) does not significantly affect the results. offixed threads to processor cores (Figure 5a) does(Figure not significantly affect the results. Figures6 6and and7 7represent representthe theexperimental experimentalresults resultsobtained obtainedfor fordifferent differentaffinities affinitiesfor forthe the Figures benchmark. Here, number p ofpworking threads does not exceed number cores, because the benchmark. Here,the the number of working threads does not the exceed the ofnumber of cores, affinity makes sense only in those conditions. The results show that the algorithm because the affinity makes sense only in those conditions. The results show that the algorithm RCLHierarchicalAffinity significantly increases throughput. effect of the RCLHierarchicalAffinity significantly increases criticalcritical sectionsection throughput. The effectThe of the algorithms algorithms depends on of thethreads number(up of threads (up to 2.4= times = 2, up to = 3,times up to depends on the number to 2.4 times at p 2, up at to p2.2 times at2.2 p =times 3, upat top1.3 1.3 times at p = 4, up to 1.2 times at p = 5) and on the access pattern (up to 1.5 times for random access, at p = 4, up to 1.2 times at p = 5) and on the access pattern (up to 1.5 times for random access, up to uptimes to 2.4for times for sequential access and to 2.1 foraccess). stridedIn access). In of the case of a small 2.4 sequential access and up to 2.1up times fortimes strided the case a small number of cores, RCL-server andthreads workingdothreads do not share cache or memory or the of memory of one ofnumber cores, RCL-server and working not share cache memory the memory one NUMA NUMA node; therefore, the overheads due to cache misses or memory access bus via NUMA bus are node; therefore, the overheads due to cache misses or memory access via NUMA are significant. significant. As we expected, the efficiency of the RCLHierarchicalAffinity decreased when the number of threads was greater than the number of processor cores on the NUMA node. This is due
Information 2018, 9, 21
9 of 12
As we expected, the efficiency of the RCLHierarchicalAffinity decreased when the number of threads 9 of 12 was greater than the number of processor cores on the NUMA node. This is due to the large overheads Information 2018, 9, 21 9 of 12 when RCL-server useswhen the NUMA bus to access memory that was previously a thread to thethe large overheads the RCL-server uses the NUMA bus to accessaccessed memoryonthat was which has been affined to another NUMA-node. to the large overheads when which the RCL-server uses the bus to access memory that was previously accessed on a thread has been affined to NUMA another NUMA-node. Information 2018, 9, 21
previously accessed on a thread which has been affined to another NUMA-node.
(a) (a)
(b) (b)
(c) (c) Figure 5. Efficiency of the algorithms for RCL-lock p = 2, .…, 7, Figure Efficiencyofofthe thealgorithms algorithmsfor forRCL-lock RCL-lockinitialization initializationon onthe theOak Oak cluster. cluster. (a) (a) . . ,7, 7, Figure5.5. Efficiency initialization on the Oak cluster. (a) pp == 2, 2, …, thread affinity corresponds to Figure 4; (b) p = 2, …, 7, no thread affinity; (c) p = 2, …, 100. 1— thread affinity corresponds to Figure 4; (b) p = 2, . . . , 7, no thread affinity; (c) p = 2, thread affinity corresponds to Figure 4; (b) p = 2, …, 7, no thread affinity; (c) p = 2, …, 100. 1— RCLLockInitNUMA, sequential access, 2—RCLLockInitDefault, sequential access, access, 3— . . RCLLockInitNUMA, . , 100. 1—RCLLockInitNUMA, access, 2—RCLLockInitDefault, sequential sequential sequential access, 2—RCLLockInitDefault, sequential access, 3— RCLLockInitNUMA, strided access, 4—RCLLockInitDefault, strided access, 5—RCLLockInitNUMA, 3—RCLLockInitNUMA, stridedaccess, access,4—RCLLockInitDefault, 4—RCLLockInitDefault,strided stridedaccess, access,5—RCLLockInitNUMA, 5—RCLLockInitNUMA, RCLLockInitNUMA, strided random access, 6—RCLLockInitDefault, 6—RCLLockInitDefault, random access. access. random randomaccess, access, 6—RCLLockInitDefault,random random access.
(a) (a)
(b) (b) Figure 6. Cont.
Information 2018, 9, 21
Information 2018, 9, 21 ation 2018, 9, 21 mation 2018, 9,9, 21 mation 2018, 21 9, 21 2018, Information 9, 21 2018, Information 2018,9,9,21 21 Information Information 2018, Information 2018, 9, 21 10 of 12
10 10 ofof12 12 10of 1210 of 12
Information 2018, 9, 21
10 of 12 10 of 12 (a)10 10ofof12 12 10 of 12
10 of 12
(c)
Figure 6. Thread affinity efficiency comparison, O (c) = 4; (d) p = 5. —RCLLockInitDefault, random a (c) (d) (c) (d) strided access, (c) (d) (d) (c) (d) (d) —RCLLockInitDefault, (c) (d) (d) (c) (c) (c) Figure 7. Thread affinity efficiency comparison, J (d) RCLLockInitNUMA, sequential access, —R (c) (d) Figure affinity efficiency comparison, Oak cluster (NUMA system). (a) pcluster ==system). (c) p=pp=3; Figure 6.6.Thread Thread affinity efficiency comparison, Oak cluster (NUMA system). (a) pcluster 2; (b) 3; (c) (d) p(a) —RCLLockInitDefault, random acce Figure6. Thread affinity efficiency comparison, Oak cluster (NUMA system). (a) p==(NUMA =2; 2;(b) (b) =3; 3;(a) (c) Figure 6. Thread affinity efficiency comparison, Oak cluster (NUMA system). 2;(c) p5. p p(c) Figure Thread affinity efficiency comparison, Oak (NUMA p=3;=2; 2;=(b) (b) (c)pp Figure 6.6. Thread affinity efficiency comparison, Oak system). Figure 6. Thread affinity efficiency comparison, Oakcomparison, cluster (NUMA system). (a) p(NUMA =ppp2; (b) psystem). p=(a) 6. Thread affinity efficiency comparison, Oak cluster (NUMA system). pp=(b) 2; (b) p===p3;3;3;(c) Figure 6. Thread affinity efficiency Oak cluster (a) p(b) =(a) 2; p(c) 3; thread, —RCL-server, —thread allocating the ===4; random access, —RCLLockInitDefault, sequential access, 4;4;(d) (d) ==5. 5. —RCLLockInitDefault, random access, —RCLLockInitDefault, sequential access, em). (a) 2; =4; 3;(d) (c)=(c) strided access, — (d)p=pp=4; 5.(b) —RCLLockInitDefault, random access, —RCLLockInitDefault, sequential access,sequential p 4; =p=5. —RCLLockInitDefault, random access, —RCLLockInitDefault, access, == (d) == 5. —RCLLockInitDefault, random access, —RCLLockInitDefault, sequential access, 4;4; p6.p=Thread 5. random access, —RCLLockInitDefault, sequential access, (d) p—RCLLockInitDefault, =p=5. —RCLLockInitDefault, random access, —RCLLockInitDefault, sequential access, 4;p(d) (d) 5. —RCLLockInitDefault, random access, —RCLLockInitDefault, sequential Figure affinity efficiency comparison, Oak cluster (NUMA system). (a)RCLLockInitDefault, psequential = 2; (b) p = 3; (c) paccess, (d) = 5. —RCLLockInitDefault, random access, —RCLLockInitDefault, access, —RCLLockInitDefault, strided access, —RCLLockInitNUMA, random access, — —RCLLockInitDefault, strided access, —RCLLockInitNUMA, random access, — = 4; (d) paccess, = 5. access, —RCLLockInitDefault, random access, —RCLLockInitDefault, sequential access, efault, sequential access, access, —R —RCLLockInitDefault, strided —RCLLockInitNUMA, random access, —RCLLockInitNUMA, —RCLLockInitDefault, strided access, —RCLLockInitNUMA, random access, — — sequential —RCLLockInitDefault, strided access, —RCLLockInitNUMA, random access, — —RCLLockInitDefault, strided access, —RCLLockInitNUMA, random access, — —RCLLockInitDefault, strided —RCLLockInitNUMA, random access, — access, —RCLLockInitDefault, strided access, —RCLLockInitNUMA, random access, —RCLLockInitNUMA, —RCLLockInitDefault, strided access, —RCLLockInitNUMA, random —RCLLockInitDefault, stridedaccess, access, —RCLLockInitNUMA, random access, — RCLLockInitNUMA, sequential access, —RCLLockInitNUMA, strided access. RCLLockInitNUMA, sequential access, —RCLLockInitNUMA, strided access. —working random access,RCLLockInitNUMA, — thread, —RCL-server, —thread allocating the RCLLockInitNUMA, sequential access, —RCLLockInitNUMA, strided access.access. —working sequential access, —RCLLockInitNUMA, strided access. —working sequential access, —RCLLockInitNUMA, access. —working thread, —RCL-server, RCLLockInitNUMA, sequential access, —RCLLockInitNUMA, strided access. —working RCLLockInitNUMA, sequential access, —RCLLockInitNUMA, strided access. —working RCLLockInitNUMA, sequential access, —RCLLockInitNUMA, strided —working RCLLockInitNUMA, sequential —RCLLockInitNUMA, strided access. —working RCLLockInitNUMA, sequential access, —RCLLockInitNUMA, strided access. —working thread, —RCL-server, —thread allocating the memory. thread, —RCL-server, —thread allocating the memory. ded access. —working —thread allocating the memory. thread, —RCL-server, —thread allocating the memory. thread, —RCL-server, —thread allocating the memory. thread, —RCL-server, —thread allocating thememory. memory. thread, —RCL-server, —thread allocating the thread, —RCL-server, —thread allocating the memory. thread, —RCL-server, —thread allocating the memory. thread,
—RCL-server, —thread allocating the memory.
ation 2018, 9, 21 mation 2018, 9,9, 21 mation 2018, 21 9, 21 2018, Information 9, 21 2018, Information 2018,9,9,21 21 Information Information 2018,
10 10 ofof12 12 10of 1210 of 12
10 of 12 (a)10 10ofof12 12
10 of 12
(a)
(a) (a) (a)
(a)
(a)
(a) (a) (a)
(b)
(b) (b) (b)
(b)
(b)
(b)
(b) (b)
b)
(c) (d) (d) Figure 7. Thread affinity efficiency comparison, J (d) Figure 6. Thread affinity efficiency comparison, Oak cluster (NUMA system). (a) p = 2; (b) p = 3; (c) p Figure 6. Thread affinity efficiency comparison, Oak cluster (NUMA system). (a) p = 2; (b) p = 3; (c) p (d) p (a) —RCLLockInitDefault, random acce FigureFigure 6. Thread affinity comparison, Oak cluster (NUMA system). (a) p =(a) 2;system). (b) 3;(a) Figure 6.efficiency Thread efficiency comparison, Oak cluster (NUMA (b) (c) Figure 6. Thread affinity efficiency comparison, Oak cluster (NUMA p=3; =(c) 2; (b) =3;3;3;(c) (c)pp Figure 7. Thread affinity efficiency comparison, Jet cluster (SMP system). (a)(c) p==p=3; 2;2;(c) (b) 3; p(b) =p 4;ppp== Figure 6. Thread affinity efficiency comparison, Oak cluster system). 6. Thread affinity efficiency comparison, Oak cluster (NUMA system). p(NUMA = p2;=system). (b) psystem). p=p(a) Figure 7. affinity Thread affinity efficiency comparison, Jet cluster (SMP (a) pp=5.p== 2;2; (c) (d) (d) p—RCLLockInitDefault, = p5. —RCLLockInitDefault, random access, —RCLLockInitDefault, sequential access, —access, (c) (d) ===4; random access, —RCLLockInitDefault, sequential access, (c) (d) 4;4;(d) (d) ==5. 5. —RCLLockInitDefault, random access, —RCLLockInitDefault, sequential access, em). (a) 2; =4; 3;(d) (c) RCLLockInitDefault, strided access, — (c) (d) (d)p=pp=4; 5.(b) —RCLLockInitDefault, access, —RCLLockInitDefault, sequential access,sequential (d) p =p=5.== random access, —RCLLockInitDefault, access, (c)access, (d) sequential 4; (d) p=== 5. —RCLLockInitDefault, random access, —RCLLockInitDefault, sequential access, (c) (d) (c) (d) 4; 5.random random access, —RCLLockInitDefault, sequential access, (d) p—RCLLockInitDefault, =p=5. —RCLLockInitDefault, random —RCLLockInitDefault, sequential access, 4;(d) 5.(c)—RCLLockInitDefault, random access, —RCLLockInitDefault, RCLLockInitDefault, strided access, —RCLLockInitNUMA, random access, Figure 7.strided Thread affinity efficiency comparison, Jetrandom cluster (SMP system). (a) pRCLLockInitNUMA, =random 2; (b) p = 3;access, (c) p = 4; sequential —RCLLockInitDefault, strided access, —RCLLockInitNUMA, access, — —RCLLockInitDefault, strided access, —RCLLockInitNUMA, random access, — d) efault, sequential access, access, —R —RCLLockInitDefault, strided access, —RCLLockInitNUMA, random access, — —RCLLockInitDefault, strided access, —RCLLockInitNUMA, random access, —— —RCLLockInitDefault, strided access, —RCLLockInitNUMA, random access, — —RCLLockInitDefault, strided access, —RCLLockInitNUMA, — —RCLLockInitDefault, access, —RCLLockInitNUMA, random access, — —RCLLockInitDefault, strided access, —RCLLockInitNUMA, random access, —RCLLockInitNUMA, Figure affinity efficiency comparison, Jet cluster system). (a) ppJet =(SMP (c) p=pp=3; 4; Figure 7.7.Thread Thread affinity efficiency comparison, Jet cluster (SMP system). (a) 2; (b) ==3; 3; (c) ==2; 4; Figure7. Thread affinity efficiency comparison, Jet cluster (SMP system). (a)(SMP p==cluster =2; 2;(b) (b) p2; 3; (c) 4; Figure 7. Thread affinity efficiency comparison, Jet cluster system). (a) (b) p= p=4; (c) p p=p=4;=3;3;(c) Figure 7. Thread affinity efficiency comparison, cluster p=3; =2;2; (b) (c)pp==4;4; Figure 7. Thread affinity efficiency comparison, Jet system). (a) (b) Figure 7. Thread affinity efficiency comparison, Jet (SMP cluster (SMP system). (a) p =pp(SMP (b) psystem). (c) p(a) RCLLockInitNUMA, sequential access, —RCLLockInitNUMA, strided access. —working (d) pRCLLockInitNUMA, = 5. access, —RCLLockInitDefault, random access, —RCLLockInitDefault, sequential access,—working — RCLLockInitNUMA, sequential —RCLLockInitNUMA, strided access. RCLLockInitNUMA, sequential access, —RCLLockInitNUMA, strided access. —working random access,RCLLockInitNUMA, — thread, —RCL-server, —thread allocating the RCLLockInitNUMA, sequential access, —RCLLockInitNUMA, strided access.access. —working sequential access, —RCLLockInitNUMA, strided access. —working sequential access, —RCLLockInitNUMA, access. —working thread, —RCL-server, sequential access, —RCLLockInitNUMA, strided access. —working RCLLockInitNUMA, sequential access, —RCLLockInitNUMA, strided access. RCLLockInitNUMA, sequential access, —RCLLockInitNUMA, strided —working (d) 5. access, —RCLLockInitDefault, sequential access, — (d) 5.5. —RCLLockInitDefault, random access, —RCLLockInitDefault, sequential access, — (a) (b)p—RCLLockInitDefault, p=—RCLLockInitDefault, =5.3; (d) (c) p == 4;5. (d) (d)ppp=p==2; =(d) random access,access, —RCLLockInitDefault, sequential access, — —RCLLockInitDefault, random access, —RCLLockInitDefault, sequential access, —access, (d) 5. —RCLLockInitDefault, random access, —RCLLockInitDefault, sequential access, — — pp==random 5. —RCLLockInitDefault, random access, —RCLLockInitDefault, sequential —RCLLockInitDefault, random —RCLLockInitDefault, sequential access, — thread, —RCL-server, —thread allocating the memory. RCLLockInitDefault, strided access, —RCLLockInitNUMA, random access, — thread, —RCL-server, —thread allocating the memory. thread, —RCL-server, —thread allocating the memory. ded access. —working allocating thethe memory. thread, —RCL-server, —thread allocating memory. thread, —thread —RCL-server, —thread allocating the memory. thread, —RCL-server, —thread allocating thememory. memory. thread, —RCL-server, —thread allocating the thread, —RCL-server, —thread allocating the memory. RCLLockInitDefault, strided access, —RCLLockInitNUMA, random access, — RCLLockInitDefault, strided access, —RCLLockInitNUMA, random access, — ult, sequential access, — RCLLockInitDefault, strided access,access, —RCLLockInitNUMA, random access,random — random RCLLockInitDefault, strided access, —RCLLockInitNUMA, access, — RCLLockInitDefault, strided access, —RCLLockInitNUMA, random access, — RCLLockInitDefault, strided access, —RCLLockInitNUMA, access, — RCLLockInitDefault, strided —RCLLockInitNUMA, random access, — RCLLockInitNUMA, sequential access, —RCLLockInitNUMA, strided access. —working RCLLockInitNUMA, sequential access, —RCLLockInitNUMA, strided access. —working RCLLockInitNUMA, sequential access, —RCLLockInitNUMA, strided access. —working random access, RCLLockInitNUMA, — RCLLockInitNUMA, RCLLockInitNUMA, sequential access, —RCLLockInitNUMA, strided access. —working sequential access, —RCLLockInitNUMA, strided access. —working RCLLockInitNUMA, sequential access, —RCLLockInitNUMA, strided access. —working sequential access, —RCLLockInitNUMA, strided access. —working RCLLockInitNUMA, sequential access, —RCLLockInitNUMA, strided access. —working thread, —RCL-server, —thread allocating the memory. thread, —RCL-server, allocating the thread, —RCL-server, —thread allocating the memory. ded access. —working thread, —RCL-server, —thread allocating thememory. memory. thread, —thread —RCL-server, —thread allocating the memory. thread, —RCL-server, —thread allocating thememory. memory. thread, —RCL-server, —thread allocating the thread, —RCL-server, —thread allocating the memory. (c) (c) (c)
(c)
(c)
(c) (c)(c)
(d) (d) (d)
(d)
(d)(d)
Information 2018, 9, 21
11 of 12
Critical section throughput for benchmark execution on the node of the Jet cluster (Figure 7) insignificantly varies for different thread affinities. This is explained by the lack of a shared (for processor cores of one processor) cache, and uniform access to memory (SMP-system). 5. Conclusions Algorithms for optimization of the execution of multithreaded programs based on Remote Core Locking (RCL) were developed. We proposed the algorithm RCLLockInitNUMA for initialization of RCL-lock while taking into account the non-uniform memory access in multi-core NUMA-systems, and the algorithm RCLHierarchicalAffinity for sub-optimal thread affinity in hierarchical multi-core computer systems. The algorithm RCLLockInitNUMA increases the throughput of the critical sections of multithreaded programs with random access and strided access to the elements of arrays on a NUMA systems by an average of 10–20%. Optimization is achieved by means of the minimization of the number of addresses to remote memory NUMA-segments. The algorithm RCLHierarchicalAffinity increases the throughput of the critical section by up to 1.2–2.4 times for all access templates on NUMA computer systems. The algorithms realize the affinity while taking into account all of the hierarchical levels of multi-core computer systems. The developed algorithms are realized as a library and can be used to minimize existing multithreaded programs based on RCL. Acknowledgments: The paper has been prepared within the scope of the state project “Initiative scientific project” of the main part of the state plan of the Ministry of Education and Science of the Russian Federation (task No. 2.6553.2017/8.9 BCH Basic Part). Author Contributions: A.P. and Y.S. conceived and designed the experiments; A.P. performed the experiments; A.P. and Y.S. analyzed the data; A.P. and Y.S. wrote the paper. Conflicts of Interest: The authors declare no conflict of interest.
References 1. 2. 3.
4. 5. 6.
7.
8. 9.
Khoroshevsky, V.G. Distributed programmable structure computer systems. Vestnik SibGUTI 2010, 2, 3–41. Herlihy, M.; Shavit, N. The Art of Multiprocessor Programming; Revised Reprint; Elsevier: Amsterdam, The Netherlands, 2012; p. 528. Herlihy, M.; Moss, J.E.B. Transactional memory: Architectural support for lock-free data structures. In Proceedings of the 20th Annual International Symposium on Computer Architecture, San Diego, CA, USA, 16–19 May 1993; Volume 21, pp. 289–300. Shavit, N. Data Structures in the Multicore Age. In Communications of the ACM; ACM: New York, NY, USA, 2011; Volume 54, pp. 76–84. Shavit, N.; Moir, M. Concurrent Data Structures. In Handbook of Data Structures and Applications; Metha, D., Sahni, S., Eds.; Chapman and Hall/CRC Press: Boca Raton, FL, USA, 2004; Chapter 47; pp. 47-1–47-30. Dechev, D.; Pirkelbauer, P.; Stroustrup, B. Understanding and effectively preventing the ABA problem in descriptor-based lock-free designs. In Proceedings of the 2010 13th IEEE International Symposium on Object/Component/Service-Oriented Real-Time Distributed Computing (ISORC), Carmona, Spain, 4–7 May 2010; pp. 185–192. Michael, M.M.; Scott, M.L. Simple, fast, and practical non-blocking and blocking concurrent queue algorithms. In Proceedings of the Fifteenth Annual ACM Symposium on Principles of Distributed Computing, Philadelphia, PA, USA, 23–26 May 1996; pp. 267–275. Anderson, T.E. The performance of spin lock alternatives for shared-money multi-processors. IEEE Trans. Parallel Distributed Syst. 1990, 1, 6–16. [CrossRef] Mellor-Crummey, J.M.; Scott, M.L. Algorithms for scalable synchronization on shared-memory multiprocessors. ACM Trans. Comput. Syst. (TOCS) 1991, 9, 21–65. [CrossRef]
Information 2018, 9, 21
10.
11. 12. 13.
14. 15.
16.
17.
12 of 12
Hendler, D.; Incze, I.; Shavit, N.; Tzafrir, M. Flat combining and the synchronization-parallelism tradeo. In Proceedings of the Twenty-Second Annual ACM Symposium on Parallelism in Algorithms and Architectures, Santorini, Greece, 13–15 June 2010; pp. 355–364. Fatourou, P.; Kallimanis, N.D. Revisiting the combining synchronization technique. In ACM SIGPLAN Notices; ACM: New York, NY, USA, 2012; Volume 47, pp. 257–266. Oyama, Y.; Taura, K.; Yonezawa, A. Executing parallel programs with synchronization bottlenecks efficiently. In Proceedings of the International Workshop on PDSIA 99, Sendai, Japan, 5–7 July 1999; pp. 1–24. Suleman, M.A.; Mutlu, O.; Qureshi, M.K.; Patt, Y.N. Accelerating critical section execution with asymmetric multi-core architectures. In ACM SIGARCH Computer Architecture News; ACM: New York, NY, USA, 2009; Volume 37, pp. 253–264. Metreveli, Z.; Zeldovich, N.; Kaashoek, M.F. Cphash: A cache-partitioned hash table. In ACM SIGPLAN Notices; ACM: New York, NY, USA, 2012; Volume 47, pp. 319–320. Calciu, I.; Gottschlich, J.E.; Herlihy, M. Using elimination and delegation to implement a scalable NUMA-friendly stack. In Proceedings of the Usenix Workshop on Hot Topics in Parallelism (HotPar), San Jose, CA, USA, 24–25 June 2013; p. 17. Lozi, J.P.; David, F.; Tomas, G.; Lawall, J.; Muller, G. Remote Core Locking: Migrating Critical-Section Execution to Improve the Performance of Multithreaded Applications. In Proceedings of the USENIX Annual Technical Conference, Boston, MA, USA, 13–15 June 2012; p. 6576. Lozi, J.P.; Thomas, G.; Lawall, J.L.; Muller, G. Efficient Locking for Multicore Architectures; Research Report RR-7779; INRIA: Paris, France, 2011; pp. 1–30. © 2018 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (http://creativecommons.org/licenses/by/4.0/).