Behavior Aware Data Placement for Improving Cache Line Level ...

4 downloads 232 Views 2MB Size Report
2Department of Computer Science and Technology, Hangzhou Dianzi University, China ... Guangdong University of Petrochemical Technology, China ...... Symposium on Microarchitecture, Orlando, FL, ... Mongolia Transportation Vocational.
Behavior Aware Data Placement for Improving Cache Line Level Locality in Cloud Computing

705

Behavior Aware Data Placement for Improving Cache Line Level Locality in Cloud Computing 1

Jianjun Wang1, Gangyong Jia2, Aohan Li3, Guangjie Han4, Lei Shu5 Department of Computer Science and Technology, Inner Mongolia Transportation Vocational and Technological College, China 2 Department of Computer Science and Technology, Hangzhou Dianzi University, China 3 Department of Information and Communication Engineering, Heilongjiang University, China 4 Department of Communication and Information System, Hohai University, China 5 Guangdong Petrochemical Equipment Fault Diagnosis Key Laboratory, Guangdong University of Petrochemical Technology, China [email protected], [email protected], {liaohan1989, hanguangjie}@gmail.com, [email protected]

Abstract Due to the VM contention on shared computing resources, especially shared caches, in datacenters, cloud computing paradigm inevitably brings noticeable performance overhead of VMs to customers. Therefore, taking advantage of both spatial and temporal locality to efficiently excavate cache plays an important role in bridging the performance gap between processor cores and main memory. This paper is motivated by two key observations: (1) the access behavior is highly non-uniform and dynamic at the cache line level; (2) neither current spatial nor temporal cache management schemes can efficiently utilize cache capacity for excessively focusing on inter cache line, ignoring the optimization within cache line. Therefore, we propose a novel adaptive scheme, called BADP, which combines task’s behavior to place data for improving locality at the cache line level. In the proposed scheme, a cache line level monitor captures the behavior of individual variables accessing and judiciously places variables together with similar behavior so that preventing the underutilized variables in the cache line occupying the valuable cache. The controller decides on the best placement for all variables. Further, our BADP can cooperate with current state-of-the-art cache management schemes. Keywords: Cache management, Task behavior, Data locality, Cache line, Cloud computing.

1 Introduction With the ability to scale computing resources on demand and provide a simple pay-as-you-go business model for customers, cloud computing is emerging as an economical computing paradigm, and has gained much popularity in the industry [1]. Currently, a number of big companies such as Netflix and Foursquare [2] have successfully moved their business services from the dedicated computing infrastructure to Amazon Elastic *Corresponding author: Guangjie Han; E-mail: [email protected] DOI: 10.6138/JIT.2015.16.4.20150511

Computing Cloud (EC2) [3]. Undoubtedly, more customers and enterprises will leverage the cloud to maintain or scale up their business while cutting down the budget, as reported by the International Data Corporation (IDC) that the business revenue brought by cloud computing will reach $1.1 trillion by 2015 [4]. Unfortunately, the running performance of virtual machines (VMs) decreases seriously for many reasons, especially, due to the shared resource contention of VMs, like shared cache. As the gap between memory and processor speeds continues to widen, cache efficiency is an increasingly important component of processor performance. The success of cache has been explained using the concept of locality (either temporal or spatial) [5]. Temporal locality implies that, once a location is referenced, there is a high probability that it will be referenced again soon, and less likely to do so as time passes; spatial locality implies that when a datum is accessed it is very likely that nearby data will be accessed soon. Since cache stores recently used segments of information, the property of locality implies that needed information is also likely to be found in the cache. Computer architects have proposed smart cache control mechanisms and novel cache architectures that can detect program access patterns and can fine-tune some cache policies to improve overall cache utilization and data locality. Such as cache bypassing [6], victim caches [7], stream buffers [7], optimizing default leastrecently-used (LRU) cache replacement policy [8-10], which contains insertion policy [11-14] and promotion policy [15], partitioning cache resources among parallel running applications in multi-core processors [16-19], which contains static partitioning [20-21] and dynamic partitioning [22-23], and co-optimizing the locality and utility [24-27], which spatially partitions the cache ways among the threads and temporally makes use of the allocated capacity for individual threads in an interactive way, so that highly utility provided by the replacement policies can be exploited.

706

Journal of Internet Technology Volume 16 (2015) No.4

Although the aforementioned proposals have demonstrated desirable performance improvement, they are all based on the presume cache line is the most finegrained unit for optimization. Specifically, they ignore the different access behavior of variables in one cache line which contains 64 or 128 Bytes memory in modern CPU. Therefore, the phenomenon of some different access behavior variables residing in one cache line is widespread. All current cache management schemes will prevent replacing the cache line containing one frequently accessed variable for possibly accessed in near future, but in the meantime, all other infrequently accessed variables which reside in the same cache line waste the valuable cache capacity for occupying as long time as the frequently accessed variable. In order to prevent wasting the valuable cache capacity for placing infrequently accessed data, in this paper, we take different data access behavior within cache line into consideration. Based on the variables’ access behavior, we replace variables into memory, so as to replace into cache line. In this way, variables in the same cache line have almost the same access behavior which reduces the valuable cache capacity waste for holding hardly accessed variables. Through combining task behavior, we can capture the behavior of individual variables accessing and judiciously places variables together with similar behavior so that preventing the underutilized variables in the cache line occupying the valuable cache. Our goal is to optimize cache efficiency in the fine-grained level of intra cache line through data placement. A behavior aware data placement (BADP) approach has been proposed in this paper, improving cache efficiency both in single-threaded and in multi-threaded scenarios. A cache line level monitor is used to capture the behavior of individual variables accessing, mainly variables’ access situation, containing occurrences, appearance orders in the single-threaded scenario, sharing situations are added in the multi-threaded scenario and so on. Based on the behavior, cache line level monitor also judiciously places variables together with similar behavior, mainly rearranging variables’ memory location to balance access frequency among variables in the cache line, therefore, reduce the valuable cache capacity waste for preventing hardly accessed variables residing in the cache line. Through the cache line level monitor, we can get a BADP executable program. Then we run BADP program in the simulator to evaluate our BADP approach overriding the default approach. Experiment results show our proposed BADP can improve 14.2% performance on average in single-threaded application situation and improve 17.5% performance on average in multi-threaded application situation. We also combine our BADP to the state-of-the-art

researches to evaluate the extended approaches overriding the recent researches. The main contributions of this work are: (1) The basic assumption in current cache optimization implies cache line is the most fine-grained unit, however, our key observations show the distinct access behavior of variables within one cache line impact the cache efficiency for wasting valuable cache capacity from infrequently accessed variables. (2) A novel approach called BADP captures the behavior of individual variables accessing and judiciously places variables together with similar behavior, therefore, prevent the underutilized variables occupying the valuable cache capacity. (3) Our BADP can be orthogonal with current proposed cache management schemes to optimize cache performance further. The rest of this paper is organized as follows. Section 2 elaborates on essential background, research motivations and presents the typical task behavior, including single and multi threaded application scenarios. Section 3 introduces the design and implementation of our behavior aware data locality approach in detail. Section 4 introduces our experimental hardware testbed, the implementation details of BADL with evaluation results. Related work is discussed in Section 5 and the paper concludes in Section 6.

2 Background & Motivation Although least-recently-used (LRU) and some other lightweight LRU approximations policies for cache management behave well in exclusive cache, do not have enough performance in share cache. Therefore, various strategies have been proposed to make the best use of the share cache capacity. Firstly, we introduce the data locality, which is the base of all cache management policies. Secondly, we will briefly describe their working principles here but place our research in a broader context of related work in Section 5. Meanwhile, we will analyze the weaknesses of the state-of-the-art schemes and provide quantitative evidence in support of our conclusions. 2.1 Data Locality Caches work on the assumption that data is accessed once will usually be accessed soon again. This kind of behavior is well known as data locality. There are two kinds of locality that are sometimes distinguished: Temporal locality means that the program reuses the same data that it recently used, and that therefore is likely to be in the cache. Spatial locality means that the program uses data close to recently accessed locations. Since the processor loads

Behavior Aware Data Placement for Improving Cache Line Level Locality in Cloud Computing

a chunk of memory around an accessed location into the cache, locations close to recently accessed locations are also likely to be in the cache. It is possible for a program exhibit both types of locality at the same time. It is common knowledge that good data locality will absolutely result in satisfying application performance. Otherwise, applications with poor data locality reduce the effectiveness of the cache, causing long stall times waiting for memory accesses. It is gainful for cache effectiveness if we improve data locality of the application. In this paper, we analyze the behavior of task in detail firstly, then combining the behavior to rearrange the data in the memory for improving the data locality. 2.2 Share Cache Management Proposal All current share cache management proposals can be categorized into either alternative replacement policies for locality management or capacity partitioning schemes for utility optimization, as follows. Locality-Oriented Alternative Replacement Policies: Because the LRU replacement policy aims to favor cache access recency (or locality) only, it can result in thrashing when the working set size of a workload is larger than the cache capacity and the cache access pattern is locality-unfriendly (e.g., a large cyclic working set [12]). Alternative replacement policies, such as TADIP [11], SDBP [28] and NU-cache [29], are proposed to optimize locality by temporally assigning and adjusting lifetimes for cached blocks. Utility-Oriented Capacity Partitioning Schemes: The utility of a thread represents its ability to reduce misses with a given amount of allocated SLLC capacity [23]. Although threads may vary greatly in their utility, an LRU-managed SLLC is oblivious of such differences when threads are co-scheduled and their cache accesses are mixed. In response to this shortcoming, several previous studies, such as UCP [23] and PIPP [8], propose to spatially partition the SLLC among competing threads based on the utility information captured by per-thread LRU-stack profilers, notably improving the performance over the baseline LRU replacement policy. 2.3 Our New Perspective and Its Supporting Experimental Evidence Both alternative replacement policies for locality management and capacity partitioning schemes for utility optimization are based on the assumption of cache line being the basic unit. However, in general, a cache line has 64 Bytes or even 128 Bytes capacity in modern CPU, which can contain quite a number of variables. For instance, a variable of int type only occupies 4 Bytes, which

707

means a cache line can contain as many as 16 variables of int type. All current cache management policies take all variables’ access in the cache line as the access of cache line. Possibility, part of variables in a cache line is accessed frequently, and others are hardly accessed. But these hardly accessed variables occupy cache all the time because the cache line will be not replaced for containing frequently accessed variables. Useless variables occupy valuable cache, which leads to serious waste of resources and low efficiency. Through analyzing task’s memory accessing behavior, we find adjacent variables which occupy in the same cache line are highly non-uniform in access behavior. The non-uniform access behavior of the variables in one cache line seriously wastes the valuable cache capacity for hardly accessed variables occupying cache as long as the frequently accessed ones. Figure 1 illustrates one circumstance of low effectiveness of the cache, which much cache capacity is occupied by hardly accessed variables. In the figure, we assume one cache line contains 4 variables (sometimes one cache line contains more than 4 variables). All 4 variables will be prefetched into the cache line after one of them is accessed. We record the access times of each variable in a cache line. According to the LRU policy, the whole cache line is moved to MRU location after its one variable is accessed. Therefore, some hardly accessed variables in one cache line occupy the cache capacity as long as frequently access one, which wastes the valuable cache capacity. For most cache effectiveness, we need replace hardly accessed variables as soon as possible and keep frequently accessed in the cache as long as possible. Figure 1 demonstrates variable 1, 2, 3 in cache line A are never accessed but they occupying cache for a long time because of frequency accessed variable 4. Similarly cases are in cache line B and D. So, the cache effectiveness needs to be improved.

Figure 1 Hardly Accessed Variables Are Occupied the Cache Line

Figure 2 demonstrates the access distribution of the benchmark mcf and gzip respectively. The x-axis presents

708

Journal of Internet Technology Volume 16 (2015) No.4

the access number before the content is replaced, the y-axis denotes corresponding cache capacity percentage. mcf and gzip represent two different cache access behavior, one for cache-bound, and the other is CPU-bound. Not only mcf and gzip, we also analyze access distribution of many other benchmarks with different cache access behavior. Through all these distribution analysis, we have observed that significant percentage of the cache capacity is occupied by the hardly accessed, even none accessed data. Moreover, this phenomenon is a result of the hardly accessed data occupying the same cache line with the frequency accessed data. So, hardly accessed data resides in the cache as long as the frequency accessed data because all current cache management policies make cache line as the basic unit for replacement.

more efficiency of the cache after rearranging variables according to their access behavior, which only frequently accessed variables occupy cache capacity long enough, like cache line A, B, and hardly accessed variables are quickly replaced. The scenario of some hardly accessed variables occupying in cache because of other variables in the same cache line frequently accessed is seldom happening. Combine task behavior to replace data placement for improving cache effectiveness in cache line level is meaningful.

Figure 3 Only Frequency Accessed Variables Will Occupy Cache Line

(a) Access Distribution of the Benchmark mcf

(b) Access Distribution of the Benchmark gzip Figure 2 Access Distribution of the Two Different Benchmarks

In order to replace hardly accessed variables as soon as possible and keep frequently accessed in the cache as long as possible, we can replace the variables according to their access behavior. All variables in one cache line have similar access behavior. Therefore, hardly accessed variables will be replaced for seldom access their cache line and frequently accessed variables will keep in cache long enough for their cache accessed frequently, which improves the cache effectiveness. Figure 3 illustrates the

3 Behavior Aware Data Placement (BADP) Although data placement is important in cache efficiency, we don’t take it into consideration when programing normally. The neglect is mainly from below reasons: (1) We can’t predict all data access behavior when beginning programming; (2) Even getting all data access behavior in advance, we can’t make the best decision how to place these data for the best utilizing cache. There are too many data, so it is too complex. In order to optimize cache efficiency, in this paper, we propose behavior aware data placement (BADP), which place the data according to their access behavior. Data behaving similar are placed together, and data of different access behavior are separated. Therefore, data in one cache line have similar access possibility, preventing some hardly access data occupying the valuable cache capacity. Moreover, the locality will be improved for hardly access data evicted earlier, the frequently access data retained longer. The most import factor to realize our BADP is the access behavior of every data. The best way to represent the access behavior is the access record from runtime, but we can’t get the runtime record for three reasons: (1) we

709

Behavior Aware Data Placement for Improving Cache Line Level Locality in Cloud Computing

need place data before the program running; (2) the runtime record is different every time for the different branch prediction, different input data, and so on; (3) the different running environment, occupying different cache capacity and memory size, different instruction window, and so on. These elements can affect the access behavior of all data. Therefore, the proposed BADP is based on static analysis to represent the access behavior of all data. Although the static analysis is not as accurate as runtime record, static analysis is feasible and low overhead. Moreover, the static analysis is transparent to the programmer. All the process of our BADP is within compiling. After analyzing the access behavior of every data, we replace the data in the memory. Place the similar access behavior data together. Different behavior data will be aparted for preventing inserting into the same cache line. Our BADP adds three mainly parts into the default compiling process. (1) Record the access numbers and times of all data after finishing front end compiling; (2) Analyze the access behavior of every data according to their access numbers and times; (3) Rearrange data placement in the memory according to access behavior of every data. Figure 4 presents a paradigm of our BADP approach framework. We add a cache line monitor in the process of compiling for each application to record each variable. Table 1 describes all the recorded information. First the variable ID is the key to access different variables though distinct table entries. Occurrences column records all the times appearing in the program of each variable. Whenever a variable appears, the occurrence of the variable adds by 1. Total interval time of appearance adds the number of other variables appearing times since last appearing of the variable. The last appearing time records the sequence of the last time of the variable appearing until now. Share is

a flag column indicating whether the variable is shared by two or multiple threads. In particular, occurrences represents how many times a variable is used; total interval time of appearance divides to occurrences represents how frequency a variable is used; the last appearing time represents when a variable is used. For example, considering a random execution state when the variable 1’s four recording information are 3, 14, 18 and 0 respectively. These information represent the variable 1 appears 3 times, while other variables appear 14 times between the first appearing time and third time (last time), all variables appear 18 times before variable 1 appearing the third time. Finally the share flag 0 refers to variable 1 is not shared by threads, which means it is exclusively accessed by a single thread. These information can depict how frequency the variable 1 occurs, and when it appears. 3.1 Access Behavior Analysis for Single-Threaded Application For single-threaded application, all variables in the single-threaded application are not shared, consequently we only focus on the occurrences, total interval time of appearance and the last appearing time of the variables in the program. When we count occurrences of the variables, we mainly distinguish two situations: The first is the sequence. It is easy to count occurrences in the sequence situation, just adding 1 to the corresponding domain in the Table 1 when the variable appears. And the other is the loop. We must analyze how many times the variable is statistically used. During the compiling process, we look for loop instructions and the loop number to add occurrences to the corresponding domain in the Table 1. Table 1 All Recorded Information of Each Variable

Variable ID Variable 1

Figure 4 BADP Approach Framework

Total The last interval Occurrences appearing Share? time of time appearance 3

14

18

0

When counting total interval time of appearance and the last appearing time, we also distinguish the sequence and loop scenarios. We define a current_val parameter to record how many times all variables appear. Every time variable appears once, current_val is added by one. In sequence scenario, when a variable appears, total interval time of appearance is added by value of the current_val minus the last appearing time, and the last appearing time of the variable is set to equal to the current_val. In loop

710

Journal of Internet Technology Volume 16 (2015) No.4

scenario, we record current_val according to loop number, but we only count the total interval time of appearance and last appearing time according to the last time of loop. If a variable is not appearing later, the last appearing time of the variable is set to the current_val, if a variable appears later, the last appearing time of the variable is set to the newer current_val. Table 2 shows the counting algorithm. Table 2 Counting Algorithm for Single-Threaded Application

begin 1: Appear a variable; 2: Judge the placement of the variable; 3: if in the sequence place then 4: current_val += 1; 5: appearance number += 1; 6: total interval time += (current_val-last appearing time); 7: else in the loop place 8: current_val += loop number; 9: appearance number += loop number; 10: total interval time += (current_val(final)-last appearing time(final)); 11: last appearing time = current_val; End 3.2 Additional Access Behavior Analysis for MultiThreaded Application For multi-threaded applications, most of counting process is the same with the single-threaded application, thereby we only need to distinguish whether the variables are shared by many threads belonging to one application. All variables are not shared at the beginning, after other threads also appear the same variables which will be recorded to be shared, and the share domain in the Table 1 will be set to 1. From then on, the variable can be accessed by multiple threads. In order to maintain data consistency, we tentatively employs a mature lock based mechanism. Especially, the lock adoption and release procedure is described as follows: (1) Before accessing the variable, a thread should detect whether the variable is locked. (2) If Yes, then the thread ID should be pushed into the queue waiting until the current transaction is finished. Otherwise, the thread can access the variable immediately. (3) In the end of the execution in-progress, the head thread of the queue shall be popped out to adopt the variable grant. (4) If there are no more pending tasks in the queue, the lock can be released permanently.

3.3 Rearrange Data Placement After the application’s data access statistics have been counted, we will rearrange data according to the statistics. Definition 1: Fre = total interval time of appearance / occurrences. Fre represents each variable’s access frequency, which is occurrences in a fix time. Definition 2: Seq = occurrences / Fre. Seq is used to mark whether two data need to be rearranged together. Firstly, we calculate all variables’ Seq according to definition 1 and 2. Secondly, we rearranged all data according to each variable’s Seq.

4 Evaluation 4.1 Simulation Environment We use M5 [30] as our simulation environment and implemented relevant caching schemes. PIPP [8] approach is mainly adjusting replacement policy to get a better cache performance. UBCP [23] approach partitions shared cache for reducing interference to improve performance. Our BADP approach can cooperate with current state-of-the-art cache management schemes, so BADP-PIPP and BADPUBCP represent the combination of BADP with PIPP and UBCP respectively. In order to verify our BADP approach can be used for single-threaded and multi-threaded application, we simulate both single core and 4-core processors. Their main parameters are listed in Table 3 and Table 4 respectively. Table 3 Single Core Processor Parameters

Parameters

Value

L1 D Cache Size

16 KB 2-way 64-byte line

L1 I Cache Size

16 KB 2-way 64-byte line

L1 Access Time

1 cycle

L2 Cache Size

512 KB 8-way

L2 Access Time

10 cycles

Memory Access Time

200 cycles

Table 4 4-Core Process Parameters

Parameters

Value

L1 D Cache Size

16 KB 2-way 64-byte line

L1 I Cache Size

16 KB 2-way 64-byte line

L1 Access Time

1 cycle

L2 Cache Size

1 MB 16-way 4 cores sharing

L2 Access Time

10 cycles Normal, 12 cycles Drowsy

Memory Access Time

200 cycles

Behavior Aware Data Placement for Improving Cache Line Level Locality in Cloud Computing

711

In order to evaluate our BADP approach, we run different benchmarks from SPEC2006 for single-threaded scenario and from sysbench [31] for multi-threaded scenario. In Table 5, the number-appname notation is the number of threads of the application with the name of appname for sysbench; for SPEC2006 workload, it is only listed appname. Table 5 Benchmarks for Both Scenarios

Mixes

Sysbench, SPEC2006

mix1

gzip, perlbmk, crafty, eon

mix2

gcc, mgrid, fma3d, swim

mix3

mcf, mesa, equake, perlbmk

mix4

gzip, perlbmk, crafty, eon

mix5

gcc, mgrid, fma3d, swim

mix6

mcf, mesa, equake, perlbmk

mix7

4-sysbench cpu

mix8

4-sysbench memory

mix9

2-sysbench cpu, gzip, gcc

mix10

2-sysbench meory, mcf, gcc

4.2 Cache Efficiency Improvement Analysis Figure 5 presents five cache management policies speedup over default in IPC when running mix1, mix2, mix3 in single core. BADP, PIPP-BADP and UBCP-BADP behave better than PIPP, UBCP and default policy seriously. PIPP and UBCP are proposed to reduce interference for the shared cache. Therefore, in single core environment, they behave just like default LRU policy. Especially, UBCP has the same performance with the LRU for there is no partition in the single core. So UBCP-BADP has the same performance with the BADP, which means the UBCPBADP is just the BADP. PIPP is little better than LRU for it evicts the less access cache line quickly. The good performance of all BADP, PIPP-BADP and UBCP-BADP is mainly from the more efficiency of cache utilization. Figure 6 demonstrates five cache management policies decrease over default in L1 cache miss rate. BADP, PIPP-BADP and UBCP-BADP decrease L1 cache miss rate obviously. Figure 7 demonstrates five cache management policies decrease over default in L2 cache references. BADP, PIPP-BADP and UBCP-BADP decrease L2 cache references obviously too. Figure 8 shows five cache management policies increase over default in L2 cache miss rate. Coordinate four figures of Figure 5, 6, 7, 8, we find BADP, PIPP-BADP and UBCP-BADP mainly optimize L1 cache, which obviously decreases L1 cache miss rate and L2 cache reference, but L2 cache utilization is dramatically falling. L2 cache reference decreased represents L1 cache

Figure 5 Five Cache Management Policies Speedup over Default in IPC with Single Core

Figure 6 Five Cache Management Policies Decrease over Default in L1 Cache Miss Rate

Figure 7 Five Cache Management Policies Decrease over Default in L2 Cache References

712

Journal of Internet Technology Volume 16 (2015) No.4

Figure 8 Five Cache Management Policies Increase over Default in L2 Cache Miss Rate

efficiency is improved, most useful data can be gotten in the L1 cache. We also get each mix’s stack distance and find stack distance reduces 31.4% on average of the five mixes, which represents our BADP can optimize application locality. Figure 9 demonstrates the occupation percentage of the cache of each access number before replace when running the SPEC 2006 benchmark mcf after using our BADP. Obviously, the problem of much percentage of the cache

Figure 9 Occupation Percentage of the Cache of Each Access Number before Replaced

is occupied by the hardly accessed data has been almost solved. The results illustrate BADP can optimize cache line efficiency. So we summarize the reasons of our BADP optimizing cache efficiency is mainly derived from both optimizing cache line efficiency and application locality. Moreover, BADP behaves well after orthogonal with PIPP and UBCP which proved by Figure 5, 6, 7, 8 and 9. And the reasons of our BADP can be orthogonal with current proposed cache management schemes are the different optimization aims. The current proposed cache management schemes mainly aim to optimize hit ratio, but our BADP optimizes cache line efficiency and application locality. In order to demonstrate our BADP is also useful in mutli-core environment, we evaluate BADP running in 4-core platform with both single-threaded applications and multi-threaded applications. Figure 10 demonstrates the five cache management policies speedup over default in IPC when running mixed benchmarks of both singlethreaded and multi-threaded applications in 4-core. BADP behaves well in performance on 4-core running mixed single-threaded and multi-threaded applications. In our experiment, UBCP is not good enough with other approaches for the limit of the L2 cache which has only 8-way. The cooperated policies have obvious advantages in performance which both PIPP-BADP and UBCP-BADP have demonstrated in the figure. This proves our BADP can cooperate with other cache policies well. 4.3 Locality Improvement Analysis Stack distance [32] is a modest and reasonable metric to leverage the task behaviors. Each task has a distinctive stack distance, which is the task’s unique behavior of accessing memory. Figure 11(a) illustrates a typical procedure of LRU cache replacement in a 4-way cache. Assume that four cache lines of A, B, C, D are integrated in the same cache set. Stack distance is the number of distinct cache lines accessed since previous access to the same line.

Figure 10 Five Cache Management Policies Speedup over Default in IPC When Running on 4-Core

Behavior Aware Data Placement for Improving Cache Line Level Locality in Cloud Computing

713

(a) A Process of LRU Cache Replacement

Figure 12 Average Stack Distance Decreased

(b) Stack Distance Distribution Figure 11 Stack Distance Profiles

Access to a cache line not in the cache set, which means miss and stack distance > 4. Figure 11(b) illustrates the stack distance profiles, the x-axis presents the stack distance of each access, and y-axis denotes corresponding access percentage. In the figure, we can find the access percentage of hit in the cache, which containing stack distance of 1, 2, 3 and 4, and access percentage of miss in the cache, which containing stack distance of > 4. If the task’s most stack distance is more than the associative of the cache, then the task will demonstrate poor locality. However, if the task’s most stack distance is smaller than the associative of the cache, the task will demonstrate moderate locality, which will definitely result in higher efficiency than poor locality. Average stack distance of each thread reflects the thread’s locality, the less the average stack distance is, the better the thread locality is. Figure 12 demonstrates the decreased average stack distance after using BADP comparing to the default method. From the figure, we can obviously find our BADP can effectively decrease the benchmarks’ stack distance, which represents our BADP can effectively improve thread locality, which will bring a enhanced cache efficiency.

5 Related Work Cache based approaches has been considered as one of the most efficient methods to maximize system performance. In order to improve cache efficiency, many cache replacement and partitioning policies have

been proposed. Generally there are three approaches to optimization: Cache replacement policy containing insertion and promotion policy: The insertion policy determines where in the eviction priority (e.g., LRU stack) a line should initially be installed [11-12]. The promotion policy determines how the eviction priority should be changed on a cache hit [15]. For example, promotion-insertion pseudopartitioning (PIPP) [8] assigns each partition a different insertion position in the LRU chain and slowly promotes lines on hits (e.g., promoting ≈ 1 position per hit instead of moving the hit line to the head of the LRU chain). With an additional mechanism to restrict cache pollution of thrashing applications, PIPP approximately attains the desired partition sizes. Qureshi et al. proposed Dynamic Insertion Policy (DIP) cache management scheme [1112]. They pointed out that when the thread’s working set is greater than the available cache size, merely using LRU replacement policy is inefficient, since LRU policy inserts the new cache line in the MRU position in the recency stack, which results in cache misses in this situation. DIP inserts the new cache line in the LRU position in the recency stack, which can guarantee other positions except LRU position retain in the cache. So DIP is better than LRU replacement policy in the situation that thread’s working set is greater than the available cache size. However, it left the promotion policy to be unconsidered. Cache partitioning policy which contains static and dynamic: Cache partitioning is a method of allocating a cache between concurrently executing processes in order to counteract the effects of inter-process conflicts. Stone et al. [20] investigated optimal (static) partitioning of cache memory between two or more applications when the information about change in misses for varying cache size is available for each of the competing application. However, such information is hard to obtain statically for all applications as it may depend on the input set of

714

Journal of Internet Technology Volume 16 (2015) No.4

the application. Sanchez and Kozyrakis [21] proposed a scalable and efficient fine-grain cache partitioning approach named Vantage. Vantage worked by matching the insertion (churn) and demotion rates of each partition, thus keeping their sizes approximately constant. It partitioned most of the cache, and used the unmanaged region to eliminate interpartition interference and achieve a simple implementation. Vantage was derived from analytical models, which allowed it to provide different degrees of isolation by varying the size of the unmanaged region. Deayton and Chung [19] created an additional shared partition able to be shared amongst all processors, which underutilized areas of the cache are identified by monitoring circuit and used for the shared partition. Detection of underutilization is based on the number of unique set accesses for a given allocated way. Cache reservation policy: Wang et al. [33] showed that threads that belong to a single, multithreaded application can exhibit a poorly balancing performance. They proposed a dynamic cache reservation scheme which can redistribute the reserved cache space to the critical thread for speeding up during the applications running. NUcache [29] logically partitions the associative way of a cache set into MainWays and DeliWays. While all lines have access to the MainWays, only lines brought in by a subset of delinquent PCs, selected by a PC selection mechanism, are allowed to enter the DeliWays. The PC selection mechanism is an intelligent cost-benefit analysis based algorithm that utilizes Next-Use information to select the set of PCs that can maximize the hits experienced in DeliWays. ZCache [34] is a cache design which allows much higher associativity than the number of physical ways. Sandberg et al. [35] describe an application classification framework that allows them to predict how applications affect each other when running on a multicore and a method for finding non-temporal memory accesses. Above all, all these researches only optimize the cache efficiency based on cache line, and ignoring task behavior which only considering data accessed statistics. Therefore, in this paper, we first consider the task behavior of internal cache lines in the behavior analysis methods, in order to construct a sound framework for both inter-cache line and internal cache lines.

6 Conclusion In this paper, we propose a behavior aware data Placement (BADP) approach to improve cache efficiency through combining data access behavior to place data for improving locality at the cache line level. In our BADP, a cache line level monitor is used to capture the behavior of individual variables accessing, mainly variables’ access

situation, containing occurrences, appearance orders in the single-threaded scenario, sharing situations are added in the multi-threaded scenario and so on. Based on the behavior, cache line level monitor also judiciously places variables together with similar behavior, mainly rearranging variables’ memory location to balance access frequency among variables in the cache line, therefore, reduce the valuable cache capacity waste for preventing hardly accessed variables residing in the cache line. Experimental results and analysis demonstrate that the proposed BADP approach can improve cache efficiency by prompting data locality. Furthermore, BADP approach can be integrated with state-of-the-art cache replacement policies. In the future, we will explore whether BADP can be effective in reducing cache’s power and energy consumption.

Acknowledgements This work was supported by Qing Lan Project, Natural Science Foundation of JiangSu Province of China, No. BK20140248, 2013 Special Fund of Guangdong Higher School Talent Recruitment, Educational Commission of Guangdong Province, China Project No. 2013KJCX0131, G u a n g d o n g H i g h - Te c h D e v e l o p m e n t F u n d N o . 2013B010401035, 2013 top Level Talents Project in “Sailing Plan” of Guangdong Province, National Natural Science Foundation of China (Grant No. 61401107, 61401147), Zhejiang provincial Natural Science Foundation (No. LQ14F020011), and 2014 Guangdong Province Outstanding Young Professor Project.

References [1] Fei Xu, Fangming Liu, Hai Jin and Athanasios V. Vasilakos, Managing Performance Overhead of Virtual Machines in Cloud Computing: A Survey, State of the Art, and Future Directions, Proceedings of the IEEE, Vol.102, No.1, 2014, pp.11-31. [2] Amazon, Customer Success, Powered by the AWS Cloud, http://aws.amazon.com/solutions/case-studies/ [3] Amazon, Amazon Elastic Compute Cloud (Amazon EC2), http://aws.amazon.com/ec2/ [4] John F. Gantz, Stephen Minton and Anna Toncheva, Cloud Computing’s Role in Job Creation, 2012, http://www.microsoft.com/en-us/news/features/2012/ mar12/03-05cloudcomputingjobs.aspx [5] Alan Jay Smith, Cache Memories, ACM Computing Surveys, Vol.14, No.3, 1982, pp.473-530. [6] Teresa L. Johnson and Wen-mei W. Hwu, RunTime Adaptive Cache Hierarchy Management via Reference Analysis, Proc. the 24th International

Behavior Aware Data Placement for Improving Cache Line Level Locality in Cloud Computing

[7]

[8]

[9]

[10]

[11]

[12]

[13]

[14]

[15]

[16]

[17]

Symposium on Computer Architecture, Denver, CO, June, 1997, pp.315-326. Norman P. Jouppi, Improving Direct-Mapped Cache Performance by the Addition of a Small Fully Associative Cache and Prefetch Buffers, Proc. of the 17th ISCA, Seattle, WA, May, 1990, pp.364-373. Yuejian Xie and Gabriel H. Loh, PIPP: Promotion/ Insertion Pseudo-Partitioning of Multi-core Shared Caches, Proc. ISCA 2009, Austin, TX, June, 2009, pp.174-183. Aamer Jaleel, Kevin B. Theobald, Simon C. Steely and Joel Emer, High Performance Cache Replacement Using Re-reference Interval Prediction (RRIP), Proc. of ISCA ’10, Saint-Malo, France, June, 2010, pp.60-71. Mainak Chaudhuri, Pseudo-LIFO: The Foundation of a New Family of Replacement Policies for Last-Level Caches, Proc. of MICRO 42, New York, December, 2009, pp.401-412. Aamer Jaleel, William Hasenplaugh, Moinuddin Qureshi, Julien Sebot, Simon Steely, Jr. and Joel Emer, Adaptive Insertion Policies for Managing Shared Caches, Proc. of the 17th International C o n f e re n c e o n P a r a l l e l A rc h i t e c t u re s a n d Compilation Techniques, Toronto, Canada, October, 2007, pp.208-219. Moinuddin K. Qureshi, Aamer Jaleel, Yale N. Patt, Simon C. Steely and Joel Emer, Adaptive Insertion Policies for High-Performance Caching, Proc. of the 34th International Symposium on Computer Architecture, San Diego, CA, June, 2007, pp.381-391. Samira Khan and Daniel A. Jimenez, Insertion Policy Selection Using Decision Tree Analysis, Proc. of ICCD, Amsterdam, The Netherlands, October, 2010, pp.106-111. C a r o l e J e a n Wu a n d M a r g a r e t M a r t o n o s i , Characterization and Dynamic Mitigation of Intraapplication Cache Interference, Proc. of 2011 IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS), Austin, TX, April, 2011, pp.2-11. Jonathan D. Kron, Brooks Prumo and Gabriel H. Loh, Double-DIP: Augmenting DIP with Adaptive Promotion Policies to Manage Shared L2 Caches, Proc. of the Workshop on Chip Multiprocessor Memory Systems and Interconnects, Beijing, China, June, 2008, pp.1-9. Jichuan Chang and Gurindar S. Sohi, Cooperative Cache Partitioning for Chip Multiprocessors, Proc. of the 21st International Conference on Supercomputing, Seattle, WA, June, 2007, pp.242-252. Jiang Lin, Qingda Lu, Xiaoning Ding, Zhao Zhang and P. Sadayappan, Gaining Insights into Multicore

[18]

[19]

[20]

[21]

[22]

[23]

[24]

[25]

[26]

[27]

715

Cache Partitioning: Bridging the Gap between Simulation and Real Systems, Proc. of the 14th International Symposium on High Performance Computer Architecture, Salt Lake City, UT, February, 2008, pp.367-378. Zhibin Huang, MingFa Zhu and Limin Xiao, Analysis of Allocation Deviation in Multi-core Shared Cache Pseudo-Partition, Proc. of EECM 2011, Beijing, China, December, 2012, pp.463-470. Peter Deayton and Chung-Ping Chung, Set Utilization Based Dynamic Shared Cache Partitioning, Proc. of IEEE 17th International Conference on Parallel and Distributed System (ICPADS), Tainan, Taiwan, December, 2011, pp.284-291. Harold S. Stone, John Turek and Joel L. Wolf, Optimal Partitioning of Cache Memory, IEEE Transactions on Computers, Vol.41, No.9, 1992, pp.1054-1068. Daniel Sanchez and Christos Kozyrakis, Vantage: Scalable and Efficient Fine-Grain Cache Partitioning, Proc. of the 38th Annual International Symposium on Computer Architecture, San Jose, CA, June, 2011, pp.57-68. G. E. Suh, L. Rudolph and S. Devadas, Dynamic Partitioning of Shared Cache Memory, The Journal of Supercomputing, Vol.28, No.1, 2004, pp.7-26. Moinuddin K. Qureshi and Yale N. Patt, UtilityBased Cache Partitioning: A Low-Overhead, HighPerformance, Runtime Mechanism to Partition Shared Caches, Proc. of the 39th Annual International Symposium on Microarchitecture, Orlando, FL, December, 2006, pp.423-432. Dongyuan Zhan, Hong Jiang and Sharad C. Seth, Locality & Utility Co-optimization for Practical Capacity Management of Shared Last Level Caches, Proc. of the 26th International Conference on Supercomputing, Venice, Italy, June, 2012, pp.279-290. Dongyuan Zhan, Hong Jiang and Sharad C. Seth, STEM: Spationtemporal Management of Capacity for Intra-core Last Level Caches, Proc. of the 43rd Annual IEEE/ACM International Symposium on Microarchitecture, Atlanta, GA, December, 2010, pp.163-174. Dongyuan Zhan, Hong Jiang and Sharad C. Seth, Exploiting Set-Level Non-uniformity of Capacity Demand to Enhance CMP Cooperative Caching, Proc. of the 24th International Parallel and Distributed Processing Symposium, Atlanta, GA, April, 2010, pp.1-10. Dongyuan Zhan, Hong Jiang and Sharad C. Seth, CLU: Co-optimizing Locality and Utility in ThreadAware Capacity Management for Shared Last Level

716

[28]

[29]

[30]

[31]

[32]

[33]

[34]

[35]

Journal of Internet Technology Volume 16 (2015) No.4

Caches, IEEE Transactions on Computers, Vol.63, No.7, 2013, pp.1656-1667. Samira M. Khan, Daniel A. Jimenez, Doug Burger and Babak Falsafi, Using Dead Blocks as a Virtual Victim Cache, Proc. of the 19th International Conference on Parallel Architectures and Compilation Techniques, Vienna, Austria, September, 2010, pp.489-500. R. Manikantan, Kaushik Rajan and R. Govindarajan, NUcache: An Efficient Multicore Cache Organization Based on Next-Use Distance, Proc. of IEEE 17th International Symposium on High Performance Computer Architecture, San Antonio, TX, February, 2011, pp.243-253. Nathan L. Binkert, Erik G. Hallnor and Steven K. Reinhardt, Network-Oriented Full-System Simulation Using M5, Proc. of Sixth Workshop on Computer A rc h i t e c t u re E v a l u a t i o n U s i n g C o m m e rc i a l Workloads, Anaheim, CA, February, 2003, pp.1-9. Alexey Kopytov, SysBench: A System Performance Benchmark, 2004, http://sysbench.sourceforge.net/ index.html Dhruba Chandra, Fei Guo, Seongbeom Kim and Yan Solihin, Predicting Inter-thread Cache Contention on a Chip Multi-processor Architecture, Proc. of the 11th International Symposium on High Performance Computer Architecture, San Francisco, CA, February, 2005, pp.340-351. Qing Wang, Zhenzhou Ji, Tao Liu and Suxia Zhu, Dynamic Cache Reservation to Maximize Efficiency in Shared Cache Multicores, Proc. of First International Conference on Instrumentation, Measurement, Computer, Communication and Control, Beijing, China, October, 2011, pp.208-211. Daniel Sanchez and Christos Kozyrakis, The ZCache: Decoupling Ways and Associativity, Proc. of the 43th Annual IEEE/ACM International Symposium on Microarchitecture, Atlanta, GA, December, 2010, pp.187-198. Andreas Sandberg, David Eklov and Erik Hagersten, Reducing Cache Pollution through Detection and Elimination of Non-temporal Memory Accesses, Proc. of the 2010 ACM/IEEE International Conference for High Performance Computing, Networking, Storage and Analysis, New Orleans, LA, November, 2010, pp.1-11.

Biographies Jianjun Wang is a vice-president of teaching department of Inner Mongolia T r a n s p o r t a t i o n Vo c a t i o n a l a n d Technological College, China. Currently, he is also an associate professor of Inner Mongolia Transportation Vocational and Technological College. His current research interests are Cloud Computing and Computer Networks. Gangyong Jia is currently an Assistant Professor of Department of Computer Science at Hangzhou Dianzi University, China. He received his PhD degree in Department of Computer Science from University of Science and Technology of China, Hefei, China, in 2013. His current research interests are operating system, cache optimization,memory management. Aohan Li is currently pursuing MS degree in signal and information processing at Heilongjiang University, China. She received her BS degree in electronic information engineering from Heilongjiang University, China, in 2012. Her current research interests are wireless sensor networks and cognitive radio networks. Guangjie Han is currently a Professor with the Department of Information and Communication System, Hohai U n i v e r s i t y, C h a n g z h o u , C h i n a . He received the PhD degree from Northeastern University, Shenyang, China, in 2004. His current research interests include sensor networks, computer communications, mobile cloud computing, and multimedia communication and security. Lei Shu received his PhD degree from the Digital Enterprise Research Institute, National University of Ireland, Galway, in 2010. In October 2012, he joined Guangdong University of Petrochemical Technology, China, as a full professor. His research interests include wireless sensor networks, multimedia communication, middleware, and security.