This is also the goal of YACO (Yet Another Cache-visualizer for Optimization). Different from ... No. Presentation of cache access behavior stop. Location of hot spots. Code analysis ..... Ësewardj/docs-2.2.0/cg main.html#cg-top. 12. Y. Yu, K.
YACO: A User Conducted Visualization Tool for Supporting Cache Optimization Boris Quaing, Jie Tao, and Wolfgang Karl Institut f¨ur Technische Informatik Universit¨at Karlsruhe (TH) 76128 Karlsruhe, Germany E-mail: {boris,tao,karl}@ira.uka.de
Abstract. To enhance the overall performance of an application it is necessary to improve the cache access behavior. In this case, a cache visualizer is usually needed for fully understanding the runtime cache activities and the access pattern of applications. However, it does not suffice if only visualizing what happened. More importantly, a visualizer has to provide users with the knowledge about the reason for cache misses and to illustrate how the cache behaves at the runtime. This is also the goal of YACO (Yet Another Cache-visualizer for Optimization). Different from existing tools, YACO uses a top-down approach to direct the user step-by-step to detect the problem and the solution.
1 Motivation On modern computer systems there exists a large distinction between processor and memory speed, and programs with large data sets suffer considerably from the long access latency of main memory. The cache memory, on the other hand, serves as a bridge; however, caches are usually not efficiently used. For tackling cache misses a variety of approaches have been proposed, including hardware-based cache reconfiguration [6] and user-level or compiler-based code optimization [7, 2, 3]. The latter is more common due to its straight-forward manner. This kind of optimization, nevertheless, requires the knowledge about the access pattern of applications and also the runtime cache access behavior. As a direct analysis of the source code is rather tedious and often can not enable a comprehensive optimization, users usually rely on visualization tools for acquiring this knowledge. As a consequence, a set of cache visualizer has been developed [9, 4, 12, 1]. However, these tools show only what happened and do not suffice for an exact understanding of the runtime cache and program activities. In this case we developed YACO with the goal of showing how the application and cache behave, rather than only presenting the statistical data. For this YACO not only visualizes the changing of cache contents but also, and with more endeavor, depicts all necessary information at high level enabling the detection of cache miss reason and optimization schemes. The resulted visualizer is optimization oriented and user conducted. This means, as illustrated in figure 1, users first acquire an overview about the cache access behavior
Presentation of cache access behavior
Optimization stop
No
Optimization necessary? Yes Location of hot spots
Code analysis Detection of miss reason
Optimization conduction
Fig. 1. Procedure of code optimization with YACO
shown by the chosen program. Based on this overview, the user can determine whether an optimization is essential. In the next step, the access hot spots, which are responsible for poor cache performance, can be located. After that, the reasons and interrelations between memory references can be detected using YACO. This information also allows the user to select appropriate optimization scheme and related parameters, or to design novel algorithms to eliminate the detected cache problem. The impact of the optimization can be observed with YACO after running the optimized code. This process can be repeated until an acceptable cache performance is achieved. The goal and main contribution of YACO lies in: – providing the possibility of detailed analysis of program execution behavior with respect to different cache hierarchies – enabling the investigation of reasons for poor cache performance – supporting compiler developers in the task of implementing automatic optimization mechanisms – helping programmers to directly apply adequate optimization techniques with the source code – aiding hardware designers to evaluate the influence of various cache organization on execution behavior of applications The rest of this paper is organized as follows. Section 2 gives a short introduction to related work in this research area. This is followed by a description of the visualization infrastructure and the source of performance data in Section 3. Then in Section 4 we demonstrate how YACO supports the optimization process with various graphical representations. After that the paper concludes in Section 5 with a short summary.
2 Related Work For supporting code optimization and for helping users to understand the runtime cache access behavior, several visualization tools have been developed. Examples include CVT [9], CACHEVIZ [12], and Vtune [4].
Cache Visualization Tool (CVT): CVT [9] is a graphical visualization program. It is composed of a main window, which shows the active content of the whole cache memory, and a second window presenting statistics on cache misses in terms of arrays. It relies on cache profilers to achieve cache performance data. We examine now the appropriateness of CVT for investigating the poor cache performance of an application. CVT provides information about the operations in cache during a program run and supplies different statistics. However, both cache content and statistics address only the current status. For a user it is more important to understand the regular structure of cache misses and for this not only the active but also the past status of the cache has to be presented in a single view. CVT shows only the former. A further shortcoming with CVT lies in that it merely targets on a single cache. Although each cache level can be separately observed but the interaction between different caches in the cache hierarchy is lost. In addition, CVT shows the whole cache making it difficult to focus on a specific region where especially critical problem exists. This difficulty increases with the size of the observed cache. CACHE behavior VIsualiZer (CACHEVIZ): CACHEVIZ [12] intends to show the cache access behavior of a complete program in a single picture. Totally it provides three views: density view, reuse distance view, and histogram view. Within the first two views, each memory reference is presented with a pixel. Horizontal pixels indicate consecutive memory accesses. Character of a reference, i.e. hit or a kind of miss, is distinguished with color. Similarly, within the reuse distance view accesses with long reuse distance are also highlighted with specific color. The third view aims at enabling an analysis of access regularity. For this it shows different values of the chosen metric, e.g. reuse distance, in a single diagram. For the reuse distance histogram, for example, it lists the various reuse distances on the x-axis, while corresponding to each reuse distance it shows on the y-axis the number of accesses with this reuse distance. VTune Performance Analyzer: The Intel VTune Performance Analyzer [4] is regarded as a useful tool for performance analysis. It provides several views, like “Sampling” and “Call Graph”, to help identify bottlenecks. For cache performance it only shows number of cache misses. Even though this information can be viewed by process, thread, module, function, or instruction address, it does not suffice for optimizations with respect to cache behavior. In summary, existing cache visualization tools show limited information about the cache operation and this information is mostly highly combined, allowing only to aquire an overview of the complete access behavior or, a little further, the detection of access hot spots. For an efficient, comprehensive optimization with cache locality, a more feasible cache visualizer is needed. Therefore, we implemented this user conducted and optimization oriented visualization system in order to show the various aspects of the runtime cache behavior. Using a top-down visualization procedure, it directs the users step-by-step to locate and detect the concrete object and technique for optimization.
3 Performance Data and YACO’s Software Infrastructure Performance visualizers usually heavily rely on data acquisition systems which deliver performance data for visualization. For instance, due to the limited data hardware coun-
ters can provide, Vtune only presents the number of cache misses. Besides hardware counters, profilers and simulation systems are also often deployed for acquiring performance data. However, existing systems [5, 11] restrict to statistical overview data and detailed information, especially that capable of showing miss reason, is missing. In this case, we implemented CacheIn [8], a monitoring and simulation framework, to acquire comprehensive cache performance data. CacheIn, at the top level, is comprised of a cache simulator, a cache monitor, and a software infrastructure. The former takes memory references as input and then models the complete process of data searching in the whole memory hierarchy with multilevel caches. According to the search result, it generates events for the cache monitor. The events describe both the access type and the cache activity, varying between hit, coldmiss, conflict-miss, capacity-miss, load, and replacement. The cache monitor defines a generic monitoring facility and provides mechanisms to filter and analyze the events. More specifically, it offers a set of working modes allowing the acquisition of different kind of cache performance data, e.g. access series and cache operation sequence. This data is further delivered to a software infrastructure which produces higher level information in statistical form like histograms and total numbers. This forms the basis for YACO to exhibit the cache access behavior in different aspects. For visualization, YACO deploys a client-server architecture to coordinate the interaction between its graphical user interface, the target system, the data source, the server, the visualization component, and the user. To use YACO, the user first initiates a simulation run of the examined program via the graphical user interface. After the simulated execution, cache performance data collected by CacheIn is delivered to the server and processed. Also through the graphical user interface requests for specific views of user interest can be specified and further issued to the server. As the core of this infrastructure, the server is responsible for building communication, storing and processing performance data, and manipulating the visualization. For communication it establishes pipelines for connecting both the cache simulator and the graphical user interface. While the former aims at acquiring performance data, the latter is used to transfer users’ visualization requirements. For control of the visualization it initiates the corresponding function for the selected view and provides the required data.
4 Visualization to Support Optimization In Figure 1 we demonstrated an optimization process towards efficient cache performance. This section describes how the steps shown in Figure 1 are supported by YACO using a variety of graphical representations. 4.1 Performance Overview First, the cache access behavior of the examined program must be observed in a high aggregated form, allowing the user to predict the potential performance gain when an optimization would be conducted. For this YACO provides an overview diagram which is illustrated in the left side of Figure 2.
Fig. 2. YACO graphical view: Total Overview (left) and Variable Miss Overview (right)
As can be observed, this view contains two kinds of diagrams for each cache location on the whole cache hierarchy: a column diagram and a circle diagram. In the column diagram statistics on cache hits and misses are presented. Besides the total number, cache miss is further distinguished between cold, conflict, and capacity miss. This means the height of the first colum (total access) is the sum of that with the following two columns (total hit and total miss), and the summary of the last three columns forms the total misses (the third column). Using colored columns this diagram allows the user to exactly examine the overall cache performance and further to decide if an optimization has to be performed. Similar to the column diagram, the circles also aim at showing a contrast between various metrics concerning memory references. While the left circle allows to detect how the cache overall behaves with only total hits and misses presented, the right circle shows the proportion of each miss type enabling a deeper insight in cache misses and the highlight of the most critical miss reason. 4.2 Access Hot Spots In the next step of optimization, the access hot spots have to be located. Theses are individual data structures, functions, loops, or specific code regions with related data structures, which are responsible for poor cache performance. For this YOCO provides a set of views. Variable Miss Overview: First, the Variable Miss Overview allows to detect the data structures which cause a significant amount of cache misses. As illustrated in the right side of Figure 2, this view presents the miss behavior of all data structures delivered by the cache simulator. The sample view in Figure 2 depicts the Fast Fourier Transformation (FFT) code, an application in the SPLASH-II Benchmarks suite [10]. For each data structure, four columns show the statistics on total misses and each miss category. Absolute numbers of the misses are depicted on the left bottom diagram, while the meaning of each column is explained on the right bottom diagram. Overall, the Variable Miss Overview allows to detect data structures that have to be optimized, e.g. x and trans in the FFT code. Data Structure Information: With help of the Variable Miss Overview, data structures with large cache misses are identified. However, users still need the knowledge
Fig. 3. YACO graphical view: Data Structure Information (left) and its Zoom-in (right)
about the access behavior of e.g. elements and data blocks within a data structure. This helps to detect regular access structure and also to make the hot spots more concrete. For this goal the view “Data Structure Information” is implemented. As shown in Figure 3, Data Structure Information contains two diagrams. While the top one covers the whole memory system and presents hits, the bottom one only targets on caches and shows the miss behavior. Figure 3 shows array x in the FFT code. Within the upper diagram (left picture in Figure 3), the x-axis shows all data blocks (size of cache line) in the selected data structure with their numbers given in a stride. Corresponding to each data block, the y-axis gives the number of access hits on a location, which can be L1 cache, L2 cache, or the main memory. The sum of these hits on all locations forms the total runtime accesses to the corresponding data block. The locations are identified with different colors enabling an easy detection of regions, where most accesses are addressed in the main memory. The lower diagram presents the information about misses in a similar way. The concrete view in Figure 3 uses separate diagrams for each cache location: L1 bottom and L2 top. This allows to better observe the misses on each single cache. For large data structures it is difficult to detect the hot spots if all data blocks are presented in the same diagram. Hence YOCO implements a zoom function for this view, where a range of blocks can be specified via a user interface and only the selected blocks are illustrated. Figure 3 (right) gives a sample view showing the misses with block 43 to 67 of array x. 3D Phase Information: The views described above cover the whole program. For data structures that are used only one time, for example within a single subroutine, the code region containing accesses to this data is clear. However, often the same data is accessed in different program stages (initialization, computation, etc.) and functions or code regions (we call them phases), and the access behavior usually changes in different phases of an application. In order to enable a further focus of the access hot spots on a code region, YACO provides the view “3D Phase Information”. The sample view in Figure 4 shows the combined misses (all caches) in different program phases with respect to the selected data structure, in this case array x in the FFT program. Each phase in this example corresponds to a single function in the pro-
Fig. 4. YACO graphical view: 3D Phase Information (left) and its 2D zoom-in (right)
gram. The phases are distributed across the z-axis and each with a 2D diagram showing the miss distribution at granularity of blocks in the data structure, where blocks are illustrated on the x-axis and number of misses to the corresponding block on the y-axis. In addition, phases are identified using different colors. Similar to the previous view, the visualization can be focused on a few of blocks enabling a better observation of the chosen data areas. In addition, this 3D diagram can be rotated in any direction allowing a deeper observation from all sides. Besides, YACO enables a zoom-in in a single phase in order to better observe a critical code region. Figure 4 (right) demonstrates the fifth phase using the same color as in the 3D view. 4.3 Access Pattern and Miss Reason With the knowledge about the data structures and more detailed the data blocks that are responsible for poor cache performance in combination with the code region shown by the phase information, the optimization process, as illustrated in Figure 1, goes into the third step: to detect the access pattern of applications and the miss reason. For this, YACO provides another two graphical views. Variable Trace: The first view is Variable Trace which shows the references to data blocks/elements in a data structure in the order as they are accessed. Figure 5 (left) demonstrates the array x in the FFT code. As can be seen, we use twenty fields to show the references to a data structure. Theses fields are first empty and are filled after begin of the visualization. The first reference is presented initially on the first field and then moved to the next field after the second reference to the same data structure occurs. Corresponding to each field Variable Trace shows the accessed block/element of the data structure and the type of the access, which can be a load operation or a replacement. While the name of the array and the block/element are explicitly printed in the fields, the operation type is indicated by the color of a field. As the references are filled in the diagram according to the accessed sequence, this view allows to detect e.g. the access stride between array elements and further to chose an appropriate prefetching strategy to reduce the load operations and thereby the cold misses. Cache Set: The second view aims to exhibit the runtime activities of a cache memory. A sample view is shown in Figure 5 (right), where set 12 of the L2 cache is taken as
Fig. 5. YACO graphical view: Variable Trace (left) and Cache Set (right)
an example. For this example, L2 is configured as a 4-way cache. Within the diagram, all lines within the selected set are illustrated in horizontal direction, with each recording the most recent operations, from left to right, in the corresponding cache line. For each operation, the operation type, which can be a load, a replacement, or a hit, and the operation target, i.e. the variable or block/element in an array, are presented. Operations are inserted in the diagram according to their occurrences. The most currently occurring operation is identified with a colored frame, in this case umain2 (12/48). In addition, statistics on replacement are shown on the left bottom allowing to find data blocks that frequently enter and leave the cache. 4.4 Sample Optimization To demonstrate the feasibility of YACO, we simulated a small code performing initialization of a two-dimensional array. In order to highlight the problem in L1 cache, we use a small size with only four cache lines. The Overview of YACO shows a 100% L1 miss with this code. We then examine the Variable Trace view of this code. As shown in the left diagram of Figure 6, all accesses to the array is a cache miss: either load or replacement; even the access to elements in the same data block such as the fourth and seventh access to element 33 and 34 which are stored in block 4. The reason is, as depicted by the Cache Set view in the right diagram of this figure, all data blocks needed for current computation, in this case block 4, 8, and 12, are mapped in the same cache line, although other three cache lines are also available. For optimization we insert pads with size of one cache line to the second dimension of this array, in order to change the mapping behavior of the data in the cache. Due to this padding, most accesses can be achieved in the cache, as shown in the left diagram of Figure 7. This results in a 47% improvement in cache hit ratio which is shown in the right side. This figure is a combination of two Overview views of YACO, each for the
Fig. 6. Variable Trace (left) and Cache Set (right) with the sample code
original code and the optimized version individually. It can be clearly observed that the optimization reduces nearly a half of the misses (third column).
Fig. 7. Variable Trace after optimization (left) and performance comparison (right)
5 Conclusion The performance of applications significantly depends on efficient use of the cache memory. This efficiency can be enhanced through various optimizations, and for this often a visualization tool is needed for showing the runtime cache access behavior. However, existing cache visualization systems show only what happened, with either statistics on cache miss or more detailed information about the changing cache con-
tent. This does not suffice for an efficient, comprehensive optimization for which more information, especially that about cache miss reason, is required. In this case, we introduce YACO, an optimization oriented and user conducted cache visualizer. The goal is to help the user through the whole optimization process, from deciding if an optimization is necessary, to finding the exact data structures and code regions to optimize, and up to the detection of miss reason, access pattern, and further the optimization strategies and related parameters. For this, YACO provides a variety of views showing overall performance, access behavior with specific data structures and code regions, access trace of variables, and runtime cache activities. The example, showing how to use YACO for cache optimization, has proven its feasibility.
References 1. R. Bosch, C. Stolte, D. Tang, J. Gerth, M. Rosenblum, and P. Hanrahan. Rivet: A Flexible Environment for Computer Systems Visualization. Computer Graphics, 34(1), February 2000. 2. Somnath Ghosh, Margaret Martonosi, and Sharad Malik. Precise Miss Analysis for Program Transformations with Caches of Arbitrary Associativity. ACM SIGPLAN Notices, 33(11):228–239, November 1998. 3. Somnath Ghosh, Margaret Martonosi, and Sharad Malik. Automated Cache Optimizations using CME Driven Diagnosis. In Proceedings of the 2000 International Conference on Supercomputing, pages 316–326, May 2000. 4. Intel Corporation. Intel VTune Performance Analyzer. available at http://www.cts.com.au/vt.html. 5. M. Martonosi, A. Gupta, and T. Anderson. Tuning Memory Performance of Sequential and Parallel Programs. Computer, 28(4):32–40, April 1995. 6. P. Ranganathan, S. Adve, and N. P. Jouppi. Reconfigurable Caches and their Application to Media Processing. In Proceedings of the 27th Annual International Symposium on Computer Architecture, pages 214–224, June 2000. 7. G. Rivera and C. W. Tseng. Data Transformations for Eliminating Conflict Misses. In Proceedings of the ACM SIGPLAN Conference on Programming Language Design and Implementation, pages 38–49, Montreal, Canada, June 1998. 8. J. Tao and W. Karl. CacheIn: A Toolset for Comprehensive Cache Inspection. In Proceedings of ICCS 2005, volume 3515 of Lecture Notes in Computer Science, pages 182–190, May 2005. 9. E. van der Deijl, G. Kanbier, O. Temam, and E. D. Granston. A Cache Visualization Tool. IEEE Computer, 30(7):71–78, July 1997. 10. S. C. Woo, M. Ohara, E. Torrie, J. P. Singh, and A. Gupta. The SPLASH-2 Programs: Characterization and Methodological Considerations. In Proceedings of the 22nd Annual International Symposium on Computer Architecture, pages 24–36, June 1995. 11. WWW. Cachegrind: a cache-miss profiler. available at http://developer.kde.org/ s˜ewardj/docs-2.2.0/cg main.html#cg-top. 12. Y. Yu, K. Beyls, and E.H. D’Hollander. Visualizing the Impact of the Cache on Program Execution. In Proceedings of the 5th International Conference on Information Visualization (IV’01), pages 336–341, July 2001.