Exploiting Program Phases

0 downloads 0 Views 295KB Size Report
May 17, 2004 - ior in Java Programs and end with future directions. 2 Time Varying ...... These values were generated by the authors using an actual HP iPAQ.
Exploiting Program Phases Priya Nagpurkar priya@cs.ucsb.edu

May 17, 2004

1 Introduction

tion [32, 30, 13]. Program behavior, however, is not entirely random and often shows significant structure. [30] and [13] periodically gathered various hardware metrics, like IPC, cache misses, branch misprediction rate with the aim of studying low-level program behavior over time and finding any possible correlation between the metrics. Their findings indicate that, not only does program behavior change, but it also has periods of stable execution interspersed with transitions. During periods of stable execution, the architectural metrics measured are relatively stable. What is more interesting is the fact that the metrics transition in unison, though the nature of the transition might be different (that is the instruction cache miss rate might go up, whereas the IPC might go down). Recognizing the importance of automatically characterizing this behavior in order to exploit it for various purposes (like reducing simulation time, aiding prediction), various techniques were developed. We survey these in the next section.

As ubiquitous computing continues to evolve, we are faced with the challenge of catering to the varied needs of an ever-increasing pool of diverse devices. Though the constraints and requirements for different devices may vary, performance and software quality are of paramount importance. Understanding program behavior is vital to both optimization amd program testing. Recently, there has been increasing interest in exploiting runtime behavior of programs in dynamic, program-aware optimizations in architectural and runtime systems [34, 2, 6, ?, 4, 28, 33]. A lot of reasearch has gone into characterizing time varying behavior of programs to better understand and exploit it [30, 33, 10, 13]. Phase analysis of programs is one such characterization. Phase analysis attempts to capture time-varying behavior and repeating patterns in a program’s execution by grouping periods of execution that are similar in a single phase. In this paper, we first present a survey on the time varying behavior of programs, its characterization, prediction, and use. Our aim is to exploit phase behavior of programs in novel and useful ways to enable high performance and software quality. To this end, we identify low-overhead program monitoring as an area of interest, and present a brief survey of extant low-overhead profiling techniques. Finally, we present our work on Phase-aware Remote Profiling and Visualization and Analysis of Phase Behavior in Java Programs and end with future directions.

2.1 Characterizing Time Varying Program Behavior Basic block vectors, instruction working set signatures, and hardware performance counter data has been used to capture and track program behavior. We next describe each technique in brief. 2.1.1 Basic Block Distribution Analysis (BBDA) Sherwood et al [30, 31] observe that program behavior is directly related to the code that is being executed. They therefore use profiles of a program’s code structure, basic block frequencies, to capture phases and showe that the periodicity captured by basic block profiles is shared by low-level architectural metrics. Basic block profiles gather frequencies of individual basic blocks based on how often they were executed. (a basic block is a straightline piece of code with a single entry and exit point) Basic block profiles thus present an architecture-independent and relatively easy to generate method of capturing program behavior. To extract time varying behavior, Sherwood et al col-

2 Time Varying Behavior of Programs Program behavior has commonly been abstracted in the form of profiles gathered over the program’s execution. More recently, there has been a lot of interest in studying program behavior during different parts of execution. In this section we present a survey on time varying behavior of programs, its characterization, prediction and use. Many researchers have observed, through detailed simulations and time-based profiles, that programs exhibit widely varying behavior during different parts of execu1

lect basic block footprints over fixed intervals, measured in terms of number of instructions executed, and compare footprints for different intervals. The footprint consists of a single n-dimensional vector with n counts of individual basic block frequencies. The dimensionality of the vector, n, is equal to the number of static basic blocks in the program. To accomodate the fact that larger basic blocks account for larger parts of the execution, the frequencies are weighted by number of instructions in the basic block. Further, since absolute values are not necessary (proportions are), each value in the vector is normalized by dividing it with the sum of all elements in the vector. To catch changes and repeating patterns, basic block vectors are compared using the basic block vector distance similarity measure. More specifically, the Manhattan distance is used. The Manhattan distance is calculated as the sum of absolute values of elementwise differences and represents the distance between the two compared vectors if the only paths you can take are parallel to the x and y axes. The calculation generates a value between 0 and 2, where 0 indicates complete similarity and 2 indicates complete dissimilarity. The authors use clustering to group similar intervals together to break the program’s execution into phases. A phase thus consists of intervals of similar behavior, irrespective of temporal adjacency. Note here that the interval size controls the granularity at which phases can be detected. Sherwood et al [30] also introduce similarity matrices as a means of graphically representing phase behavior as captured by basic block distribution analysis. A similarity matrix is the upper triangular N X N matrix, where a point (x,y) represents the similarity between intervals x and y, and N indicates the number of intervals of fixed duration generated. Basic block distribution analysis thus presents an architecture-independent and effective way of capturing phase behavior. It is intended to be used offline and was first introduced to find simulation points. 2.1.2 Working Set Signatures Working sets were first introduced in the context of virtual memory pages [9] and represent the collection of pages that the process is working with. Dhodapkar et al [10] state that phase changes are manifestations of changes in working sets. They keep track of the executing program using instruction working set signatures defined over a fixed window size. The working set signature is a lossy compressed representation of the working set, and consists of an n-bit vector formed by mapping working set elements into buckets using a random hash function. To detect a phase change, working set signatures defined over non-overlapping windows, are compared using the relative signature distance. A value of 1 indicates similarity and 0 indicates dissimilarity. A phase change is said to

be detected if the distance is more than a certain threshold. A phase is then defined as the maximum interval over which the working set remains constant (within the defined threshold). The size of the window controls the granularity at which phases can be detected. The authors introduced working set analysis to enable adaptive dynamic configuration of microarchitecture features. They also state that other than tracking phase changes, working set size can sometimes be used to choose optimal configurations, and that repeating working sets can be identified and used to reduce re-optimization overhead. To this end, they enhanced an existing dynamic reconfiguration algorithm [28] with working set analysis. The algorithm computes the relative signature difference with respect to the previous signature at the end of each window. On detecting a phase change, it waits till the phase stabilizes before retuning and installing an optimal configuration. Finding an optimal configuration is easy if the working set signature has been seen before, since a table of such signatures and optimal configurations is maintained. 2.1.3 Hardware Performance Counters Duesterwald et al [13] observe that most adaptive systems are reactive and stress the importance of prediction in such systems. With this aim, they try to characterize time varying program behavior using architectural metrics, and exploit it in the design of online predictors. As mentioned earlier, Duesterwald et al [13] periodically gathered various architectural metrics using hardware performace counters for the IBM Power3 and Power4 microarchitectures. They used a timer-based interrupt to enable the recording of metrics after every 10ms. The characterization obtained is in the form of the low-level behavior of the program, as captured by hardware performance counters. Variability over time is measured in terms of average absolute distance between metric values at adjacent points in the time series. The authors make the following three observations:

 Program behavior varies significantly over time  Program behavior is periodic  Periodicity of metric behavior is shared across all metrics As seen before, Sherwood et al also arrive at the same conclusions through detailed simulations and design an architecture-independent way of characterizing program behavior. Duesterwald et al, on the other hand, use lowlevel metrics to characterize behavior, but exploit the fact that the periodicity is shared across metrics to design cross-metric predictors. The idea behind cross-metric predictors is that the history of a single metric can be used to predict one or more other metrics.

The authors design and evaluate both statistical and table-based history predictors based on the accuracy with which values of metrics are predicted. Prediction accuracy is measured in terms of mean absolute prediction error, that is the mean absolute distance between the predicted and actual value. The conclusion drawn is that table-based predictors outperform statistical predictors. 2.1.4 Online Phase Tracking and Prediction Following earlier work on time varying behavior of programs [32] and its characterization using basic block distribution analysis [30], Sherwood et al presented an efficient run-time phase tracking and prediction mechanism. Since tracking and comparing basic block vectors imposes a high overhead, an approximation of basic block vectors is used to track and detect changes in the proportions of code being executed. To approximate basic block vectors, branch PCs and the number of instructions executed between branches is captured. (Note that this can be done entirely in hardware and does not need any compiler support). As in BBDA, the program’s execution is broken up into intervals of fixed size. During every interval, each branch PC is hashed and the corresponding counter is incremented with the number of instructions since the last branch. At the end of an interval, the table of counters is compressed to form a smaller footprint, which then represents program execution during the last interval. This footprint is compared with previously generated footprints and stored if it is found to be unique.(that is, if this interval is similar to a previous one, it belongs to that phase and its footprint need not be stored) Thus only one footprint per phase is stored. To exploit repeating patterns in the program’s behavior, the authors present a prediction scheme that predicts what phase the next interval will belong to. They observe that the set of phases seen recently and the duration of those phases are important indicators of the next phase. The authors use a run length encoding Markov predictor to predict the phase of the next interval. The predictor uses a run-length encoded version of the history to index into a prediction table and is able to predict with a misprediction rate of 14% on average. Both tracking and prediction can be completely implemented in hardware and require less than 500 bytes of memory. 2.1.5 Comparing Different Phase Detection Techniques Dhodapkar and Smith [11]compare three different phase detection techniques, namely BBDA, instruction working sets and conditional branch counts. The first two have been described in previous sections. The following comparison metrics are used:

Sensitivity and False Positives: The authors define sensitivity as the fraction of intervals with significant performance changes, which were also indicated to be phase changes by the phase detection technique. Conversely, if the phase detection technique detects a phase change when there was no significant performance change, it is counted as a false positive. (the authors define a 2% change in the metric of interest as a significant change) Stability and Average Phase Length: Stability is defined as the fraction of intervals that belong to stable phases, whereas average phase length is defined as the number of intervals that are part of stable phases divided by total number of stable phases. The authors state that stability and phase length are important metrics because they directly affect the effectiveness of reconfiguration algorithms. Performance Variance: This metric refers to the performance variance within a phase. Ideally, variance within a phase should be very low, since phases represent intervals with similar behavior. The authors use Receiver Operating Characteristic (ROC) analysis to arrive at different distance thresholds (the difference threshold decides whether the two entities being compared are similar enough to belong to the same phase or not) for the three different techniques. They then plot an ROC curve, which is the plot of sensitivity vs. false positives for different values of the distance threshold. Stability, phase length, and performance varioation are computed at various distance thresholds. The authors conclude that BBDA provides better sensitivity and less variation within phases than the other techniques. Comparing BBDA to working sets, basic block vectors contain more information than instruction working sets. Instruction working sets only capture which instructions were touched, whereas BBDA captures the proportions in which different parts of the code execute.

2.2 Exploiting Phase Behavior of Programs As we have mentioned in previous sections, the motivation behind characterizing time varying program behavior, has been primarily to enhance adaptive reconfiguration algorithms [10, 33, 28] and to reduce simulation time and effort by using phases to find simulation points [30]. On the software side, some researchers have alluded to the importance of phase behavior in feedback directed, dynamic optimization [7, 34], while some others have used phase shifts as triggers for either re-optimization [19] or flushing data structures, like flushing the code cache [4]. Explicit phase tracking and prediction, however, has only so far been used in the architectural domain. Hind et al. [16] investigate the fundamental problem of phase shift detection and various parameters that define phase behavior with empirical data for Java programs.

3 Profiling Techniques In the last section, we presented a survey on time varying behavior of programs, its characterization, prediction and use. Our intention is to exploit program phases in novel and useful ways. One of the areas we have identified is low-overhead profiling. Before presenting our work on exploiting phases to design a low-overhead profiling scheme, we present a survey on other low-overhead profiling techniques. Programs are often optimized based on profiles obtained at runtime. Profiles are nothing but program characteristics recorded as the program runs. Some examples of profile types are: basic block frequency profiles, method hotness profiles (or time spent in methods), path profiles. (Profiles of hardware metrics are also important for understanding and analysing low-level program behavior.) In order to obtain profile data, an application must either be instrumented or its runtime state recorded (like call stack, hardware performance counter metrics). The overhead imposed due to profiling is unacceptable, especially in the context of dynamic optimization systems and systems that continuously monitor programs after deployment. Efficient profiling techniques that generate accurate profiles, yet impose low overhead are thus of great interest. We next describe some such techniques.

3.1 Arnold’s Sampling Technique In [3], the authors present an online, software only mechanism for sampling executing code. They use code duplication (methods both with and without instrumentation) and counter-based sampling to switch between instrumented and non-instrumented code. The idea is to reduce the time spent in instrumented code, while at the same time generating meaningful samples. Under this scheme, a pre-initialized counter value is checked and decremented at every method entry and backward branch. Only when the counter value is zero, does execution switch to instrumented code and samples one acyclic portion of code (the counter is re-initialized at this point) Control is transfered back to uninstrumented code (also called checking code) on the next backward branch or method invocation. Thus a single intraprocedural acyclic path is sampled. Counter-overflow based sampling of events generates samples that are proportional to the frequency of occurence of those events. The sampling frequency can be altered by setting the counter value appropriately to either give more samples (that is more accurate profiles) or to reduce profiling overhead. The framework can incorporate any type of profile and does not require any hardware or operating system support. The authors implemented the framework in JikesRVM [18], a research Java Virtual Machine, and their evaluation results show that for sampling

method call-pairs and field accesses, high accuracy can be achieved (90-93%) while imposing a low overhead ( 6%). 3.1.1 Improving Arnold’s Technique: Bursty Tracing Hirzel and Chilimbi [17] extend Arnold’s scheme to sample longer bursts that can span procedures and sample multiple iterations of a loop. Instead of returning to the checking code after executing one acyclic burst in the instrumented code, they add an additional count, which decides how long execution should stay in instrumented code. In addition, the authors further reduce the overhead by eliminating redundant checks on procedure entries by analyzing recursive cycles. They also remove checks from loops that are not interesting (tight inner loops) The authors evaluated their profiling scheme on x86 binaries using the Vulcan tool for binary instrumentation.

3.2 Ephemeral Instrumentation Lightweight Profiling

for

[37] presents another, somewhat similar, profiling scheme that enables hooking and unhooking of profiling code at branch sites. The aim is to sample a small amount of branch executions to generate an accurate branch-bias profile. The authors state that the predictability of branch behavior obviates extensive sampling, but do not explicitly make use of the predictability in choosing samples. Under this scheme, instrumentation code is periodically inserted (or hooked) to capture a small number of a branch’s executions. The two parameters that control the sampling frequency and number of samples per branch are called periodicity and persistence. To account for the fact that branch-bias might change as the program executes, a single static branch site is sampled multiple times. The authors use ephemeral instrumentation to gather branch biases and state that a traditional edge profile can be generated by post-processing this data. A concrete algorithm is, however, not presented. They evaluate their technique by measuring the scheduling accuracy of a superblock scheduler using ephemeral profiles and perfect profiles.

3.3 Hardware Support for Profiling Using hardware support for profiling tasks can significantly reduce the overhead associated with profiling, especially if it reduces the amount of work to be done in software. Researchers have developed profiling schemes that use existing simple hardware features like hardware performance counters or other debugging features, as well as those that depend on specialized, dedicated hardware.

3.3.1 Hardware Performance Counters Compaq’s Continuous Profiling Infrastructure [1] employs sampling based profiling to continuously monitor production code. The data collection sub-system periodically samples program counters and stores them in a database. The information in the database is later used by a suite of analysis tools that generate data that can be used to locate time-critical sections of the code and analyze delays incurred by individual instructions. The periodicity with which samples are collected is governed by the processor’s performance counter hardware, which counts specified events and generates a highpriority interrupt when the count exceeds a certain value. When such an interrupt is generated, the instruction that caused the interrupt is saved alongwith its context. The overflow value can be varied to vary the sampling frequency. Many such samples are generated as the program executes. This enables the analysis tool to correlate events with instructions. 3.3.2 Profiling on Embedded Devices [27] presents a light-weight approach by profiling only user-selected parts of an application using the rangebreakpoint feature available on many embedded processors. The range-breakpoint feature is provided to enable debugging and allows the user to specify a start and end address in two programmable registers. Whenever execution goes beyond the specified region, an interrupt is raised. The authors propose to exploit this feature by programming the two registers with a pre-decided set of regions and modifying the interrupt handler to collect hardware performance counter data for each of the userspecified regions. 3.3.3 Dedicated Hardware Zilles et al [41] propose a programmable co-processor that attempts to overcome much of the hardware profiling overhead. Such a co-processor is specialized towards gathering and filtering profile information, and as such, it passes on only relevant information in a compact format to the main processor. Using the co-processor speeds up sampling without sacrificing accuracy, since the main processor does not need to be interrupted frequently. [29] proposes a hardware-software approach aimed at reducing the overhead of profiling by using programmable hardware to capture and compress profile information before passing it on to software for analysis and exploitation. Dedicated hardware performs lossy compression on the event stream that forms the profile, by using hardwarebased low-cost sampling mechanisms. This approach is somewhat similar to the programmable co-processor described above, in that it uses dedicated hardware to reduce

the overhead of profiling by pre-processing profile information before passing it on to the software. The authors report a low overhead of 4.5% and an error of 3% based on analytical models.

3.4 Monitoring Program Behavior after Deployment There has been growing interest in exploiting the opportunity of monitoring applications after deployment to gather a large amount of profile data that represents actual use of the software, both for software evolution and bugisolation and testing. As a step in this direction, a lot of research has recently gone into enabling efficient and effective distributed monitoring of remote software for the purposes of error reporting, bug isolation, and code coverage. Using these techniques, information about an executing program is collected at a user site (the point of execution) and communicated to a centralized location for analysis, debugging, and further application development. Two extant techniques that perform such application monitoring include residual testing [26] and expectationdriven event monitoring (EDEM) [15]. Residual testing is the process of continual program monitoring for fulfillment of test obligations not satisfied prior to deployment. Residual testing does not address the issue of reducing the instrumentation required to monitor the residue. EDEM uses software agents deployed over the Internet to collect application-usage data. This approach addresses the problem of monitoring deployed software, but is limited in that it collects the same information from every execution at every site visited and only gathers information about certain events; it cannot collect general profile information. Another framework that performs remote postdeployment program monitoring that attempts to reduce profiling overhead is software tomography in the GAMMA system [25, 5]. Software tomography is the process of dividing an application into subtasks then assigning instrumentation across subtasks in an effort to reduct the overhead imposed on any single task. The instrumented tasks are then distributed amongst a large number of users. The technique significantly reduces the overhead per user since only certain tasks are monitored. The process of assigning instrumented tasks to users is iterative to enable testing the application for code coverage. The authors show in simulation, that there is potential for reducing monitoring overhead for branch coverage. The authors in [21] describe a similar sampling infrastructure for bug isolation (as opposed to code coverage) that distributes the overhead of remote profiling across many users of a single application. The system collects samples in a single program instance based on a geometric distribution that enables a statistically fair random sample. This ensures that all events (including rare ones) are ac-

curately represented. The authors show that they are able to introduce a small amount of sampling overhead (instrumented computation and profile communication) per user yet gather enough information to aid in bug isolation.

4 Our Work This section includes the relevant work we have done so far. [24] presents a profiling scheme that exploits program phases to reduce the overhead of profiling, while still generating accurate profiles. [23] couples existing techniques used primarily in the architecture community into a set of unifying tools to study phase behavior in Java programs. Both papers are included in the appendix.

5 Summary and Future Directions This paper aimed to provide enough background to enable us to think of ways in which program phases can be exploited innovatively and usefully. To this end, we first presented a survey of time-varying behavior of programs, its characterization, prediction and use. Identifying low-overhead profiling as an area of interest, we then presented a brief survey of efficient profiling techniques. Finally, we presented our recent work on phase-aware remote profiling, and phase behavior in Java programs. Our work in both areas, remote profiling, and Java runtimes, is fairly preliminary and shows promise. The following paragraphs discuss future work in both areas. Phase-aware Remote Profiling As a step towards our vision of low-overhead remote, distributed profiling, we presented a phase-aware profiling scheme capable of generating accurate profiles with a relatively low overhead. We used basic block frequency profiles to evaluate our technique. Several questions remain to be answered. Among them are:

 Does phase-aware profiling work equally well for all profile types?  Where else could it be applied?  Can phase profiles be used for bug isolation and test coverage?  Can phase profiling be used in a distributed profiling infrastructure? Exploiting Phases in Java Programs Our preliminary investigation indicated that Java programs also exhibit phase behavior. The lack of a simulation environment and a light-weight phase tracking or phase shift detection algorithm that can be implemented within a Java virtual machine, has hampered further efforts into exploiting phase behavior in Java pro-

grams. To begin with, it would be useful to find optimizations(compiler or runtime) that would benefit from knowledge about phase behavior. If we were to find such optimizations, an effort to develop light-weight phase tracking schemes would be justified.

References [1] J. Anderson, W. Weihl, L. Berc, J. Dean, S. Ghemawat, M. Henziger, S. Leung, R. Sites, M. Vandevoorde, and C. Waldspurger. Continuous Profiling: Where Have All the Cycles Gone? ACM Transactions on Computer Systems (TOCS), 15(4):357–390, 1997. [2] M. Arnold, S.J. Fink, D. Grove, M. Hind, and P. Sweeney. Adaptive optimization in the Jalape˜no JVM. In ACM SIGPLAN Conference on ObjectOriented Programming Systems, Languages, and Applications (OOPSLA), October 2000. [3] Matthew Arnold and Barbara G. Ryder. A framework for reducing the cost of instrumented code. In SIGPLAN Conference on Programming Language Design and Implementation, pages 168–179, 2001. [4] Vasanth Bala, Evelyn Duesterwald, and Sanjeev Banerjia. Dynamo: a transparent dynamic optimization system. ACM SIGPLAN Notices, 35(5):1–12, 2000. [5] J. Bowring, A. Orso, and M. Harrold. Monitoring Deployed Software Using Software Tomography. In Proceedings of ACM SIGPLAN-SIGSOFT Worshop on Program Analysis for Software Tools and Engineering, pages 2–9, 2002. [6] M. Cierniak, G. Lueh, and J. Stichnoth. Practicing JUDO: Java Under Dynamic Optimizations. pages 13–26, June 2000. [7] S. Clarke, E. Feigin, W. Yuan, and M. Smith. Phased behavior and its impact on program optimization. [8] J. Dean, J. Hicks, C. Waldspurger, W. Weihl, and G. Chrysos. Profileme : Hardware support for instruction-level profiling on out-of-order processors. In International Symposium on Microarchitecture, pages 292–302, 1997. [9] Peter Denning. Working sets past and present. In IEEE Transactions on Software Engineering, 1980. [10] A. Dhodapkar and J. Smith. Managing multiconfiguration hardware via dynamic working set analysis. In 29th Annual International Symposium on Computer Architecture, May 2002.

[11] A. Dhodapkar and J. Smith. Comparing program phase detection techniques. In 36th Annual International Symposium on Microarchitecture, December 2003. [12] E. Duesterwald and V. Bala. Software Profiling for Hot Path Prediction: Less is More. October. [13] E. Duesterwald, C. Cascaval, and S. Dwarkadas. Characterizing and predicting program behavior and its variability. In International Conference on Parallel Architecture and Compilation Techniques, September 2003. [14] S. Fink and F. Qian. Design, Implementation and Evaluation of Adaptive Recompilation with OnStack Replacement. In International Symposium on Code Generation and Optimization (CGO), March 2003. [15] D. Hilbert and D. Redmiles. Extracting usability information from user interface events. ACM Computing Surveys, 32(4):384–421, 2000. http://www. ics.uci.edu/˜dhilbert/papers/. [16] M. Hind, V. Rajan, and P. Sweeney. The Phase Shift Detection Problem is Non-Monotonic. Technical Report RC23058, IBM Research, 2003. [17] M. Hirzel and T. Chilimbi. Bursty tracing: A framework for low-overhead temporal profiling. In Fourth ACM Workshop on Feedback-Directed and Dynamic Optimization (FDDO-4), 2001.

[24] Priya Nagpurkar, Chandra Krintz, and Timothy Sherwood. Phase-aware remote profiling. In Submitted to: Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS), 2004. [25] A. Orso, D. Liang, M. Harrold, and R. Lipton. GAMMA System: Continous Evolution for Software After Deployment. In Proceedings of International Symposium on Software Testing and Analysis, pages 65–69, 2002. [26] C. Pavlopoulou and M. Young. Residual Test Coverage Monitoring. In Proceedings of International Conference on Software Engineering, pages 277– 284, 1999. [27] Ramesh V Peri, Sanjay Jinturkar, and Lincoln Fajardo. A novel technique for profiling programs in embedded systems. In Second ACM Workshop on Feedback-Directed and Dynamic Optimization (FDDO-2), 1999. [28] A. Buyuktosunoglu R. Balasubramonian, D. Albonesi and S. Dwarkadas. Memory Hierarchy Reconfiguration for Energy and Performance in General Purpose Architecture. In 33rd International Symposium on Microarchitecture, December 2000. [29] S. Sastry, Rastislav Bod´ık, and James Smith. Rapid Profiling via Stratified Sampling. In ISCA, pages 278–289, July 2001.

[18] IBM Jikes Research Virtual Machine (RVM). [30] T. Sherwood, E. Perelman, and B. Calder. Basic http://www-124.ibm.com/developerworks/ block distribution analysis to find periodic behavoss/jikesrvm. ior and simulation points in applications. In International Conference on Parallel Architectures and [19] T. Kistler and M. Franz. Continuous program optiCompilation Techniques, September 2001. mization: A case study. ACM Transactions on Programmins Languages and Systems, 25(4):500–548, [31] T. Sherwood, E. Perelman, G. Hamerly, and 2003. B. Calder. Automatically characterizing large scale program behavior. In 10th International Conference [20] C. Lee, M. Potkonjak, and W. Mangione-Smith. Meon Architectural Support for Programming Landiabench: A tool for evaluating and synthesizing guages, October 2002. multimedia and communicatons systems. In International Symposium on Microarchitecture (Micro30), pages 330–335, 1997. [21] B. Liblit, A. Aiken, A. Zheng, and M. Jordan. Bug Isolation via Remote Program Sampling. 2003.

[32] Timothy Sherwood and Brad Calder. Time Varying Behavior of Programs. Technical Report Technical Report, UC San Diego, August 1999.

[22] A. Madison and A. Bates. Characteristics of program localities. Communications of the ACM, 19(5):285–294, 1976.

[33] Timothy Sherwood, Suleyman Sair, and Brad Calder. Phase tracking and prediction. In Proceedings of the 30th International Symposium on Computer Architecture (ISCA’03), 2003.

[23] Priya Nagpurkar and Chandra Krintz. Visualization and analysis of phased behavior in java programs. In ACM International Conference on the Principles and Practice of Programming in Java (PPPJ), 2004.

[34] M. Smith. Overcoming the challenges to feedbackdirected optimization. In ACM SIGPLAN workshop on Dynamic and adaptive compilation and optimization, pages 1–11, 2000.

[35] SpecJVM’98 Benchmarks. http://www.spec.org/osg/jvm98. [36] M. Takur and J. Hollingsworth. Efficient Instrumentation for Code Coverage Testing. In International Symposium on Software Testing and Analysis, 2002. [37] O. Traub, S. Schecter, and M. Smith. Ephemeral Instrumentation for Lightweight Program Profiling. Technical Report Technical Report, Department of Electrical Engineering and Computer Science, Harvard University,Cambridge,Massachusetts, June 2000. [38] J. Whaley. A Portable Sampling-based Profiler for Java Virtual Machines. In Proceedings of ACM JavaGrande Conference, pages 78–87, 2000. [39] John Whaley. Partial Method Compilation using Dynamic Profile Information. In ACM SIGPLAN Conference on Object-Oriented Programming Systems, Languages and Applications, OOPSLA, pages 166– 179. ACM Press, October 2001. [40] L. Zhang and C. Krintz. Code unloading. Technical Report 2003-14, University of California, Santa Barbara, 2003. [41] Craig Zilles and Gurinder Sohi. A Programmable Co-processor for Profiling. In Proceedings of the 7th International Symposium on High Performance Computer Architecture(HPCA’01), 2001.

Appendix A

Phase-aware Remote Profiling

The explosive growth in Internet bandwidth and availability has precipitated a significant change in the way that software is bought, sold, used, and maintained. Users are no longer a set of disconnected individuals that passively execute disks from a shrink-wrapped box, but are instead often far more involved in the processes of software evolution. Users already demand bug fixes, patches, upgrades, forward compatability, and security updates served to them over the ever-present network. This ubiquity of access offers a new opportunity for software engineers: users can now participate in the software development and evolution process. Specifically, users can dynamically transmit error reports upon program failure and participate in coverage testing and bug isolation while they use their software. Indeed, several systems are already deployed in modern operating systems that track bugs uncovered by users and report them back to the developers, and researchers have recently proposed even more sophisticated and informative techniques. And while these techniques

have proven to be very proficient at uncovering bugs, their has been little work to address how multi-user environments can be exploited to optimize programs for performance or power consumption. The end goal of our research is to design a distributed optimization system for connected users. Hand-held battery-powered devices such as personal digital assistants and web-enabled mobile phones have emerged as new access points to the world’s digital infrastructure. These systems have limited battery life and increasingly computationally intensive workloads. Designing an aggressive run-time optimization system on such a powerlimited platform is very difficult since there is typically neither the extra space for compilers and optimizers, nor the resources to run them on-the-fly. Instead, we are currently building a system that will gather information about a running program, transmit that information to a optimization center for analysis and re-compilation, and then will update the code on the end user system when the opportunity or need arises. This task is further complicated in that a successful system must do all of the above, while remaining unobtrusive by consuming the minimum amount of resources, e.g., battery, communication, or processing. Building a distributed optimization system poses significant challenges due to the sheer number of users, the complexity of their requirements, the heterogeneity of their devices and resource constraints, and the variability of their connectivity. This is especially true in the batterypowered cellular and hand-held software market which we target, where resources are at a premium. The first requirement of such a system is that it is capable of transparently, unobtrusively, and effectively gathering profiles from applications in the wild, once the software has been deployed to users. It is with this topic that this paper is concerned. While at first glance, there may not seem to be a significant challenge to performing such analysis, remote profiling can consume significant resources (computational and communication). Profiles are commonly collected by executing instrumented versions of the software. In a remote environment, we must also communicate this information back to the software development site or optimization center for analysis. This problem of overhead introduction is exacerbated for mobile devices with limited resources as this performance degradation can translate into significant battery drain. The key to an efficient implementation of remote profiling is the exploitation of program phases. Using program phase behavior, we can summarize a software system as a minimal but diverse set of program behaviors in a manner that is distributed, dynamic, efficient, and that accurately reflects overall program behavior. Moreover, while past work has shown that program phase behavior can be

effective at reducing simulation time, applying the idea of phases to on-line software profiling creates a whole new set of issues. In this paper, we describe our analysis of the different ways in which remote sampling can be performed. We propose the use of a mixed hardwaresoftware method for general purpose software profiling that exploits phase behavior to improve efficiency and efficacy. Specifically, this paper presents the following contributions:  The development of novel, online sampling schemes that achieve very high accuracy and a comparison of these schemes and other sample-based approaches,

 The empirical evaluation of these techniquies for all of the overheads associated with remote profiling resource-restricted devices (communication, computation, and power),  An evaluation of various profiling policies for selection of phase identifiers, and  An analysis and visualization of cross-input phase analysis issues and an evaluation of its potential for feedback-directed remote profiling.

A.1 Remote Profiling Profiling an application that is under the control of the developer is commonplace today. In a typical software design cycle, developers test and optimize their application using some set of inputs. In this setting, the overhead introduced by profiling is not of great concern since the program is being executed solely for the purpose of testing and identification of optimization opportunities. One of the limitations of this methodology, however, is the choice of inputs used by the developer will commonly only exercise a small portion of the program. That is, testing and optimizing for all possible hardware and software configurations and use scenarios routinely is not feasible. These problems are exacerbated by user requirements, preferences, and devices that change over time. To address these issues, we propose to gather profiles from systems that are actually being used by people and that are connected via the network. These profiles will help developers understand how their code is being exercised in the wild aiding in the creation of user models, assisting in the classification of users into groups that exercise the program in a similar manner, and possibly even enable the creation of distributed optimization centers.

A.2 Sample-Based Remote Profiling Upon starting this research, we assumed that the major chalenges would lie in the analysis of profile data and its use for optimization. We quickly found that even the initial step of gathering profile data from our target platforms, wireless IPAQ devices, was a significant hurdle. It is this aspect of our system that we examine in this paper.

At first glance, it might appear that we could apply extant techniques from the areas of sampling and architectural performance analysis [12, 38, 3, 18, 6] to our problem of remote profiling. On the surface, the problems seem to be similar since both are concerned with the examination of a small subset of a programs execution and the use of that information to estimate and evaluate the performance of the running application. In fact these two problem, and the solutions that address them, differ in several significant ways. The first important difference is that in an online profiling environment, a decision has to be made at each point in time, whether or not to profile. Most of the past work requires multiple passes over the data, at least in the worst case. For example, the SimPoint framework [31], the first pass over the data analyzes the program at a high level by finding regions of execution that are simillar to one another. The next step, then examines all of the points and picks a small subset of the program’s execution for sampling. Statistical sampling techniques suffer from a similar problem in that they can require multiple passes over the data until the statistics of the results stabilize. This is not directly applicable to our problem since once a profile is not taken for some amount of time, there is no going back to re-take it. Furthermore, the decision as to whether a profile should be taken at time , must be made from partial knowledge of the program execution from time 1 to  . This entails that any technique that we develop must not require knowledge of future behavior. The second difference is that we would like to develop a profile gathering technique that is general purpose enough that it can be used for a variety of profiling applications. This creates several complications. It means that we cannot rely solely on hardware performance counters to provide all of the information we need. Instead we need a more widely applicable software instrumentation. The use of software-based profiling means that if we are not careful we can effect that which we are trying to measure. Any system we develop must keep performance overhead to a minimum to reduce interference and not degrade the user’s perception of program or device performance. The final key difference is the severely limited resources of end user devices that we must employ. It is acceptable for an offline technique to generate gigabytes of trace information and to spend hours analyzing it, but that is not to do so on a cell phone or a PDA. Since we must impose very little computation and communication overhead on the users to maintain transparency, we must extract profile data very efficiently. The two major resource issues on a mobile device is consumption of CPU time and battery life. We can address the CPU time by carefully selecting when to sample. Conserving battery life is more difficult since both computation and communication, required for remote profiling, can consume sig-

uJ/Instr

0.4 0.3 0.2 0.1 0.0 0

1

2 3 4 Instructions Executed (Billions)

5

Figure 1: The figure shows the run-time power usage of the full execution of the program mpeg encode. The program goes through a couple different phases, makred by periods of high and low power. A random or periodic sampling method (the white triangles) will continue to take samples over the full execution of the program. A more intellegent sampling techniques based on phase information (shown as black triangles) can get the same error, by taking only a couple of key samples from each phase. nificant power.

A.3 Phase-aware Sampling The key to enabling efficient, post-deployment, remote collection of accurate profile information is the exploitation of program phase information. Phases can be used to create an intelligent profiling scheme that carefully chooses profiling points online. The way a program’s execution changes over time is not random, but is often structured into repeating behaviors, i.e., phases. More specifically, a phase is a set of dynamic instructions, i.e., intervals, during program execution that have similar behavior, regardless of their temporal adjacency. Prior work [30, 31, 33] has shown that it is possible to identify, predict, and create meaningful classifications of phases in program behavior accurately. Phase behavior has been used in the past to reduce the overhead of architectural simulation [31] and to guide online optimizations [10, 13, 33]. In this paper, we employ program phase behavior in a novel way: to enable efficient collection of accurate profile information from remote users. The advantage we gain by using phase information is that we need only to gather information about part of the phase and we can then use that information to approximate overall profile behavior. Since that interval will be similar to all other intervals in the phase, it can serve to act as a representative of all other intervals in the phase. As such, only representative intervals of the program phases in the program need be collected (instrumented, communicated, and analyzed) to capture the behavior of the entire program. This will make more efficient use of those limited resources available on mobile devices. Furthermore, these low-overhead profiles will be highly accurate (very similar to exhaustive profiles of the same program). Figure 1 exemplifies our approach using actual energy data gathered from the execution of the mpeg encoding utility. The execution of mpeg exhibits a couple of distinct phases during execution. Clearly the same behavior is repeating over again multiple times. A random or peri-

odic sampling method will continue to take samples over the full execution of the program regardless of any repeating behavior. In Figure 1 the white triangles show where samples would be taken if sampling is done periodically to achieve an error of 5%. This has the unfortunate drawback that most of the samples will not provide any new information because they are so similar to samples seen in the past. A more intelligent sampling techniques based on phase information (shown as black triangles) can get the same error rate with significantly fewer samples. This is done by taking only a couple of key samples from each phase.

A.4 Selection of Phase Representatives The phase tracking hardware that we employ in our system classifies intervals as belonging to a particular phase based on how similar their execution characteristics were. As such, it groups several intervals, within a similarity threshold, into a single phase. The similarity threshold thus governs how many phases are generated and also how similar the intervals within a phase are to each other. A higher threshold value will generate less number of phases with more intervals in each phase. Moreover, the intervals within a phase will show more variance. Figure 2 illustrates this fact. Assuming that we use a single representative profile for each phase, it is important that we choose the representative well, especially when only a small part of the execution is to be sampled. More specifically, we want to ensure that the profile will be reflective of the steady state of the phase, and that it’ll be a good representative for as many other intervals belonging to this phase, as possible. In this work, we consider four different alternatives: the first interval first, the third interval third, a randomly selected interval 1rand, and the interval closest to the centroid of the phase, best. We also investigate the effect of ignoring small phases by not profiling them at all (thirdi). first and third are simple heuristics that can be easily implemented in any online or offline phase collection system. The intuition behind third is that there is some

X X X X X X XX X X

Threshold T1 Threshold T2 T2 < T1

Figure 2: Example of phase selection and identification of effective phase representatives. startup time before a phase reaches steady state. As such, first may not be a good representative for the phase as a whole since it may not be most-similar to this steady state. third addresses this problem by delaying selection of the representative interval so that it is more likely that a steady state is reached. Thirdi is a modification to third, where phases with less than three intervals are ignored. The remaining two techniques require complete knowledge about the intervals that make up a phase. As such, they provide insight into the how effective our simple techniques, first and third, are. Using 1rand, we select an interval randomly from the set of intervals that make up a phase and use it as the phase representative. We select best by computing the centroid, given n-dimensional basic block vectors for each interval, and then choosing the interval nearest to the centroid (where n is the number of static basic blocks).

A.5 Feedback-Directed, Profiling

Adaptive Phase

Given a large connected user base, we can further reduce the overhead of phase-based remote profiling by providing feedback to users about phase discovery. If a user executes a program using the same input as that used by another user for which phase data already has been collected, the second execution will provide us with no new information, wasting resources needlessly. As part of our phase based remote profiling system, we employ phase identifiers as a feedback mechanism to users so that they may avoid unnecessary profiling. A phase identifier is simply an value that uniquely identifies phases; the hardware PhaseTracker we employ generates these identifiers as the program executes. Moreover, it is these phase identifiers that we communicate as part of the phase profile to indicate which intervals belong to which phases. To provide dynamic feedback to users, we also communicate these phase identifiers periodically, in the reverse direction. The remote profiling system on the user’s de-

Number of Phases Identified

30

X X X X XX X

25 20 15 10 5 0

Individual Runs

Feedback Profiling

Figure 3: Similarity matrix (left) for the gsmencode MediaBench benchmark across 5 different executions using different inputs. On the right is a graph that shows the number of phases identified if we profile this benchmark with each input separately (left bar broken down by input). The right bar in the graph shows the number of unique phases. There are 10 phases that overlap; as such there is potential for a dynamic feedback approach in which we communicate phase identifiers to users so that they may avoid profiling and communicating intervals in phases that have been discovered by other users.

vice then, when a new phase is identified by the PhaseTracker, checks the feedback data to determine if the phase has already been profiled by some other user. If it has, it avoids sampling and simply communicates the phase identifier. This technique can reduce the overhead of phase aware remote sampling when users execute the same program with the same inputs. In addition, it may be possible to use this technique when the same phase occurs across inputs. A more realistic scenario is one in which software is deployed and users execute it with a wide range of diverse inputs. Each input will cause the program to exhibit phased behavior; for some programs, some phases may be the same. To evaluate the potential of feedback-directed remote profiling using phase behavior, we analyzed many different inputs for one of our benchmarks, gsmdecode. Figure 3 shows the similarity matrix for this benchmark. A similarity matrix is a 2-dimensional array all of the intervals in a program; each entry is the similarity value between two intervals encoded as a grayscale value with dark values identifying similar intervals (in the same phase), e.g., the points on the diagonal are black since an interval is exactly the same as itself. The x-axis and y-axis of figure are increasing interval id’s. We omit data in the lower triangle for clarity, since it is symmetric with the upper triangle. We read the figure by selecting a point on the the diagonal; each point on the diagonal is an interval. By then traversing the row, we can visualize how similar

Dynamic Statistics Static Branches Instructions Cache Energy Time Benchmark Branches (Millions) (Millions) Miss Rate (Joules) (seconds) gsmdecode 572 182.05 1610.05 0.000 29.93 16.35 gsmencode 748 79.42 2562.29 0.000 29.38 19.93 jpegdecode 930 111.76 1421.33 0.006 46.50 21.87 jpegencode 1175 433.70 4218.60 0.002 100.65 51.41 mpegdecode 1104 309.95 3007.85 0.001 65.03 900.63 mpegencode 2216 244.25 4196.19 0.001 52.63 282.40 Average 1124 226.86 2836.05 0.002 54.02 215.43

Figure 4: General Benchmark Statistics

Instr Type IREG IMEM-R IMEM-Rcache IMEM-W FPREG Wireless Card Transmit

the row interval is compared to all others that follow it during execution. Commonly, similarity matrices are used to analyze the execution of a benchmark running a single input [23, 33]. However, we use it here to visualize execution of five different inputs. The dark regions indicate that even across inputs there are many intervals that are similar, i.e., there are phases that span inputs. The graph on the right in the figure shows the number of total intervals in all five executions of gsmdecode. The total height of the left bar is 28, indicating the number of different phase identified if we were to execute the program with each input individually. The left bar is broken up into pieces, indicating the number of phases found for each input. The right bar shows the number of intervals that we must sample, across all five inputs to gather all of the unique phase behavior: 18 phases. The data shows that there are 10 phases (36%) that overlap across all of the inputs. This indicates that there is potential for reducing the overhead of phase-driven remote profiling further using feedback-directed profile collection. For this benchmark, we can communicate the phase identifiers to the user base as each is discovered by individual users. For programs that execute a phase that has already been identified, we avoid collection and communication of the profile.

A.6 Empirical Evaluation Having now discussed the issues involved in remote profiling for performance, we now present our experimental findings. We begin by discussing our experimental methodology and then follow with an evaluation of accuracy and overhead encountered under different performance analysis policies.

A.7 Experimental Methodology Because we are targeting mobile devices such as cell phones and PDAs, we focus our attention on a set of 6 media benchmarks. The benchmarks we used the encoding and decoding programs for mpeg (movie), jpeg (picture), and gsm (voice) from the MediaBench benchmarks [20]; a benchmark suite designed for the empirical evaluation of media applications. We show the basic statistics on

Average Instr / s Joules / s (Millions) 0.865 204.790 0.973 19.462 0.000 137.510 1.340 11.625 0.965 0.439

Joules / Instr ( * 10-6 ) 0.004 0.168 0.008 0.231 2.200

Specification Max Joules / Byte -6 5V*0.285A Bandwidth transferred * 10 1.425 11Mb/s 0.988

Figure 5: Empirical data used to compute energy consumption. the programs and inputs we used in this study in Table 4. The second column is the number of static branches in the program, which correlates with the size of the branch profiles generated. The next five columns show the dynamic statistics: number of branches executed (in millions), number of instructions executed (in millions), the cache miss rate assuming a 64K, 4-way set associative, instruction and data cache, the energy consumed by executing the program (in Joules), and the execution time (in seconds). Because the inputs that are provided with MediaBench are very short, and because these applications are typically used in a streaming fashion, it was necessary to find more substantial inputs to analyze the realistic long term effects of profiling. We will make these inputs available via our web page. To evaluate our remote profiling framework in simulation we make use of SimpleScalar to emulate a StrongARM processor extended with a branch tracking mechanism. We modified the simulator to emulate the capture of phase information as described in [33] with an interval size of 10 million instructions. We then apply the different profiling strategies and examine the difference between a perfect trace and a our sampled trace. To compute energy consumption and execution time, we used used a model generated from our actual hardware system. We compute the energy consumed per-instruction (including events such as cache misses) and per-bytetransmitted energy consumed and instructions per second. The values used are summarized in Table 5. These values were generated by the authors using an actual HP iPAQ H3835, wireless card, and hand-coded benchmarks. Using Familiar Linux v0.6.1 on the iPAQ, and a logging program that periodically (every 6 seconds) reads the battery voltage and current from external monitors, we calibrated our model and then validated it against a variety of benchmarks. We report the average Joules/s consumed by each of our single-instruction programs (IREG: integer register operations, IMEM-R: load operations, and IMEM-W: store operations). We further distinguish load operations that hit in cache and those that miss, since the power consump-

tion of each is quite different (and shown in the table). Since stores use a write-through policy on the iPAQ we studied, we did not measure any significant differences between stores that hit the cache and those that miss, as such, we assume that all stores miss the cache. We use the cache miss rate of the executing program to determine which instructions in the program hit the data cache (and thus are represented by IMEM-Rcache). We compute instructions per second of each benchmark in a similar fashion using the instructions per second measurement of each constituent instruction type. To compute the power consumption for transfer, we computed the number of Joules per byte transfered using the specifications of our wireless card, a Lucent/Orinoco Gold card. These values and the Joules-per-byte-transfered are shown in the final row of the table.

A.8 Results To evaluate the impact of remote profiling on both the power and timing, and the examine the accuracy versus overhead tradeoffs, we implemented four different profile collections policies:

 Exhaustive - gather a profile for each interval. This is used to compare the accuracy of the other results.  Periodic Sampling - gather a profile every Nth interval, for N in [3,100].  Random Sampling - gather a profile for interval i with a probability of 1/N for some N  Phase-based - gather a profile for every interval that is dissimilar from all previously gathered intervals, given some threshold of similarity. For the periodic and random technique, we gather data for different sampling frequencies. The number of intervals profiled, and therefore the percentage of total execution profiled, depends on the sampling period N. We performed experiments for a range of sampling frequencies which correspond to a range of overheads and accuracies. Because a truly random technique is at the whim of chance as to whether or not it performs well, we characterize two aspects of random profiling for each of the different percent-sampled values: 1) we compute the average error across 10 runs and (avg random), 2) we compute the maximum error seen across 10 runs (max random). To get a range of accuracies and overheads we can adjust the parameter N and examine the effect. For phase-based profiling, we begin with an implementation of the phase prediction system that was developed by Sherwood et.al [33]. When then pick on interval from each phase to act as the representative from that phase. We implemented various profiling policies, as described previously, that identify different phase representatives. We then profile the phase representative from each phase. As

we will demonstrate, the policy used to pick this representative can have a large impact on the results. Unlike the random and periodic sampling approaches there is no sampling frequency variable that can be varied to get different tradeoff points between accuracy and overhead. Instead we can get a similar effect by dynamically tuning the similarity threshold. The similarity threshold determines the cutoff point at which two intervals are said to be similar to one another and hence are part of the same phase. As we lower the threshold there will be more phases detected, each with a fewer number of intervals within it. As this happens, more samples will be taken and the accuracy should rise. Profile Accuracy The first question to be answered is ”given that I can afford to profile X% of the program, how good of a profile can I get?”. To answer this question we evaluate profile accuracy as the degree to which a sampled profile is similar to that of the exhaustive profile. We report accuracy in terms of percentage error in basic block counts which we compute as the element-wise difference between a sampled profile and the exhaustive profile. We then divide this value by the total counts in the exhaustive profile to get a percentage error. Figure 6 shows the percentage error for the different representative selection policies described previously. The x-axis is the percentage of the program that was executed by each sample technique. Again, we vary the percentage profiled by changing the similarity threshold that the PhaseTracker uses. We only show sample percentages between 0 and 5% since representative selection is most important when there are few phases and many intervals within a phase (each quite dissimilar to other intervals in the phase due to the high similarity threshold). The five different polices for representative selection that we studied were (a) 1random: select the representative by random from all intervals in the phase (we report performance for this policy as the average performance of 5 selections), (b) best: select the centroid of the intervals in the phase as the representative, (c) first: select the first interval as the representative, (d) third: select the third interval as the representative – if there are fewer than 3 intervals, select the first interval, and (e) thirdi: same as third only if there are fewer than 3 intervals don’t sample any (disregard the phase). As expected, the centroid method, best, performs the best: its error remains low even when we sample very little of the program. First and 1random perform the worst. This happens because the first is not representative of the steady state (the phase is just warming up) and because selecting randomly can result in selection of a representative that is “far away”, i.e., dissimilar to all others. Third and thirdi enable accuracy that

40

30

% Error in Block Counts

% Error in Block Counts

40

1random best first third thirdi

20

10

0

avg random max random periodic phaseaware

30

20

10

0 0

1

2 3 % Program Sampled

4

5

0

5

10 15 % Program Sampled

20

25

Figure 6: Evaluation of representative selection policies.

Figure 7: Average Error

is between that of best and first/1random. That is, third is able to select an interval that is more representative of the steady state of the phase than first and 1random. Moreover, Third is simple and can be implemented without additional overhead. Thirdi does not perform significantly better than Third for the smallest percent sampled. As such, we use Third for the rest of the results in the paper. To evaluate the efficacy of the different sampling techniques that we considered, we calculate error for different percentages of program execution sampled, hence, the higher the percentage, the more accurate the profile should be. The best performing profile technique is the one that produces the least amount of error for the least percentage of the program that is sampled. Figure 7 shows the average error across benchmarks using the third interval of each phase as the phase representative. The graph compares the accuracy of each of the different sampling techniques, avg random, max random, periodic, and phase aware. The y-axis is error and the x-axis is percent of the program that was sampled for a given parameterization of each technique. The graph shows that on average, phase aware profiling results in significantly lower error for a very small percent sampled. The error for periodic sampling approaches that of phase aware sampling for larger percent sampled values. However, we can achieve very high accuracy using phases, e.g., less than 5% error, by sampling a very small amount of the program’s execution (4%). To achieve the same accuracy, periodic sampling requires that 11% be sampled, average random sampling requires that 20% be sampled, and max random is never able to achieve an error of less than 5%. All of the results presented up to this point are across all benchmarks and do not show how the performance compares across individuals. Figure 8. Shows the impact of the error on each individual benchmark. We con-

sider the performance of each technique when we constrain the error to be less than 1% and less than 5%. In the bottom graph (5%) we can see that under these assumptions the basic block error for most programs is quite low (under 10%). The only exceptions are jpeg-encode and mpeg-encode, both of which have errors in the range of 30%. The phase aware technique performs far better across the board. The only benchmark with significant error is mpeg-encode which has an error of 9%. If we reduce the amount of profiling that be done by another factor of 5 we get the top graph. In the top graph only 1% of the program can be sampled. At these lean profiling rates, all of the programs have error rates at of above 20% for both periodic and random. For some programs, such as mpegencode, almost no useful information can be gathered (the error rate is above 50%). Phase-aware sampling can help to minimize this effect across all of the benchmarks. Hot-Block Identification We next investigate the “usefulness” of the low accuracy that we achieve with phase-aware profiling. One way in which profiles are commonly used is for identifying program “hot spots”, to focus optimization energy (automated or human). Moreover, by identifying hotspots that are not actually hot wastes resources. As such, we evaluate the percentage error in hot block identification for the different sampling techniques. We accumulated each profile from each of the sampling techniques into a single basic block vector. We then sorted it in decreasing order. We then identified the top 10% most frequently executed basic blocks (“hot blocks”) in each profile and compared them to the top 10% hottest blocks in the exhaustive profile. We counted the number of blocks that were different from the exhaustive set, normalized the value by the number of blocks in the top 10% and multiplied by 100. Figure 9 shows the results. The results are similar to those for accuracy error: phase-aware

% Error in Block Count

avg random max random periodic phaseaware

80 60 40 20 0

gsm-dec

gsm-enc

jpeg-dec

jpeg-enc

mpeg-dec

mpeg-enc

jpeg-enc

mpeg-dec

mpeg-enc

% Error in Block Count

(a)

50

avg random max random periodic phaseaware

40 30 20 10 0

gsm-dec

gsm-enc

jpeg-dec

(b) Figure 8: Error rates for the four different techniques across all benchmarks when the profiling overhead is limited to 1% (top graph) or 5% (bottom graph).

% Error in Block Hotness

25

20

avg random max random periodic phaseaware

15

10

5

0 0

5

10 15 % Program Sampled

20

Figure 9: Percentage error in Hot Block identification using different sampling techniques. profiling is able to identify hot blocks to a much greater degree than the other sampling techniques. Impact on Power We next evaluate the impact of phase-based remote profiling on resource performance. In particular, since we are interested in enabling remote profiling of hand-helds feasible, we study the impact on power consumption. We assume a maximum accuracy error of 5%. As mentioned above, to achieve this level of accuracy, phase aware sampling requires that on average 4% of the program be sampled, periodic sampling requires that 11% be sampled, average random sampling requires that 20% be sampled, and max random is never able to achieve an error of less

than 5%. As such, we focus on the former three techniques. We show how 5% error translates into energy, computation, and communication overhead in the three sections of Table 10). In each section, we show the average overhead for each metric across benchmarks for periodic and averaged random sampling in columns 2 and 3. In column 4 and 5, we show the percent reduction enabled by phase-based profiling over each of these techniques, respectively. Each section in the table contains two rows of data for the two different communication protocols that we studied. For “At End”, we combine the basic block vectors of each profiled interval into a single vector; upon program termination, we compress the vector and transmit it. Using this protocol, phase-aware profiling reduces energy consumption by 75% over random sampling. Phase-aware profiling reduces computation overhead by requiring 72% fewer instructions for instrumentation over random sampling. Given this “At End” approach, the communication cost is the same across profiling techniques. However, we investigated another protocol, one in which we compress and transmit the basic block vector after each interval. This protocol reduces the amount of device storage required (which may be highly constrained for real devices); as such, it is a realistic alternative that we should consider. Using this “Interleaved” protocol, phase-based remote profiling can also reduce communication overhead. These results are shown in the second row of each section. The reductions in overhead for energy and computation are similar to the “At End” protocol. However, phasebased profiling requires that less than  the number of

Sampling Overhead For Accuracy Error of 5% % Sampled:

Periodic: 11% AvgRand: 20% Phase: 4%

Energy Joules Percent Reduction Protocol Periodic AvgRandom vs Periodic vs ARand At End 7.75 15.02 60.03 79.38 Interleaved 8.24 16.03 60.13 79.50 Computation Overhead Instructions Executed (Millions) Percent Reduction Protocol Periodic AvgRandom vs Periodic vs ARand At End 265.61 514.20 59.89 79.28 Interleaved 280.48 545.46 59.06 78.95 0.00 0.00 Communication Overhead (Compressed) Bytes Transfered Percent Reduction Protocol Periodic AvgRandom vs Periodic vs ARand At End 2217.83 2217.83 0.00 0.00 Interleaved 27095.50 51683.97 55.38 76.61

% Error in Block Counts

Figure 10: Performance of sampling methods for a 5% error

40 avgrandom maxrandom periodic phaseaware

30

20

10

0 0

5

10 15 Power Overhead

20

Figure 11: Power v/s Error bytes be transmitted to communicate the same information as the random approach. Finally, to summarize our results, we present the percent energy overhead imposed on the program versus error in Figure 11. We used the “At End” protocol for this graph.

A.9 Related Work Our work builds upon and extends a body of related research on program phase behavior [30, 31, 33, 10, 13, 11, 23, 16]. Our work is novel in that it is the first, to our knowledge, to investigate the efficacy of remote performance profiling. Moreover, we use program phase behavior to significantly improve the efficiency of remote profiling and as such, we make it feasible to gather performance characteristics about software for mobile devices post-deployment. The other area of research that is related to our work is that of sample-based profiling techniques. Many other researchers have identified that an entire program need not be profiled to extract accurate execution behavior infor-

mation from it. Instead, many sample-based approaches have been proposed [12, 38, 3, ?, 6]. Sample-based profiling is used to gather performance statistics about a program for use on the same device (as opposed to remotely), in compiler and runtime optimization. Extant sample-based profiling techniques that couple hardware support for performance profiling include those that employ hardware performance counters [1, 8], and others that use special-purpose hardware to guide sampling [41, 29]. The work in [29] is somewhat related to the research herein in that it describes a performance profiling approach that couples hardware and software in an attempt to reduce profiling overhead by using programmable hardware to capture and compress profile information before passing it on to software for analysis and exploitation. The generated profile is in the form of a single or multiple event streams. Dedicated hardware performs lossy compression on this stream, by using hardware-based low-cost sampling mechanisms, thereby reducing the amount of information that the software profiler has to process. This approach is completely orthogonal to ours, in that, it uses specialized hardware to capture and pre-process profile information as dictated by the software profiler. We are interested in using specialized hardware to drive our profiling policies. There are many sample-based, software-only perfomance profiling techniques, e.g., [12, 38, 3, 2, 6]. These approaches are intended to be used within extant dynamic optimization systems. Duesterwald et al [12] present online path profiling to enable hot path prediction in dynamic optimization systems, and [38] examines several sampled based techniques to gather profiles within a Java Virtual Machine to enable feedback-directed dynamic optimization. In [3], the authors present an online, software only mechanism for sampling executing code. They use code duplication (methods both with and without instrumentation) and transfer execution between the two based on method invocation and taken backward branch (backedges) counts. They show that for sampling method call-pair frequencies and field accesses that their technique exhibits very low overhead. In our work, we focus on basic block profiles as they has been shown to accurately reflect how the executing program uses the underlying hardware resources. As part of future work, we plan to evaluate hardware-supported phase aware profiling for different profile types in addition, we plan to investigate the degree to which phase-awareness can be used to guide offline and online optimization. In addition, sample-based program monitoring is used to collect code coverage information [36, 25, 5] and to identify errors in deployed programs [21]. The focus of our work is on performance profiling; however, as part of future work, we plan to investigate the efficacy of using phase-awareness to gather profile information for code

coverage and bug isolation.

A.10 Conclusion In this paper, we couple hardware and software techniques to enable efficient collection of remote profiles from resource-restricted devices. The key to our approach is the exploitation of program phase behavior. We show that by using special phase tracking hardware to guide sample-based profiling, we generate highly accurate profiles at a relatively low overhead. Our simulation results indicate that phase based profiling enables a 50-75% reduction in overhead (communication, computation, and battery power) over periodic and random sampling.

B Visualization and Analysis of Phased Behavior in Java Programs Recent research in the area of feedback-directed, hardware-based optimization has identified potential optimization opportunities in the repeating patterns in the time-varying behavior, i.e., phases, of programs [33, 10, 34]. Phases represent intervals during program execution that are similar (repeat). Hence, phase information can be used to reduce analysis and profiling overhead (analysis of an interval in a phase is the same as that for all intervals in the phase), and to identify repeating behavior that can be exploited with code specialization. Phased behavior, if present in Java programs, has the potential for enabling significant performance improvements in both JVM and program execution. Phases can be used to reduce the overhead of online instrumentation and analysis and to guide optimization, adaptation, and specialization [14, 39, 40]. However, to date, phased behavior in Java programs has not been thoroughly researched. Moreover, there are many open questions about the various components of the methodology of phase behavior collection. For example, how many instructions, i.e., at what granularity, should we consider as the minimum phase size (program behavior)? Is this size application specific? In addition, how do we measure the similarity between program behaviors so that we distinguish different behaviors (phases)? Studies have shown that the answers to these questions significantly impact the detection of phase boundaries [16], and thus, the degree to which phases can be exploited. To facilitate research into these questions and into the phased behavior in Java programs, we have developed a toolkit and JVM extensions for the collection and visualization of dynamic phase data. In addition, our framework enables researchers to experiment with the various parameters associated with phase detection and analysis, e.g.,

granularity and similarity. Our framework incorporates phase collection techniques used by the binary optimization and architecture communities into a freely-available JVM. Our toolset is intended for use offline to gather phase data and to simplify and facilitate phase analysis as part of the design and implementation process of highperformance Java programs and JVM optimizations. We first describe the system in detail then show how it can be used to visualize and analyze phased behavior in a set of commonly used Java benchmarks.

B.1 JVM Phase Framework To better understand the potential benefits from exploitation of phased behavior in Java programs, we developed a framework for Java Virtual Machines that allows application developers, as well as architects of dynamic optimization systems, to visualize, investigate, and experiment with phase behavior data in Java programs. The framework consists of a data generator and a set of tools for the extraction and analysis of phased behavior. In this section, we describe the implementation of each and discuss how each component can be parameterized to facilitate research and investigation into the customization of phase collection and analysis. We implemented the phase framework as an extension to the Jikes Research Virtual Machine (JikesRVM) from IBM T.J. Watson Research [2]. JikesRVM is an adaptive optimization system that uses online instrumentation to apply optimization dynamically to methods of a Java program that account for a significant portion of the execution time. We extended this JVM with instrumentation for basic blocks that causes basic block frequencies to be collected into an array, i.e., a basic block vector, during program execution. In addition, we apply the highest level of optimization to all application and library methods. To capture the time-varying nature of program behavior, we decompose program execution into intervals, each representing the execution of a fixed number of dynamic instructions. We can specify interval size as a commandline parameter; we use a value of 5 million in this study. At the end of each interval, we take a snapshot of the basic block vector and store it in an Interval Queue; a background thread periodically empties the queue to disk. Upon program termination, a file on disk contains a trace of per-interval basic block vectors for the program. We next describe the second component of our phase framework: A set of tools for the extraction and analysis of phased behavior from the profile information produced by the data generator. The toolkit consists of four tools: an image generator and visualizer, a phase finder, a phase analyzer, and a code extractor. The tools are written in Perl and Java and are easily extensible.

Figure 12: Phase Visualizer on Compress100

B.2

Phase Visualizer

The phase visualizer consumes the phase trace and generates a portable graymap image from it. This image can be viewed using any image viewer; however, we developed our own Java-based viewer that enables users to point (using the mouse) to a pixel on the image and view the interval coordinates. These coordinates allow the user to identify intervals within a visualized phase. An image produced by the phase analyzer is shown in Figure 12. The data in the figure was taken from the phase trace of SpecJVM benchmark 201 compress using input size 100. Each image is a similarity matrix [31]; the xaxis and y-axis are increasing interval identifiers (ids). An interval is a period of program execution (specified during trace collection); we assign interval ids in increasing order starting from 0. The visualizer omits data in the lower triangle since it is symmetric with the upper triangle. Each point, with coordinates x and y, denotes how similar interval y is to interval x. Dark pixels indicate high similarity. White indicates no similarity. A user can read the figure by selecting a point on the diagonal; by then traversing the row, she can visualize the degree to which the intervals that follow are similar. The grayscale depiction of similarity enables identification of phases, phases boundaries, and repeating phases over time. We discuss how we compute interval similarity in the next section.

B.3

Phase Finder

To determine which intervals are similar (and thus, the pixel color displayed by the visualizer), the second tool that we developed is the Phase Finder. The tool consists of two components: One that computes the similarity between intervals and another that clusters intervals into phases.

Two intervals are similar if the execution behavior that each represents is correlated. As such, we use the vector distance between the basic block vectors representing the two intervals. Specifically, we compute similarity as the Manhattan distance between two vectors: the sum of the element-wise absolute differences between two normalized vectors. The Manhattan distance is a value between 0 and 2. A difference value of zero implies that the two vectors are entirely similar and 2 indicates complete dissimilarity. Other similarity metrics, e.g., vector angle [19], can be plugged into this component to enable evaluation of the efficacy of different techniques. Once we compute interval similarity, we map the similarity value to one of 65536 different grayscale values to generate the portable graymap image. The clustering component, uses interval similarity to determine which intervals should be included in a phase. This component is also pluggable, i.e., any clustering algorithm can be inserted, experimented with, and evaluated in terms of its efficacy for phase discovery. We currently implement a simple threshold-based mechanism that identifies similar intervals for this component. The user provides a Manhattan distance threshold (between 0 and 2). This component then extracts intervals with Manhattan distances below the threshold specified. This simple mechanism enables users to adjust and experiment with the threshold value.

B.4 Phase Analyzer and Code Extractor As part of our toolset, we also developed two tools that enable users to extract statistics as well as code from each phase or interval: The Phase Analyzer and the Code Extractor. The phase analyzer generates and filters data to aid in the analysis of phases and individual intervals. It lists the intervals in each phase as well as how often the phase occurs and in what durations over the execution of the program. The phase analyzer extracts details about the behavior of individual intervals or entire phases. For example, it reports the number of phases found, the number of instructions in each phase (over time), and how many instructions occur in dissimilar intervals that interrupt the different phases. Moreover, it lists sorted basic block and method frequencies. This information is useful for identifying “hot spots” in the execution. For all of the data reported by the phase analyzer, we include a number of filters that significantly simplify analysis of the possibly vast amounts of data generated for a program. For example, a user can specify a threshold count below which data is not reported. This enables users to analyze only the most frequent data. In addition, data can be combined into cumulative counts or into a number of categories, e.g., instructions, basic blocks, methods,

Phase 1

Phase 2

Phase 3

I-16--

I-33--

0

I-49--

14

26

64

Figure 14: Phases for Mtrt10 (thresh: 0.8) I-65--

Figure 13: Similarity graph for Mtrt10 and types of instructions. Finally, to analyze the program code that makes up a phase, we developed a code extraction tool. By inputting intervals identified by the visualizer and statistics generated by the phase analyzer, users can use the code extractor to dump code blocks of interest. The granularity of the dump can be specified to be a single basic block, a series of basic blocks, or an entire method. We show how the tools in the toolkit can be employed in the next section.

B.5

Phase Behavior in Java Programs

The data we present in this section was gathered by executing Java benchmarks on a 1.13Ghz x86-based singleprocessor Pentium III machine running Redhat Linux v2.4.5 We used JikesRVM version 2.2.2. The programs we examined are from the SpecJVM98 benchmark suite [35]. Each of the similarity graphs that follow, can be read, as described previously, as an N x N similarity matrix, where N is the number of intervals in the program’s execution. An entry in the matrix at position (x,y) is a pixel colored to represent the similarity between interval x and interval y. The diagonal is black since an interval is entirely similar to itself. To see how an interval relates to the remaining program execution, we locate the interval of interest, say x, on the diagonal and move right along the row. Dark areas in the row identify intervals that are similar in behavior to interval x. As an example, consider the similarity matrix for the multi-threaded ray trace benchmark, Mtrt, when we execute it with input size 10 (Figure 13). Mtrt executes 65 intervals of 5 million instructions. We start at the top left corner of the matrix and move right along the x-axis. As we move right, we encounter dark pixels until we reach interval 15. That is, the initial phase, phase-1, begins at interval 0 and continues through interval 15. Interval 15 is entirely dissimilar and therefore belongs to another phase, which we call phase-2. After interval 15, the intervals alternate between phase-1 and phase-2 until we reach interval 44. From interval 44 until the end of the execution, the

intervals are completely dissimilar to phase-1. Up to this point, we have visually discerned two phases. We have concluded that the intervals in phase-2 and intervals 44 through 65 are completely different from the intervals in phase-1. Now, we must investigate how intervals from interval 44 through 64 relate to each other. We do this by locating interval 44 on the diagonal and evaluating its row in the same way. We can observe two different phases in this row. It is important to note here that the dark intervals we encounter in row 44 are in no way related to the dark intervals in phase-1 even though the color may be the same. That is, a row in a similarity matrix identifies the similarity between the row interval and all future intervals only. Figure 14 shows the phases found by our phase-finder using a similarity threshold of 0.8 for Mtrt. We use a figure to depict the output of the phase finder. Each pattern indicates a different interval; there were 3 phase detected. Using the phase analyzer and code extractor, we can further evaluate each phase by analyzing the commonly executing methods. The most frequently executed methods in phase-1 are ReadPoly in class Scene and init in class PolyTypeObj. In phase-2 the most frequently executed methods are CreateChildren and CreateFaces in class OctNode. phase-3 then renders the scene by frequently executing Intersect from class OctNode, Combine from class Point, and RenderScene from class Scene. Figure 15 shows the interval similarity matrices for all of the SpecJVM benchmarks; we showed the omitted benchmark, Compress, previously. Visual analysis of the benchmark data provides insight into the phase behavior in the programs and also enables us to target our efforts for further analysis using the other tools as we did above for Mtrt. We can see that each of the benchmark programs exhibits very different patterns. In DB, Javac, Jess, and Mtrt there is a clear startup phase. Existing adaptive systems have shown that it can be profitable to consider startup behavior separately from the remaining execution [40, 39]. This phase in these four benchmarks is noticeably different from the rest of the execution. This is particularly evident in Javac. Using further analysis of the number of instructions executed by different methods (using the phase analyzer), we find that the most popular methods are read, scanIdentifier, and xs-

I-222--

I-61--

I-444--

I-122--

I-666--

I-183--

I-243--

I-887--

DB

Jack

I-31--

I-115--

I-61--

I-229--

I-92--

I-344--f

I-122--

I-458--

Javac

Jess

I-127--

I-240--

I-479--

I-234--

I-719--

I-351--

I-468--

I-958--

Mpegaudio

Mtrt

Figure 15: Similarity graphs for the SpecJVM benchmarks with input size 100. The interval length used is 5 million instructions. The number of intervals, n, vary and each graph is a n x n matrix with the x and y axes representing the interval identifier. The lower triangle is a mirror image of the upper one and is masked for clarity. Each point on the graph indicates the similarity between the intervals represented by that point. Dark implies similar and light implies dissimilar. The diagonal is dark, since every interval is entirely similar to itself.

can during the first 90 intervals. They are again the most popular methods during intervals 202-212, depicted by the dark vertical bar in the right part of the startup phase. These methods are rarely executed in all other phases. Other programs exhibit other interesting phase features. Mpegaudio shows no apparent phased behavior. That is, each interval is very similar to every other. Mtrt shows very dark intervals also, however there is a perceivable pattern that traverses the matrix of Mtrt. Jack exhibits a very regular pattern: 16 rows of almost perfect squares. Output from our phase analyzer for Jack reveals that the code does repeat itself 16 times. The reason for this is that this benchmark is a parser generator that generates the same parser 16 times. Our framework correctly identifies this repeating phase behavior.

B.6

Related Work

Runtime phased behavior of programs has been previously studied and successfully exploited primarily in the domain of architecture and operating systems [22]. The basis of our framework is to combine existing techniques that have proven to be successful in these domains within an adaptive JVM context. [30] and [31] are two such techniques. The authors of these works propose to use basic block distribution analysis to capture phases in a program’s execution. They use phase information to reduce architectural simulation time by selecting small representative portions of the program’s execution for extensive simulation. Basic block vectors are used to characterize program behavior across multiple intervals of fixed duration and are classified into phases using Manhattan distance and k-means clustering. [33] presents an online version of such phase characterization along with a phase prediction scheme. The authors of this work also describe additional applications to configurable hardware. Our framework employs this methodology within the data gathering component and phase finding tool to enable collection and analysis of phase data in Java programs. The authors in [10] and [13] stress the importance of exploiting phased behavior to tune configurable hardware components. In the former, the authors compare working set signatures across intervals using a similarity measure called relative working set distance to detect phase changes and identify repeating phases. In [13], the authors use hardware counters to study the time-varying behavior of programs and use it in the design of online predictors for two different microarchitectures. In other work [11], the authors compare three different metrics that characterize phased behavior. The metrics are basic block vectors, branch counters and instruction working sets. Hind et al. [16] examine the fundamental problem of phase shift detection and analyze its dependence on the two parameters that define phased behavior, granularity and similarity. They demonstrate that for the SpecJVM

benchmark suite, observed phase behavior depends on the choice of parameter values. Our framework allows the user to specify and experiment with both, possibly application-specific, parameters.

B.7 Conclusion To enable the study of time-varying behavior, i.e., phased behavior, in Java programs, we developed an offline, phase visualization and analysis framework within the JikesRVM Java Virtual Machine. The framework couples existing techniques from other research domains (architecture and binary optimization) into a unified set of tools for data collection, processing, and analysis of dynamic phased behavior in Java programs. The framework enables program and optimization developers to significantly reduce analysis time and to target parts of the code that will recur with sufficient regularity.