Scalability of Visualization and Tracing Tools - JuSER

0 downloads 0 Views 103KB Size Report
Scaling up the usability of trace based tools requires new techniques in both the .... the code developer or an analyst without a very deep knowledge of the code.
John von Neumann Institute for Computing

Scalability of Visualization and Tracing Tools ´ ´ J. Labarta, J. Gimenez, E. Mart´ınez, P. Gonzalez, H. Servat, G. Llort, X. Aguilar

published in

Parallel Computing: Current & Future Issues of High-End Computing, Proceedings of the International Conference ParCo 2005, G.R. Joubert, W.E. Nagel, F.J. Peters, O. Plata, P. Tirado, E. Zapata ( Editors), John von Neumann Institute for Computing, Julich, ¨ NIC Series, Vol. 33, ISBN 3-00-017352-8, pp. 869-876, 2006.

c 2006 by John von Neumann Institute for Computing

Permission to make digital or hard copies of portions of this work for personal or classroom use is granted provided that the copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. To copy otherwise requires prior specific permission by the publisher mentioned above.

http://www.fz-juelich.de/nic-series/volume33

1 869

Scalability of Visualization and Tracing Tools J. Labartaa ,J. Gimeneza , E. Martineza , P.Gonzaleza , H. Servata , G. Llorta , X. Aguilara a

Barcelona Supercomputing Center, Jordi Girona 1-3, 08034 Barcelona , Spain

Extending the capability of performance tools to deal with the larger and larger machines being deployed is necessary in order to understand their actual behavior and identify how to achieve performance expectations in the frequent case these are not met at a first try. Trace based tools such as Paraver provide extremely powerful and flexible analysis capabilities to identify performance problems not detectable by profile based tools. Scaling up the usability of trace based tools requires new techniques in both the acquisition and visualization phases. The CEPBA-tools approach distributes the functionalities required to tackle large systems in three different levels. Different acquisition techniques are used in the instrumentation package to control the data captured and maximize the ratio of information to file size. An intermediate level set of tools are used to summarize the generated Paraver traces into smaller traces, with the same format, but where some of the information has been summarized. Examples of filter functionalities at this level include summarization of certain events in periodic software counters and selection of specific time intervals or events. At the final level, different rendering techniques have been introduced in Paraver to visualize traces of many processes while still being able to convey to the analyst the information relevant to identify problems at very coarse level as well as the capabilities to dig down to very detailed levels. The paper describes in detail the techniques being used along those lines in the CEPBA-tools environment in order to support the analysis of applications run on large systems. 1. Scalability issues in tracing tools Understanding the behavior of large scale parallel computers is a real need especially in large shared infrastructures offered by HPC centers. With more and more users willing to access such resources it is important to optimize their use, both from the point of view of the individual user aiming at delivering faster results as well as from the point of view of the operator of the resource, interested in maximizing the productivity of all the components in the infrastructure. The detailed analysis of the behavior of a system requires the ability to capture a large amount of data and process it to deliver information to the analyst. The performance analysis data space can be seen as a three dimensional space, where individual points represent the occurrence of an event relevant for the performance analysis. The three dimensions are time, space (or processors) and event types. Each of them can be very large in itself and the three dimensional space they define can be very densely populated. Profiling tools [6] essentially summarize all such data at run time within the acquisition process itself and only emit global statistics at the end of the run. Trace based approaches [9][10][8] store the raw data into files that can be analyzed off-line. This has several advantages as the detail is not lost in the summarization process. An iterative analysis loop can be performed where hypotheses can be made and validated. Rather that relying on a predefined set of metrics averaged for a whole run, the focus of the analysis can be dynamically directed to specific metrics and time ranges as it progresses. An intermediate approach [1] does obtain traces, but then precomputes a large set of profile metrics that can be navigated with a graphic interface.

2 870 On a scenario of highly parallel platforms, being able to use trace based tools to study the behavior of parallel programs is a challenge with different aspects. First, just storing and handling the data may be a problem. It is easy to face restrictions in storage capacity or in the capability of the analysis tool to process such data with the responsiveness required for the interactive analysis loop. A second issue is how do we present such data to the analyst, specially in typical timeline displays where one line should be used to display for each process the evolution of a given performance index. How to handle and display large amounts of data is an important aspect of the general scalability problem. We nevertheless consider that it is even more important to be able to handle the large dynamic range along the three dimensions earlier presented. A tool should be able to present a very high level view of a whole run and then be able to drill down to a small time interval and a subset of processors where a given microscopic phenomena may have a significant global impact. The question of performance tools scalability has long been around. The Paradyn project [2][4] was motivated by this concern. This project opened the direction of dynamic instrumentation and iterative automatic search during the program run. In general there is a perception that trace based approaches do face an enormous problem when scaling up to large number of processors. Approaches for structuring the trace data have been proposed [8]. Other work focuses on trace compression mechanisms [6]. These proposals address from the data structure point of view the issue of how to reduce the trace size and at the same time speed up the process of manipulating them either for display or to compute some statistic. Parallelizing the implementation of the tools themselves is an alternative to speed up their execution or let then use more resources (i.e. file descriptors, memory) than available in a single node. Direct implementations [13] or using general infrastructures [12] to support such parallel implementations are possible. Our objective is to investigate how far we can go in the use of trace based approaches to analyze the performance of large scale parallel computers. Our position is that even for large numbers of processors, trace based approaches can still be applied, offering the advantages of supporting a detailed analysis and flexible search capabilities. Certainly, blind tracing of a large system is unmanageable. Even if parallelizing the tools or improving the internal data structures will certainly help, it is necessary to explore the direction of intelligent selection of the traced information. In this paper we describe the techniques that have been implemented in the CEPBA-tools environment, centered on the Paraver [8] trace visualization tool. In our environment, the analysis process actually goes through three phases. The first one is the MPI + OpenMP instrumentation package OMPItrace [11]. Current practice in the size of traces generated after an instrumented run is in the order of some GBs. On a standard laptop configuration, Paraver can visualize traces up to 100 MB. To bridge the gap, we have implemented a filtering step where different techniques are used to select or summarize the information. Both the input and output trace are in the same Paraver format. The structure of the paper is as follows. In section 2 we describe the techniques used in the instrumentation phase. The functionality of the filtering step is described in section 3 and the techniques to display traces of thousands of processors are presented in section 4. Section 5 concludes the paper. 2. Scalability of instrumentation 2.1. Limiting the trace file size The basic approach supported in many tracing packages is to manually modify the code inserting calls to a tracing library call to start and stop the tracing. Although requiring access to the source code and some understanding of the application structure the mechanism is easy to use either by the code developer or an analyst without a very deep knowledge of the code. This mechanism also

3 871 results in a framed traced data, where the start of a phase in the algorithm and its periodicity are easily identified. In situations where the source code is not available and the structure of the application is not known, simple approaches such as tracing from the beginning of the application till a given trace file size is reached can be quite effective. This mechanism is very easy to use, provides direct control of the amount of data captured and can be tuned to what the post processing or visualization steps can handle. A first drawback appears in applications with a lot of traced activity during the initialization steps. In situations where the behavior of the applications varies along time it may not be possible to reach the actual objective of the analysis. A desirable functionality is to let the analyst specify what to trace when launching the instrumented run. In our environment this is specified through an environment variable indicating two pairs (function, instance number). Tracing will start when the first function is invoked for the specified time. After that, tracing stops when exiting the specified instance of the second function. For some applications, the analyst may be interested in tracing a relatively small interval half way through the execution of a very large run even if not having access to the source code. The approach implemented in OMPItrace to support this need is based on a circular buffer. The tracing probes are activated at the beginning of the program and keep storing traced data in the buffer without dumping it to disk. New events will overwrite old ones, but the buffer will always contain the most recent events that fit in it. By sending a signal to all the processes the user forces the dump to disk of the events in the buffer. Correlating the events dumped by different processes is nevertheless tricky. Different paths through the application code and delays in propagation of the dumping signal may result in some of the events in the file generated by one process not having their counterpart in the other. To be able to match events in the dumps of different processes some type of logical synchronization data must be emitted along with the events. In our approach, we rely on collective operations on MPI COMM WORLD. The tracing library emits for each such operations its sequence number. The trace merging process searches for such collectives. The first MPI COMM WORLD collective appearing on all processes is used as reference for matching events in different processes. Records before that one are discarded. When approaching the end of the local dumps, events in some processes may not have counterpart and are also discarded. Although the mechanism has some limitations (i.e. applications with no global collective calls, codes with high variation in density of MPI calls across processes, point to point communications across collectives, slow propagation of signals), the approach is valid for a large set of codes with very long term behavior variations. The most desirable situation would be one where the tracing package automatically detects what is the amount of data to be captured. An approach for OpenMP programs described in [3] tries to detect the periodic structure of the application and emits to the tracefile only a few periods of such pattern. The approach has been ported to OMPItrace. As the detection is local to each process, the method has limitations in that we assume an SPMD structure for the program. The approach can be merged with the matching mechanism described for the circular buffer mechanism. An orthogonal direction to restrict the amount of data in the trace is to limit the set of events captured. The typical analysis of an MPI program would be interested in registering events of entry/exit to user functions and MPI calls as well as hardware counter information at those points. By not emitting some of those data to the file it is possible to restrict the trace size. Depending on the type of analysis, there is no problem in doing so. For example if we know that for a given application the point to point calls are not the actual bottleneck and we are more concerned about the collectives it is possible to only emit such information to the tracefile. If we are interested in the evolution of the load balance for a long run, we may restrict the emitted events to just those on entry and exit to the

4 872 user functions that perform the core computation. Even if we do not trace individual MPI calls, the analysis of the code sections encapsulating the communication can still provide a lot of information about communication behavior. Furthermore, even in this case, hardware counter information can unveil great levels of detailed understanding. An example is the Blue Gene/L where the PAPI library used by MPItrace provides hardware counters on the amount of bytes outgoing through each individual link in the 3D torus. Hardware counts at the start and end of communication phases can give detailed information on actual amount of data transferred or level of contention at each link. 2.2. Scalable trace merge The instrumented run generates one file per process. It is then necessary to merge such dumps into a single trace file, matching send and their corresponding receives and correlating the timestamps. The merge process is done off line as a batch job. Even if it is desirable for this process not to be very long, the performance requirements for this step are not very high in the typical analysis practice. It must nevertheless match a large number of files and may use a lot of memory. Our previous sequential implementation ran out of file descriptors when tracing above 1000 processes. We have implemented a parallel version that hierarchically merges some individual dumps into partially matched files that are then merged again following a tree structure. Some speedup can be obtained by this parallel merge, but the really important effect is that it enables the merge of traces of large numbers of processors. 3. Trace post processing For runs with a large number of processors, the above mechanisms can still generate very large tracefiles. To bridge the gap between the GBs generated and tenths of MBs that Paraver can visualize it is necessary to implement filtering mechanism that select or summarize the data. We have implemented several selection and aggregation mechanisms that we describe in the following subsections.We should emphasize that the output of all the filtering steps is again a Paraver trace. This allows for the pipelining of the post processing functionalities. A typical practice will probably undergo several iteration of filtering steps till the final trace showing a specific behavior is obtained. The intermediate output of some of those steps will be visualized and the indications this provides will drive the following filtering steps. 3.1. Selection mechanisms We have implemented a selection mechanism that extracts from an original trace a subset of the events, states and communication records, emitting them without modification to the output trace, but eliminating all other records. The first basic mechanism selects a time interval by specifying a start and stop time. This can be complemented in the spatial dimension by selecting a subset of the processors. In this case, in order not to loose the information of their interactions with other processes, the communication records that involve one of the selected processes are kept in the output even if the communication counterpart is not selected. Finally, in the states and event dimension it is possible to select a subset of the event types or even a given range of values for those types. 3.2. Software counters A new mechanism has been implemented in the filter in order to minimize the potentially huge amount of events of a certain type in a trace. The idea draws from current processors where very fine grain events like instruction completion or cache misses are counted and then made available to the monitoring tools at a coarser level of granularity. The software counters mechanism substitutes the individual event instances in one raw trace by a summarized event at the end of a summarization

5 873 interval. Two possible ways of summarization are possible. For events with categorical values indicating for example entry/exit to subroutines (i.e. MPI point to point calls), a count of the number of entries is kept. At the end of the summarization interval an event is emitted indicating the number of invocations during the interval. For events whose associated value is itself a count (i.e. hardware counter events) the individual values are accumulated and a single event of that type is emitted at the end of the interval with the aggregated value. Two alternatives are possible to determine when to emit the summarized events. A first approach is to do it periodically. This sampling approach is not correlated to the application structure. The relationship between the sampling frequency and the natural frequencies in the application will determine how well its behavior is represented by the software counts. The lower the sampling frequency, the smaller the output trace will be, but if the sampling frequency is too low, everything will be averaged, losing information. For situations with no a priori knowledge of the application behavior it may be necessary to experiment with some sampling periods. The second alternative is to emit software counter events at points where coarser grain events happen. The typical example would be to accumulate at the exit of user functions counts of how many MPI call occurred since the function was entered. 3.3. Aggregated communication A similar idea can be applied to the communication data, somewhat preserving the communication structure of the application even if performing an aggregation to reduce the number of events. In its current implementation, the mechanism coalesces several communications between each pair of processors into a single record, with message size equal to the sum of the individual message sizes. For the equivalent message the send is fixed at the send time of the first accumulated message and the receive time is the last receive time of all accumulated messages. One approach to determine which messages to group is to use fixed sampling intervals. All messages going out form a source processor during that interval will be considered for coalescing. Other approach is to group messages whose send time is closer than a specified bound. This trace compression mechanism introduces an important difference compared to all others described till now. In previous mechanisms, data was either extracted or accumulated, but the data that was generated was accurate. In this case, the generated data is no longer accurate. Even so, it has proven an interesting alternative in the analysis of large runs. 4. Scalability of visualization 4.1. Rendering Once a trace is available, it is necessary to present to the analyst the data it contains in a way that conveys the maximum possible information about the application behavior [14]. Timelines are a typical way to display such behavior. In this approach, the two dimensional display derives from the three dimensional raw data space by projecting along the event dimension. The events are transformed to some performance metric, function of time that is displayed following some color encoding scheme. Given the limitation in the number of pixels of a display device, the issue arises as to which value represent for each pixel. A first approach is to aggregate the information for each basic object (thread) according to the logical structure of the application. This can be made if the aggregated metric actually makes sense and represents some property of the set of objects aggregated. The result is that fewer lines have to be represented. For example in an MPI + OpenMP application it may be possible to display the

6 874 aggregated MFLOPS for each process. It does not make sense to represent the average identifier of the routine each thread is in, but it may make sense to select the user routine executed by the master thread as representative for the process. The above approach deals with mechanisms on how to build a given metric for entities in which the program is structured. In Paraver this is part of the semantic module that computes a function of time for each thread, process or the whole application. The approaches we will describe in the following paragraphs are implemented in the Paraver visualization module and are purely related to rendering. The important difference between them lies in their respective quantitative and qualitative nature. The time varying metric computed by the semantic module is accurate and can be used to compute quantitative statistics (counting, averaging, histograms). The objective of the display module is to render such an accurate metric to a small display area. It needs not be accurate and should focus on conveying to the analyst a general perception of relevant aspects of the metric. A first approach is to display for each pixel the last of the set of values that the sequential display computation assigns to it. This actually corresponds to a periodic sample of the time/space dimension and is thus not correlated to the actual metric to represent. If the sampling frequency is below the natural frequencies in the data the result may be quite misleading. An interesting approach in this situation is to use a random selection of one of the possible values that fall into the pixel area. This is also not correlated to the actual data, but redrawing the display window will report different values and will give the analyst the opportunity to visually inspect the structure of the represented data. Different redraws resulting in structurally similar displays will actually show such structure. Even if different redraws result in very different displays this also conveys to the analyst useful information about the nature of the application behavior. It is also useful to offer rendering approaches correlated to the data to be represented. A natural one is to average the values that fall into the pixel. This may be useful for quantitative metrics where addition has a physical meaning. It may be less interesting than may be initially though as averaging tends to mask structure rather than highlighting it. Also because it does something that the eye would itself do. Finally, it may make sense to add or average metrics between for example neighboring processors in the actual application problem space, but neighboring processors in such space may not be neighbors in the linear process space of the programming model. Two simple and very useful rendering methods are to display the maximum or minimum of all values mapping to the pixel. This non linear mechanism does highlight individual processes where some desirable or undesirable behavior appears. We consider that relying on non linear data transformations is extremely useful. In Paraver this is supported by the semantic module building the time varying metric, where it is possible to zero out regions not corresponding to the actual target of the analysis. So if a metric represents for example the MFLOPS inside user function foo and zero elsewhere, it may make sense to display the minimum (not zero) value to get a perception of regions where such routine is performing poorly. 4.2. Focusing on a subset of processes The global view supported by the above mechanisms is generally used in a first step of the analysis but it is then necessary to focus on a reduced set of processes where a given phenomena shows up. A typical proposal is to provide a scroll bar mechanism for the processor dimension. This was initially implemented in Paraver but proved of little use. Losing the global view picture and having to scroll up and down to search for a given behavior drops the context information the analyst had and makes the search difficult. A two dimension zooming mechanism was then implemented where it is possible to select a region both in time and processes to which the view focuses. Features that support a fairly good navigation through the large timeline representation are the undo/redo

7 875 capabilities and the possibility to copy both the time scale and the selected processes to different windows. In this way, some metrics may highlight a given problem on a global view and the analyst can then focus to the specific time/space region where it appears, looking at it with different views and metrics. It is often the case that the processes involved in a given behavior are not contiguous. When scaling up the system they may not even be near one another in the linear process model numbering of processes. It is necessary to support the selection of an arbitrary subset of processes to display. The GUI can let the analyst tick the desired processes to display. Although this provides the basic mechanism, it is not very scalable form the usability point of view. Even if the selection mechanism is available, the actual problem is to identify which processes to select. Next section describes how this is addressed by Paraver. 4.3. Analysis power and scalability of visualization When getting deep into an analysis, questions will arise as for example wondering whether there are many processes sending or receiving form process X, how stable is the TLB miss ratio within a given routine or what message sizes appear in the run. The analysis module in Paraver can be used to compute statistics or histograms that give a numerical answer to those questions. The typical situation is then that the analyst wants to have a look at the timeline for those invocations that incur a certain range of TLB misses or MPI calls with message sizes in a given range or just the processors that communicate with processor X. To satisfy such need, we implemented a mechanism by which selecting a region of the histogram view automatically generates a display window where only time intervals where the metric falls in the selected range are displayed, while all other time intervals are zeroed. Furthermore, only processes with non null entries in that region of the histogram appear in the generated view. The result is a very precise quantitative selection technique to let the analyst focus directly to regions of time and space where a given property is expressed. This mechanism heavily relies on the semantic module of Paraver to build functions of time from the raw events, apply non linear compositions and combine them to derive elaborated metrics. 5. Conclusions This paper describes the techniques used in the CEPBA-tools environment to scale the applicability of the Paraver trace visualization and analysis tool to systems with up to several thousands of processors. Our objective is to support, for the same analysis a very large dynamic range of granularities in the effects observed. The techniques have been applied at three different phases: instrumentation, post processing and visualization/analysis. Even if in the initial implementation each phase implemented its own techniques, the experience shows that there is a common underlying set of concepts shared by the three of them (i.e. selection of events or timescales, summarization mechanisms). Based on this, future work will define a common configuration language. We expect this will greatly increase the flexibility and power of the whole environment separating specification of how to process the data from how that is actually carried out. Regarding data acquisition and handling our approach is based in providing flexible mechanisms to select and summarize the raw data to obtain traces that still contain the relevant information (detail, structure, variance) with much less data. In particular, we have described the software counters technique that has proven very useful to analyze the structure of large runs with many processes. Once those mechanisms are available, it is possible to apply intelligence in their use to drastically

8 876 reduce the data and focus on what is relevant for understanding a given behavior. Such intelligence may be provided by the user/analyst but also automatically. A very relevant direction of ongoing work is the analysis of the large traces in order to automatically generate the filtered traces for the analyst to look at. Example objectives of such analysis are the identification of regions with significant OS perturbation or the detection of the periodic pattern in the application and selection of an appropriate time interval. Different rendering techniques have been presented. As opposed to a general concern on the issue, we have observed that the limited size of a typical display is not a real problem for analyzing traces of thousands of processors. In particular, non linear rendering has proven very useful. Two other important conclusions regarding the visualization tool are the importance of tightly coupling the quantitative analysis mechanisms and the display selection mechanisms; and the need to further increase the capabilities of the semantic module and quantitative analysis mechanisms. This work is supported by the Ministry of Science and Technology of Spain (under contract TIN2004-07739-C02-01) the European Union (HPC-Europa project Contract No RII3-CT-2003506079) and by BSC (Barcelona Supercomputing Center). References [1]

[2] [3] [4]

[5] [6] [7] [8] [9] [10] [11] [12] [13] [14]

B. Mohr, F. Wolf: “KOJAK - A Tool Set for Automatic Performance Analysis of Parallel Programs”. Procs. of the International Conference on Parallel and Distributed Computing (Euro-Par 2003), Klagenfurt, Austria, August 2003. - (Lecture notes in computer science; 2790). - S. 1301 – 1304. B. P. Miller, M. Callaghan, J. Cargille, J.K. Hollingsworth, R. Irvin, K. Karavanic, K. Kunchithapadam, T. Newhall: “The Paradyn Parallel Performance Measurement Tool” IEEE Computer, Nov. 1995 F. Freitag, J. Caubet, J. Labarta: “On the Scalability of Tracing Mechanisms” Euro-Par, pp. 97-104, Paderborn, August 2002. J. K. Hollingsworth, B. P. Miller, J. Cargille: “DNew Algorithms for Performance Trace Analysis Based on Compressed Complete Call Graphs” V.S.Sunderam el al. (Eds.) ICCS2005, LNCS 3515, pp. 116-123. 2005. A. Knupfer and W.E. Nagel: “The Dynamic Probe Class Library-An Infrastructure for Developing Instrumentation for Performance Tools” IPDPS 2001, April 2001. mpiP. http://www.llnl.gov/CASC/mpip/ A. Chan, W. Gropp, and E. Lusk: “Scalable Log Files for Parallel Program Trace Data — DRAFT”. Argonne National Laboratory, Argonne, IL 60439, 2000. Paraver, http://www.cepba.upc.es/paraver VAMPIR User’s Guide, Pallas GmbH, http://www.pallas.de O. Zaki, E. Lusk, W. Gropp, and D. Swider: “Toward Scalable Performance Visualization with Jumpshot” High-Performance Computing Applications, volume 13, number 2, pages 277-288, 1999. OMPItrace. http://www.cepba.upc.es/paraver/manual i.htm P.C. Roth, D.C. Arnold, B.P. Miller: “MRNet: A Software-Based Multicast/Reduction Network for Scalable Tools”SC2003, Phoenix, AZ, November 2003. Vampir-ng. http://www.vampir-ng.de/ Steven T. Hackstadt, Allen D. Malony, Bernd Mohr: “Scalable Performance Visualization for DataParallel Programs” Proceedings of the Scalable High Performance Computing Conference (SHPCC), Knoxville, TN, May, 1994.

Suggest Documents