Extreme Performance Scalable Operating Systems

2 downloads 0 Views 4MB Size Report
In the evolution of parallel performance tools, the two general methods of measurement and analysis – profiling and .... periodic checkpoint I/O). 8 ... KTAU was ported to three architectures: Linux x86, 32-bit PPC, and x86 64. KTAU was also.
Extreme Performance Scalable Operating Systems Progress Report (Year 1: May 1, 2008 – April 30, 2009) Allen D. Malony Department of Computer and Information Science University of Oregon Eugene, Oregon 97403

1

Sameer Shende Performance Research Lab University of Oregon Eugene, OR 97403

Introduction

In the first year of the the FastOS (Phase 2) project, the University of Oregon (UO) has made excellent progress on all three areas of work: scalable parallel monitoring, kernel-level performance measurement, and parallel I/0 system measurement. The results we achieved build a solid foundation for the efforts in Year 2 directed towards whole-system performance integration and “path-level” tracking of performance effects. UO has also interacted with Argonne colleagues during the year on the deployment of TAU and KTAU within the ZeptoOS system, and the performance measurement of the PVFS-2 I/O infrastructure. The progress report provides a description of the UO FastOS efforts and details of accomplishments. These are captured in several papers and presentations listed in the references for further information. We are mostly on track with respect to the initial Year 1 milestones, with some rearrangement with Year 2 targets to better coincide with the ZeptoOS release schedule. We end the report with an outlook of Year 2 goals.

2

Online Parallel Performance Monitoring at Scale

Extreme-scale systems motivate a concern for global performance awareness, since local system- and application-level decisions can lead to global effects and consequences. Online performance monitoring tracks performance characteristics during execution by querying producers of performance data and making that information available to consumers at run time. Sources of data can be from the application- or system-level, and online access can be for application/system introspection or external performance clients. Extreme-scale systems pose challenges for run-time monitoring of performance behavior. Advanced methods for collection and reduction of system-wide data are necessary to achieve proper scaling.

2.1

Prior Work

We have looked at the problem of collecting global performance measurements at run time both from the perspective of application and for external performance data consolidation. We built the 1

prototype TAUg library to experiment with MPI-based techniques for application access to global TAU profile data. We have also investigated the use and extension of Supermon for collecting performance measurements via a systems monitoring infrastructure. TAU previously supported run-time performance analysis using a global shared filesystem for transmission of performance data. This approach does not scale because of file system contention among thousands of nodes, especially the I/O meta-data operations. The TAUg library offered the application a fast-path access to TAU global profiles without a file system intermediary. By building a monitoring API on top of MPI, global profile access could be integrated easily with application execution semantics. Run-time offloading of TAU performance data is also intended for use by monitoring clients. The TAUoverSupermon (ToS) prototype, demonstrated how the filesystem bottleneck could be dramatically reduced, while providing additional support for system performance data integration. Besides scalable collection of data, other middle-ware like MRNet provide the ability to perform reduction operations during the collection phase, drastically reducing the monitoring overhead. For example, calculating a run-time statistical sample or other aggregate information can often replace the transport of raw performance data from all the nodes in the system. We next describe our current efforts and the progress made in building a monitoring system based on TAU and MRNet that by reducing amount of communication and resulting data achieves greater tool scalability.

2.2

The TAUoverMRNet System

Performance monitoring of HPC applications offers opportunities for adaptive optimization based on dynamic performance behavior, unavailable in purely post-mortem performance views. However, a parallel performance monitoring system must have low overhead and high efficiency to make these opportunities tangible. Over the past year we have built a scalable parallel monitor (through the integration of TAU and MRNet) and have evaluated its function and performance. We have leveraged MRNet’s distributed programming capabilities to efficiently perform useful reductions on the performance data on its path from application to monitor. We have reported our experiences in building the scalable parallel monitor and in evaluating its function and performance at the STHEC workshop [5] and to the CC&PE journal [1]. 2.2.1

Scalable Architecture

Scalable, online monitoring of parallel performance decomposes naturally into measurement and transport concerns. In TAU, an extensible plugin-based architecture allows composition of the underlying measurement system with multiple transports for performance data offloading. The ToM work explores the Tree-Based Overlay Network (TBON) model provided by MRNet with an emphasis on programmability of the transport for the purpose of distributed performance analysis/reduction. The main components and the data/control paths of the system are shown in Figure 1 (left). The ToM Back-End (BE) resides within the instrumented parallel application. The generic profile data offload routine in TAU is overridden by a MRNet adapter that uses two streams (data and control). The data stream is used to send packetized profiles from the application back-ends to the monitor. MRNet provides the capability to perform transformations on the data as it travels through intermediate nodes in the transport topology. ToM uses this capability to i) distribute statistical analyses traditionally performed at the sink and ii) to reduce the amount of performance data that reaches the monitor. Lastly, The ToM front-end (FE), which is the root of the transport 2

Node 1

DSF MRNET Comm Node + Filter

Back End

B Node 3

Streams

USF B > l -ro

A

BE

Co

TAU Front-End

FE

Node 4

USF

Node 5

l --

tro

n Co

Data

C A

BE

A C

Node 0

nt

Data

A

>

C

MRNET Comm Node + Filter

A

Back End

BE

A

A

B

Phase A

B

USF

Phase B

Streams

FE BE USF DSF

: Front End : Back End : UpStream Filter : DownStream Filter KEY

BE Node 6

C Phase C

A

B

DSF Node 2

Figure 1: The TAUoverMRNet System : Architecture (left) and Distributed Analyses (right) tree, invokes the MRNet API to instantiate the network and the streams. In the simplest case, the data from the application that is transported as-is, without transformations, is unpacked and written to disk by the FE. More sophisticated FEs (that are in turn associated with special ToM filters) accept and interpret statistical data including histograms and functionally-classified profiles. These capabilities are described next. 2.2.2

Distributed Analyses

Ideally, one would want to retrieve and store as much performance detail as the measurement system can provide. But the perturbation caused to the measured application and the transport and storage costs associated with the performance data, require that we trade-off measurement data granularity (in terms of events, time intervals and application ranks) against the costs. One method to vary the level of performance detail is through performance data reduction as the data flows through the transport. This is feasible by distributing performance analyses traditionally performed at the front-end, out to the intermediate transport nodes. ToM implements three such filtering schemes, namely, Statistics filter, Histogramming filter and the Functional Classification filter. Each scheme builds upon and extends the previous. Figure 1 (right) describes the data paths used by the distributed analyses and reductions we examine next. Statistical Filter: The StatsFilter, the simplest ToM filter, is an Upstream Filter (USF) that calculates global summary statistics across the ranks including mean, standard deviation, maximum and minimum for every event in the profile. Performance data is assumed to arrive in rounds (i.e. a round is one profile-offload from every rank). The summary statistics are calculated recursively by an intermediate node over the data from all its children until the root of the tree is reached. This corresponds to Phase A of the data path in Figure 1 (right). We have demonstrated this capability 3

Classification Filter : Classified Histograms Allreduce Time 5 Bin Histogram / Class [Same Offload# 18 as before]

3D Paraprof View : Statistics Filtered - 4096 Ranks Height and Color Metric Exclusive Time

s

ond

Sec -4]

Bins [0

s ad flo ] Of ation le ofi ter Pr 0th I [1 Lacsi’08

es

tin

u Ro

Classes 4 Lacsi’08

Some Tom Data

Color Metric: # Ranks Height Metric: Allreduce Time

Some Tom Data

9

Figure 2: Reductions on Ranger’s 4096 cores : Statistics (left), Classified Histograms (right) on 4096 cores of the TACC Ranger system running the Sod 2D problem of the FLASH application. Figure 2 (left) shows a screen-shot from the ParaProf visualizer of the reduced data retrieved online using this filter. Offload Benchmak | Large N 600

N=512 FO=8 N=512 FO=8 Reduce N=256 FO=16 N=256 FO=16 Reduce

550

Benchmark Performance (msecs)

500 450 400 350 300 250 200 150 100 50 0 25000

50000

75000

100000

125000

Profile Period

Figure 3: ToM Histogram-Filter Reduction Performance Histogramming Filter: The StatsFilter while providing global summary statistics for every event, loses considerable spatial distribution information. A histogram is one method of reducing data while still maintaining a level of distribution information. The HistFilter extends the StatsFilter and provides histograms for each event in addition to the summary statistics. Given a number of bins, to perform histogramming accurately, the global min/max must be known (so that the ranges of the bins may be calculated). Below the root of the ToM tree, global information is not available. To be able to distribute the histogramming function across the intermediate nodes in the ToM tree, the global min/max first needs to be ascertained. This is achieved through the use of a 3-Phase procedure as shown in Figure 1 (right) where the histograms are recursively generated after determinging the min/max. Figure 3 shows the savings in performance obtained through reduction. An order of magnitude difference is seen between the ToM (unreduced) and ToM Reduce (reduced) curves. This evaluation was also performed on the Ranger system using a simple bsp-style benchmark. Functional Classification Filter: It is often useful to reduce performance data into classes or categories based on the specific roles that each rank performs. This allows the examination of a small representative set of data instead of thousands of profiles individually. As an example, the spatial unevenness observed across the ranks in collective events, such as Allreduce, may be attributable to 4

network performance issues, load-imbalance issues or the existence of different classes of application ranks performing varying roles. In the latter case it may be important to distinguish between imbalance within the classes versus across them. The ClassFilter groups the ranks into classes using a purely functional definition of a class. It uses a hash of the concatenated event-names to generate a class-id and assumes that ranks with identical class-ids belong to the same functional class. Distributed histogramming is then performed within each class. Figure 2 (right) shows a ParaProf visualizer screen-shot of the classified-histograms of the Allreduce event. The profiles from 4096 Ranger cores have been reduced to 22 classes with 5 histogram bins each, making examination of unevenness substantially easier.

2.3

Benchmarking Online Monitoring

Having built the scalable parallel monitor, we turned next to the problem of using the system in such a way so as to match application monitoring requirements carefully to monitor infrastructure configuration and cost. Parallel performance monitoring extends parallel measurement systems with infrastructure and interfaces for online performance data access, communication, and analysis. At the same time it raises concerns for the impact on application execution from monitor overhead. The application monitoring scheme parameterized by performance events to monitor, access frequency and the type of data analysis operation defines a set of monitoring requirements. The monitoring infrastructure presents its own choices, particularly the amount and configuration of resources devoted explicitly to monitoring. The key to scalable, low-overhead parallel performance monitoring is to match the application monitoring demands to the effective operating range of the monitoring system (or vice-versa). A poor match can result in over-provisioning (wasted resources) or in under-provisioning (lack of scalability, high overheads and poor quality of performance data). BOI Binary Search: Example

One WayDelay Delay (ms) One Way [OWD] (ms)

350 300

Curr. OWD Rest. OWD + threshold Growth

OWD metric

250 200 150 100 50 0

Offload Interval (ms)

Offload Interval (ms)

65 60 55 50

A new offload interval is chosen based on reaction of the OWD metric (above)

45

Curr. Interval Low Interval High Interval

40 0

1

CLUSTER 2008, Tsukuba, Japan

2

Fig. 7.

3

4 5 Search Step Search Progress

8

9

Step 25

Sweet-Spots in Parallel Performance Monitoring

Figure 4: Bottleneck Offload Interval (BOI) Estimation Normalized SD (% Mean) of Rest. OWD (top) and Bott. Offload Interval (bottom)

1.5

N=64, FO=2 N=64, FO=4 N=64, FO=8

Normalized SD of OWD

Eval Metric [ { (Data Arrival Interval)*100/(Estimated Offload Interval) } - 100 ]

7

Search Progress : Interval (bottom), OWD (top)

Evaluating Estimation of Bottleneck Offload-Interval

15

6

[OWD] FO=2 [OWD] FO=4 [OWD] FO=8

1

We have defined metrics, developed a methodology and an evaluation framework to determine the sweet-spots for performance monitoring using TAU and MRNet. These efforts were reported at the IEEE Cluster 2008 conference [2]. We described the Bottleneck Offload Interval (BOI) metric as an estimator of the operating capacity of the monitoring infrastructure. We then implemented, within ToM, an estimation method for the BOI (Figure 4). This included APIs for the application to to discover the limits of monitoring system capability. Using these facilities we, next, characterized various ToM configurations. Given an application size and the performance data reductions to be Fig. 8. Offload Interval Estimation Evaluation Fig. 9. Stability of One Way Delay (top) and Offload Interval (bottom) performed, the characterizations helped discover what the choices to be made are with regards to ToM fanouts, monitoring offload intervals and number of profile events to sample. not known), the data arrival interval should match the offload of profile offloads. For instance, an iterative application may 10

0.5

Normalized SD of Interval

0

5

-10

-20 % Change from Estimated Offload Interval Applied to Benchmark

[Interval] FO=2 [Interval] FO=4 [Interval] FO=8

3.5 3

2.5 2

1.5 1

0.5 0

-40

16

interval. But when offload interval is lower than the true-BOI, the data arrival interval should never drop below the true-BOI. The evaluation metric used is the percent difference in the data arrival interval from the estimated-BOI (i.e. the difference between the estimates from the two methods). Figure 8 plots the evaluation metric resulting from offloading at intervals that are less than the estimated-BOI by 10, 20 and 40%. The curves for all three configurations never fall below +6% and instead begin to increase, suggesting that the estimated-BOI is at most within 10% of the true-BOI.

5

32

64 No. of Profile Events

128

256

determine that its iterations take 75 ms on average, but the estimated-BOI reported is 100 ms. It can then decide (consistently across the ranks) to drop every 4th profile offload, there by increasing its average offload interval (with the average taken over multiples of 4 rounds) to 100 ms, matching the BOI. Further, by backgrounding the offloads (i.e. making them non-blocking), it can avoid potential backups due to burstiness from its lower-than-BOI short-term offload interval. The isolated nature of HPC resource allocations for jobs suggests a level of stability that may make BOI estimation at

2.4

TAU Profile Snapshots

TAU’s Profile Snapshot feature targets the issue of providing low-cost temporal performance analysis. While, this was developed independently of the ToM work, it also applies to the monitoring problem. In the evolution of parallel performance tools, the two general methods of measurement and analysis – profiling and tracing – have each found their strong advocates and critics, yet both continue to be dominant in performance engineering practice. Profiling methods compute performance statistics at runtime based on measurements of events made either through sampling or direct code instrumentation. Tracing, on the other hand, merely records the measurement information for each event in a trace for future analysis. Whereas the statistics calculated in profiling can be computed by trace analysis, certain performance results can only be generated from traces. These results are best characterized as time-dependent. If the potential for temporal analysis of performance is tracing’s forte, its weakness is the generation of large, sometime prohibitively large, traces. In general, trace sizes are proportional to both the number of processes in a parallel application and the total execution time. Applications of a thousand processes running several hours can easily produces traces in the hundreds of gigabytes. While the use of profiling methods do not suffer such drastic size concerns, unfortunately, profiling simply loses track of time. An interesting question is whether profiling methods could be enhanced to introduce a time reference, in some way, in the performance data to allow time-oriented analysis. TAU’s profile snapshots, as reported on at the Euro-Par 2008 conference [6], attempt to do just that. 2.4.1

Snapshot Design

A parallel profile snapshot is a recording of the current values of parallel profile measurements during program execution. It is intended that multiple profile snapshots are made, with each snapshot marked by a time value indicating when it took place. In this manner, by analyzing the profile snapshot series, temporal performance dynamics are revealed. Figure 5 shows a high-level view of the performance profile snapshot work-flow. For any snapshot, the profile data for each thread of execution is recorded. Depending on the type of parallel system, thread-level (processlevel) profiles may be stored in different physical locations during execution. However, analysis of temporal performance dynamics requires all parallel profile snapshots to be merged beforehand. TAU measurement

parallel profile snapshots ..

ParaProf

..

{ { {

...

..

application run on parallel system

tn

t2

t1

Fig. 1. Profile snapshot architecture.

Figure 5: Profile Snapshot Architecture Because the profile snapshots being taken are is from parallel executions, A snapshot trigger determines when a profile snapshot taken. Triggers arethe defined with respect triggers are also executed in parallel. There may be different triggers for different to actions that occur during execution, either externally, within the performance measurement threads of execution and they may be based on different conditions. Thus, it is system, or at the possible program level.snapshots Timer-based triggers initiate snapshots for profile to be made at different times profile for different threads at regular fixed for different reasons. snapshots during can also record any portion of the parallel time internals. These intervals canProfile be changed the execution. Triggers can be conditional, profile state by selecting which performance events are of interest and what determined by performance or application state. The key issue is where trigger control is located. performance data (e.g., time and counters) should be stored. User-level trigger control a profile snapshot be taken at any pointofinthethe program. The A seriesallows of parallel profile snapshots is atotime-sequence representation performance measurement system can invoke triggers at points where it has gained changing performance measurements of a program’s execution. Flexibility in trig- execution control. ger control and profile snapshot selection is important to allow the desired views of temporal performance dynamics to be obtained. For instance, timer-based 6 rates to be calculated. However, intertriggers allow performance frequency and preting the relationship between profile snapshots and between different threads of execution for the same ‘logical’ snapshot can be tricky, especially when the per thread snapshots are recorded at different time points. One of the challenges of

2.4.2

Snapshot Performance

We have tested the scalability of this format with hundreds of processors and hundreds of events on long running applications. Depending on the application, creating snapshots at coarse intervals still provides a wealth of useful information. Our profile snapshot format typically requires on the order of 20 bytes per event recorded. In the execution of a 6 minute simulation of FLASH across 128 processors, we generated 320MB of uncompressed data (58MB compressed) with 6.3% overhead above regular profiling (which had a 4.6% overhead) . By comparison, with tracing, the same run generated 397GB of trace data and imposed a 130.3% overhead (13m31s). 2.4.3

Snapshot Application and Visualization

ParaProf has been significantly enhanced to support profile snapshots. Each chart and table can display both individual and cumulative data from snapshots, all linked through a separate window with a slider to control snapshot position. Because the snapshots are timestamped, we present it as a time-line and allow automatic playback of the execution.

a. Differential exclusive time, stacked

b. Differential number of calls, line chart

c. Differential exclusive time per call

d. Cumulative exclusive percent

Fig. 2. ParaProf charts of profile snapshots from FLASH

Figure 6: ParaProf charts of Profile Snapshots from FLASH and the user can watch as each window animates through the history of the

We have appliedexecution, our profile technique to themonitoring FLASHframework. application from the ASC Flash similarsnapshot to a saved history of an online Figureof3 Chicago. shows one ofFigure the three6 dimensional charts from two Center at the University shows a variety of ParaProf analysisfordisplays from ParaProf different snapshots. The 3D scatterplot can be used to determine performance showing performance data from each of the profile snapshots. The data shown here is for MPI rank data clustering. With a snapshot profile, we can see how the clustering changes 0 from a 6 minute over 128 the processor runexecution on LLNL’s Atlas machine. Figure 6(a) shows the time taken course of the of the application. Using thefor neweach capabilities provided by profile snapshots in as TAU, we are ablebar chart. The data in each of the top 20 events snapshot, across a time-line, a stacked verify our understanding of the execution of FLASH and look for inconsistent here is differential,to meaning that the snapshots are viewed as the difference between the timing program behavior over time. information at the start of the snapshot vs. the end of the snapshot. Alternatively, we can view each chart in cumulative mode.Work Figure 6(b) shows a line chart of the number of calls for the top 20 6 Related routines. The AMR grid used in FLASH is refined as the simulation proceeds, so we expect to see concept of parallel profile snapshots complements the workby by F¨ urlinger more calls and moreThe time spent in later time-steps. This is verified the dataonwe see in the profile “incremental profiling” [15, 16] of OpenMP parallel programs. Incremental prosnapshots. Figure 6(c) shows the differential per call exclusive time. Because the per call value can filing shares the same objectives as TAU’s profile snapshot to capture temporal performance dynamics. F¨ urlinger developed the ompP tool and has shown it to be effective at tracking variations in performance properties of OpenMP loops 7 over time, in particular overheads associated with loop imbalance and thread 7

decrease between snapshots, the values here can be negative. Here we see that the MPI Alltoall() call has spikes of large per-call values during certain iterations. While the regular snapshot view would show us that more time is spent in this routine, it could be due to more MPI Alltoall() calls being made, however, with this view, we are able to determine that the duration of each call has actually increased. Figure 6(d) shows the cumulative exclusive percent of time spent in each of the top 20 routines. As expected, the cumulative percentage of time spent in each routine stabilizes as the simulation proceeds.

2.5

Future Work

We envision a robust monitoring framework being used to address specific requirements for application and system-level global performance data collection and analysis. A core infrastructure should support efficient access and reduction services, while monitoring interfaces enable customization of performance awareness, query isolation / virtualization, and analysis operations. Learning from our experiences with Supermon and MRNet, we will create a monitoring framework where MPI is more directly integrated and leveraged in the monitoring infrastructure. The TAU monitoring framework TAUmon will enable us to use the highly optimized, vendor supplied communication libraries for performance data movement and take advantage of communication abstractions provided by MPI. Furthermore, computational collectives (e.g., MPI Reduce) can be used to combine communication with computation performed on monitoring data. In addition, we will continue to enhance the ToS and ToM implementations.

3

OS Performance Measurement and Analysis

The success of the ZeptoOS project’s exploratory research in operating and run-time systems depends greatly on the ability to observe the performance effects of OS/R components and their interplay, and to evaluate, compare, and track the performance of different OS/R families as the software evolves. System-level performance information is also crucial to application developers to pinpoint problems and discern whether the bottlenecks are in the application or the OS/R stack or are due to interactions between them. In particular, as platforms scale up, applications become more sensitive to system-level influences, including interrupt processing, daemon activity, and memory subsystem-related activities. Understanding the influences of these factors, as well as performance of I/O and other subsystems, depends on the integration of measurement and analysis in system-level components. The systems (OS/R and middle-ware) community needs to develop and evaluate richer, more featureful middle-ware and OS/R components that can support petascale parallel applications without the associated overheads and perturbations. Performance technology that provides the ability to measure and understand noise, synchronization, and I/O performance is of key importance. KTAU (Kernel Tuning and Analysis Utilities) is a system-level performance measurement methodology and tool infrastructure we developed to bridge the gap between the application and the OS/R components. KTAU couples kernel-level performance measurements with application-level events to provide an integrated parallel performance view from both kernel-wide and process-centric perspectives. The principal goal is to allow all program-OS interactions to be observed so that the influences of various system performance factors can be identified and properly attributed. Attribution is a key requirement in determining where to tune an application with respect to OS interactions (e.g., periodic checkpoint I/O).

8

3.1

Prior Work

In the initial phase of the ZeptoOS project, the KTAU architecture (Figure 7, left) was defined and techniques for instrumenting and measuring performance of the Linux kernel were developed. KTAU was ported to three architectures: Linux x86, 32-bit PPC, and x86 64. KTAU was also integrated into the I/O-node BG/L ZeptoOS distribution. How the integration works... KTAU Instrumentation

KTAU Proc Interface

KTAU Performance State

inter context access local context access

timer_interrupt

schedule timer_interrupt

PID: 1423

sys_read

KTAU Measurement System

KTAU Performance State

On schedule() update counter

schedule

schedule timer_interrupt

PID: 1430

sys_read sys_write do_IRQ

/proc/ktau/trace /proc/ktau/profile

On timer_interrupt update counter schedule timer_interrupt

Kernel User

task struct

get_shared_counter()

User

task lock global lock

User

KTAU User API

Kernel

task list

User Daemon Application

MPI Rank 0

get_shared_counter() User-level Double-Buffered Container

Application w/ TAU

Per-Process Virtualized OS Counters SC 2007

MPI Rank 1 Application w/ TAU

User-level Double-Buffered Container

No Daemon or System Call needed!

Observing the Effects of Kernel Operation 9on Parallel Application Performance

Figure 7: KTAU: Architecture (left) and Integrated OS Counter Support (right)

14

In addition, we extended KTAU to provide a new level of access to kernel performance data. The objective was to allow kernel performance data to be queried efficiently by the TAU measurement system running at the application level. The key idea was to make available a portion of application process memory (called a metric container) for use by KTAU to write certain kernel performance metrics (Figure 7, right). Noise Accumulation 2e+06

Carry-Forward Delay (microsecs)

timer_interrupt smp_apic_timer_interrupt schedule Overall Noise 1.5e+06

1e+06

500000

0 0

20

40

60

80

100

120

140

MPI Rank

Figure 8: Noise accumulation, 128-processor run of Sweep3D: noise components: timer interrupt (red), smp apic timer interrupt (green), and schedule (blue). Overall noise is also shown (purple). OS operation, when not performed directly on behalf of the application, introduces performance artifacts, referred to as noise. The light-weight access to kernel-performance state from within the context of the application provides the capability to identify noise sources and correlate that information with the application. We developed trace-based parallel noise analysis algorithms that allowed us to measure the effect of individual noise components on the performance of a specific parallel application. Figure 8, a sample of our results from experiments on a x86 64 Woodcrest Cluster at SDSC, plots the noise accumulation (the total amount of time lost by the parallel application due to noise) for overall noise and for individual noise sources. 9

Integrated performance observation allows a complex performance phenomenon, such as OS noise, to be investigated. The intimate relationship of noise and communication operations, especially with respect to how noise influences performance scaling, requires event-specific noise measurement and analysis to fully capture noise dynamics. Over the course of this multi-year project we had planned to port our performance measurement and analysis tools to emerging large-scale platforms and to apply these techniques at scale to study ZeptoOS noise characteristics and the effect of noise mitigation mechanisms. Next we describe the progress made towards that direction during Year 1.

3.2

TAU / KTAU on ZeptoOS

Significant progress has been made in porting both the TAU and KTAU performance tool frameworks to the ZeptoOS V2.0 distribution for the IBM BG/P. Currently both are publicly available for download and use on the Surveyor system at Argonne National Laboratory. Preliminary experimentation has also been carried out on the Surveyor system. We describe these efforts in further detail next. 3.2.1

TAU

TAU is now ported to the ZeptoOS V2.0 in addition to the existing support for the BG/P under the default IBM OS suite. Predefined configuration options (e.g. -zeptodir) makes building for ZeptoOS quick and easy. Numerous TAU features [3], including support for MPI application profiling, multiple-counter support, access to the fast cycle-accurate timer and automatic instrumentation utilities are functional on the platform. Simultaneously, TAU has also been extended with BG/P-specific features under the IBM OS suite such as meta-data collection (e.g. torus mapping information) using the BG{L/P} personality.h APIs. We hope to make available these new features under ZeptoOS as well soon. 3.2.2

KTAU

The port of KTAU to the compute nodes of the IBM BG/P running ZeptoOS V2.0 involved changes to the architecture-dependent portions of the tool framework. In addition, a KTAU patch for the specific kernel version (2.6.19.2) used in compute nodes was created. New instrumentation points to BG/P-specific code areas (e.g. bgpnet tree interrupt()) were added. Several modifications were also required to work with zcb-enabled (zepto compute binary) applications. The memory mapping provided by ZeptoOS for such applications largely eliminates TLB misses, improving memory access performance. This is provided through the use of a special memory mapping termed the big memory region which is covered by large pages. The implementation of the light-weight integrated (TAU/KTAU) OS performance, described earlier, required modification as it relied on the default Linux memory mappings. The KTAU port now accommodates integrated profiling for ZCB-style binaries as well as standard applications. The entire patch/configure/build process of KTAU within the kernel was integrated into the build system of ZeptoOS. Detailed documentation [7] on acquiring, configuring, building and using TAU/KTAU on ZeptoOS has been released. Lastly, KTAU has also been successfully ported to SiCortex MIPS, another recent Linux-based platform (although not a ZeptoOS supported platform). This provides the opportunity in the use of KTAU to compare the performance characteristics (such as noise) on two emerging, specialized Linux-based MPPs.

10

3.2.3

Demonstration

Preliminary experimentation of TAU/KTAU under the BG/P ZeptoOS V2.0 has been performed with the MPI versions of the NAS Parallel benchmarks on the entire rack (1024 nodes) of the ANL Surveyor system. These results were demonstrated at the SC’08 (Supercomputing 2008) conference in the DOE NNSA ASC and Argonne National Laboratory booths. We plan to extend experimentation soon to the entire Intrepid machine using an INCITE application. Figures 9,10 and 11 show a sampling of the Surveyor results as screen-shots from the ParaProf visualizer collected Profiles : Surveyor N=1024 fromNPB theFT NPB FT application.

microseconds

NPB FT Profiles : Surveyor N=1024

ank

IR

MP

SC 2008, Austin, TX

SC 2008, Austin, TX

TAU / KTAU on ZeptoOS

6

TAU / KTAU on ZeptoOS

7

Figure 9: Surveyor Full Rack (1024 nodes) - NPB FT: TAU User-level Profiles

Figure 9 shows the user-level only profile as available from TAU. The figure on the left shows the mean of the GET TIME OF DAY (wall-clock) metric over the 1024 ranks. Only the top 20 instrumentation points are shown. On the right-side figure a 3D view of all the ranks’ user-level performance data is shown. The routines, MPI rank and exclusive time are shown on the x, y and z axes respectively. NPB FT Kernel Profiles : Surveyor N=1024

Rank 0 Compute Node Process List vs. Kernel routines

()

Pr oce s

sI

D

ule

ed

sch

y()

fda

eo

im

ett

_g sys

PID:112 [ft.D.1024]

Ins

tru m

e Ro nted uti Ke ne rne l

SC 2008, Austin, TX

TAU / KTAU on ZeptoOS

Single Node | Kernel only view

11

Figure 10: Kernel Profile of Node-0 When configured with KTAU options and run under a KTAU-patched OS, TAU automatically saves kernel profile information alongside user-level data. Figure 10 shows a kernel-only view of all processes active on the compute-node running rank-0 of the FT application. The functions on the x-axis are those instrumented kernel routines (e.g. sys-calls) which were invoked while the job was running. The actual MPI job (in this case rank-0) is always spawned as pid-112 currently (due to stripped down nature of the CN ZeptoOS). 11

In the figure, process pid:112 (the nearest to the observer) can be seen running only a few instrumented kernel routines. The largest (red bar) represents sys gettimeofday() since TAU was using that for timing (if configured to use linux-timers/rdtsc(), then this routine should not been seen either). The green bar represents pre-emptive scheduling time. Given the total runtime of the process (66 seconds), the 80 millisecs of preemptive scheduling is negligible (less than 0.15%). There are two other process that showed some non-negligible activity during the time the MPI job executed on this compute node. They were the control daemon (/sbin/control) and FUSE (/sbin/zoid-fuse). The node is shown to be largely quiet except for our MPI process (pid:112) and even in that case only very little OS activity was seen in this trial run. It should be noted that at the time this trial run was performed not all instrumentation points in the kernel were as yet enabled as ZeptoOS was still under development – they are now all enabled. NPB FT Kernel Profiles : Surveyor N=1024

NPB LU Kernel Profiles : Surveyor N=1024

All Ranks’ Compute Nodes

Occurrence of schedule()

PID 112 vs. Kernel routines

SC 2008, Austin, TX

TAU / KTAU on ZeptoOS

ank

MP

I Ra

MPI R

nk

microsecond

s

MPI Rank vs. Application Routine

All PID:112 | Kernel only view 12

SC 2008, Austin, TX

TAU / KTAU on ZeptoOS

Integrated User+Kernel only view

13

Figure 11: Kernel Profile of All MPI Ranks: Kernel-Only (left), Integrated schedule() (right) The view in Figure 11 (left) isolates the processes of the MPI job and shows the kernel activity of those process across all of the MPI ranks. In other words, the kernel-operations of processes with PID:112 on all the compute nodes are being shown. This is also a kernel-only view as it only shows the kernel activities. The large red bar is the sys gettimeofday (due to the timers). The other cyan/green bar is schedule() (preemptive scheduling). It shows how the MPI job on Rank-0 suffered relatively more (still small in an absolute sense) scheduling than the other ranks. This could be due a longer initial setup phase (given it is rank-0), but this kernel-only view is unable to shed further light. Figure 11 (right) shows an integrated user+kernel performance view. Each instrumented application routine also records how much of certain chosen types of kernel activity occur within the context of the application routine. The figure shows the preemptive scheduling that occurred during the execution of the various user-level routines (including MPI routines). The longer schedule() we saw for rank-0 in the earlier kernel-only view is now clearly seen to have occurred in the SETUP phase as suspected earlier. There is no limitation regarding what instrumented kernel routines can be tracked in this manner (be they sys-calls, scheduling, interrupts, exceptions/faults).

4

Parallel I/O Performance

I/O is an important concern in emerging peta-scale architectures like the IBM BG/P and Cray XT3/4, which have multiple node types with specific responsibilities. In these environments an I/O request that originates as a library call (e.g., stdlib or mpi/io) on a compute node running a parallel application will traverse a series of components (both hardware and software) before reaching 12

disk. This is being termed as the long-path of I/O. We are developing an integrated methodology that has access to all these components (at the library-level, application user-space level and the kernel-level) to correlate performance between them and isolate bottlenecks. Our objective is to apply the developed methodology, based on the TAU/KTAU tool infrastructure, to optimizing I/O performance over this long-path.

4.1

Current Accomplishments

6101%%&%(7?J(-0123*49(14#(+,&(K"49E61+,(

2"!5$+&(4"#&'(

C878--'-(4$$-./8&.")( DC>(>=( %*H2(

.,*+/0"1(#(+"(.67("0('+0$2+$0&'(

.%%(&8*'+(*4('*49%&(50"2&''(!&!"0:(1##0&''('512&( ;*0&2+=(

I/>A(

':'21%%'(

!"#$%&'()"*'(!"#$%&'(!"#$%&'("$()*)*+,(-./( -./(+0123'(14#(51''&'(+,"$(-$*1'(21%%(50"2&&#'((

/F7G(

?AB(

-.9:".*;/)(

+,$-./.&(5!%(67,%'0$(')/"*.)0(1(*'/"*.)0("2( (345(/")&',&6(7'&%7)($'72"7#8)/'(*8&8(

&7''(

>+0$2+$0&(?(.67(2,149&'(+"( -.9:".*;/)(14#(?(*8'#")(

*?"(4"#&'(

?(*8'#")( /F7G(

>1=(@"*'(!"#$%&'(!"#$%&'("$()*)*+,(-./(

?AB( /F7G(

CFAB(

89:;*
'+"019&(

E.0+(

CFAB(B'7G'7(

>&0B&0(122&5+'(
Figure 12: I/O ’long-path’: Components (left), Instrumentation and Context-Tracking (right) Progress has been made in both instrumentation of the various components involved in the I/O and in the context-tracking of I/O flows through these components (Figure 12). 4.1.1

Long-path Instrumentation

We use a variety of different strategies to perform the instrumentation. Figure 12 (left) shows the path an MPI I/O request from a compute-node takes under the ZeptoOS/ZOID BG/P OS suite. With ZOID replacing CIOD and ZeptoOS replacing CNK on the BG/P, most long-path components have become open-source and hence instrumentable. Below are listed the various strategies, along with the components targeted by each. • Automatic PDT-based Source Instrumentation: Source-level instrumentation is both portable and also allows a direct association between language-and program-level semantics 13

and performance measurements. Manually placing probes, while extremely flexible, can be tedious and error prone in large code-bases. To address these issues, TAU uses a powerful automatic source instrumentation tool, tau instrumentor, for C, C++, and Fortran, based on our program database toolkit (PDT). The TAU source instrumentor tool can place probes at routine and class method entry/exit, on basic block boundaries, in outer loops, and as part of component proxies. PDT’s robust parsing and source analysis capabilities enables the TAU instrumentor to work with very large source files and inserts probes at all possible points. TAU has been ported to the BG/P platform running ZeptoOS and can be used along with PDT to auto-instrument. Any parallel application (the first component in the long-path), written in C, C++ or Fortran, can be quickly and easily instrumented using TAU’s automatic source instrumentor. In addition, we have also been able to utilize this approach, with limited success, in the instrumentation of the PVFS server (the final component). • Static and Dynamic Library Interposition: In library wrapping the original library routines are replaced by instrumented versions which in turn call the original routines. The problem is how to avoid modifying the library calling interface. Some libraries provide support for interposition, where an alternative name-shifted interface to the native library is provided and weak bindings are used for application code linking. Here, library routines can be accessed with both its name shifted interface and the native interface. The advantage of this approach is that library-level instrumentation can be implemented by defining a wrapper interposition library layer that inserts instrumentation calls before and after calls to the native routines. It is also possible through interposition to access arguments passed to the native library. This is termed static interposition as it is done at linktime and requires a relinking to enable instrumentation. Where libraries are built as shared objects (and dynamically linked) the need for name-shifted interfaces is avoided. Instead, the wrapper library uses the same interface but then loads the original library internally. The instrumentation library can be interposed using the pre-loading facility. Like other tools, TAU uses MPI’s support for interposition (PMPI) for performance instrumentation purposes. Recently, TAU’s support for MPI library instrumentation has been extended to MPI I/O routines by defining appropriate TAU wrappers. TAU now supports tracking of volume and bandwidth information for MPI-IO and tracking MPI-IO is enabled by default. • Automatic Wrapper Generation: Utilities are also helpful in reducing application reprogramming required just to get instrumentation enabled, such as when needing to instrument an external library without modifying its source. This may be necessary for libraries where the source is not available or for when the library is cumbersome to re-build. TAU’s new wrapper generator, tau wrap, is intended for such cases. In its current form, this PDT-based tool will read header files of the original library and generate a new library wrapper header file with preprocessor DEFINE macros to change routine names to TAU routines names. A new wrapped library is created with these instrumented TAU routines, which then call the original library. tau wrap can be used for any library. We plan to further extend this tool to generate dynamic library loading (dlopen/dlclose) code. In the case where target libraries are built as shared-objects, there will be no need to change either the calling code or the called libraries. Instead, dynamic library interposition of the tau wrap generated wrappers can be performed completely transparently using LD PRELOAD. 14

This approach can be used for many of the internal library components of the long-path such as ZOIDFS components on the compute and I/O nodes. • Specialized PVFS Instrumentation API: As mentioned before, automatic source instrumentation of the PVFS server can be performed. But in that case the automatic approach has not been completely useful due to the many varying number and types of parameters that need to be recorded along with performance data. The automatic instrumentor is not aware of which parameters to record with each event without direction from the programmer. Further, the design of the PVFS server utilizes a significant number of simultaneous, transient worker threads that are created and destroyed on-demand. This provides for increased concurrency and latency hiding of I/O delays in the server. But in the context of performance tracking, this creates a problem of increased contention if a simplistic single-buffer trace solution is used (since all the threads simultaneously try to write to the shared trace buffer). Alternatively, having a separate file for each worker thread is infeasible as this can create thousands of trace files within a matter of minutes, depending on the workload. These issues prompted us to provide a specialized instrumentation API (currently termed pvfs tau ttf, for TAU Trace File). The API allows for first registering event types with variable number of arguments. These events can then triggered from within the code by the programmer. The underlying implementation is optimized to reduce contention, at the same time keeping the number of trace/profile buffers (and files) to a minimum. We have worked with the PVFS core team to integrate this API into their code-base and in performing preliminary trial runs to collect sample performance data. • Manual Source Instrumentation: Lastly, the existing, standard TAU instrumentation API can be used in any of the long-path components in addition to any other instrumentation technique described above. This may be desirable, for instance, when extra semantic information needs to be associated with performance data. TAU provides many facilities to annotate performance data with high-level application semantics (such as mapping and phases) [3]. 4.1.2

Context Tracking

To be able to delineate end-to-end I/O flows some context needs to be available to chain performance data from the different long-path components. Figure 12 (right) describes how the context, after being created at the source (the parallel application), is passed between components. For interactions that occur between components within the same thread/process boundary, the context hand-off occurs transparently. This is achieved through the use of thread-local-storage (TLS). The first time that TAU is made aware of the context, it is saved in TLS as the current active context and is hence available to TAU instrumentation called from other components within the same memory address space. As seen in the figure, this occurs within the compute-node components and also within the I/O node components. No changes to component interfaces or special instrumentation code is needed to be added to track the context in such cases and is hence termed transparent. When component interactions cross process boundaries, there needs to be explicit context handoff. This may require change to the component interfaces, message formats and the addition of extra code to perform the hand-off in an out-of-band fashion. For instance, this occurs when the computenode side libzoid cn writes to the tree-network to communicate with the I/O node-side ZOID. An extra API allows get/set of the context. The wire format of the messages over the tree-network also needs to accommodate the extra context information. 15

Similarly, explicit context hand-off is again required when crossing the IO node PVFS client / PVFS server boundary. This is achieved by using the existing hint facility in PVFS2. Hints can be arbitrary key/value pairs that annotate PVFS2 requests. On the client side, TAU encodes the context information as hints, which are then available to the measurement code within the PVFS2 server.

4.2

Future Work

We describe several remaining challenges to resolve and then conclude the I/O section with a description of the outstanding tasks we intend to focus on. 4.2.1

Challenges

The key challenges are related to dealing with the issues of work-sharing, asynchrony and the differing lifetimes of clients vs. the service. Work-sharing occurs when a server code-segment executes on behalf of more than one context simultaneously (but in the same thread). For instance, two requests for adjacent data arriving close together in time may be serviced by a single large read request in the underlying implementation. Two separate performance contexts becoming merged raises questions of how to report performance for each, whether to maintain inclusive/exclusive metric relationships and how to do so while still being conservative. A request is termed asynchronous when it is handed-off to another executing thread/process/host without the requester needing to wait the entire time the request takes to service. This raises the issue of how to represent performance data occurring across asynchronous interactions. Given, the notion of asynchronous requests is not well defined in TAU new constructs may need to be introduced to deal with such cases. The differing lifetimes of clients and the service complicates a profile-based scheme (as opposed to a tracing scheme). The PVFS service is expected to outlive its clients and the various clients may have nested / overlapped lifetimes with relation to each other. How then can we keep separate, the performance of different flows? Would clients wanting to track performance be expected to open a performance context with the server and then issue all requests within that context? Such a scheme will also have to deal with potential resource leaks from misbehaving / crashed clients. We expect to examine in detail and succesfully resolve these issues in the coming year. 4.2.2

Outstanding Tasks

• Context-aware measurement: The measurement data on each of the long-path components, needs to be collected in a manner that allows correlation back to the performance problem (even if that bottleneck originates in a remote component). This requires contextaware performance measurement. We have described context-tracking above, which allows for the context to be accessible to the measurement system within each component. The TAU measurement system needs to be able to use the context in such a way so that the resulting measurement data can be chained. If the intended measurement mechanism is tracing, then this may be trivially achieved by adding an extra context field to the trace-record. The traces can then be merged post-mortem. In the case of profiling, more sophisticated techniques are required to perform chaining as, unlike trace data, profile data is aggregated over time. • Performance data reduction: Another issue to investigate is the use of online datareduction techniques on long-path performance data. Measurement of all the components along the data path and tracking of all I/O if not done carefully can lead to a large amount of 16

performance data. Reducing the amount of performance state collected without compromising diagnosis will be crucial. • Long-Path Performance Analysis and Visualization: Once collected and chained, the performance data needs to be interpreted, analyzed and visualized. We are investigating the extension of ParaProf’s call-path visualization facility to the long-path performance data. In the case of traces, we are investigating the feasibility of using existing trace visualizers such as VAMPIR.

5

Meetings

We participated in several conference meetings throughout Year 1 of the project where we presented the FastOS research and development accomplishments reported here. The meetings included: • ASCR 2008 Computer Science Research PI Meeting – Poster in conjunction with Argonne partners: “Extreme-Performance Scalable Operating System,” P. Beckman, R. Ross, K. Yoshii, K. Iskra (Argonne National Laboratory) A. Malony, S. Shende, A. Nataraj, A. Morris (University of Oregon) – Talk in Performance Tools session: “Parallel Performance Tools at the Petascale Event Horizon,” A. Malony et al. • International Workshop on Scalable Tools for High-End Computing (STHEC 2008) – Held with International Conference on Supercomputing (ICS 2008). – Presentation of accepted STHEC paper: “TAUoverMRNet (ToM) : A Framework for Scalable Parallel Performance Monitoring,” A. Nataraj, A. Malony, A. Morris, D. Arnold, B. Miller. • IEEE International Conference on Cluster Computing (Cluster 2008) – Presentation of accepted Cluster paper: “In Search of Sweet-Spots in Parallel Performance Monitoring,” A. Nataraj, A. Malony, A. Morris, D. Arnold, B. Miller. • International Euro-Par Conference on Parallel Processing (EuroPar 2008) – Presentation of accepted EuroPar paper: “Observing Performance Dynamics using Parallel Profile Snapshots,” A. Morris, W. Spear, A. Malony, S. Shende. • SC (Supercomputing) 2008 – TAU/KTAU demonstrations in DOE NNSA ASC booth. – Talk in Argonne National Laboratory booth: “TAU / KTAU on the ZeptoOS BG/P,” A. Nataraj, A. Malony, S. Shende, A. Morris.

17

6

Year 2 Goals

With the release of ZeptoOS at the end of Year 1, we will be ready to integrate the KTAU measurement infrastructure. This will be completed during the first quarter and evaluation studies will then commence with Argonne colleagues in the second quarter. The goal is to characterize ZeptoOS operation with KTAU running in both the compute and I/O nodes. In particular, we are interested in running noise evaluation benchmarks. With TAU and KTAU running with ZeptoOS, we will also be in a position to demonstrate integrated application/OS performance views at large scale on Intrepid. The main objective for the UO work on I/O is to demonstrate the I/O long-path methodology and infrastructure in the context of PVFS-2. This work is an important component of Aroon Nataraj’s Ph.D. thesis research. He hopes to complete his Ph.D. degree in Year 2. The experience gained with the ToM monitoring work in Year 1 will be applied in the design and development of the TAUmon system for use with ZeptoOS on Intrepid. Our hope is to demonstrate a real-time monitoring on Intrepid of a large-scale INCITE application by year end.

7

Conclusions

Significant progress has been made by the University of Oregon during the first year of the FastOS (Phase 2) project. This Year 1 report discusses the extent of the work; please refer to the referenced papers for more detail. With the pending availability of ZeptoOS on the Intrepid machine, we are confident in meeting Year 2 goals in all project areas.

18

References 1. A. Nataraj, A. Malony, A. Morris, D. Arnold, and B. Miller, TAUoverMRNet (ToM) : A Framework for Scalable Parallel Performance Monitoring. Concurrency and Computation: Practice and Experience, 2008. (special issue on Scalable Tools for High-End Computing), under submission, to be published. 2. A. Nataraj, A. Malony, S. Shende, and A. Morris. Integrated Parallel Performance Views. IEEE Cluster Computing Journal, 11(1):57–73, Mar. 2008. 3. A. Malony, S. Shende, A. Morris, S. Biersdorff, W. Spear, K. Huck, and A. Nataraj. Evolution of a Parallel Performance System. 2nd International Workshop on Tools for High Performance Computing, M. Resch, R. Keller, V. Himmler, B. Krammer, and A. Schulz, Eds., Stuttgart, Germany, Springer, pp. 169–190, July, 2008. 4. A. Nataraj, A. Malony, A. Morris, D. Arnold, and B. Miller. In Search of Sweet-Spots in Parallel Performance Monitoring. In IEEE International Conference on Cluster Computing (Cluster 2008), Sept. 2008. 5. A. Nataraj, A. Malony, A. Morris, D. Arnold, and B. Miller. TAUoverMRNet (ToM) : A Framework for Scalable Parallel Performance Monitoring. In International Workshop on Scalable Tools for High-End Computing (STHEC ’08), 2008. 6. A. Morris, W. Spear, A. Malony, S. Shende. Observing Performance Dynamics using Parallel Profile Snapshots. In 14th International Euro-Par Conference on Parallel Processing (Euro-Par 2008), Aug. 2008. 7. ZeptoOS V2.0 Documentation – Using (K)TAU on ZeptoOS http://wiki.mcs.anl.gov/zeptoos/index.php/(K)TAU

19