Igor Trubin, Ph.D. and Linwood Merritt. Capital One Services, Inc. This paper discusses one site's experience of using business drivers and I/O performance data ...
The Association of System Performance Professionals
The Computer Measurement Group, commonly called CMG, is a not for profit, worldwide organization of data processing professionals committed to the measurement and management of computer systems. CMG members are primarily concerned with performance evaluation of existing systems to maximize performance (eg. response time, throughput, etc.) and with capacity management where planned enhancements to existing systems or the design of new systems are evaluated to find the necessary resources required to provide adequate performance at a reasonable cost. This paper was originally published in the Proceedings of the Computer Measurement Group’s 2003 International Conference.
For more information on CMG please visit www.cmg.org
Copyright © 2003 by The Computer Measurement Group, Inc. All Rights Reserved
Disk Subsystem Capacity Management, Based on Business Drivers, I/O Performance Metrics and MASF Igor Trubin, Ph.D. and Linwood Merritt Capital One Services, Inc. This paper discusses one site's experience of using business drivers and I/O performance data from SAS/IT Resource Management (formerly IT Service Vision) and BMC performance databases to produce web-based disk subsystem capacity usage reports for a large, multi-platform environment. The home-made system captures global and application level I/O performance data and automatically publishes the following information on an Intranet web site: 1. Hourly disk I/O rate with trends and 6 month forecast charts correlated with business drivers forecast. 2. Disk I/O channel capacity estimation as a threshold on the I/O rate charts. 3. Global and application level I/O exceptions displayed as Statistical Process Control (SPC) charts based on the Multivariate Adaptive Statistical Filtering (MASF) technique. 4. Tabulated data of DISK I/O subsystem performance status automatically colored to show possible performance or capacity issues regarding disk subsystems. 1. Introduction The Capacity Management service in the authors’ company has to deal with more than 1000 servers of different platforms such as UNIX, NT/W2K, Tandem, Unisys, and MVS. To provide daily web based reports of capacity and performance issues, several large SAS based applications run nightly on ServerP, a relatively small 4-way Unix server This Capacity Management process was discussed at past conferences [M01, M03] and was recently faced with the following capacity problem. The system is supposed to prepare all data, html tables and charts by 8 am every day, but because of growth in the number of servers this SLA was broken, and the Capacity Planning web site was ready after 9 am. This presented an interesting situation: the Capacity Management System needed to resolve its own capacity problem!
Before a “recent” upgrade (in May03) the metric had reached only 80% and based on simple trend analysis, no capacity problem would occur for several months. The next thought follows: The SAS job is an I/O intensive workload and as shown on Figure 2, the Disk I/O metric had been growing as well.
The first natural analysis is to look at the CPU utilization chart (Figure 1).
Figure 2 – Disk I/O rate trend This metric, however, does not have a threshold and, based on this chart, it is very hard to say that this is a Disk subsystem capacity issue. Both charts show that an upgrade has happened and as a result, both metrics have dropped. A more important fact is that after the upgrade, all data and reports are again finished by 8 am. Figure 1 - CPU util. trend
The smart reader may have already guessed which subsystem was upgraded. The main source of performance data in this case is HP MeasureWare
data, and fortunately it has one Disk metric with the obvious threshold. It is “Busiest Disk utilization” which is (based on its documentation) “the percentage of time during the interval that the busiest disk device had I/O in progress from the point of view of the Operating System.” Looking at this metric (on Figure 3), it is clear what subsystem was upgraded. Busiest Disk Utilization before and after upgrade 100 90 80 70
%
60 50 40 30 20 10
20:00
16:00
12:00
08:00
04:00
00:00
20:00
16:00
12:00
08:00
02:00
22:00
18:00
14:00
10:00
06:00
02:00
22:00
18:00
14:00
10:00
06:00
02:00
22:00
18:00
14:00
10:00
06:00
02:00
22:00
18:00
14:00
10:00
06:00
02:00
0
hour
Figure 3 – Disk Busy Chart
occupied disk space to total disk space for the fullest file system found during the interval.” This is similar to “Busiest Disk Utilization” (or “Disk Busy” metric) that was mentioned in the introduction. However, this metric shows 99% utilization almost all the time for every production server, because it usually reports the static and always “almost full” file system that has OS or other UNIX system files, and which is normal. Even the high level of “Busiest Disk Utilization” using a similar approach can be acceptable if it reports about non-critical parts of the disk subsystem. Some of the authors’ servers, mostly LAN servers, are under the Concord eHeallth performance monitor system, which has a very interesting way to report about all main server subsystem performance problems, including the Disk Subsystem. It is a “System Health Index”, which is “a grade on the performance of server” and is based on measurements of several Health Index variables. Figure 4 shows an example of a health index report generated by a SAS program using the data from a Concord Performance Database.
Indeed, older disk devices were replaced with faster RAID ones. This case study shows: 1. how efficient Disk subsystem capacity analysis might be; and 2. the importance of having thresholds for Disk I/O metrics. This paper presents an overview of Disk Subsystem metrics used for Capacity Management of the authors’ large multi-platform server farm as well as discussions of how to use them to produce meaningful forecasts, simple modeling and statistical analysis. 2. Disk Subsystem Metrics Overview The most popular metric is File System Utilization. Everyone wants to know how much space is left on the disk. The problem is that the number of file systems on each server might be very large (hundreds) and can be huge across the company (hundred thousands). The authors’ centralized Capacity Management environment definitely does not have the capacity to monitor and report capacity problems about each of them. Is there is any metric to show the worst file system in term of space utilization? Yes, HP MeasureWare has the GLB_FS_SPACE_UTIL_PEAK UNIX performance metric, which is “the percentage of
Figure 4 – Server Health Index The Overall Health Index is the sum of five components (variables). Each of them might have the value in the range from 0 (excellent condition) to 8 or more (poor condition) and is indication of the following problem: • • • •
SYSTEM, which reports a CPU imbalance problem; MEMORY, which is exceeding some memory utilization threshold or reflects some paging and/or swapping problems; CPU, which is exceeding some utilization threshold; COMM., which reports network errors or exceeding some network volume thresholds;
•
And STORAGE, which might be a combination of a. Exceeding user partition utilization threshold; b. Exceeding system partition utilization threshold; c. File cache miss rate, Allocation failures and Disk I/O faults problem that can add additional points to this Health Index component.
In Figure 4, the STORAGE component has the biggest contribution and demonstrates some bad trending. In addition to this chart, the system generates classic performance charts about each subsystem. Figure 5 gives more explanation as to why the Health index is so high.
• • •
Amount of free space in the file system; Number of free inodes in the file system; Amount of file system space available that is allocated for general use.
Only the combination of utilization type of metrics and actual size of the file systems is sufficient, because that can show how many absolute MB are used or free. Indeed, 1% free space of 100 GB disk is equal to 10% free space of a 10 GB disk. The most interesting Disk subsystem metric in terms of system performance monitoring is Disk I/O rate. For UNIX systems this metric means the number of physical I/Os per second during the interval. This data can be summarized by hour as shown on Figure 6 and can be applied to application level data based on standard workload characterizations. Another useful view on this metric is a 24-hour profile grouped by nearest month hourly averages as shown in Figure 6.
Figure 5 – Disk Partition Utilization. Indeed, the two partitions (#1 and #2) were highly utilized and caused a Health Index increase. This tool is a good mechanism to monitor the Disk capacity usage of LAN servers because Concord is network oriented and uses SMNP agents to gather the performance data. For UNIX application and database servers with a large number of file systems to monitor for utilization, the site utilizes the BMC Perform and Predict tool. Currently the site monitors space capacity only for a limited (“most essential”) set of file systems using the BMC Perform and Predict environment. At the same time, the System Administration groups are responsible for monitoring all of them to ensure availability by using different tools such as BMC Patrol or ITO. The following “most interesting” disk space metrics are reported: • • • •
Percent of file system that is full; Measure of inodes used in the file system; Size of file system in megabytes; Number of inodes in the file system;
Figure 6 – 24 hour profile of Disk I/O rate. This picture can help to optimize I/O usage during the day and to show where workload could be spread out to use Disk subsystem more efficiently.
3. Disk Metric Trend Analysis and Forecast The site's Capacity Management system produces several types of trending charts against Disk performance data. First of all, its Disk I/O rate trend for off- and work-hours is shown in Figure 2. Using the SAS standard “forecast” procedure that is based on one of the “time series” algorithms, the charts are produced for almost all UNIX servers and includes a future trend as an extrapolation of the historical data.
consolidations, the historical data consists of phases with different patterns, and the SAS scripts should be adjustable to take in consideration only the last one with a consistent pattern. For instance, if the history shown in Figure 2 began in October instead of July, the future trend would be more realistic as shown on by dashed lines on the future side of the chart. The same approach is applied to publish the Health Index trend as shown on Figure 7 and corresponds to the same situation as shown on the Figures 4 and 5.
It might work well where the history is consistent, but often due to upgrades, workload shifts or
Figure 7 – Server Health Index Trend The disadvantages of this approach are: -
The Disk subsystem is indirectly presented here. The future trend tries to predict future problems of different subsystems and sounds very suspicious as an “apples to oranges” comparison.
However, the first argument might be an advantage in term of overall Capacity Planning, and the most important advantage is a real threshold, because if the Health index equals 8, it means at least one component (subsystem) had very poor condition. We used the SAS program not only to build this chart but also to calculate when in the future the trend will intersect some threshold (yellow or red
zone) and to automatically estimate the capacity status of the server. The authors’ site’s Capacity Planning process includes an Intranet web publishing capacity status of all servers by regularly updating the color coded tables with essential server capacity information (configuration data and capacity/performance charts). This method of Capacity Planning has already been discussed in past conferences [M03]. Usually, a decision about a server’s color requires the manual intervention of capacity planners, but the usage of data shown on Figure 7 opens the possibility to automate it. The future trend analysis, strictly based on the server performance data history, assumes that future patterns of server usage will remain the same. Often the future server usage depends on business
drivers such as “number of accounts in the database of customer transactions,” which vary based on market situation. There is a way to produce a potentially more accurate forecast: a performance data vs. business driver correlation analysis. The authors’ site takes monthly business driver data (historical and projected) that is supplied by business units within the company, configures each server to one or more business drivers, and performs SAS multivariate regressions against such server resources as CPU
utilization and disk I/O [M03]. This view of projections accounts for business plans, and can be very valuable for “what-if” analyses of various business scenarios. The authors’ experience is that this approach appears to be more accurate for a single resource such as CPU. When applied to multiple file systems’ disk I/O, results may vary greatly. The authors’ implementation displays the projections backward into historical data (see Figure 8), to allow a visual inspection of how close the regression fits the data. Where the regression doesn’t fit, we can use the trended forecast.
Figure 8 – Business Driver Based Forecast 4. Overall Disk I/O Operations Estimation In the introduction we discussed how a disk subsystem bottleneck was discovered, but looking at the disk I/O trend chart on Figure 2, even if it’s trending up, it’s very difficult to say that the Disk I/O subsystem is running out of capacity. Why? It is because this chart does not show any thresholds. It would be very beneficial to have one. Based on HP MeasureWare DISK level data, there is the possibility to estimate overall disk subsystem I/O capacity. This was discovered during one ad hoc study of an exceptionally long term Disk I/O hourly trend on ServerE, as shown on Fig. 10. The Disk Level data was extracted from the performance log file and was used to formulate the following data in Tables 1 and 4. For the sample
interval (5 min) it had DISK utilization equal to BYDSK_UTIL, during which the rate of I/O was BYDSK_PHYS_IO_RATE. The BYDSK_UTIL is the percentage of time busy servicing I/O requests for the particular disk device and BYDSK_PHYS_IO_RATE is the average I/Os per second for the particular disk device during the interval. Table 1 has an example for two disk devices (the actual number of disks is much more), where the last column was calculated by formula in Table 2 and is the maximum of the I/O rate which would be executed if the disk was 100% busy. It is a very simple linear model and does not take in consideration the DISK queue and controller cache usage, but at least gives some estimation of overall Disk I/O capacity.
Table 1 - Two Disk I/O Observations of MeasureWare data and Disk Capacity Estimation DISK Capacity DEVICE_NAME BYDSK_UTIL BYDSK_PHYS_IO_RATE Time IO/sec 49.05 11:35 7/0/0.8.0.3.1.4.6 63.0 30.9 68.50 11:35 6/0/0.8.0.3.1.0.6 60.9 41.7 … … … … 11:35 Table 2 - I/O Capacity Estimation Linear Formula DISK Capacity (IO/sec) = BYDSK_PHYS_IO_RATE (IO/sec) * 100 / BYDSK_UTIL (%) To find the most accurate estimation for ServerE, peak time data was analyzed. The result of Disk I/O capacity calculation is show in Table 3. The actual measured I/O rate is in Table 4. Finally, we can calculate ServerE DISK IO CAPACITY utilization as (Actual IO/hour)*100/(Max capacity IO/hour)= 6.62% The result should be doubled because, based on HP recommendations, 50% of a particular (single) disk utilization is a threshold. It is a very simple approach and Table 3 shows that the capacity estimation varies a bit across samples, but it has a relatively stable outcome. To make an estimation “from the top” (worst case), the interval with a minimum of I/O capacity (25,957.33) may be taken and then the I/O capacity usage calculated at 9.48%. However, for some intervals the minimum might be very low and the accuracy might be low as well.
Table 3 – I/O Capacity: Available
Figure 9 – Disk I/O Rate Capacity Chart Based on this estimation the second axis with percentage was added into Figure 9. As a meter of the particular ServerE ad hoc analysis, the conclusion was that the server’s Disk Subsystem had enough capacity, which was proven in practice, because the server was not upgraded until one year after this study. Table 4 – I/O Capacity: Used
Sum of ALL Disks IO Capacity IO/sec Time
Sum of ServerE:BYDSK_PHYS_IO_RATE IO/sec
Total
Time
Total
11:35
33,402.75
11:35
...
...
...
...
12:30
28,739.75
12:30
1605.799994
12:35
25,957.33
12:35
2101.399988
Max capacity IO/sec
37,172.06
Max capacity IO/hour
133,819,406.86
Actual IO/sec
Actual IO/hour
1728.799992
2,460.20
8,856,719.97
5. Statistical Analysis of Disk Performance Data
•
Leaders/Outsiders bar charts generator.
There is one more way to build a threshold of Disk I/O rate metric, and this way is much more accurate! This is a dynamic threshold based on the Statistical Process Control concept (SPC) which was discussed in several CMG papers [T01, T02]. This approach was developed and successfully implemented as a Statistical Exception Detection System (SEDS) by this author as an extension of the Multivariate Adaptive Statistical Filtering (MASF) technique. SEDS is used for automatically scanning through large volumes of performance data and identifying measurements of global metrics that differ significantly from their expected values. Extending on the MASF method, the authors’ site acted on a suggestion to use some new derived metric such as "amount of exceptions per day" and keeping the history of exceptions in a separate exception database to produce advanced capacity planning analyses. The SEDS is the subsystem that uses inputs from the SAS/ITRM Performance Database (PDB). The structure is presented in Figure 10 and consists of the following main parts: • exception detectors for the most important metrics including Busiest Disk Utilization and Disk I/O Rate; • SEDS Database with history of exceptions; • statistical process control daily profile chart generator; • exception server name list generator; • Leader/Outsider servers detector and detector of runaway processes; and
Figure 10 – SEDS structure Within MASF, the exception detector (SAS program) scans the six-month history of each server every day for hourly performance data. The full "7 days X 24 hours” adaptive filtering policy is applied to calculate the average, upper, and lower statistical limits of a particular metric for each weekday for the past six months. To get detailed information of the servers’ behavior for the previous day, the system publishes the SPC chart on the Intranet web site for each exception, as shown in Figure 11, where Saturday is the example.
Figure 11 – Disk I/O Statistical Process Control Chart
The Upper and Lower Limits are calculated as 3 standard deviations from average. A quick analysis of the chart allows the analyst to identify immediately the part of the day where the limits were exceeded.
Another example shows that SEDS captured a Disk I/O rate exception at about 4:00 PM on ServerB, and the Application detector found that the Workload “Appl2” had an exception as well. This situation is illustrated on figure 12
Figure 12 – Application Level Disk I/O Usage Exception To have a numeric estimation of the exception magnitude, a derived system performance metric was added in the SEDS database. Rather than simply counting the number of exceptions, it calculates the area between the limit curve and the actual data curve (see Figure 11) for periods when the exceptions occurred. In the case of exceeding the upper historical limit, the area would be positive (call it UpperIOs, or S+ in Figures 11); it would be negative if the lower historical limit were exceeded (call it LowerIOs, or S- in Figures 11). The best metric to record would be the sum of those values: ExtraIOs=UpperIOs+LowerIOs This metric is an integrative characteristic of the Disk subsystem exceptions that happen for the day, and it has a simple physical meaning. It is the number of I/O operations that the server has taken that exceeds a standard deviation.
This approach is applied to all metrics in SEDS. This type of metric “ExtraVolume” was discussed in [T02]. For example, if the parent metric is CPU run queue, ExtraVolume is the daily extra queue length versus the usual queue, CPU utilization is CPU time, and so on. As well as LowerIOs and UpperIOs, the ExtraIOs metric might be less or more than zero. If the server showed a positive value for the last day, it means more Disk capacity was used on the server than in the past. In the same way, if the server showed a negative ExtraIOs metric, less capacity was used than usual. Those metrics can be summarized by day, week, or month, which will provide a quantitative estimation of disk subsystem behavior for a certain period. Based on this method, the system automatically produces this calculation for the last day and records that in the SEDS database using S+ and S- fields.
This data is used for generating Leaders/Outsiders charts for the last day, last week, and last month, and for publishing the bar charts as shown in Figure 13.
data fields from which similar charts can be generated. However, it makes sense to do only the top 5 exceptional workloads for each particular exceptional server. To identify business area, server configuration and relative size of the server, SEDS can produce an overall company wide picture of all servers that had Disk I/O exceptions. The best way for this type of presentation is a colored “Treemap,” or “heat chart.” This type of chart has been already used to publish an overall capacity status [M03, S98]. Applying this method to “ExstraIOs”, SEDS produces the chart shown on the Figure 14, where the ServerB is presented as pretty large red box inside of “M Department”, because the unusual I/O usage was bigger than 40,000,000.
Figure 13 – Top 10 Servers with Unusual I/O Usage
In combination with a “Top 10” server chart, those two charts are excellent tools to start spending some analyst recourses to address possible performance issues on the most critical servers.
A similar chart can be generated to show the opposite end of the server list and to demonstrate the top 10 servers with I/Os below a standard deviation (ExtraIOs