74
Int. J. Operational Research, Vol. 15, No. 1, 2012
A new control chart based on the loess smooth applied to information system quality performance Zhi Tao and Fengkun Liu Department of Management and Information Systems, Graduate School of Management, College of Business Administration, Kent State University, Kent, OH 44242, USA E-mail:
[email protected] E-mail:
[email protected]
Fanglin Shen Department of Finance, Graduate School of Management, College of Business Administration, Kent State University, Kent, OH 44242, USA E-mail:
[email protected]
Michael Suh and David Booth* Department of Management and Information Systems, Graduate School of Management, College of Business Administration, Kent State University, Kent, OH 44242, USA E-mail:
[email protected] E-mail:
[email protected] *Corresponding author Abstract: Sometimes to measure the performance of an information system, unusually long response times of database operation need to be identified. Compared with Lloyd’s (1995) statistical process control chart, we propose an outlier resistant robust locally weighted scatterplot smooth (loess) based control chart that identifies effectively all the out of control points of operation in Lloyd’s (1995) data. Our findings provide guidance to management in information system quality control and more generally to all control chart users in cases where outliers occur in the charted data. Keywords: loess smooth; statistical process control; outliers; information system; control chart; operational research. Reference to this paper should be made as follows: Tao, Z., Liu, F., Shen, F., Suh, M. and Booth, D. (2012) ‘A new control chart based on the loess smooth
Copyright © 2012 Inderscience Enterprises Ltd.
A new control chart based on the loess smooth
75
applied to information system quality performance’, Int. J. Operational Research, Vol. 15, No. 1, pp.74–93. Biographical notes: Zhi Tao is currently a PhD Candidate with a major of Operations Management and minor of Applied Statistics in the Department of Management and Information Systems in the College of Business Administration at Kent State University, Kent, OH. She holds a Master of Science degree in Accounting and Information Systems from the University of Delaware. Before beginning the Doctoral Program at Kent State University, she worked for Philips Medical Systems as an Accountant. Her research interests include green supply chain management, operation efficiency and outliers in statistical process control. Her research has been presented at numerous conferences and published as Proceedings in the Decision Sciences Institute, Midwest Decision Sciences Institute and the 1st International Symposium on Green Supply Chains. Fengkun Liu is currently a PhD Candidate in Information Systems in the Department of Management and Information Systems, Kent State University. He earned his BS at Qingdao University, Qingdao, China and his MS at the Catholic University of Korea, Bucheon, Korea, both in Management Information Systems. His research interests include social networks, mobile application adoption, recommender systems and e-commerce. Fanglin Shen is currently a PhD Candidate in the Department of Finance, Kent State University. She earned her Bachelor’s degree from Beihang University, China and her Master’s degree from Southern Illinois University at Edwardsville. Her research interests include investment, international finance and behavioural finance. She has presented at several finance conferences. Michael Suh is currently an Information Systems Consultant. He served as an Assistant Professor in the Management and Information Systems Department at Kent State University, where he earned his PhD. Previously, he has published in the Int. J. Computer Integrated Manufacturing and Industrial Mathematics. His research interests focus on information systems. David Booth is currently a Professor in the Department of Management and Information Systems at Kent State University. He earned his PhD at the University of North Carolina at Chapel Hill. Previously, he has published in the Int. J. Operational Research, Journal of Quality Technology, and Decision Sciences, etc. His research interests include applied statistics and quality control.
1
Introduction
It is well known that outliers have a major deleterious effect on statistical methods based on ordinary least squares (OLS). Robust techniques (Hauser and Booth, forthcoming a,b; Stigler, 2010) have been developed to combat these effects. Time series methods are related to regression algorithms and thus these outliers can produce the same type of problem in this case as well. We propose an outlier resistant robust locally weighted scatterplot smoothing (loess) method to be used in statistical process control (SPC) as the foundation for a new type of quality control chart. A control chart is a plot of a quality response variable vs. the time sequence of the observations that forms a time series (Booth, 1984). Many smoothing and regression type
76
Z. Tao et al.
algorithms applied to time series data are quite affected by outlying observations (Chang et al., 1988; Chen and Liu, 1993a,b; Chen and Tiao, 1990; Chen et al., 1992), as are standard x control charts (Booth, 1984; Mahaney et al., 2007a,b). The effect is to pull the x values towards the outliers and thus cause erroneous interpretation of the quality control data. We argue in this paper that is exactly the problem with Lloyd’s (1995) proposed control chart and further argue that a robust outlier resistant control chart can be developed based on a loess smooth (Cleveland, 1979; Cleveland and Loader, 1995) and outlier resistant summary statistics, e.g. medians and median absolute deviations. We then show that such a loess-based control chart works better in Lloyd’s (1995) information system application than does the chart he proposed. The SPC application is to identify, abnormally, long response times (outliers) in database management processes. Lloyd (1995) developed a control chart based on the mean measurement as the central line + 2.66 u mean moving average as an upper control limit (UCL) to identify these abnormally long response times. His control chart identified some outliers in the data set but also missed some outliers (for details, see Section 4). The reason that Lloyd failed to identify those outliers is that in Lloyd’s chart the outliers themselves pulled the UCL too far towards them. We developed a robust (outlier resistant) loess-based control chart. In our chart, UCL = median (observations) + 2.66 (median absolute deviation (loess residuals)). The median and median absolute deviation are defined and discussed in Booth (1986a,b). Our loess-based control chart identified all the outliers because the loess process as well as the use of the median and median absolute deviations gives outliers lower weight accordingly and the smooth is not pulled towards them, then the resulting residuals are used to compute the UCL.
2
Literature review
The effect of outliers on maximum likelihood and least squares estimates has been known and studied since the first comparison of the properties of means and medians (Bianco and Martinez, 2009; Portnoy, 1980; Stigler, 2010). Because means and variances are often used in SPC, especially in control charts, this effect is well known and researchers have used the effect in various ways to try to improve SPC methods (Booth and Booth, 2009; Booth et al., 2005; Mahaney et al., 2007a,b; Shah and Booth, forthcoming). All of these new methods try to improve SPC by either using outlier resistant (robust) methods of estimation or try to take advantage of the outlier effect of pulling maximum likelihood or least squares estimators towards the outliers (Booth, 1986a; Croux et al., 2002; Hauser and Booth, forthcoming a,b; Kimmel et al., 2010). This paper tries to take advantage of properties of the median, median absolute deviation (Booth, 1986a) and loess smooth (Cleveland, 1979; Cleveland and Loader, 1995; Sawitzki, 2009). Loess began as locally weighted scatterplot smoothing or lowess (Cleveland, 1979). Essentially, that method was to fit locally a robust regression (i.e. outlier resistant regression) (Booth, 1986a) with a polynomial model of degree one or two. A second proposal was put forward later, loess, which added a nearest neighbour component and lead to the double weighting scheme detailed in Section 3 (Cleveland and Loader, 1995; Kimmel et al., 2010) and used in SAS proc loess. The robust weighting scheme in loess works in the same manner as that in lowess. The previously mentioned papers in SPC take advantage of the outlier effect in several ways. We first recall three facts. Firstly, the effect itself is that outliers pull
A new control chart based on the loess smooth
77
maximum likelihood estimates and least squares estimates towards themselves (Bianco and Martinez, 2009; Booth, 1986a). Secondly, a robust method (i.e. outlier resistant) resists the outliers pull. Finally, out of control (OTC) points in SPC are outliers (Booth, 1984). Booth et al. (2005) used a robust time series method (joint estimation) that not only identified outliers but also gives them time series types to identify losses from nuclear material inventories. Mahaney et al. (2007a,b) used joint estimation to deal with SPC problems in metallurgical industries as well as SPC problems in business management. An advantage of the approaches that use joint estimation is that they take into account any autocorrelation that may be in the data (Wright et al., 2001) as does loess (Cleveland, 1979). Joint estimation is again based on countering the outliers’ pulling effect. Another approach to the use of the outlier effect in SPC is based on using the outliers pulling effect to signal the existence of the outlier. Here, Booth and Booth (2009) and Shah and Booth (forthcoming) used the fact that fractal dimension is computed with least squares regression. Thus if a point is added to a data set which is an outlier, the outlier effect causes a jump in the regression coefficients, thus signalling the presence of an outlier by showing a jump in the fractal dimension. This approach has been used both in nuclear material safeguards (Booth and Booth, 2009) and in chemical process control (Shah and Booth, forthcoming). Overall, the use of the outlier effect has had a productive history in SPC (Grant and Leavenworth, 1980) and in statistics in general (Stigler, 2010).
3
Methodology
3.1 Loess smooth We now consider the methods used to construct our proposed control chart based on the loess smooth. OLS regression works well if the error term is based on a normal distribution, but OLS poses problems if there are OTC points (outliers). As Kimmel et al. (2010) pointed out, in OLS, outliers cause the prediction equation to shift or rotate, causing the residuals or distances from the regression equation to the data points to change. This could cause misclassification of points as outliers because some of the distances could become shorter, resulting in type II errors when OTC points are overlooked, and some may have gotten longer, resulting in type I errors as points classified as OTC when, in fact, they are not. We propose using loess (robust locally weighted scatterplot smoothing method), which neither discards outliers nor gets unduly influenced by outliers, through which we make loess-based control charts to make outliers easily identified. Cleveland (1979), Cleveland and Loader (1995) and Kimmel et al. (2010) discussed the loess algorithm. Loess is a combination of local fitting of polynomials and iteratively weighted least squares. At each point in the data set, a linear or quadratic polynomial is fit using weighted least squares, giving more weight to points near the point whose response is being estimated and less weight to points further away. The value of the regression function for the point is then obtained by evaluating the local polynomial for that data point. The loess fit is complete after regression function values have been computed for each of neighbouring data points. Note also that an iterative process to downweight outliers is applied (Cleveland, 1979).
78
Z. Tao et al.
3.2 Test data We first consider the test data of Lloyd (1995) which Lloyd used to test his outlier sensitive method and which we use to test our proposed outlier resistant method. In his discussion of this data, Lloyd (1995) defines response time, which is one of measurements to be used to ascertain the performance of the system or subsystem, as the time between a user request and the answer to that request. Read sequential is one type of file operation request to the system. Lloyd (1995) used two sets of data of response time to read sequential to fit his models and make his charts. To test how our loess-based chart performs, i.e. to see whether our chart improves on identifying the outliers that Lloyd indicated but failed to find. We use the same data as Lloyd’s (reported in Suh et al., 2000a) response time to user request of read sequential on day 1 (35 transaction cases) and on day 2 (89 transaction cases). The difference between day 1 and day 2 is: for day 1, 46 records are added to the database before the read sequential operation occurs while for day 2, no additional records are added before the read sequential operation occurs. We fit a loess smooth using these two sets of data, make control charts based on the loess results to identify any unusually long response times (outliers) for day 1 and day 2, and make the comparison of our results of day 1 and day 2 with Lloyd’s to see the consistent improvement of our control charts.
3.3 Model and implementation Our loess regression model is, Ǔ = f(x), where y is the response time, x is observed transaction case number and f is the function fitted by loess. We implemented the model with SAS proc loess. We set parameters k, d and q in SAS. We note that the parameter k, the smoothing parameter, is what determines the amount of smoothing. The larger k, the smoother is the resulting curve. Therefore, k needs to be chosen on the basis of the properties of the data and the amount of smoothing desired. Parameter d is the degree of the polynomial used in the locally weighted regression and is restricted to being a non-negative integer. As d is increased, the computational complexity of the regression quickly rises; however, the flexibility to reproduce patterns in the data is enhanced. If d is set at d = 0, then local constancy is assumed. Since, the data may or may not exhibit autocorrelation, d should be set at d > 0. We set d = 1 or d = 2 because it provides a good balance between computation ease and flexibility. The parameter q is the iteration convergence number. For the portion of the algorithm that downweights outliers, experimentation with loess has shown that loess tends to converge rather quickly. In our case, q = 5 is a good value to meet the convergence criterion. We define the residuals from the loess fit as residuali = Yi – Ǔi, where Yi is the observed response time and Ǔi is the corresponding fitted value. If residuali > 0 then Yi > Ǔi indicating that the observed response time > fitted response time. The larger the positive residual the more likely the observation is an outlier and hence screened as an OTC point in the information system application. To predict the acceptable performance of the system and examine unusually long response times which are not acceptable and worth investigating, a control chart is needed. We are only interested in long response times and hence UCLs only were considered. Lloyd’s control chart used mean response measurement as the central line and that value + 2.66 u mean moving average as his UCL. His control chart identified outliers
A new control chart based on the loess smooth
79
in data sets of day 1 and day 2, but also missed some outliers (for details, see Section 4). The reason that Lloyd’s chart failed to identify outliers is because in Lloyd’s chart the outliers themselves pulled the UCL too far towards them. We propose a robust (outlier resistant) analogue to the standard control chart limits (Grant and Leavenworth, 1980 p.78ff). To make these limits robust they were calculated as UCL = median (response) + k (median absolute deviation (loess residuals)). The median and median absolute deviation as robust estimators are defined and discussed in Booth (1986a). We use k = 2.66 to make our loess control chart comparable with that of Lloyd (1995). Our loess control chart identifies the outliers successfully and hence efficiently yields the desired process control. The loess control chart was implemented in R. We note that such methods have a long history in operations research. The first such charts were proposed by Walter Shewhart (see e.g. Shewhart, 1931). The name of W. Edwards Deming is also well known in the use of such methods.
4
Results
For analysis, we used two different methods and compared their results. The first method we used is linear loess fit and the second method we used is quadratic loess fit. Discussions are as follows.
4.1 Linear loess fits for day 1 and day 2 data For day 1, we tested the smoothing parameters k with the following values: 0.1, 0.15, 0.2, 0.25, 0.3, 0.35, 0.4, 0.45, 0.5 and 0.6, respectively, with 5 robust iterations. After comparison based on graphical considerations, we chose an optimal value k of 0.35 as our final smoothing parameter. Figure 1 and Table 1 show day 1 plot and residual results from SAS proc loess with k = 0.35. From Figure 1, we see that the chosen linear loess fit with smoothing parameter k = 0.35 provides a reasonable fit to the data. The residual data values, rank of the residual data values and loess predicted values are shown in Table 1. In Lloyd’s (1995) original study, he had good reason to believe that the case 1 observation in day 1 data was OTC as well as the obvious case 18. His moving rangebased control chart (Lloyd, 1995, p.72), however, failed to show case 1 as OTC (Lloyd, 1995, p.80). Our chart, Figure 2, based on median (responses) + 2.66 (mad (loess residuals)) indicated that case 1 is OTC as well as case 18, mad refers to median absolute deviation. Note that we use Lloyd’s value 2.66 for comparison purposes. It is well known (Booth, 1982, 1984, 1986a,b; Chernik et al., 1982) that large robust residuals correspond to large outliers. Our result is confirmed from this point of view since it is consistent with the rank order of the loess positive residuals in day 1 (Table 1). All of the loess OTC measures are consistent with each other as with Lloyd’s non-chart result. The reason that Lloyd’s chart could not show case 1 as OTC point is because in Lloyd’s chart the outliers themselves pulled the UCL too far towards them. Our loess control chart, Figure 2, avoids this effect of outliers by using median and mad (Booth, 1986a) hence making the OTC points stand out. We further note that Lloyd mislabelled case 1 as case 2. We discovered this error using our chart.
Z. Tao et al.
80 Figure 1
Smoothing parameter: k = 0.35, day 1 data, linear robust fit
r esp 133
132
131
130
129 0
10
20
30
40
case
Table 1
Residuals for day 1 data with smoothing parameter 0.35
Rank of Rank of Responses Loess loess Loess Responses Loess loess Loess time predicted Residuals predicted outlier Case time predicted Residuals predicted outlier Case 1
130.23 129.8297
0.40035
2
Yes
19
129.14
129.3514 0.2114
31
No
2
129.13 129.7466 0.61662
35
No
20
129.46
129.3523
0.10775
11
No
3
129.68 129.6656
0.01441
17
No
21
129.3
129.354
0.05398
19
No
4
129.41 129.5857 0.17574
29
No
22
129.58
129.3701
0.20989
8
No
5
129.56 129.5059
16
No
23
129.25
129.3862 0.13624
25
No
0.05411
6
129.41 129.4579 0.04786
18
No
24
129.47
129.3956
0.07441
15
No
7
129.5
0.09016
14
No
25
129.2
129.4049 0.20494
30
No
8
129.24 129.3996 0.15963
28
No
26
129.79
129.4082
0.38183
3
No
9
129.07 129.3894 0.31941
33
No
27
129.09
129.4114 0.32141
34
No
129.4098
10
129.67 129.3851
0.28487
5
No
28
129.69
129.416
0.27397
6
No
11
129.24 129.3808 0.14084
26
No
29
129.29
129.4207 0.13066
24
No
12
129.68 129.3882
0.29184
4
No
30
129.51
129.4193
0.0907
13
No
13
129.24 129.3912 0.15118
27
No
31
129.3
129.418
0.11795
22
No
14
129.5
12
No
32
129.58
129.4122
0.16785
9
No
15
129.25 129.3746 0.12459
129.3942
0.1058
23
No
33
129.19
129.4063 0.21634
32
No
16
129.52 129.355
0.16503
10
No
34
129.63
129.4004
0.22956
7
No
17
129.29 129.3528 0.06277
20
No
35
129.3
129.3945 0.09454
21
No
18
132.94 129.3506
1
Yes
3.58944
A new control chart based on the loess smooth Loess control chart for day 1 linear fit
131 129
130
DepVar
132
133
Figure 2
81
0
5
10
15
20
25
30
35
case
For day 2, we tested smoothing parameters k with values of 0.1, 0.15, 0.2, 0.25, 0.3, and 0.4, respectively, with 5 robust iterations and we chose k = 0.15 as our optimal smoothing parameter based on comparison. Figure 3 and Table 2 show day 2 plot and residual results from SAS proc loess. From Figure 3, we see that the chosen linear loess fit with smoothing parameter 0.15 provides a reasonable fit to the data. The residual data values, rank of the residual data values and loess predicted values are shown in Table 2. For day 2 data, Lloyd (1995) believed that the cases 6, 22, 39, 56, 72 and 89 observations were OTC points as well as the obvious case 36 (Lloyd, 1995; Suh et al., 2000a,b). Lloyd’s control chart (Lloyd, 1995 p.72), however, failed to show cases other than case 36 as OTC (Lloyd, 1995, p.80). Our chart, Figure 4, based on median (responses) + 2.66 (mad (loess residuals)) indicated that cases 6, 22, 39, 56, 72, 89 as well as 36 are OTC points. Again we use Lloyd’s value 2.66 for comparison purposes. Our result is confirmed since it is consistent with the rank order of the loess positive residuals in day 2 (Table 2). Again our loess, Figure 4, control chart avoids the effect of outliers by using the median and mad hence making the OTC points stand out.
Z. Tao et al.
82 Figure 3
Smoothing parameter: 0.15, day 2 data, linear robust fit
r esp 132
131
130
129
128
127
126 0
10
20
30
40
50
60
70
80
90
case
Table 2
Residuals of day 2 data with smoothing parameter 0.15
Rank of Rank of Response Loess loess Loess Response Loess loess Loess predicted Residuals predicted outlier Case time predicted Residuals predicted outlier Case time 1
127.38 127.4499 0.06991
53
No
46
127.22
127.293
2
127.7
0.25373
17
No
47
127.59
127.29
3
127.42 127.4453 0.02527
44
No
48
127.06
127.28
4
127.43 127.4468 0.01679
43
No
49
127.44
127.274
5
127.1
127.4483 0.34832
89
No
50
127.11
127.269
6
128.2
127.4477
7
Yes
51
127.38
127.256
7
127.27 127.4467 0.17668
76
No
52
127.05
127.243
8
127.64 127.4457
0.19434
25
No
53
127.49
127.23
9
127.37 127.455
0.08501
56
No
54
127.11
127.228
0.1182
10
127.81 127.4583
0.3517
8
No
55
127.21
127.238
11
127.32 127.4616 0.1416
67
No
56
128.15
127.247
127.4463
0.7523
12
127.37 127.4615 0.09153
58
No
57
127.05
127.253
13
127.32 127.4563 0.13628
65
No
58
127.45
127.259
14
127.76 127.451
0.30897
12
No
59
127.18
127.269
15
127.2
127.4498 0.24984
86
No
60
127.6
127.281
54
No
14
No
83
No
28
No
71
No
34
No
78
No
16
No
62
No
0.0276
46
No
0.9029
4
Yes
0.0734 0.30034 0.2197 0.16576 0.1588 0.12425 0.193 0.25984
80
No
26
No
0.0889
57
No
0.3193
11
No
0.203 0.19101
A new control chart based on the loess smooth Table 2
83
Residuals of day 2 data with smoothing parameter 0.15 (continued)
Rank of Rank of Response Loess loess Loess Response Loess loess Loess Case time predicted Residuals predicted outlier Case time predicted Residuals predicted outlier 16
127.76 127.4531
No
61
127.06
127.293
17
127.27 127.4563 0.18628
18
127.6
0.15371
77
No
62
127.34
127.293
29
No
63
127.16
127.281
19
127.31 127.4198 0.10976
60
No
64
127.6
127.27
20
127.53 127.3932
0.13677
30
No
65
127.11
127.264
21
127.32 127.3711 0.05107
50
No
66
127.34
127.265
22
128.25 127.3695
0.88049
5
Yes
67
127.05
127.265
23
127.26 127.368
0.10796
59
No
68
127.44
127.263
24
127.11 127.3653 0.25534
87
No
69
127.1
127.261
25
127.49 127.3627
0.12728
33
No
70
127.38
127.267
26
127.44 127.3649
0.07507
38
No
71
127.28
127.282
27
127.64 127.3768
0.2632
15
No
72
128.15
127.296
28
127.22 127.3887 0.16867
74
No
73
127.05
127.306
29
127.33 127.3937 0.06373
51
No
74
127.44
127.307
30
127.27 127.3872 0.11723
61
No
75
127.34
127.308
31
127.6
0.21927
23
No
76
127.39
127.31
32
127.22 127.3856 0.16556
73
No
77
127.28
127.312
127.4463
127.3807
127.3721
0.30694
13
85
No
39
No
64
No
9
No
70
No
37
No
82
No
27
No
72
No
35
No
41
No
6
Yes
88
No
0.13341
31
No
0.03253
40
No
0.08055
36
No
47
No
0.2325 0.04747 0.1214 0.32964 0.154 0.07528 0.2154 0.17661 0.1614 0.11256 0.0018 0.85383 0.2557
0.0323
33
127.6
0.22786
19
No
78
127.28
127.315
0.0351
48
No
34
127.21 127.3587 0.14871
68
No
79
127.16
127.312
0.1521
69
No
35
127.54 127.3173
0.22273
22
No
80
127.54
127.309
18
No
36
131.21 127.2758
3.93418
1
Yes
81
127.11
127.31
79
No
37
127.21 127.2481 0.03809
49
No
82
127.54
127.314
20
No
38
127.11 127.2286 0.11862
63
No
83
127.11
127.318
81
No
39
128.21 127.2092
3
Yes
84
127.54
127.315
21
No
40
126.98 127.203
84
No
85
127.17
127.311
0.141
66
No
0.0166
1.00085 0.223
41
127.42 127.2192
0.20079
24
No
86
127.29
127.307
42
127.06 127.2354 0.17541
75
No
87
127.43
127.302
43
127.59 127.2626
10
No
88
127.22
127.296
44
127.21 127.2799 0.06987
52
No
89
128.32
127.291
45
127.27 127.2972 0.02715
45
No
0.3274
0.23091 0.2001 0.22607 0.2077 0.22456
0.12854 0.0762 1.02905
42
No
32
No
55
No
2
Yes
84
Loess control chart for day 2 linear loess fit
129 127
128
DepVar
130
131
Figure 4
Z. Tao et al.
0
20
40
60
80
case
4.2 Quadratic loess fits for day 1 and day 2 data For the ease of comparison, we followed a similar procedure and settings from linear loess fit in our quadratic loess fit. For day 1, we tested k = 0.1, 0.15, 0.2, 0.25, 0.3, 0.35, 0.4, 0.45, 0.5 and 0.6, respectively, again we used 5 robust iterations since more iterations produce little improvement while consuming more computing time. After comparison based on graphical considerations, we chose k = 0.3 as the optimal smoothing parameter. Figure 5 shows day 1 plot and Table 3 shows residual results from SAS proc loess with k = 0.3. From Figure 5, we found that the chosen quadratic loess fit with k = 0.3 provides a reasonable fit to the data. The residual data values, rank of the residual data values and loess predicted values are shown in Table 3. Our chart with quadratic loess as shown in Figure 6, based on median (responses) + 2.66 (mad (loess residuals)), indicated that case 1 (identified in Lloyd as erroneously as case 2) is OTC as well as case 18, which is consistent with the linear robust fit result. For day 2 quadratic fit, we tested k = 0.1, 0.15, 0.2, 0.25, 0.3, and 0.4, respectively, with 5 robust iterations. We finally chose k = 0.25 as the optimal result. Day 2 plot is shown in Figure 7 and residual results are shown in Table 4. By observing Figure 7, we conclude that the chosen quadratic loess fit with smoothing parameter k = 0.25 provides a reasonable fit to the data. Table 4 contains the residual data values, rank of the residual data values and loess predicted values. Figure 8 is our chart for day 2 data, which is based on median (responses) + 2.66 (mad (loess residuals)). The chart indicates that cases 6, 22, 39, 56, 72, 89 and 36 are OTC. Our results in Figures 2, 4, 6 and 8 suggest that using a robust fit is more important
A new control chart based on the loess smooth
85
to the OTC analysis than whether or not a linear or quadratic polynomial is used since all loess-based charts find all of OTC points. Because that result may be specific to this application, we suggest practitioners choose the better robust fit using both linear and quadratic polynomials as they deal with their own data. Figure 5
Smoothing parameter: 0.30, day 1 data, quadratic robust fit
r esp 133
132
131
130
129 0
10
20
30
40
case
Table 3 Case 1
Residuals for day 1 data with smoothing parameter 0.30 Response time Loess predicted Residuals 130.23 130.14245 0.08755
Rank of loess predicted 14
Loess outlier No
2 3
129.13 129.68
129.88154 129.68537
0.75154 0.00537
35 17
No No
4 5
129.41 129.56
129.58547 129.48558
0.17547 0.07442
28 16
No No
6 7
129.41 129.5
129.42382 129.36207
0.01382 0.13793
18 10
No No
8 9
129.24 129.07
129.33206 129.30205
21 33
No No
10 11
129.67 129.24
129.36421 129.42637
0.09206 0.23205 0.30579
3 30
No No
12 13
129.68 129.24
129.45371 129.42529
6 29
No No
14 15
129.5 129.25
129.39688 129.37959
12 24
No No
16 17
129.52 129.29
129.36231 129.32138
9 19
No No
18
132.94
129.28044
1
Yes
0.18637 0.22629 0.18529 0.10312 0.12959 0.15769 0.03138 3.65956
86 Table 3 Case
Z. Tao et al. Residuals for day 1 data with smoothing parameter 0.30 (continued) Response time Loess predicted Residuals
Rank of loess predicted
Loess outlier
19
129.14
129.30471
27
20
129.46
129.32898
0.13102
11
No
21
129.3
129.39265
0.09265
22
No
22
129.58
129.39228
0.18772
7
No
23
129.25
129.39191
0.14191
26
No
24
129.47
129.3894
0.0806
15
No
25
129.2
129.3869
0.1869
31
No
26
129.79
129.41728
0.37272
2
No
27
129.09
129.44766
0.35766
34
No
28
129.69
129.43849
0.25151
4
No
29
129.29
129.42933
0.13933
25
No
30
129.51
129.41893
0.09107
13
No
31
129.3
129.40854
0.10854
23
No
32
129.58
129.40908
0.17092
8
No
0.16471
No
33
129.19
129.40961
0.21961
32
No
34
129.63
129.39521
0.23479
5
No
35
129.3
129.3808
20
No
129
130
131
132
133
Loess control chart for day 1 quadratic robust fit
DepVar
Figure 6
0.0808
0
5
10
15
20 case
25
30
35
A new control chart based on the loess smooth Figure 7
87
Smoothing parameter: 0.25, day 2 data, quadratic robust fit
r esp 132
131
130
129
128
127
126 0
10
20
30
40
50
60
70
80
90
case
Table 4 Case
Residuals for day 2 data with smoothing parameter 0.25 Response time
Loess predicted
Residuals
Rank of loess predicted Loess outlier
1
127.38
127.45885
0.07885
54
No
2
127.7
127.45589
0.24411
17
No
3
127.42
127.45292
0.03292
45
No No
4
127.43
127.45186
0.02186
42
5
127.1
127.4508
0.3508
89
No
6
128.2
127.44974
0.75026
7
Yes
No
7
127.27
127.45028
0.18028
76
8
127.64
127.45082
0.18918
25
No
9
127.37
127.45136
0.08136
55
No
10
127.81
127.45789
0.35211
8
No
11
127.32
127.46442
0.14442
66
No
12
127.37
127.47095
0.10095
57
No
13
127.32
127.4683
0.1483
67
No
14
127.76
127.46564
0.29436
13
No
15
127.2
127.46299
0.26299
88
No
16
127.76
127.45375
0.30625
12
No
17
127.27
127.4445
75
No
18
127.6
127.43525
0.16475
28
No
19
127.31
127.42139
0.11139
58
No
20
127.53
127.40753
0.12247
32
No
0.1745
88 Table 4 Case 21
Z. Tao et al. Residuals for day 2 data with smoothing parameter 0.25 (continued) Response time 127.32
Loess predicted
Residuals
127.39367
0.07367
Rank of loess predicted Loess outlier 52
No
22
128.25
127.38317
0.86683
5
Yes
23
127.26
127.37268
0.11268
59
No
24
127.11
127.37055
0.26055
87
No
25
127.49
127.36841
0.12159
33
No
26
127.44
127.36628
0.07372
38
No
27
127.64
127.37655
0.26345
15
No
28
127.22
127.38683
0.16683
72
No
29
127.33
127.3971
0.0671
50
No
30
127.27
127.39564
0.12564
61
No
31
127.6
127.39418
0.20582
23
No
32
127.22
127.39271
0.17271
74
No
33
127.6
127.36687
0.23313
18
No
34
127.21
127.34102
0.13102
63
No
35
127.54
127.31484
0.22516
21
No
36
131.21
127.28866
3.92134
1
Yes
37
127.21
127.26248
0.05248
49
No
38
127.11
127.24667
0.13667
65
No
39
128.21
127.23086
0.97914
2
Yes
40
126.98
127.21505
0.23505
84
No
41
127.42
127.23032
0.18968
24
No
42
127.06
127.24559
0.18559
77
No
43
127.59
127.26086
0.32914
9
No
44
127.21
127.27945
0.06945
51
No
45
127.27
127.29805
0.02805
43
No
46
127.22
127.29716
0.07716
53
No
47
127.59
127.29626
0.29374
14
No
48
127.06
127.29537
0.23537
85
No
49
127.44
127.2804
0.1596
29
No
50
127.11
127.26543
0.15543
69
No
51
127.38
127.25047
0.12953
31
No
52
127.05
127.24558
0.19558
78
No
53
127.49
127.24069
0.24931
16
No
54
127.11
127.23581
0.12581
62
No
55
127.21
127.24095
0.03095
44
No
56
128.15
127.24609
0.90391
4
Yes
A new control chart based on the loess smooth Table 4 Case
89
Residuals for day 2 data with smoothing parameter 0.25 (continued) Response time
Loess predicted
Residuals
Rank of loess predicted Loess outlier 81
No
57
127.05
127.25686
58
127.45
127.26764
0.18236
26
No
59
127.18
127.27841
0.09841
56
No
60
127.6
127.28181
0.31819
11
No No
0.20686
61
127.06
127.28522
0.22522
83
62
127.34
127.28862
0.05138
39
No
63
127.16
127.28203
0.12203
60
No
64
127.6
127.27543
0.32457
10
No
65
127.11
127.26884
0.15884
71
No
66
127.34
127.2639
0.0761
37
No
67
127.05
127.25896
0.20896
82
No
68
127.44
127.26454
0.17546
27
No No No
69
127.1
127.27012
0.17012
73
70
127.38
127.2757
0.1043
34
71
127.28
127.28553
0.00553
41
No
72
128.15
127.29537
0.85463
6
Yes
73
127.05
127.3052
86
No
74
127.44
127.30791
0.13209
30
No
75
127.34
127.31063
0.02937
40
No
76
127.39
127.31334
0.07666
36
No
77
127.28
127.31401
0.03401
46
No
78
127.28
127.31468
0.03468
47
No
79
127.16
127.31346
0.15346
68
No
80
127.54
127.31225
0.22775
19
No
81
127.11
127.31103
0.20103
79
No
82
127.54
127.31344
0.22656
20
No
83
127.11
127.31585
0.20585
80
No
84
127.54
127.31825
0.22175
22
No
0.2552
85
127.17
127.32602
0.15602
70
No
86
127.29
127.33378
0.04378
48
No
87
127.43
127.34154
0.08846
35
No
88
127.22
127.35358
0.13358
64
No
89
128.32
127.36562
0.95438
3
Yes
Z. Tao et al.
90
Loess control chart for day 2 quadratic robust fit
129 127
128
DepVar
130
131
Figure 8
0
20
40
60
80
case
5
Managerial implication
A database management system (DBMS) consists of related files that contain useful information. Database systems must be robust enough to provide useful information to different queries (Lloyd, 1995). The successful utilisation of a database system depends largely on the stability of its operations, which requires the efficient and accurate determination of outliers in database response times. Traditionally, the only approach for database administrators to monitor the performance of a database system is by examining descriptive statistics of data operation, such as the mean and standard deviation. However, the drawback for this approach is obvious: it can only indicate historical performance and thus lacks the ability of providing predictive information of DBMS operation (Lloyd, 1995). This is the reason Lloyd (1995) developed SPC methods with the purpose of monitoring DBMS performance and predicting its possible failure. Through the analysis and discussion using our loess methods, we successfully identified all the outliers in the data set for both day 1 and day 2 using our robust loess fitting. We conclude that our method is easier to implement and is more efficient compared to the previous SPC method. Based on our results, we could provide information technology managers with a more cost-effective way in monitoring their DBMS performance.
A new control chart based on the loess smooth
6
91
Conclusion
In our research, we use the same two sets of data as Lloyd (1995) and fit a loess smooth. Our loess-based control charts identified all unusually long response times (outliers) for both day 1 and day 2 consistently while Lloyd’s charts missed some outliers for day 1 and day 2. We conclude that a loess control chart is a better method of SPC for database management of this type than Lloyd’s control chart and it provides more valuable guidance to chart users in information quality management. Compared with other control charts, a loess control chart has five benefits. They are: 1
the underlying functions are approximated rather than interpolated in the area of outliers
2
gross error effects are limited
3
the convergence rate is improved because of the elimination of gross errors and thus it does not have to unlearn the effects of outliers
4
it is simple to implement and further using the median and mad make the statistical properties of the loess control chart approximate those of an x bar chart (Booth, 1984)
5
it downweights the effect of outliers and hence makes the OTC points stand out.
One limitation of this research is the procedure for determining optimal smooth parameters for each case is based on a manual procedure. We ran different values of the smoothing parameters on an incremental basis to obtain fitted curves. Then we compared different fitted curves by their graphical smoothness. Based on this graphical consideration, we chose the optimal value for the smoothing parameter. In future research, we could improve the efficiency of determining the optimal smoothing parameter by automating the comparison procedure. In other words, we could use computer programmes to help us in comparing and determining the fitted curve that has the optimal smoothness. In addition, simulation could be used in the manner of Wright et al. (2001) to further characterise the statistical properties of the loess-based control chart. We hope to attempt this research in the near future.
Acknowledgements We thank the editor, reviewers and Karoly Bozan for their very helpful suggestions.
References Bianco, A. and Martinez, E. (2009) ‘Robust testing in the logistic regression model’, Computational Statistics and Data Analysis, Vol. 53, pp.4095–4105. Booth, D. (1984) ‘Some applications of robust statistical methods to analytical chemistry’, PhD Thesis, Chapel Hill: The University of North Carolina at Chapel Hill. Booth, D.E. (1982) ‘The analysis of outlying data points using robust regression: a multivariate problem bank identification model’, Decision Sciences, Vol. 13, pp.71–81.
92
Z. Tao et al.
Booth, D.E. (1986a) Regression Methods and Problem Banks, a Unit to Demonstrate an Application of Data Analysis. Lexington, MA: COMAP, Inc., Module No. 626, Reprinted in UMAP Modules Tools for Teaching, 1985, Arlington, MA: COMAP, Inc. Booth, D.E. (1986b) ‘A method for the early identification of loss from a nuclear materials inventory’, The Journal of Chemical Information and Computer Science, Vol. 26, pp.23–28. Booth, D.E. and Booth, S.E. (2009) ‘A new nuclear material safeguards method using fractal dimension’, Current Analytical Chemistry, Vol. 5, No. 1, pp.26–28. Booth, D.E., Zhu, D.X., Baker, D.L. and Hamburg, J.H. (2005) ‘Recent chemometric approaches to the detection of nuclear material losses’, Current Analytical Chemistry, Vol. 1, No. 2, pp.181–186. Chang, I., Tiao, G.C. and Chen, C. (1988) ‘Estimation of time series parameters in the presence of outliers’, Technometrics, Vol. 30, pp.193–204. Chen, C. and Liu, L-M. (1993a) ‘Joint estimation of model parameters and outliers effects in time series’, Journal of the American Statistical Association, Vol. 88, No. 421, pp.284–297. Chen, C. and Liu, L-M. (1993b) ‘Forecasting time series with outliers’, Journal of Forecasting, Vol. 12, pp.13–35. Chen, C., Liu, L-M. and Hudak, G.B. (1992) ‘Modeling and forecasting time series in the presence of outliers and missing data’, Working Paper No. 128, Scientific Computing Associations Corp., Oak Brook, IL. Chen, C. and Tiao, G.C. (1990) ‘Random level shift time series models, ARIMA approximation, and level shift detection’, Journal of Business and Economic Statistics, Vol. 8, pp.170–186. Chernik, M.R., Downing, D.J. and Pike, D.H. (1982) ‘Detecting outliers in time series data’, Journal of the American Statistical Association, Vol. 77, No. 380, pp.743–747. Cleveland, W.S. (1979) ‘Robust locally weighted regression and smoothing scatterplots’, Journal of the American Statistical Association, Vol. 74, pp.829–836. Cleveland, W.S. and Loader, C. (1995) ‘Smoothing by local regression: principles and methods’, Technical Report, Murray Hill, NY: AT&T Bell Laboratories. Croux, C., Flandre, C. and Haesbroeck, G. (2002) ‘The breakdown behavior of the maximum likelihood estimator in the logistic regression model’, Statistics & Probability Letters, Vol. 60, No. 4, pp.377–386. Grant, E. and Leavenworth, R. (1980) ‘Statistical Quality Control (5th ed.). New York: McGrawHill. Hauser, R.P. and Booth, D. (forthcoming a) ‘CEO bonuses as studied by robust logistic regression’, Journal of Data Science. Hauser, R.P. and Booth, D. (forthcoming b) ‘Predicting bankruptcy with robust logistic regression’, Journal of Data Science. Kimmel, R., Booth, D. and Booth, S. (2010) ‘The analysis of outlying data points by robust loess: a model for the identification of problem banks’, Int. J. Operational Research, Vol. 7, No. 1, pp.1–15. Portnoy, S. (1980) Personal Communication. Lloyd, S.J. (1995) ‘An expert system approach to choosing the least cost physical file structure in different types of databases’, Unpublished PhD Dissertation, Kent State University, Kent, OH. Mahaney, J., Baker, D., Hamburg, J. and Booth, D. (2007a) ‘Time series analysis of process data’, Int. J. Operational Research, Vol. 3, No. 2, pp.231–253. Mahaney, J., Goeke, R. and Booth, D. (2007b) ‘Out of control (outlier) detection in business data using the ARMA(1, 1) model’, Int. J. Operational Research, Vol. 2, No. 2, pp.115–134. Sawitzki, G. (2009) Computational Statistics. Boca Raton: CRC Press. Shah, V. and Booth, D. (forthcoming) ‘A fractal dimension-based method for statistical process control’, Int. J. Operational Research.
A new control chart based on the loess smooth
93
Shewhart, W. (1931) Economic Control of Quality of Manufactured Product. New York: D. Van Nostrand Company. Stigler, S.M. (2010) ‘Changing history of robustness’, The American Statistician, Vol. 64, No. 4, pp.277–281. Suh, M., Booth, D., Grznar, J., Prasad, S., Lloyd, S. and Hamburg, J. (2000a) ‘Fuzzy and robust neural networks and information systems process control’, Industrial Mathematics, Vol. 50, No. 1, pp.5–31. Suh, M., Booth, D., Grznar, J., Prasad, S., Lloyd, S. and Hamburg, J. (2000b) ‘An investigation of neural network algorithms for database management applications’, Unpublished PhD Dissertation, Kent State University, Kent, OH. Wright, C.M., Booth, D.E. and Hu, M.Y. (2001) ‘Joint estimation: SPC method for short run auto correlated data’, Journal of Quality Technology, Vol. 33, No. 3, pp.365–377.