A new control chart based on the loess smooth

39 downloads 0 Views 253KB Size Report
outlier resistant robust locally weighted scatterplot smooth (loess) based .... Essentially, that method was to fit locally a robust regression (i.e. ...... Hauser, R.P. and Booth, D. (forthcoming b) 'Predicting bankruptcy with robust logistic regression',.
74

Int. J. Operational Research, Vol. 15, No. 1, 2012

A new control chart based on the loess smooth applied to information system quality performance Zhi Tao and Fengkun Liu Department of Management and Information Systems, Graduate School of Management, College of Business Administration, Kent State University, Kent, OH 44242, USA E-mail: [email protected] E-mail: [email protected]

Fanglin Shen Department of Finance, Graduate School of Management, College of Business Administration, Kent State University, Kent, OH 44242, USA E-mail: [email protected]

Michael Suh and David Booth* Department of Management and Information Systems, Graduate School of Management, College of Business Administration, Kent State University, Kent, OH 44242, USA E-mail: [email protected] E-mail: [email protected] *Corresponding author Abstract: Sometimes to measure the performance of an information system, unusually long response times of database operation need to be identified. Compared with Lloyd’s (1995) statistical process control chart, we propose an outlier resistant robust locally weighted scatterplot smooth (loess) based control chart that identifies effectively all the out of control points of operation in Lloyd’s (1995) data. Our findings provide guidance to management in information system quality control and more generally to all control chart users in cases where outliers occur in the charted data. Keywords: loess smooth; statistical process control; outliers; information system; control chart; operational research. Reference to this paper should be made as follows: Tao, Z., Liu, F., Shen, F., Suh, M. and Booth, D. (2012) ‘A new control chart based on the loess smooth

Copyright © 2012 Inderscience Enterprises Ltd.

A new control chart based on the loess smooth

75

applied to information system quality performance’, Int. J. Operational Research, Vol. 15, No. 1, pp.74–93. Biographical notes: Zhi Tao is currently a PhD Candidate with a major of Operations Management and minor of Applied Statistics in the Department of Management and Information Systems in the College of Business Administration at Kent State University, Kent, OH. She holds a Master of Science degree in Accounting and Information Systems from the University of Delaware. Before beginning the Doctoral Program at Kent State University, she worked for Philips Medical Systems as an Accountant. Her research interests include green supply chain management, operation efficiency and outliers in statistical process control. Her research has been presented at numerous conferences and published as Proceedings in the Decision Sciences Institute, Midwest Decision Sciences Institute and the 1st International Symposium on Green Supply Chains. Fengkun Liu is currently a PhD Candidate in Information Systems in the Department of Management and Information Systems, Kent State University. He earned his BS at Qingdao University, Qingdao, China and his MS at the Catholic University of Korea, Bucheon, Korea, both in Management Information Systems. His research interests include social networks, mobile application adoption, recommender systems and e-commerce. Fanglin Shen is currently a PhD Candidate in the Department of Finance, Kent State University. She earned her Bachelor’s degree from Beihang University, China and her Master’s degree from Southern Illinois University at Edwardsville. Her research interests include investment, international finance and behavioural finance. She has presented at several finance conferences. Michael Suh is currently an Information Systems Consultant. He served as an Assistant Professor in the Management and Information Systems Department at Kent State University, where he earned his PhD. Previously, he has published in the Int. J. Computer Integrated Manufacturing and Industrial Mathematics. His research interests focus on information systems. David Booth is currently a Professor in the Department of Management and Information Systems at Kent State University. He earned his PhD at the University of North Carolina at Chapel Hill. Previously, he has published in the Int. J. Operational Research, Journal of Quality Technology, and Decision Sciences, etc. His research interests include applied statistics and quality control.

1

Introduction

It is well known that outliers have a major deleterious effect on statistical methods based on ordinary least squares (OLS). Robust techniques (Hauser and Booth, forthcoming a,b; Stigler, 2010) have been developed to combat these effects. Time series methods are related to regression algorithms and thus these outliers can produce the same type of problem in this case as well. We propose an outlier resistant robust locally weighted scatterplot smoothing (loess) method to be used in statistical process control (SPC) as the foundation for a new type of quality control chart. A control chart is a plot of a quality response variable vs. the time sequence of the observations that forms a time series (Booth, 1984). Many smoothing and regression type

76

Z. Tao et al.

algorithms applied to time series data are quite affected by outlying observations (Chang et al., 1988; Chen and Liu, 1993a,b; Chen and Tiao, 1990; Chen et al., 1992), as are standard x control charts (Booth, 1984; Mahaney et al., 2007a,b). The effect is to pull the x values towards the outliers and thus cause erroneous interpretation of the quality control data. We argue in this paper that is exactly the problem with Lloyd’s (1995) proposed control chart and further argue that a robust outlier resistant control chart can be developed based on a loess smooth (Cleveland, 1979; Cleveland and Loader, 1995) and outlier resistant summary statistics, e.g. medians and median absolute deviations. We then show that such a loess-based control chart works better in Lloyd’s (1995) information system application than does the chart he proposed. The SPC application is to identify, abnormally, long response times (outliers) in database management processes. Lloyd (1995) developed a control chart based on the mean measurement as the central line + 2.66 u mean moving average as an upper control limit (UCL) to identify these abnormally long response times. His control chart identified some outliers in the data set but also missed some outliers (for details, see Section 4). The reason that Lloyd failed to identify those outliers is that in Lloyd’s chart the outliers themselves pulled the UCL too far towards them. We developed a robust (outlier resistant) loess-based control chart. In our chart, UCL = median (observations) + 2.66 (median absolute deviation (loess residuals)). The median and median absolute deviation are defined and discussed in Booth (1986a,b). Our loess-based control chart identified all the outliers because the loess process as well as the use of the median and median absolute deviations gives outliers lower weight accordingly and the smooth is not pulled towards them, then the resulting residuals are used to compute the UCL.

2

Literature review

The effect of outliers on maximum likelihood and least squares estimates has been known and studied since the first comparison of the properties of means and medians (Bianco and Martinez, 2009; Portnoy, 1980; Stigler, 2010). Because means and variances are often used in SPC, especially in control charts, this effect is well known and researchers have used the effect in various ways to try to improve SPC methods (Booth and Booth, 2009; Booth et al., 2005; Mahaney et al., 2007a,b; Shah and Booth, forthcoming). All of these new methods try to improve SPC by either using outlier resistant (robust) methods of estimation or try to take advantage of the outlier effect of pulling maximum likelihood or least squares estimators towards the outliers (Booth, 1986a; Croux et al., 2002; Hauser and Booth, forthcoming a,b; Kimmel et al., 2010). This paper tries to take advantage of properties of the median, median absolute deviation (Booth, 1986a) and loess smooth (Cleveland, 1979; Cleveland and Loader, 1995; Sawitzki, 2009). Loess began as locally weighted scatterplot smoothing or lowess (Cleveland, 1979). Essentially, that method was to fit locally a robust regression (i.e. outlier resistant regression) (Booth, 1986a) with a polynomial model of degree one or two. A second proposal was put forward later, loess, which added a nearest neighbour component and lead to the double weighting scheme detailed in Section 3 (Cleveland and Loader, 1995; Kimmel et al., 2010) and used in SAS proc loess. The robust weighting scheme in loess works in the same manner as that in lowess. The previously mentioned papers in SPC take advantage of the outlier effect in several ways. We first recall three facts. Firstly, the effect itself is that outliers pull

A new control chart based on the loess smooth

77

maximum likelihood estimates and least squares estimates towards themselves (Bianco and Martinez, 2009; Booth, 1986a). Secondly, a robust method (i.e. outlier resistant) resists the outliers pull. Finally, out of control (OTC) points in SPC are outliers (Booth, 1984). Booth et al. (2005) used a robust time series method (joint estimation) that not only identified outliers but also gives them time series types to identify losses from nuclear material inventories. Mahaney et al. (2007a,b) used joint estimation to deal with SPC problems in metallurgical industries as well as SPC problems in business management. An advantage of the approaches that use joint estimation is that they take into account any autocorrelation that may be in the data (Wright et al., 2001) as does loess (Cleveland, 1979). Joint estimation is again based on countering the outliers’ pulling effect. Another approach to the use of the outlier effect in SPC is based on using the outliers pulling effect to signal the existence of the outlier. Here, Booth and Booth (2009) and Shah and Booth (forthcoming) used the fact that fractal dimension is computed with least squares regression. Thus if a point is added to a data set which is an outlier, the outlier effect causes a jump in the regression coefficients, thus signalling the presence of an outlier by showing a jump in the fractal dimension. This approach has been used both in nuclear material safeguards (Booth and Booth, 2009) and in chemical process control (Shah and Booth, forthcoming). Overall, the use of the outlier effect has had a productive history in SPC (Grant and Leavenworth, 1980) and in statistics in general (Stigler, 2010).

3

Methodology

3.1 Loess smooth We now consider the methods used to construct our proposed control chart based on the loess smooth. OLS regression works well if the error term is based on a normal distribution, but OLS poses problems if there are OTC points (outliers). As Kimmel et al. (2010) pointed out, in OLS, outliers cause the prediction equation to shift or rotate, causing the residuals or distances from the regression equation to the data points to change. This could cause misclassification of points as outliers because some of the distances could become shorter, resulting in type II errors when OTC points are overlooked, and some may have gotten longer, resulting in type I errors as points classified as OTC when, in fact, they are not. We propose using loess (robust locally weighted scatterplot smoothing method), which neither discards outliers nor gets unduly influenced by outliers, through which we make loess-based control charts to make outliers easily identified. Cleveland (1979), Cleveland and Loader (1995) and Kimmel et al. (2010) discussed the loess algorithm. Loess is a combination of local fitting of polynomials and iteratively weighted least squares. At each point in the data set, a linear or quadratic polynomial is fit using weighted least squares, giving more weight to points near the point whose response is being estimated and less weight to points further away. The value of the regression function for the point is then obtained by evaluating the local polynomial for that data point. The loess fit is complete after regression function values have been computed for each of neighbouring data points. Note also that an iterative process to downweight outliers is applied (Cleveland, 1979).

78

Z. Tao et al.

3.2 Test data We first consider the test data of Lloyd (1995) which Lloyd used to test his outlier sensitive method and which we use to test our proposed outlier resistant method. In his discussion of this data, Lloyd (1995) defines response time, which is one of measurements to be used to ascertain the performance of the system or subsystem, as the time between a user request and the answer to that request. Read sequential is one type of file operation request to the system. Lloyd (1995) used two sets of data of response time to read sequential to fit his models and make his charts. To test how our loess-based chart performs, i.e. to see whether our chart improves on identifying the outliers that Lloyd indicated but failed to find. We use the same data as Lloyd’s (reported in Suh et al., 2000a) response time to user request of read sequential on day 1 (35 transaction cases) and on day 2 (89 transaction cases). The difference between day 1 and day 2 is: for day 1, 46 records are added to the database before the read sequential operation occurs while for day 2, no additional records are added before the read sequential operation occurs. We fit a loess smooth using these two sets of data, make control charts based on the loess results to identify any unusually long response times (outliers) for day 1 and day 2, and make the comparison of our results of day 1 and day 2 with Lloyd’s to see the consistent improvement of our control charts.

3.3 Model and implementation Our loess regression model is, Ǔ = f(x), where y is the response time, x is observed transaction case number and f is the function fitted by loess. We implemented the model with SAS proc loess. We set parameters k, d and q in SAS. We note that the parameter k, the smoothing parameter, is what determines the amount of smoothing. The larger k, the smoother is the resulting curve. Therefore, k needs to be chosen on the basis of the properties of the data and the amount of smoothing desired. Parameter d is the degree of the polynomial used in the locally weighted regression and is restricted to being a non-negative integer. As d is increased, the computational complexity of the regression quickly rises; however, the flexibility to reproduce patterns in the data is enhanced. If d is set at d = 0, then local constancy is assumed. Since, the data may or may not exhibit autocorrelation, d should be set at d > 0. We set d = 1 or d = 2 because it provides a good balance between computation ease and flexibility. The parameter q is the iteration convergence number. For the portion of the algorithm that downweights outliers, experimentation with loess has shown that loess tends to converge rather quickly. In our case, q = 5 is a good value to meet the convergence criterion. We define the residuals from the loess fit as residuali = Yi – Ǔi, where Yi is the observed response time and Ǔi is the corresponding fitted value. If residuali > 0 then Yi > Ǔi indicating that the observed response time > fitted response time. The larger the positive residual the more likely the observation is an outlier and hence screened as an OTC point in the information system application. To predict the acceptable performance of the system and examine unusually long response times which are not acceptable and worth investigating, a control chart is needed. We are only interested in long response times and hence UCLs only were considered. Lloyd’s control chart used mean response measurement as the central line and that value + 2.66 u mean moving average as his UCL. His control chart identified outliers

A new control chart based on the loess smooth

79

in data sets of day 1 and day 2, but also missed some outliers (for details, see Section 4). The reason that Lloyd’s chart failed to identify outliers is because in Lloyd’s chart the outliers themselves pulled the UCL too far towards them. We propose a robust (outlier resistant) analogue to the standard control chart limits (Grant and Leavenworth, 1980 p.78ff). To make these limits robust they were calculated as UCL = median (response) + k (median absolute deviation (loess residuals)). The median and median absolute deviation as robust estimators are defined and discussed in Booth (1986a). We use k = 2.66 to make our loess control chart comparable with that of Lloyd (1995). Our loess control chart identifies the outliers successfully and hence efficiently yields the desired process control. The loess control chart was implemented in R. We note that such methods have a long history in operations research. The first such charts were proposed by Walter Shewhart (see e.g. Shewhart, 1931). The name of W. Edwards Deming is also well known in the use of such methods.

4

Results

For analysis, we used two different methods and compared their results. The first method we used is linear loess fit and the second method we used is quadratic loess fit. Discussions are as follows.

4.1 Linear loess fits for day 1 and day 2 data For day 1, we tested the smoothing parameters k with the following values: 0.1, 0.15, 0.2, 0.25, 0.3, 0.35, 0.4, 0.45, 0.5 and 0.6, respectively, with 5 robust iterations. After comparison based on graphical considerations, we chose an optimal value k of 0.35 as our final smoothing parameter. Figure 1 and Table 1 show day 1 plot and residual results from SAS proc loess with k = 0.35. From Figure 1, we see that the chosen linear loess fit with smoothing parameter k = 0.35 provides a reasonable fit to the data. The residual data values, rank of the residual data values and loess predicted values are shown in Table 1. In Lloyd’s (1995) original study, he had good reason to believe that the case 1 observation in day 1 data was OTC as well as the obvious case 18. His moving rangebased control chart (Lloyd, 1995, p.72), however, failed to show case 1 as OTC (Lloyd, 1995, p.80). Our chart, Figure 2, based on median (responses) + 2.66 (mad (loess residuals)) indicated that case 1 is OTC as well as case 18, mad refers to median absolute deviation. Note that we use Lloyd’s value 2.66 for comparison purposes. It is well known (Booth, 1982, 1984, 1986a,b; Chernik et al., 1982) that large robust residuals correspond to large outliers. Our result is confirmed from this point of view since it is consistent with the rank order of the loess positive residuals in day 1 (Table 1). All of the loess OTC measures are consistent with each other as with Lloyd’s non-chart result. The reason that Lloyd’s chart could not show case 1 as OTC point is because in Lloyd’s chart the outliers themselves pulled the UCL too far towards them. Our loess control chart, Figure 2, avoids this effect of outliers by using median and mad (Booth, 1986a) hence making the OTC points stand out. We further note that Lloyd mislabelled case 1 as case 2. We discovered this error using our chart.

Z. Tao et al.

80 Figure 1

Smoothing parameter: k = 0.35, day 1 data, linear robust fit

r esp 133

132

131

130

129 0

10

20

30

40

case

Table 1

Residuals for day 1 data with smoothing parameter 0.35

Rank of Rank of Responses Loess loess Loess Responses Loess loess Loess time predicted Residuals predicted outlier Case time predicted Residuals predicted outlier Case 1

130.23 129.8297

0.40035

2

Yes

19

129.14

129.3514 0.2114

31

No

2

129.13 129.7466 0.61662

35

No

20

129.46

129.3523

0.10775

11

No

3

129.68 129.6656

0.01441

17

No

21

129.3

129.354

0.05398

19

No

4

129.41 129.5857 0.17574

29

No

22

129.58

129.3701

0.20989

8

No

5

129.56 129.5059

16

No

23

129.25

129.3862 0.13624

25

No

0.05411

6

129.41 129.4579 0.04786

18

No

24

129.47

129.3956

0.07441

15

No

7

129.5

0.09016

14

No

25

129.2

129.4049 0.20494

30

No

8

129.24 129.3996 0.15963

28

No

26

129.79

129.4082

0.38183

3

No

9

129.07 129.3894 0.31941

33

No

27

129.09

129.4114 0.32141

34

No

129.4098

10

129.67 129.3851

0.28487

5

No

28

129.69

129.416

0.27397

6

No

11

129.24 129.3808 0.14084

26

No

29

129.29

129.4207 0.13066

24

No

12

129.68 129.3882

0.29184

4

No

30

129.51

129.4193

0.0907

13

No

13

129.24 129.3912 0.15118

27

No

31

129.3

129.418

0.11795

22

No

14

129.5

12

No

32

129.58

129.4122

0.16785

9

No

15

129.25 129.3746 0.12459

129.3942

0.1058

23

No

33

129.19

129.4063 0.21634

32

No

16

129.52 129.355

0.16503

10

No

34

129.63

129.4004

0.22956

7

No

17

129.29 129.3528 0.06277

20

No

35

129.3

129.3945 0.09454

21

No

18

132.94 129.3506

1

Yes

3.58944

A new control chart based on the loess smooth Loess control chart for day 1 linear fit

131 129

130

DepVar

132

133

Figure 2

81

0

5

10

15

20

25

30

35

case

For day 2, we tested smoothing parameters k with values of 0.1, 0.15, 0.2, 0.25, 0.3, and 0.4, respectively, with 5 robust iterations and we chose k = 0.15 as our optimal smoothing parameter based on comparison. Figure 3 and Table 2 show day 2 plot and residual results from SAS proc loess. From Figure 3, we see that the chosen linear loess fit with smoothing parameter 0.15 provides a reasonable fit to the data. The residual data values, rank of the residual data values and loess predicted values are shown in Table 2. For day 2 data, Lloyd (1995) believed that the cases 6, 22, 39, 56, 72 and 89 observations were OTC points as well as the obvious case 36 (Lloyd, 1995; Suh et al., 2000a,b). Lloyd’s control chart (Lloyd, 1995 p.72), however, failed to show cases other than case 36 as OTC (Lloyd, 1995, p.80). Our chart, Figure 4, based on median (responses) + 2.66 (mad (loess residuals)) indicated that cases 6, 22, 39, 56, 72, 89 as well as 36 are OTC points. Again we use Lloyd’s value 2.66 for comparison purposes. Our result is confirmed since it is consistent with the rank order of the loess positive residuals in day 2 (Table 2). Again our loess, Figure 4, control chart avoids the effect of outliers by using the median and mad hence making the OTC points stand out.

Z. Tao et al.

82 Figure 3

Smoothing parameter: 0.15, day 2 data, linear robust fit

r esp 132

131

130

129

128

127

126 0

10

20

30

40

50

60

70

80

90

case

Table 2

Residuals of day 2 data with smoothing parameter 0.15

Rank of Rank of Response Loess loess Loess Response Loess loess Loess predicted Residuals predicted outlier Case time predicted Residuals predicted outlier Case time 1

127.38 127.4499 0.06991

53

No

46

127.22

127.293

2

127.7

0.25373

17

No

47

127.59

127.29

3

127.42 127.4453 0.02527

44

No

48

127.06

127.28

4

127.43 127.4468 0.01679

43

No

49

127.44

127.274

5

127.1

127.4483 0.34832

89

No

50

127.11

127.269

6

128.2

127.4477

7

Yes

51

127.38

127.256

7

127.27 127.4467 0.17668

76

No

52

127.05

127.243

8

127.64 127.4457

0.19434

25

No

53

127.49

127.23

9

127.37 127.455

0.08501

56

No

54

127.11

127.228

0.1182

10

127.81 127.4583

0.3517

8

No

55

127.21

127.238

11

127.32 127.4616 0.1416

67

No

56

128.15

127.247

127.4463

0.7523

12

127.37 127.4615 0.09153

58

No

57

127.05

127.253

13

127.32 127.4563 0.13628

65

No

58

127.45

127.259

14

127.76 127.451

0.30897

12

No

59

127.18

127.269

15

127.2

127.4498 0.24984

86

No

60

127.6

127.281

54

No

14

No

83

No

28

No

71

No

34

No

78

No

16

No

62

No

0.0276

46

No

0.9029

4

Yes

0.0734 0.30034 0.2197 0.16576 0.1588 0.12425 0.193 0.25984

80

No

26

No

0.0889

57

No

0.3193

11

No

0.203 0.19101

A new control chart based on the loess smooth Table 2

83

Residuals of day 2 data with smoothing parameter 0.15 (continued)

Rank of Rank of Response Loess loess Loess Response Loess loess Loess Case time predicted Residuals predicted outlier Case time predicted Residuals predicted outlier 16

127.76 127.4531

No

61

127.06

127.293

17

127.27 127.4563 0.18628

18

127.6

0.15371

77

No

62

127.34

127.293

29

No

63

127.16

127.281

19

127.31 127.4198 0.10976

60

No

64

127.6

127.27

20

127.53 127.3932

0.13677

30

No

65

127.11

127.264

21

127.32 127.3711 0.05107

50

No

66

127.34

127.265

22

128.25 127.3695

0.88049

5

Yes

67

127.05

127.265

23

127.26 127.368

0.10796

59

No

68

127.44

127.263

24

127.11 127.3653 0.25534

87

No

69

127.1

127.261

25

127.49 127.3627

0.12728

33

No

70

127.38

127.267

26

127.44 127.3649

0.07507

38

No

71

127.28

127.282

27

127.64 127.3768

0.2632

15

No

72

128.15

127.296

28

127.22 127.3887 0.16867

74

No

73

127.05

127.306

29

127.33 127.3937 0.06373

51

No

74

127.44

127.307

30

127.27 127.3872 0.11723

61

No

75

127.34

127.308

31

127.6

0.21927

23

No

76

127.39

127.31

32

127.22 127.3856 0.16556

73

No

77

127.28

127.312

127.4463

127.3807

127.3721

0.30694

13

85

No

39

No

64

No

9

No

70

No

37

No

82

No

27

No

72

No

35

No

41

No

6

Yes

88

No

0.13341

31

No

0.03253

40

No

0.08055

36

No

47

No

0.2325 0.04747 0.1214 0.32964 0.154 0.07528 0.2154 0.17661 0.1614 0.11256 0.0018 0.85383 0.2557

0.0323

33

127.6

0.22786

19

No

78

127.28

127.315

0.0351

48

No

34

127.21 127.3587 0.14871

68

No

79

127.16

127.312

0.1521

69

No

35

127.54 127.3173

0.22273

22

No

80

127.54

127.309

18

No

36

131.21 127.2758

3.93418

1

Yes

81

127.11

127.31

79

No

37

127.21 127.2481 0.03809

49

No

82

127.54

127.314

20

No

38

127.11 127.2286 0.11862

63

No

83

127.11

127.318

81

No

39

128.21 127.2092

3

Yes

84

127.54

127.315

21

No

40

126.98 127.203

84

No

85

127.17

127.311

0.141

66

No

0.0166

1.00085 0.223

41

127.42 127.2192

0.20079

24

No

86

127.29

127.307

42

127.06 127.2354 0.17541

75

No

87

127.43

127.302

43

127.59 127.2626

10

No

88

127.22

127.296

44

127.21 127.2799 0.06987

52

No

89

128.32

127.291

45

127.27 127.2972 0.02715

45

No

0.3274

0.23091 0.2001 0.22607 0.2077 0.22456

0.12854 0.0762 1.02905

42

No

32

No

55

No

2

Yes

84

Loess control chart for day 2 linear loess fit

129 127

128

DepVar

130

131

Figure 4

Z. Tao et al.

0

20

40

60

80

case

4.2 Quadratic loess fits for day 1 and day 2 data For the ease of comparison, we followed a similar procedure and settings from linear loess fit in our quadratic loess fit. For day 1, we tested k = 0.1, 0.15, 0.2, 0.25, 0.3, 0.35, 0.4, 0.45, 0.5 and 0.6, respectively, again we used 5 robust iterations since more iterations produce little improvement while consuming more computing time. After comparison based on graphical considerations, we chose k = 0.3 as the optimal smoothing parameter. Figure 5 shows day 1 plot and Table 3 shows residual results from SAS proc loess with k = 0.3. From Figure 5, we found that the chosen quadratic loess fit with k = 0.3 provides a reasonable fit to the data. The residual data values, rank of the residual data values and loess predicted values are shown in Table 3. Our chart with quadratic loess as shown in Figure 6, based on median (responses) + 2.66 (mad (loess residuals)), indicated that case 1 (identified in Lloyd as erroneously as case 2) is OTC as well as case 18, which is consistent with the linear robust fit result. For day 2 quadratic fit, we tested k = 0.1, 0.15, 0.2, 0.25, 0.3, and 0.4, respectively, with 5 robust iterations. We finally chose k = 0.25 as the optimal result. Day 2 plot is shown in Figure 7 and residual results are shown in Table 4. By observing Figure 7, we conclude that the chosen quadratic loess fit with smoothing parameter k = 0.25 provides a reasonable fit to the data. Table 4 contains the residual data values, rank of the residual data values and loess predicted values. Figure 8 is our chart for day 2 data, which is based on median (responses) + 2.66 (mad (loess residuals)). The chart indicates that cases 6, 22, 39, 56, 72, 89 and 36 are OTC. Our results in Figures 2, 4, 6 and 8 suggest that using a robust fit is more important

A new control chart based on the loess smooth

85

to the OTC analysis than whether or not a linear or quadratic polynomial is used since all loess-based charts find all of OTC points. Because that result may be specific to this application, we suggest practitioners choose the better robust fit using both linear and quadratic polynomials as they deal with their own data. Figure 5

Smoothing parameter: 0.30, day 1 data, quadratic robust fit

r esp 133

132

131

130

129 0

10

20

30

40

case

Table 3 Case 1

Residuals for day 1 data with smoothing parameter 0.30 Response time Loess predicted Residuals 130.23 130.14245 0.08755

Rank of loess predicted 14

Loess outlier No

2 3

129.13 129.68

129.88154 129.68537

0.75154 0.00537

35 17

No No

4 5

129.41 129.56

129.58547 129.48558

0.17547 0.07442

28 16

No No

6 7

129.41 129.5

129.42382 129.36207

0.01382 0.13793

18 10

No No

8 9

129.24 129.07

129.33206 129.30205

21 33

No No

10 11

129.67 129.24

129.36421 129.42637

0.09206 0.23205 0.30579

3 30

No No

12 13

129.68 129.24

129.45371 129.42529

6 29

No No

14 15

129.5 129.25

129.39688 129.37959

12 24

No No

16 17

129.52 129.29

129.36231 129.32138

9 19

No No

18

132.94

129.28044

1

Yes

0.18637 0.22629 0.18529 0.10312 0.12959 0.15769 0.03138 3.65956

86 Table 3 Case

Z. Tao et al. Residuals for day 1 data with smoothing parameter 0.30 (continued) Response time Loess predicted Residuals

Rank of loess predicted

Loess outlier

19

129.14

129.30471

27

20

129.46

129.32898

0.13102

11

No

21

129.3

129.39265

0.09265

22

No

22

129.58

129.39228

0.18772

7

No

23

129.25

129.39191

0.14191

26

No

24

129.47

129.3894

0.0806

15

No

25

129.2

129.3869

0.1869

31

No

26

129.79

129.41728

0.37272

2

No

27

129.09

129.44766

0.35766

34

No

28

129.69

129.43849

0.25151

4

No

29

129.29

129.42933

0.13933

25

No

30

129.51

129.41893

0.09107

13

No

31

129.3

129.40854

0.10854

23

No

32

129.58

129.40908

0.17092

8

No

0.16471

No

33

129.19

129.40961

0.21961

32

No

34

129.63

129.39521

0.23479

5

No

35

129.3

129.3808

20

No

129

130

131

132

133

Loess control chart for day 1 quadratic robust fit

DepVar

Figure 6

0.0808

0

5

10

15

20 case

25

30

35

A new control chart based on the loess smooth Figure 7

87

Smoothing parameter: 0.25, day 2 data, quadratic robust fit

r esp 132

131

130

129

128

127

126 0

10

20

30

40

50

60

70

80

90

case

Table 4 Case

Residuals for day 2 data with smoothing parameter 0.25 Response time

Loess predicted

Residuals

Rank of loess predicted Loess outlier

1

127.38

127.45885

0.07885

54

No

2

127.7

127.45589

0.24411

17

No

3

127.42

127.45292

0.03292

45

No No

4

127.43

127.45186

0.02186

42

5

127.1

127.4508

0.3508

89

No

6

128.2

127.44974

0.75026

7

Yes

No

7

127.27

127.45028

0.18028

76

8

127.64

127.45082

0.18918

25

No

9

127.37

127.45136

0.08136

55

No

10

127.81

127.45789

0.35211

8

No

11

127.32

127.46442

0.14442

66

No

12

127.37

127.47095

0.10095

57

No

13

127.32

127.4683

0.1483

67

No

14

127.76

127.46564

0.29436

13

No

15

127.2

127.46299

0.26299

88

No

16

127.76

127.45375

0.30625

12

No

17

127.27

127.4445

75

No

18

127.6

127.43525

0.16475

28

No

19

127.31

127.42139

0.11139

58

No

20

127.53

127.40753

0.12247

32

No

0.1745

88 Table 4 Case 21

Z. Tao et al. Residuals for day 2 data with smoothing parameter 0.25 (continued) Response time 127.32

Loess predicted

Residuals

127.39367

0.07367

Rank of loess predicted Loess outlier 52

No

22

128.25

127.38317

0.86683

5

Yes

23

127.26

127.37268

0.11268

59

No

24

127.11

127.37055

0.26055

87

No

25

127.49

127.36841

0.12159

33

No

26

127.44

127.36628

0.07372

38

No

27

127.64

127.37655

0.26345

15

No

28

127.22

127.38683

0.16683

72

No

29

127.33

127.3971

0.0671

50

No

30

127.27

127.39564

0.12564

61

No

31

127.6

127.39418

0.20582

23

No

32

127.22

127.39271

0.17271

74

No

33

127.6

127.36687

0.23313

18

No

34

127.21

127.34102

0.13102

63

No

35

127.54

127.31484

0.22516

21

No

36

131.21

127.28866

3.92134

1

Yes

37

127.21

127.26248

0.05248

49

No

38

127.11

127.24667

0.13667

65

No

39

128.21

127.23086

0.97914

2

Yes

40

126.98

127.21505

0.23505

84

No

41

127.42

127.23032

0.18968

24

No

42

127.06

127.24559

0.18559

77

No

43

127.59

127.26086

0.32914

9

No

44

127.21

127.27945

0.06945

51

No

45

127.27

127.29805

0.02805

43

No

46

127.22

127.29716

0.07716

53

No

47

127.59

127.29626

0.29374

14

No

48

127.06

127.29537

0.23537

85

No

49

127.44

127.2804

0.1596

29

No

50

127.11

127.26543

0.15543

69

No

51

127.38

127.25047

0.12953

31

No

52

127.05

127.24558

0.19558

78

No

53

127.49

127.24069

0.24931

16

No

54

127.11

127.23581

0.12581

62

No

55

127.21

127.24095

0.03095

44

No

56

128.15

127.24609

0.90391

4

Yes

A new control chart based on the loess smooth Table 4 Case

89

Residuals for day 2 data with smoothing parameter 0.25 (continued) Response time

Loess predicted

Residuals

Rank of loess predicted Loess outlier 81

No

57

127.05

127.25686

58

127.45

127.26764

0.18236

26

No

59

127.18

127.27841

0.09841

56

No

60

127.6

127.28181

0.31819

11

No No

0.20686

61

127.06

127.28522

0.22522

83

62

127.34

127.28862

0.05138

39

No

63

127.16

127.28203

0.12203

60

No

64

127.6

127.27543

0.32457

10

No

65

127.11

127.26884

0.15884

71

No

66

127.34

127.2639

0.0761

37

No

67

127.05

127.25896

0.20896

82

No

68

127.44

127.26454

0.17546

27

No No No

69

127.1

127.27012

0.17012

73

70

127.38

127.2757

0.1043

34

71

127.28

127.28553

0.00553

41

No

72

128.15

127.29537

0.85463

6

Yes

73

127.05

127.3052

86

No

74

127.44

127.30791

0.13209

30

No

75

127.34

127.31063

0.02937

40

No

76

127.39

127.31334

0.07666

36

No

77

127.28

127.31401

0.03401

46

No

78

127.28

127.31468

0.03468

47

No

79

127.16

127.31346

0.15346

68

No

80

127.54

127.31225

0.22775

19

No

81

127.11

127.31103

0.20103

79

No

82

127.54

127.31344

0.22656

20

No

83

127.11

127.31585

0.20585

80

No

84

127.54

127.31825

0.22175

22

No

0.2552

85

127.17

127.32602

0.15602

70

No

86

127.29

127.33378

0.04378

48

No

87

127.43

127.34154

0.08846

35

No

88

127.22

127.35358

0.13358

64

No

89

128.32

127.36562

0.95438

3

Yes

Z. Tao et al.

90

Loess control chart for day 2 quadratic robust fit

129 127

128

DepVar

130

131

Figure 8

0

20

40

60

80

case

5

Managerial implication

A database management system (DBMS) consists of related files that contain useful information. Database systems must be robust enough to provide useful information to different queries (Lloyd, 1995). The successful utilisation of a database system depends largely on the stability of its operations, which requires the efficient and accurate determination of outliers in database response times. Traditionally, the only approach for database administrators to monitor the performance of a database system is by examining descriptive statistics of data operation, such as the mean and standard deviation. However, the drawback for this approach is obvious: it can only indicate historical performance and thus lacks the ability of providing predictive information of DBMS operation (Lloyd, 1995). This is the reason Lloyd (1995) developed SPC methods with the purpose of monitoring DBMS performance and predicting its possible failure. Through the analysis and discussion using our loess methods, we successfully identified all the outliers in the data set for both day 1 and day 2 using our robust loess fitting. We conclude that our method is easier to implement and is more efficient compared to the previous SPC method. Based on our results, we could provide information technology managers with a more cost-effective way in monitoring their DBMS performance.

A new control chart based on the loess smooth

6

91

Conclusion

In our research, we use the same two sets of data as Lloyd (1995) and fit a loess smooth. Our loess-based control charts identified all unusually long response times (outliers) for both day 1 and day 2 consistently while Lloyd’s charts missed some outliers for day 1 and day 2. We conclude that a loess control chart is a better method of SPC for database management of this type than Lloyd’s control chart and it provides more valuable guidance to chart users in information quality management. Compared with other control charts, a loess control chart has five benefits. They are: 1

the underlying functions are approximated rather than interpolated in the area of outliers

2

gross error effects are limited

3

the convergence rate is improved because of the elimination of gross errors and thus it does not have to unlearn the effects of outliers

4

it is simple to implement and further using the median and mad make the statistical properties of the loess control chart approximate those of an x bar chart (Booth, 1984)

5

it downweights the effect of outliers and hence makes the OTC points stand out.

One limitation of this research is the procedure for determining optimal smooth parameters for each case is based on a manual procedure. We ran different values of the smoothing parameters on an incremental basis to obtain fitted curves. Then we compared different fitted curves by their graphical smoothness. Based on this graphical consideration, we chose the optimal value for the smoothing parameter. In future research, we could improve the efficiency of determining the optimal smoothing parameter by automating the comparison procedure. In other words, we could use computer programmes to help us in comparing and determining the fitted curve that has the optimal smoothness. In addition, simulation could be used in the manner of Wright et al. (2001) to further characterise the statistical properties of the loess-based control chart. We hope to attempt this research in the near future.

Acknowledgements We thank the editor, reviewers and Karoly Bozan for their very helpful suggestions.

References Bianco, A. and Martinez, E. (2009) ‘Robust testing in the logistic regression model’, Computational Statistics and Data Analysis, Vol. 53, pp.4095–4105. Booth, D. (1984) ‘Some applications of robust statistical methods to analytical chemistry’, PhD Thesis, Chapel Hill: The University of North Carolina at Chapel Hill. Booth, D.E. (1982) ‘The analysis of outlying data points using robust regression: a multivariate problem bank identification model’, Decision Sciences, Vol. 13, pp.71–81.

92

Z. Tao et al.

Booth, D.E. (1986a) Regression Methods and Problem Banks, a Unit to Demonstrate an Application of Data Analysis. Lexington, MA: COMAP, Inc., Module No. 626, Reprinted in UMAP Modules Tools for Teaching, 1985, Arlington, MA: COMAP, Inc. Booth, D.E. (1986b) ‘A method for the early identification of loss from a nuclear materials inventory’, The Journal of Chemical Information and Computer Science, Vol. 26, pp.23–28. Booth, D.E. and Booth, S.E. (2009) ‘A new nuclear material safeguards method using fractal dimension’, Current Analytical Chemistry, Vol. 5, No. 1, pp.26–28. Booth, D.E., Zhu, D.X., Baker, D.L. and Hamburg, J.H. (2005) ‘Recent chemometric approaches to the detection of nuclear material losses’, Current Analytical Chemistry, Vol. 1, No. 2, pp.181–186. Chang, I., Tiao, G.C. and Chen, C. (1988) ‘Estimation of time series parameters in the presence of outliers’, Technometrics, Vol. 30, pp.193–204. Chen, C. and Liu, L-M. (1993a) ‘Joint estimation of model parameters and outliers effects in time series’, Journal of the American Statistical Association, Vol. 88, No. 421, pp.284–297. Chen, C. and Liu, L-M. (1993b) ‘Forecasting time series with outliers’, Journal of Forecasting, Vol. 12, pp.13–35. Chen, C., Liu, L-M. and Hudak, G.B. (1992) ‘Modeling and forecasting time series in the presence of outliers and missing data’, Working Paper No. 128, Scientific Computing Associations Corp., Oak Brook, IL. Chen, C. and Tiao, G.C. (1990) ‘Random level shift time series models, ARIMA approximation, and level shift detection’, Journal of Business and Economic Statistics, Vol. 8, pp.170–186. Chernik, M.R., Downing, D.J. and Pike, D.H. (1982) ‘Detecting outliers in time series data’, Journal of the American Statistical Association, Vol. 77, No. 380, pp.743–747. Cleveland, W.S. (1979) ‘Robust locally weighted regression and smoothing scatterplots’, Journal of the American Statistical Association, Vol. 74, pp.829–836. Cleveland, W.S. and Loader, C. (1995) ‘Smoothing by local regression: principles and methods’, Technical Report, Murray Hill, NY: AT&T Bell Laboratories. Croux, C., Flandre, C. and Haesbroeck, G. (2002) ‘The breakdown behavior of the maximum likelihood estimator in the logistic regression model’, Statistics & Probability Letters, Vol. 60, No. 4, pp.377–386. Grant, E. and Leavenworth, R. (1980) ‘Statistical Quality Control (5th ed.). New York: McGrawHill. Hauser, R.P. and Booth, D. (forthcoming a) ‘CEO bonuses as studied by robust logistic regression’, Journal of Data Science. Hauser, R.P. and Booth, D. (forthcoming b) ‘Predicting bankruptcy with robust logistic regression’, Journal of Data Science. Kimmel, R., Booth, D. and Booth, S. (2010) ‘The analysis of outlying data points by robust loess: a model for the identification of problem banks’, Int. J. Operational Research, Vol. 7, No. 1, pp.1–15. Portnoy, S. (1980) Personal Communication. Lloyd, S.J. (1995) ‘An expert system approach to choosing the least cost physical file structure in different types of databases’, Unpublished PhD Dissertation, Kent State University, Kent, OH. Mahaney, J., Baker, D., Hamburg, J. and Booth, D. (2007a) ‘Time series analysis of process data’, Int. J. Operational Research, Vol. 3, No. 2, pp.231–253. Mahaney, J., Goeke, R. and Booth, D. (2007b) ‘Out of control (outlier) detection in business data using the ARMA(1, 1) model’, Int. J. Operational Research, Vol. 2, No. 2, pp.115–134. Sawitzki, G. (2009) Computational Statistics. Boca Raton: CRC Press. Shah, V. and Booth, D. (forthcoming) ‘A fractal dimension-based method for statistical process control’, Int. J. Operational Research.

A new control chart based on the loess smooth

93

Shewhart, W. (1931) Economic Control of Quality of Manufactured Product. New York: D. Van Nostrand Company. Stigler, S.M. (2010) ‘Changing history of robustness’, The American Statistician, Vol. 64, No. 4, pp.277–281. Suh, M., Booth, D., Grznar, J., Prasad, S., Lloyd, S. and Hamburg, J. (2000a) ‘Fuzzy and robust neural networks and information systems process control’, Industrial Mathematics, Vol. 50, No. 1, pp.5–31. Suh, M., Booth, D., Grznar, J., Prasad, S., Lloyd, S. and Hamburg, J. (2000b) ‘An investigation of neural network algorithms for database management applications’, Unpublished PhD Dissertation, Kent State University, Kent, OH. Wright, C.M., Booth, D.E. and Hu, M.Y. (2001) ‘Joint estimation: SPC method for short run auto correlated data’, Journal of Quality Technology, Vol. 33, No. 3, pp.365–377.

Suggest Documents