Replacing Outliers and Missing Values from Activated Sludge Data ...

Replacing Outliers and Missing Values from Activated Sludge Data Using Kohonen Self-Organizing Map Rabee Rustum1 and Adebayo J. Adeloye2 Abstract: Modeling the activated sludge wastewater treatment plant plays an important role in improving its performance. However, there are many limitations of the available data for model identification, calibration, and verification, such as the presence of missing values and outliers. Because available data are generally short, these gaps and outliers in data cannot be discarded but must be replaced by more reasonable estimates. The aim of this study is to use the Kohonen self-organizing map 共KSOM兲, unsupervised neural networks, to predict the missing values and replace outliers in time series data for an activated sludge wastewater treatment plant in Edinburgh, U.K. The method is simple, computationally efficient and highly accurate. The results demonstrated that the KSOM is an excellent tool for replacing outliers and missing values from a high-dimensional data set. A comparison of the KSOM with multiple regression analysis and back-propagation artificial neural networks showed that the KSOM is superior in performance to either of the two latter approaches. DOI: 10.1061/共ASCE兲0733-9372共2007兲133:9共909兲 CE Database subject headings: Wastewater management; Mathematical models; Neural networks; Activated sludge; Water treatment plants.

Introduction Operation and control of activated sludge wastewater treatment plants 共ASWWTP兲 are becoming increasingly important as many now are used not only for the removal of carbon 共for which limited operational effort is often sufficient兲 but also for meeting increasingly stringent standards for nutrients, heavy metals, organic micropollutants, and pathogenic organisms 共Vanrolleghem 2001兲. Any effective control should aim to maintain the performance of the plant process, avoid the discharge of excessive biomass and other pollutants in the effluent, protect receiving water body quality, and minimize operation and maintenance costs 共Spellman 2003兲. Modeling the ASWWTP plays an important role in implementing efficient control actions for better process performance. For example, when a large variation of influent flow rate and influent substrate concentration takes place, process operating parameters such as the recycle flow rate and/or wastage flow rate must be regulated properly to avoid process deterioration and possible gross failure 共Nokyoo 2002兲. The operator will certainly not wait until a significant change or deterioration in the effluent water quality has occurred before taking any corrective action. To 1

Ph.D. Research Student, School of the Built Environment, Heriot-Watt Univ., Riccarton, Edinburgh EH14 4AS, U.K. E-mail: [email protected] 2 Senior Lecturer, School of the Built Environment, Heriot-Watt Univ., Riccarton, Edinburgh EH14 4AS, U.K. E-mail: A.J.Adeloye@ hw.ac.uk Note. Discussion open until February 1, 2008. Separate discussions must be submitted for individual papers. To extend the closing date by one month, a written request must be filed with the ASCE Managing Editor. The manuscript for this paper was submitted for review and possible publication on April 12, 2006; approved on March 16, 2007. This paper is part of the Journal of Environmental Engineering, Vol. 133, No. 9, September 1, 2007. ©ASCE, ISSN 0733-9372/2007/9-909–916/ $25.00.

achieve this, the operator will need to rely on models to determine proper, proactive control actions. However, there are many problems encountered in modeling the ASWWTP, one of which is the quality of the data. Although, many treatment plants are equipped with properly designed data collection systems, there is often little or no attention paid to the quality of the data 共Rosen 1998兲. Thus opportunities abound for data corruption, such as excessive disturbances, equipment malfunction, and human errors. The resultant effects of these are discontinuity or gaps in the data record and outliers, both of which create severe handicaps in modeling and identification of the process. One obvious solution to the problem is to just remove records containing the missing values and outliers; however, given the shortness of the available data and the time and expense for their collection, such a luxury cannot be afforded. So a considerable preprocessing of the data is required, both to fill the missing gaps and to replace the identified outliers with more plausible values. This study used the Kohonen self-organizing map 共KSOM兲, unsupervised neural networks, for predicting the missing values and for replacing outliers of the time series data and compares the results with those of linear regression and back propagation artificial neural networks. This task is the first step in modeling the ASWWTP using intelligent techniques such as fuzzy logic and artificial neural networks.

Literature Review Outliers An outlier, a sample value that differs notably from the mean of the measurement series, can be caused by many factors such as electromagnetic interference, hostile measurement environment, defective installation, insufficient maintenance or erroneous handling of the measurement system and intentional cover-up for lapses of the technician. A problem in detecting outliers is to

JOURNAL OF ENVIRONMENTAL ENGINEERING © ASCE / SEPTEMBER 2007 / 909

decide whether they represent a true value or whether they are false due to disturbances in the measurement system. Detection of outliers can be accomplished by using redundant sensors or digital filtering 共Barnett and Lewis 1994兲. In the redundant sensors case, at least two sensors 共or measurements兲 are used and an outlier is indicated when the sensors or the analyzed samples do not deliver the same value 共within a reasonable margin兲. However, this is an expensive procedure because it requires a large number of sensors or samples. In the activated sludge process, in which the wastewater is normally treated in parallel lanes, it is possible to use measurements from another lane to make such a validity check. However, even where this is the case, it will be necessary to combine such evidence with a more formal one for detecting outliers such as the statistical Z-score and modified Z-score. In the Z-score test, assuming that the data have a normal distribution, the mean and standard deviation of the entire data set are used to obtain a Z-score for each data point as in zi =

共xi − ¯x兲 S

共1兲

where ¯x = sample mean and S = sample standard deviation. In the modified Z-score test, the median of absolute deviation about the mean 共MAD兲 takes the place of the standard deviation in Eq. 共1兲. A test heuristic is that an observation with an absolute Z-score value greater than 3 should be labeled as an outlier or an observation with an absolute modified Z-score value greater than 3.5 should be labelled as an outlier 共Fallon and Spada 2006; McBean and Rovers 1998兲. However, the above-mentioned statistical techniques are not robust enough for labeling outliers since they depend on a number of assumptions, notably that the samples are normally distributed. Also, the test statistic depends on parameters, which can be significantly affected by the outliers such as the mean, median, and standard deviation. So, a straightforward and practical method for the off-line detection of outliers is to manually label the outliers by examining the time series plot. In time series, the human eye has a remarkable ability to pick out outliers with good result and by a careful investigation of the time series, manual detection of outliers can be as good as any more formalized method. Indeed, manual inspection is to be recommended when preparing data for model identification because it gives the model builder a sense for the data and also what can be expected from the model 共Rosen 1998兲. Consequently, the manual labeling approach was applied in the study, although comparisons with the results of the Z-score and the modified Z-score approaches will also be presented. Missing Values A missing value is caused by a sensor that does not deliver a measurement value, or by a fault in the measurement tools, or even by human mistakes. Depending on the measurement equipment, missing values can appear in the record as, for instance, blanks, zeros, or negative values for entities limited to positive values. Therefore, missing values are often simple to detect in data a record. Missing values are a serious problem as they distort the dynamic properties of the signal. In order to perform a dynamic analysis, the missing values must be estimated, especially when multivariate analysis is considered, as failure to estimate them makes the complete sample difficult to be used. They may lead to severe problems in model identification process, particularly

when tools such as artificial neural networks and fuzzy logic models are used. As these tools require long periods of good and reliable data, it is important that the number of missing values is kept at a minimum. Replacing Outliers and Missing Values If there are relatively few missing points, there are some models, which can be used to estimate values to complete the series, such as replacing missing values with the mean or median of the data. Simple linear regression can also be used to estimate missing values 共MacDonald and Zucchini 1997; Harvey 1989兲. Regression will work best if the number of water quality parameters having missing values in their records is small; otherwise developing different predictive regression equations for a large number of water quality parameters will be time wasting. Backpropagation artificial neural networks 共ANNs兲 modeling 共Maier and Dandy 1996兲 offers a solution for multivariable prediction but the performance of ANNs tends to decrease rapidly as the number of output variables increases, particularly when the output variables are not highly correlated 共Adeloye and De Munari 2006兲. Outliers can sometimes be accommodated in the data through the use of trimmed means, other scale estimators apart from standard deviation 共e.g., MAD兲 and Winsorization 共McBean and Rovers 1998兲. In calculations of a trimmed mean, a fixed percentage of data is dropped from each end of an ordered data, thus eliminating the outliers. The mean is then calculated using the remaining data. Windsorization involves accommodating an outlier by replacing it with the next highest or next smallest value as appropriate. However, using these types of models to predict missing values or outliers in a long time series is difficult and often unreliable, particularly if the number of values to be in-filled is relatively high in comparison with the total record length. The accuracy of the estimate depends on how good and representative the model is and how long the period of missing values extends 共Rosen and Lennox 2001兲. The activated sludge treatment plant is a dynamic process, so any variable is dependent, not just on the historical time series of the same variable but also on several other variables or parameters of the process. In other words, the problem is an exercise in multivariate analysis rather than the univariate approach of most of the traditional methods of estimating missing values and outliers. A multivariate model will therefore be more representative than the univariate one for predicting missing values. The KSOM offers a simple and robust multivariate model for data analysis, thus providing good possibilities to estimate missing values, taking into account its relationship or correlation with other pertinent variables in the data record. In comparison to other data driven modeling paradigms such as multilayer perceptron ANNs and classical multiple regression analysis, the KSOM is not negatively affected by missing values. Moreover, time sequencing of data is not a problem when compared to classical time series analysis 共Vesanto et al. 2000兲. In the next section, brief outline of the KSOM will be given. The results of applying the methodology for replacing missing values and outliers in a large array of data collected at an activated sludge treatment works will then be presented. Finally, the KSOM results are compared to regression analysis and ANN for predicting specific water quality variables.

910 / JOURNAL OF ENVIRONMENTAL ENGINEERING © ASCE / SEPTEMBER 2007

Training the KSOM The multidimensional input data are first standardized by deducting the mean and then dividing the result by the standard deviation. A standardized input vector is chosen at random and presented to each of the individual neurons for comparison with their code vectors in order to identify the code vector most similar to the presented input vector. The identification uses the Euclidian distance, which is defined as

冑兺 n

Di =

共x j − wij兲2 ;

i = 1,2, . . . ,M

共4兲

j=1

Fig. 1. Illustrating the winning node and its neighborhood in the Kohonen self-organizing map

Kohonen Self-Organizing Map Basics of Kohonen Self-Organizing Map The KSOM 共also called feature map or Kohonen map兲 is one of the most widely used artificial neural networks algorithms 共Kohonen et al. 1996兲. It is usually presented as a dimensional grid or map whose units 共nodes or neurons兲 become tuned to different input data patterns. Its algorithms are based on unsupervised competitive learning, which means that training is entirely data driven and the neurons or nodes on the map compete with each other 共Alhoniemi et al. 1997, 1999兲. The principal goal of the KSOM is to transform an incoming signal pattern of arbitrary dimension into a two-dimensional discrete map. It involves clustering the input patterns in such a way that similar patterns are represented by the same output neurons, or by one of its neighbors 共Back et al. 1998兲. In this way, the KSOM can be viewed as a tool for reducing the amount of data by clustering, thus converting complex, nonlinear statistical relationship between high dimensional data into simple relationship on low dimensional display 共Kohonen et al. 1996兲. This mapping roughly preserves the most important topological and metric relationship of the original data elements, implying that not much information is lost during the mapping. The KSOM consists of two layers: The multidimensional input layer and the competitive or output layer, both of these layers are fully interconnected as illustrated in Fig. 1. The output layer consists of M neurons arranged in a two-dimensional grid of nodes. Each node or neuron i 共i = 1 , 2 , . . . , M兲 is represented by an n-dimensional weight or reference vector Wi = 关wi1 , . . . . , win兴, where n = dimension of the input vector, i.e., the number of variables in the input vector. Garcia and Gonzalez 共2004兲 offer guidance on determining the optimum number of neurons, which is M = 5冑N

共2兲

where N = total number of data samples. Once M is known, the number of rows and columns in the KSOM can be determined. A guideline by Garcia and Gonzalez 共2004兲, which also seems to work well is l1 = l2

冑

where Di = Euclidian distance between the input vector and the weight vector i; x j = jth element of the current input vector; wij = jth element of the weight vector i; and n = dimensional of the input vector. The neuron whose vector most closely matches the input data vector 共i.e., for which Di is minimum兲 is chosen as a winning node or the best matching unit 共BMU兲 as indicated in Fig. 1. The weight vectors of this winning node and those of its adjacent neurons are then adjusted to match the input data using the following equation, thus bringing the weight vectors further into agreement with the input vector 共Vesanto et al. 2000兲: wi共t + 1兲 = wi共t兲 + ␣共t兲hci共t兲关x共t兲 − wi共t兲兴

共5兲

where t denotes time; ␣共t兲 = learning rate at t; hci共t兲 = neighbourhood function centered in the winner unit c at time t; and all the other variables are as defined previously. In this manner each node in the map internally develops the ability to recognize input vectors similar to itself. This characteristic is referred to as self-organizing, because no external information is supplied to lead to a classification 共Penn 2005兲. The process of comparison and adjustment continues until the optimal number of iterations is reached or the specified errors are attained. Both the learning rate and the neighbourhood function affect the learning effectiveness of the KSOM and must be chosen carefully. In particular, the learning rate decreases monotonically with increased number of iterations as in ␣共t兲 = ␣0共0.005/␣0兲t/T

共6兲

where ␣0 = initial learning rate and T = training length 共Vesanto et al. 2000兲, thus forcing the weight vector to converge. The neighborhood function is normally chosen to be Gaussian centered in the winner unit c, such that: hci共t兲 = exp兵− 储rc − ri储2/关2␴2共t兲兴其

共7兲

where rc and ri = positions of nodes c and i on the KSOM grid and ␴共t兲 = neighborhood radius. Like the learning rate ␣共t兲, ␴共t兲 also decreases monotonically as the number of iterations increases. The quality of the trained KSOM is measured by the total average quantization error and total topographic error. The quantization error is N

e1 e2

共3兲

where l1 and l2 = number of rows and columns respectively; e1 = biggest eigenvalue of the training data set; and e2 = second biggest eigenvalue.

qe =

1 储Xi − Wc储 N i=1

兺

共8兲

where qe = quantization error; Xi = ith data sample or vector; Wc = prototype vector of the best matching unit for Xi; and 储·储 denotes the Euclidian distance 关Eq. 共4兲兴. The topographic error is


Application The KSOM was used to predict the missing values and identified outliers treated as missing values in data records obtained from a large wastewater treatment plant in Edinburgh, U.K. Details of the data records are presented in the next section followed by the results and discussion. Experimental Data

Fig. 2. Prediction of missing components of the input vector using the Kohonen self-organizing map 共BMU= best matching unit兲

N

1 te = u共Xi兲 N i=1

兺

共9兲

where ui = binary integer such that it is equal to 1 if the first and second best matching units for Xi are not adjacent units; otherwise it is zero. The KSOM can be used for many practical tasks, such as the reduction of the amount of training data for model identification, nonlinear interpolation and extrapolation, generalization and compression of information for easy transmission 共Kohonen et al. 1996; Kangas and Simula 1995; Tananaki et al. 2007; Badekas and Papamarkos 2006兲. The application of the KSOM for prediction purposes is illustrated in Fig. 2 共Obu-Cann et al. 2001兲. First, the model is trained using the available data set. Then the depleted vector is presented to the KSOM to identify its BMU. The values for the missing variables are then obtained as their corresponding values in the BMU.

Daily records from the operation of the Seafield wastewater treatment plant in Edinburgh, U.K., during a period of about 3 years with a total of N = 1,066 data samples 共or records兲 were obtained from Thames Water, the operators of the plant. The data relate to the secondary treatment stage and each data sample comprises 13 variables. Summary statistics of the measured process variables are shown in Table 1. The Seafield treatment plant is based on the activated sludge secondary treatment system and comprises six circular sedimentation tanks, six rectangular nonnitrifying aeration lanes, and six circular final settlement tanks. The main treatment is preceded by six screens and four grit removal units. The works is located in the eastern part of Edinburgh city and discharges its effluent directly into the sea. The prevalence of bathing and other water contact activities in the city beach close to the discharge point means that routine ultraviolet 共UV兲 disinfection of the effluent is carried out in the summer months as required in the EU Bathing Water 共EEC 2006兲 and Urban Wastewater Treatment Directives 共EEC 1991兲. As shown in Table 1, there are large numbers of missing values which cannot be thrown away. In addition, the missing values occur randomly within the data array. Thus, although the maximum number of missing values is 310 共for the stirred sludge volume index兲, the number of potentially discardable incomplete daily records or vectors in the data is much more than this number given the nonsynchronization of the missing values. In the Seafield data, the total number of data records discardable as a result of nonsynchronization was 496.

Table 1. Summary Statistics of the Measured Variables at the Seafield Treatment Plant Measurements

Variables

Unit

Average

Minimum

Number of outliers

Maximum

Number of missing values

Visual inspection

Z-score

Modified Z-score

259,427 171,367 466,486 19 23 18 54 Flow to ASP m3 / day 65 15 180 105 1 5 22 Influent BOD5 g / m3 68 3 268 87 7 22 41 Influent SS g / m3 3822 802 6016 146 15 9 20 WAS rate m3 / day 2240 1126 4180 246 16 5 25 Measured MLSS g / m3 4984 1748 1014 303 15 7 28 RAS MLSS g / m3 SSVI mL/g 92 31 165 310 7 4 14 Sludge age Days 5 1 32 225 13 11 30 Actual F/M 0.15 0.015 0.43 292 23 4 12 250,174 650,00 461,926 1 17 16 58 Final effluent flow m3 / day 28 3 190 14 24 24 29 Final effluent SS g / m3 50 15 173 15 48 18 42 Final effluent COD g / m3 9 2 351 8 19 5 12 Final effluent BOD5 g / m3 Note: ASP= activated sludge process; BOD5 = biochemical oxygen demand; SS= suspended solids; WAS= waste activated sludge; MLSS= mixed liquor suspended solids; RAS MLSS= return activated sludge mixed liquor suspended solids; SSVI= stirred sludge volume index; and F / M = food to microorganisms ratio. 912 / JOURNAL OF ENVIRONMENTAL ENGINEERING © ASCE / SEPTEMBER 2007

Fig. 3. Component planes of the KSOM

Table 1 also contains the number of identified outliers using the three methods described previously. In general, the modified Z-score tends to identify more outliers than either the visual inspection or the Z-score method. Indeed, the Z-score method produced the least number of outliers 共148兲 as against 387 by the modified Z-score and 228 based on visual inspection. Because of the restrictive assumptions underpinning the Z-score and the modified Z-score approaches, however, the identified outliers using the visual inspection method were taken as the outliers for the subsequent analysis. These outliers were also removed and treated as missing values to be estimated so as to preserve the true dynamic history of the process as exemplified in the data. The data in Table 1 relate to the secondary treatment stage of the treatment plant. The decision to focus on the biological stage of treatment is because the secondary stage is often the terminal treatment offered at treatment plants discharging to inland rivers or coastal environments. The secondary treatment process helps to remove a substantial proportion of the suspended solids 共SS兲 and biochemical oxygen demand 共BOD5兲 in the wastewater; it therefore plays a significant role in meeting the quality objectives set for such receiving systems. An analysis such as the KSOM to estimate the missing values should therefore provide complete data for analyzing and modeling the biological activated sludge process in wastewater treatment plant. SOM Analysis The SOM Toolbox for MATLAB 5 was used in the study. The toolbox was developed by the SOM team at the Helsinki Univer-

sity of technology, Finland 共http://www.cis.hut.fi兲. Initial preparation of the data provided by Thames Water was carried out using Microsoft EXCEL. The computation for training and searching for the BMUs was done starting with the default values for the learning rate 共␣0 = 0.5兲 and neighborhood radius 关␴0 = max共l1 , l2兲 / 4兴 in the SOM Toolbox, where l1 and l2 = dimensions of the map as presented in Eq. 共3兲. In computing the size 共and dimensions兲 of the map, the Toolbox uses Eq. 共2兲 but adjusts the final map units M such that it is equal to the product of l1 and l2 exactly. In making this final adjustment, the estimated number of map units may be slightly different from that obtainable with Eq. 共2兲. The analysis of the Seafield data led to M = 168 map units, which is slightly different from the M = 164 obtainable using Eq. 共2兲 with N = 1 , 066, and sides l1 and l2 of 14 and 12, respectively. The final quantization error qe = 1.801 and the final topographic error te = 0.066.

Results and Discussion The component planes for each of the 13 variables are shown in Fig. 3. Each component plane can be thought as a “sliced” version of the KSOM, because it consists of the values of single vector variable in all map units. In other words, the component planes show the value of the variables in each map unit 共Vesanto et al. 2000兲. These planes are built using gray color to show the value of a given feature of each KSOM unit in the two-dimensional lattice, such that the lighter the gray color, the higher the relative


Table 2. Correlation Matrix for Variables in the Code Vectors

Flow Influent BOD Influent SS WAS rate MLSS RAS-MLSS SSVI Sludge age F/M Effluent flow Effluent BOD Effluent COD Effluent SS

Flow

Influent BOD

Influent SS

WAS rate

MLSS

RAS

SSVI

Sludge age

F:M

Effluent flow

Effluent BOD

Effluent COD

Effluent SS

1 −0.43 −0.44 −0.16 −0.25 −0.18 −0.62 0.09 −0.16 1.00 0.03 0.03 0.09

1 0.88 0.15 0.63 0.35 0.3 0.42 0.84 −0.42 0.24 0.12 0.1

1 0.13 0.56 0.45 0.29 0.21 0.74 −0.43 0.19 0.10 0.09

1 −0.21 −0.30 0.20 −0.46 0.25 −0.18 −0.13 −0.09 −0.13

1 0.77 −0.19 0.74 0.22 −0.24 −0.10 −0.21 −0.21

1 −0.31 0.42 0.02 −0.17 −0.34 −0.44 −0.43

1 −0.29 0.35 −0.62 0.31 0.38 0.30

1 0.13 0.10 −0.05 −0.16 −0.15

1 −0.15 0.45 0.34 0.33

1 0.06 0.05 0.11

1 0.96 0.97

1 0.98

1

component value of the corresponding weight vector. These component planes help to illustrate visually the relationship between the various parameters or characteristics of the wastewater treatment plant. For example, by looking at the upper left hand of the component planes, we can see that high sludge age is associated with a low waste activated sludge rate. This is to be expected given the relationship between the sludge age and the wastage sludge rate VL ␪c = Q wL W

shown in Figs. 4共a and b兲, from which it can be seen that their trend is in conformity with the overall trend of the observed data series. A further analysis was carried out to test whether the sample skewness coefficients of the residuals are statistically zero. This is required to ensure that the residuals have a normal distribution. The sample skew for a variable x can be estimated using N

兺

共10兲

where ␪c = sludge age; V = volume of the reactor; L = MLSS 共mixed liquor suspended solids兲 in the aerator; Lw = MLSS in the waste activated sludge; and Qw = waste activated sludge rate. It is therefore to be expected that a combination of low Lw, low Qw, and high L will produce a high ␪c. Other notable relationships visible from the component planes is the low effluent SS, BOD5, and chemical oxygen demand 共COD兲 concentrations associated with low hydraulic loading rate to the aerator, which is a natural result of the higher retention time caused by a low hydraulic loading. The complete correlation matrix for all 13 variables of the prototype vectors is shown in Table 2 and although this is a simple tool for examining the linear relationship between various variable, its results seem to agree with the indications of the cross correlation provided by the much more complex KSOM analysis that resulted in the component planes. The performance of the KSOM in predicting the various characteristics is demonstrated in Table 3 which shows both the correlation coefficient between the observed and predicted values as well as the mean square error of prediction. For most of the effluent characteristics, the correlation coefficient is generally above 0.90. The model has also particularly done well in predicting the sludge age and F / M ratio, two of the most commonly used parameters for controlling the activated sludge process. This offers some promise for the real-time control of the activated sludge treatment process using these parameters. Figs. 4共a and b兲 compare the time series plots of the estimated and measured values for the effluent BOD5 and SS concentrations, respectively. Time series plots for all the other variables are also available but have not been shown here for reason of lack of space and help to illustrate how well the KSOM outputs have matched the observed data temporally. In general, the KSOM outputs have correctly reproduced the peaks and troughs in the observed time series data. The predicted missing values are also

冉冊

xi − ¯x N skew = 共N − 1兲共N − 2兲 i=1 S

3

where N = sample size, ¯x = sample mean; and S = sample standard deviation. Based on the null hypothesis that the skew is zero, the skew coefficient will have a normal distribution with a mean of zero and variance of 6 / N. Therefore, The 95% confidence interval for a zero skew is 关−1.96冑6 / N , + 1.96冑6 / N兴. If the estimated sample skew coefficient lies within this interval, then the null hypothesis cannot be rejected at the 5% level. The results of the hypothesis testing for all 13 variables are shown in Table 4 from which it is clear that the residuals associated with most of the characteristics are distributed as normal. The only exceptions are the influent SS, F / M and the effluent COD whose test statistics

Table 3. Performance Indices of the KSOM during Training

Variables Flow to ASP Influent BOD5 to ASP Influent SS to ASP WAS MLSS RAS SSVI Sludge age Actual F/M Final effluent flow Effluent BOD Effluent COD Effluent SS


Correlation coefficient 共R兲

Mean square error 共MSE兲

0.945 0.943 0.898 0.879 0.912 0.933 0.905 0.950 0.934 0.946 0.932 0.914 0.950

49,934,653 16.64 64.06 108,953 16,529 48,600 28.18 0.153 0.048 55,880,839 1.82 33.59 15.10

Table 5. Comparing KSOM, Regression and ANN for Predicting Effluent BOD5 Modeling method Regression BP-ANN KSOM

Data set

Correlation coefficient

Mean square error 共MSE兲

Average absolute error 共AAE兲

Training Testing Training Testing Training Testing

0.75 0.76 0.94 0.88 0.96 0.95

20 11.3 5.0 5.4 5.5 6.0

2.5 2.6 1.6 1.5 1.2 0.9

fall marginally outside the 95% confidence interval for a zero skew.

Comparison of the KSOM with Regression and ANN As remarked previously, the performance of the KSOM was compared with the use of simple linear regression and back propagation ANN for predicting the effluent BOD5. On the basis of the correlation matrix shown in Table 3, the independent 共i.e., input兲 variables for both the regression and ANN were chosen as the effluent COD 共R = 0.96兲, effluent SS 共R = 0.97兲 and the F / M ratio 共R = 0.45兲. As complete records are required for these two techniques, only the 770 data records with no missing values in these four parameters were used. Of these, 500 data records were used for model calibration and 270 were used for model testing. The final regression model was BOD = − 1.01 + 11.59F/M + 0.075COD + 0.146SS

共R = 0.75兲共12兲

Fig. 4. Comparing the observed and KSOM predicted time series plots for 共a兲 effluent BOD; 共b兲 effluent SS

Table 4. Result of Approximate Normality Test for the Residuals

Variables Flow to ASP Influent BOD Influent SS WAS MLSS RAS SSVI Sludge age F/M Effluent flow Effluent BOD Effluent COD Effluent SS

N 1002 978 982 887 788 703 327 815 726 1032 ,020 1011 1005

95% confidence limit for the skew 关−0.152, 关−0.154, 关−0.153, 关−0.161, 关−0.171, 关−0.181, 关−0.265, 关−0.168, 关−0.178, 关−0.149, 关−0.150, 关−0.151, 关−0.151,

0.152兴 0.154兴 0.153兴 0.161兴 0.171兴 0.181兴 0.265兴 0.168兴 0.178兴 0.149兴 0.150兴 0.151兴 0.151兴

Sample skew coefficient

Normal 共Y/N兲

0.0204 0.1539 0.1692 −0.0929 0.0259 0.0549 0.0133 0.1518 −0.1926 −0.0821 0.1425 0.1628 0.1439

Y Y N Y Y Y Y Y N Y Y N Y

The backpropagation ANN had a single hidden layer and the optimum number of neurons in this hidden layer was found to be 25. Table 5 compares the performances of the regression and ANN with the KSOM. In general, the linear regression model had the least performance of the three approaches. The backpropagation ANN was a much more improved approach than the regression, particularly during training, but its performance is still inferior to that of the KSOM. 共It should be noted that the KSOM statistics quoted in Table 5 relate to the reduced 770 sample size unlike those in Table 3, which relate to the entire 1,066 data record.兲 A further advantage of the KSOM is that the same map can be used for predicting any missing value in any variable, whereas if the missing variable were to change from the BOD5, new regression and ANN models would have to be developed. Additionally, the KSOM is not affected by missing values, implying that it is unnecessary to carry out any preprocessing for identifying complete records before the method can be applied. Both the regression and ANN approaches require complete records and hence extensive preprocessing of the data is required before they can be applied.

Conclusion In this paper, raw data from the secondary treatment stage from the Seafield activated sludge wastewater treatment plant Edinburgh, U.K., during a period of about three years have been modelled to replace outliers and missing values using the Kohonen self-organizing map. Each sample comprises 13 water quality and process variables.


The results demonstrated that the KSOM is an excellent tool for replacing outliers and missing values in high dimensional data sets. Although the KSOM is a multivariate tool enabling the simultaneous predicting of multiple variables, the KSOM developed in this work also outperformed univariate prediction models based on linear regression and backpropagation ANN. Analysis of the residuals showed that these are random and can be described with the normal distribution.

Acknowledgments The writers are grateful to Thames Water for their cooperation, especially Mr. Duncan Taylor, by providing the plant operation data used in this study. They would also like to thank the two anonymous reviewers whose comments have helped us in improving the manuscript.

References Adeloye, A. J., and De Munari, A. 共2006兲. “Artificial neural network based generalized storage-yield-reliability models using LevenbergMarquardt algorithm.” J. Hydrol., 326共1–4兲, 215–230. Alhoniemi, E., Hollmén, J., Simula, O., and Vesanto, J. 共1999兲. “Process monitoring and modeling using the self-organizing map.” Integrated Computer-Aided Engineering, 6共1兲, 3–14. Alhoniemi, E., Simula, O., and Vesanto, J. 共1997兲. “Analysis of complex systems using the self-organizing map.” Rep., Helsinki Univ. of Technology, Laboratory of Computer and Information Science, Helsinki, Finland, 具http://citeseer.ist.psu.edu/cache/papers/cs/18500/ http:zSzzSzwww.cis.hut.fizSzprojectszSzidezSzpublicationszSzpapers zSziconip97.pdf/analysis-of-complex-systems.pdf典共Dec. 6, 2006兲. Back, B., Sere, K., and Hanna, V. 共1998兲. “Managing complexity in large database using self organising map.” Accounting Management & Information Technologies, 8, 191–210. Badekas, E., and Papamarkos, N. 共2006兲. “Optimal combination of document binarization techniques using a self-organising map neural network.” Eng. Applic. Artif. Intell., 20共1兲, 11–24. Barnett, V., and Lewis, T. 共1994兲. Outliers in statistical data, 3rd Ed., Wiley, Chichester, U.K. EEC. 共1991兲. “Directive concerning urban wastewater treatment 共91/271/ EEC兲.” Official Journal, L135/40. EEC. 共2006兲. “Directive concerning the management of bathing water quality.” Official Journal of the European Union, L64/37. Fallon, A., and Spada, C. 共2006兲. “Detecting and accommodation of outliers in normally distributed data sets.” 具http://ewr.cee.vt.edu/ environmental/teach/smprimer/outlier/outlier.html典共Dec. 6 2006兲. Garcia, H., and Gonzalez, L. 共2004兲. “Self-organizing map and clustering

for wastewater treatment monitoring.” Eng. Applic. Artif. Intell., 17共3兲, 215–225. Harvey, A. C. 共1989兲. Forecasting structural time series models and Kalman filter, Cambridge University Press, Cambridge, U.K. Kangas, J., and Simula, O. 共1995兲. “Process monitoring and visualisation using self organising map.” Neural networks for chemical engineers, Chap. 14, A. B. Bulsari, ed., Elsevier Science, Dordrecht, The Netherlands. Kohonen, T., Simula, O., and Visa, A. 共1996兲. “Engineering applications of the self-organizing map.” Proc. IEEE, 84共10兲, 1358–1384. MacDonald, I. L., and Zucchini, W. 共1997兲. Hidden Markov and other models for discrete valued time series, Monographs on Statistics and Applied Probability, Vol. 70, Chapman and Hall, London. Maier, H. R., and Dandy, G. C. 共1996兲. “The use of artificial neural networks for prediction of water quality parameters.” Water Resour. Res., 32共4兲, 1013–1022. McBean, E. A., and Rovers, F. A. 共1998兲. Statistical procedures for analysis of environmental monitoring data and risk assessment, Prentice-Hall, Englewood Cliffs, N.J. Nokyoo, C. 共2002兲. “The use of artificial neural networks in real time forecasting of wastewater treatment plant performance.” Ph.D. thesis, Univ. of Newcastle Upon Tyne, Newcastle upon Tyne, U.K. Obu-Cann, K., Morita, Y., Fujimura, K., Tokutaka, H., Ohkita, M., and Inui, M. 共2001兲. “Data mining of power transformer database using self-organizing maps.” Proc., 6th Int. Conf. on Soft Computing, IZUKA 2000, Iizuka, Fukuoka, Japan, 201–206, 具http:// ieeexplore.ieee.org/iel5/7719/21162/00983717.pdf典共Dec. 6, 2006兲. Penn, B. S. 共2005兲. “Using self organising maps to visualize high dimensional data.” Comput. Geosci., 31共5兲, 531–544. Rosen, C. 共1998兲. “Monitoring wastewater treatment system.” Ph.D. thesis, Dept. of Industrial Electrical Engineering and Automation, Lund Institute of Technology, Lund Univ., Lund, Sweden. Rosen, C., and Lennox, J. A. 共2001兲. “Multivariate and multi-scale monitoring of wastewater treatment operation.” Water Res., 35共14兲, 3402– 3410. Spellman, F. R. 共2003兲. Handbook of water and wastewater treatment plant operation, Lewis, Boca Raton, Fla. Tananaki, C., Thrasyvoulou, A., Giraudel, J. L., and Montury, M. 共2007兲. “Determination of volatile characteristics of Greek and Turkish pine honey samples and their classification by using Kohonen self organising maps.” Food Chemistry, 101共4兲, 1687–1693. Vanrolleghem, P. A. 共2001兲. “The usefulness of models in wastewater engineering.” Dept. of Applied Mathematics, Biometrics, and Process Control, Univ. of Gent, lecture held at the Institute for Urban Water Management, Dresden University of Technology, Dresden, Germany. Vesanto, J., Himberge, J., Alhoniemi, E., and Parhankangas, J. 共2000兲. “Self-organizing map 共SOM兲 toolbox for Matlab 5.” Rep. No. A57, Helsinki Univ. of Technology, Laboratory of Computer and Information Science, Helsinki, Finland.


Replacing Outliers and Missing Values from Activated Sludge Data ...

Replacing Outliers and Missing Values from Activated Sludge Data ...

Suggest Documents

Activated Sludge

Combined expression data with missing values and

Handling outliers and missing data in brain tumor ... - Semantic Scholar

ACTIVATED SLUDGE MODELLING AND ...

Analysis of longitudinal data from animals with missing values using ...

Activated sludge modelling

IN ACTIVATED SLUDGE

Handling Missing Values in Data Mining

Handling Missing Values in Data Mining

Outliers and data descriptions

Isolation and Identification of Bacteria from Activated Sludge ... - ipcbee

Page 1 missing values regarding missing values for other lications ...

Missing Values In Excel

3 Missing values techniques

Missing values - GitHub

Polyhydroxyalkanoates in waste activated sludge

(Mycolata) in activated sludge foams

Dealing with Zeros and Missing Values in Compositional Data Sets ...

Simultaneous nitrification and denitrification in activated sludge ...

Modeling Integrated Fixed-Film Activated Sludge and

activated sludge bulking episodes and dominant filamentous

CATIONS AND ACTIVATED SLUDGE FLOC ... - VTechWorks

Data on metagenomic profiles of activated sludge ...

Pattern Matching Techniques for Replacing Missing ...