Research Article
A real-time monitoring method using random projection and k-nearest neighbor rule for batch process
International Journal of Advanced Robotic Systems November-December 2017: 1–6 ª The Author(s) 2017 DOI: 10.1177/1729881417739431 journals.sagepub.com/home/arx
Lan Wu1, Chenglin Wen1,2, Mei Zhou2 and Haipeng Ren3
Abstract As an important production method, the batch process is complex and flexible. Moreover, the modeling complexity and the spatial complexity of the storage model are higher, and the monitoring of the actual batch process is more difficult. To address this problem, this article proposes a fault detection method based on random projection, K-means clustering, and the k-nearest neighbor algorithm. First, a multiperiod division method is put forward based on the random projection and the K-means clustering algorithm. This reduces the computational complexity while ensuring the fault detection performance of the algorithm. Second, a real-time monitoring model is established based on each sub-period data using the k-nearest neighbor method to realize online monitoring of the batch production process. According to the premise that the fault detection performance is approximately equal, the proposed method reduces the complexity and computation of the model and realizes the real-time demand of fault detection. Keywords Batch process, random projection, K-means clustering, k-nearest neighbor, fault detection
Date received: 8 May 2017; accepted: 19 September 2017 Topic: Special Issue – Intelligent Control Methods in Advanced Robotics and Automation Topic Editor: Andrey V Savkin Associate Editor: Junzhi Yu
Introduction Batch process, as a significant production mode, has been utilized openly in the dye, food, pharmaceutical, and other fields. Compared with the continuous production process, the batch process has a complex reaction and is more flexible. This could make the actual batch process monitoring more difficult.1 In recent years, process monitoring methods based on multivariate statistics, such as principal component analysis (PCA)2–4 and partial least squares,5,6 have been widely used to monitor batch production processes. The PCA-based process monitoring method assumes that the operating conditions are stable, the variables are linear, and the data are subject to the Gaussian distribution. However, the batch process usually runs in a non-single, steady-state condition where the nonlinearity of the variables and the non-Gaussian
features of the data are significant, making it difficult to monitor the batch process effectively using the traditional PCA-based monitoring method. In addition, for nonlinear, non-Gaussian, and multimodal problems, there are
1
College of Electrical Engineering, Henan University of Technology, Zhengzhou, China 2 Institute of Information and Control, Hangzhou Dianzi University, Hangzhou, China 3 Shaanxi Key Laboratory of Complex System Control and Intelligent Information Processing, Xi’an University of Science and Technology, Xi’an, China Corresponding author: Lan Wu, College of Electrical Engineering, Henan University of Technology, Zhengzhou, Henan Province 450001, China. Email:
[email protected]
Creative Commons CC BY: This article is distributed under the terms of the Creative Commons Attribution 4.0 License (http://www.creativecommons.org/licenses/by/4.0/) which permits any use, reproduction and distribution of the work without further permission provided the original work is attributed as specified on the SAGE and Open Access pages (https://us.sagepub.com/en-us/nam/ open-access-at-sage).
2 corresponding improvements, such as kernel PCA7 for nonlinear, independent component analysis (ICA) 8 for non-Gaussian, and multimodal PCA for multimode. 9 However, when these problems coexist, these methods are not satisfactory. To circumvent these difficulties, He and Wang10 proposed a new method, termed fault detection using the k-nearest neighbor rule (FD-kNN). This method utilizes the distance relations between local samples to execute the anomaly detection. It has no restriction on the data; therefore, it can be applied to industrial processes with the abovementioned characteristics. FD-kNN adopts the method of off-line monitoring; that is, it can be applied to detect whether a batch process is abnormal until this process is finished. Moreover, when it is used to monitor an ongoing process in real time, it has to estimate the measurement data from the next sampling time to the end of the batch process; the fault detection performance is affected by the estimated accuracy.11 The monitoring model for each sample time or a period of time is established using PCA.12 The corresponding time model and the sampling data obtained online are compared to realize real-time fault detection and the implementation of the corresponding control measures. Although this approach reduces the estimated data to a certain extent, it retains the nonlinear and non-Gaussian problems of the batch process. Batch process is divided into several operation subperiods, and the sub-period monitoring model is established to replace the previous single-time model. This reveals the multiperiod characteristics of the process and improves the real-time performance. Several mark points are selected at equal intervals over one operation cycle;13 from the beginning of production to each mark point represents one suboperation period. Models are established for each time period to reduce the number of estimated samples. This time division method is relatively rough and does not take into account the process itself. An improved division method is proposed.14 First, PCA is used to extract the main component and variable variation of each time slice data. Using the principle of minimization of similarity, the time slice matrix is clustered to realize the fine division of the process. Finally, the PCA monitoring model is established for different periods. These existing subperiod monitoring methods are based primarily on the PCA model and are difficult to apply to nonlinear, non-Gaussian batch production processes. In this article, the subperiod is divided based on the historical data, and the k-nearest neighbor (kNN) method is used to establish the monitoring model. To reduce the complexity of the batch process in real time for the singletime model, while revealing the multiperiod characteristics of the process, a real-time monitoring method of the batch process based on time division is proposed based on the K-means clustering algorithm and kNN method. First, a time division method based on random projection and K-means clustering is proposed. This method is based on the K-means clustering algorithm, which divides the whole
International Journal of Advanced Robotic Systems cycle time of the batch process into multiple subperiods. Then, the subperiod monitoring model is established using the kNN method to monitor the batch process in real time. Finally, a numerical experiment is used to illustrate the feasibility of these algorithms proposed in this chapter.
Related information Random projection Random projection is a widely used dimension reduction method. The basis of the method is to project the data in a higher dimension space randomly into a lower dimension space and ensure that the sample spacing is relatively constant before and after projection. Random projection has the capability of distance preservation, depending on the Johnson–Lindenstrauss (JL) lemma. Lemma 1: For any two points in the given range 0 < " < 1 and x1 ; x2 ; ; xp 2 Rm , existing the mapping f: Rm!Rd, thus15 ð1 "Þjjxi xj jj22 jjf ðxi Þ f ðxj Þjj 22 ð1 þ "Þjjxi xj jj22 (1) where i; j ¼ 1; 2; ; p, i 6¼ j, d ¼ oð"2 logðpÞÞ. JL lemma states that any set in (high) m-dimensional Euclidean space can be mapped into an O("2 log(p))dimensional Euclidean space such that the distance between any two points is distorted by only a factor of 1 + "(0 < " < 1). Assuming that X 2 Rnm represents a data set of n samples of m-dimensional variables, two methods based on simple distribution are designed to generate the projection matrix16 A 2 Rdm . Theorem 2: For an arbitrary set X 2 Rnm , given "; > 0, let d0 ¼
4 þ 2 logn " 2 =2 " 3 =3
(2)
For integer d d 0 , assume a projection matrix A 2 Rdm with A(i, j) ¼ aij, where faijg are independent random variables from either of the following two distributions þ1 with probability 1=2 aij ¼ (3) 1 ... 1=2 9 8 > = < þ1 with probability 1=6 > (4) aij 0 ... 2=3 > > ; : 1 ... 1=6 Let ~ ¼ p1ffiffiffi XAT X d
(5)
~ X ~ Let f: Rm!Rd map the ith row of X to row of X, represents the matrix after projection.
Wu et al.
3 2.
K-means clustering The K-means is a more commonly used clustering method based on the idea that the original data set is divided into K clusters (clustering), and iterations are updated until the sum of the errors of each sample in the training data set in its cluster are minimized. In general, the sum of the squared error (SSE) is used as the evaluation index of cluster quality and has the following form SSE ¼
K X X
jjx ci jj 22
(6)
i¼1 x2Ci
ci ¼
1X x Ni x2C
(7)
i
where Ci denotes the ith cluster, i ¼ 1, 2, . . . , K; ci denotes the centroid; and Ni denotes the number of samples. For a sample set of n points X 2 Rnm , the cluster algorithm is 1. the K sample points fc 1 ; c 2 ; ; cK g are selected randomly and independently from X as the initial centroid of each cluster; 2. the similarity between the samples is measured based on the Euclidean distance, where the distance between the sample and the centroid is calculated and assigned to the nearest cluster to form K clusters fC 1 ; C2 ; ; CK g; 3. the centroid of each cluster is recalculated using equation (7), and the SSE of the clusters is calculated by equation (6). 4. For x 2 X, the assignment is repeated and the steps updated until the SSE reaches the minimum and the cluster no longer changes.
Process monitoring based on kNN The basic idea of FD-kNN is that a normal sample is similar to the samples in the training set as all the training samples are obtained from normal batch processes, whereas a faulty sample is different from the training samples.6 In FD-kNN, this is implemented by evaluating the kNN distance, which is defined as the cumulative squared distance between a sample and its kNN, in the training samples. Therefore, the kNN distance of a faulty sample must be greater than that of a normal sample. The details of this method are as follows. Model building For each sample xi ; i ¼ 1; 2; ; n in the training data set X ¼ ½ x ; x ; ; x T 2 Rnd , 1
1.
2
n
find its kNNs by computing the distance between it and all other samples xt in X dit2 ¼ jjxi xt jj22 ; i; t ¼ 1; 2; ; n; i 6¼ t
(8)
calculate the kNN distance k X dij2 Di2 ¼
(9)
j¼1
3.
2 determine the threshold D for fault detection.
2 The common way to set D is by calibration or utilization 6 of the training samples. Specifically, the threshold can be 2 as10 estimated as ð1 Þ-empirical quantile of DðiÞ 2 2 D ¼ Dðº nð1Þ ßÞ
(10)
2 is the realignment of Di2 in descending order, where DðiÞ
º nð1 Þ ß is the integer of n(1), and is the confidence level. Fault detection For a new sample y 2 Rd , 1. find its kNNs in X using equation (4); 2. calculate its kNN distance Dy2 via equation (5); and 2 3. compare Dy2 against the above threshold D . 2 , y is a normal sample; otherwise, y is deterIf Dy2 D mined as a faulty sample.
Real-time monitoring method of batch process based on time division In this section, a real-time monitoring method of the batch process based on time division by combining K-means clustering and the kNN algorithm is proposed. First, according to the historical data of the process, a complete batch production cycle is divided into multiple suboperation stages. Then, the subperiod monitoring model is established off-line. Finally, the process data collected by the model are monitored in real time. This proposed method includes subperiod division, modeling, and real-time monitoring. The following is described in detail in the proposed algorithm.
Subperiod division of batch process Based on random projection and K-means clustering, a subperiod division method is proposed to realize the time division of the batch process. The batch process has a multiperiod characteristic, and the process variable does not change over time, but follows the operation and mechanism characteristics of the process. Correspondingly, the historical measurement data of the batch process can be clustered into multiple clusters, revealing different process characteristics for each period. It should be noted that this article is based on a single batch of a subtime division method. It is difficult to obtain sufficient single-batch data to solve the subtime division problem in the short term. However, the process for multibatch data is more complex; in contrast, single-batch data process is relatively simple.
4
International Journal of Advanced Robotic Systems
Suppose there is a batch of data for the batch process, denoted by X ¼ ½x1 ; x2 ; ; xM 2 RJ M , where J is the number of variables and M is the number of independent sampling points for an operating cycle. The subperiod division algorithm is as follows: 1. dimension reduction: Based on random projection, the original data set is projected into the lower dimension space according to equation (5) to obtain ~ ¼ ½~ the projected matrix X x1 ; ~ x2 ; ; ~ xM 2 RdM , where d denotes the dimension of the low dimensional space;17 ~ along the time 2. cutting: cut two-dimensional data X axis into M data points, ~ xi ; i ¼ 1; 2; ; M; 3. subperiod division: The M data points ~ xi are taken as the input samples of the cluster, and the process is divided into K stages according to the principle of SSE. 4. adjustment: The input of the clustering algorithm is arranged in chronological order, and the consistency of the process variables is divided accordingly. However, owing to the presence of noise or measurement errors, highly similar variables may exist in different periods. Correspondingly, the subperiods may appear discontinuously in time, indicated by “jump points.” Then, these jump points are grouped into temporally adjacent subperiods in conjunction with the clustering results and the process operation time. Finally, the K subperiods of the batch process are obtained.
kNN modeling and real-time monitoring based on subperiod Through the analysis in the previous section, the subperiod division of the batch process is realized so that sampling points with similar characteristics are divided into one subperiod, and the successive points with large differences are assigned to different periods. This section divides the historical data of the batch process into K subperiods, establishes the monitoring model of each subperiod by the kNN method, and realizes the batch production process online based on these models. Assume that the operation cycle selects I for K subperiods Xi 2 RIJ a i ¼ 1; 2; ; K, where a is the number of independent sampling points in each subperiod. Then, the modeling and online monitoring process based on the kNN method is described as follows: 1.
2.
the three-dimensional matrix of each subperiod IJ a is expanded into a two-dimensional Xi 2 R matrix by batch Xi ¼ ½x1 ; x2 ; ; xIa T 2 RIaJ i ¼ 1; 2; ; K; for each subperiod data Xi , the kNN is searched based on the Euclidean distance, and k-distance is calculated from equation (11)
2 ¼ Dj;i
k X
djt2
(11)
t¼1 2 denotes the k-distance of the sample data Where Dj;i xj in the ith subperiod. 2 3. the control limits D ;i are determined according to the empirical score ð1 Þ 2 2 D ;i ¼ Dðº Iað1Þ ßÞ
(12)
2 For ith period, Dðº indicates that the Iað1Þ ßÞ
k-distance of all the samples is sorted in descending order. For an ongoing batch production process, the measurement data for the lth sample are acquired online yl 2 RJ ,l ¼ 1; 2; ; M; 4. assuming that the current time belongs to the ith subperiod, the batch process at the current time is monitored by the model based on the ith subperiod; 5. in the ith sub-period data Xi , the kNN of yl is 2 searched, and the k-distance Dy;l of yl is calculated. 2 6. the control limit D;i of ith subperiod is obtained 2 2 2 and compared with Dy;l . If Dy;l D ;i , yl is a normal sample; otherwise, it is a fault sample.
Simulation Data generation Based on the numerical simulation, the batch process contains two steady-state phases and a transition phase; there are 5 latent variables and 20 process variables. The operation time is 200 h, where the two steady-state phases are 120 and 50 h, respectively. The transition phase is 30 h. The sampling interval is 1 h with 200 sample points. The sample batch contains 40 samples, and the resulting modeling data are denoted by X 2 RIJ M , I ¼ 40, J ¼ 20, and M ¼ 200. To test the performance of the algorithm, additive faults are introduced on variable 8, and different fault values are set at different stages. The fault values are 5 and 25 in steady-state phases 1 and 2, and in the transition phase, respectively. All faults are introduced at the 51st sampling point until the production is completed; a total of 150 faults are set.
Result analysis This section divides the subperiods of the batch process based on the above proposed method and then performs modeling and online monitoring based on the kNN method. The process characteristics are combined and the following parameters are set: number of expected clusters K ¼ 3, error parameters of projection " ¼ 0.8 and d ¼ 0.4, projected spatial dimension d ¼ 6, number of neighbors k ¼ 3, and confidence level ¼ 1%.
Wu et al.
5
Figure 1. The time division of K-means clustering.
Figure 2. The time division of this new method. Table 1. Time complexity of the two methods. Method CPU time
K-means clustering (s) 2
4.19 10
Proposed method (s) 1.29 102
The K-means clustering and the proposed new method in this article are used to divide the entire batch process. The results of these two methods are shown in Figures1 and 2. It can be seen that both methods can divide the entire batch process into three subperiods and have jump points. In Figure 1, the 16th sample point is divided into the second subperiod. However, the points before and after this point are assigned to the first subperiod; thus, this sample point is considered to be a jump point. Therefore, the time division result of the K-means clustering method is first subperiod 1–121, second subperiod 122–150, and third subperiod 151–200. Similarly, the time division result of this new method is the first subperiod is 1–119, the second subperiod is 120–150, and the third subperiod is 151–200.Comparison of the results of these two methods shows that both methods can divide a batch process accurately; however, the method proposed in this article can reduce the time complexity of a sub-period division significantly, as shown in Table 1.
Figure 3. Real-time monitoring results of subperiod models.
The measurement data of 40 operating cycles were divided according to this method, eventually forming three subperiod data: X 1 2R 4020119 , X 2 2R 402031 , and X32R402050.The uniform monitoring model of the subperiods is established using the kNN method, and the batch process is monitored based on these models. From Figure 3, it can be seen that the kNN method has a poor fault detection effect on the second subperiod. Combined with the characteristics of the batch process itself, the second subperiod corresponds to the transition phase. Compared to the steady-state phase, the transition phase is more dynamic and is disturbed easily by instability. Therefore, the difference in the measurement data at the same time for different batches, or at different times in the same batch, is relatively large in the transition phase, and the calculated kdistance is also relatively large. On the other hand, owing to the instability of the transition phase, the fault samples may be similar to the normal samples at other times. Accordingly, their k-distance will not exceed the control limit, and such faults will not be detected. For a single-time monitoring model, 200 monitoring models are required for 200 sample points. When there are numerous sample points, the time complexity of modeling and the spatial complexity of the storage model are substantial. In this study, the batch process is divided into subperiods, and a unified monitoring model is established for each subperiod. This proposed method divides the above batch process into three subperiods and establishes three sub-period monitoring models, which greatly reduce the model complexity.
Conclusions This article presents a real-time monitoring method for the batch process based on time division. First, a time division method based on random projection and K-means clustering is proposed. Based on the measurement data after random projection, the K-means clustering algorithm is used to divide the whole cycle time of the batch process into
6 multiple subperiods. Then, the kNN method establishes the subperiod monitoring model to monitor the batch process in real time. The proposed method is simple and suitable for production processes with large variable correlations. It can reveal the multiperiod characteristics of the batch process more objectively, while reducing the complexity of a single-time model. Further, combining the prior knowledge of the process will help to improve the accuracy of a subperiod division and the performance of process anomaly monitoring, which is an important direction for future research. Declaration of conflicting interests The author(s) declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.
Funding The author(s) disclosed receipt of the following financial support for the research, authorship, and/or publication of this article: This work was supported by the National Natural Science Foundation of China, Nos. U1504616, 61503123, 61673160, 61573137; sponsored by Program for Science and Technology Innovation Talents in Universities of Henan Province under Grant 17HASTIT021; supported by Shaanxi Key Laboratory project of Complex System Control and Intelligent Information Processing under Grant 2017cp05.
Reference 1. Zhao CH, Wang FL, Yao Y, et al. Phase-based statistical modeling, online monitoring and quality prediction for batch processes. Acta Autom Sinica 2010; 36(3): 366–374. 2. Wang YJ, Sun FM, and Jia MX. Online monitoring method for multiple operating batch processes based on local collection standardization and multi-model dynamic PCA. Can J Chem Eng 2016; 94(10): 1965–1976. 3. Wen CL, Lv FY, Bao ZJ, et al. A review of data driven-based incipient fault diagnosis. Acta Autom Sinica 2016; 42(9): 1285–1298. 4. Hong JJ, Zhang J, and Morris JL. Fault localization in batch processes through progressive principal component analysis modeling. Ind Eng Chem Res 2011; 50: 8153–8162.
International Journal of Advanced Robotic Systems 5. Wang G and Yin S. Quality-related fault detection approach based on orthogonal signal correction and modified PLS. IEEE Trans Ind Inform 2015; 11(2): 398–405. 6. Stubbs S, Zhang J, and Morris J. Multiway interval partial least squares for batch process performance monitoring. Ind Eng Chem Res 2013; 52(35): 12399–12407. 7. Choi SW, Lee C, Lee JM, et al. Fault detection and identification of nonlinear processes based on kernel PCA. Chemom Intell Lab Syst 2005; 75(1): 55–67. 8. Kano M, Tanaka S, Hasebe S, et al. Monitoring independent components for fault detection. AIChE J 2003; 49(4): 969–976. 9. Zhao SJ, Zhang J, and Xu YM. Monitoring of processes with multiple operating modes through multiple principle component analysis models. Ind Eng Chem Res 2004; 43(22): 7025–7035. 10. He QP and Wang J. Fault detection using kNN rule for semiconductor manufacturing processes. IEEE Trans Semicond Manuf 2007; 20(4): 345–354. 11. Lu NY, Wang FL, Gao FR, et al. Statistical modeling and online monitoring for batch processes. Acta Autom Sinica 2006; 32(3): 400–410. 12. Deng F, Guan SP, Yue XH, et al. Energy-based sound source localization with low power consumption in wireless sensor networks. Transactions on Industrial Electronics 2017; 64(6): 4894–4902. 13. Louwerse DJ and Smilde AK. Multivariate statistical process control of batch processes based on three-way models. Chem Eng Sci 2000; 55(7): 1225–1235. 14. Chang YQ, Wang Z, Tan S, et al. Research on multistagebased MPCA modeling and monitoring method for batch processes. Acta Autom Sinica 2010; 36(9): 1312–1319. 15. Johnson WB and Lindenstrauss J. Extensions of Lipschitz mappings into a Hilbert space. Contemp Math 1984; 26: 189–206. 16. Achlioptas D.Database-friendly random projections: Johnson-Lindenstrauss with binary coins. J Comput Syst Sci 2003; 66(4): 671–687. 17. Deng F, Guo S, Zhou R, et al. Sensor multifault diagnosis with improved support vector machines. Transactions on Automation Science and Engineering 2017; 14(2): 1053–1063.