Dynamic Programming for Bayesian Logistic Regression Learning under Concept Drift Pavel Turkov, Olga Krasotkina1 , and Vadim Mottl2
2
1 Tula State University, Tula, Russia
[email protected],
[email protected] Computing Center of the Russian Academy of Sciences, Moscow, Russia
[email protected]
Abstract. A data stream is an ordered sequence of training instances arriving at a rate that does not permit to permanently store them in memory and leads to the necessity of online learning methods when trying to predict some hidden target variable. In addition, concept drift often occurs, what means means that the statistical properties of the target variable may change over time. In this paper, we present a framework of solving the online pattern recognition problem in data streams under concept drift. The framework is based on the application of the Bayesian approach to the probabilistic pattern recognition model in terms of logistic regression, hidden Markov model and dynamic programming. Keywords: online learning, concept drift, logistic regression, hidden Markov model
1
Introduction
A data stream is an ordered sequence of instances that arrive at a rate that does not permit to permanently store them in memory [1]. Unfortunately the standard approach to pattern recognition learning is caused by the tacit assumption that the entire available information is to be processed at once as a single data chunk. However, in the case of a data stream, such a traditional data mining concept would require infinite storage and running time. The data stream leads to the necessity of online learning methods, which assume the updating of an existing classifier or pool of classifiers. The main goal of online learning is to incorporate the knowledge that is intrinsically present in previously observed chunks of data into the existing system [2]: Pioneer researches concentrated on the creation of algorithms for implementation of online learning with respect to different types of classifiers. The majority of them considered a single particular classifier and confined themselves to a direct tuning of its parameters [3], [5, 6]. Many authors proposed one-pass versions of batch learning algorithms. Later, ensembles of classifiers were additionally applied to the problems which had been originally assumed to be solved by single classifier-based incremental learning algorithms [2], [7–9]. In some applications the target concept may change while being analyzed, in the literature such a problem is known as the concept drift problem. There are some methods for pattern recognition under concept drift [10], that use a sliding
2
Pavel Turkov et al.
window to choose a group of new instances to train a model. Such methods include a single classifier like [11, 12] or exploit an ensemble of classifiers [13]. However, it is to be noted that the accurate mathematical statement of concept drift does not exist. The known algorithms are somewhat heuristic, and each specific heuristics is determined by the specificity of the particular practical task. In this paper, we propose a probabilistic Bayesian approach to the online pattern recognition problem based on treating the concept drift as a hidden Markov process and exploiting the general principle of dynamic programming.
2
Background
Let each of real-world objects hiddenly belong to one of two classes y = ±1 and be accessible to immediate observation only through its real-valued feature vector x ∈ Rn . The classical linear approach to the pattern recognition problem is underlain by observer’s concept that there exists a linear discriminant function aT x+b ≷ 0, namely, a discriminant hyperplane, determined by its direction vector in the feature space a ∈ Rn and threshold b ∈ R, such that primarily aT x+b > 0 if y = 1 and < 0 if y = −1. It is required to infer the hyperplane’s parameters (ˆ a, ˆb) from a finite unordered training set (xj , yj ), j = 1, ..., N that contains information on both feature vectors and class-memberships of objects. The problem of batch learning under concept drift differs from the classical statement of the pattern recognition problem in two aspects. First, the training set is no longer treated as completely unordered. Instead, a data stream of single training batches of individual size is considered (xj,t , yj,t ), j = 1, ..., Nt , which arrive sequentially in time t = 1, 2, 3, ... but are unordered each inside itself. Second, the observer’s concept tolerates a relatively slow drift of the unknown discriminant hyperplane, i.e., its parameters may change in time (at , bt ). We use the logistic regression approach [4], so, the probabilities of two possible class memberships of an instance yj,t = ±1 can be expressed as a logistic functions of its feature vector xj,t 1 , (1) f (yj,t |xj,t , at , bt ) = 2 1 + exp −(1/σ )yj,t (aTt xj,t + bt ) so that f (1|xj,t , at , bt ) + f (−1|xj,t , at , bt ) = 1. For the entire batch of training instances Xt = {xt,j , j = 1, ..., Nt } and their class labels Yt = {yt,j , j = 1, ..., Nt } the joint probability function is the product: QNt Φ(Yt |Xt , at , bt ) = j=1 f (yj,t |xj,t , at , bt ). The key element of our Bayesian approach to the concept drift problem is treating the time-varying parameters of the hyperplane (at , bt ) as a hidden random processes possessing the Markov property at = qat−1 + ξ t , M (ξ t ) = 0, M (ξ t ξ Tt ) = dI, bt =√ bt−1 + νt , M (νt ) = 0, M (νt2 ) = d0 , q = 1 − d, 0 ≤ q < 1,
(2)
where variances d and d0 determine the assumed hidden dynamics of the concept, and ξ t and νt are independent white noises with zero mathematical expectations.
Logistic regression learning under concept drift
3
The online learning protocol suggests that when the current training batch (Xt , Yt ) comes, the observer retains only the last estimate of the discriminant hyperplane (ˆ at−1 , ˆbt−1 ) inferred from the precedent part of the data stream (Xs , Ys )st−1 that is no longer stored in the memory, and his/her task is to imme=1 diately recompute the current estimate with respect only to the new information (ˆ at , ˆbt ) = F (ˆ at−1 , ˆbt−1 ), (Xt , Yt ) . In order to maximally save the statistical ad vantages of off-line learning (ˆ at , ˆbt ) = F (Xs , Ys )st =1 , let us temporarily assume that the entire prehistory of the data stream is still available. The a priori distribution density of the hidden sequence of hyperplane parameters will have the form (2): Q t Ψ (as , bs )st =2 |a1 , b1 = s =2 ψs (as , bs |as−1 , bs−1 ), √ ψs (as , bs |as−1 , bs−1 ) ∝ N as | 1 − d as−1 , dI N (bs |bs−1 , d0 ) = 1 √ √ T 1 exp − as − 1 − d as−1 as − 1 − d as−1 × n/2 n/2 2d d (2π) 1 1 2 exp − (b t − b ) s s−1 2πd0 2d0 If we assume that there is no a priori information on the first value of the parameter vector (a1 , b1 ), then the a posteriori distribution density of the entire hidden sequence of hyperplane parameters will be proportional to the product P (as , bs )st =1 |(Xt , Ys )st =1 ∝ Φ (Ys )st =1 |(Xs , as , bs )st =1 Ψ (as , bs )st =1 . The sought-for estimate of time-varying parameters (ˆ as , ˆbs )st =1 is the maximum point of the joint distribution of the parameters and the training set: (ˆ as , ˆbs )st =1 = arg max P (as , bs )st =1 |(Xs , Ys )st =1 = arg min Jt (zs )st =1 , (3)
Jt (zs )st =1 =
as ,bs Ns t P P
s =1 j=1
zs
T ln 1 + exp(−Cgj,s zs ) +
| +
t P
T
{z
ηs (zs )
} t P
t P
(4)
(zs −Azs−1 ) U(zs −Azs−1 ) = ηs (zs )+ γs (zs−1 , zs ) → min, zs {z } s =1 s =2
s =2 |
γs (zs−1 ,zs )
zs 1 d
0 U = ... 0 0
0 ... 1 d ... .. . . . . 0 ... 0 ...
2 as yj,s xj,s n+1 = ∈R , gj,s = ∈ Rn+1 , C = 2 , bs yj,s σ √ 1−d √ 0 . . . 0 0 0 0 0 1−d . . . 0 0 0 0 .. . .. .. .. , A = .. . . . (n+1)×(n+1). . . . . . . . √ 1 0 0 . . . 1−d √ 0 d 0 0 d10 0 0 ... 0 1−d0
The criterion (4) is pair-wise separable, i.e., has the structure of a sum of functions each of which is associated with one time point (s) or two immediately adjacent time points in their increasing order (s−1, s). This computational problem is, generally speaking, that of dynamic programming.
4
Pavel Turkov et al.
ˆt = (ˆ In the online learning mode, only the last element z at , ˆbt ) of the estimated sequence (ˆ zs )st =1 is of interest at each current time point t. The main notion of dynamic programming is that of the sequence of Bellman functions J˜t (zt ) whose minimum points yield just the required online estimates: t ˜ ˆt = arg minJ˜t (zt ). (5) Jt (zt ) = min Jt (zs )s =1 , z z1 ,...,zt−1
zt
The fundamental property of Bellman functions is the almost evident equality h i (6) J˜t (zt ) = ηt (zt ) + min γt (zt−1 , zt ) + J˜t−1 (zt−1 ) , zt−1
which would immediately suggest an algorithm of their sequential computation if only there existed a simple way of solving the optimization problem in (6) with real-valued vectors zt−1 . If all the items of the pair-wise separable learning criterion (4) were quadratic, each of the Bellman functions (5)-(6) would be also quadratic ˜ t (zt − z ˜t )T Q ˜t ) + c˜t . J˜t (zt ) = (zt − z (7) ˜ t (n+1)×(n+1) , ˜t ∈ Rn+1 , Q and easily computable in terms of their parameters z c˜t ∈ R by the classical Kalman-Bucy filter [14, 15]. But only the second sum in (4) is quadratic, whereas the first one is formed by logarithmic items. Therefore, an approximation of Bellman functions is required before immediate application of the exceptionally effective Kalman-Bucy filter.
3
Approximate dynamic programming for online estimating the time-varying discriminant hyperplane
A method of overcoming the obstacle of non-quadratic Bellman functions (5)(6) is proposed in [16]. However, that way is based on the assumption that the size of the training set t is much greater than the number of features n, and that both classes are equally represented in the training set. The former of these assumptions is simply satisfiable in the case of a data stream, but the latter one can be mistaken in very many cases. We suggest here a more universal approach to quadratic approximation of logarithmic summands in (4). Our procedure is based on the assumption that there exists an approximate compact representation of Bellman fuctions, which permits storing them in the memory. The previous Bellman function J˜t−1 (zt−1 ) in (6) is non-quadratic because of the logarithmic term in (4). The idea of approximate implementation of the dynamic programming procedure consists in the substitution of the function h i Ft (zt ) = min γt (zt−1 , zt ) + J˜t−1 (zt−1 ) (8) zt−1
by an appropriate quadratic function ¯ t (zt − ¯ zt ), (9) F¯t (zt ) = (zt − ¯ zt )T Q ˜ then the following approximations of the next Bellman function Jt (zt ) will be ¯ t) quadratic too. It remains to choose appropriate values of the parameters (¯ zt , Q ¯ of the quadratic function Ft (zt ), which would ensure conservation of the main aspects of, generally speaking, non-quadratic original Bellman function. We propose to retain the minimum point of the function ¯ zt = arg min Ft (zt ) and the ¯ t = ∇2 Ft (¯ Hessian at the minimum point Q zt ).
Logistic regression learning under concept drift
5
Theorem 1. Let function (8) have the form h i PNt−1 Ft (zt ) = minzt−1 (zt −Azt−1 )T U(zt −Azt−1 ) + j=1 ln 1+exp(−CgjT zt−1 ) . ¯ t ) of approximation (9) are determined as Then the parameters (¯ zt , Q −1 −1 ¯ t =Q ˜ t−1 AUA+ Q ˜ t−1 ¯ t AUA+ Q ˜ t−1 ˜ t−1 z ¯t = Q ˜t−1 . Q U, z AUQ
4
Experimental results
In order to test our algorithm, we got the data set proposed by Street and Kim [17]. This data is 50,000 random points generated in a three-dimensional feature space. The three features had values in the range [ 10;0 ), and only the first two features were relevant. Those points were then divided into 4 blocks with different concepts. In each block, a data point belongs to class 1, if f1 + f2 ≤ φ , where f1 and f2 represent the first two features, and φ is a threshold value for the two classes. Threshold values of 8, 9, 7, and 9.5 were used for the four data blocks. 10% class noise was then introduced into each block of data by randomly changing the class value of 10% of instances. To carry out the experiments, we selected 20000 instances from this database. The test set was comprised by other 30000 instances. The parameters σ, d, d0 were chosen after the trial tests with their different values by the minimum of error: σ = 3; d = 0.1; d0 = 0.1. Table 1 shows the results of experiments with different values of batch size.
5
Conclusions
In this paper we present the mechanism for the problem under concept drift with data streams. This mechanism is based on the Bayesian approach to the logistic regression. It proves that the proposed method is applicable to the problem of concept drift for the pattern recognition problem.
References 1. Brzezinski, D.: Mining Data Streams with Concept Drift. Poznan University of Technology, 2010. 2. Polikar, R., Upda, L., Upda, S.S., Honavar, V.: Learn++: an incremental learning algorithm for supervised neural networks. IEEE Transactions on Systems, Man, and Cybernetics, Part C: Applications and Reviews, Vol.31, 2001, No.4, pp.497-508. 3. Mongillo, G., Deneve, S.: Online learning with hidden Markov models. Neural Computation, 20, 2008, pp. 1706-1716. 4. Bishop, C. M.: Pattern Recognition and Machine Learning (Information Science and Statistics). Springer-Verlag New York, Inc. Secaucus, NJ, USA, 2006.
Table 1. The experimental results: SEA Database Batch size 500 1000
Classification error, % 21.17 21.12
6
Pavel Turkov et al.
5. Mizuno, J., Watanabe, T., Ueki,K., Amano K., Takimoto E., Maruoka A.: On-line estimation of hidden Markov model parameters. Proceedings of the Third International Conference, Discovery Science. Lecture Notes in Computer Science, Vol. 1967, Springer, 2000, pp. 155-169. 6. Florez-Larrahondo, G., Bridges, S., Hansen, E.A.: Incremental estimation of discrete hidden Markov models based on a new backward procedure. Proceedings of the 20th National Conference on Artificial intelligence, 2005, Vol. 2, pp. 758-763. 7. Yu-Shu, C., Yi-Ming, C.: Combining incremental hidden Markov model and AdaBoost algorithm for anomaly intrusion detection. Proceedings of the ACM SIGKDD Workshop on CyberSecurity and Intelligence Informatics, New York, USA, 2009, pp. 3-9. 8. Ulas, A., Semerci, M., Yildiz, O.T., Alpaydin, E.: Incremental construction of classifier and discriminant ensembles. Information Sciences, 179, 2009, pp. 1298-1318. 9. Kapp, M., Sabourin, R.R., Maupin, P.: Adaptive incremental learning with an ensemble of support vector machines. Proceedings of the 20th International Conference on Pattern Recognition, Istanbul, Turkey, 2010, pp. 4048-4051. 10. Elwell, R., Polikar, R.: Incremental Learning of Concept Drift in Nonstationary Environments. IEEE Transactions on Neural Networks, 2011, Vol. 22, Issue 10, pp. 1517-1531. 11. Widmer, G., Kubat, M.: Learning in the presence of concept drift and hidden contexts. Machine Learning, 1996, Vol. 23, pp. 69-101. 12. Bifet, A., Gavalda, R.: Learning from time-changing data with adaptive windowing. SIAM International Conference on Data Mining, 2007. 13. Wang, H., Fan, W., Yu, P.S., Han, J.: Mining concept-drifting data streams using ensemble classifiers. Proceedings of the Ninth ACM SIGKDD International Conference KDD’03. ACM Press, 2003, pp. 226-235. 14. Markov, M., Krasotkina, O., Mottl, V., Muchnik. I.: Time-varying regression model with unknown time-volatility for nonstationary signal analysis. Proceedings of the 8th IASTED International Conference on Signal and Image Processing. Honolulu, Hawaii, USA, August 14-16, 2006, paper 534-196. 15. Grewal, M.S., Andrews, A.P.. Kalman Filtering: Theory and Practice Using MATLAB. Wiley, 2008. 16. Turkov, P., Krasotkina, O., Mottl, V.: Bayesian Approach to the Concept Drift in the Pattern Recognition Problems. Proceedings of 8th International Conference on Machine Learning and Data Mining (MLDM 2012). Lecture Notes in Artificial Intelligence LNAI, vol. 7376, Springer Verlag, 2012. 17. Street W., Kim Y.: A streaming ensemble algorithm (SEA) for large-scale classification. In: Proc. 7th Int. Conf. on Knowledge Discovery and Data Mining KDD-2001, 2001, 377-382.