A framework of irregularity enlightenment for data pre-processing in ...

15 downloads 89125 Views 671KB Size Report
AT&T Research Labs, Florham Park, NJ, USA. S.-T. Au ... tunities. Unfortunately, business rules always change over time and customer activities are ... enue of a typical organization, and more informally speculated as 40 to 60% of a service.
Ann Oper Res DOI 10.1007/s10479-008-0494-z

A framework of irregularity enlightenment for data pre-processing in data mining Siu-Tong Au · Rong Duan · Siamak G. Hesar · Wei Jiang

© Springer Science+Business Media, LLC 2008

Abstract Irregularities are widespread in large databases and often lead to erroneous conclusions with respect to data mining and statistical analysis. For example, considerable bias is often resulted from many parameter estimation procedures without properly handling significant irregularities. Most data cleaning tools assume one known type of irregularity. This paper proposes a generic Irregularity Enlightenment (IE) framework for dealing with the situation when multiple irregularities are hidden in large volumes of data in general and cross sectional time series in particular. It develops an automatic data mining platform to capture key irregularities and classify them based on their importance in a database. By decomposing time series data into basic components, we propose to optimize a penalized least square loss function to aid the selection of key irregularities in consecutive steps and cluster time series into different groups until an acceptable level of variation reduction is achieved. Finally visualization tools are developed to help analysts interpret and understand the nature of data better and faster before further data modeling and analysis. Keywords Activity monitoring · Change point · Feature selection · LASSO · Outliers · Regression models

1 Introduction Modern computing technologies allow recording all business activities, explicitly or implicitly, in enterprise data warehouses. For decades service providers invest extensively on

This work is supported by National Science Foundation Grant #IIS-0542881. S.-T. Au · R. Duan AT&T Research Labs, Florham Park, NJ, USA S.-T. Au e-mail: [email protected] S.G. Hesar · W. Jiang () Stevens Institute of Technology, Hoboken, NJ, USA e-mail: [email protected]

Ann Oper Res

developing automatic activity monitoring and detection modules that apply operational intelligence and application integration technologies to alert business activities responsible to changes in the enterprise that may require action (Fawcett and Provost 1999). By analyzing business activities in real time, companies can make better decisions, more quickly address problem areas, and reposition the organizations to take full advantage of emerging opportunities. Unfortunately, business rules always change over time and customer activities are obscure and hard to discover without an appropriate and robust analytical tool. As Jiang et al. (2007) pointed out, without properly handling these activities, it is difficult to develop a unified activity modeling and monitoring framework to track all customers activities such as churns and frauds. On the other hand, besides the important business activities, automatic data collection systems often generate irregular data due to various technical and managerial reasons. For example, errors due to data entry mistakes, faulty sensor readings or more malicious activities provide erroneous data sets that propagate errors in each successive generation of data. The widespread irregular observations hidden in databases, herein referred to as irregularities, often interrupt correct interpretation of the gathered data and lead to erroneous conclusions with respect to business analysis and decisions. Considerable bias is often resulted from many statistical analysis and data mining procedures without properly handling abnormal observations. The irregularity issue becomes more and more severe as the size of the databases and dimension of the problem increase. As an estimation, the cost associated with making decisions based on poor-quality data can be as much as 8 to 12% of the revenue of a typical organization, and more informally speculated as 40 to 60% of a service organization’s expense (Redman 1998). In order to exploit customer activities and evaluate customer value and risk, it is very crucial for companies and organizations to establish a practical tool set that can perform both data cleaning task on large business databases and spotting irregular observations whether they are due to poor quality or some business related activities. Irregularity enlightenment (IE) will free business analysts from ignorance, prejudice, or superstition of irregular observations by extracting concealed knowledge of the irregularities and ultimately help them better understand the real business. More importantly, other than highlighting individual irregular observations, it is of considerable interest for a company to investigate individual and composite effects of different types of irregularities in strategic and tactical decisions. By identifying and grouping similar irregularities together, it is possible to understand the distribution of irregularities as a whole so that their corresponding root causes can be discovered and treated appropriately. The irregularity enlightenment modules will help evaluate the importance of major activity patterns and quantify their business impacts on the company’s health so that appropriate business programs can be built to reduce the risks of wrong decisions and improve the company’s profitability. For example, if the quality of data is the major concern in business analysis, data quality management systems should be launched to improve the visibility of business data; if churns are identified as a significant business driver, effective customer loyalty programs should be built on order. Unfortunately most data modeling and cleaning tools assume one known type of irregularity. For example, outliers which are observations that lie outside the overall pattern of a distribution, have been extensively studies in statistical analysis such as regressions (Rousseeuw and Leroy 1987). However, most of the literature assumes that outliers come from a common distribution. Similarly, change points refer to points in a time series from which the structure of the time series changes afterward. To reduce variation in manufacturing processes, many statistical process control (SPC) tools have been developed to detect process changes (Montgomery 2004). Nonetheless, most statistical quality control methods

Ann Oper Res

assume clean process data before change point detection. The following section provides a brief review of relevant techniques related with IE. The objective of this paper is to propose a generic IE framework to extract and describe multiple, different types of irregularities out of large amounts of data related to business historical activities. In particular, we will develop an automatic data mining platform to capture key irregularities in large volumes of cross sectional time series one by one based on their importance. The platform can serve as a basis for other online monitoring and forecasting methods and improve the quality of these business intelligence modules. By decomposing historical time series data into primitive components, we propose to optimize a penalized least square loss function to aid the selection of key irregularities in consecutive steps and cluster time series into different groups until an acceptable level of variation reduction is achieved. Visualization tools are developed to help analysts interpret and understand the nature of data better and faster before further data modeling and analysis. Time series irregularities considered here include ramps, steps, and impulses, which represent most interesting events in business applications (Wilson and Keating 2001). Galati and Simaan (2006) have proposed a decomposition algorithm—ALESDA—to extract these irregularities from a single time series. As illustrated later in this paper, the ALESDA algorithm is not efficient when the number of features is not small. This limitation is very significant when a large amount customer time series data needs to be mined for interesting events. It motivates us to use a more efficient feature selection algorithm—Least Absolute Shrinkage and Selection Operator (LASSO)—to extract important features from multiple data streams. Although both time series decomposition and LASSO algorithm are not new, their marriage for multiple data stream analysis has not been investigated. More importantly, the marriage enables efficient classification of multiple data streams according to the extracted features, which are very attractive to business management for activity monitoring (Jiang et al. 2007). The remainder of the paper is organized as follows. Section 2 reviews major anomaly detection methods related to irregularity enlightenment in time series. Section 3 discusses a generic IE framework based on a data-decomposition approach. Section 4 introduces the LASSO method for variable selection and Sect. 5 discusses a step-by-step procedure to implement the IE framework. Section 6 investigates the performance of the proposed framework with some simulation experiments and an industrial example. Concluding remarks will be presented in Sect. 7. The Appendix provides a brief summary of the LASSO computation algorithm.

2 Related work In general, implementing irregularity enlightenment depends on types of irregularities of interests. To identify different irregularities, many related research have been developed in both theory and practice. This section provides a brief review of outlier detection and change point detection methods as they are widely used for anomaly detection and closely related to irregularity enlightenment. 2.1 Outlier detection methods Bad data are often outliers in a database. Effective outlier detection requires the construction of a model that accurately represents the data. Statistical model-based outlier detection methods, where the data is assumed to follow a parametric (typically univariate) distribution generally do not work well in even moderately high-dimensional (multivariate) spaces

Ann Oper Res

(Barnett and Lewis 1994). In fact, choosing an appropriate working model is often a difficult task in its own right. A non-parametric approach for discovering outliers uses distance metrics and defines a point to be a distance outlier if at least a user-defined fraction of the points in the data set are further away than some user-defined minimum distance from that point (Knorr and Ng 1998, 1999; Knorr et al. 2000, Bay and Schwabacher 2003). A critical issue related to such distance-based methods is the arbitrariness of many user-supplied quantities which often require extensive human interactions and several iterations to determine an outlier. Breunig et al. (2000) develop another non-parametric approach based on density. A local outlier factor (LOF) is computed for each point as the ratio of the local density of the area around the point and the local densities of its neighbors. The size of a neighborhood of a point is determined by the area containing a user-supplied minimum number of points (MinPts). Since it is difficult to choose values for MinPts, Papadimitriou et al. (2003) propose local correlation integral (LOCI) by using statistical values derived from the data itself. However, both the LOF- and LOCI-based approaches do not scale well with a large number of attributes and data points. Most importantly, the above outlier detection algorithms do not deal with outliers in time series. For time series, the majority of outlier detection methods concern about the location (e.g., the mean) and the scatter (e.g., variance/covariance) in the presence of outliers (Rousseeuw and Leroy 1987). When significant autocorrelation exists in customer transaction database, maximum likelihood based outlier detection methods have been widely investigated assuming data generating models (e.g., Bianco et al. 1996; Bianco et al. 2001; Chen and Liu 1993; Tsay 1988, 1996). While most of the outlier detection methods are offline, Liu et al. (2004) propose a sliding-window algorithm for online data cleaning using a modified Kalman filter based on autoregressive (AR) models. Again the critical task is a statistical model that can describe majority of time series in customer databases. 2.2 Change point detection methods Change point detection methods such as Shewhart charts, Cumulative Sum (CUSUM) charts and Exponentially Weighted Moving Averages (EWMA) charts assume that the data are temporally uncorrelated, are stationary, and follow a normal distribution. Statistical theory has proven that these classical charts are most efficient at detecting an anomaly pattern of a certain type (Moustakides 1986). While these methods are generally used for online monitoring, serial correlation is common in business database. To accommodate autocorrelations for change point detection, forecast-based methods have been proposed by fitting some time series models (Alwan and Roberts 1988, Harris and Ross 1991, Montgomery and Mastrangelo 1991, Wardell et al. 1994, Jiang et al. 2002). However, the forecast-based methods have been found inefficient depending on the serial correlations and fault properties. By assuming the underlying stochastic model, generalized likelihood ratio test (GLRT) has been investigated recently by Vander Wiel (1996), Apley and Shi (1999), and Jiang (2004). The GLRT-based methods not only rely on the underlying model but also depend on the fault signature in nature and may thus suffer when the actual fault does not match with the pre-defined signature. Motivated by simulation research, Runger and Willemain (1995, 1996) use a weighted batch mean (WBM) and a unified batch mean (UBM) to monitor autocorrelated data. The WBM method weighs the mean of observations, defines batch size so that autocorrelation among batches is reduced to zero and requires knowledge of the underlying process model. The UBM method determines batch size so that autocorrelation remains below a certain level and is model free.

Ann Oper Res

In order to detect multiple changes, the likelihood approach includes the sequential methods (Basseville and Nikiforov 1993) and the global methods based on BIC and penalized contrasts (Lavielle 1999, 2005). While the optimization of likelihood is often computationally intensive, fitting segmented line regression models has been extensively discussed. Lerman (1980) examines grid search method for possible locations of the change points when fitting a segmented regression curve. Hawkins (2001) proposes dynamic programming algorithms to fit segmented curves. Besides the computation load, these methods are not general enough to accommodate patterns of unknown nature, especially when outliers may contaminate the series. 2.3 Anomaly detection methods in time series with both change points and outliers The aim of this paper is to discover irregularities including both change points and outliers in time series database. To detect aberrations of unknown nature involving outliers, discrete wavelet transformations (DWT) can be used to decompose data streams into multiple scales and then monitor the different scales for aberrations (Bakshi 1999). Nounou and Bakshi (1999) propose a data filtering method using wavelet thresholding and a finite impulse response (FIR) median hybrid (FMH) filter to reconcile outliers and process shifts. Such methods are advantageous because they are suitable for series that exhibit autocorrelation, and even nonstationarity. However, downsampling, boundary extrapolations, and multiple testing have to be addressed. Most importantly, the wavelet coefficients do not have physical interpretations that are transparent to business users related to business activities and thresholding coefficients depends on knowledge of particular customer records. A general review of anomaly detection algorithms for time series can be found in Cooper et al. (2003) which includes many methods for detecting a single type of irregularity as discussed above. A workshop devoted to anomaly detection was also held in 2005 where anomaly detection methods for time series received considerable interests (Mahoney and Chan 2005; Shmueli 2005). These methods often involve a training data set that can be used to learn a baseline model for normal behavior to identify significant deviations from the baseline model when new data becomes available in an online fashion (Chan and Mahoney 2005). When the training data is not labeled and involves irregularities in the historical data, it is hard to derive the baseline model that can efficiently captures the normal process. Keogh et al. (2004) and Wei et al. (2005) introduce an assumption-free anomaly detection method based on bitmaps of time series—a symbolic representation of numerical variables. However, discretization of continuous values may result in losses of important information and renders the effectiveness of the bitmaps model. Recently, Galati and Simaan (2006) propose to decompose a time series into impulse, step, and ramp signals buried in noises. The three types of signals represent outliers and different types of change points and carry meaningful process information. They adopt an exhaustive search approach and develop a sliding-window algorithm for long time series data. Unfortunately the proposed method is very time consuming even for moderate length of time series. More importantly, when millions of data streams are analyzed simultaneously, their method is prohibitive to extract features that can be used to cluster streams with similar behaviors. By developing a data-decomposition framework for IE, this paper proposes the use of LASSO algorithm in particular to identify significant components. Although one of the purposes of this paper is to group similar time series, our clustering approach is fundamentally different from many previously published time series clustering methods. Oates et al. (1999) introduce hidden Markov models to group multiple time series. Guha et al. (2000) develop a streaming algorithm and use a k-median algorithm for

Ann Oper Res

clustering time series. However, as Keogh et al. (2003) commented on these time series clustering methods, clustering of data streams is “completely meaningless” since “clusters extracted from streaming time series are forced to obey a certain constraint that is pathologically unlikely to be satisfied by any dataset, and because of this, the clusters extracted by any clustering algorithm are essentially random.” Differently, our clustering method essentially follows a supervised way when grouping time series together. The objective of our time series clustering is to evaluate the variation that can be explained by each group of time series.

3 A data-decomposition framework for irregularity enlightenment One of the fundamental issues for data mining and business analysis tasks is to quantitatively describe the underlying relationships in a database. A basic assumption of any statistical database is that the data is generated by certain systematic patterns plus random noise. Based on this assumption, a broad range of statistical and mathematical models have been developed in order to capture the systematic portion of the available data, while the noise portion is simply the leftover after all the systematic components are accounted for. For example, feature extraction is a very popular method that aims to describe the observed data with a set of dominant components or features. It is essentially a pre-processing data mining step to identify relevant features that lead to better, faster, and easier understanding of data (Bishop 2006). The terms “components” and “features” have been broadly interpreted in literature and are interchangeably used in this paper. It is very popular in business analysis to decompose a time series into trend, seasonality, and random errors, i.e., Data = Trend + Seasonality + Random error, where trend is defined as a long term movement in a time series that usually dominates the underlying direction and rate of change, while seasonal component describes time series variations that are dependent on the time with a cyclical pattern (Wilson and Keating 2001). For the customer profiles of the telecommunication industry considered in this paper, the basic systematic pattern can be characterized by linear or nonlinear trends. Although only linear models are considered for time series profiles here, Qian et al. (2006) investigate nonlinear profile models based on splines. The current framework can be easily extended to nonlinear profiles as well. Note that, due to the short time series, seasonality is ignored from our feature extraction model for simplicity. Nonetheless, feature extraction methods such as Fourier transforms are suitable to describe cyclical patterns, and it is straightforward to include seasonal patterns if the time series data is not too short. Although the “Systematic components + Random errors” decomposition is very successful in exploring dominant patterns for many business applications, as discussed in Sect. 1, business data contains lots of valuable information that can not be categorized and modeled as a systematic pattern. In some cases these missing components may imply a tremendous loss of revenue or business opportunity. This fact raises the concern that the structure of data decomposition model needs to be revised to reflect major unexpected changes that are not essentially random errors. Motivated by data irregularities, we propose the following generic decomposition to represent data structures, Data = Systematic components + Irregularity components + Random error,

(1)

Ann Oper Res

where irregularity component refers to the new concept of abnormal data that can be described as unexpected element which is not a main feature for a dataset while carries significant information load that needs to be modeled and analyzed. Specifically, if simple linear relationships are assumed for all systematic and irregular features, (1) can be represented as Data = α0 + α1 S1 + · · · + αr Sr + β1 I1 + · · · + βk Ik + ,

(2)

where Si ’s represent systematic components such as trend in a time series and Ij ’s represent irregular components such as outliers.  is the random error which represent the data variations that can not be captured by either systematic or irregular components. Although the above linear simplification is not perfect, it is quite accurate in practice to describe most business datasets. To take into account both systematic components such as trend and irregular components such as outliers and change points in time series, Galati and Simaan (2006) introduce the following three basic functions for time series decomposition,  Ramp:

fr (t) =

Step:

fs (t) =

  Impulse:

fi (t) =

0, t,

t < 0, t ≥ 0,

0, 1,

t < 0, t ≥ 0,

0, t,

t = 0, t = 0,

where t represents time along which a total of N consecutive data points of time series are recorded. These primitives can be extended to include time delay τ so that a general representation follows fk (t − τ ) (τ = 1, 2, . . . , N , k ∈ K = {r, s, i}) represents three types of primitives. As illustration, Fig. 1 depicts the three primitive functions with time delay τ . It is easy to see that, the impulse function represents outliers, the ramp function represents trend, and both ramp and step functions represent change points. Specifically, a trend can be characterized by S1 = fr (t). An outlier that happens at time to can be characterized by an impulse component with delay τ = to , i.e., I1 (to ) = fi (t − to ). Change points represent changes of the time series structure which may be a sudden change in the mean of the sequence and/or a sudden change in the slope, Suppose the change time is tc , the former can be represented by a step component with delay τ = tc , i.e., I2 (tc ) = fs (t − tc ), while the later can be represented by a ramp component, i.e., I3 (tc ) = fr (t − tc ). If both mean and slope of the time series change at the same time, a linear combination of I2 (tc ) and I3 (tc ) is sufficient for characterization. Based on these representations, following (2), it is easy to see that any

Fig. 1 Basic component functions with delay τ

Ann Oper Res

time series Y (t) with length N can be decomposed as, Y (t) = α0 + α1 fr (t) +

N 

β1τ fi (t − τ ) +

τ =1

+

N−2 

β3τ fr (t − τ ) + (t).

N−2 

β2τ fs (t − τ )

τ =1

(3)

τ =1

This decomposition will give us a lot of flexibility in terms of data modeling and will also provide more interpretable and tangible results in terms of business analysis. It is important to note that there are totally 3N − 3 different components, one systematic components and 3N − 4 irregular components, which have strong linear correlation with each other. In order to limit the number of components included in the regression model (3), Galati and Simaan (2006) suggest an exhausted search approach called ALESDA to select the most significant features given an acceptable percentage of variation that the model can capture. However, a major problem is that the computational order of the problem increases exponentially as the length of the time series increases. When dealing with large amount of customer time series, it is important to have an algorithm that identified significant irregularities effectively, while it may be acceptable to slightly sacrifice the efficiency. The next section proposes an efficient feature selection algorithm based on LASSO regression for the purpose of implementing the irregularity enlightenment framework.

4 Feature selection for irregularity identification In order to efficiently solve the linear regression problem in (3) and automatically pick the most significant features that best fit the data without over-fit, we here propose to utilize the LASSO regression method for feature selection. Feature selection, also called subset selection or variable selection, has been studied extensively in various fields of pattern recognition, statistical learning, and data mining. It is necessary for the cases when there are many predictive variables that could be included (or would be efficient to include) in the model. It is also necessary for the situations when limited number of observations (but a large number of features) are present. This latter problem is related to the so-called curse of dimensionality (Bellman 1961), which refers to the fact that the number of data samples required to estimate some arbitrary multivariate probability distribution increases exponentially as the number of dimensions in the data increases linearly. Feature selection in general will significantly improve the comprehensibility of the resulting models and even generate a model that can be better generalized to unseen points (Kim et al. 2003). The main idea of feature selection is to choose a subset of input variables and eliminating variables (features) with little or no predictive information. Several methods have been developed in various disciplines to perform the feature selection task. Hastie et al. (2003) and Fan and Li (2006) have provided an extensive review and comparison of different feature selection methods. For example, LASSO, first introduced by Tibshirani (1996), is a member of the family of penalized least square estimators (PLS). Efron et al. (2004) introduce an efficient and fast feature selection method called Least Angle Regression (LARS) and showed that it can be easily modified to include LASSO and forward stagewise linear regression models. Let’s consider a linear regression model as follows, Y = Xβ + ,

(4)

Ann Oper Res

where Y is the response vector,  is the random error with mean 0 and variance σ 2 , Xn×p is predictors’ matrix, and β = (β1 , . . . , βp ) is the coefficients vector. In order to remove the intercept element, normalization can be conducted. The general penalized least squares (PLS) can be formulated as: min Y − Xβ2 + L(β),

(5)

β

where L(.) is called the penalty function (or loss function in some literatures). Several modifications of the PLS have been developed by introducing different penalty functions. The choice of L1 penalty function (absolute function) results in the LASSO method with the following formulation:   p  LASSO 2 = arg min Y − Xβ + λ |βi | , (6) β β

i=1

where the β LASSO is the LASSO coefficient and λ is the tuning parameter. Tibshirani (1996) originally formulated the LASSO regression as the following quadratic programming problem: β LASSO = arg min(Y − Xβ2 ) s.t.

p 

β

(7)

|βi | ≤ t

i=1

where he called t ≥ 0 the tuning parameter. Problems (6) and (7) are equivalent and there is a one-to-one relationship between the λ and t . For example, it is apparent that λ → +∞ corresponds to t → 0 and λ = 0 corresponds to t → ∞. Tibshirani (1996) showed that LASSO is a very efficient variable selection method which tends to produce some coefficients that are exactly 0 and hence results in an interpretable solution. With the choice of appropriate tuning parameter, LASSO simultaneously provides both accurate and sparse model, which makes it a favorable variable selection method (Zou et al. 2006). Consider a database YN×m containing m different customer profiles. N is the number of observations or the length of the time series transaction for each customer. Y·j , the j th column of this matrix, represents the j th customer with N observations. For the case of time series, it represents a set of m cross sectional time series with length of N . Let’s denote the features’ matrix by XN×p . For example, in the case of modeling time series, it can contain a set of all possible basic components of ramps, steps, and impulses along the N data points. In this case every column of X is an individual component and p = 3N − 3. Then we can write the LASSO model as follows:   3N−4  2 LASSO = arg min Y·j − α·0 − α·1 X·1 − X·i βij  β·j β·j ,α·0 ,α·1

s.t.

3N−3 

|βij | ≤ t

i=2

(8)

j = 1, . . . , m.

i=2

The solution of this model will result in the coefficient matrix, βp×m , which is sparse if the tuning parameter is chosen properly. Note that, we purposely allow coefficient for the first feature of trend (which is a systematic component) and only limit the number of irregular components in our model since we believe that trend is inherent in all business transaction databases. To implement LASSO in an automatic fashion and speed up the process, we

Ann Oper Res

propose to utilize the modified LARS algorithm which is briefly discussed in Appendix. With a pre-specified number of predictors k, LARS is a stepwise variable selection method and, with a minor modification, solves the entire LASSO solution path in the same order of computations as a single least squares regression (Efron et al. 2004). We now illustrate the implementation of this method for time series decomposition using the numerical examples of customer profiles discussed in Jiang et al. (2007). Before fitting the linear regression model (3), the customer profiles are first re-scaled to [−1 1] for standardization. We utilize the LARS-LASSO program code developed by Sjöstrand (2005) in Matlab to select the most significant irregularities in the customer time series profile. Note that the LASSO algorithm is only used for identifying irregular components since the LASSO estimator is actually a shrinkage estimator. The unbiased coefficient estimation βˆkτ is obtained by fitting an ordinary least square regression (OLS) using the irregular components identified by the LASSO algorithm. Although the time series has been standardized before the time series modeling, the significance of the estimated coefficients depends on the volatility that can’t be explained by the systematic and irregular components. In order to demonstrate the significance of irregularities, the following significance index is defined for each coefficient βˆkτ and plotted against the types and time delays, λkτ = sign(βˆkτ )(1 − pβˆkτ ), where pβˆkτ is the p-value of the coefficient estimate βˆkτ . The significance index has the same sign as βˆkτ to reflect the direction of the impact of irregularities. It is analog to the pvalue in the linear regression model and loosely measures the confidence that the identified irregularity components are significant. Practically the value of λkτ can be plotted against a pre-defined threshold, e.g., 90%, to gauge the significance/importance of the irregularities. However, the interpretation of λ for a particular irregularity component has to be cautious in presence of other identified irregularities due to the colinearity among the predictors. Moreover, the significance index is also influenced by over-fitting that is commonly encountered in problems where the number of predictors is much larger than the number of observations. Therefore, the significance index plots are not used for measuring the absolute significance of identified irregularities but rather served for reference purpose in visualization. Example 4.1 Figure 2 presents 9 different customer profiles with the most important irregular component. Each panel presents the scaled data along with the OLS fit in the top row followed by three plots of significance indices of all irregularity components in (3) with respect to impulses, steps, and ramps. It is interesting to note from the top panel that the LASSO algorithm correctly identifies outlier and change point irregularities as the most significant ones in the 2nd and 3rd examples, respectively. These observations conforms with the previous statement made in Jiang et al. (2007). Note that, even though the previous authors missed an outlier in the first example, it is identified by the LASSO algorithm as the most significant irregularity since the number of irregularities is assumed to be 1. Nonetheless, the significance index shows λ = 0.87, which is found insignificant enough compared with random noise, i.e., the outlier may not be very important from business sense, depending on user’s preference. The middle and bottom panels show that all rest examples are mainly disturbed by step changes. Example 4.2 Similarly, Fig. 3 presents the same examples when k = 3 irregularity components are allowed in the model. Besides the most significant irregularities found in Fig. 2, more irregularities are discovered while some of them may not be significant enough such

Ann Oper Res Fig. 2 Real data example for using LASSO to extract first most significant irregularity

as the least significant impulse in the first example and the insignificant change point in the second example. The pre-defined threshold may help eliminate these insignificant irregularities. It is important to note that, with a larger value of k and more irregularity components, the significance of previously identified irregularities generally becomes stronger since the

Ann Oper Res Fig. 3 Real data example for using LASSO to extract first three significant irregularities

remaining variation following the linear regression fits gets smaller. This fact may magnify the significance of newly discovered irregularities as well and results in too many significant

Ann Oper Res

irregularities, i.e., all significance indices become inflated when k gets large. This manifests that using significance indices alone may overestimate irregular components in practice. As illustrated in these examples, increasing the number of significant irregularities may be coupled with the excessive variation reduction. In general, a trade-off is necessary for determining the appropriate number of significant components. Nevertheless, the LARSLASSO algorithm allows the computation of the entire variable selection path in a single iteration. The trade-off among variable selections can be achieved quickly in practice. In the following IE implementation, we propose to use the goodness-of-fit criterion to determine appropriate number of irregularities.

5 Implementation of the IE framework In this section, we propose an IE algorithm to automatically select major irregularities from large amount of data and group similar irregularities together so that the distribution of irregularities inside a database can be well understood. Before presenting the algorithm we first introduce some notation. – k is the number of the allowed irregular components for each customer. This also specifies the current step of the algorithm. – Yˆik is the fitted model for the ith customer profile with k significant components. – Tl is a pre-determined local threshold that controls the unexplained variations for each customer’s profile, while Tg is defined as a global threshold for controlling total errors that are unexplained. – Qik = (# of Outliers, # of Change points) is a tube of irregularity indexing set consisting of two elements that show the number ofeach type of significant irregularity for customer i in step k. Furthermore we have Ak = i {Qik } is the set of possible tubes at step k. – La is the class of tube a, where a ∈ Ak ∪ ∅. L∅ is the class of customers with no significant irregularities. sum of squares of treatment for the ith customer. Yi is the mean – SST i = Yi − Yi 2 is the of time series Yi . SST = m i=1 SST i is the global sum of squares of treatment. ˆ ik 2 is the unexplained variation of the ith customer at step k (sum of – SSEik = Yi − Y squares due   to error).  – SSEk = a∈Ak Qik =a SSEik + Qik =∅ SST i is the global unexplained variation in the kth step after classification. It is partitioned based on the irregularity classes. a % is the percentage of the sum of squares of treatment – ξ1a = SST SST  within class of tube a over the global sum of squares of treatment, where SST a = Qik =a SST i . This statistic describes percentage of total variations that is classified into irregularity type a. SST a −SSEak % – ξ2a = SST is the percentage of the total explained variation by the class of tube a, where SSEak = Qik =a SSEik is the amount of unexplained variation within the class of tube a in the kth step. ξ2a in fact is the goodness-of-fit statistic. The local and global thresholds basically represent the acceptable level of variations explained by the systematic and irregular components in the individual and overall senses, respectively. Choice of the global threshold Tg generally depends on practical preference that practitioners would like to understand the database, e.g., 0.1 ≤ Tg ≤ 0.3. The local threshold Tl on the other hand controls the speed of breakdowns when classifying customer profiles. The smaller Tl is, the slower the breakdown of all profiles and convergence of the classification process, while more detailed classification may be achieved. In practice, it is

Ann Oper Res

customary to set Tl = Tg since if the unexplained variations for all customers are less than Tl , the overall unexplained variation will be consequently smaller than Tg . Example 5.1 We now use the numerical example presented in Example 4.2 to illustrate the classification of profiles. The first time series in this example is recognized to have three outliers without change points. Considering the above notation we would write Q13 = (3, 0) (note that k = 3). However the second series has two outliers and one step change. Thus we have Q23 = (2, 1). Similarly we can write the other types of recognized sets as Q33 = (0, 3), Q43 = (0, 3), Q53 = (1, 2), Q63 = (0, 3), Q73 = (0, 3), Q83 = (1, 2), and Q93 = (0, 3). As there are often similar types of tubes, we have the set of different types A3 = {(3, 0), (2, 1), (1, 2), (0, 3)}. This indicates that there are 4 different irregularity types inside the dataset. Finally, the class label of type (3, 0) for instance is represented as L(3,0) . We can present the IE implementation algorithm as follows: 1. Set the local and global thresholds, Tl and Tg ; 2. Calculate SSTi for all customers and the overall SST; 3. Start from k = 0 (i.e., we assume that database only contains systematic patterns and no irregularity); 4. Fit the linear regression model (3) using the LARS-LASSO algorithm with k predictors for each customer; Calculate Yˆik ; 5. Identify the irregularity types and numbers and determine the irregularity indexing set Qik for each customer; Sort {Qik } to obtain the whole class set Ak ; 6. Calculate SSEik with respect to the fitted model Yˆik for each customer; ik 7. If SSE < Tl then assign the ith customer to the class LQik . Otherwise assign it to the SSTi class L∅ ; 8. Repeat step 7 for i = 1, . . . , m; 9. Calculate SSEk ; k < Tg then stop. The algorithm has reached an acceptable classification. Other10. if SSE SST wise set k = k + 1 and repeat the process from step 4.

6 Performance analysis In this section, we shall use numerical examples to illustrate the implementation of the proposed IE framework for business data analysis. In order to validate the framework, we first investigate the performance of the proposed LASSO algorithm for irregularity identification using a set of simulated data. Then we will use an industrial example to classify time series profiles into different categories to explain the variations inside the customer database. 6.1 Simulation examples The success of the IE framework critically depends on the scalability and efficiency of the LASSO algorithm proposed for variable selection. In the following simulation experiment, we shall compare the mean square error (MSE) and computation performance of the proposed LASSO algorithm with the ALESDA algorithm proposed by Galati and Simaan (2006). To make a sound comparison, the ALESDA algorithm is implemented in Matlab without the sliding window since the time series is not too long. To find a specified number of important irregularities k in a time series, the ALESDA algorithm uses an exhausted

Ann Oper Res Table 1 Comparison of LASSO and ALESDA Algorithms (the numbers shown in parenthesis indicate the MSE and CPU time in seconds) n=1 LASSO

n=2 ALESDA

k = 1 (0.165, 13.5) (0.158, 14.2)

LASSO

n=3 ALESDA

(1.323, 13.4) (1.171, 15.1)

LASSO

ALESDA

(5.628, 14.2) (4.184, 15.0)

k = 2 (0.115, 14.1) (0.110, 2.4 × 104 ) (0.196, 14.1) (0.141, 2.0 × 104 ) (1.913, 14.2) (1.671, 2.1 × 104 ) k = 3 (0.102, 14.0) –

(0.120, 13.4) –

(0.287, 14.1) –

search of all possible combinations and finds the smallest combination that gives the smallest MSE. On the other hand, the LASSO algorithm almost takes the same amount of time no matter how many irregular components are specified. Table 1 presents the MSE and CPU computation time (in seconds) of 1000 time series with N = 24 to resemble the data in the telecommunication example mentioned earlier. The time series were generated following a normal distribution with mean 0 and standard deviation 0.1. n irregular components are then randomly selected out of the 3N − 4 = 68 irregularities. For the above business example, n = 1, 2, 3 are selected since it is very rare to have more than 3 irregularities in a single time series with 24 months of observations. Consequently k = 1, 2, 3 represent the effects of different number of specified irregularities— under-fit, fit, and over-fit. It is easy to see that the LASSO algorithm performs similarly with the ALESDA algorithm. As m gets bigger, the performance of the LASSO algorithm slightly deteriorates since the complexity of problem increases. However, as the specified number of irregular components increases, the time taken by the ALESDA algorithm to find the minimum solution increases exponentially. When k ≥ 3, it is almost impossible to find the minimum solution in a reasonable time, while the LASSO algorithm takes the same amount of time to converge regardless of the value of k. This illustrates that the LASSO algorithm is practically efficient to handle multiple irregularities. In the following we should demonstrate the IE framework in a telecommunication customer database. 6.2 An industrial example We now consider the experimental example discussed in Jiang et al. (2007). The database records 3745 customers’ transactions for 24 months. As discussed in Jiang et al. (2007), the activity monitoring module relies on a robust estimate of the baseline model for each customer given historical activities. Figures 2 and 3 show several examples of such activities represented by business trend, billing errors, business structure breaks, etc. In this study, we shall implement the IE framework to classify different types of irregularities in this database. For illustration, we set both Tl and Tg to 0.1, i.e., we accept the model that is able to capture more than 90% of the variations for each single customer and the whole database, respectively. Figure 4 presents the tree structure of the experimental results, which arrived at the final acceptable set of models in three steps. At the beginning of the experiment, k is set to zero and we are interested to see if the database only contains systematic patterns in the acceptable level. Figure 4 shows that the systematic patterns can model 52% of the customer using a simple linear regression model. In fact, these are the ones that fulfill the local threshold criterion using only simple linear regression. However, the simple linear regression model is capable of describing only 61% of the variation in the database, which is not satisfactory. Therefore we are interested in

Ann Oper Res

Fig. 4 Breakdown of Irregularities in the Telecommunication Database

increasing the value of k by one in order to find the most significant irregularity. When k = 1, the algorithm has recognized three different classes based on the number and type of significant irregularities. The first class has only one change point. The second includes only one outlier, and the third consists of the data of no significant irregularities. After checking the local threshold criterion for each time series that would potentially falls in each category, the algorithm assigns the customers into corresponding classes. It is found that the first class contains 60% of the data. This class constitutes 73.3% of the total variations and is able to explain 71.5% of the total variations. This suggests that change points are responsible for a major portion of variations in the database. Similarly, the second class contains 16% of the data and encompasses 7% of the variation. It can describe 6.8% of the total variations. Overall, this set of models can describe 78.3% of the total variations, which is not acceptable yet. When k = 2, the algorithm recognizes four different classes and can explain 89.4% of the total variations, which is very close to the acceptable level of 90%. When k = 3, we end up with 5 classes which can explain 95.3% of total variations. This experiment shows that using the IE algorithm can explicitly breakdown a database into different classes at specified levels of acceptable thresholds. This procedure can provide business analysts with insights into the types of the significant irregularities that are hidden in a database, explaining each type’s contributions in the overall variations. To better understand the sensitivity of the tuning parameters and the relationship with the step k, Fig. 5 shows a lift-chart of the unexplained variations for different Tl values. It is found that the stopping k value is larger than 3 steps when Tl = 0.1, which suggests setting Tg = Tl in general.

7 Discussions and concluding remarks In most practice of KDD and DM, data irregularities have profound impacts on the quality of business intelligence and analytics. Enlightening irregularities in a large database is a critical step for data pre-processing. This paper presents a generic framework for modeling

Ann Oper Res Fig. 5 Lift Chart of the Specified Number of Irregularities with Different Tl (Tl = 0.1—Solid line with circle, Tl = 0.2—Dashed line with triangle, Tl = 0.05—Dashed line with cross)

irregularities in large databases, in particular cross sectional time series data due to their popularity in business applications. Based on a data-decomposition approach, we propose to use feature extraction methods to identify and locate major data irregularities according to their importance. In order to automate the irregularity enlightenment framework, a step-bystep procedure is developed with visualization tools to capture irregularity components that are not main features of the underlying data but carry significant information business information. The numerical examples and simulations show the effectiveness of the proposed framework in modeling large amounts of time series data. To speed up the irregularity identification process, we reframe the IE problem into a linear regression problem with different irregularity as features and propose using the LASSO algorithm to help select important features from the large amount of customer time series data. The LASSO algorithm performs similarly to the ALESDA algorithm in terms of the MSE but is much more efficient in terms of computation time. Moreover, different from other anomaly detection methods such as DWT, the irregularities identified from the LASSO algorithm carry business information that is transparent to business users. The proposed IE framework can serve as a basis for further research on feature selection and activity monitoring dealing with large numbers of data streams simultaneously. It helps data miners and business analysts better understand their data before feeding the data into mining algorithms. It also highlights the underlying properties of irregularities and facilitate the evaluation of their impacts on strategic and tactic decisions. The data irregularity distribution will help business executives efficiently allocate resources to improve quality of data and develop activity monitoring modules to track important customer activities. An interesting potential application of the proposed IE framework is robust profiling. As Jiang et al. (2007) pointed out, many data mining procedures will fail without a robust data profiling technique. The IE platform is expected to bring benefits of the cleaned data to more efficient and robust business analysis. Moreover, although exogenous feature extraction methods are investigated in the current data-decomposition approach, without domain knowledge, endogenous feature extraction methods can generally be considered. However, cautions should be exercised for endogenous feature extraction methods since irregular observations can bias the transformations themselves. This will be pursued in a future research.

Ann Oper Res Acknowledgements The authors would like to thank the editors and two anonymous referees for their valuable suggestions, which have significantly improved the quality and presentation of this paper.

Appendix: LARS algorithm for LASSO computation ˆ c = Xβc be the current estimate of the actual data Y (for the moment, let’s assume Let Y that Y is a vector of observed data). βc has nonzero values for the variables that have been selected in the current model. Then the correlation vector of the predictors with the current  ˆ c ). Let be the set of indices of the currently selected model is defined as ηc = X (Y − Y variables. Then we can define the equiangular vector of the currently selected variables νc =  Xc (Xc Xc )−1 1c   (1c (Xc Xc )−1 1c )1/2

, where Xc = (· · · δj Xj · · ·), j ∈ , δj = sign(ηj ), and 1c is the vector of 1’s with the length equal to the number of selected variable. The LARS algorithm, then, starts from Yˆ0 = 0 and iteratively builds up the model by selecting the variable that has the highest correlation with the current equiangular vector. It builds this model in the direction of the equiangular vector of all the selected variables. In other words, the LARS model in the step c ˆ c−1 +ρc νc . At each step, ρ is selected such that the new model has equal correlation ˆc =Y is Y with all the selected variables including the newly selected one. The simple modification of the LARS algorithm towards solving the LASSO is to make sure that the sign of the coefficient value of the newly selected variable should agree the sign of its correlation with the current model (see Efron et al. 2004 for more details). Holding the LASSO condition it can be shown that, the LARS-LASSO coefficient vector is a piecewise linear function of LASSO parameter t . This is important since it describes the relationship between the number of allowed components k in the model and the LASSO tuning parameter t . t Let β·jk−1 be the LASSO coefficient estimate for the j th customer having k − 1 number of p t t nonzero elements and let tk−1 = i=1 |β·jk−1 |. Then for tk > t ≥ tk−1 we have limt→tk (β·jk−1 + tk ω (t − tk−1 )) = β·j , where ω is a constant vector of size p within that interval. Therefore choosing k, the number of allowed components in the model, in the LARS-LASSO method can be directly translated to choice of tuning parameter in LASSO. In other words, the proposed IE framework can utilize the iterative nature of the LARS-LASSO method while keeping the favorable characteristics of the LASSO method.

References Alwan, L. C., & Roberts, H. V. (1988). Time-series modeling for statistical process control. Journal of Business and Economic Statistics, 6, 87–95. Apley, D. W., & Shi, J. (1999). The GLRT for statistical process control of autocorrelated processes. IIE Transactions, 31, 1123–1134. Bakshi, B. R. (1999). Multiscale analysis and modeling using wavelets. Journal of Chemometrics, 13, 415– 434. Barnett, V., & Lewies, T. (1994). Outliers in statistical data (3rd ed.). New York: Wiley. Basseville, M., & Nikiforov, I. (1993). Detection of abrupt changes. Theory and application. Prentice Hall Information and system sciences series. New York: Prentice Hall. Bay, S., & Schwabacher, M. (2003). Mining distance-based outliers in near linear time with randomization and a simple pruning rule. In Proceedings of the ninth ACM SIGKDD international conference on knowledge discovery and data mining, Washington, DC, 24–27 August 2003. Bellman, R. (1961). Adaptive control processes: a guided tour. Princeton: Princeton University Press. Bianco, A. M., Garcia, B. M., Martinez, E. J., & Yohai, V. J. (1996). Robust procedures for regression models with ARIMA errors. In COMPSTAT 96, proceedings in computational statistics, part A (pp. 27–38). Berlin: Physica.

Ann Oper Res Bianco, A. M., Garcia, B. M. G., Martinez, E. J., & Yohai, V. J. (2001). Outlier detection in regression models with ARIMA errors using robust estimations. Journal of Forecasting, 20, 565–579. Bishop, C. M. (2006). Pattern recognition and machine learning (1st ed.). Berlin: Springer. Breunig, M., Kriegel, H., Ng, R., & Sander, J. (2000). LOF: Identifying density-based local outliers. In Proceedings of ACM SIGMOD, May 2000 (pp. 93–104). Chan, P., & Mahoney, M. (2005). Modeling multiple time series for anomaly detection. In Proceedings of IEEE international conference on data mining (pp. 90–97). Chen, C., & Liu, L. M. (1993). Forecasting time series with outliers. Journal of Forecast, 12, 13–35. Cooper, G., Hogan, B., Moore, A., Sabhnani, R., Tsui, R., Wagner, M., & Wong, W. K. (2003). Detection algorithms for biosurveillance: a tutorial. http://www.autonlab.org/tutorials/biosurv01.pdf. Efron, B., Hastie, T., Johnstone, I., & Tibshirani, R. (2004). Least angle regression. Annals of Statistics, 32, 407–499. Fan, J., & Li, R. (2006). Statistical challenges with high dimensionality: Feature selection in knowledge discovery. arXiv:math/0602133. Fawcett, T., & Provost, F. (1999). Activity monitoring: noticing interesting changes in behavior. In Proceedings of the fifth international conference on knowledge discovery and data mining (KDD-99). Galati, D., & Simaan, M. (2006). Automatic decomposition of time series into step, ramp, and impulse primitives. Pattern Recognition, 39, 2166–2174. Guha, S., Mishra, N., Motwani, R., & O’Callaghan, L. (2000). Clustering data streams. In Proceedings of the 41st annual symposium on foundations of computer science. Redondo Beach, CA, Nov 12–14 (pp. 359–366). Harris, T. J., & Ross, W. M. (1991). Statistical process control for correlated observations. The Canadian Journal of Chemical Engineering, 69, 48–57. Hastie, T., Tibshirani, R., & Friedman, J. H. (2003). The elements of statistical learning. Berlin: Springer. Hawkins, D. (2001). Fitting multiple change-point models to data. Computational Statistics and Data Analysis, 37, 323–341. Jiang, W. (2004). Multivariate control charts for monitoring autocorrelated processes. Journal of Quality Technology, 36, 367–379. Jiang, W., Wu, H., Tsung, F., Nair, V. N., & Tsui, K.-L. (2002). PID charts for process monitoring. Technometrics, 44, 205–214. Jiang, W., Au, T., & Tsui, K. (2007). A statistical process control approach to business activity monitoring. IIE Transactions, 39, 235–249. Keogh, E., Lin, J., & Truppel, W. (2003). Clustering of time series subsequences is meaningless: implications for past and future research. In Proceedings of the 3rd IEEE int’l conference on data mining, Melbourne, FL, Nov 19–22 (pp. 115–122). Keogh, E., Lonardi, S., & Ratanamahatana, C. (2004). Towards parameter-free data mining. In Proceedings of the tenth ACM SIGKDD international conference on knowledge discovery and data mining, Seattle, WA, 22–25 Aug 2004. Kim, Y., Street, W. N., & Menczer, F. (2003). Feature selection in data mining. In Data mining: opportunities and challenges (pp. 80–105). Knorr, E., & Ng, R. (1998). Algorithms for mining distance-based outliers in large datasets. In Proceedings of the VLDB conference. New York, September 1998 (pp. 392–403). Knorr, E., & Ng, R. (1999). Finding intentional knowledge of distance-based outliers. In Proceedings of 25th international conference on very large databases, September 1999 (pp. 211–222). Knorr, E., Ng, R., & Tucakov, V. (2000). Distance-based outliers: algorithms and applications. The International Journal on Very Large Data Bases, 8(3–4), 237–253. Lavielle, M. (1999). Detection of multiple changes in a sequence of dependent variables. Stochastics Processes and Applications, 83, 79–102. Lavielle, M. (2005). Using penalized contrasts for the change-point problem. Signal Processing, 85(8), 1501– 1510. Lerman, P. M. (1980). Fitting segmented regression models by grid search. Applied Statistics, 29, 77–84. Liu, H., Shah, S., & Jiang, W. (2004). On-line outlier detection and data cleaning. Computers and Chemical Engineering, 28, 1635–1647. Mahoney, M., & Chan, P. (2005). Trajectory boundary modeling of time series for anomaly detection. In KDD-2005 workshop on data mining methods for anomaly detection. Montgomery, D. C. (2004). Introduction to statistical quality control (5th ed.) New York: Wiley. Montgomery, D. C., & Mastrangelo, C. M. (1991). Some statistical process control methods for autocorrelated data. Journal of Quality Technology, 23, 179–204. Moustakides, G. V. (1986). Optimal stopping times for detecting changes in distributions. Annals of Statistics, 14, 1379–1387.

Ann Oper Res Nounou, M. N., & Bakshi, B. R. (1999). On-line multiscale filtering of random and gross errors without process models. AIChE Journal, 5(45), 1041–1058. Oates, T., Firoiu, L., & Cohen, P. (1999). Clustering time series with hidden markov models and dynamic time warping. In Proceedings of the IJCAI-99 workshop on neural, symbolic and reinforcement learning methods for sequence learning (pp. 17–21). Papadimitriou, S., Kitagawa, H., Gibbons, P. B., & Faloutsos, C. (2003). LOCI: fast outlier detection using the local correlation integral. In Proceedings 19th international conference on data engineering, March 2003 (pp. 315–326). Qian, Z., Jiang, W., & Tsui, K. (2006). Churn detection via customer profile modelling. International Journal of Production Research, 44, 2913–2933. Redman, T. C. (1998). The impact of poor data quality on the typical enterprise. Communications of the ACM, 41(2), 79–82. Rousseeuw, P. J., & Leroy, A. M. (1987). Robust regression and outlier detection. New York: Wiley. Runger, G. C., & Willemain, T. R. (1995). Model-based and model-free control of autocorrelated processes. Journal of Quality Technology, 27, 283–292. Runger, G. C., & Willemain, T. R. (1996). Batch means charts for autocorrelated data. IIE Transactions on Quality and Reliability, 28, 483–487. Shmueli (2005). Current and potential statistical methods for anomaly detection. In KDD-2005 workshop on data mining methods for anomaly detection. Sjöstrand, K. (2005). Matlab implementation of lasso, lars, the elastic net and spca (Technical Report). June 2005. Tibshirani, R. (1996). Regression shrinkage and selection via the lasso. Journal of the Royal Statistical Society Series B, 58, 267–288. Tsay, R. S. (1988). Outliers, level shifts, and variance changes in time series. Journal of Forecasting, 7, 1–20. Tsay, R. S. (1996). Time series model specification in the presence of outliers. Journal of the American Statistical Association, 81, 132–141. Vander Wiel, S. A. (1996). Monitoring processes that wander using integrated moving average models. Technometrics, 38, 139–151. Wardell, D. G., Moskowitz, H., & Plante, R. D. (1994). Run-length distributions of special-cause control charts for correlated observations. Technometrics, 36, 3–17. Wei, L., Keogh, E., Van Herle, H., & Mafra-Neto, A. (2005). Atomic wedgie: efficient query filtering for streaming time series. In Proc. of the 5th IEEE international conference on data mining (ICDM 2005), Houston, TX, 27–30 Nov 2005 (pp. 490–497). Wilson, J. H., & Keating, B. (2001). Business forecasting (4th ed.). New York: McGraw-Hill. Zou, H., Hastie, T., & Tibshirani, R. (2006). Sparse principal component analysis. Journal of Computational and Graphical Statistics, 15, 265–286.

Suggest Documents