Monitoring simple linear profiles using the K-chart

3 downloads 78 Views 7MB Size Report
The database is available online at http://io.uv.es/ projects/ daisex/. Following Camps-Valls et al. (2004) ...... M., Zhumabekova. D. “Testing for contagion using ...
Chairman of MSDM 2013 Pr. Mohamed Limam, Dhofar University, Oman - University of Tunis, Tunisia

Program committee Amira Dridi, TBS, University of Tunis Amor Messaoud, TBS, University of Tunis Chiheb Ben Ncir, ISG, University of Tunis Ghazi Bel Mufti, ESSECT, University of Tunis Mohamed El Ghourabi, ESSECT, University of Tunis. Mourad Landoulsi, ESSECT, University of Tunis Riadh Khanchel, FSEG-Nabeul, University of Carthage

Scientific committee Abdelwahed Trabelsi, University of Tunis, Tunisia Adel Karaa, University of Tunis, Tunisia Changliang Zou, Nankai University, China Christian Francq, University of Lille, France Claus Weihs, Dortmund University of Technology, Ger-many Dhafer Maalouch, University of Carthage, Tunisia Giovanni Porzio, University of Cassino, Italy Kamal Smimou, University of Ontario, Canada Mhamed-Ali El Aroui, University of Carthage, Tunisia Mohamed El Ayadi, University of Tunis, Tunisia Nizar Touzi, Ecole Polytechnique, France Paolo Giudici, University of Pavia, Italy Philipe Castagliola, University of Nante, France Rainer Göb, University of Würzburg, Germany Stéphane Lallich, Université Lumière Lyon 2, France

Conference Partners

i

Preface The fourth Meeting on Statistics and Data Mining MSDM’2013 is held from 14th to 15th March 2013 in Hammamet, Tunisia. It is organized by the Tunisian Association of Statistics and Its Applications (TASA) to bring together academic researchers and Industry practitioners carrying out research and development in statistical analysis, knowledge discovery and information management. The objective of the meeting is also to serve as a forum for current and future development work as well as to exchange research ideas. The main topics of the conference are: • Statistical Analysis: Statistical Inference, Multivariate Analysis, Nonparametric Statistics, Statistical Learning, Simulation and Computational Statistics, Bayesian Statistics, Exploratory Data Analysis. • Data Mining: Data Pre-processing, Feature Extraction and Selection, Machine Learning, Clustering and Classification, Ensemble Methods, Post-processing and Knowledge Validation. • Applications: Bioinformatics and Bio-statistics, Mining Spatial and Temporal Data, Quality Control, Remote Sensing, Supply Chain Management, Quantitative Finance. This volume contains papers that were presented in MSDM 2013 meeting in Hammamet, Tunisia. All the papers were reviewed. I would like to express my gratitude to all authors in the present volume for their contributions, diligence and timely production of the final versions of their papers. Furthermore, I thank all the reviewers of the scientific committee for their careful reviews of the originally submitted papers, and in this way, for their support in improving the quality of the papers. Special thanks go to all those in the organizing committee for their precious input and constant determination to make this event a success. We hope that MSDM’2013 brings interesting opportunities for all colleagues and graduate students and fosters follow-up discussions leading to further cooperation and joint projects. We would like to extend our feeling of gratitude to all those who have contributed financially and morally to organize, implement and take part in MSDM’2013. Without their wholehearted commitment and encouragement this event would not have been possible. Until we meet again for our fifth MSDM meeting, I wish to all our participants all the best.

Tunis, March 2013 Mohamed Limam Chairman of MSDM 2013

ii

Contents Worst-Case Scenarios Optimization as a Financial Stress Testing Tool for Risk Management. Amira Dridi, Mourad Landolsi, Mohamed El Ghourabi and Imed Gamoudi

1

Rank Aggregation for Filter Feature Selection in Credit Scoring. Waad Bouaguel, Ghazi Bel Mufti and Mohamed Limam

7

Multiclass classification of banks using Bayesian models: Risk and product diversification. Asma FEKI and Saber FEKI

15

Customer lifetime value prediction using Conway-Maxwell-Poisson distribution. Mohamed Ben Mzoughia and Mohamed Limam

22

Comparison of parameter optimization techniques for a music tone onset detection algorithm. Nadja Bauer, Julia Schiffner and ClausWeihs

28

A new approach of One Class Support Vector Machines for Detecting Abnormal Wafers in Semiconductor. Ali Hajj Hassan, Sophie Lambert-Lacroix and Francois Pasqualini

35

Robust ensemble feature selection based on reliability assessment. Afef Ben Brahim and Mohamed Limam

42

Prior class probability for hyperspectral data classification. Saoussen Bahria and Mohamed Limam

48

Overlapping Clustering of Sequential Textual Documents. Chiheb-Eddine Ben N’Cir and Nadia Essoussi

57

Robust Principal Component Analysis for Incremental Face Recognition. Haifa Nakouri and Mohamed Limam

64

GMM estimation of the first order periodic GARCH processes. Ines Lescheb

72

Testing instantaneous causality in presence of non constant unconditional covariance structure Quentin Giai Gianetto and Hamdi Raissi

76

Consistency and Asymptotic Normality of the Generalized QML Estimator of a General Volatility Model. Christian Francq, Fedya Telmoudi and Mohamed Limam

83

Testing for mean and volatility contagion of the subprime crisis: Evidence from Asian and Latin American stock markets Leila Jedidi, Wajih Khallouli and Mouldi Jlassi

87

On the covariance structure of a bilinear stochastic differential equation. Abdelouahab Bibi

100

iii

Monitoring simple linear profiles using the K-chart. Walid Gani and Mohamed Limam

103

Monitoring the Coefficient of Variation using a Variable Sampling Interval Control Chart. Philippe Castagliola, Ali Achouri and Hassen Taleb

110

A new time-adjusting control limit with Fast Initial Response for Dynamic Weighted Majority 117 based control chart. Dhouha Mejri, Mohamed Limam and Claus Weihs On economic design of np control charts using variable sampling interval. Kooli Imen and Mohamed Limam

124

Average run length for MEWMA AND MCUSUM charts in the case of Skewed distributions. Sihem Ben Zakour and Hassen Taleb

139

The multi-SOM algorithm based on SOM method. Imen Khanchouch, Khaddouja Boujenfa, Amor Messaoud and Mohamed Limam

145

iv

Proceedings of MSDM 2013

Worst-Case Scenarios Optimization as a Financial Stress Testing Tool for Risk Management Amira Dridi 1, Mourad Landolsi 2, Mohamed El Ghourabi 3, Imed Gamoudi 4 1,2,3 4

ISG Tunis, LARODEC ISG Gabes

ABSTRACT: Financial stress testing (FST) is a key technique for quantifying financial vulnerabilities; it is an important risk management tool. FST should ask which scenarios lead to big loss with a given level of plausibility. However, traditional FSTs are criticized firstly for the plausibility that rose against stress testing and secondly, for being conducted outside the context of an econometric risk model. Hence, the probability of a sever scenario outcome is unknown and many scenarios yet plausible possibilities are ignored. The aim of this paper is to propose a new FST framework for analyzing stress scenarios for financial economic stability. Based on worst case scenario optimization, our approach is able first to identify the stres sful periods with transparent plausibility and second to develop a methodology for conducting FST in the context of any financial-economic risk model. Applied to Tunisian economic system data, our proposed framework identifies more harmful scenarios that are equally plausible leading to stress periods not detected by classical methods. KEYWORDS: Worst-Case Scenarios, Financial stress testing, Risk management

1.

Introduction

Stress is the product of a vulnerable structure and some exogenous shock. Financial fragility describes weaknesses in financial conditions and in the structure of the financial system. The size of the shock and the interaction of financial-system fragility determine the level of stress. That is why, researchers and risk managers are more interested in better understanding financial vulnerabilities of banks. One key measure of those fragilities is financial stress testing (FST). Committee on the global financial system defines FST as "a generic term describing various techniques used by financial firms to gauge their p otential vulnerability to exceptional but plausible events". FST can be conducted by simulating historical stress episodes or by constructing hypothetical events built by stressing one or a group of risk factors. FST should ask which scenarios lead to big loss with a given level of plausibility. However, traditional FSTs are criticized firstly for the plausibility that rose against stress testing and secondly, for being conducted outside the context of an econometric risk model. Hence the probability of a sever scenario outcome is unknown and many scenarios yet with plausible possibilities are ignored. Many FSTs also fail to incorporate the characteristics that markets are known to exhibit in crisis periods, namely, increased probability of further large mov ements, increased co-movement between markets, greater implied volatility and reduced liquidity. The aim of this paper is to propose a new FST framework for analyzing stress scenarios for financial economic stability. For extreme scenarios, given that the distribution of risk factors is not elliptical, we introduce extreme value theory (EVT) as mentioned by [BRE 10] that a given extreme scenario should be more plausible if the risk factor has fatter tails. However, for fat tails distributions where the two first moments are not known, the Mahalanobis distance as a plausibility measure is not applicable [BRE 10]. To overcome those pitfalls, we use an approach based on copula EVT to estimate upper bound of value at risk (VaRMax) considered by several authors as a coherent risk measure used to estimate maximum loss (ML) value and hence to find the worst case scenario (WCS) namely the Copula EVT -WCS. Then, the main advantages of this approach are first the use of Copula EVT that focuses on extreme scenarios fou nd in tail distributions. And second, for a fixed level of plausibility p, an explicit formula of WCS based on an estimation of (VaRMax) is proposed.

1

Proceedings of MSDM 2013

The remainder of this paper is organized as follows. Section 2 presents an overview credit risk models. Section 3 focuses on VaR based Copula EVT. Section 4 develops our proposed work relative to WCS Optimization. The empirical study is conducted in Section 5 and Section 6 provides the conclusion.

2.

Credit risk models

Loan portfolio is the most significant source of risk; it is defined as the loss associated with unexpected changes in loans quality. The largest source of credit risk is loans that involve the risk of default of the counterpart. Measuring credit risk is an estimation of a number of different parameters namely probability of default (D), loss given default (LGD), which may involve estimating the value of collateral and the exposure at default (EAD). Some authors link credit risk to macroeconomic variables using econometric models. [PES 05] presents an econometric study of macroeconomic determinants for credit risk and other sources of banking fragility and distress in Finland. For Austria, [BOS 02] provides estimates of the relationship between macroeconomic variables and credit risk. For Norway, the Norges Bank has single equation models for household debt and house prices, and a model of corporate bankruptcies based on annual accounts for all Norwegian enterprises [EKL 01]. For Hong Kong, [GER 04] proposed an FST credit risk model based on a panel using bank-by-bank data. For the Czech Republic, [BAB 05] estimate a vector auto regression model with non-performing loans (NPL) and a set of macroeconomic variables. Similar models are also common in Financial Sector Assessment Program (FSAP) missions. For example, the technical note from the Spain FSAP includes an estimate of a regression explaining NPL on an aggregate level with financial sector indicators and a set of macroeconomic indicators. Several shortcomings need to be considered when interpreting macroeconomic models of FST credit risk. In particular, the literature is dominated by linear statistical models. The linear approximation is reasonable when shocks are small, but for large shocks nonlinearity is likely to be important. Based on copula co ncept, we apply the method of worst case search over risk factor domains of certain plausibility to the analysis of portfolio credit risk.

3.

Upper VaR based on copula concept

The aim of this section is to introduce the semi-parametric methodologies developed by [MES 05] based on copula concept in order to propose in the following section, an explicit formula of WCS for credit portfolios of possibly dependent financial sectors’ loans. The idea is conducted when the marginal returns distributions are in the domain of attraction of the generalized extreme value distribution and the dependence structure between financial assets remains unknown. The concept of a copula has been first introduced by [SKL 59] as a function which couples a joint distribution function with its univariate margins. The formal definition of this concept is a tool to model the dependence structure using the joint cumulative function of T observations. In this section, we give the basic concepts about the problem of VaRMax for functions of dependent risks. The VaR at probability level p for a random variable X with distribution function F is defined as

VaRp  F 1 ( p)

(1)

In order to evaluate the risk level of a portfolio of possibly dependent financial assets, we use copula. Let the multivariate distribution function of a random vector X=(X1 ,...,Xd ) be defined as H(x1 ,…,xd )=p(X1 T |µ) Z T + L(λ, ν|T, tx , x1 , x2 ...xx , inactive at τ ∈ (tx , T ])f (τ |ν)dτ ) (5) tx

1 λx2 1 λxx 1 λx1 ... (x1 !)ν Z(λ, ν) (x2 !)ν Z(λ, ν) (xx !)ν Z(λ, ν)

L(λ, ν|T, tx , x1 , x2 ...xx , τ > T ) =

=

λx 1 (πx )ν Z(λ, ν)T

(6)

Where πx = Π(xi !) Compared to Pareto/NBD, our model requires the additional variable πx which is equal to the product of the factorial number of transactions xi occurring during each period time ti. Using equations (5) and (6) the likelihood function is represented as :

λx e−µT L(λ, ν|T, tx , x, πx ) = + (πx )ν Z(λ, ν)T

Z

T

tx

λx e−µτ (πx )ν Z(λ, ν)τ

x −µT

=

λ e λx µe−µtx λx µe−µT + − Z(λ, ν)T (πx )ν Z(λ, ν)tx (µ + ln(Z(λ, ν))) (πx )ν Z(λ, ν)T (µ + ln(Z(λ, ν)))

(7)

We remove the conditioning on λ and µ by taking the expectation of L(λ, ν|T, tx , x, πx ) over the distributions of λ and µ to have : Z



Z

L(µ, r, s, α|T, tx , x, πx ) =



L(λ, ν|T, tx , x, πx )g(µ|s, β)g(λ|r, α)dµdλ 0

(8)

0

Parameters λ, ν, µ, r and s can be calculated using the maximum likelihood, although it requires iterations and is computationally intensive.

24

Proceedings of MSDM 2013

2.3. Prediction of number of transactions Given that we don’t know if a customer is alive at T, the expected number of purchases in the period (T, T + t] with purchase history x, tx , πx and T is measured as : E(X(t)|λ, µ, ν, T, tx , x, πx ) = E(X(t)|λ, µ, ν, alive at T )P (τ > T |λ, µ, ν, T, tx , x, πx )

(9)

The expected number of transactions while customer is ”alive” at T is calculated as : Z

T +t

x ¯τ f (τ |µ, τ > T )dτ =

E(X(t)|λ, µ, ν, alive at T ) = x ¯P (τ > T + t|µ, τ > T ) + T

with x ¯=

∞ X j=0

x ¯ x ¯ + e−µt µ µ

jλj

(10)

(11)

(j!)ν Z(λ, ν)

As parameters λ and µ are unobserved, we compute expected number of transaction by taking the expectation in (9) over the joint posterior distribution of λ and µ :

Z



Z

E(X(t)|ν, r, s, α, β, T, tx , x, πx ) =



E(X(t)|λ, µ, ν, alive at T )P (τ > T α, β, T, tx , x, πx ) 0

0

g(λ|r, α, β, T, tx , x, πx )dλdµ (12)

Z



Z



=

( 0

Z



Z

= 0

0



0

x ¯ L(λ, ν|T, tx , x, πx , τ > T )P (τ > T |µ) x ¯ + e−µt ) g(µ|s, β)g(λ|r, α)dλdµ µ µ L(λ, ν, µ|T, tx , x, πx )

(13)

∞ 1 1 −µt X jλj λx e−µT ( + e )( ) g(µ|s, β)g(λ|r, α)dλdµ (14) µ µ (j!)ν Z(λ, ν) (πx )ν Z(λ, ν)L(λ, ν, µ|T, tx , x, πx ) j=0

β s αr ((T + β)2−s − (T + t + β)2−s ) = (πx )ν (s − 1)L(λ, ν, µ|T, tx , x, πx )

Z 0



∞ λx+r−1 e−λα X jλj ) ( Z(λ, ν)T +1 j=0 (j!)ν Z(λ, ν)

(15)

Using parameters λ,ν,µ, r, s calculated using maximum likelihood and customer past purchase history x, tx , πx and T , the expected number of purchases can be measured of each customer for any future time period (T, T + t]. Using the proposed model, the future number of transactions of a customer can be predicted, and the CLV is then computed as a discounted product between this number and the expected profit per transaction.

3. Empirical analysis We explore the performance of our proposed model using data provided by an important North African retail bank. The dataset focuses on a single cohort of customers who made their first purchase in the first week of 2011. The dataset contains customers’ card transactions data from January 2011 till December 2012. The total number of transactions is 4,850 made by 99 customers. The first 52 weeks data (Year 2011) are used to estimate model parameters. For the next 52 weeks (year 2012) data are used both to validate our model and to make a comparison with other models. The proposed model is compared to Pareto/NBD and BG/NBD models. The parameters are obtained via MLE, and are reported in Table 1.

25

Proceedings of MSDM 2013

Parameter ν r α s β a b

Pareto/NBD

BG/NBD

1.25 2.08 1.07 11.02

0.17 0.08

Proposed Model 0.96 1.28 1.25 0.03 23

8.69 42.32

Table 1. Model comparison - predicting future number of transactions

As a generalization of the Poisson distribution, the additional parameter of COM-Poisson distribution ν =0,96 approaches 1, value from which the COM-Poisson distribution becomes the standard Poisson distribution. This means that our sample of transactions approach the Poisson distribution, but is not quite stuck to this distribution. In Fig. 1 we compare cumulative number of transactions predicted by the three models. It appears that our model fit better to real data.

Figure 2. Expected number of transactions

Given the objectives of this research, this analysis demonstrates the high degree of accuracy of the proposed model compared to the BG/NBD and the Pareto/NBD models, particularly for the purposes of forecasting a customer’s future purchasing.

4. Conclusion Customer lifetime value is a key metric for any business activity. The difficulty encountered by companies when computing the CLV is the choice of the appropriate model with a satisfactory prediction of the CLV for each customer. The Pareto/NBD model is considered as a powerful technique to predict the future activity of a customer in a non-contractual relationship while using the Poisson distribution to model the number of transactions. However, the reliance in the Poisson distribution with a single parameter limits its flexibility in many applications. Overdispersion or underdispersed of data is a recurrent problem in many applications. We propose a model based COM-Poisson distribution offering more flexibility to predict future customer transactions over time. Our proposed model fits better to real data but present two principal shortcomings. Firstly, our model requires additional data πx

26

Proceedings of MSDM 2013

to predict customer transactions. Secondly, it presents similar difficulties associated with computational challenges of parameter estimation. Our model offers more accuracy for the calculation of CLV, and the user is responsible for the choice of the appropriate model depending on the desired accuracy and the implementation difficulty.

5. References [KUM 06] Kumar V. Denish Shah, and Rajkumar Venkatesan, Managing Retailer Profitability - One Customer at a Time !, Journal of Retailing, 82 (4),277-94, 2006. [VEN 04] Venkatesan, Rajkumar and V. Kumar, A customer lifetime value framework for customer selection and optimal resource allocation strategy, Journal of Marketing, 68 (4), 106-125, 2004. [SCH 87] Schmittlein, D. C., Morrison, D. G., Colombo, R., Counting your customers : Who are they and what will they do next ?, Management Science, 33, 1-24 (January), 1987. [SCH 94] Schmittlein, David C. and Robert A. Peterson, Customer Base Analysis : An Industrial Purchase Process Application, Marketing Science, 13 (Winter), 41-67, 1994. [VEN 04] Venkatesan, Rajkumar and V. Kumar, A customer lifetime value framework for customer selection and optimal resource allocation strategy, Journal of Marketing, 68 (4), 106-125, 2004. [FAD 05] Fader, Peter S., Bruce G.S. Hardie and Lee S Ka Lok . ”Counting your customers” the easy way : An alternative to the Pareto/NBD model, Marketing Science, 24 (2), 275-284, 2005. [GLA 09] Glady N., Baesens B., Croux C., A modified Pareto/NBD approach for predicting customer lifetime value, Expert Systems with Applications (36) 2062-2071, 2009. [SIN 09] Siddharth S. Singh, Sharad Borle, and Dipak C. Jain, A generalized framework for estimating customer lifetime value when customer lifetimes are not observed, Quantitative Marketing and Economics, Volume 7, Number 2, June, pp. 181-205, 2009. [SHM 05] Shmueli G., Minka T., Kadane J.B., Borle S., and Boatwright, P.B. A useful distribution for fitting discrete data : revival of the Conway-Maxwell-Poisson distribution, Journal of the Royal Statistical Society : Series C (Applied Statistics) 54.1 : 127-142, 2005.

27

Proceedings of MSDM 2013

Comparison of parameter optimization techniques for a music tone onset detection algorithm Nadja Bauer1 , Julia Schiffner1 , Claus Weihs1 1

Chair of Computational Statistics, Department of Statistics, TU Dortmund

Design of experiments is an established approach to parameter optimization for industrial processes. In many computer applications, however, it is usual to optimize the parameters via genetic algorithms or, recently, via sequential parameter optimization techniques. The main idea of this work is to analyse and compare parameter optimization approaches which are usually applied in industry with those applied for computer optimization tasks using the example of a tone onset detection algorithm. The optimal algorithm parameter setting is sought in order to get the best onset detection accuracy. We vary in our work essential options of the parameter optimization strategies like size and constitution of the initial designs in order to assess their influence on the evaluation results. Furthermore we test how the instrumentation and the tempo of music pieces affect the optimal parameter setting of the onset detection algorithm. ABSTRACT.

KEYWORDS:

Sequential parameter optimization, Design of experiments, Tone onset detection

1. Introduction

Parameter optimization is an important issue in almost every industrial process or computer application. It is remarkable that the parameter optimization strategies which are applied in industry and in computer applications differ significantly. In industry often strong assumptions regarding the relationship between the target variable and the influential parameters are made and then such experimental designs are used, which fulfill special criteria (like A- or D-optimality). Many computer optimization approaches, in contrast, aim to cover the parameter space uniformly by heuristically generated designs (like Latin Hypercube Sampling designs). Furthermore, when planning trial series in industry many aspects are considered (like improving internal and external validity, identifying and controling disturbing factors or modeling interactions between the influential factors) which are often neglected when planning computer optimizations. The number of trials (or function evaluations) is also very different : While in industry often maximally 100 trials are allowed, the number of function evaluations in computer optimization frequently exceeds ten thousands. The main idea of this work is to combine and compare industrial and computer simulation based parameter optimization techniques for the optimization of a music signal analysis algorithm. The tone onset detection algorithm, which we aim to optimize here, is presented in Section 2. Two important factors that can influence the optimization results are the optimization strategy and the music data set under consideration. The optimization strategy determines how trial points, where the function is evaluated, are selected. In Sections 3 and 4 we present a parameter optimization approach and define characteristics of industrial and computer based parameter optimization which we will compare systematically. Also, we systematically vary characteristics of the music data in order to assess their influence on the evaluation results. Section 5 gives the procedure of the music data set generation. Section 6 presents the simulation results. Finally Section 7 summarizes our work and provides points for future research.

28

Proceedings of MSDM 2013

2. Onset detection algorithm A tone onset is the time point of the beginning of a musical note or other sound. Onset detection is an important step for music transcription and other applications like timbre or meter analysis. The algorithm we will use here is based on two simple approaches : In the first approach the amplitude slope and in the second approach the change of the spectral structure of an audio signal are considered as indicators for tone onsets. The ongoing audio signal is split up into windows of length L samples with an overlap of O per cent. In each window (starting with the second) two features are evaluated : The difference between amplitude maxima (F 1) and the correlation coefficient between the spectra (F 2) of the current and the previous window, respectively. Each of the vectors F 1 and F 2 is then rescaled into the interval [0,1]. For each window a combined feature CombF is calculated as CombF = W · F 1 + (1 − W ) · F 2, where the weight W ∈ [0, 1] is a further parameter, which specifies the influence of each feature on the sum. In [BAU 12b] we investigated further feature combination approaches, where this approach provided the best results. In order to assess, based on CombF , if a window contains a tone onset a threshold is required. We will use here a Q%-quantile of the CombF -vector as such threshold, where Q is the fourth algorithm parameter. If the CombF -value for the current window, but neither for the preceding nor for the succeeding window, exceeds the threshold, an onset is detected in this window. If the threshold is exceeded in multiple, consecutive windows, we assume that there is only one onset, located in that window with the maximal CombF -value in this sequence. For each window with an onset detected its beginning and ending time points are calculated and the onset time is then estimated by the centre of this time interval. In this work we assume a tone onset to be correctly detected, if the absolute difference between the true and the estimated onset time is less than 50 ms. As quality criterion for the goodness of the onset detection the so called F -value is used here : F = 2c+f2c + +f − , where c is the number of + − correctly detected onsets, f is the number of false detections and f denotes the number of undetected onsets. Note that the F -value lies always between 0 and 1. The optimal F -value is 1. The studied ranges of possible settings for the onset detection algorithm parameters are : L (window length in samples) : 512, 1024 and 2048, O (overlap in per cent) : 0 − 50 with step size 5, W (weight of the features) : 0 − 1 with step size 0.05, Q (%-quantile) : 1 − 30 with step size 1.

3. Sequential parameter optimization An experimental design is a scheme that prescribes in which order which trial points are evaluated. One of our aims here is to compare the classical parameter optimization, where all trial points are fixed in advance, with the sequential parameter optimization, where a relatively small initial design is given and the next trial points are chosen according to the results of previous experiments. We consider a non-linear, multimodal black-box function f : Rk → R, x 7→ f (x) of k parameters. We aim to minimize f with respect to x. Let V ⊂ Rk denote the feasible parameter space. The following procedure of (sequential) parameter optimization is used. 1 Let D ⊆ V denote the initial experimental design with Ninitial trial points and let Y = f (D) be the set of function values of points in D. 2 Repeat the following sequential step until the termination criterion is fulfilled : 2.1 Generate a random number s from the distribution : P (s = 0) = p0 , P (s = 1) = 1 − p0 , 0 ≤ p0 ≤ 1. 2.1a If s = 0, fit a model M which models the relationship between D and the response Y = f (D). Find the next trial point dnext ⊂ V , which minimizes the model prediction. 2.1b If s = 1, let Dsample ⊆ V denote a design with Nsample trial points from the parameter space V . For each point in Dsample calculate the Euclidean distance to all points in D and sum them up. The next trial point dnext is that point in design Dsample , which has a maximal sum of Euclidean distances. 2.2 Evaluate ynext = f (dnext ) and update D ←− D ∪ dnext , Y ←− Y ∪ ynext . 3 Return the optimal value ybest of the target variable Y and the associated parameter setting dbest

29

Proceedings of MSDM 2013

The challenge in sequential design of experiments is to find the appropriate next trial point to evaluate. The major differences between the existing algorithms for sequential parameter optimization lie in step 2.1. A popular approach here is to just use step 2.1a : Fitting a user-chosen model M and calculating its prediction for a sequential design Dstep ⊆ V of size Nstep  Ninitial . The next trial point then is the point in Dstep with the best predicted value ([BAR 05]). However, choosing the next trial points in this way may lead to convergence to a local optimum of f . A suitable approach here might be to take into account not only the model prediction for each point in Dstep but also the distances of these points to already evaluated trial points. Such a methodology is already used in the case of Kriging models (expected improvement criterion, [JON 98]), which is unfortunately suitable exclusively for these models. Nevertheless, in order to consider the above mentioned distances of new points to already evaluated points we implement here a simple approach for the exploration of the parameter space : In step 2.1 the next trial point is chosen according to the model prediction (step 2.1a) with a user-defined probability p0 and according to the distance to already evaluated trial points (step 2.1b) with probability 1 − p0 . Our settings for the parameter optimization approach presented above are : Dstep is a Latin Hypercube Sampling (LHS) design (a design which covers the parameter space uniformly, [STE 87]) with Nstep = 20.000 points and Dsample is an LHS design with Nsample = 500. Note that, as described in step 2.1b, for each point in Dsample we calculate distances to the points in D, therefore Nsample should not be chosen too large. The termination criterion in step 2 is defined by the total number of evaluations (Ntotal ) of the function f . The probability p0 is set to 0.9. Details regarding the model M are discussed in Section 4. Further important issues here are the initial design and the number of sequential steps. We propose different settings for the initial designs using an experimental scheme, in which the size of the sequential design and its type are considered as control variables. For construction of initial designs information about the experimental parameters is required. According to Section 2 there are four parameters to be optimized : L, O, W and Q. As classical parameter optimization strategy we use here a full factorial design with 3 levels for each parameter (81 trial points). After evaluating f in these 81 points a so called verification step is conducted : We identify the next trial point (in this case just with step 2.1a) and evaluate f at this point. The total number of evaluations, Ntotal = 82, should not be exceeded by all further parameter optimization strategies in order to facilitate comparability. Other settings for the size of the initial designs are approx. one half and approx. one third of the evaluation budget (Ntotal ). Here we aim to investigate, whether and (if so) how the size of initial designs influences the optimization results. We consider two different types of initial designs : “textbook” designs, which fulfill special criteria, and LHS designs. LHS initial designs are commonly used in (sequential) parameter optimization of computer applications, while the “textbook” designs are often applied to optimization of industrial processes. We employ both in order to assess which leads to better results. Table 1 presents our parameter optimization strategies. We decided to implement two widely used “textbook”designs (in addition to the full factorial design mentioned above) : A central composite design with inner star ([WEI 99]) with 25 trial points and an orthogonal design with 48 trial points 1 . The size of the central composite design (k 2 + 1 + 2 · k) depends on the number of parameters k, which is here 4. The size of the orthogonal design depends on the number of parameters and the number of their levels. For the generation of the orthogonal design we use the R-package DoE.base ([GRO 11]) where for our number of parameters and number of levels (see Table 1) only a design with 48 trial points was possible. The disadvantage of most “textbook”-designs in comparison with LHS-designs is their inflexibility regarding the design size.

4. Model combination In step 2.1a in the sequential parameter optimization procedure in Section 3 both a single model and a combined model can be used. In this work we will use four model combination strategies that were introduced and investigated in [BAU 12a]. In the following we will briefly review the main ideas. Let us assume that m models M1 , M2 , . . . , Mm are given with response Y and design D which includes the settings of the influential parameters. For each model we first compute a model prediction accuracy criterion (10-fold cross-validated mean squared error) and then calculate model predictions for each point dj , j = 1, . . . , Nstep , of the sequential design Dstep . 1. We do not present the trial schemes for the initial designs.

30

Proceedings of MSDM 2013

strategy Classic Orth Centr LHS81 LHS48 LHS25

initial design 34 full factorial design orthogonal design with 3 (for L) or 4 (for O, A and Q) factor levels central composite design with inner star LHS design LHS design LHS design

Ninitial Nseq_step 81 1

Ntotal 82

48

34

82

25 81 48 25

57 1 34 57

82 82 82 82

Table 1. Strategies of (sequential) parameter optimization where Nseq_step is the number of sequential steps

As first model combination method we will use the weighted average approach (WeightAver) : For each point dj the weighted sum of the m model predictions is calculated, where the model weights are defined by the associated values of the prediction accuracy criterion. In each sequential step the next evaluation is done at that point dj which has the best weighted sum of predictions. In the second combination approach (BestModel) we will just choose the best model according to the model prediction accuracy criterion. Then the function f is evaluated at that point dj which has the best model prediction value. The third combination method (Best2Models) is similar to the second method but in each step we evaluate two points according to the predictions of the two best models. We take care that we do not carry out more function evaluations than allowed (see the termination criterion in Section 3). In the last model combination approach we determine for each model the ten trial points with the best predicted values (Best10). Here for each point dj we do not only consider the model predictions with associated accuracy criteria but also the number of models for which this point has one of the ten best predicted values. The core idea is to prefer points which belong to the best predictions of many models at the same time. For more details see [BAU 12a].

5. Data base Since one of the aims of this work is to determine the influence of the music signal characteristics on the optimal parameter settings of the onset detection algorithm, we designed a special music data set. There are many characteristics which describe a music signal like tempo, genre, instrumentation or sound volume. We consider only the instrumentation and the tempo as control variables when designing the data set. The special characteristic of this data set is that the same tone sequences are recorded by different music instruments with different tempo settings, so that we can explicitly measure the influence of these two control variables on the optimal parameter settings of the onset detection algorithm. As we need the information about the true onset times and in order to vary the tempo and instrumentation of tone sequences we will work with MIDI-files 2 . However, the MIDI-files are not converted to WAVE-files 3 using synthetic tones (which is the case for most free and commercial converter programs), but using a specially developed program, which employs recordings of real tones for the WAVE-file generation 4 . The challenge here is finding such music pieces, which can be played by all music instruments under consideration. We consider in our work six music instruments (guitar, piano, flute, clarinet, trumpet and violin) with a common pitch range of 17 tones (from C4 to E5 or, in MIDI-Coding, from 60 to 76). We found two German folk songs, which fulfill the tone range condition : S1 5 with 122 tone onsets from the tone interval [60, 76] and S2 6 with 138 tone onsets from the tone interval [65, 74]. The tempo of a music piece can be measured by Beats Per Minute 2. 3. 4. 5. 6.

http ://www.midiworld.com/basics/, date 01.06.2012. http ://www.sonicspot.com/guide/wavefiles.html, date 01.06.2012. This program is introduced in detail by [BAU 12b]. http ://www.ingeb.org/Lieder/haidschi.mid, date 01.06.2012. http ://www.ingeb.org/Lieder/esgetein.mid, date 01.06.2012.

31

Proceedings of MSDM 2013

(BPM). We will set the tempo for each piece to 90 BPM (classical tempo marking : andante) and 200 BPM (classical tempo marking : presto). The sampling rate of the recordings is set to 44100 Hz. The names of the music signals follow the pattern : S1tempo _instrument (for example : S190 _piano ). The total number of music pieces in the data set is 24 (2 tempi, 2 music pieces and 6 instruments).

6. Results In order to compare different (sequential) parameter optimization approaches (see Section 3) we generate an experimental scheme with the following three meta-parameters : model type, model combination type and initial design 7 . The meta-parameter model type determines the model which describes the relationship between the onset detection algorithm parameters (L, O, W and Q) and the target variable (F -value, see Section 2). We employ six model types : A full second order model (FSOM, R-package rsm), Kriging (KM, R-package DiceKriging), random forests (RF, R-package randomForest), support vector machines (SVM, R-package kernlab), neural networks (NN, R-package nnet) and the combination of these five models (COMB). The second meta-parameter – model combination type – is just meaningful for the sixth model type and has four options (see Section 4) : weighted average (WeightAver), best model (BestModel), best two models (Best2Models) and best ten points (Best10). The last meta-parameter – initial design – is related to the initial designs of the optimization strategies and has six levels : Three “textbook”-designs with different numbers of trial points and three associated Latin Hypercube Sampling designs (see Table 1). Note that for the initial designs Classic and LHS81 the model combination approach Best2Models is not possible, because in these cases the number of function evaluations (83) would exceed the experimental budged (82 evaluations). For each optimization strategy and for each music piece the evaluation is carried out ten times. This is done in order to average out the influence of chance on the outcome. We actually have a maximization problem here, the sign of F will be reversed hence to get a minimization problem (see Section 3). As we aim to know for each music piece and for each optimization strategy, how close the estimated optima and the true optima are to each other, we find the true optima using a time-consuming grid search. The full factorial design Grid consists – according to the ranges of the possible parameter settings (see the last paragraph of Section 2) – of 20790 trial points. For each of 24 music pieces a vector of function values for the Grid -design is computed. In order to assess the goodness of the optimization strategies the following procedure is conducted : – Let i denote the index of a music song, i = 1, ..., 24 : i i – let qi be the 99%-quantile of the vector YGrid , where YGrid is the vector of function values for the Grid design for the i-th song, – determine the number of replications (nri ) of the current optimization strategy, in which an F -value that exceeds the qi -value was found, – compute the relative frequency of these “successful” replications by freq i = nri /10, P24 1 – compute the goodness of the optimization strategy by 24 · i=1 freq i . Further we look at the number of function evaluations which are required by each optimization strategy for finding an F -value close to the true optimum. For this purpose we conduct the following procedure for all optimization strategies with initial designs of size 25 and 48 : – Let i denote the index of a music song, i = 1, . . . , 24 : i – let qi be the 99%-quantile of the vector YGrid , – determine for each replication j, j = 1, . . . , 10, of the current optimization strategy the number of function evaluations which were sufficient to find an F -value that exceeds the qi -value. Collect these numbers into the vector NRi = (nr1 , nr2 , . . . , nr10 )0 , P10 1 · j=1 nrj , – compute the mean of vector NRi : NRi = 10 – calculate the mean and the standard deviation (sd) of the vector (NR 1 , NR 2 , . . . , NR 24 )0 7. We use the programming language R (version 2.15.0, [R C 12]) for calculation.

32

Proceedings of MSDM 2013

model type FSOM KM RF SVM NN FSOM KM RF SVM NN FSOM KM RF SVM NN COMB COMB COMB COMB COMB COMB COMB COMB COMB COMB COMB

model combination type WeightAver BestModel Best10 WeightAver BestModel Best2Models Best10 WeightAver BestModel Best2Models Best10

initial design

[ID] goodness

mean (sd)

Classic Classic Classic Classic Classic Centr Centr Centr Centr Centr Orth Orth Orth Orth Orth Classic Classic Classic Centr Centr Centr Centr Orth Orth Orth Orth

[1] [2] [3] [4] [5] [6] [7] [8] [9] [10] [11] [12] [13] [14] [15] [16] [17] [18] [19] [20] [21] [22] [23] [24] [25] [26]

43.65 (28.62) 37.34 (14.55) 39.61 (15.79) 41.41 (15.33) 42.11 (18.40) 29.07 (19.96) 45.12 (22.99) 40.49 (22.00) 44.03 (22.39) 44.25 (22.66) 45.50 (17.09) 37.17 (12.35) 36.25 (11.62) 37.61 (12.28 48.93 (25.33) 46.17 (22.84) 44.17 (21.68) 43.77 (21.56)

0.3875 0.3958 0.3958 0.3917 0.3833 0.2875 0.9125 0.7167 0.7458 0.6542 0.4667 0.8250 0.6333 0.7000 0.7208 0.3958 0.3917 0.3833 0.6750 0.9375 0.9167 0.9625* 0.7083 0.8500 0.8625 0.8792

initial design LHS81 LHS81 LHS81 LHS81 LHS81 LHS25 LHS25 LHS25 LHS25 LHS25 LHS48 LHS48 LHS48 LHS48 LHS48 LHS81 LHS81 LHS81 LHS25 LHS25 LHS25 LHS25 LHS48 LHS48 LHS48 LHS48

[ID] goodness

mean (sd)

[27] [28] [29] [30] [31] [32] [33] [34] [35] [36] [37] [38] [39] [40] [41] [42] [43] [44] [45] [46] [47] [48] [49] [50] [51] [52]

30.78 (8.22) 29.71 (5.03) 30.19 (7.52) 34.76 (9.05) 34.34 (7.57) 34.85 (8.98) 41.16 (6.06) 38.31 (7.09) 39.72 (9.12) 39.37 (9.38) 31.70 (5.62) 30.94 (5.17) 30.58 (4.07) 30.91 (4.74) 39.14 (7.11) 40.52 (6.68) 42.38 (5.94) 41.23 (6.23)

0.6292 0.7042 0.6542 0.6375 0.6583 0.6292 0.9625* 0.7125 0.7958 0.8375 0.6500 0.9875* 0.7708 0.8125 0.8083 0.6417 0.6708 0.6792 0.8458 0.9375 0.9501* 0.9585* 0.8625 0.9458 0.9625* 0.9667*

Table 2. Frequency of finding an F -value close to the optimum by the parameter optimization strategies under consideration (the strategies whose goodness-values exceed 0.95 are marked with an asterisk)

Table 2 shows the goodness values as well means and standard deviations mentioned above for the parameter optimization strategies. The strategies whose goodness-values exceed 0.95 are marked with an asterisk. One of the most important findings when considering Table 2 is that the strategies with LHS-initial designs (in almost all cases) achieve better goodness-measures than the associated strategies with “textbook”-initial designs. The best result is achieved by the strategy with ID 38, a single Kriging model with initial design LHS48 , and the second best result is given by the strategy with ID 52, a combined model (Best10) with initial design LHS48 . When considering only the “textbook”-initial designs we observe in contrast to LHS-initial designs that firstly a model combination approach (Best10, ID 22) is better than the best single model (Kriging, ID 7), and secondly that initial designs of size 25 seem to be better than designs of size 48. For “textbook” as well LHS-initial designs it is obvious that the classical optimization strategies (where 81 of 82 design points are fixed in advance) are considerably worse than the sequential optimization strategies. Further, according to Table 2 all LHS-initial design strategies (with exception of the strategy with model type FSOM and initial design Orth) are faster than the associated “textbook”-design strategies (by 5.8 evaluations on average). Frequently the mean number of function evaluations does not exceed the size of the associated initial design. This is caused by the fact that in many cases a sufficiently large F -value has been already achieved in the initial design. The standard deviations (of the number of steps) of the LHS-initial

33

Proceedings of MSDM 2013

designs is 1.6 to 3.7 times smaller than the standard deviations of the associated “textbook”-designs. This seems to be a further advantage of LHS-initial designs. We abstain here from analysis of the optimal parameter settings of the onset detection algorithm.

7. Conclusion In the following we will summarize our work. Different strategies for sequential parameter optimization were compared on the basis of an algorithm for tone onset detection. We systematically tested the influence of initial design characteristics and model types on different goodness-measures of the optimization strategies. The LHSinitial designs yield the better results both by achieving a sufficiently good value of the target variable and by their “speed” in comparison with the “textbook”-designs. Furthermore we noticed for the LHS-initial design strategies that by using initial designs with 48 trial points (approx. one half of the evaluation budget) slightly better goodnessvalues (according to Table 2) could be achieved in comparison with the initial designs of smaller size (25 trial points, approx. one third of the evaluation budget) but that the number of necessary function evaluations for finding a sufficiently large F -value rises by about 5 evaluations. Regarding the meta-parameter model we can see that the best model is Kriging (KN), but that the model combination approaches Best2Models and Best10 also perform well. Nevertheless, these model combination approaches are more time-consuming than Kriging since the model prediction accuracies have to be calculated in each sequential step. Please note that it might not be unproblematic to generalize the above results. This is because we used a very specific data base, which in fact increases the internal validity of our study (regarding the analysis of best parameter settings) but reduces the external validity. For our further research it is important, on the one hand, to apply the different parameter optimization strategies defined here on other real or artificial optimization problems. On the other hand, we have to investigate the properties of the proposed onset detection algorithm by applying it to a wider range of data sets. For this reason it is important firstly to define interesting music characteristics (e.g. monophony/polyphony, instrumentation, classic/modern, slow/fast) and then find appropriate music pieces for each combination of these characteristics. Furthermore, a parameter optimization of a more complex music signal analysis algorithm like an algorithm for music transcription is planned. In this case, however, a multi-objective parameter optimization will be required.

8. References [BAR 05] BARTZ -B EIELSTEIN T., L ASARCZYK C., P REUSS M., “Sequential Parameter Optimization”, M C K AY B., Ed., Proceedings 2005 Congress on Evolutionary Computation (CEC’05), vol. 1, Piscataway NJ : IEEE Press, Edinburgh, 2005, p. 773-780. [BAU 12a] BAUER N., S CHIFFNER J., W EIHS C., “Comparison of classical and sequential design of experiments in note onset detection”, Studies in Classification, Data Analysis, and Knowledge Organization, Berlin Heidelberg, 2012, Springer, Accepted. [BAU 12b] BAUER N., S CHIFFNER J., W EIHS C., “Einfluss der Musikinstrumente auf die Güte der Einsatzzeiterkennung”, Discussion Paper num. 10/2012, 2012, SFB 823, TU Dortmund. [GRO 11] G ROEMPING U., “Relative projection frequency tables for orthogonal arrays”, report , 2011, Reports in Mathematics, Physics and Chemistry 1/2011. [JON 98] J ONES D., S CHONLAU M., W ELCH W., “Efficient global optimization of expensive black-box functions”, J. Global Optimization, vol. 13, 1998, p. 455–492. [R C 12] R C ORE T EAM, “R : A Language and Environment for Statistical Computing”, R Foundation for Statistical Computing, Vienna, Austria, 2012, ISBN 3-900051-07-0. [STE 87] S TEIN M., “Large Sample Properties of Simulations Using Latin Hypercube Sampling.”, Technometrics, vol. 29, 1987, p. 143–151. [WEI 99] W EIHS C., J ESSENBERGER J., Statistische Methoden zur Qualitätssicherung und -optimierung in der Industrie, Wiley-VCH, Weinheim, 1999.

34

Proceedings of MSDM 2013

A new approach of One Class Support Vector Machines for Detecting Abnormal Wafers in Semi-conductor Ali Hajj Hassan1,2 , Sophie Lambert-Lacroix2 , Francois Pasqualini1 1

STMicroelectronics, Crolles, France UJF-Grenoble 1 / CNRS / UPMF / TIMC-IMAG UMR 5525, Grenoble, F-38041, France 2

ABSTRACT. In this paper we propose a new approach for fault detection in the semiconductor domain. This approach is based on One Class Support Vector Machines (OC-SVM). OC-SVM is a well known tool of one class classification method used for outlier detection. We used OC-SVM in a real world scenario where no labels are available. In addition we propose a new filter method for feature selection in order to improve the performance of OC-SVM model. The efficacy is demonstrated by applying this approach on industrial real-time Semiconductor data set. Further, we compared this approach to Robust Principal Component Analysis (ROBPCA), an alternative method for fault detection. We showed that the proposed algorithm outperforms ROBPCA, and can provide an effective and efficient way for fault detection. KEYWORDS:

Fault detection, Wafers, Outlier detection, OC-SVM, feature selection, ROBPCA.

1. Introduction In semiconductor manufacturing, wafer fault detection is a key step to guarantee to customers the required quality level. The process flow of manufacturing is a very long and complex, requiring a high quality control. At the end of the process, the Parametric Test (PT) step is performed. This allows detection, in the shortest possible time, of abnormal wafers using electrical measurements. Actually, the used approach consists on evaluating each parameter individually with respect to its specification limits. This method is a univariate approach which gives a high false positive rate while the real proportion of rejection is much lower. The aim of our study is to develop a multivariate approach for fault detection based on statistical learning in order to obtain high detection rate of abnormal wafers and decrease the false alarms rate. In the following text, abnormal wafers will be referred as outliers and normal wafers as targets. The problem of detection of abnormal wafers was already reported in [MAH 09] and [MNA 08]. OC-SVM and Principal Component Analysis (PCA) were respectively used as multivariate outlier detection approaches. In both studies, the model was fitted on a training set containing only normal wafers. Also, PCA model was unable to capture non-linearity in the case where data were not linearly separable. To overcome those restrictions, we tried to use OC-SVM in a real world scenario, where information about targets and outliers are unavailable. We learned the model using available wafers (both targets and outliers). We then checked if wafers, classified by the model as outliers, were targets or outliers in reality. OC-SVM could be used to handle non-linear cases with the help of kernel functions. We improved the performance of OC-SVM model by introducing a new filter method for feature selection. In Section 2 of the paper, we introduce the OC-SVM classifier, then we present the description and motivation for developping the filter method for feature selection in Section 3. We briefly talk about the ROBPCA method [HUB 05] for outlier detection in Section 4. Finally, Section 5 investigates the performance and robustness of our strategy through industrial real-time Semiconductor data set and compare it to ROBPCA. 35

Proceedings of MSDM 2013

2. One-class Support Vector Machines Support Vector Machines (SVM) were introduced by Vapnik [VAP 95] as an effective learning algorithm that operates by finding an optimal hyperplane to separate the two classes of training data. The basic idea behind SVM is to non-linearly map the input space of the training data into a higher dimensional feature space corresponding to an appropriate kernel function. SVM then finds a linear separating hyperplane in the feature space by maximizing the distance, or margin, to the separating hyperplane of the closest training data from the two classes (see Fig 1(a)).

Figure 1. Geometry interpretation of SVM-based classifiers : (a) Two-class SVM classifier. (b) One-class SVM classifier.

One-class classification is a less well-known extension of the classification, used to solve some special subjects about outlier detection. An extension of SVM (OC-SVM) was subsequently proposed by [SCH 01] to handle oneclass classification by estimating the support of a high-dimensional distribution. The strategy is to separate the training data (positive examples) from the origin (considered as negative examples) with maximum margin in the feature space. Under this strategy, a bounding hypersphere is computed around the training data in a way that captures most of these data while minimizing its volume (see Fig 1(b)). Scholkopf et al. [SCH 01] formulated the one-class SVM approach as follows : Consider a training set {xi }, i = 1, . . . , n, xi ∈ Rp , and suppose them distributed according to some unknown underlying probability distribution P . We want to know if a test example x is distributed according to P or not. This can be done by determining a region R of the input space X such that the probability that a test point drawn from P lies outside of R is bounded by some a priori specified value ν ∈ (0, 1). This problem is solved by estimating a decision function f which is positive on R and negative elsewhere. A non linear function φ : X → F maps vector x from the input vector space X endowed with an inner product to a Hilbert space F termed feature space. In this new space, the training vectors follow an underlying distribution P 0 , and the problem is to determine a region R0 of F that captures most of this probability mass distribution. In other words the region R0 corresponds to the part of the feature space where most of the data vectors lie. To separate as many as possible of the mapped vectors from the origin in feature space F, we construct a hyperplane H(w, ρ) in feature space defined by H(w, ρ) = hw, φ(x)i − ρ (1) where w is the weight vector and ρ the offset. The maximum margin from the origin is found by solving the following quadratic optimization problem Pn 1 1 2 min i=1 ξi − ρ 2 ||w|| + νn w,ρ,ξ1 ,...,ξn (2) subject to hw, φ(x)i ≥ ρ − ξi , ξi ≥ 0 where ρ/||w|| specifies the distance from the decision hyperplane to the origin ; ξi are so-called slack variables that penalize the objective function but allow some of the points to be on the wrong side of the hyperplane, i.e. located between the origin and H(w, ρ) as depicted in Fig.1. ν is a parameter that controls the tradeoff between

36

Proceedings of MSDM 2013

maximizing the distance from the origin and containing most of the data in the region created by the hyperplane. It was proved in [SCH 01] that ν is an upper bound on the fraction of outliers and also a lower bound on the fraction of support vectors. Lagrange multipliers αi , βi ≥ 0 are introduced and the Lagrangian is formed. The partial derivatives of the Lagrangian are set to zero as follows : L(w, ξ, ρ, α, β) =

n n n X X 1 1 X ||w||2 + ξi − ρ − αi (hw, φ(xi i) − ρ + ξi ) − βi ξi , 2 νl i=1 i=1 i=1 ∂L ∂w ∂L ∂ξi ∂L ∂ρ

Pn = 0 → w = i=1 αi φ(xi ) 1 1 = 0 → αi = νl − βi ≤ νn Pn = 0 → i=1 αi = 1

(3)

(4)

By replacing (4) in (3), the solution to the problem is equivalent to the solution of the Wolfe dual problem : Pn 1 min i,j=1 αi αj hφ(xi ), φ(xj )i 2 α1 ,...,αn P 1 subject to 0 ≤ αi ≤ νn , i αi = 1

(5)

and the corresponding decision function is : f (xi ) = sgn(αi hφ(xi ), φ(xj )i − ρ).

(6)

All training data vectors xi for which f (xi ) ≤ 0 are called support vectors ; these are the only vectors for which αi 6= 0. Notice that in (5) only inner products between data are considered ; for certain particular maps φ, there is no need to actually compute hφ(xi ), φ(xj )i. The inner product can be derived directly from xi and xj by means of the so-called ‘kernel trick’. The main property of functions satisfying these conditions is that they implicitly define a mapping from X to a Hilbert space F such that K(xi , xj ) = hφ(xi ), φ(xj )i and thus can be used in algorithms using inner products. Accordingly, the hyperplane (1) in feature space F becomes a non linear function in the input space X ! n X f (x) = sgn αi k(xi , x) − ρ , (7) i=1

where ρ = hw, φ(x)i =

n X

αi k(xi , x) f or any αi such that 0 < αi
θ}

(13)

To summarize, our proposed approach for fault detection is 2-step methodology : – Feature selection using the M ADe method. – Applying OC-SVM in a real world scenario on the feature subset and then detecting abnormal wafers.

4. Robust Principal Component Analysis Principal Component Analysis is a popular statistical method which tries to explain the covariance structure of data by means of a small number of components. Because PCA is concerned with data reduction, it is widely used for the analysis of high-dimensional data which are frequently encountered in semiconductor domain. Unfortunately, data reduction based on PCA becomes unreliable if outliers are included. The goal of robust PCA is to obtain principal components that are not influenced much by outliers. For a detailed description of ROBPCA, the interested is referred to [HUB 05]. Fault detection methods based on PCA have received particular attention and have been widely used for industrial process monitoring. The principle of this approach is to use PCA to model the behavior of the normal wafers. Abnormal wafers are then detected by comparing the behavior observed and that given by the PCA model. Since our data set contains outliers, we train the model using ROBPCA. The data are projected onto this model and the

38

Proceedings of MSDM 2013

distance metric is calculated. If the metric exceeds the threshold based on a predefined confidence measure (1−α), the sample is categorized as outlier, otherwise it is normal. The T 2 is used as the metric for fault detection. For further details on PCA and ROBPCA fault detection readers are advised to read the literature ([NOM 94],[HUB 05]).

5. Industrial data case study 5.1. Experimentation and Performance Measures Our experimental goal was to asses the ability of OC-SVM to detect the outliers among wafers. It is also important to minimize false alarms rate as they cause unwarranted interruption in plant operation. In order to evaluate the robustness of OC-SVM, we will illustrate this method on industrial real-time Semiconductor data set. We also compare the results from OC-SVM with ROBPCA fault detection method. The data set consists of 1616 wafers, each wafers is described by 84 parameters (electrical parameters) and each parameter is measured on 9 sites in the wafer. Therefore the dimension of variables space is equal to 9 × 84 = 765. Amongst the 1616 wafers, 23 are faulty (i.e. considered as outliers at PT). As we mentioned earlier, we do not have any information about wafers (target, outlier). So we built the models of OC-SVM and ROBPCA using all wafers. For OC-SVM we use an RBF kernel (9) and experimented with different values for the ν and γ parameters. The threshold θ of feature selection method was setted to 1% as we proposed in (12). In future work, we will evaluate in details the influence of each of γ and θ parameters on the performance of our proposed approach. In order to give a comparison between our approach and ROBPCA, we have selected the optimal parameters that give the best results for each of two methods. To evaluate and compare the results obtained from the used methods, we used two performance criteria : Detection rate and false alarms rate. Detection rate (sensitivity) is the percentage of outliers detected by our model while false alarms rate (100-specificity) is the percentage of targets considered as outliers by our model.

5.2. Results Our main goal is to prove the efficiency of OC-SVM approach for fault detection and its superiority on the ROBPCA approach. Ideally, we want high sensitivity (to detect most of the abnormal wafers) and a low false alarms rate (to avoid mistakenly classifying normal wafers as abnormal). Firstly, we presented the best results obtained by training OC-SVM classifiers using different parameter configurations. We can see clearly in Tab 1 that, as the value of γ decreases, Detection Rate increases and the False Alarms Rate is reduced. Hence, we chose to retain γ4 as kernel parameter value. Parameters ν γ 0.13

γ 4 γ 2

γ 0.16

γ 4 γ 2

γ 0.18

γ 4 γ 2

γ

Detection Rate % 78.26 73.91 65.22 86.96 82.61 73.91 95.65 95.65 78.26

False Alarms Rate % 12.05 12.30 12.18 15 15.06 15.44 16.95 16.95 16.57

Table 1. Performance of OC-SVM for different parameter settings using an RBF Gaussian kernel

Secondly, we presented the results of comparison of all three methods in a form of so-called ROC curves. A ROC curve shows the trade-off between Detection Rate and False Alarms Rate when varying a free parameter of

39

Proceedings of MSDM 2013

100

the method. We also gave the best performance results for each of methods in table 2. Fig 2 and Tab 2 show that OC-SVM outperforms ROBPCA, even without feature selection. The performance of OC-SVM was improved by applying feature selection (OC-SVM.FS). As illustrated in Tab 2, OC-SVM.FS lead to significant improvements in the two used performance criteria.

●● ●

● ●













●●



●● ●

80

● ●



ROBPCA OC−SVM OC−SVM.FS



60

● ● ● ● ●

● ●

●●● ● ●

40

Detection Rate





● ●●● ●●

20

●● ●

● ●● ●● ● ●● ●

0

20

40

60

80

100

False Alarms Rate

Figure 2. ROC curves for OC-SVM and OC-SVM with feature selection varying ν. Also plotted is the ROC curve for ROBPCA varying α.

Method ROBPCA

OCSVM

OCSVM.FS

Detection Rate 60.87 73.91 78.26 78.26 86.96 95.65 91.30 95.65 100

False Alarms rate 14.19 17.33 19.21 12.05 15 16.95 8.66 12.74 33.96

Table 2. Best performances of each of three used methods.

6. Conclusion We propose here a new approach for fault detection based on OC-SVM and univariate filter method for feature selection. We tested and validated this approach on a industrial real world data set. Our results proved that our

40

Proceedings of MSDM 2013

technique could detect most of the abnormal wafers with a considerable reduction in False Alarms Rate and an increased fault detection rate compared to that of ROBPCA. Models obtained from OC-SVM are based on the kernel parameter γ and parameter ν, which can be set at a predefined values. The fault detection results are not very sensitive to the kernel parameter, hence fine-tuning of the parameter is not required. ν controls the false alarm rate, so one can define it in a way to not exceed a specific rate of false alarms. In contrast, most of the projection techniques need the number of principal components to be pre-specified to build the model. Moreover, in ROBPCA we have to define the values for five parameters. Overall, we feel that our approach based on OC-SVM is a promising approach for fault detection.

7. References [BUR 01] B URKE S., “Missing values, outliers, robust statistics and non-parametric methods”, Statistics and Data analysis, vol. 2002, 2001. [CHI 11] C HIH - CHUNG C., C HIH -J EN L., “LIBSVM : a Library for Support Vector Machines”, 2011. [HUB 05] H UBERT M., ROUSSEEUW P., B RANDEN K., “ROBPCA : A New Approach to Robust Principal Component Analysis”, Technometrics, vol. 47, 2005, p. 64–79. [MAH 09] M AHADEVAN S., S HAH S., “Fault detection and diagnosis in process data using one-class support vector machines”, Journal of Process Control, vol. 19, 2009, p. 1627–1639. [MAL 11] M ALDONADO S., W EBER R., BASAK J., “Simultaneous feature selection and classification using kernel-penalized support vector machines”, Information Sciences, vol. 181, 2011, p. 115–128. [MNA 08] M NASSRI B., A NANOU B., E L A DEL E., O ULADSINE M., G ASNIER F., “Detection et localisation de defauts des Wafers par des approches statistiques multivariees et calcul des contributions”, Conference Internationale Francophone d’Automatique (CIFA), 2008. [NOM 94] N OMIKOS P., M AC G REGOR J., “Monitoring batch processes using multiway principal component analysis”, AIChe Journal, vol. 40, 1994, p. 1361–1375. [SCH 01] S CHOLKOPF B., P LATT J., S HAWE -TAYLOR J., S MOLA A. J., W ILLIAMSON R. C., “Estimating the support of a high-dimensional distribution”, Neural Computationl, vol. 13, 2001, p. 1443–1471. [VAP 95] VAPNICK V., The nature of statistical learning theory, Springer, New York, 1995.

41

Proceedings of MSDM 2013

Robust ensemble feature selection based on reliability assessment Afef Ben Brahim 1 , Mohamed Limam 1,2 1 2

Larodec, ISG, University of Tunis Dhofar University, Oman

Feature selection is an important and frequently used technique in data preprocessing for performing data mining on large scale data sets. However, for this type of data feature selection can sometimes give very different results especially if the number of samples is very small. Recently, ensemble feature selection concept have been introduced to help resolve this problem. Multiple feature selections are combined in order to produce more stable feature lists and better classification results. However, one of the most critical decisions when performing ensemble feature selection is the aggregation technique to use for combining the resulting feature lists from the multiple algorithms into a single decision for each feature. In this paper, we propose a robust feature aggregation technique to combine the results of three different filter methods. Our aggregation technique is based on measuring feature algorithms confidence and conflict with the other ones in order to assign a reliability factor guiding the final feature selection. Experiments on a high dimensional data set show that the proposed approach outperforms the single feature selection algorithms in terms of classification performance. ABSTRACT.

KEYWORDS:

Feature selection, high dimensional data, classification, robust aggregation, reliability.

1. Introduction In many real world situations, we are increasingly faced with problems characterized by large number of features, not all relevant to the problem at hand. Feeding learning algorithms with all the features may cause serious problems to many machine learning algorithms with respect to scalability and learning performance. Therefore, feature selection is considered to be one of current challenges in statistical machine learning for high-dimensional data. Feature selection tends to reduce the dimensionality of the feature space, avoiding the well known dimensionality curse problem which is defined as the sensitivity of a method to variations in the training set [KAL 07]. Feature selection is a process consisting of choosing a subset of original features so that the feature space is optimally reduced according to a certain evaluation criterion. The recent increase of dimensionality of data poses a severe challenge to many existing feature selection methods with respect to efficiency and effectiveness. Feature selection can be divided into three categories : filters, wrappers, and embedded methods [GUY 02] [SAE 07]. Filter methods are directly applied to datasets and generally assign relevance scores to features by looking only at the intrinsic properties of the data. High scoring features are presented as input to the classification algorithm. Filter methods ignore feature dependencies and this is their main shortcoming. Wrapper and embedded methods, on the other hand, generally use a specific learning algorithm to evaluate a specific subset of features. Wrapper approaches use a performance measure of a learning algorithm to guide the feature subset search and have the ability to take into account feature dependencies, however they are computationally intensive, especially if building the learning algorithm has a high computational cost [LAN 94]. Embedded methods use the internal parameters of some learning algorithms to evaluate features. The search for an optimal subset of features is built into the classifier construction, this is why they are less computationally intensive than wrapper method. When the number of features becomes very large, the filter model is usually chosen as it is computationally efficient, fast and independent of the classification algorithm.

42

Proceedings of MSDM 2013

To combine the advantages of both models, filters and wrappers, we propose in this paper a general feature selection approach to deal with high dimensional data. In this approach, first, an ensemble of different filter methods is applied to the data set in order to choose the best subsets for a given cardinality. Given that each of the filters uses a specific feature evaluation criterion, we may not say that a resulting subset is better than the others but rather say that all the obtained subsets are the best subsets among the whole feature space. For this reason we naturally thought of ensemble learning [DIE 00] as a way to combine independent feature subsets in order to get a hopefully more robust feature subset. After this step, an SVM classifier is trained on each of the projections of the resulting feature subsets on the training data. Cross validation is used to obtain the classification performance of each setting. This classification performance is used together with SVM feature weights for each feature subset to measure the reliability of selected features. Initial feature weights obtained from the base filter algorithms are adjusted based on the features’ corresponding reliability. To get a final best subset, a robust aggregation technique is used to select best features from different individual subsets based on their adjusted weights. Thus, simplicity and fastness of filters is employed to select the best feature subsets among the whole feature space, then ability of wrappers to take into account feature dependencies and provide an associated classification performance is exploited to guide the choice of a final robust subset among initial feature subsets.

2. Feature selection by ensemble learning In ensemble learning, a collection of single classification or regression models is trained, and the output of the ensemble is obtained by aggregating the outputs of the single models, e.g. by majority voting in the case of classification, or averaging in the case of regression. [DIE 00] shows that the result of the ensemble might outperform the single models when weak (unstable) models are combined, mainly because of three reasons : a) several different but equally optimal hypotheses can exist and the ensemble reduces the risk of choosing a wrong hypothesis, b) learning algorithms may end up in different local optima, and the ensemble may give a better approximation of the true function, and c) the true function cannot be represented by any of the hypotheses in the hypothesis space of the learner and by aggregating the outputs of the single models, the hypothesis space may be expanded. Ensemble feature selection techniques uses an idea similar to ensemble learning for classification [DIE 00]. Instead of choosing one particular feature selection method, and accepting its outcome as the final subset, different models can be combined using ensemble feature selection approaches. Based on the evidence that there is often not a single universally optimal feature selection technique, and due to the possible existence of more than one subset of features that discriminates the data equally well, model combination approaches such as boosting [FRE 97] have been adapted to improve the robustness and stability of final, discriminative methods. Similar to the case of supervised learning, ensemble feature selection consists of two steps. In a first step, a set of different feature selectors are used, each providing their output and in a final phase the results of these separate selectors are aggregated and returned as the final (ensemble) output. Like in supervised learning, the generation of a set of diverse component learners is one of the keys to the success of ensemble learning. Variation in the feature selectors can be achieved by various methods, the most used are data perturbation and function perturbation. Data perturbation tries to run component learners with different sample subsets. Examples of this method are found in [BRE 96] and [FRE 97]. Function perturbation refers to those ensemble feature selection methods in which the component learners are different from each other. The basic idea is to capitalize on the strengths of different algorithms to obtain robust feature subsets. Existing ensemble feature selection methods in this category differ mainly in the aggregation procedure. Examples can be found in [NET 09] [TAN 09]. Aggregating the different feature selection results can be done by weighted voting, e.g. in the case of deriving a consensus feature ranking, or by counting the most frequently selected features in the case of deriving a consensus feature subset.

43

Proceedings of MSDM 2013

3. Ensemble feature selection based on reliability assessment In our paper, we focus on function perturbation to construct our feature selectors ensemble. Rather than sampling data and use the same feature selection method for each sample, we use different feature selection algorithms and conduct them on the original data. To combine their results, we propose a robust aggregation technique. This two step process is detailed in the following.

3.1. Ensemble selectors creation Consider a dataset D = (xi , . . . , xM ), xi = (x1i , . . . , xN i ) with M instances and N features. An ensemble of feature selection algorithms (H1 , . . . , HK ) is applied to D resulting on K feature subsets (F1 , . . . , FK ) each one containing J selected features Fk = (fk,1 , . . . , fk,J ). To create the selectors ensemble, we apply three several filter selection algorithms to the training samples. These algorithms are : Correlation-based feature selection (CFS) [HAL 00], Information gain (IG) [QUI 93] and Relief [KIR 92].These algorithms are available in Weka machine learning package. Information gain (IG) evaluates features by computing their information gain with respect to the class. Relief is an instance-based method that evaluates each feature by its ability to distinguish the neighboring instances. It randomly samples the instances and checks the instances of the same and different classes that are near to each other. An exponential function governs how rapidly the weights degrade with the distance. Correlation-based Feature Selection (CFS) searches feature subsets according to the degree of redundancy among the features. The evaluator aims to find the subsets of features that are individually highly correlated with the class but have low inter-correlation. The subset evaluators use a numeric measure, such as conditional entropy, to guide the search iteratively and add features that have the highest correlation with the class. After the application of the three feature selectors, we selected 40 best features from each of the three feature ranking lists obtained as output.

3.2. Robust aggregation method One of the most critical decisions when performing ensemble feature selection is the aggregation technique to use for combining the resulting feature lists from the multiple runs or multiple algorithms into a single decision for each feature. We propose a robust feature aggregation technique to combine the results of the three different filter methods described above. Our aggregation technique is based on measuring feature algorithms confidence and conflict with the other ones in order to assign a reliability factor guiding the final feature selection. The opinions given by the ensemble of feature selection algorithms are represented as weights given to each selected feature. To enhance robustness of the final selection, these opinions are associated with a confidence level presenting the belief on the feature selection decision. The robust approach determines the conflict level of each feature selection algorithm by measuring the similarity between its opinion and confidence, and those of the other algorithms in the ensemble. Based on those conflict levels, a reliability rate is associated to each algorithm, such as a reliable algorithm is the one which is confident and non-conflicting at the same time. The final decision is obtained by multiplying these reliability factors by the original selection algorithm opinions. Our robust aggregation technique involves two steps. The first one is the feature weights adjustment where a confidence level is measured for each selected feature. The second one is the reliability assessment and decision making. 3.2.1. Feature weights adjustment We remind that the trained feature selectors ensemble resulted in three feature subsets, 40 best features were selected from each one. An SVM classifier is trained on the projection of each of the selected subsets on the training data. We record the classification error together with SVM feature weights obtained for each setting. Our

44

Proceedings of MSDM 2013

motivation for the choice of SVM is that like for any linear classifier, the absolute value of the weights obtained directly reflect the importance of a feature in discriminating the classes. The three individual feature subsets are then merged into a single feature set containing all selected features. Redundant features are removed. In total 88 features are in question after ignoring redundancies. Let F S = (f1 , . . . , fS ) be the resulting merged feature set. opk,s denotes the opinion of the k th feature set algorithm Hk about the selected feature fs . This opinion is the weight assigned by Hk to feature fs and it is equal to zero if feature fs is not selected by Hk . wk,s denotes the absolute value of the weight assigned by the k th SVM classifier to feature fs . This weight is equal to zero if feature fs is not included in feature subset Fk on which the k th SVM is trained . A confidence level confk,s is assigned to each selection algorithm Hk about each opinion opk,s it expresses. The confidence is a weight calculated as follow confk,s = (opk,s + wk,s ) ∗ log(

1 ), βk

[1]

where βk is the normalized error of the k th SVM classifier. Confidences are then normalized. 3.2.2. Reliability assessment and decision making Given Ops = {opk,s , k = 1 . . . K} the opinions of K feature selection algorithms about the selection of a feature fs , and given Confs = {confk,s , k = 1 . . . K}, the confidences associated with those opinions, the conflict of each selection algorithm is formulated by first measuring the similarity between its opinions and those of the other algorithms in the ensemble as follows

Simk (Ops ) = 1 −

1 (K − 1)

K X

| opk,s − opt,s | .

[2]

t=1,t6=k

Then, algorithm’s confidences similarity with the rest of confidences, Simk (Confs ), is calculated the same way as in Eq. (2). Based on these calculations, the conflict raised by an algorithm is defined as Conf lictk,s = Simk (Confs )[1 − Simk (Ops ))].

[3]

Conflicting selectors are those with similar confidences to the agreeing selectors but completely different opinions from theirs. The conflict measure will affect selection algorithm’s reliability which is calculated as follows relk,s (x) = confk,s (1 − Conf lictk,s ).

[4]

Finally, the original opinions about the features are adjusted by multiplying them by the associated reliability factors after being normalized. The selected features are the best ranked ones according to their adjusted opinion.

4. Experiments The experiments were conducted on a large data set which is concerned with the prediction of Central Nervous System Embryonal Tumour Outcome based on Gene Expression. This data set includes 60 samples containing 39 medulloblastoma survivors and 21 treatment failures. These samples are described by 7129 genes[POM 02]. The experiments are conducted using Weka’s implementation of the base feature selection algorithms. Matlab is used to train SVM classifier on the projection of the selection results of the base algorithms on the dataset and to implement the robust aggregation method.

45

Proceedings of MSDM 2013

We run all three feature selection algorithms, Relief, CFS, IG, respectively, and then apply SVM classifier on each newly obtained data set containing only the selected features from each feature selector and record overall accuracy by 10-fold cross-validation. Classification results of the robust ensemble feature selection (REFS) in addition to individual selectors’ classification results are shown in Table 1. Classifier SVM KNN

Relief 68.33 65

CFS 76.67 63.33

IG 61.67 68.33

REFS 80 65

Table 1. Classification performance on individual feature selectors output and ensemble feature selection output. Table 1 shows the learning accuracy of SVM and KNN classifiers respectively on different feature sets. It is clear that the classification accuracy achieved by SVM classifier when coupled with REFS is the best result compared to those achieved when working with base feature selection algorithms individually. The minimum classification improvement is about 3.3% compared to individual selectors and it reaches 18.3%. For KNN classifier, the classification accuracy achieved when using REFS is not the best compared to individual selection algorithms, but it is close to the base results. The above experimental results suggest that REFS is efficient for feature selection for classification of high dimensional data. It can improve classification accuracy especially when coupled with SVM classifier.

5. Conclusion In this paper we proposed a robust ensemble method for feature selection. We employed simplicity and fastness of filters to create the selectors ensemble and obtain the best feature subsets among the whole feature space. We proposed a robust aggregation technique to combine ensemble output. Advantages of SVM algorithm as a wrapper were exploited in order to adjust selected features importance taking into account their dependencies and associated classification performance. The robust aggregation technique based on features reliability assessment was used to select a final robust subset among initial feature subsets. Our proposed approach improved classification performance for large feature/small sample size data set. We plan to test the proposed approach with more data sets including feature selection stability as another evaluation criteria in addition to classification performance. Another direction consists of deploying domain experts knowledge about some features as reliability evaluation criterion instead or in addition to the application of a learning algorithm.

6. References [BRE 96] B REIMAN L., “Bagging predictors”, Machine Learning, vol. 24, num. 2, 1996, p. 123-140, Kluwer Academic Publishers. [DIE 00] D IETTERICH T. G., “Ensemble Methods in Machine Learning”, Proceedings of the First International Workshop on Multiple Classifier Systems, London, UK, UK, 2000, Springer-Verlag, p. 1-15. [FRE 97] F REUND Y., S CHAPIRE R. E., “A decision-theoretic generalization of on-line learning and an application to boosting”, J. Comput. Syst. Sci., vol. 55, num. 1, 1997, p. 119-139, Academic Press, Inc. [GUY 02] G UYON I., W ESTON J., BARNHILL S., VAPNIK V., “Gene selection for cancer classification using support vector machines”, Machine Learning, vol. 46, 2002, p. 389-422. [HAL 00] H ALL M. A., “Correlation-based feature selection for discrete and numeric class machine learning”, Proceedings of the Seventeenth International Conference on Machine Learning, Morgan Kaufmann, 2000, p. 359-366. [KAL 07] K ALOUSIS A., P RADOS J., H ILARIO M., “Stability of feature selection algorithms : a study on high-dimensional spaces”, Knowl. Inf. Syst., vol. 12, num. 1, 2007, p. 95-116. [KIR 92] K IRA K., R ENDELL L., “A practical approach to feature selection”, S LEEMAN D., E DWARDS P., Eds., International Conference on Machine Learning, 1992, p. 368-377.

46

Proceedings of MSDM 2013

[LAN 94] L ANGLEY P., “Selection of relevant features in machine learning”, Proceedings of the AAAI Fall Symposium on Relevance, AAAI Press, 1994, p. 140-144. [NET 09] N ETZER M., M ILLONIG G., O SL M., P FEIFER B., P RAUN S., V ILLINGER J., VOGEL W., BAUMGARTNER C., “A new ensemble-based algorithm for identifying breath gas marker candidates in liver disease using ion molecule reaction mass spectrometry”, Bioinformatics, vol. 25, num. 7, 2009, p. 941-947. [POM 02] P OMEROY S. L., TAMAYO P., G AASENBEEK M., S TURLA L. M., A NGELO M., M C L AUGHLIN M. E., K IM J. Y. H., G OUMNEROVA L. C., B LACK P. M., L AU C., A LLEN J. C., Z AGZAG D., O LSON J. M., C URRAN T., W ETMORE C., B IEGEL J. A., P OGGIO T., M UKHERJEE S., R IFKIN R., C ALIFANO A., S TOLOVITZKY G., L OUIS D. N., M ESIROV J. P., L ANDER E. S., G OLUB T. R., “Prediction of central nervous system embryonal tumour outcome based on gene expression”, Nature, vol. 415, num. 6870, 2002, p. 436-442. [QUI 93] Q UINLAN J. R., C4.5 : programs for machine learning, Morgan Kaufmann Publishers Inc., 1993. [SAE 07] S AEYS Y., I NZA I., L ARRANAGA P., “A review of feature selection techniques in bioinformatics”, Bioinformatics, vol. 23, 2007, p. 2507-2517. [TAN 09] TAN N. C., F ISHER W. G., ROSENBLATT K. P., G ARNER H. R., “Application of multiple statistical tests to enhance mass spectrometry-based biomarker discovery.”, BMC Bioinformatics, vol. 10, 2009.

47

Proceedings of MSDM 2013

Prior class probability for hyperspectral data classification Saoussen Bahria (a) and Mohamed Limam (b) (a)LARODEC Laboratory, ISG, University of Tunis, Tunisia (b)Dhofar University, Oman

ABSTRACT:

Hyperspectral data classification based on prior knowledge is a useful task in remote sensing field. Moreover, the classification of hyperspectral data when a knowledge about classes is not complete is a challenging problem. In this letter, we propose a method based on the incorporation of prior class probabilities in support vector machines (SVM) for hyperspectral data classification. Prior knowledge is incorporated as a constraint in the SVM optimization problem. Based on four hyperpsectral data sets, the proposed method is compared to standard SVM approach and to three previous approaches. The classification performance is tested in terms of the overall accuracy, individual classes’ accuracies, and kappa coefficient. Empirical results show the positive effect of incorporating prior class probability for hyperspectral data classification. However, SVM performs better than previous methods with and without prior class probability. KEYWORDS: Prior class probability, hyperspectral data classification, Support vector machines.

1. Introduction Hyperspectral imagery with detailed information about spectral signatures leads to the fast development of new algorithms capable of handling the high dimensionality of the data. In hyperspectral data field, the acquisition of labeled training data is costly and time consuming. As a consequence, the researcher has usually the problem of insufficient set of labeled pixels to develop the classifier. Thus, the small size of the labeled pixels is a really challenge in the classification problem. Therefore, an active area of research is in progress dealing with learning high dimensional densities from a limited number of training samples. This problem is known as the Hughes phenomenon and it is among major problem in hyperspectral data analysis. To overcome this problem, a large number of studies have been conducted for the improvement of remote sensing images based on the incorporation of prior data knowledge in the classification process. Curieses et al. (2002) deal with the problem of classifying multispectral images when a priori knowledge about classes is not complete: the true number of classes is not known, or it is not possible to obtain ground truth data for all classes in the image. Foody and Mather (2004) show that an accurate SVM image classification could be performed using a small 1 48

Proceedings of MSDM 2013

number of training samples. Mantero et al. (2005) propose a method for the estimation of probability density functions and a recursive procedure to generate prior probability estimates for known and unknown classes in remote sensing data. Mingguo et al. (2009) test the effect of prior probabilities on individual classes in the maximum likelihood classification. Li et al. (2010) introduce a new semi-supervised hyperspectral image segmentation is inferred from a posterior probability distribution taking into account the spectral information through a multinomial logistic regression (MLR) and the spatial information through a multinomial logistic Markov random field. They use an Expectation Maximization (EM) algorithm to learn the multinomial logistic regressors, where the class labels of the unlabeled samples are considered as unobserved random variables. The maximum a posteriori estimate (MAP) of the multinomial regressors are computed in the maximization step of the EM algorithm. Li et al. (2011) discuss the limited training samples as an ill-posed remote sensing classification problem and suggest that the incorporation of added information could improve the classification accuracy. On the other hand, incorporating prior knowledge in SVM, in order to improve the accuracy of the method, is of wide interest. For example, Scholkopf (2002) attempts to incorporate known invariance of the problem by first training a system and then creating new data by distort resulting support vectors. Wang et al. (2004) incorporate prior knowledge into SVM to solve the problem of the scarcity of labeled samples in image retrieval. Tao et al. (2005) present a posterior probability support vector machines method for unbalanced data. In this context, we focus on the classification of hyperspectral remote sensing data based on the incorporation of prior knowledge in SVM. In this letter, we propose to incorporate prior class probabilities in SVM for the classification of hyperspectral data. The effect of using SVM with prior knowledge rather than standard SVM is tested. 2. Methodology For incorporating prior information in SVM, we first estimate prior classes’ probabilities. Then, this prior is introduced as a constraint in the SVM optimization problem. The methodology is described as follows. Step 1: Estimation of prior classes’ probabilities Let be the number of pixels in the image and let i , ), denotes the number of pixels belonging to each class, such that there are N= i pixels for Mclass classification problem. For a pixel , its prior probability with respect to the class i can be calculated as (Strahler 1980): prior

i

(1)

i

Step 2: Modifying SVM based on the incorporation of the estimated prior classes probabilities. 2.1 SVM problem before incorprating prior probability Let the training data of two separate classes with M samples be represented by n where is an n-dimensional space, and are 1 1 M M

2 49

Proceedings of MSDM 2013

class label. Suppose the two classes can be separated by two hyperplanes parallel to the optimal hyperplane: i

i

i

,

i

(2)

where 1 n is a vector of n elements and b is the bias parameter. The optimal separating hyperplane (OSH) is the one that separates the data with the maximum margin. This hyperplane can be found by minimizing the norm of , or the following function 2 i

(3)

i

The saddle point of the following lagrangian gives solutions to the above optimization problem: 2

where

(4)

0 are lagrangian multipliers.

The decision rule is then applied to classify the dataset into two classes +1 and -1. This is the simple case where the data are linearly separable, the decision function is defined by s

where multipliers.

i i

(5)

i

denotes the resulting number of support vectors and

are lagrangian

To deal with non-separable case, an important assumption to the above solution is that the data are separable in the feature space. It is easy to check that there is no optimal solution if the data cannot be separated without error. To resolve this problem, a penalty value C for misclassification errors and positive slack variables i are introduced. Then the objective function becomes 2

i

,

(6)

where C is a preset penalty value for misclassification errors, . If solution to this optimization problem is similar to that of the separable case.

= 1, the

2.2 SVM after incorporating prior knowledge Incorporating prior knowlege in SVM can be tackled as follows : n

Given a labeled data set

}, a prior knowledge set where PC means the prior class, and a machine learning SVM. The task is to find a decision function which take into account both D and P. Let: w*,x*

* ,

(7)

3 50

Proceedings of MSDM 2013

This leads to: {w*,x*} D

and

P

D

,

P

(8)

are the weights of D and P respectively.

Based on equation (3), the target function for SVM with prior knowledge is given by : 2

Subject to :

D,i

D

,

P,i

D,i

P

P,i

D,i

D,i

P,i

P,i

),

(9)

Where D and P are the parameters for D and P respectively. D is the labeled data set and P is the prior data set. D,i and P,i are positive slack variables for D and P respectively.

3. Data sets Four benchmark hyperspectral data sets were considered. The AVIRIS Indian Pines data set, the DAISEX data set, the SALINAS data set, and the DAIS 7915 data set. 3.1 The AVIRIS Indian Pines-1992 data set We consider the AVIRIS Indian Pines-1992 data set described by (Bahria et al. 2011). The Indian Pines test site in North-western Indiana, USA, is a popular benchmark hyperspectral dataset collected by the AVIRIS sensor. Each image comprises of 145 × 145 samples of an agricultural area. It uses 224 spectral reflectance bands within a wavelength range of 0.4 to 2.5 μm, with a nominal spectral resolution of 10nm, a 16 bit radiometric resolution and a 20m spatial resolution. The number of spectral bands was further reduced to 220 because four spectral bands contain no data. The ground truth available is composed by 16 classes, namely Alfalfa, Corn-notill, Corn-min, Corn, Grass-pasture, Grass-trees, Grass-pasture-mowed, Hay-windrowed, Oats, Soybean-notill, Soybean-mintill, Soybeanclean, Wheat, Woods, Buildings-Grass-Trees-Drives and Stone-Steel-Towers. The number of training and testing samples is given in Table 1.

4 51

Proceedings of MSDM 2013

Class number 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16

Class name

Training samples

Testing samples

54 1423 834 234 497 747 26

10 286 166 46 99 149 5

489 20 797 2468

97 4 159 493

614 212 1294 380 95 10184

122 42 258 76 19 2031

Alfalfa Corn-no till Corn-minimum till Corn Grass/pasture Grass/Trees Grass/pasturemowed Hay-windrowed Oats Soybeans-no till Soybeans-minimum till Soybeans-clean till Wheat Woods Building-grass-trees Stone-steel-towers Total samples

Table 1. Number of training and testing samples in the AVIRIS Indian Pines data set

3.2 The DAISEX 1999 - Barrax data set We use the DAISEX (Digital Airborne Imaging Spectrometer Experiment) dataset corresponding to the Barrax site that is situated in the west of the west of the province of Albacette, Spain. The landscape in this area is fat with no change of elevation higher than 2m over the whole area. Under the Barrax area several aquiferous geological formations exist. These formations seem to be connected and form a regional ground water body. The dominant cultivation in the 10,000 ha area is approximately 65% dry land (of which 67% are winter cereals and 33% fallow land ) and 35% irrigated land (Corn 75%, Barley/Sun.ower 15%, Alfalfa 5%, Onions 2.9%, vegetables 2.1%). The database is available online at http://io.uv.es/ projects/ daisex/. Following Camps-Valls et al. (2004), six different classes were identified in the designated area (corn, sugar beets, barley, wheat, alfalfa and soil). The samples were chosen to have good spatial coverage, so the natural variability of the vegetation could be ensured. Three types of units were chosen for sampling, which were based on the type of variability that they represent: full covered fields (alfalfa, wheat, and barley), sparsely vegetated fields (corn and small sugar beets), and bare soil fields. 5 52

Proceedings of MSDM 2013

3.3 The SALINAS data set This dataset was acquired by the 224-band AVIRIS sensor over Salinas Valley in Southern California, USA, at low altitudes resulting in an improved pixel resolution of 3.7 meter per pixel. Each image is made up of 512 lines of 217 samples. 20 spectral bands were removed due to water absorption and noise, resulting in a corrected image containing 204 spectral bands over the range of 0.4 to 2.5μm. A sample band and the corresponding ground truth data has been shown in Figure 2(a) and 2(b) respectively. The Salinas scene consists of the 7 ground truth classes, namely, Broccoli-green-weeds, Stubble, Celery, Grapes-untrained, Soil-Vineyard-develop, Vineyard-untrained and Vineyard-vertical-trellis. The number of pixels in each class is given in Table 3.

Table 3. Class description of the SALINAS data set TableClass number 1 2 3 4 5 6 7

Class name Broccoli-green –weeds Stubble Celery Grapes-untrained Soil-vineyard-develop Corn-senesced-greenweeds Vineyard-untrained

Number of pixels 3726 3959 3579 11271 6203 3278 7268

3.4 The DAIS 7915 data set DAIS 7915 hyperspectral data was acquired on 04 August 2002 by the German Space Agency (DLR) in the frame of the HySens PL02_05 project. This instrument is a 79channel imaging spectrometer operating in the wavelength range 0.4-12.5 µm with 15 bit radiometric resolution. After preprocessing the resulting pixel size was 3 meters. The selection of sample areas was based on minimal variation between airborne and field spectra. We are interested to seven land classes, namely, water, asphalt, parking lot, bitumen, bick roofs, meadows, soil, and shadows.. The number of training and testing samples are given in table 3.

Table 3. Number of training and testing samples in the DAIS 7915 data set

Class Water Trees Asphalt Parking Lot Bitumen Brick roofs

Training sample 114 101 85 59 65 106

Test sample 4176 2444 1614 229 620 2132

6 53

Proceedings of MSDM 2013

62 74 52 718

Meadows Soil Shadows Total 4. Results

1183 1401 181 13979

For classification purposes, training and testing sets, were built, consisting of 20% for training and the rest for testing. We used the polynomial kernel with a degree d=2 within a one-against-one multiclass SVM process. The efficiency of SVM with and without prior class probability (PCP) was tested. As shown in Table 4, the incorporation of prior class probability improves SVM hyperspectral data classification for all data sets. Table 4. Overall accuracy and kappa coefficient for SVM and SVM-PCP for the AVIRIS Indian Pines data set Data set Overall accuracy (%) Kappa coefficient (%) classifier SVM

87.50

86

SVMPCP SVM

90.10

89

89.40

88

SVMPCP

92.85

91

SVM

84.06

83

SVMPCP SVM

87.99

86

89.13

88

SVMPCP

92.88

91

AVIRIS

DAISEX

SALINAS

DAIS

The proposed method performs better than the standard SVM method with the highest value of overall accuracy and kappa coefficient for example the AVIRIS data set gives an overall accuracy of (90.1% versus 87.5%) and kappa coefficient of (89% versus 86%) . The proposed method is compared to three previous methods, namely the maximum likelihood classification (MLC) proposed by Mingguo et al. (2009), the multilogistic regression (MLR) proposed by Li et al. (2010), and the maximum a posteriori with the expectation maximization algorithm (MAP-EM) proposed by Li et al. (2010). Table 5 gives the overall accuracies as a function of labeled samples for all methods. The experiment study is made on the AVIRIS data set since it is used by all previous works. The detailed results show better performance of the proposed method compared to the previous approaches. This result confirm the superiority of the non7 54

Proceedings of MSDM 2013

parametric SVM classifier rather than the parametric classifier when the number of classes is not completely known.

Table 5. Overall accuracy (%) as a function of labeled samples for all compared methods applied on the AVIRIS data set Number of labeled samples per classes Methods MLC MLR MAP-EM SVMPCP (proposed method)

5 88.4 77.99 84.23 90.1

10 88.6 83.97 90.09 93.07

15 89.2 88.85 95.07 94.77

20 89.5 88.91 94.88 95.30

25 90.2 90.42 96.24 97.10

30 90.5 90.42 96.34 97.78

100 92.05 94.91 98.87 99.4

5. Conclusion and discussion In this work, the incorporation of prior class probability in SVM for improving hyperspectral data classification is investigated. An improvement in the classification performance of SVM classification after incorporating prior class probability is illustrated. Experimental results show the superiority of the proposed method in terms of overall accuracy, producer’s and user’s accuracy, and kappa coefficient applied on the four benchmark hyperpspectral data sets. Empirical results provide better performance of SVM compared to previous approachs for hyperspectral data classification with prior class probability.

References [1] BAHRIA, S., ESSOUSSI, N. and LIMAM, M. (2011). Hyperspectral data classification using geostatistics and support vector machines, Remote Sensing Letters, 2: 99 – 106. 2] CURIESES, A.G., BIASIOTTO, A., SERPICO, S. B., and MOSER, G. (2002). Supervised classification of remote sensing images with unknown classes, Geoscience and Remote Sensing Symposium. [3] FOODY.G.M, and MATHER. A. (2004). Toward intelligent training of supervised image classifications: directing training data acquisition of SVM classification. Remote Sensing of Environment, 93:107-117. [4] LI, S., ZHANG, B., CHEN, D., GAO, L and MAN, P. (2011). Adaptive support vector machine and Markov random field model for classifying hyperspectral imagery. Journal of Applied Remote Sensing, Volume 5 : 053538. 8 55

Proceedings of MSDM 2013

[5] LI. J., BIOUCAS- DIAS. M.J and PLAZA. A. (2010). Exploiting spatial information in semi-supervised hyperspectral image segmentation.IEEE. [6] MANTERO, P., MOSER, G., and SERPICO, B.S. (2005). Partially supervised classification of remote sensing images through SVM-Based probability density estimation. IEEE Transactions on Geoscience and Remote sensing, 43 : 559-570. [7] MINGGUO.Z, QIANGUO.C, and MINGZHOU.Q. (2009). The effect of prior probabilities in the maximum likelihood classification on individual classes: a theoretical reasoning and empirical testing. Photogrammetric Engineering and Remote Sensing, 75: 11091117. [8] MARCONCINI. M, CAMPS-VALLS. G, AND BRUZZONE.L. (2009). A composite semi supervised SVM for classification of hyperspectral images. IEEE Geoscience and Remote Sensing Letters, 6:234-238. [9] SCHOLKOPF, B., Massachusetts.

AND

SMOLA, J. (2002). Learning with kernels. The MIT Press,

[10] STRAHLER, A.H. (1980). The use of prior probabilities in maximum likelihood classification of remotely sensed data. Remote Sensing of Environment, 10:135-163. [11] TAO, Q., WU, G.W, WANG, Y. F., and WANG, J. (2005). Posterior probability support vector machines for unbalanced data. IEEE Transactions on Neural Networks, 16: 15611573. [12] WANG, L., XUE, P., AND CHAN, L.Q. (2004). Incorporating prior knowledge into SVM for image retrieval. Proceedings of the 17th International Conference on Pattern Recognition (ICPR’04).

9 56

Proceedings of MSDM 2013

Overlapping Clustering of Sequential Textual Documents Chiheb-Eddine Ben N’Cir 1 , Nadia Essoussi 2 1 2

LARODEC, ISG Tunis, University of Tunis ; [email protected] LARODEC, ISG Tunis, University of Tunis ; [email protected]

Grouping text documents based on words they contain is an important application of clustering refereed as text clustering. This application has the issue that a document can discuss several themes and then, it must belong to several groups. The learning algorithm must be able to produce non disjoint clusters and assigns documents to several clusters. Another issue in text clustering is data representation. Textual data are often represented as a bag of features such as terms, phrases or concepts. This representation of text avoids correlation between terms and doesn’t give importance to the order of words in the text. To deal with these issues, we propose a non supervised learning method able to detect non disjoint groups in text document by considering text as a sequence of words. The proposed method looks for optimal overlapping groups and uses the Word Sequence Kernel as similarity measure. The experiments show that the proposed method outperforms existing overlapping methods using the bag of word representation in terms of clustering accuracy and detect more relevant groups in textual documents. ABSTRACT.

KEYWORDS:

Overlapping clustering, Word Sequence Kernel, Kernel Overlapping k-means

1. Introduction

Text clustering is an important application within the Information Retrieval field (IR). It aims to group similar documents in the same cluster, while dissimilar documents must belong to different clusters without using any predefined categories. This definition can be a crucial issue in many real life applications of text clustering where a document needs to be assigned to more than one group. This issue arises naturally because a document can discuss several topics and can belong to several themes. For example, a newspaper article concerning the participation of a soccer in the release of an action film can be grouped with both of the categories Sports and Movies. Many clustering methods have been proposed to solve the problem of the detection of non disjoint groups in data. This kind of application is refereed as overlapping clustering [DID 84], [FEL 11]. Our works concerns the detection of groups based k-means algorithm. Existing overlapping methods, when applied to text document clustering [CLE 08], use the Vector Space Model (VSM) representation for the set of documents. This representation is based on the assumption that relative position of tokens are irrelevant leading to the loss of correlation with adjacent words and to the loss of information regarding word positions. The loss of information and the loss of correlation between adjacent words influence the quality of obtained clusters. We propose in this paper to use a structured model for text representation when detecting overlapping groups based on sequences of words. This representation, takes into account information regarding words positions and has the advantage of being more language independent. We propose an overlapping method able to detect relevant groups in textual data based on WSK (Word Sequence Kernel) as a similarity measure. This paper is organized as follows : Sect.2 presents the WSK and existing overlapping clustering methods, then Sect.3 presents the KOKM based WSK method that we propose. Experiments on different data sets are described and discussed in Sect.4. Finally, Sect.5 presents conclusion and future works.

57

Proceedings of MSDM 2013

2. Background 2.1. Existing Overlapping clustering methods Existing k-means based methods extend fuzzy and possibilistic clustering methods to produce overlapping clusters. Example of these methods are the fuzzy c-means [BEZ 81] and the possibilistic c-means [KRI 93] methods. These methods need a post-processing treatment to generate hard and overlapping clusters by thresholding clusters memberships. More recent methods are based on the adaptation of k-means algorithms to look for optimal covers [CLE 08], [CLE 09], [BEN 10]. As opposed to fuzzy and possibilistic c-means, these methods produce hard overlapping clusters and does not need any post processing treatment. The criteria optimized by these methods look for optimal overlapping groups. A recent proposed method referred as Kernel overlapping k-means (KOKMϕ) [BEN 12], extends kernel kmeans to detect non disjoint and non-linearly separable clusters. By an implicit mapping of the data from an input space to a higher, possibly infinite, feature space, KOKMϕ looks for separation in feature space and solves the problem of overlapping clusters with non-linear and non-spherical separations. d Given the set of observations X = {xi }N i=0 with xi ∈ R , and N is the number of observations. Let C the number of covers and ϕ(xi ) the representation of the observation xi in a hight dimension space F with a non linear transformation ϕ : xi 7→ ϕ(xi ) ∈ F . The KOKMϕ method introduces the overlapping constraint (an observation can belong to more than one cluster) in the objective function which minimizes a local error on each observation defined by the distance between the observation and it’s image in the feature space : ∑ ∥ϕ(xi ) − im(ϕ(xi ))∥2 , [1] J({πc }C c=1 ) = xi ∈X

where im(ϕ(xi )) is the image of the observation xi and is defined by the gravity center of clusters prototypes to which xi belongs. If observation xi is assigned to only one cluster, the image is equivalent to the representative of the following cluster. The objective function is computed without explicitly performing the non linear mapping ϕ using the function of the Kernel Kij = ϕ(xi ) · ϕ(xj ) evaluating the dot product in feature space between xi and xj :

J({πc }C c=1 ) =

∑ xi ∈X

[Kii −

C C C 2 ∑ 1 ∑∑ Pic · Kimc + ( )2 Pic Pil · Kmc ml ]. Li c=1 Li c=1

[2]

l=1

2.2. Text representation and Word Sequence Kernel In the Vector Space Model (VSM) each text document is represented by a vector of tokens (words) where the size of vector is determined by the number of different tokens in all documents D. Each documents dj will be transformed into a vector : dj = (w1j , w2j , ..., w|T |j ), where T is the whole set of terms T = (t1 , ..., t|T | ) (or tokens) which appears at least once in the corpus (|T | is the size of the vocabulary), and wkj represents the weight (frequency or importance) of the term tk in the document dj . Documents whose vectors are close to each others based on tokens frequencies are considered to be similar in content. This representation is based on the assumption that relative position of tokens has a little importance leading to the loss of correlation with adjacent words and leading to the loss of the information regarding word positions. The ”n-grams” representation of text solves the problem of the loss of information regarding word positions by considering text document as sequences of n consecutive characters (syllables or words). The whole set of n-grams is obtained by the extraction of all possible ordered subsequences of consecutive n characters (syllables or words) along the text. This representation of text leads to a high dimensional features of subsequences representing the text document. This problem is solved in information retrieval tasks by using kernel machines over sequences of text.

58

Proceedings of MSDM 2013

Many kernels known as String Kernel such as n-grams Kernel [LES 02], String Subsequence Kernel [LOD 01] and Word Sequence Kernel [CAN 03] are proposed in literature. These kernels return the inner product between documents mapped into a high dimensional feature space. This inner product is computed without explicitly computing feature vectors. The objective of this representation is to keep information regarding words positions and to keep linguistic meaning of the terms when using ordered words as atomic units. For example, the terms ”son-in-law” mean a special meaning that can be lost if it is broken. In addition, the number of features per document is reduced because it uses sequences of words rather than sequences of characters.

3. Proposed methods : KOKM based WSK To detect non disjoint groups from sequential text documents, we propose an overlapping method refereed as “KOKM based WSK” using WSK as a similarity measure between structured documents. Given a set of documents |D| D = {dq }q=1 where each document dq is defined in the feature space by the u coordinate ϕu (dq ) which measures the number of occurrences of subsequence u in the document dq weighted according to it’s lengths. Let Σ the alphabet which consists of existing words in all documents, let S = s1 s2 s3 ...s|s| the sequence of words with |S| is the length of S, let u = s[i] a subsequence of S with s[i] = si1 ..sij ..sin where si1 and sij in this subsequence are not necessarily contiguous in S, the feature mapping ϕ for the sentences s in the feature space is given by defining ϕu for each u ∈ Σn as : ∑ λl(i) , [3] ϕu (s) = i:u=s[i]

where l(i) is the length of subsequence s[i] in S with l(i) = in −i1 +1 and λ is the decay factor used to penalize non contiguous subsequences. These features measure the number of occurrences of subsequence u in the sentences s weighting them according to their lengths. So, given two strings s1 and s2 , the inner product of the feature vectors is obtained by computing the sum over all common subsequences : ∑ Kn (s1 , s2 ) = ϕu (s1 )ϕu (s2 ) u∈Σn

=







λl(i)+l(j) .

[4]

u∈Σn i:u=s1 [i] j:u=s2 [j]

The proposed method consists of minimizing of a local error on each document, where the local error is defined by the kernel distance between the document and it’s image. The objective function of KOKM based WSK is described by : ∑ 2 C ∥ϕ(dq ) − im(ϕ(dq ))∥ . [5] J({πc }c=1 ) = dq ∈D

where C is the number of overlapping clusters and im(ϕ(dq )) is the image of document dq defined by the gravity center of clusters prototypes to which document dq belongs : C ∑

im(ϕ(dq )) =

Pqc · ϕ(dmc )

c=1 C ∑

,

[6]

Pqc

c=1

with Pqc is a binary variable indicating the membership of document dq in cluster c and ϕ(dmc ) is the prototype of the cluster c in the feature space. Using the defined WSK Kernel in equation 4 and using the Kernel Trick, the objective function is performed as follows :

59

Proceedings of MSDM 2013

J({πc }C c=1 ) =

∑ [ ∑ dq ∈D

( L1q )2

ϕu (dq )ϕu (dq ) −

u∈Σn

u∈Σ

C ∑ C ∑



Pqc Pql ·

ϕu (dmc )ϕu (dml )

]

u∈Σn

c=1 l=1

=

C ∑ 2 ∑ Pqc · ϕu (dq )ϕu (dmc )+ Lq c=1 n

[7]

C ∑ [ 2 ∑ Kn (dq , dq ) − Pqc · Kn (dq , dmc )+ Lq c=1

dq ∈D 1 2 Lq

C ∑ C ∑

] Pqc Pql · Kn (dmc , dml ) ,

c=1 l=1

where Lq =

C ∑

Pqc , Σn is the set of all possible ordered word-sequences of length n and dmc is the prototype of

c=1

cluster c. The minimization of the objective function is performed locally by iterating two principal steps : the first step concerns the computation of clusters prototypes where each prototype is defined by the document that minimizes all distances with others documents belonging to the same group. This computation of prototypes is performed in the feature space where the document dmc representing prototype of cluster c is computed as follows : ∑ dmc

=

min

j∈πc ,j̸=q

q∈πc

wj .



||ϕu (dq ) − ϕu (dj )||2

u∈Σn



wj

j∈πc ,j̸=q

∑ dmc

=

min

q∈πc

j∈πc ,j̸=q

wj [Kn (dq , dq ) − 2.Kn (dq , dj ) + Kn (dj , dj )] ∑

,

[8]

wj

j∈πc ,j̸=q

where wj is a weight of the Kernel distance between document dq and document dj depending on the number of groups to which document dj belongs. This weight is more important if document dj belongs to more than one cluster to take into account that overlapping documents dj have a small influence in determining cluster prototype as well as the number of assignment increases. The second step concerns the multi assignment of documents to one or several clusters. This step is performed using an heuristic that explores the combinatorial sets of possible assignments. The heuristic consists, for each document, in sorting clusters from closest to the farthest, then assigning the document in the order defined while assignment minimizes the local error defined in the objective function and then it minimizes the hole objective function. The stopping rule of KOKM based WSK algorithm is characterized by two criteria : the maximum number of iterations or the minimum improvement of the objective function between two iterations.

4. Experiments and Discussions Experiments were performed on computer with 4 GB RAM and 2.1 GHZ Intel Core 2 duo processor. Data are preprocessed by removing a stop words. The Vector Space Model representation of each data set is built using the ”WEKA text preprocessing module” where the frequencies of occurrence of words is computed using the T F ∗ IDF technique. For the implementation of Word Sequence Kernel, we use the recursive definition of

60

Proceedings of MSDM 2013

Table 1. Comparison between OKM and KOKMϕ based VSM representation with KOKM based WSK on Reuters Dataset.

Methods OKM KOKMϕ (Linear) KOKMϕ (Polynomial) KOKMϕ (RBF σ=10) KOKMϕ (RBF σ=108 ) KOKM based WSK

Precision 0,275±0,01 0,275±0,01 0,275±0,01 0,274±0,01 0,275±0,01 0,499±0,04

With stemming Recall 0,968±0,03 0,955±0,04 0,955±0,04 0,965±0,03 0,955±0,05 0,670±0,12

F-measure 0,429±0,01 0,427±0,02 0,427±0,01 0,426±0,02 0,427±0,02 0,569±0,04

Without stemming Precision Recall F-measure 0,275±0,01 0,968±0,03 0,429±0,01 0,275±0,01 0,958±0,04 0,428±0,01 0,275±0,01 0,958±0,04 0,428±0,01 0,274±0,01 0,968±0,03 0,427±0,02 0,275±0,01 0,958±0,04 0,428±0,01 0,458±0,05 0,698±0,02 0,553±0,04

Table 2. Comparison between OKM and KOKMϕ based VSM representation with KOKM based WSK on Ohsumed Dataset.

Methods OKM KOKMϕ (Linear) KOKMϕ (Polynomial) KOKMϕ (RBF σ=10) KOKMϕ (RBF σ=108 ) KOKM based WSK

Precision 0,274±0,03 0,297±0,10 0,297±0,10 0,248±0,01 0,262±0,04 0,308±0,03

With stemming Recall 0,799±0,36 0,798±0,36 0,798±0,35 0,980±0,02 0,835±0,40 0,696±0,08

F-measure 0,396±0,04 0,417 ±0,11 0,417±0,10 0,396±0,01 0,385±0,03 0,421±0,02

Without stemming Precision Recall F-measure 0,262±0,04 0,761±0,32 0,378±0,02 0,297±0,10 0,795±0,36 0,416±0,11 0,297±0,10 0,795±0,36 0,417±0,11 0,248±0,01 0,983±0,02 0,396±0,01 0,260±0,04 0,840±0,40 0,383±0,03 0,312±0,02 0,641±0,06 0,420±0,03

WSK defined by [CAN 03] based on the dynamic programming technique. The advantage of this implementation is to reduce the time complexity and to perform WSK without implicitly extracting word sequences. The time complexity of computing WSK kernel between two documents d1 and d2 is reduced to O(n|d1 ||d2 |) where n is the length of the used subsequence and |di | is the number of words in document di . The computational complexity of the KOKM based WSK method is evaluated to O(N · C 2 · Nc ) where N is the number of documents, C is the number of clusters and Nc is the maximal number of documents in each clusters. Experiments were conducted on two overlapping textual data sets which are respectively Reuters 1 and Ohsumed 2 datasets. We used a subset of Reuters composed of 76 documents and a subset of Ohsumed composed of 83 documents. Each document in each data set is labeled by one or many labels from a set of 5 categories where each category contains 20 documents. Results are compared according to three external validation measures : Precision, Recall and F-measure. These validation measures attempt to estimate whether the prediction of categories is correct with respect to the underlying true categories in the data. For each data set, the number of groups is set to the number of underlying categories in the data set. Table 1 and Table 2 report average scores and the standard deviation variation of Precision, Recall and Fmeasure on ten runs using OKM and KOKMϕ methods based on VSM representation compared to the proposed method KOKM based WSK (we fix n = 2 the length of word sequences and λ=0.9 the value of the decay factor). Results are compared with and without stemming. For each run, all methods are computed with the same initialization of seeds to guarantee that all methods have the same experimental conditions. Values in bold correspond to the best obtained scores.

1. cf. http ://kdd.ics.uci.edu/databases/reuters-transcribed/reuters-transcribed.html 2. cf. http ://disi.unitn.it/moschitti/corpora/ohsumed-first-20000-docs.tar.gz

61

Proceedings of MSDM 2013

Table 3. The impact of varying the decay factor on KOKM based WSK. Data set

Reuters data set

Ohsumed data set

λ=0,1 λ=0,3 λ=0,5 λ=0,7 λ=0,9 λ=0,1 λ=0,3 λ=0,5 λ=0,7 λ=0,9

Precision

Recall

F-measure

0,458±0,045 0,443±0,027 0,434±0,060 0,431±0,042 0,440±0,065 0,254±0,021 0,253±0,022 0,262±0,021 0,264±0,014 0,258±0,014

0,698±0,023 0,664±0,096 0,705±0,093 0,708±0,103 0,705±0,054 0,590±0,070 0,608±0,042 0,595±0,035 0,640±0,080 0,700±0,050

0,553±0,037 0,531±0,050 0,536±0,058 0,535±0,063 0,540±0,031 0,354±0,011 0,357±0,018 0,363±0,017 0,373±0,028 0,377±0,014

The F-measure obtained with KOKM based WSK is characterized by a high value compared to overlapping methods based VSM representation. The improvement of F-measure is induced by the improvement of precision. For example, in Reuters data set the obtained precision using KOKM based WSK is 0.458 while using KOKMϕ and OKM methods the max obtained precision is 0.275. The improvement of precision is realized with and without stemming. Recalls obtained with KOKMϕ and OKM methods are characterized by a high values (the Recall obtained with OKM in Reuters data set is equivalent to 0.968). These high values of recall is explained by the way that OKM and KOKMϕ assign observations to all clusters because of the diagonal dominance problem. For example, in Reuters data set, where the dimensionality of the VSM matrix is very sparse (1482 words), OKM and KOKMϕ have the issue of diagonal dominance and therefore assigns each observation to all clusters. We study the sensitivity of the KOKM based WSK method to the parameter λ (the value of the decay factor). Table 3 shows obtained results using different values of the λ parameter in Reuters and Ohsumed datasets. The λ parameter represents the decay factor used to penalize non contiguous subsequences. As well as the decay factor λ is near to 0, the non contiguous word subsequences are considered dissimilar and then the gap is more penalized. The proposed KOKM based WSK method is not sensitive to λ parameter where the values of Precision, Recall and F-measure are not highly affected by the variation of this parameter. Obtained results prove the theoretical finding that considering text as a sequence of words improves clustering accuracy compared to VSM representation [CAN 03, LOD 01]. The meaning of natural languages depends on the word sequences, and the frequent word sequences can provide compact and valuable information about documents structures. In fact, methods based kernel function (Bag of Word Kernel or Word Sequence Kernel) outperform non kernel method. These results prove that looking for separation between clusters in a feature space is better than looking for separation in input space. Separability between documents can be improved when documents are mapped to a feature space.

5. Conclusion We proposed in this paper the KOKM based WSK method which is able to detect non-disjoint groups from textual sequential documents based on Word Sequence Kernel as a similarity measure. Detecting overlapping groups by considering text as a sequence of words improves quality of obtained groups compared to the VSM representation of text. Preliminaries obtained results on Reuters and Ohsumed data sets prove the efficiency of the proposed method compared to overlapping methods using the VSM representation. This proposed method can be applied for many other application domains where textual data needs to be assigned to more than one cluster. We plan to conduct experiments in other real overlapping data sets where

62

Proceedings of MSDM 2013

sequences of text are more relevant than sequences in Reuters and Ohsumed datasetd such as in the detection of groups in textual biological data.

6. References [BEN 10] B EN N’C IR C., E SSOUSSI N., B ERTRAND P., “Kernel overlapping k-means for clustering in feature space.”, International Conference on Knowledge discovery and Information Retrieval KDIR, Valencia, SPA, 2010, SciTePress Digital Library., p. 250–256. [BEN 12] EDDINE B EN N’ CIR C., E SSOUSSI N., “Overlapping Patterns Recognition with Linear and Non-Linear Separations using Positive Definite Kernels”, International Journal of Computer Applications, vol. 56, num. 9, 2012, p. 1-8. [BEZ 81] B EZDEK J. C., “Pattern Recognition with Fuzzy Objective Function Algoritms.”, Plenum Press, vol. 4(2), 1981, p. 67–76. [CAN 03] C ANCEDDA N., G AUSSIER E., G OUTTE C., R ENDERS J., “Word-Sequence Kernels”, Journal of Machine Learning Research, vol. 3, 2003, p. 1059-1082. [CLE 08] C LEUZIOU G., “An extended version of the k-means method for overlapping clustering.”, International Conference on Pattern Recognition ICPR, Florida, USA, 2008, IEEE, p. 1–4. [CLE 09] C LEUZIOU G., “OKMED et WOKM : deux variantes de OKM pour la classification recouvrante”, Revue des ˜ c Nouvelles Technologies de l’Information, CA˜ ⃝padu A¨s-Edition, vol. 1, 2009, p. 31–42. [DID 84] D IDAY E., “Orders and overlapping clusters by pyramids”, report num. 730, 1984, INRIA, France. [FEL 11] F ELLOWS M. R., G UO J., KOMUSIEWICZ C., N IEDERMEIER R., U HLMANN J., “Graph-based data clustering with overlaps”, Discrete Optimization, vol. 8, num. 1, 2011, p. 2–17. [KRI 93] K RISHNAPURAM R., K ELLER J. M., “A possibilistic approach to clustering”, IEEE Transactions on Fuzzy Systems, vol. 1, 1993, p. 98–110. [LES 02] L ESLIE C. S., E SKIN E., N OBLE W. S., “The Spectrum Kernel : A String Kernel for SVM Protein Classification”, Pacific Symposium on Biocomputing, 2002, p. 566-575. [LOD 01] L ODHI H., C RISTIANINI N., S HAWE -TAYLOR J., WATKINS C., “Text classication using string kernel”, The Journal of Machine Learning Research, vol. 2, 2001, p. 419-444, JMLR.org.

63

Proceedings of MSDM 2013

Robust Principal Component Analysis for Incremental Face Recognition Ha¨ıfa Nakouri, Mohamed Limam Institut Sup´erieur de Gestion, LARODEC Laboratory University of Tunis, Tunisia ABSTRACT. Face recognition performance is highly affected by image corruption, shadowing and various face expressions.

In this paper, an efficient incremental face recognition algorithm, robust to image occlusion, is proposed. This algorithm is based on robust alignment by sparse and low-rank decomposition for linearly correlated images, extended to be incrementally applied for large face data sets. Based on the latter, incremental robust principal component analysis (PCA) is used to recover the intrinsic data of a sequence of images of one subject. A new similarity metric is defined for face recognition and classification. Experiments on five databases, based on four different criteria, illustrate the efficiency of the proposed method. We show that our method outperforms other existing incremental PCA approaches such as incremental singular value decomposition, add block singular value decomposition and candid covariance-free incremental PCA in terms of recognition rate under occlusions, facial expressions and image perspectives. KEYWORDS:

Image alignment, Robust Principal Component Analysis, Incremental RPCA.

1. Introduction In the last two decades, face recognition has been an active research area within the computer vision and the pattern recognition communities. Since an original input image space has a very high dimension, dimensionality reduction techniques are usually performed before classification. Principal Component Analysis (PCA) is one of the most popular representation methods for computer vision applications mainly face recognition. Usually, PCA is performed in the batch mode, where all training data are used to calculate the PCA projection matrix. Once the training data have been fully processed, the learning process stops. In case we want to incorporate additional data into an existing PCA projection matrix, the matrix has to be retained with all training data. Therefore, such system is hard to scale up. An incremental version for PCA is a straightforward solution to overcome this limitation. Incremental PCA (IPCA) has been studied for more than two decades yielding many methods, which are specially useful when not all observations are simultaneously available. The aim of the IPCA approach is to do not consider all available observations more than once even when new data are eventually upcoming. New data can be used to incrementally update a previous computation. Such an approach reduces storage requirements and large problems become computationally feasible. The performance of IPCA methods is evaluated with face recognition standard databases [HAL 00, WEN 03, HUA 09, HAL 02]. However, one of their major drawbacks is that they cannot simultaneously handle large illumination variations, image corruptions and partial occlusions that often occur in real face data (e.g., self shadowing, hats, sunglasses, scarf, incomplete face data, etc), hence inducing important appearance variation. These image variations can be considered as outliers or errors regarding the original face image of one subject. Although classical PCA is effective against the presence of small Gaussian noise in the data, it is highly sensitive to even sparse errors of very high magnitude. On the other hand, it is known that well-aligned face images of a person, under varying illumination, lie very close to a low-dimensional linear subspace. However, in practice, images deviate from this situation due to self

64

Proceedings of MSDM 2013

shadowing, different angles and occlusions. Thus, we have a set of coherent images corrupted by essentially sparse errors. In order to efficiently extract low-rank face images from corrupted and distorted ones, we should first model those corruption factors and seek efficient ways to eliminate them. The robust PCA (RPCA) [WRI 09] is a powerful tool to get rid off such errors and retrieve cleaner images potentially better suited for computer vision application, namely face recognition. In this paper, we propose an incremental method for robust face recognition under various conditions based on RPCA. The proposed method handles both misalignment and occlusion problems on face images. In order to improve the recognition process and based on RPCA [WRI 09], we eliminate corruptions and occlusion in original face images. Besides, the incremental aspect of our face recognition method handles the memory constraint and computational cost of a large data set. To measure the similarity between a query image and a sequence of images of one person, we define a new similarity metric. To evaluate the performance of our method, experiments on the AR [MAR 98], ORL [SAM 94], PIE [SIM 02], YALE [BEL 97] and FERET [PHI 98] databases and a comparison with other incremental PCA methods namely incremental singular value decomposition (SVD) [HAL 02], add block SVD [BRA 06] and candid covariance-free incremental PCA [WEN 03] are conducted. We also compare our method to a face recognition method based on batch robust PCA, denoted by face recognition RPCA (FRPCA) [WAN 10]. This paper is organized as follows. In Section 2, we introduce the RPCA method, incremental RPCA (IRPCA) and our face Recognition method, denoted by new incremental RPCA (NIRPCA). Finally, in Section 3, we present our experimental results.

2. Face Recognition Based on IRPCA 2.1. Incremental Robust Principal Component Analysis (IRPCA) Peng et al., [PEN 10] proposed robust alignment by sparse and low-rank decomposition for linearly correlated images (RASL). It is a scalable optimization technique for batch linearly correlated image alignment. One of its objectives is to robustly align a dataset of human faces based on the fact that if faces are well-aligned, they show efficient low-rank structure up to some sparse corruptions. Even perfectly aligned images may not be identical, but at least they lie near a low-dimensional subspace [BAS 03]. To the best of our knowledge, RASL is the first method that uses a trade-off between rank minimization and alignment of image data. Hence, the idea is to search for a set of transformations τ such that the rank of the transformed images becomes as small as possible and at the same time the sparse errors are compensated. Generally, the applied transformation is the 2D affine transform, where we implicitly assume that the face of a person is approximately on a plane in 3D-space. RPCA algorithm is aimed to recover the low-rank matrix A from the corrupted observations D = A + E, where corrupted entries E are unknown and the errors can be arbitrarily large but assumed to be sparse. More specifically, in face recognition, E is a sparse matrix because it is assumed that only a small fraction of image pixels are corrupted by large errors (e.g., occlusions). Hence, being able to correctly identify and recover the low structure A could be very interesting for many computer vision applications namely face recognition. Figure 1 illustrates the recovered original images A and error images E of two different subjects using RPCA. In our case, RPCA is used to recover original (A) and error (E) images of the training set. As for the test set, we start using our NIRPCA method defined in Algorithm 2. We assume that we have m subjects and each one has n face images. Although RASL can give a very accurate alignment for faces [PEN 10], it is not applicable when the total number of images m × n denoted by l is very large. Wu et al., [WU 11] proposed an extension to RASL from l to L where L >> l, by reformulating the problem using a ”one-by-one” alignment approach. This incremental alignment can be summarized in three steps. First, l frames are selected to be aligned with batch RASL method producing a low-rank summary A∗ . In the second step, the (l + 1)th image is aligned with A∗ which contains the information of the previously aligned l images. Finally, the third step is repeated for the remaining images, regardless of the size of the data set.

65

Proceedings of MSDM 2013

D

A

E (a)

(b)

(c)

(d)

(e)

(f)

(a')

(b')

(c')

(d')

(e')

(f')

Figure 1. Images recovered by Robust PCA (RPCA), (a) - (f) are six images of two different subjects. D are original subjects, A and E are recovered low-rank and error faces. Sunglasses, scarf and face expressions are successfully removed. These images correspond to the training face images of the corresponding subjects.

We denote by Iij , Aji , Eij the corrupted ,observed, face image, the original face image and the error of the j image of the ith subject, respectively. Then, we have Iij = Aji + Eij , where i denotes the subject and j its corresponding image such that i = 1, . . . , m, j = 1, . . . , n. Let : th

vec : Rw×h → R(w×h)×1 ,

(1)

be a function which transforms a w × h image matrix into a (w × h) × 1 vector by stacking its columns to have vec(Iij ) = vec(Aji ) + vec(Eij ). Assuming that we have m subjects and each one has n images, we define for the ith subject : Di := [vec(Ii1 )| . . . |vec(Iin )] = Ai + Ei (2) where Ai := [vec(A1i )| . . . |vec(Ani )] ∈ R(w×h)×n

(3)

Ei := [vec(Ei1 )| . . . |vec(Ein )] ∈ R(w×h)×n ,

(4)

and th

with i = 1, . . . , m. Di is formed by stacking the n image vectors of the i subject, Ai and Ei are the corresponding original images matrix and the error matrix, respectively. Since all images of the same person are approximately linearly correlated, Ai is regarded as a low-rank matrix and Ei is a large matrix but sparse. As proved in [PEN 10], the original face Ai can be efficiently recovered from the corrupted face image Di . It is well known that if images are well-aligned, they should present a low-rank structure up to some sparse errors (e.g., occlusions) [PEN 10]. Therefore, we search for a set of transformations, τ = τ1 , . . . , τn , such that the rank of the transformed images becomes as small as possible, and simultaneously images become as well-aligned as possible. Many works [PEN 10, CAN 11] prove that practical misalignment can be modeled as a certain transformation τ −1 ∈ G acting on the two-dimensional domain of an image I. G is assumed to be a finite dimensional group that has a parametric representation, such as the similarity group SE(2) × G+ or the 2D affine group Af f (2). In this paper we assume that G is the affine group. Consider that after performing a batch image alignment using the RASL method, we obtain the set of transformation τ , the low rank matrix A and the error matrix E. The IRASL algorithm is given in Algorithm 1.

2.2. The Proposed Face Recognition Algorithm Based on IRPCA In this section, we introduce a new face recognition algorithm called NIRPCA based on the one-by-one RASL method discussed in Section 2.1. We also define a new similarity metric, which is used for measuring the similarity between a query image and a sequence of images, and later used for our main face recognition application. Given, m different subjects where each one has n training images Iij , i = 1, . . . , m; j = 1, . . . , n, we need to classify a query image In+1 . The basic idea of the proposed algorithm is to recover the sparse error En+1 of the

66

Proceedings of MSDM 2013

Algorithm 1 Incremental robust alignment by sparse and low-rank decomposition. – INPUT : Images In+1 ∈ Rw×h , initial transformations τn+1 in certain parametric group G, weight µ > 0 – WHILE not converged DO step1 : compute Jacobian matrix w.r.t transformation τn+1 : J←

∂  ∂ζ

vec(In+1 ◦ζ) kvec(In+1 ◦ζ)k2



|ζ=τn+1 ;

step2 : wrap and normalize the image : In+1 ◦ τn+1 ←

h

vec(In+1 ◦τn+1 ) kvec(In+1 ◦τn+1 )k2

i

;

step3 : solve the linearized convex optimization : ∗ (x∗ , ∆τn+1 )←

arg min

x,∆τn+1

1 ˜ k22 +µ k x k1 ; k In+1 ◦ τn+1 + J∆τn+1 − Ax 2

step4 : update transformation : ∗ τn+1 ← τn+1 + ∆τn+1 ;

– END WHILE – OUTPUT : solution x, τn+1 .

test image In+1 using approximated An+1 . Once En+1 is recovered, we use it to compute a similarity metric for our face recognition algorithm NIRPCA. Let D be the observation matrix of each subject in the training set, and A its recovered low-rank matrix and E the error matrix. For the ith subject in the training set, i = 1, . . . , m, we have Di , Ai and Ei as given in Equations (2), (3) and (4). Let In+1 be a new occluded face image that we need to classify. According to RASL method [PEN 10, CAN 11], this new observation can be decomposed as : In+1 ◦ τn+1 = An+1 + En+1 ,

(5)

where, τn+1 is the transformation applied to corrupted image In+1 to resolve the image misalignment. An+1 is the occlusion-free image and En+1 is the error image representing the occlusion. Our objective is to estimate τn+1 , An+1 and En+1 . To solve Equation (5), we propose to use the low rank matrix A∗ generated by the RASL method. As indicated in Algorithm 1, the one-by-one alignment approach proposed by [WU 11] computes τn+1 and x having the low rank matrix A∗ and the corrupted face image In+1 . Since A∗ is a low-rank matrix, let A˜ denote ∗ the summary of low rank data resulting from batch RASL, such that A˜ ∈ Rl×rank(A ) , where the columns are equal to rank(A∗ ), i.e., the independent columns of A∗ . The vector x of dimension rank(A∗ ) is represented as an approximation of the coefficients of the linear combination of A˜ and An+1 . Hence, an approximation of An+1 is obtained by the following equation An+1 = A˜ ∗ x. (6) Once we estimate An+1 and by using Equation (5), we estimate the error vector En+1 (standing for the occlusion on the In+1 test image) and use it to compare the similarity between the test image and the stored images. Let the similarity metric be. Mi =k Ei − En+1 k2 , i = 1, . . . , n (7) where Mi measures the similarity between the input corrupted test image In+1 and the a class of images of the ith subject. Mi is the Euclidean distance between the input image In+1 and a class of images belonging to the ith subject. If the query image In+1 belongs to the ith subject, Di contains the image of the same subject so that the assumption that Ai is linearly correlated and the low-rank condition can be satisfied. In this case, the parameters

67

Proceedings of MSDM 2013

of Ei are small, and then Mi should be small, otherwise, the value of Mi is relatively large. For face recognition, the test image In+1 is recognized as the subject which has the smallest value of Mi . If the similarity Mi is greater than a given threshold α, the face image In+1 is not recognized. It will be considered as a new subject and a ˜ Otherwise, Mi ≤ α, and In+1 is recognized as new class subject An+1 will be created and added to summary A. th belonging to the i subject and we just should update the class of the ith subject in the A˜ summary with An+1 . The classification criterion α is set to be 0.5. We should also set at most 15 face images per subject to be stored in the A˜ summary in order to keep a sort of balance between the training faces. The NIRPCA algorithm is summarized in Algorithm 2. Algorithm 2 Face Recognition Based on IRASL (NIRASL). – – – – –

0 INPUT : Images In+1 ∈ Rw×h , initial transformations τn+1 , in certain parametric group G, A˜ summary, weight µ > 0 0 ˜ 1. (τn+1 ,x) = IRASL (A, τn+1 ) (Algorithm 1) 2. An+1 = A˜ ∗ x 3. En+1 = In+1 ◦ τn+1 − An+1 (Equation (5)) 4. FOR each subject Ii

Mi =k Ei − En+1 k2 ,

i = 1, . . . , m

(Equation (7))

END FOR – 5. Mmin ← min(Mi ), i = 1, . . . , m – 6. IF Mmin > α THEN ˜ n+1 ) A˜ ← (A|A m←m+1 n←n+1 ELSE ˜ n+1 ) A˜ ← (A|A n←n+1 END IF – OUTPUT : Subject class of face image In+1 .

3. Experiment Results and Discussion In this section, we evaluate the performance of the proposed face algorithm NIRPCA based on the AR [MAR 98], ORL [SAM 94], PIE [SIM 02], YALE [BEL 97] and FERET [PHI 98] databases. All testing images are grayscale and normalized. We precisely use the canonical size of the images rather than the original one, based on the eye corner locations. Table 1 provides information about these databases. Database

AR

ORL

PIE

YALE

FERET

Original size

64 × 64

92 × 112

640 × 486

320 × 243

80 × 80 80 × 80

Canonical size

50 × 50

90 × 90

110 × 130

120 × 140

Number of subjects

70

40

68

15

47

Number of total images

420

200

680

162

465

Table 1. Face databases used for experiments. 68

Proceedings of MSDM 2013

3.1. Comparison with Standard Incremental Face Recognition Methods In this section, our method is tested on five different databases as shown in Table 2. For each database, 23 of the images are used for the training set and the remaining 13 of the images are randomly selected for the test set. In these databases, face images are captured under varied conditions such as illumination and shadowing levels, facial expression, different face perspectives and with/without occlusion (sunglasses, scarf, etc.). Besides, our method is compared to other well-known incremental face recognition methods, i.e., ISVD [HAL 02], ABSVD [BRA 06] and CCIPCA [WEN 03]. In this section, occluded images in the AR database (with sunglasses or scarfs) are omitted in the evaluation. Performance on images with occlusions is considered in Section 3.3 and the accuracy rate is given in Table 2. These results indicate that our method achieves the best performance, specifically on images %

IPCA AR

ORL PIE YALE FERET

79.71 72.68 72.17 78.12 76.09

ABSVD

73.44 72.23 75.89 80.27 80.63

CCIPCA

77.43 76.82 79.64 85.47 85.44

NIRPCA

86.42 77.39 88.72 90 89.58

Table 2. Accuracy rate of different incremental face recognition algorithms.

under different facial expressions, head positions and shadow levels. This can be explained by the application of IRPCA which recovers original face and removes shape and expression variations better than other algorithms. In fact, various angles of a face image or different head positions of a subject can be reduced to an image distortion problem. Our algorithm can solve this problem since IRPCA approximates the original image matrix An+1 , the sparse error matrix En+1 and a transformation τn+1 at the same time and for each input image In+1 . Thus applying the recovered affine transformation τn+1 to the input image will generate a well-aligned and distortion-free face image.

3.2. Incremental Face Recognition With Different Numbers of Training Images In this section, we evaluate the performance when the number of training images per subject varies. Based on the AR database, K images are randomly selected as the training set, and the remaining constitute the test set, where K = 1, . . . , 10. The left panel of Figure 2 shows the variation trend of different face recognition algorithms and NIRPCA, in terms of recognition rate. This experiment shows that beyond 3 images per subject NIRPCA achieves the best performance. In fact, with less than 4 images per subject, the D matrix, is itself a low-rank matrix, hence it cannot generate an exact and efficient error matrix E. Accordingly, approximated En+1 and similarity measure are both inaccurate, which proves the slow growth of NIRPCA’s accuracy rate with few images per subject. Wang and Xie [WAN 10] presented a face recognition method, FRPCA, based on batch RPCA [PEN 10]. The right panel of Figure 2 shows the variation trend between the FRPCA method and NIRPCA with reference to various number of training sets. Results show that the accuracy of our method with different numbers of training face images is very close to that of FRPCA. This experiment shows high accuracy of the recovered low rank face image An+1 approximated by Equation 6.

3.3. Performance on Images with Occlusions In practical face recognition applications, occlusions (e.g., sunglasses or scarfs on faces) could not be avoided. Robust and efficient face recognition algorithms should achieve good performance when faces are practically occluded. We use the occluded face images in the AR database to test the performance of different algorithms. In this experiment, 70 subjects are selected for the dataset, 5 occluded images and 6 non-occluded ones for each one.

69

Proceedings of MSDM 2013

100%

100%

95%

95%

90%

90%

85%

ISVD

80%

BlockISVD

75%

CCIPCA

70%

NIRPCA

85%

FRPCA

80%

NIRPCA

75% 70%

65%

65%

60% 1

2

3

4

5

6

7

8

9

1

10

2

3

4

5

6

7

8

9

10

Figure 2. The left panel shows accuracy rate of face recognition methods with different number of training images per subject. The right panel shows accuracy rate variation of FRPCA and NIRPCA with different numbers of training images per subject.

These results show us that occluded images considerably reduce the performance of IPCA, ABSVD and CCIPCA, where the best recognition rate is only about 20%. Whereas a high recognition rate is still maintained by our algorithm NIRPCA as shown in Table 3. This is due to the robustness of our method regarding occlusions while other incremental algorithms cannot efficiently remove disruption caused by corruption on images. %

IPCA AR

28.89

ABSVD

24.36

CCIPCA

22.27

NIRPCA

58.67

Table 3. Accuracy rate of different incremental face recognition algorithms with occluded test faces.

3.4. Runtime Comparison In this section, we compare the runtime of NIRPCA with other IPCA-based face recognition methods namely ISVD, ABSVD and CCIPCA. This experiment is carried on 70 different subjects, 13 images for each subject from the AR database. Table 4 summarizes the runtime results. Although the recognition rate of NIRPCA is the best, its %

IPCA

Runtime (s)

96.89

ABSVD

135.41

CCIPCA

26.53

NIRPCA

60.29

Table 4. Runtime of different incremental face recognition algorithms.

runtime is slower than that of CCIPCA due to iterative linearization of the convex optimization using the splitting Bregman method [GOL 09] (step 3 in Algorithm 1). On the other hand, the runtime of NIRPCA is faster than those of face recognition methods based on ISVD or ABSVD. When we decrease the number of principal components, the runtimes of ISVD and ABSVD are similar to that of NIRPCA , but their face recognition ratio decreases.

4. Conclusion In this paper, we proposed a new face incremental recognition method based on one-by-one RASL. Our method is robust to sparse corruptions on face images and performed experiments on different databases show its efficiency. The advantages of our method is that it handles many aspects of image variations such as face expression, image shadowing and various angles. Above all, unlike other existing incremental face recognition methods, the proposed method handles efficiently corrupted images, mainly the occluded ones. Experiment based on five different

70

Proceedings of MSDM 2013

databases show that our proposed method has better accuracy rates. Further experiments can be extended to video images, so that it could be used in real face recognition applications. However, for video images, important image preprocessing work, such as face detection, should be done before the recognition step.

5. References [BAS 03] BASRI R., JACOBS D., “Lambertian Reflectance and Linear Subspaces”, PAMI, vol. 25, 2003, p. 218–233. [BEL 97] B ELHUMEUR P. N., H ESPANHA J. P., K RIEGMAN D. J., “Eigenfaces vs. Fisherfaces : Recognition Using Class Specific Linear Projection”, 1997. [BRA 06] B RAND M., “Fast low-rank modifications of the thin singular value decomposition”, Linear Algebra and its Applications, vol. 415, num. 1, 2006, p. 20–30, Elsevier. [CAN 11] C AND E` S E., L I X., M A Y., W RIGHT J., “Robust Principal Component Analysis ?”, Journal of the ACM, vol. 58, num. 3, 2011. [GOL 09] G OLDSTEIN T., O SHER S., “The Split Bregman Method for L1-Regularized Problems”, SIAM J. Img. Sci., vol. 2, num. 2, 2009, p. 323–343. [HAL 00] H ALL P., M ARSHALL D., M ARTIN R., “Merging and Splitting Eigenspace Models”, IEEE Trans. Pattern Anal. Mach. Intell., vol. 22, 2000, p. 1042–1049. [HAL 02] H ALL P., M ARSHALL D., M ARTIN R., “Adding and subtracting eigenspaces with eigenvalue decomposition and singular value decomposition”, Image Vision Comput., vol. 20, num. 13-14, 2002, p. 1009-1016. [HUA 09] H UANG D., Y I Z., P U X., “A New Incremental PCA Algorithm With Application to Visual Learning and Recognition”, Neural Process. Lett., vol. 30, 2009, p. 171–185. [MAR 98] M ARTINEZ A., B ENAVENTE R., “The AR face database”, report , 1998. [PEN 10] P ENG Y., G ANESH A., W RIGHT J., X U W., M A Y., “RASL : Robust alignment by sparse and low-rank decomposition for linearly correlated images”, CVPR, 2010, p. 763-770. [PHI 98] P HILLIPS P., W ECHSLER H., H UANG J., R AUSS P., “The FERET Database and Evaluation Procedure for Face Recognition Algorithms”, Image and Vision Computing, vol. 16, num. 5, 1998, p. 295–306. [SAM 94] S AMARIA F. S., H ARTER A. C., “Parameterisation of a Stochastic Model for Human Face Identification”, 1994. [SIM 02] S IM T., BAKER S., B SAT M., “The CMU Pose, Illumination, and Expression (PIE) Database”, 2002. [WAN 10] WANG Z., X IE X., “An efficient face recognition algorithm based on robust principal component analysis”, Proceedings of the Second International Conference on Internet Multimedia Computing and Service, ICIMCS ’10, New York, NY, USA, 2010, ACM, p. 99–102. [WEN 03] W ENG J., Z HANG Z. Y., H WANG W., “Candid covariance-free incremental principal component analysis”, In PAMI, vol. 25, 2003, p. 1034–1040. [WRI 09] W RIGHT J., G ANESH A., R AO S., P ENG Y., M A Y., “Robust Principal Component Analysis : Exact Recovery of Corrupted Low-Rank Matrices via Convex Optimization”, NIPS, 2009, p. 2080-2088. [WU 11] W U K. K., WANG L., S OONG F. K., YAM Y., “A Sparse and Low-rank approach to efficient face alignment for photo-real talking head synthesis”, ICASSP, 2011, p. 1397-1400.

71

Proceedings of MSDM 2013

GMM estimation of the first order periodic GARCH processes. Ines Lescheb Department of mathematics, university of Mentouri, Constantine, Algeria This paper is concerned with GMM estimation and inference in periodic GARCH models. Sufficient conditions for the estimator to be consistent and asymptotically normal are established for the periodic GARCH(1,1) process. ABSTRACT.

KEYWORDS:

Periodic GARCH processes, GMM estimation, Consistency, Asymptotic normality.

1. Introduction Periodic (Generalized) Autoregressive Conditional Heteroscedasticity processes (abbreviated as P −(G)ARCH) have attracted great interest in the recent econometric and statistical literature. This class of processes was introduced to the econometric mainstream by Bollerslev and Ghysels [6] and continues to gain popularity especially in financial data in order to give a new impact to develop appropriate analysis methods. These processes are similar to the Markov switching GARCH (M S − GARCH) models (see for instance [8]) but the major difference is that a realization of the chain which governs the parameters of the model is assumed to be recurrently and periodically observed in time. This specification makes the process globally non stationary, but it is stationary within each period. The reader is referred to Bibi and Aknouche [3] and Bibi and Lescheb [4] for background and works dealing with periodic GARCH (P − GARCH) processes. Periodic GARCH models are estimated by the method of the quasi maximum likelihood through the only paper of Aknouche and Bibi [1]. In another paper of Bibi and Lescheb [4], the strong consistency and the asymptotic normality of the least squares estimators are studied. In our modest knowledge, no theoretical result exists on the GM M estimation method for GARCH processes with periodically time-varying parameters. This paper then is concerned with GM M estimation and inference in periodic GARCH models. Sufficient conditions for the estimator to be consistent and asymptotically normal are established for the periodic GARCH(1, 1) process. A second order process (n )n∈Z defined on some probability space (Ω, A, P ) is said to have a periodic generalized autoregressive conditional heteroscedastic representation with period s > 0 and orders p and q [noted P GARCH(p, q)] if it satisfies the non linear equations ∀n ∈ Z : n = en hn and h2n = a0 (n) +

q X

ai (n) 2n−i +

i=1

p X

bj (n) h2n−j

[1]

j=1

where (en )n∈Z is a sequence of independent identically  2 distributed  k (i.i.d.) random variables defined on the same probability space (Ω, A, P ) with E {en } = 0, E en = 1, E en = κk and ek is independent of t for k > t. The parameters (ai (n))0≤i≤q and (bi (n))0≤i≤p are periodic in n with period s, i.e., for any (n, k) ∈ Z2 : ai (n) = ai (n + sk) , i = 0, ..., q and bj (n) = bj (n + sk) , j = 0, ..., p. So by setting n = st + v, v = 1, ..., s, Equation (1) may be equivalently written in periodic notations as ∀t ∈ Z : st+v = est+v hst+v and h2st+v = a0 (v) +

q X i=1

ai (v) 2st+v−i +

p X j=1

bj (v) h2st+v−j

[2] 72

Proceedings of MSDM 2013

which we will make heavy use of (2). In (2) st+v (resp hst+v , est+v ) refers to t (resp ht , et ) during the v − th regime of cycle t, (ai (v))0≤i≤q and (bi (v))0≤i≤p are the model coefficients at season v = 1, ..., s such that a0 (v) > 0, ai (v) ≥ 0, bi (v) ≥ 0 for all v and i ∈ {1, ..., p ∨ q}. In this paper, we continuous to investigate the asymptotic behavior of empirical studies of P GARCH models. Since P GARCH(p, q) models with p, q ≥ 2 are rare in practice, we restrict ourselves to one particular model which has very often used in applications : The P GARCH(1, 1) model in which st+v = est+v hst+v and h2st+v = w (v) + α (v) 2st+v−1 + β (v) h2st+v−1 , v = 1, ..., s.

[3]

2. GMM estimator of P GARCH(1, 1) processes In this section, the asymptotic properties of the GM M estimates for P GARCH(1, 1) processes coefficients  0 0 0 are studied. The process is thus described with the vector of parameters θ = θ (1) , ..., θ (s) where θ (v) = 0

(w (v) , α (v) , β (v)) , s v = 1, ..., s. The vector θ belongs to a parameter space Θθ := {θ : θ ∈ (]0, ∞[ × [0, ∞[ × [0, ∞[) }. The period s is supposed to be known and the true parameter value is unknown and is denoted by θ0 . Let {1 , 2 , ..., N } be a realization of length N = sn to estimate the parameter θ. In practice we obseve a finite segment of te process (3) and the objective is to estimate the parameters θ ∈ Θθ . To this end, define the vector 0 wst+v = st+v , 2st+v − h2st+v , v = 1, ..., s and the generalized vector, 0

g st+v = Yst+v wst+v , v = 1, ..., s where Yt is an instrumental variable function. The GM M estimator of a parameter vector θ is the solution to (c.f. Hansen [7])   θbsn = θ ∈ Θθ Arg minQsn (θ)  P 0  P  n Ps n Ps 1  Qsn (θ) = 1 Mns ns t=1 v=1 g st+v t=1 v=1 g st+v ns Pn Ps 1 where Mns = ns t=1 v=1 w st+v an appropriate weighting matrix.

[4]

Efficient GM M corresponds to choosing for all v = 1, ..., s Yst+v =

∂wst+v Σ−1 andwst+v 0 st+v

0

=

∂wst+v ∂θ (v)

! Σ−1 st+v



∂wst+v



0

∂θ (v) ∂θ (v)  /   ∂w  (v) where Σst+v = V ar wst+v Ft−1 and ∂θ0st+v is the Jacobian matrix (see Newey and McFadden [11]). The (v) objective function for an efficient GM M estimator is then given by #0 !−1 " n s # " n s n X s X XX 1 XX Qsn (θ) = g Λst+v g st+v [5] ns t=1 v=1 st+v t=1 v=1 t=1 v=1 0

where Λst+v = g st+v g st+v , v = 1, ..., s. The elements of the generalized moment and the weighting matrix are given by       2   st+v 1 ∂h2st+v st+v −2 gitv = hst+v κ3 − −1 Π ∂θi (v) hst+v h2st+v       2  2  2  st+v ∂hst+v 1 ∂h2st+v st+v −2 −2 Λijtv = h κ − − 1 h 3 st+v st+v 2 2 Π ∂θi (v) hst+v hst+v ∂θj (v) with Π = (κ4 − 1) − κ23 and for all i, j = 1, ..., 3, v = 1, ..., s.

73

Proceedings of MSDM 2013

3. Asymptotic properties All the asymptotic results below are derived with a parameter dependent weighting matrix. Compared to basing the weighting matrix on an initial consistent estimator of θ or simply the identity matrix no additional restrictions are needed. The latter estimators are of course special cases of the results given below. Furthermore allowing for a parameter dependent weighting matrix is unimportant for the asymptotic distribution. The following assumptions are sufficient for the consistency of θbsn . A1. θ0 ∈ ◦Θθ and Θθ is compact (where ◦Θθ denotes the interior of Θθ ) s Q A2. (α (v) + β (v)) < 1 v=1  4 A3. E et = κ4 < ∞   0 < wl∗ (v) ≤ w∗ (v) ≤ wk∗ (v) , 0 < αl∗ (v) ≤ α∗ (v) ≤ αk∗ (v) A4. Θθ = θ : 0 < βl∗ (v) ≤ β ∗ (v) ≤ βk∗ (v) < 1

 f orall v = 1, ..., s .

The first Assumption in A1 is standard, and in the second Assumption the compactness of the Θθ is assumed in order that several results from real analysis may be used. The Assumption A2 ensures that the P GARCH(1, 1) (3) has an unique, P C, causal, periodically ergodic solution. Moreover, the solution process is SP S and the existence of some finite moments for this solution, for more detail see Bibi and Lescheb [5]. While A3 is necessary for the existence of the asymptotic variance matrix of the GM M . In particular the Assumption A4 can be used to establish consistency of the GM M estimator of the P GARCH(1, 1) process (see Appendix). Define θbsn as the sequence of minimizers of the objective function (4) and suppose some initial estimators κ b3 , κ b4 are available and that these initial estimators only requires Assumptions A1-A4. The next theorem shows the weakly consistency of GM M for P GARCH(1, 1) process. Suppose Assumptions A1-A4 hold and that κ b3 p→κ∗3 , κ b4 p→κ∗4 . Then θbsn → θ0 in probability on Θθ regardless ∗ ∗ ∗ ∗ of κ3 = κ3 or κ4 = κ4 as long as both κ3 , κ4 are finite. That is, θbsn is consistent for finite arbitrary guess on κ3 , κ4 . In practice we are of course interested in obtaining asymptotically valid inference about θbsn and for this purpose we need consistent initial estimators of κ3 and κ4 . But this result has a useful consequence in terms of the asymptotic distribution of θbsn . In particular we will be able to conclude that the asymptotic distribution of θbsn is the same regardless of wether κ3 and κ4 are known or estimated. The second main result of this section is the following Suppose Assumptions A1-A4 hold and that κ b3 p→κ3 , κ b4 p→κ4 or κ3 , κ4 are known. Then   −1   √  ∼ sn θbsn − θ0 N 0, JI−1 J = N 0, I−1 with J = diag {Jv , v = 1, ..., s} , I = diag {Iv , v = 1, ..., s} and each block matrix is given respectively by Iv

=

s X

n o 0 Eθ0 g st+v g st+v

v=1

Jv

=

s X v=1

( Eθ 0

∂g st+v

)

0

∂θ (v)

for v = 1, ..., s and where”∼ =’ denotes equality in distribution. By construction the objective function (5) is exactly identified and it is well-known from the literature that the choice of weighting matrix above is sufficient but not necessary for asymptotic efficiency.

74

Proceedings of MSDM 2013

4. References Aknouche, A. and Bibi, A. (2009). Quasi-maximum likelihood estimation of periodic GARCH and periodic ARM A − GARCH processes. J Time Ser Anal, 29(1), 19-45. Basawa, I. V., and R. Lund (2001). Large sample properties of parameters estimates for periodic ARM A models. Journal of Time Series Analysis 22, 1-13. Bibi, A., and A. Aknouche (2008). On periodic GARCH processes : Stationarity, existence of moments and geometric ergodicity. Mathematical methods of Statistics Vol. 17(4), pp. 305-316.

Bibi, A. and Lescheb, I. (2010). Strong consistency and asymptotic normality of least squares estimators for P GARCH and P ARM A − P GARCH . Stat. Prob. Letters 80, 1532-1542. Bibi, A. and Lescheb, I. (2013). Estimation and asymptotic properties in periodic GARCH(1, 1) models. Communication and Statistics, Theory and Methods, Accepted. Bollerslev, T., and E. Ghysels. (1996). Periodic autoregressive conditional heteroscedasticity. J. of Business & Economic Statistics, 14, 139-151. Hansen, L. P. (1982). Large sample properties of generalized method of moment estimators. Econometrica 50, 1029-1054. Francq, C., and J-M., Zakoˆıan. The L2 −Structures of standard and switching-regime GARCH models. Stoch. Processes and their App. Vol. 115, 1557-1582. 2005. Lee, S. W. and Hansen, B. (1994). Asymptotic theory for the GARCH(1,1) quasi-maximum likelihood estimator. Econometric Theory 10, 29-52. Lumsdaine, R. L. (1996). Consistency and asymptotic normality of the quasi-maximum likelihood estimator in IGARCH(1, 1) and covariance stationary GARCH(1, 1) models. Econometrica 64, 575-596. Newey, W. K. and McFadden, D. (1994). Large sample estimation and hypothesis testing, in Engle, R. F. and McFadden, D., eds. Handbook of Econometrics. Vol. 4, Amesterdam North Holland, chapter 36, pp, 2111-2245.

75

Proceedings of MSDM 2013

Testing instantaneous causality in presence of non constant unconditional covariance structure Quentin Giai Gianetto1 , Hamdi Ra¨ıssi2 1

and 2 IRMAR-INSA, 20 avenue des buttes de Co¨esmes, CS 70839, F-35708 Rennes Cedex 7, France. E-mail : [email protected] and [email protected] The problem of testing instantaneous causality between variables with time-varying unconditional covariance matrices is investigated. It is shown that the classical tests based on the assumption of stationary processes must be avoided in our non standard framework. More precisely we underline that the standard test does not control the type I errors, while the tests with White [WHI 80] and Heteroscedastic Autocorrelation Consistent (HAC) (see [DEN 97]) corrections can suffer from a severe loss of power when the covariance is not constant. Consequently a modified test based on a bootstrap procedure is proposed. ABSTRACT.

KEYWORDS: VAR model;

Unconditionally heteroscedastic errors; Instantaneous causality.

1. Introduction The concept of causality defined by Granger [GRA 69], is widely used to analyze cause and effect relationships between macroeconomic and financial variables. Causality relationships are often analyzed by taking into account only the past values of studied variables. In many situations the prediction of the unobserved current variables X2t can however be improved by including the available current information of variables X1t . In such a case the instantaneous causality relation between X1t and X2t is investigated. In the stationary VAR processes framework, the instantaneous causality is usually tested by using Wald tests for zero restrictions on the innovation’s covariance matrix (see L¨utkepohl [LUT 05]). Standard tools available in the commonly used softwares are based on the assumption of i.i.d. Gaussian innovations. The weight matrix of the test statistic has to be corrected by using the White or HAC type covariance matrices in non standard but stationary situations. Nevertheless many applied papers questioned the assumption of a constant unconditional covariance structure. For instance Sensier and van Dijk [SEN 04] found that most of the 214 U.S. macroeconomic variables they investigated exhibit a break in their unconditional variance. Numerous tools for time series analysis in presence of potential non constant variance have been proposed in the literature (see Francq and Gautier [FRA 04], Horv´ath, Kokoszka and Zhang [HOR 06] or Patilea and Ra¨ıssi [PAT 12] among others). In this paper the test of zero restriction on the time-varying covariance structure is studied. We highlight the inadequacy of the standard Wald test and of the tests based on White or HAC corrections of the Wald test statistic for instantaneous causality when the covariance structure is time-varying. As a consequence a new approach for testing instantaneous causality taking into account non-stationary unconditional covariance is proposed in this paper. It is however found that the asymptotic distribution of the modified statistic is non standard involving the unknown non constant covariance structure in a functional form. Therefore a wild bootstrap procedure is provided for testing zero restrictions on the non constant covariance structure. It is established through theoretical and empirical results that the modified test is preferable to the tests based on the spurious assumption of constant unconditional covariance matrix.

76

Proceedings of MSDM 2013

2. Vector autoregressive model with non constant covariance structure Consider the following VAR model Xt = A01 Xt−1 + . . . + A0p Xt−p + ut

[1]

u t = Ht ǫ t , where Xt ∈ Rd and it is assumed that X−p+1 , . . . , X0 , X1 , . . . , XTP are observed. The d × d dimensional matrices p A0i are such that det A(z) 6= 0 for all |z| ≤ 1, where A(z) = Id − i=1 A0i z i with Id the d × d identity matrix. Note that the process (Xt ) should be formally written in a triangular form, but the double subscript is suppressed for notational simplicity. In the following assumption we give the structure of the covariance by using the rescaling approach of Dahlhaus [DAH 97]. Ft corresponds to the σ-field generated by {ǫk : k ≤ t} and k . kr is such that k x kr := (E k x kr )1/r for a random variable x with k . k the Euclidean norm. Assumption A1 : (i) The d × d matrices Ht are lower triangular nonsingular matrices with positive diagonal elements and satisfy Ht = G(t/T ), where the components of the matrix G(r) := {gkl (r)} are measurable deterministic functions on the interval (0, 1], such that supr∈(0,1] |gkl (r)| < ∞, and each gkl satisfies a Lipschitz condition piecewise on a finite number of sub-intervals partitioning (0, 1]. The matrices Σ(r) = G(r)G(r)′ are assumed positive definite for all r in (0, 1]. (ii) The process (ǫt ) is α-mixing and such that E(ǫt | Ft−1 ) = 0, E(ǫt ǫ′t | Ft−1 ) = Id and supt k ǫt k4µ < ∞ for some µ > 1. If we suppose the process (ǫt ) Gaussian and the functions gkl (.) constant we retrieve the standard case. Nevertheless when the unconditional covariance is time-varying, it can be expected that the tools developed in the stationary framework are not valid or can suffer from drawbacks since the tests for instantaneous causality are directly based on the covariance structure. From the piecewise Lipschitz condition abrupt breaks as well as smooth changes are allowed for the unconditional covariance. In the framework of A1 we are interested in testing zero restrictions on the covariance structure Σ(r). From (ii) we see that the error terms are not correlated at the second order. Therefore the tools proposed in this paper have to be preferably used for relatively low frequency variables for which it is commonly admitted that there is no second order dynamics (for instance monthly, quarterly or annual macroeconomic data). Now re-write model (1) as follows

˜ ′ ⊗ Id )θ0 + ut Xt = (X t−1 u t = Ht ǫ t , ˜ t−1 = (X ′ , . . . , X ′ )′ and θ0 = vec{(A01 , . . . , A0p )} where vec(.) is the usual column vectorization with X t−1 t−p operator of a matrix and ⊗ is the usual Kronecker product. The parameter vector θ0 may be estimated by Ordinary Least Squares (OLS) or Adaptive Least Squares (ALS). Properties of these estimators are established in Patilea and Ra¨ıssi [PAT 12] under A1. Denoting by θˆ the ALS (or alternatively OLS) estimator of θ0 , we introduce the ′ ˆ Note that the more efficient ALS estimation method should be preferred ˜t−1 residuals u ˆt = Xt − (X ⊗ Id )θ. for approximating the innovations. The goodness-of-fit of model (1) can be checked by using portmanteau tests proposed in Patilea and Ra¨ıssi [PAT 13] in our non standard framework. Hence √the lag length is assumed well fitted in the sequel. More particularly the OLS and ALS estimators are unbiased T -asymptotically normal if the lag length is well adjusted. ′ ′ ′ Let X1t and X2t be the subvectors of Xt := (X1t , X2t ) with respective dimensions d1 and d2 and let Σ12 t be the d1 × d2 -dimensional upper right block of Σt := E(ut u′t ). Our goal is to determine if it exists an instantaneous causality relation between X1t and X2t . The next lemma gives some preliminary results and requires to introduce additional notations. Let uˆt := (ˆ u′1t , u ˆ′2t )′ , ϑˆt := u ˆ2t ⊗ u ˆ1t , vˆt := vec(ˆ u1t uˆ′2t − Σ12 ˆ2t ⊗ uˆ1t − vec(Σ12 t ) = u t ),

77

Proceedings of MSDM 2013

′ ′ ′ Ht := (H1t , H2t ) and G(r) := (G1 (r)′ , G2 (r)′ )′ be in line with the partition of Xt . We denote by [z] the integer part of a real number z. We also denote by ⇒ the convergence in distribution and → the convergence in probability.

Lemma 1 Under A1 we have as T → ∞ T −1

T X t=1

Z ϑˆt →

and T

− 21

T X t=1

where

Z

1

vec(Σ12 (r))dr,

vˆt ⇒ N (0, Ω),

[3]

Z

1 ′

Ω= 0

[2]

0

(G2 (r) ⊗ G1 (r)) M (G2 (r) ⊗ G1 (r)) dr −

1

vec(Σ12 (r))vec(Σ12 (r))′ dr 0

and M = E(ǫt ǫ′t ⊗ ǫt ǫ′t ). In addition we also have 1

T −2

[T s] X t=1

Z vˆt ⇒

s 0

(G2 (r) ⊗ G1 (r))dBΩ˜ (r)

[4]

˜ := E(ǫt ǫ′t ⊗ ǫt ǫ′t ) − vec(Id )vec(Id )′ , with s ∈ [0, 1]. where BΩ˜ (.) is a Brownian Motion with covariance matrix Ω A proof of this lemma is provided in the long version of the paper [GIA 12]. Next we shall discuss the test for instantaneous causality between X1t and X2t assuming spuriously that the unconditional covariance is constant and propose a new test adapted to our framework.

3. Testing for instantaneous causality Denote by X2t (1|{Xk |k < t}) the optimal one step linear predictor of X2t at the date t − 1, based on the information of the past of the process (Xt ). Similarly we define the one step linear predictor X2t (1|{Xk |k < t} ∪ {X1t }) based on the past of (Xt ) and the present of (X1t ). It is said that there is no instantaneous linear causality between (X1t ) and (X2t ) if X2t (1|{Xk |k < t} ∪ {X1t }) = X2t (1|{Xk |k < t}). In the case of non constant covariance following the assumption A1 and, more particularly, because we assumed that the Ht ’s are lower triangular nonsingular matrices with positive diagonal elements, it can be shown that there is no instantaneous causality between X1t and X2t if and only if the Σ12 t ’s are all equal to zero. Consequently in our non standard framework the following pair of hypotheses has to be tested : H0 : Σ12 (r) = 0 vs H1 : Σ12 (r) 6= 0 for r ∈ [a, b] ⊆ [0, 1] with fixed a < b. Now if we consider the case where the covariance is assumed constant Σt = Σu for all t, it is well known that there is no instantaneous causality between X2t and X1t if and only if Σ12 u = 0 with obvious notation. Therefore the following pair of hypotheses is tested under standard assumptions : ′ 12 H0′ : Σ12 u = 0 vs H1 : Σu 6= 0, R1 PT −1 The block Σ12 ˆ1t u ˆ′2t which converges in probability to 0 Σ12 (r)dr under A1. u is usually estimated by T t=1 u Hence such hypothesis testing does not take into account the time-varying covariance in the sense that it can only

78

Proceedings of MSDM 2013

R1 be interpreted as a global zero restriction testing of the covariance structure, i.e. testing 0 Σ12 (r)dr = 0 against R1 the alternative 0 Σ12 (r)dr 6= 0. Then H0′ and H1′ are inappropriate for testing instantaneous causality in our non standard framework. It is interesting to point out that H0 is a particular case of H0′ , i.e. H0 ⊂ H0′ , since H0′ corresponds to R1 Σ12 (r)dr = 0. On the other hand since 0 Σ12 (r)dr 6= 0 implies that Σ12 (r) 6= 0, then H1′ ⊂ H1 . More 0 R1 precisely, if Σ12 (r) 6= 0 for r ∈ [a, b] ⊆ [0, 1], we may have either 0 Σ12 (r)dr 6= 0, which corresponds to R1 H1 ∩ H1′ , or 0 Σ12 (r)dr = 0, which corresponds to H1 ∩ H0′ . Note that we have H1 = (H1 ∩ H1′ ) ∪ (H1 ∩ H0′ ) and (H1 ∩ H1′ ) ∩ (H1 ∩ H0′ ) = ∅. It is shown in the next part that the case H1 ∩ H0′ entails a loss of power for tests built on the assumption of constant unconditional covariance of the innovations. R1

3.1. Tests based on the assumption of constant error covariance structure In this section the consequences of non constant covariance on the instantaneous causality tests based on the 1 PT spurious assumption of a stationary process are analyzed. Let be δT := T − 2 t=1 ϑˆt where we recall that ϑˆt = u ˆ2t ⊗ u ˆ1t . The standard test statistic is given by ˆ st )−1 δT , Sst = δT′ (Ω where ( ˆ st := Ω

T −1

T X

! u ˆ2t u ˆ′2t

ˆ st → Under A1 it can be shown that Ω general.

R1 0

T −1



t=1

Σ22 (r)dr ⊗

R1 0

T X

!) u ˆ1t uˆ′1t

t=1

11 → Σ22 u ⊗ Σu .

Σ11 (r)dr =: Ωst and we obviously have Ω 6= Ωst in

If the practitioner (spuriously) assumes that the error process is iid but not Gaussian, u1t and u2t could be dependent and the following statistic with White type correction should be used : ˆ w )−1 δT , Sw = δT′ (Ω where ˆ w := T −1 Ω

T X t=1

uˆ2t uˆ′2t ⊗ u ˆ1t uˆ′1t → Ω

under H0 , and it can be shown that Z ˆw → Ω + Ω

1

vec(Σ12 (r))vec(Σ12 (r))′ dr

[5]

0

under H1 . Finally the practitioner may again (spuriously) suppose that the error process is stationary and that the observed heteroscedasticity is a consequence of the presence of nonlinearities. This kind of situation can arise if we assume that the innovations process is driven by a stationary GARCH model. In such a case HAC type weight matrices should be used in the test statistic. For simplicity we focus on the VARHAC weight matrix. Denote by Am,1 , . . . , Am,m the coefficients of the LS regression of ϑˆt on ϑˆt−1 , . . . , ϑˆt−m , taking ϑˆt = 0 for t ≤ 0. Introˆ h = A(1)−1 Σ ˆ z A(1)−1 where A(1) = Id1 d2 − Pm Am,k and duce zˆm,t the residuals of such a regression and Ω k=1 ˆ z = T −1 PT zˆm,t zˆ′ . The order m can be chosen by using an information criterion. The following statistic Σ m,t t=1 involving VARHAC type weight matrix may be used

79

Proceedings of MSDM 2013

ˆ h )−1 δT . Sh = δT′ (Ω Since we assumed that the autoregressive order p is well adjusted (or known), the process ϑt = u2t ⊗ u1t is ˆ h → Ω under H0 and uncorrelated and it can be shown that the Am,k ’s converge to zero in probability. Therefore Ω Z ˆh → Ω + Ω

1

vec(Σ12 (r))vec(Σ12 (r))′ dr,

[6]

0

ˆ h and Ω ˆ w are asymptotically equivalent in the framework of A1. This is not surprising since under H1 , so that Ω second order dynamics are in fact excluded in A1. In this part the asymptotic properties of the above statistics are investigated. The asymptotic behavior of the statistics in our non standard framework is first established under H0 . The results are direct consequences of (3) Proposition 1 Assume that H0 hold. Then under A1 we have as T → ∞ Sst ⇒

dX 1 d2

λj Zj2 ,

[7]

j=1 −1

−1

where the Zj ’s are independent N (0, 1) variables, and λ1 , . . . , λd1 d2 are the eigenvalues of the matrix Ωst 2 ΩΩst 2 . In addition we also have Sw ⇒ χ2d1 d2 and Sh ⇒ χ2d1 d2 . [8] For a fixed asymptotic level α, the standard test (Wst hereafter) consists in rejecting the hypothesis of no instantaneous causality between X1t and X2t if Sst > χ2d1 d2 ,1−α where χ2d1 d2 ,1−α is the (1 − α)th quantile of the χ2d1 d2 law. Therefore it appears from (7) that the standard test is not able to control the type I error since Ωst 6= Ω in general. Denote by Ww (resp. Wh ) the test consisting to reject the hypothesis of no instantaneous causality if Sw > χ2d1 d2 ,1−α (resp. Sh > χ2d1 d2 ,1−α ). From (8) we see that the Ww and Wh tests should have good size properties for large enough T . Now we study the behavior of the test under the alternative of instantaneous causality. In the long version of R1 the paper [GIA 12] it is shown that if the covariance structure of the errors is non constant with 0 Σ12 (r)dr 6= 0, the Wh and Ww tests achieve a gain in power in the Bahadur sense when compared to the Wst test. Here we focus on the ability of the Wst , Ww and Wh tests in detecting instantaneous causality when H1 ∩ H0′ hold, that R1 ˆ h = Op (1) and Ω ˆ w = Op (1) from (5) and (6) and we also is Σ12 (r) 6= 0 and 0 Σ12 (r)dr = 0. In this case Ω 1 1 PT 12 ˆ st = Op (1), while the non centrality term is T − 2 2 have Ω t=1 Σt = o(T ). Therefore we have Si = op (T ) with i = st, w, h in the case H1 ∩ H0′ . When such eventuality is considered, it is clear that the tests based on the assumption of stationary errors may suffer from a severe loss of power. This is a consequence of the fact that this kind of tests are not intended to take into account time varying covariance. The case H1 ∩ H0′ can arise in the important case where Σ12 (r) 6= 0 but close to zero so that Σ12 (r) may have a changing sign. Even when at R1 least one of the components of Σ12 (r) is far from zero, we can have 0 Σ12 (r)dr = 0 as for instance in some cases where the covariance structure is periodic. This can be seen by considering the bivariate case and taking Σ12 (r) = c cos(πr) or Σ12 (r) = c1[0, 12 ] (r) − c1] 12 ,1] (r) with c ∈ R. Therefore the tests based on the spurious assumption of constant unconditional covariance for the error must be avoided. In summary it is found that the tests based on the White and VARHAC corrections should control well the type I errors for large enough samples on the contrary to the Wst . In addition it appears that in the case of non constant unconditional covariance and when H1 ∩ H1′ hold, the Wh and Ww tests have better power properties than the Wst test. Therefore the Wh and Ww tests should be preferred to the Wst test when the unconditional covariance

80

Proceedings of MSDM 2013

is time-varying. However it is also found that the tests based on the assumption of constant covariance may suffer R1 R1 from a severe loss of power in the important cases where 0 Σ12 (r)dr = 0 (or 0 Σ12 (r)dr ≈ 0). A bootstrap test circumventing this power problem in the case H1 ∩ H0′ is proposed in the next part. 3.2. A bootstrap test taking into account non constant covariance structure 1

Introduce δs = T − 2

P[T s] ˆ t=1 ϑt with s ∈ [0, 1] and consider the following statistic : Sb = sup ||δs ||22 . s∈[0,1]

Under H0 and from (4) we write : Z δs ⇒

0

s

(G2 (r) ⊗ G1 (r))dBΩ˜ (r) := K(s),

˜ = M = E(ǫt ǫ′ ⊗ǫt ǫ′ ). Therefore under H0 we have from the Continuous where the covariance matrix becomes Ω t t Mapping Theorem Sb ⇒ sup ||K(s)||22 ,

[9]

s∈[0,1]

since the functional f (Y ) = sups∈[0,1] ||Y (s)||22 is continuous for any Y ∈ D[0, 1], the space of c`adl`ag processes on [0,1]. Under H1 we obtain

T

− 21

δs = T

−1

[T s] X

vˆt + T

t=1

−1

[T s] X

vec(Σ12 t )

[10]

t=1

˜ defined in (4). The first term in the right hand side of (10) converges to zero in probability, while we have with Ω Rs R s P[T s] −1 12 12 T vec(Σ12 (r))dr ||22 = C > 0. Hence we t=1 vec(Σt ) = 0 vec(Σ (r))dr + o(1) and sups∈[0,1] || 0 have in such a situation Sb = CT + op (T ). From (9) we see that the asymptotic distribution of Sb under the null H0 is non standard and depends on the unknown covariance structure and the fourth order cumulants of the process (ǫt ) in a functional form. Thus the statistic Sb cannot directly be used to build a test and we consider a wild bootstrap procedure to provide reliable quantiles for testing the instantaneous causality. In the literature such procedures were used for investigating VAR model specification as in Inoue and Kilian [INO 02] among others. For resampling our test statistic we draw B (i) (i) (i) bootstrap sets given by ϑt := ξt ϑˆt = ξt u ˆ2t ⊗ u ˆ1t , t ∈ {1, . . . , T } and i ∈ {1, . . . , B}, where the univariate (i) random variables ξt are taken iid standard Gaussian, independent from (ut ). For a given i ∈ {1, . . . , B} set 1 P[T s] (i) (i) (i) (i) δs = T − 2 t=1 ϑt and Sb = sups∈[0,1] ||δs ||22 . In our procedure bootstrap counterparts of the xt ’s are not generated and the residuals are directly used to generate the bootstrap residuals. This is motivated by the fact that zero restrictions are tested on the covariance structure, so that we only consider the residuals in the test statistic. It is clear that the wild bootstrap method is designed to replicate the pattern of non constant covariance of the (i) residuals in Sb . More precisely we have under A1 (i)

Sb ⇒P sup ||K(s)||22 ,

[11]

s∈[0,1]

where we denote by ⇒P the weak convergence in probability. A proof of (11) is provided in the long version of the paper [GIA 12]. Denoting E ∗ (.) the expectation under the bootstrap probability measure it is interesting to point

81

Proceedings of MSDM 2013

(i) out that we have by construction E ∗ (ξt ϑˆt ) = 0 even when the alternative is true, that is E(ϑt ) 6= 0 (recall that ϑt = u2t ⊗ u1t ). As a consequence the result (11) is hold whatever Σ(r)12 = 0 or Σ(r)12 6= 0.

The Wb test consists in rejecting H0 if the statistic Sb exceeds the (1 − α) quantile of the bootstrap distribution. R1 Under H1 with 0 Σ(r)12 dr 6= 0 we note that all the statistics considered in this paper increase at the rate T . R1 However when Σ(r)12 6= 0 with 0 Σ(r)12 dr ≈ 0, we can expect that the Wb test is more powerful than the tests based on the assumption of constant unconditional covariance. In such situations we may have Sw = op (T ), Sh = op (T ) while again Sb = Op (T ). If the unconditional covariance is constant, that is Σ12 (r) = Σ12 u , note that Sw = Op (T ), Sh = Op (T ) and Sb = Op (T ). Hence we can expect no major loss of power for the Wb when compared to the Ww and Wh tests if the underlying structure of the covariance is constant. In general since Sb = Op (T ) and in view of (11), the Wb test is consistent. From the above results we can draw the conclusion that the Wb test is preferable in presence of non constant unconditional covariance and for large enough sample sizes. In the long version of the paper [GIA 12], some simulation experiments illustrating the above findings are displayed. It turns out that the Wb test has good finite sample size and power properties in all the cases whether the unconditional covariance structure is constant or time-varying. In accordance with our theoretical results the tests based on the assumption of constant unconditional covariance structure may suffer from a severe loss of power. Some instantaneous causality relationships between US macroeconomic variables are also investigated in [GIA 12].

4. References [DAH 97] DAHLHAUS R., “Fitting time series models to nonstationary processes”, Annals of Statistics, vol. 25, 1997, p. 1-37. [DEN 97] D EN H AAN W., L EVIN A., “A practitioner’s guide to robust covariance matrix estimation”, handbook of statistics 15, chapter 12, eds : G . S . MADDALA and C . R . RAO, Amsterdam : Elsevier, , 1997, p. 299-342. [FRA 04] F RANCQ C., G AUTIER A., “Estimation of time-varying ARMA models with Markovian changes in regime”, Statistics and Probability Letters, vol. 70, 2004, p. 243-251. [GIA 12] G IAI G IANETTO Q., R A¨I SSI H., “Testing instantaneous causality in presence of non constant unconditional variance”, Working paper, arXiv :1207.3246v1, 2012. [GRA 69] G RANGER C., “Investigating causal relations by econometric models and cross-spectral methods”, Econometrica, vol. 12, 1969, p. 424-438. ` [HOR 06] H ORV ATH L. KOKOSZKA P., Z HANG A., “Monitoring constancy of variance in conditionally heteroskedastic time series”, Econometric Theory, vol. 22, 2006, p. 373-402. [INO 02] I NOUE A., K ILIAN L., “Bootstrapping autoregressions with possible unit roots”, Econometrica, vol. 70, 2002, p. 377-391. [LUT 05] L UTKEPOHL H., “New Introduction to Multiple Time Series Analysis”, Berlin, , 2005, p. 46-47. [PAT 12] PATILEA V., R A¨I SSI H., “Adaptive estimation of vector autoregressive models with time-varying variance : application to testing linear causality in mean”, Journal of Statistical Planning and Inference, vol. 142, 2012, p. 2891-2912. [PAT 13] PATILEA V., R A¨I SSI H., “Portmanteau tests for stable multivariate autoregressive processes”, Journal of Multivariate Analysis, vol. 116, 2013, p. 190-207. [SEN 04] S ENSIER M., VAN D IJK D., “Testing for volatility changes in U.S. macroeconomic time series”, Review of Economics and Statistics, vol. 86, 2004, p. 833-839. [WHI 80] W HITE H., “A heteroskedasticity consistent covariance matrix estimator and a direct test for heteroskedasticity”, Econometrica, vol. 48, 1980, p. 817-838.

82

Proceedings of MSDM 2013

Consistency and Asymptotic Normality of the Generalized QML Estimator of a General Volatility Model. Christian Francq1,2 , Fedya Telmoudi2,3 , Mohamed Limam3,4 1

CREST University Lille 3 (EQUIPPE) 3 ISG Of Tunis, LARODEC 4 Dhofar University, Oman 2

This paper investigates the behavior of the generalized quasi-maximum likelihood estimator of a general conditionally heteroscedastic model. Consistent estimation and asymptotic normality are demonstrated. An illustration is proposed. ABSTRACT.

KEYWORDS:

Generalized Quasi Maximum likelihood, Volatility model, Scaling, CAN properties.

1. Introduction A simple and widely formulation of a general conditionally heteroscedastic model is of the form  t = σt ηt σt = σ(t−1 , t−2 , . . . ; θ0 )

[1]

as example, we find the GARCH(p, q) model of Engle (1982) and Bollerslev (1986), defined by 

t = σt ηt P Pp q 2 σt2 = ω0 + i=1 α0i 2t−i + j=1 β0j σt−j

[2]

where ω0 > 0, α0i ≥ 0, β0j ≥ 0. where (ηt ) is a sequence of independent and identically distributed (iid) random variables, with ηt independent of {u , u < t}, θ0 ∈ Rm is a parameter belonging to a compact parameter space Θ, and σ : R∞ × Θ → (0, ∞). Note that the model (1) is not identifiable without a scaling assumption on the distribution of ηt . The standard identifiability assumption is Eηt2 = 1, but we do not need to make this assumption at this stage.The variable σt2 is generally referred to as the volatility of t . For the standard volatility models, the following assumption is satisfied. A1 : There exists a function H such that for any θ ∈ Θ, for any K > 0, and any sequence (xi )i Kσ(x1 , x2 , . . . ; θ) = σ(x1 , x2 , . . . ; H(θ, K)). In the GARCH(1,1) case, we have H(θ0 , K) = (K 2 ω0 , K 2 α01 , β01 )0 . The QML estimator (QMLE) consistency and asymptotic normality, have been investigated for a wide range of strictly stationary GARCH models. The consistency and asymptotic normality (CAN) of this estimator requires only few regularity assumptions, and the standard identifiability condition Eηt2 = 1, (see Weiss [WEI 86], Berkers et al. [BER 03], Francq and Zakoian [FRA 04], Mikosch and Straumann [MIK 06]). In the framework of standard GARCH models, Berkes and Horv´ath [BER 04] introduced generalized non-gaussian QMLE and established their CAN under alternative identifiability conditions. For the general model (1), Francq and Zakoian [FRA 10] showed that particular generalized QMLE lead to convenient one-step predictions of the powers |t |r , r ∈ R.

83

Proceedings of MSDM 2013

2. Asymptotic behaviour of the Generalized QMLE Given observations 1 , . . . , n , and arbitrary initial values e i for i ≤ 0, let σ et (θ) = σ(t−1 , t−2 , . . . , 1 , e 0 , e −1 , . . . ; θ). This random variable can be seen as proxy of σt (θ) = σ(t−1 , t−2 , . . . , 1 , 0 , −1 , . . . ; θ). Given an instrumental density h > 0, consider the QML criterion n

X e n (θ) = 1 g(t , σ et (θ)), Q n t=1

g(x, σ) = log

1 x h , σ σ

[3]

and the (generalized) QMLE e n (θ). θˆn∗ = arg max Q θ∈Θ

Throughout the text, starred symbols are used to designate quantities which depend on the instrumental density h. This estimator is the standard Gaussian QMLE when h is the standard Gaussian density φ.

2.1. Technical assumptions for the consistency. To establish the CAN of θˆn∗ , Berkes and Horv´ath [BER 04] and Francq and Zakoian [FRA 10] make assumptions similar to the following ones. A2 : (t ) is a strictly stationary and ergodic solution of (1), E|1 |s < ∞ for some s > 0. A3 : For some ω > 0, almost surely, σt (θ) ∈ (ω, ∞] for any θ ∈ Θ. Moreover, for any θ1 , θ2 ∈ Θ, we have σt (θ1 ) = σt (θ2 ) a.s. iff θ1 = θ2 . A4 : The function σ → Eg(η0 , σ) is valued in [−∞, +∞) and has a unique maximum at some point σ∗ ∈ (0, ∞). Remark 1 Note that A4 is much less restrictive than the analog assumption in Francq and Zakoian [FRA 10], which requires a maximum at σ∗ = 1 (see A3 in Francq and Zakoian [FRA 10]). Note also that we do not need any identifiability condition of ηt (such that Eηt2 = 1). We need weaker assumptions because, in our framework, it will only be necessary to defined the volatility up to an unknown multiplicative constant. A5 : The instrumental density h is continuous on R, it is also differentiable, except possibly in 0, and there exist constants δ ≥ 0 and C0 > 0 such that, for all u ∈ R \ {0}, |uh0 (u)/h(u)| ≤ C0 (1 + |u|δ ) and E|η0 |2δ < ∞. A6 : There exist a random variable C1 measurable with respect to {u , u < 0} and a constant ρ ∈ (0, 1) such that supθ∈Θ |σt (θ) − σ et (θ)| ≤ C1 ρt . Under A1 and A4, define the parameter θ0∗ = H(θ0 , σ∗ ).

[4]

A7 : The parameter θ0∗ belongs to the compact parameter space Θ. 2.2. Additional assumptions for the asymptotic normality. ◦

A8 : The parameter θ0∗ belongs to the interior Θ of Θ. ∂σt (θ0∗ ) A9 : There exist no non-zero x ∈ Rm such that x0 ∂θ = 0,

a.s.

84

Proceedings of MSDM 2013

A10 : The function θ 7→ σ(x1 , x2 , . . . ; θ) has continuous second-order derivatives, and

2

2

∂σt (θ) ∂e et (θ) σt (θ)

≤ C1 ρt ,

+ ∂ σt (θ) − ∂ σ − sup

∂θ ∂θ ∂θ∂θ0 ∂θ∂θ0 θ∈Θ where C1 and ρ are as in A6. 0

A11 : h is twice continuously differentiable, except possibly at 0, with |u2 (h0 (u)/h(u)) | ≤ C0 (1 + |u|δ ) for all u ∈ R \ {0} and E|η0 |δ < ∞, where C0 and δ are as in A5. A12 : There exists a neighborhood V (θ0∗ ) of θ0∗ such that

1 ∂σt (θ) 4

sup

σt (θ) ∂θ , ∗

θ∈V (θ0 )



1 ∂ 2 σt (θ) 2

sup

σt (θ) ∂θ∂θ0 , ∗

θ∈V (θ0 )

σt (θ0∗ ) 2δ sup ∗ σt (θ)

θ∈V (θ0 )

have finite expectations. The following lemma follows easily from Berkes and Horv´ath [BER 04] and Francq and Zakoian [FRA 10]. Lemma 1 (Asymptotic behavior of generalized QMLE) If A1-A7 hold, then θˆn∗ → θ0∗ ,

a.s.

where θ0∗ is defined by (4). If, in addition, A8-A12 hold and Eg2 (η0 , 1) 6= 0 then  √  ∗ L n θˆn − θ0∗ → N (0, τh J∗−1 ) where

4Eg12 (σ∗−1 η0 , 1) J∗ = 4EDt (θ0∗ )Dt0 (θ0∗ ) and τh =  2 , Eg2 (σ∗−1 η0 , 1)

[5]

with Dt (θ) =

1 ∂σt (θ) , σt (θ) ∂θ

g1 (x, σ) =

∂g(x, σ) ∂σ

and

g2 (x, σ) =

∂g1 (x, σ) . ∂σ

3. Illustrations. Example 1 Assume for instance that we have a standard GARCH(1,1) with parameter θ0 = (ω0 , α0 , β0 ) and ηt ∼ N (0, 1). Note that p E|η1 | = 2/π. If we take the Laplace distribution e−|x| /2 as instrumental density h, then θˆn∗ thus converges to θ0∗ = (2ω0 /π, 2α0 /π, β0 ). Corollary 1 If A1 holds true when σt is replaced by σ ˜t , i.e. if Kσ ˜t (θ) = σ ˜t (θ) {H(θ, K)} ,

[6]

then the estimator H (θ, K) is not changed if h(x) is replaced by hs (x) = s−1 h(s−1 x), for any s > 0. Example 2 As an example, consider the case where h is the Generalized Error Distribution of shape parameter κ > 0 and scale parameter s > 0, denoted by GED(κ, s), and defined by h(x) ∝ exp (−|x/s|κ /2). We then have, for x 6= 0, κ|x/s|κ h0 (x) = − . h 2x 85

Proceedings of MSDM 2013

We obtain σ∗ =

1/κ  κ κ . E|η | 1 2sκ

Moreover, we have g1 (x, σ) = − Thus,  g1

1n κ 1− κ σ 2s

x κ o , σ

g2 (x, σ) =

 κ |η1 | η1 , 1 = −1 + κ, σ∗ E |η1 |

 g2

κ 1 n 1 − (1 + κ) κ σ2 2s

 κ η1 |η1 | , 1 = 1 − (1 + κ) κ σ∗ E |η1 |

and τh := τGED

4 = 2 κ

x κ o . σ



E |η1 |

κ 2

(E |η1 | )

! −1 .

[7]

Note that τh does not depend on the scale parameter s, which is in accordance with Corollary 1.

4. Conclusion This paper has shown that the generalized quasi maximum likelihood will consistently estimate the parameters of a general conditionally heteroscedastic model and it is proved that it is scale invariant. This result is obtained under much less restrictive assumptions than those existing in the literature which requires a maximum at σ∗ = 1. An arbitrary instrumental positive density h can be used to define the QML criterion. It is proved that whatever the distribution of ηt the consistency of the QML estimator is ensured. We characterized the limit θ0∗ as a function of h and of the distribution of ηt .

5. References ´ [BER 03] B ERKES I., H ORV ATH L. , KOKOSZKA P., “ GARCH processes : structure and estimation ”, Bernoulli, 2003, p. 9, 201-227. ´ [BER 04] B ERKES I., H ORV ATH L. , “ The efficiency of the estimators of the parameters in GARCH processes ”, The Annals of Statistics , 2004, p. 32, 633-655. [BOL 86] B OLLERSLEV T., “Generalized autoregressive conditional heteroskedasticity”, p. 31, 307-327.

Journal of Econometrics, 1986,

[ENG 82] E NGLE R., “ Autoregressive conditional heteroscedasticity with esti- mates of the variance of uk inflation ”, Econometrica , 1982, p. 987-1008. [FRA 04] F RANCQ C. , Z AKOIAN J M. , “ Maximum likelihood estimation of pure GARCH and ARMA-GARCH processes ”, Bernoulli , 2004, p. 10, 605-637. [FRA 10] F RANCQ C. , Z AKOIAN J M. , “ Optimal predictions of powers of conditionally heteroskedastic processes ”, MPRA Preprint No. 22155., 2010. [MIK 06] M IKOSCH T., S TRAUMAN D. , “ Stable limits of martingale transforms with application to the estimation of GARCH parameters ”, The Annals of Statistics , 2006, p. 34, 493-522. [WEI 86] W EISS A A., “ Asymptotic theory for ARCH models : estimation and testing ”, Econometric Theory, 1986.

86

Proceedings of MSDM 2013

Testing for mean and volatility contagion of the subprime crisis: Evidence from Asian and Latin American stock markets Leila JEDIDI1, Wajih KHALLOULI2, Mouldi JLASSI1 1

ESSEC Tunis, DEFI ESSEC Tunis, UAQUAP (ISG Tunis)

2

ABSTRACT: In this paper, we examine both mean and volatility contagion of the US subprime crisis on emerging markets namely Brazil, Mexico, Argentina, India, Hong Kong, Indonesia, Malaysia and South Korea. The mean contagion is modelled through the DCC-GARCH model proposed by [ENG 02] and the volatility contagion is measured by the EGARCH model introduced by [NEL 91]. In addition, an iterated cumulative sum of squares (ICSS) test is used to identify endogenously crisis periods in the US stock market during the crisis of 2007-2009. Using daily data from Mars 11st, 2005 to Mars 12nd 2010, our results prove that the Asian region was more vulnerable to contamination unlike the Latin American region. Our results provide evidence of mean contagion only in Mexico, Indonesia, India and Malaysia stock markets. Volatility contagion is identified in all used Asian markets. However, LA’s stock markets were affected by the subprime crisis only through volatility interdependence. Therefore, our finding conclude that the “financial decoupling” hypothesis is not validated. KEYWORDS: Subprime crisis, Contagion, LA and Asian stock markets, ICSS, DCC-GARCH model.

1.

Introduction

The US subprime crisis that has triggered in August 2007, following the rise of defaults on the subprime mortgage lending in the United States, was not actually confined to the US mortgage markets. The crisis spreads to the entire financial markets not only in the US, but also to all sectors into the finance and real economy of developed countries [HOR 09]. In addition, [DOO 09] find that many emerging markets were affected by the Global Financial Crisis (GFC) after a first phase of resistance until the Lehman Brothers bankruptcy in September 2008. These facts have revived the policy debate on the “financial decoupling-recoupling” hypothesis of emerging markets [POW 08]; [PER 09]; [DUF 11]. To test for “financial decoupling” hypothesis, few empirical researches have been interested on the impact of the US subprime crisis on emerging equity markets [DOO 09], despite the abundant literature which has been devoted to the current financial crisis. [DOO 09] examine emerging markets (from Latin America (LA), Asia, Central Europe and other region) vulnerability to the US financial subprime crisis. Using the “event study” approach, they have shown the significant impact of financial and real economic news from US on emerging markets with strong linkages with US economy. Analysing linkage behaviours using Granger causality tests and impulse response, the authors have found evidence of decoupling hypothesis only until summer 2008. From that point on, however, emerging markets became strongly recoupled with US economy. [SAM 11] uses a large panel of 62 between emerging and frontier1 markets in order to show evidence of interdependence and contagion due to shocks from U.S. market to emerging markets and vice-versa. They found that there is contagion from U.S. only to Latin America markets. However, Asian markets have a strong contagious effect from U.S. during the GFC. [ALO 11] use data from BRIC markets (Brazil, Russia, India and Chine) to investigate contagion with U.S. market. Using a multivariate copula approach which allows capturing the non linear interdependences, their results showed not only the 1

Frontier market is defined as small and illiquid markets.

87

Proceedings of MSDM 2013

evidence of time-varying dependence between each markets pairs but also extreme comovement with the U.S. market. Others papers have focused on regional investigation to assess the impact of the subprime crisis on emerging markets. [SYL 11] have tried to capture a potential contagion effect from US market to Central and Eastern European markets during several recent crisis episodes. They examined the time varying conditional correlation computed by a DCC-GARCH model. Their results show the shift behaviours in CEE emerging markets particularly around the 2007-2009 financial turmoil due to increased participation of foreign investors after the accession of these countries to the European Union in 2004. [LAG 09] have tested the MENA stock markets’ vulnerability using the classical approaches based on shifts in conditional correlation coefficients between tranquil and crisis periods following [FOR 02], [COR 05], [FAV 02] and [BAU 09]. Their results highlighted the increasing vulnerability observed in this region during the ongoing financial crisis. [KHA 12] have tested both mean and volatility contagion in the MENA context using a Markov-Switching approach. They provided evidence of a persistence of MENA stock markets recession characterised by low mean/high variance regimes which coincides with the third phases of the subprime crisis. Their results confirmed also evidence of mean and volatility contagion in MENA stock markets caused by the US stock market. For Asian and Latin American countries, a few empirical studies have been investigated focusing only on volatility spillover from U.S. markets during the subprime crisis. [DUF 11] have used a time-varying transition probability MarkovSwitching model which identify endogenously crisis and non-crisis periods, in order to detect the rise in the volatility in the aftermath of the 2007/2008 crisis. Their results reject the financial decoupling hypothesis of LA’s stock markets and especially in Mexico since high volatility regime in these markets is explained by the US financial stress. [SHA 09] have studied the consequences of the recent US crisis on South East Asia major stock markets. They used a BEKK-GARCH model in order to examine volatility spillover from U.S. stock market during the crisis. They found that volatility in South East Asia equity markets is driven by shocks occurring in the US equity markets. The objective of this paper is to analyse the degree of interdependence across US and emerging markets of the LA and Asian emerging markets and also to investigate a possible shift in the spillover channels during the US subprime crisis 2007-2009 in order to test for contagion [RIG 03]. Following [BAU 03], we consider two types of contagion: mean contagion and volatility contagion. The first is realised when changes in the US market returns affect returns in one emerging stock market. Furthermore, we consider volatility contagion when the sudden rise in the return volatility in the emerging market is caused by the change in the US market volatility2. In this article we are interest mainly in the impact of the subprime crisis on the Asian and Latin American countries. Our choice of these markets is motivated by their growth model, based on financial integration. Since, these countries pursued from the years 1980 a process of financial liberalization in order to promote a bigger international financial integration. Besides, these countries are characterized by an economic heterogeneity, where some are producers of raw materials and basic products (Brazil); others focus their development on the services (India) and others on the manufacturing production (China), from where this diversification attracts more investors. Also, they are more mature and more experienced. In addition, since the year 2000, these countries have taken many reforms in order to reinforce their situation and limit the consequences disastrous of the crises that they knew. They led the macroeconomic policies aiming to reinforce their capacities of reaction to the outside shocks, to improve their financial system surveillance, the prudential regulation, and to consolidate their banking system. Despite these significant developments, these countries remained under-investigated in the literature of the U.S. subprime crisis contagion. Our article contributes to the literature in at least two points. Firstly, we select endogenously the crisis window. Indeed, [GRA 06] point out the subjective and arbitrary choice of the structural change points which define the beginning and the end of the crisis window. In order to avoid this problem, [RIG 03] recommends using a crisis window containing the same volatility regime of the country generating the crisis. Hence, we identify the crisis windows defined by sudden change in US stocks markets volatility, using the iterated cumulative sums of squares (ICSS) algorithm. Secondly, we extend works on Asian and Latin American markets by taking into account two kinds of contagion through which subprime crisis can spread: mean and volatility contagion. Testing both mean contagion and volatility contagion allow us to identify all types of negative effects 2 In this case, changes in volatility of the US market increase the volatility in MENA market during a particular period of time. That is the volatility contagion of [BAU 03].

88

Proceedings of MSDM 2013

of the US subprime crisis on the emerging stocks markets since mean contagion is not necessarily associated with the volatility contagion [BAU 03]. The layout of this paper is structured as follows: Section 2 outlines our data and the methodologies. Section 3 reports our empirical results. Section 4 offers some concluding remarks.

2.

Methodology

2.1. Detecting for structural breakpoints To testing for contagion, the choice of the crisis windows is crucial to having robust results. On the one hand, for [BIL 03], a long period of crisis must include observations generated by other regimes, and not only by the regime of crisis. The coefficient of interrelationship between the financial markets during the period of crisis becomes a linear combination of the different regime coefficients. In this case, the probability of dismissal of the null hypothesis of non contagion lowers. On the other hand, if the period of crisis is short enough, the risk to include other regimes is very weak but the risk to reduce the power of the test rises [DUN 01]. [RIG 03] define the crisis window as a high-volatility period in market generating the increase of the variance. Therefore, in order to really identify the crisis windows, we use the test of structural change of the volatility (ICSS). This ICSS algorithm allows us to detect discrete changes in the stock return variance of the “ground zero” country. It is assumed that the variance of a time series is stationary between two different sudden changes that occur as the result of global or regional events. After the second breakpoint, the variance changes and reverts to stationary until another shock occurs. More formally, we define u t as a series of independent observations, with zero mean and variance σ t . In 2

order to determine the number of changes in variance and the point in time at which variance shifts of u t , a cumulative sum of squares Ck and the Dk statistics are calculated as follows: k

C k = ∑ u t2 , t = 1,..., T

(1)

t =1

Dk =

Ck k - , t = 1,..., T with D0 = DT = 0 CT T

(2)

We use the iterated algorithm of [INC 94] that employs the

Dk

function to systematically look for change at

different points of the series. This algorithm is based on successive evaluation of the Dk over different time periods, which are determined by the found breakpoints3. Once we detect the k structural breakpoints of the “ground zero” country’s volatility, we next create a set of dummy variables DMk,t in order to use them to test contagion. DMk,t takes a value of one on the time period between two structural breaks constituting an homogenous volatility regime, and take a value of zero elsewhere.

2.2. Testing for mean and volatility contagion There is still wide disagreement about what contagion is exactly and how it should be tested empirically. In practice, contagion test that are based on the correlation coefficient neglect the volatility as a potential factor of contagion. Therefore, in this section, we present two methodologies that explain contagion. In the first stage we estimate the DCC-GARCH to compute conditional correlations and testing for mean contagion following [CHI 07]. In the second stage, we use the univariate EGARCH model in the object to analyse the volatility contagion following [BAU 03].

3

See [INC 94] for more discussion of the different algorithm steps.

89

Proceedings of MSDM 2013

2.2.1.

Mean contagion: DCC-GARCH model

To compute conditional correlations between the “ground zero” and the contaminated stock markets we use the DCC-GARCH model introduced by [ENG 02]. This model is used to examine the time-varying correlations. Therefore, in order to test the mean contagion between an emerging and US stock markets, we estimate a bivariate DCC-GARCH model which provide us all the pair-wise dynamic conditional correlations between US and each emerging market i noted ρ i , t :

ρ i ,t = corr ( rUS ,t , ri ,t ) =

q iUS ,t qUSUS ,t q ii ,t

(3)

Where q ij , t is the typical element of the symmetric positive definite covariance matrix Q t 4:

q ij ,t = ρ ij (1 − α − β ) + β q ij ,t −1 + α u i ,t −1u j ,t −1 Where ρ

ij

(4)

is the unconditional correlation between standardized residual is series uit and ujt for i≠j and is equal

1 for i=j, α and β are non-negative scalars satisfying α + β < 1 . We then use dummy variables DMk,t representing the US volatility structural changes (turmoil US subperiods) in order to look into the time-series behaviour of the dynamic correlation coefficient ρ i , t and sort out the impact of US stock on emerging stock markets movements and variability. The regression model is given by:

ρ i ,t =

P

∑φ p =1

K

p

ρ i ,t − p + ∑ γ k DM k ,t + eij ,t

(5)

k =1

The lag length p in equation (5) is determined by the AIC criterion. Note that following [CHI 07] the significance of the coefficient γ k with a positive sign indicates structural change in mean or/and variance of the time-varying conditional correlation between US and country i. Such a shift would clearly identify a shiftcontagion process since it involves an increase in pair-wise interdependence after the crises. Following [BAU 03] this type of contagion is qualified by “mean contagion”.

2.2.2.

Volatility contagion: Baur (2002)’s test

In addition to the mean contagion, [BAU 03]. distinguishes between volatility spillover and volatility contagion. He considers that, in contrast to volatility spillover which can occur at any time, the volatility increase’s impact in one market on the conditional volatility of another stock market takes place only during crises periods. Then, following [BAU 03] we employ the exponential GARCH (EGARCH) model developed by [NEL 91], in which it is not necessary to restrict the parameters to be non-negative. This approach allows us to conceptualize the notion of volatility contagion as a significant effect of the volatility increase of the US stock market on the conditional volatility of another emerging stock market.

The AR-EGARCH model is given as follows:

4

This matrix is computed from the standardized residual series ui obtained from two stages estimations. In the first stage the univariate GARCH (1,1) model is estimated for each stock market and the residual series are recovered. Then, in the second stage, these residuals and the standard deviations obtained from the first stage are used to estimate the dynamic correlation among the pair-wise market returns (see ENG 2000 for detailed econometric specification and two stages estimation).

90

Proceedings of MSDM 2013

ψ ( L ) Ri ,t = µ + et

(6)

et ~ iidN (0, h ) t

(7)

 ε  ε 2 ˆ2 log(ht ) = ω + α  t −1 − 2  + β log(ht −1 ) + δ t −1 + d1eˆUS ,t −1 + d 2 eUS ,t −1 DM k ,t −1 π h h  t −1  t −1

(8)

et is the error term for the return at time t , ψ (L) is the lag operator and µ is the intercept term. Eq.(8) allows us to model the conditional variance of each Ri ,t . It where ‘ Ri ,t ’ represents the return under investigation,

ensures the positive conditional variance and accounts for the leverage effect5. The coefficient capturing asymmetric respond of conditional variance to shock

εt

2 US ,t

of either sign. eˆ

δ

allows

is the square of the

residual series obtained from estimating Eq. (6) using US returns. It is used as a volatility proxy of US stock market. The first coefficient d1 captures the volatility spillover from US to the emerging market i. However, the second coefficient d2 detects the additional effect during the crisis period captured by the dummy variable DMk,t-1. Following [BAU 03], the significance of this additional effect could be interpreted as the volatility contagion.

3. Data and empirical results To identify contagion of emerging markets before the US subprime crisis 2007-2009, we use stock indices of three Latin American’s countries and five Asian’s countries namely: Bresil (Bovespa), Mexique (IPC), Argentine (Merval), Inde (bse-30), Hong Kong (Hang seng), Indonesie (jkse), Malaisi (Klse), South Korea (KS11). In addition, to investigate contagion from the US stock market, we base our analysis on the S&P500 of the US stock market index price as the originate country of the crisis. All indices are denominated in US dollars6. Following [FOR 02] the stock market returns are obtained from the average of two consecutive daily series since US markets and other emerging markets are not open at the same time. Then, returns are computed as the logarithm difference of two day average of stock index values and calculated as follows: rt = 100 × ln( pt / pt − 2 ) , where pt is the stock price on the date t. The data are sampled over the period from Mars 11st, 2005 to Mars 12nd 2010, providing a total of 1250 observations.

3.1. Detecting for crisis periods The ICSS algorithm is used to identify different crisis sub-periods in the US stock market by detecting structural sudden changes in the volatility of S&P500 returns. Table1 reports the time periods of different regimes of unconditional variance. Therefore, the S&P500 returns show nine sudden change points corresponding to ten distant volatility regimes. The last column in table 1 reports some economic and political events that correlated with the time of these breakpoints. Looking at table 1, we could observe the first significant great increase7 in volatility which occurred on the period from October 23rd, 2007 to September 12nd, 2008 when standard deviation jumped from 1.44 to 21.97. 5

When a negative shock generates more volatility than a positive shock of equal magnitude.

6

All data were extracted from the web site: http://fr.finance.yahoo.com.

7

Our explanations only consider volatility increase.

91

Proceedings of MSDM 2013

This change in volatility corresponds to the market plunge following the behaviour of the rating agencies. In fact, they have a part of responsibility in the appearance and the development of the subprime crisis, as they have given an excellent notation to risky structural product. Also, in this period there are many financial institutions have loosed such as the Merrill Lynch. The next time period of high volatility is from September 12nd, 2008 to February 12, 2008, which coincids with the consequence of the Lehman Brother’s bankruptcy. According to [BAR 09], this time period defines clearly the global equity market crisis. Indeed, they have examined the effect and the magnitude of the subprime crisis on global equity markets and their major components. They found that the equity market reaction is refer to the mortgage and banking crisis until July/August 2008. While the real collapse equity market starts in the middle of September 2008 (the bankruptcy of Lehman Brothers and the bailout of AIG). On the whole, we construct ten dummy variables DMk,t representing distinguish volatility regimes of the US stock market before and after the sub-prime crisis 2007-2009.

3.2. Testing for mean and volatility contagion The next step is to incorporate the sudden change dummy variables in dynamic conditional correlation regression (Eq. 5) and EGARCH model (Eq. 8) and examine the impact of US subprime crisis on LA and Asian stock markets. In order to compute the dynamic conditional correlations, we start by estimating bivariate DCC-GARCH (1,1) models between US and each emerging market8. Figure 1 depicts different dynamic correlation series from July 2005 to December 2010. Interestingly, we can see the sharp increase in all correlation series during the 2008-2009. In other wise, after the Lehman brother’s failure (third quarter of 2008), we can see a higher correlation between the US and our all countries of our sample. Indeed, the conditional correlation coefficient between US and Brazil increases from 0.6 before the subprime crisis to 0.9 after the collapse of Lehman Brother. During the same post-collapse, the correlation with Mexico and Argentina are ranged from 0.6 to 0.8. Moreover, the correlation with Malaysia, Korea, Indonesia, India, and Hong Kong are ranged between 0.3 and 0.6 after the crisis with a particular rise during the period of 2008-2009. Additionally, we can note that correlation of the US with LA stock market returns is more noticeable and higher than those with Asian stock market returns during the sub-prime crisis. This increase of interdependence between the US and emerging markets can be explained by many channels of transmission through which output losses are associated with financial crisis. In the case of LA’s stock markets, we can note the channel of the financial linkages9 or the international trade10 between countries. Indeed, the importance in the coefficient of interdependence between the US and LA’s markets could probably be explained by the fundamental links (trade and financial channels) which are based on the regional proximity [GLI 99]. These markets belong to the same region as constituting a densely intertwined and interconnected. Therefore this higher level of interdependence can be the cause of the neigbhors and their geographic proximity. For the case of Asian stock markets, the “wake-up call” hypothesis11 is more plausible in the absence of geographic proximity. [VAN 03] provide evidence for the wake-up call hypothesis from the Russian crisis which caused generalized outflows from emerging markets. To investigate the relevance of the mean and volatility contagion, we can prove the existence of financial contagion on our selected countries from US stock market during the distinct identified phases of the subprime crisis. Although ICSS test has detected ten US volatility regimes, we delete the first, the second and the last subperiods and we use only seven dummy variables for the seven different windows for the subprime crisis 20072009. Table 2 reports results of significant changes associated with different phases of crisis. Results show that none of DM1 and DM7 are statistically significant, indicating that there is no mean contagion from US stock 8

Results of bivariate GARCHs estimations are available from authors upon request.

9

See, [GOL 05], [ALL 00], and [DAS 04].

10

For a theoretical formalization of this idea see, [GER 95].

11

See [RIG 98]. [VAN 03] for literature reviews.

92

Proceedings of MSDM 2013

market on the analyzed emerging markets during the early (12/02/2007-09/03/2007) and the last phase (20/04/2009-07/15/2009) of the subprime crisis. Our results reveal that Malaysian stock market is the first affected market during the second subprime crisis phase (9/03/07-12/07/07) as the coefficient associated to DM2 is positive and statistically significant. This result provides evidence of the structural change in the returns correlation between US and Malaysian markets. Given the positive sign, the fall in the US stock market returns decreases the Malaysian stock markets. We could interpret this result as the evidence of mean contagion from the US Subprime financial crisis to Malaysia. Moreover, coefficients associated to DM2 and DM3 in the case of Indonesian stock markets are statistically significant but with negative sign implying the sudden decrease in the conditional correlation. According to [FAV 02], this decrease reveals the evidence of “flight-to-quality” at the beginning of the subprime crisis. International Investors on the US stock markets look for other better quality investments on the Indonesian markets. This phenomenon of “flight-to-quality” is shown clearly before the subprime crisis [LON 04]. However, during the next sub-period of 23/10/07- 12/09/08 (associated to DM4 dummy variable) the Indonesian stock market was affected by the failure of the US market. We identify also during the same sub-period the contamination of the Mexican stock market during this crisis’s phase. More interestingly, during the following sub-period 12/09/08--2/12/08 during which the beginning of the Global Financial Crisis (US crisis spreads to all developed countries)12, we identify only the “fligh-to-quality” – the coefficient associated to DM5 is significant and negative – for Mexican, Indian and Malaysian stock markets. This result confirms that international investors have tried to looking for other investments in some emerging markets after the collapse of developed countries and particularly the European markets. Just after the first phase of the Global Financial Crisis and during the first quarter of 2009, our results prove the contamination of these stock markets (Mexico, India and Malaysia) by the mean contagion from US crisis. Indonesia’s stock market is also affected by the US crisis during this same period. Therefore, for mean contagion, our results prove that the Asian region was more vulnerable to contamination unlike the Latin American region. We then try to test volatility contagion between the US and analyzed emerging markets. The results are displayed in table (3) and table (4) for LA and Asian stock markets, respectively. Our results verify the presence of volatility interdependence13 for all emerging markets except for Malaysia during the total crisis period 20072009. In fact all the coefficients associated to d1 (see Eq. 8) are positive and statistically significant regardless all the used sub-periods of the crisis. However, in the Malaysia’s case, the financial interdependence is verified only using both dummy variables DM4 and DM5 (period from October 2007 to December 2008). In addition, results of volatility contagion (parameter d2) in table (3) show that the LA’s markets (Brazil, Mexico, Argentina) are contaminated by the subprime crisis only in the first phase of the crisis from February 2007 to Mars 2007. However, results relatively to other crisis’s phases prove that LA’s stock markets are stay speared from the volatility contagion. Thereby, they had undertaken reforms after the failure of the US stock market, which were insulate them from sudden changes in international investors expect. While, the phenomena of volatility interdependence remain usually present explaining the contamination of these markets via fundamental financial links. Moving to the volatility contagion in the Asian emerging markets, we can observe in table (4) that volatility interdependence between US and Asian stock markets is present during all the phases of the subprime crisis. Concerning volatility contagion, results show that the Korea’s stock market is the most country which is contaminated in the different phases of 2007-2009 crisis. The India’s stock market is contaminated from july 2007 to October 2008. In addition, when crisis takes the nature of GFC (from September 2008) we note the presence of volatility contagion in the last sub-period (from April 2009 to July 2009). While Hong Kong were affected by the subprime crisis only during the sub-period 10/23/2007 – 09/12/2008, as the coefficient associated to d2 during the fourth sub-period (DM4) is positive and statistically significant. Indonesia is contaminated during two sub-periods: from July 2007 to October 2007 and from April 2009 to July 2009. However, the subprime crisis has affected the Malaysian stock market during the sub-periods from February 2007 to Mars 12

During this sub-period the US stock market returns was in its higher volatility regime (see table 1).

13

Which is detected by the coefficient d1 in EGARCH equation.

93

Proceedings of MSDM 2013

2007 and from October 2007 to September 2008. Therefore, we can conclude that the volatility contagion can be found for all selected Asian emerging markets, while each markets is contaminated within a specific sub-period. We can conclude that the LA’s markets are spared from this subprime crisis not only in term of volatility contagion but also in term of mean contagion. Therefore, we can qualify the transmission of the subprime crisis to LA’s markets only by the phenomenon of interdependence. However, the Asian region is more affected by the last GFC and we can prove the presence of mean and volatility contagion in these countries. This finding can confirms the “financial decoupling” hypothesis of Asian markets during the first phase of Subprime crisis. This hypothesis is no longer validated during the GFC that started in 2008. That’s why we deny the presence of the re-coupling hypothesis within Asian markets.

4.

Conclusion

The recent financial turmoil started with the collapse of the US mortgage market in 2007, has developed with the behaviour of the rating agencies and the Lehman Brother bankruptcy. Therefore, it has reinforced the issue of contagion effect in both developed and emerging economies. This study contributes to a better understanding of this phenomenon by detecting the time period of sudden changes in volatility of the “ground zero” country which is the US stock market. We find that the mean contagion which concerns Mexico, Indonesia, India and Malaysia, tends to be more significant in the first part of the crisis. Volatility contagion is proved for all cases of Asian markets but it is not identified in any LA’s markets which are affected by volatility interdependence of the US market. Therefore, we can conclude that the “financial decoupling” hypothesis is not validated. Indeed, the decoupling between emerging and developed economies has never been a reality. Thus, Asia and Latin America were not spared by the slowdown in the United States.

References

[ALO 11] Aloui R., Ben Aissa M., Nguyen D. “Global financial crisis extreme interdependence and contagion effects . The role of economic structure”. Journal of Banking and Finance, 2011, PP. 130-141. [BAU 03] Baur. D, “Testing for contagion – mean and volatility contagion”. 2003, Journal of multinational financial Management 13, PP 405-422. [BAU 09] Baur D.G., Fry, R.A. “Multivariate contagion and interdependence”. 2009, Journal of Asian Economics 20, 353–366. [BIL 03] Billio. M, L Pelizzon “ Contagion and interdependence in stock markets: Have they been misdiagnosed?” ,2009 , Journal of Economics and Business 55, PP 405-426. [CHI 07] Chiang .T.C., Jèon B,N., Li H “Dynamic correlation analysis of financial contagion : evidence from Asian markets”. 2007, Journal of International Money and Finance 26, PP 1206-1228. [COR 05] Corsetti.G , Pericoli. M., Sabricia M “Some contagion, some interdependence ; More pitfalls in tests of financial contagion”. 2005, Journal of Inernational Money and Finance 1177-1199. [DOO 09] Dooley.M., Hutchison. M “Transmission of the US subprime crisis to emerging markets: Evidence on the decoupling- recoupling hypothesis”. 2009, Journal of International Money and Finance 28, PP 1331-1349. [DUF 11] Dufrénot.G , Mignon.V., Péguin Feissolle.A “The effects of the subprime crisis in the Latin Americain market: An Ampirical assessment”. 2011, Economic Modelling 28, PP 2342-2357. [DUN 01] Dungey. M., Zhumabekova. D. “Testing for contagion using correlation: some words of caution. 2001,Pacific Basin Working paper PB01-09 (Federal Reserve Bank of San FGrancisco). [ENG 02] Engel. R “Dynamic conditional correlation: A simple class of Multivariate GARCH Models”. 2002, Journal of Business and Economic statistics 20, PP 339-350.

94

Proceedings of MSDM 2013

[FAV 02] Favero, C.A., Giavazzi, F. “Is the propagation of financial shocks non-linear? Evidence from the ERM. 2002, Journal of International Economics 57, 231e246. [FOR 02] Forbes, K. J., Rigobon, R. “No contagion, only interdependence: measuring stock market comovements”. 2002, Journal of Finance, 57(5), 2223-2261. [GLI 99] Glick, R. Rose, A.K. “Contagion and Trade: Why are Currency Crises Regional?”, 1999, Journal of International Money and Finance, Vol. 18(4), pp. 603–17. [GRA 06] Gravelle. T, Kichian. M., Morley.J “ Detcting shift contagion in currency bond markets”. 2006, Journal of International Economics 68 (2), PP 409-423 [INC 94] Inclan , C., GC. Tiao “ Use of cumulative Sums squares for retrospective detection of changes of variance”. 1994, Journal of the Americain statistical Association 89, PP 913-23. [KAL 12] Kallouli. W., Sandretto. R“ Testing for “ contagion” of the subprime crisis on the middle East and North Africain stock Markets : A markov Switching EGARCH Approach. 2012, Journal of Economic Integration 27 (1) PP 134-166. [LAG 09] Lagoarde-Segot T., Lucey BM “Shift contagion vulnerability in the MENA stock markets” . 2009, The world Economy, 32 PP 1478-1497. [LAH 09] Lahet. D “Le repositionnement des pays émergents : de la crise financière asiatique de 1997 a la crise de 2008”. 2009, Revue d’Economie Financière. 95, PP 275-306. [ SYL 11] Syllignakis N, Georgios P Kouretas “ Dynamic correlation analysis of financial contagion: Evidence from the central and Eastern European market”. 2011, International Review of Economic and Finance” 20. PP 717.132. [NEL 91] Nelsson.D. B “Conditional heterosketasticity in asset return : A new approach”. 1991, Econometrica 59. PP 347-370. [PER 09] Pereira Voladao. MA ; Gico. Jr “The ( not so) great depression of the 21 st centry and its impact on Brazil. 2009, Working paper 0002/09 . Universidade catolica de Brasilia. [POW 08] Powell. A., Martinez J.F. “On emerging economy sovereign spreads and ratings” , 2008, Working paper 629, Inter-Americain Development Bank. [RIG 03] Rigobon. R. “On the measurement of the international propagation of stocks : is the transmission stables?”. 2003, Journal of International Economics 61, PP 261-283. [SAM 11] Samarkoon , L.P “Stock market interdependence, contagion and the US financial crisis : the case of emerging and frontier markets”. 2011, Journal of the international financial markets institution & Money 21 (5) 724-742. [SHA 09] Shamiri. A and Isa. Z “The US crisis and the volatility spillover Across South East Asia stock markets.” 2009, International Research of finance and Economics. 34, PP 234-240. [SOH 09] Sohnke M. Bartram , Gordon M . Bodnar. “No place to hide : the global crisis in equity markets in 2008”. 2009, Journal of International Money and Finance 28, PP 1246-1292. [VAN 03] Van Rijckeghem C. and Weder B. “Spillovers through banking centers: a panel data analysis of bank flows”. 2003, Journal of International Money and Finance, 22, pp. 483-509.

95

Proceedings of MSDM 2013

Appendix:

Table 1: Sudden changes in volatility of US stock market returns Time period

Standard deviation

Events

03/14/2005 - 04/24/06

0,37920309

04/24/2006 - 07/12/2006

0,85188263

07/12/2006 – 02/12/2007

0,21515034

HSBC announces losses linked to US subprime mortgage

02/12/2007 – 03/09/2007

1,51030518

The Federal Home Loan Mortgage corporation (Freddie Mac) announces that it will no longer buy the most risky subprime mortgage and mortgage related securities

03/09/2007 – 07/12/2007

0,4758146

Standard and Poor’s and Mood’s Investor Services downgrades over 100 bonds backed by second lien subprime mortgage.

07/12/2007 – 10/23/2007

1,44029089

Suiss bank UBS announces losses linked to US subprime mortgage

10/23/2007 – 09/12/2008

21,9735475

Merrill Lynch announces losses to be over $ 8 billions Fitch Rating downgrades Ambac Financial group’s insurance financial strength, rating to AA credit watch negative. Standard and Poor’s place Ambac’s AAA rating on credit watch negative (Ambac is us’s second largest bond insurer)

09/12/2008 – 12/02/2008

6,31510104

Lehman Brother declares bankruptcy and the failure of European stock markets.

12/02/2008 – 04/20/2009

2,52079203

SCE approves measures to increase transparency and accountability at rating agencies

04/20/2009 – 07/15/2009

0,98711628

96

Proceedings of MSDM 2013

Table 2: Results of significant changes in dynamic conditional correlations between US and each emerging market returns during distinguish phases of subprime crisis 2007-2009 Brazil

Mexico

Argentina

India

Hong Kong

Indonesia

Malaysia

Korea

DM1

0,043

-0,014

-0,00123

0,01959

0,01195

0,0177

-0,04105

0,01337

12/02/07--9/03/07

(0,41)

(-0,33)

(-0,02)

(0,62)

(0,14)

(0,48)

(-1,18)

(0,33)

DM2

-0,07136

-0,00119

0,01089

-0,01018

-0,016344

-0,044

0,057

-0,00168

9/03/07--12/07/07

(-0,09)

(-0,00)

(0,11)

(-0,58)

(-0,23)

(-2,47**)

(1,93*)

(-0,08)

DM3

-0,01939

0,022062

0,0012052

-0,0041128

0,0062448

-0,0342802

-0,0473249

-0,041236

12/07/07--23/10/07

(-0,16)

(0,07)

(0,01)

(-0,27)

(0,13)

(-2,03**)

(-1,46)

(-0,49)

DM4

-0,33977

0,0476638

-0,04411

-0,0050322

-0,0160502

0,01840

-0,0125001

-0,0150458

23/10/07--12/09/08

(-0,18)

(2,79***)

(-0,42)

(-0,54)

(-0,72)

(1,85*)

(-0,64)

(-0.69)

DM5

0,0496225

-0,0494767

0,0034333

-0,0316998

0,0225661

-0,0045811

-0,093

0,0300295

12/09/08--2/12/08

(1,02)

(-2,45**)

(0,06)

(-2,61***)

(0,69)

(-0,37)

(-2,91***)

(0,60)

DM6

0,0091655

0,008985

0,0284465

0,0306861

-0,0014763

0,0298876

0,0639523

0,0100961

(0,06)

(0,23)

(0,31)

(2,23**)

(-0.03)

(2,75***)

(2,13**)

(0,21)

DM7

0,0159186

-0,0031747

0,0220717

-0,0007061

-0,0088942

0,0234698

0,0261222

0,020256

20/04/09--20/04/09

(0,07)

(-0,05)

(0,10)

(-0,04)

(-0,14)

(1,23)

(0,89)

(0,64)

2/12/08--20/04/09

Note: The estimated model is ρ = i ,t

P

∑φ p =1

K

p

ρ i ,t − p + ∑ γ k DM k ,t + e ij ,t k =1

Value reported in brackets is t-statistic. Only the values in bold indicate the presence of the mean contagion. Other significant results with a negative sign could be interpreted as the “flight-to-quality”. * Significance of the coefficients at the 10% level. ** Idem 5% level.*** Idem 1% level.

97

Proceedings of MSDM 2013

Table 3: Results of volatility contagion during distinguish phases of subprime crisis 2007-2009 (LA’s stock markets)

DM1

d1

d2

DM2

d1

d2

DM3

d1

d2

DM4

d1

d2

DM5

d1

d2

DM6

d1

d2

DM7

d1

d2

Brazil

Mexico

Argentina

0.00329

0.00301

0.00378

(2.64***)

(2.51***)

(2.22**)

0.04121

0.06898

0.09575

(2.33**)

(4.16***)

(6.42***)

0.00381

0.00281

0.00552

(2.89***)

(2.54**)

(2.97***)

-0.04816

-0.01802

-0.09953

(-1.61)

(-0.57)

(-3.11***)

0.00325

0.00290

0.00531

(2.83***)

(2.60***)

(2.95***)

0.01434

0.00579

0.00796

(1.31)

(0.58)

(0.58)

0.00352

0.00297

0.00526

(2.79***)

(2.59***)

(2.92***)

0.00489

0.00331

-0.00250

(0.99)

(0.82)

(-0.39)

0.00638

0.00927

0.01236

(2.16**)

(2.78***)

(3.14***)

-0.00319

-0.00611

-0.00731

(-1.23)

(-2.04**)

(-2.01**)

0.00313

0.00262

0.00511

(2.65***)

(2.39**)

(2.87***)

0.00199

0.00315

0.00366

(0.83)

(1.15)

(1.04)

0.00338

0.00296

0.00549

(2.73***)

(2.68***)

(3.10***)

0.00297

0.01068

0.02335

(0.46)

(1.37)

(1.53)

Note: Value reported in brackets is t-statistic. Only the values in bold indicate the presence of the volatility contagion. * Significance of the coefficients at the 10% level. ** Idem 5% level.*** Idem 1% level.

98

Proceedings of MSDM 2013

Table 4: Results of volatility contagion during distinguish phases of subprime crisis 2007-2009 (Asian stock markets)

DM1

d1

d2

DM2

d1

d2

DM3

d1

d2

DM4

d1

d2

DM5

d1

d2

DM6

d1

d2

DM7

d1

d2

India

Hong Kong

Indonesia

Malaysia

Korea

0.00315

0.00472

0.00613

-0.00020

0.00593

(1.82*)

(3.32**)

(3.79***)

(-0.74)

(4.23***)

0.05111

0.01421

-0.01973

0.03884

-0.00750

(1.07)

(0.39)

(-0.62)

(3.96***)

(-0.31)

0.00320

0.00475

0.00611

-0.00044

0.00503

(1.83*)

(3.33***)

(3.75***)

(-1.59)

(4.39***)

-0.08387

-0.01759

-0.04981

-0.02928

0.06628

(-2.10**)

(-0.43)

(-0.96)

(-2.02**)

(2.79***)

0.00280

0.00487

0.00597

-0.00037

0.03071

(1.71*)

(3.50***)

(3.76***)

(-1.35)

(9.20***)

0.02773

0.03261

0.02939

-0.00228

0.34999

(2.29**)

(2.04**)

(1.74*)

(-0.49)

(7.53**)

0.00357

0.00607

0.00614

0.01003

0.00608

(1.99**)

(3.73***)

(3.86***)

( 2.67***)

(4.22***)

0.01237

0.02200

0.01142

0.05166

0.00454

(1.84*)

(2.50***)

(1.65*)

(1.86*)

(0.68)

0.01455

0.02296

0.01638

-0.00201

0.01581

(3.24***)

(4.02***)

(3.71***)

(-2.31**)

(4.52***)

-0.01181

-0.01771

-0.01152

0.00178

-0.01007

(-2.63***)

(-3.38***)

(-2.93***)

(1.94*)

(-3.11***)

0.00271

0.00418

0.00535

-0.00021

0.00555

(1.37)

(2.61***)

(3.40***)

(-0.68)

(4.18***)

0.00445

0.00697

0.00553

-0.00107

0.00520

(1.18)

(1.82*)

(1.42)

(-1.08)

(1.86*)

0.00342

0.00500

0.00636

-0.00034

0.00593

(1.93**)

(3.45***)

(3.97***)

(-1.22)

(4.27***)

0.02662

0.01958

0.03994

0.00796

0.00788

(2.34**)

(1.17)

(2.44**)

(0.58)

(0.77)

Note: Value reported in brackets is t-statistic. Only the values in bold indicate the presence of the volatility contagion. * Significance of the coefficients at the 10% level. ** Idem 5% level.*** Idem 1% level.

99

Proceedings of MSDM 2013

On the covariance structure of a bilinear stochastic differential equation Abdelouahab Bibi D´epartement de math´ematiques, universit´e Mentouri de Constantine, Algeria. This talk is concerned with a study of general bilinear stochastic differential equations. We provide conditions for second-order and strict-sense stationarities of the state process. Explicit formulas for the mean and covariance functions for the state process are given. A linear representation is obtained and the optimal linear filter and its asymptotic behavior are investigated. The problem of parameters estimation for some particular case is also considered ABSTRACT.

KEYWORDS:

Stochastic diferential equations, Covariance structure, Kalman filter

1. The model and its second-order properties Let us consider the following stochastic differential equation p ∑

aj (t)X (j) (t) = a0 (t) +

j=0

q ∑

bj (t)W (j) (t) +

p ∑ q ∑

j=0

cij (t)X (i) (t)W (j) (t),

[1]

i=1 j=1

where the superscript (j) denotes the j−fold differentiation with respect to t, (W (t), t ≥ 0) is the standard Brownian motion defined on some basic probability space (Ω, ℑ, P ) and (aj (t))0≤j≤p , (bj (t))0≤j≤q and (ci,j (t))0≤i≤p,0≤j≤q are real and measurable functions satisfying the following conditions for all T > 0 ∫T

∫T δ

|ai (t)| d(t) < +∞, 0

∫T δ

δ

|bj (t)| d(t) < +∞, 0

|ci,j (t)| d(t) < +∞, δ = 1, 2

[2]

0

The derivatives W (j) (t), j > 0 do not exist in the usual sense. Hence we interpret (1) as being equivalent to the observation and state equations :  q  dY (t) = (A (t)Y (t) + a (t)) dt + ∑ (A (t)Y (t) + b (t)) dW (j) (t), Y (0) = Y . 0 j 0 j 0 ∀t ≥ 0 : [3] j=1  X(t) = HY (t) ( ) for some appropriates matrices (Aj (t))0≤j≤q and vectors H, a0 (t), bj (t) 0≤j≤q . The initial state Y (0) is a random vector defined on (Ω, ℑ, P ) with vector mean m(0) and covariance matrix Σ(0). The existence and the uniqueness of the solution process of equations (1) is ensured by the general results on stochastic differential equations (see, [3], ch. 4). Let us denote by (Φ(t), t ≥ 0) the solution process of the matrix homogeneous equation dΦ(t) = A0 (t)Φ(t) dt +

q ∑

Aj (t)Φ(t)dW (j) (t), Φ(0) = I(n) , t ≥ 0.

[4]

j=1

100

Proceedings of MSDM 2013

where I(n) denotes the unit matrix. Setting D(t) = det Φ(t), then we have (see [2])   t q ∫t q ∫t ∫  ∑ 1∑ tr(Aj (t))dW (j) (s) , t ≥ 0 D(t) = exp tr(A0 (s)ds − tr(A2j (t)ds +   2 j=1 0

0

[5]

j=1 0

where tr (A) stands the trace of a squared matrix A. Denotes by m(t) (resp. π(t)) the mean function of state process (Y (t), t ≥ 0) (resp. for Φ(t), t ≥ 0). then we have Theorem 1 The solution process of state process (Y (t), t ≥ 0) generated by (3) is given by      ∫t ∫t q q   ∑ ∑ Y (t) = Φ(t) Y (0) + Φ−1 (s) a0 (s) − Aj (s)aj (s) ds + Φ−1 (s)  aj (s)dW (j) (s)   j=1

0

0

[6]

j=1

where Φ(t) is defined in (4). Moreover m(t)=A ˙ ˙ ˙ 0 (t)m(t) if t ≥ 0,m(0)=m 0 , π(t)=A 0 (t)π(t) if t ≥ 0 and π(0)=π ˙ . The covariance function of the state process (Y (t), t ≥ 0) is then given by 0 { π(t)π −1 (s)K(s) if t≥s ( )′ K(t, s) = [7] K(t) π(s)π −1 (t) if t≤s where (K(s), s ≥ 0) satisfies the following equation ˙ K(t) = A0 (t)K(t) + K(t)A′0 (t) +

q ∑

Aj (t)K(t)A′j (t)

q ∑ [ ][ ]′ Aj (t)m(t) + aj (t) Aj (t)m(t) + aj (t) [8] +

j=1

j=1

with k(0) = k0 . Theorem 2 The state process (Y (t), t ≥ 0) generated by (??) is second order stationary if and only if for almost revery t ≥ 0 1. A0 (t)m(0) + a0 (t) = 0 and there exists a constant matrix A0 such that the followingconditions hold 2. A0 (t)K(0)K − (0) = A0 K(0)K − (0) = K(0)K − (0)A0 = A0 q q [ ][ ]′ ∑ ∑ Aj (t)K(0)A′j (t) + Aj (t)m(0) + aj (t) Aj (t)m(0) + aj (t) = 0 3. A0 K(0) + K(0)A′0 + j=1

j=1

where K − stands for the pseudo–inverse matrix of K. Moreover, under the above conditions, the covariance function of the state process (Y (t), t ≥ 0) is reduce to { A (t−s) e 0 K(0) if t≥s K(t, s) = [9] ′ K(0)eA0 (t−s) if t≤s

2. Linear representation ( ) f (t), t ≥ 0 , such that the state process (Y (t), t ≥ 0) Theorem 3 There exists a wide-sense Wiener process W generated by (1) admit the following representation f (t) dY (t) = (A0 (t)Y (t) + a0 (t))dt + Σ1/2 (t)dW

[10]

with t ≥ 0 and Y (0) = Y 0 and where (Σ(t), t ≥ 0) is defined by Σ(t) =

q ∑ j=1

Aj (t)K(t)A′j (t) +

q ∑ ][ ]′ [ Aj (t)m(t) + aj (t) Aj (t)m(t) + aj (t) .

[11]

j=1

101

Proceedings of MSDM 2013

3. References Arnold, L. (1974). Stochastic differential equations, theory and applications. New-York, J. Wily. Lebreton, A., M. Musiela (1983). A look at bilinear model for multidimensional stochastic systems in continuous time. Statistics & decisions I. 285-303. Liptser, R.S., and A.N., Shirayev (1978). Statistics for random processes. I,II. Springer.

102

Proceedings of MSDM 2013

Monitoring simple linear profiles using the K-chart Walid Gani1 & Mohamed Limam1 1

LARODEC, ISG, University of Tunis

This paper proposes the use of the K-chart for monitoring simple linear profiles. A benchmark example is used to show the construction methodology of the K-chart for simultaneously monitoring the slope and intercept of linear profile. In addition, performance of the K-chart in detecting out-of-control profiles is assessed and compared with traditional control charts. Results reveal that the K-chart performs better than the T 2 control chart, EWMA control chart and R-chart under small shift in the slope.

ABSTRACT.

KEYWORDS:

statistical process control, K-chart, linear profiles, T 2 control chart, EWMA control chart, R-chart

1. Introduction

Control charts for monitoring linear profiles have acquired a prominent role in controlling quality processes characterized by a relationship between a response variable and one or more explanatory variables. A control chart for monitoring linear profiles consists of two phases. In phase I, the parameters of the regression line are estimated to determine the stability of the process. In phase II, the goal is to detect shifts in the process from the baseline estimated in phase I. In the literature, the majority of control charts deal with the phase II analysis of linear profiles. [KAN 03] proposed a multivariate T 2 control chart for monitoring both the intercept and the slope, while [KIM 03] suggested the use of three univariate exponentially weighted moving average control charts for simultaneously monitoring the intercept, slope and standard deviation. [ZOU 07] proposed a multivariate EWMA scheme when the quality process is characterized by a general linear profile. [ZHA 09] developed a control chart based on EWMA and Likelihood ratio test. [ZOU 09] developed the LASSO-based EWMA control chart, for monitoring multiple linear profiles. [LI 10] established an EWMA scheme with variable sampling intervals for monitoring linear profiles. In the last decade, the kernel distance-based multivariate control chart, also known as the K-chart, devloped by Sun and Tsung [SUN 03], has received significant attention as a promising non-parametric control chart with high sensitivity to small shifts in the process mean ([KUM 06], [CAM 08], [GAN 11], [GAN 12]). Unlike traditional control charts, the K-chart does not require any assumption about the model distribution of quality characteristics and it has the ability to construct flexible control limits based on support vectors (SVs). All these features serve as incentives to the application of the K-chart for monitoring linear profiles. In this paper, we propose the use of the K-chart for monitoring simple linear profiles. We show how to construct the K-chart for simultaneously monitoring the slope and intercept of linear profiles. A comparison between the K-chart and traditional control charts using a benchmark simulated data, is also discussed in this paper. This paper is organized as follows. Control charts for phase II analysis of simple linear profiles are presented in Section 2. Theoretical background of the K-chart for monitoring simple linear profiles is presented in Section 3. A benchmark simulated data is used in Section 4 to illustrate the application of K-chart for simultaneously monitoring the slope an intercept of linear profiles, with a comparison with traditional control charts. Section 5 summarizes the paper.

103

Proceedings of MSDM 2013

2. Monitoring simple linear profiles We consider the following simple linear profile model yij = β0j + β1j xij + ǫij ,

i = 1, 2, ..., n and j = 1, 2, ..., m.

[1]

where yij is the j th measurement, xij is the value of the explanatory variable corresponding to the j th profile, β0j and β1j are, respectively, the intercept and the slope for profile j and ǫij is the j th random error assumed to be independent and normally distributed with mean zero and variances σ 2 . In phase I, the parameters of the model given in Equation (1) are estimated using the least squares method. The estimated slope for profile j, denoted by b1j , is given by b1j = where y¯j =

Pn

i=1

yij

n

and x ¯j =

Pn

i=1

n

xij

.

Pn

(y − y¯j )(xij − x¯j ) i=1 Pnij , ¯j )2 i=1 (xij − x

[2]

and the estimated intercept for profile j, denoted by b0j , is given by b0j = y¯j − b1j x ¯j .

[3]

Phase II analysis of linear profiles remains the most important step since it aims to assess the performance of control charts in detecting shifts in the parameters of linear profiles. [MAH 11] distinguished between two main categories of control charts for phase II. The omnibus control charts category for monitoring simultaneously the intercept and slope, and the individual control charts category for monitoring separately individual regression parameters. This paper focuses on the omnibus category, since our objective is to simultaneously monitor the slope and intercept of linear profiles. The most applied traditional control charts in this category are T 2 control chart, EWMA control chart and R-chart. For monitoring b0 , b1 and σ 2 in phase II, [KAN 00] recommended the use of the T 2 control chart for b0 and b1 . The T 2 statistics for monitoring the intercept and the slope are given by Tj2 = (zj − µ)T Σ−1 (zj − µ),

where zj = (b0j , b1j ) , µ = (¯b0 , ¯b1 ), ¯b0 =

Pn

i=1

m

b0j

, ¯b1 =

Pn

i=1

b1j

m

The upper control limit (UCL) for the T 2 control chart is given by

[4]



 ,Σ= 

σb20j

σb20j ,b1j

σb20j ,b1j

σb21j



 .

U CL = χ22,α ,

[5]

where χ22,α is the 100(1 − α) percentile of the chi-squared distribution with 2 degrees of freedom. In addition, [KAN 00] proposed an EWMA control chart to monitor the average deviation from the in-control line. The EWMA statistics for monitoring σ 2 are given by EW M Aj = θ¯ ej + (1 − θ)EW M Aj−1 , where e¯j = EW M A0 = 0.

Pn

eij i=1 nj

[6]

is the average deviation for sample j, 0 < θ < 1 is a smoothing constant and 104

Proceedings of MSDM 2013

The lower control limit (LCL) and the UCL for this EWMA chart are given by

LCL = −L1 σ

s

θ n(2 − θ)

and U CL = +L1 σ

s

θ , n(2 − θ)

[7]

where L1 > 0 is a constant chosen to give a specified in-control average run length (ARL). Also, [KAN 00] suggested the use of an R-chart for monitoring the process variation as follows LCL = σ(d2 − L2 d3 )

and U CL = σ(d2 − L2 d3 ),

[8]

where L2 > 0 is a constant selected to produce a specified in-control ARL, and d2 and d3 are constants depending on the sample size n.

3. On the use of the K-chart for monitoring simple linear profiles We consider βj = [b0j , b1j ], the vector of the intercept and slope for profile j, with j = 1, ..., m. The construction of the K-chart for simultaneously monitoring the slopes and intercepts of linear profiles requires two steps. In the first step, a sphere around the samples of βj is constructed using support vector data description (SVDD) method. The sphere should contain the maximum of βj with minimum volume. This is equivalent to solving the following quadratic programming

M inimize

F (R, a) = R2 + C

m X

ξj ,

[9]

j=1

subject to k βj − a k2 ≤ R2 + ξj ,

[10]

where C > 0 is a parameter introduced for the trade-off between the volume of the sphere and the errors, ξj ≥ 0 are slack variables used to penalize large distances, F is the cost function to minimize, a is the center and R is the radius of the sphere. Equation (10) can be incorporated into Equation (9) by using Lagrange multipliers

L(R, a, αj , ηj , ξj ) = R2 + C

m X j=1

ξj −

m X

αj [R2 + ξj − (k βj k2 −2a.βj + k a k2 )] −

j=1

m X

ηj ξj ,

[11]

j=1

with the Lagrange multipliers αj ≥ 0 and ηj ≥ 0, L should be minimized with respect to, R, a, ξj and maximized with respect to αj and ηj . Setting partial derivatives of L, we obtain m X

αj = 1.

[12]

j=1

a=

m X

αj βj .

[13]

j=1

C − αj − ηj = 0.

[14]

105

Proceedings of MSDM 2013

From Equation (14), αj = C − ηj , αj ≥ 0, and ηj ≥ 0, then Lagrange multipliers ηj can be removed and we have 0 ≤ αj ≤ C.

[15]

By substituting Equations (12) and (14) into Equation (11), we have M aximize

L=

m X

αj (βj .βj ) −

j=1

m X

αj αk (βj .βk ),

[16]

j,k=1

subject to 0 ≤ αj ≤ C.

[17]

A test sample, denoted by β test , is accepted when its distance is smaller or equal to the radius. This is equivalent to ′



(β test − a) (β test − a) = (β test .β test ) − 2

m X

αj (β test .βj ) +

j=1

m X

αj αk (βj .βk ) ≤ R2 .

[18]

j,k=1

Generally, data is not spherically distributed. To make the method more flexible, the vectors of βj are transformed to a higher dimensional feature space. The inner products in Equations (16) and (18) are substituted by a kernel function K(βj , βk ). In a higher dimension, the sphere becomes a complex form called ”hypersphere”. The problem of finding the optimal hypersphere is given by

M aximize

L=

m X

αj K(βj , βk ) −

j=1

m X

αj αk K(βj , βk ),

[19]

j,k=1

subject to Equation (17). A test sample β test is accepted when K(β test , β test ) − 2

m X

αj K(β test , βj ) +

j=1

m X

αj αk K(βj , βk ) ≤ R2 .

[20]

j,k=1

The second step in the construction of the K-chart consists in determining which samples are SVs by solving the following quadratic programming

M aximize

m X j=1

αj K(βj , βj ) −

m X

αj αk K(βj , βk ),

[21]

j,k=1

subject to 0 ≤ αj ≤ C.

[22]

4. Application In this section, application of the K-chart for simultaneously monitoring the slope and intercept of simple linear profiles, is discussed. In addition, to show the efficacy of our proposed approach, we compare the performance of

106

Proceedings of MSDM 2013

the K-chart with that of T 2 control chart, EWMA control chart and R-chart in detecting out-of-control (OOC) profiles. In this application, we use the benchmark simulated data of [MAH 11], where we consider the following in-control profile model : Yij = 13 + 2Xi + ǫij (where the ǫij are random errors assumed to be independent and normally distributed with mean zero and variance 1), with fixed X-values of -3, -1, 1, and 3. The simulated data set consists of 29 profiles generated as follows. First, 20 in-control profiles were generated. Then, nine OOC profiles were generated, after shifting the slope from 2.0 to 2.4. Details about the simulated data can be found in [MAH 11]. During the training phase, the 20 in-control profiles are used to construct the optimal one-class using SVDD algorithm. Then, the nine remainder profiles are used to detect OOC states. For the construction of the optimal SVDD based one-class, the Gaussian kernel function is used and it is defined as follows

K(βj , βk ) = exp(−

k βj − βk k2 ), 2σ 2

[23]

where σ > 0 is the width of the Gaussian kernel that controls the complexity of the SVDD boundary.

Figure 1. Examples of SVDD based one-classes for simultaneously monitoring the intercept and slope under different values of the parameter σ.

All calculations were carried out in Matlab software. The optimal value of the Gaussian kernel width was found to be σ = 1.50. Figure 1 shows examples of SVDD based one-classes when varying the parameter σ. The UCL of the K-chart was established using 4 SVs and it was estimated at 0.469. The T 2 control chart was constructed with an UCL = χ22,0.005 = 10.579. For the construction of EWMA control chart, the smoothing parameter θ was set at 0.2 to obtain the charting statistics. Following Equation (7), the control limits were set at ±0.48 so that they produce an in-control ARL of 200. Using Equation (8), the UCL of the R-chart was set at 4.94. It is worth noting that there is no LCL for the R-chart since n < 7. Figure 2 shows the constructed control charts.

107

Proceedings of MSDM 2013

Figure 2. Phase II control charts for monitoring linear profiles : (a) the K-chart, (b) the T 2 control chart, (c) the EWMA control chart, and (d) the R-chart.

Regarding the performance of the discussed control charts, the K-chart performed better than the other control charts. In fact, the K-chart detected 6 OOC profiles which are profiles number 23, 25, 26, 27, 28 and 29, while the T 2 detected only one OOC state which is profile number 27. The R-chart detected one OOC state which is profile number 26. The EWMA control chart was the weakest control chart since it did not detect any OOC profile. As shown in Table 1, the performance rate of K-chart, estimated at 66.67%, was highly superior to that of traditional control charts. The used performance rate in this application is defined as the detected number of OOC profiles divided by the generated number of OOC profiles. Table 1. Performance evaluation of control charts in detecting OOC profiles. Control chart # OOC profiles performance rate K-chart 6 66.67% T 2 chart 1 11.11% EWMA chart 1 11.11% R-chart 0 0.00%

5. Conclusion This paper has suggested the use of the K-chart for monitoring simple linear profiles. We have shown how to construct the K-chart for simultaneously monitoring the slope and intercept. The simulation study revealed that the K-chart performs better in detecting small shifts in the slope in comparison with T 2 control chart, EWMA control chart and R-chart. The high sensitivity level of the K-chart is explained by its flexible control limit based on SVs, making it adaptive to any shift in the process. Many interesting extensions are possible to the use of the K-chart for monitoring simple linear profiles. One possible extension is to apply the K-chart for monitoring multivariate linear profiles and compare it with the multivariate EWMA control charts. This extension could constitute a promising research field in the future.

108

Proceedings of MSDM 2013

6. References [CAM 08] Camci F., Chinnam R.B., Ellis R.D., “Robust kernel distance multivariate control chart using support vector principles,” International Journal of Production Research, vol. 46, 2008, p. 5075–5095. [GAN 11] Gani W., Taleb H., Limam M., “An assessment of the kernel-distance based multivariate control chart through an industrial application,” Quality and Reliability Engineering International, vol. 27, 2011, p. 391–401. [GAN 12] Gani W., Limam M., “Performance evaluation of one-class classification-based control charts through an industrial application,” Quality and Reliability Engineering International, DOI :10.1002/qre.1440, 2012. [KAN 00] Kang L., Albin S.L., “On-line monitoring when the process yields a linear profile,” Journal of Quality Technology, vol. 32, 2000, p. 418–426. [KIM 03] Kim K., Mahmoud M.A., Woodall W.H., “On the monitoring of linear profiles,” Journal of Quality Technology, vol. 35, 2003, p. 317–328. [KUM 06] Kumar S., Choudhary A.K., Kumar M., Shankar R., Tiwari M.K., “Kernel-distance-based robust support vector methods and its application in developing a robust K-chart,”, International Journal of Production Research, vol. 44, 2006, p. 77-96. [LI 10] Li Z., Wang Z., “An exponentially weighted moving average scheme with variable sampling intervals for monitoring linear profiles,” Computers and Industrial Engineering, vol. 59, 2010, p. 630–637. [MAH 11] Mahmoud M.A., “Simple linear profiles,” in Statistical Analysis of Profile Monitoring, R. Noorossana, A. Saghaei, and A. Amiri, Ed., p. 21–92, JohnWiley & Sons, New York, NY, USA, 2011. [SUN 03] Sun R., Tsung F., “A kernel-distance-based multivariate control chart using support vector methods,” International Journal of Production Research, vol. 41, 2003, p. 2975–2989. [ZHA 09] Zhang J., Li Z., and Z. Wang, Z. “Control chart based on likelihood ratio for monitoring linear profiles,” Computational Statistics and Data Analysis, vol. 53, 2009, p. 1440–1448. [ZOU 07] Zou C., Tsung F., Wang Z., “Monitoring general linear profiles using multivariate EWMA schemes,” Technometrics, vol. 49, 2007, p. 395–408. [ZOU 09] Zou C., Qiu P., “Multivariate statistical process control using LASSO,” Journal of the American Statistical Association, vol. 104, 2009, p. 1586–1596.

109

Proceedings of MSDM 2013

Monitoring the Coefficient of Variation using a Variable Sampling Interval Control Chart Philippe CASTAGLIOLA1 , Ali ACHOURI2, Hassen TALEB3 1

LUNAM Universit´e, Universit´e de Nantes & IRCCyN UMR CNRS 6597, Nantes, France Institut Sup´erieur de Gestion, Universit´e de Tunis, Tunisie 3 Higher Institute of Business Administration of Gafsa, University of Gafsa, Tunisia 2

The coefficient of variation CV is a quality characteristic that has several applications in applied statistics and is receiving increasing attention in quality control. Few papers have proposed control charts that monitor this normalized measure of dispersion. In this paper an adaptive Shewhart control chart implementing a VSI (Variable Sampling Interval) strategy is proposed to monitor the CV. Tables are provided for the statistical properties of the VSI CV chart and a comparison is performed with a Shewhart FSR (Fixed Sampling Rate) chart for the CV. An example illustrates the use of these charts on real data gathered from a casting process.

ABSTRACT.

KEYWORDS:

coefficient of variation, variable sampling interval, average time to signal.

1. Introduction Quality is one of the most important consumer decision factors. It has become one of the main strategies to increase the productivity of industrial and service organizations. One of the fundamental principles of the SPC (Statistical Process Control) is that a normally distributed process cannot be claimed to be in-control until it has a constant mean and variance. This implies that a shift in the mean and / or the standard-deviation makes the process out-of-control. However, control charting techiques were recently extended to various sectors such as health, education, finance and various societal applications where the mean and the standard-deviation may not be constant all the time but the process is operating in-control. In this case, it is natural to explore the use of the coefficient of variation (CV, in short) γ which is a normalized measure of dispersion of a probability distribution that is defined as the ratio of the standard deviation σ to the mean µ. Generally, it is widely used to compare data sets having different units or widely different means. The goal of this paper is to propose a CV Shewhart control chart using a VSI feature (from now on denoted as VSI-γ) and to evaluate its performance in terms of AT S (Average Time to Signal) and SDT S (Standard Deviation Time to Signal). It is important to note that the VSI chart for monitoring the CV introduced in this paper does not outperform an advanced strategy like the EWMA CV chart already proposed by [CAS 11]. Consequently, this paper must be considered as a framework for quality practioners who already made the choice to implement a Shewhart type of control chart. The remainder of the paper is organized as follows. The statistical properties of the sample coefficient of variation are presented in section 2. Then, in section 3, the Shewhart CV control chart using the VSI strategy is proposed. A statistical design strategy is defined in section 4 in order to obtain optimal chart parameters. The VSI Shewhart CV control chart is compared to the FSR Shewhart CV control chart in terms of out-of-control AT S and SDT S. An illustrative example from a real manufacturing process is presented in Section 5. Finally, the main conclusions are summarized in section 6 with comments and recommendations for future researches. 110

Proceedings of MSDM 2013

2. Properties of the (Sample) Coefficient of Variation Let X be a positive random variable and let µ = E(X) > 0 and σ = σ(X) be the mean and standard-deviation of X respectively. By definition, the coefficient of variation γ of the random variable X is defined as γ=

σ . µ

[1]

¯ and S be Now, let us assume that {X1 , . . . , Xn } is a sample of n normal i.i.d. (µ, σ) random variables. Let X the sample mean and the sample standard-deviation of X1 , . . . , Xn , i.e., n

X ¯ = 1 X Xi , n

[2]

i=1

and

v u u S=t

n

1 X ¯ 2. (Xi − X) n − 1 i=1

[3]

The sample coefficient of variation γˆ is defined as S γˆ = ¯ . X

[4]

By definition, γˆ is defined on (0, +∞). The distributional properties of the sample coefficient of variation γˆ have been studied by [MCK 32], [HEN 36], [IGL 68], [IGL 70], [WAR 82], [VAN 96] and [REH 96]. Among √ these authors, [IGL 68] noticed that γˆn follows a noncentral t distribution with n − 1 degrees of freedom and

noncentrality parameter Fγˆ (x|n, γ) of γˆ as



n . γ

Based on this property, it is easy to derive the c.d.f. (cumulative distribution function) √ √  n n Fγˆ (x|n, γ) = 1 − Ft n − 1, , x γ

[5]

where F (.) is the c.d.f. of the noncentral t distribution with n − 1 degrees of freedom and noncentrality √t n parameter γ . Inverting Fγˆ (x|n, γ) gives the inverse c.d.f. Fγˆ−1 (α|n, γ) of γˆ as Fγˆ−1 (α|n, γ) =

Ft−1



√ n √ , 1 − α n − 1, γn

[6]

where Ft−1 (.) is the inverse c.d.f. of the non-central t distribution. Concerning confidence intervals or hypothesis tests involving the coefficient of variation, the reader can refer to [TIA 05], [VER 07] and [MAH 09].

3. The Variable Sampling Interval (VSI) CV Shewhart chart Let us suppose that we observe subgroups {Xi,1 , Xi,2 , . . . , Xi,n } of size n, at time i = 1, 2, . . .. We assume that there is independence within and between these subgroups and we also assume that each random variable Xi,j follows a normal (µi , σi ) distribution where parameters µi and σi are constrained by the relation γi = µσii = γ0 when the process is in-control. This implies that from one subgroup to another, the values of µi and σi may change, but the coefficient of variation γi = µσii must be equal to some predefined in-control value γ0 = µσ00 , common to all the subgroups where µ0 is the in-control mean and σ0 is the in-control standard-deviation. 111

Proceedings of MSDM 2013

[KAN 07] were the first to investigate the opportunity to monitor the coefficient of variation through a FSR Shewhart type chart, denoted as SH−γ chart. In their paper, an application from the medical field is discussed to show the motivations which can lead quality practitioners to monitor the coefficient of variation instead of other sample statistics. The control limits LCLSH−γ and U CLSH−γ proposed by [KAN 07] are probability type control limits with an assumed type I error rate of α0 = 0.0027, i.e. an in-control ARL0 = 370.4. That is, LCLSH−γ and U CLSH−γ are respectively equal to :  LCLSH−γ = Fγˆ−1 α20 |n, γ0 , [7]  U CLSH−γ = Fγˆ−1 1 − α20 |n, γ0 , [8]

where Fγˆ−1 (α|n, γ) is the inverse cumulative distribution function of γˆ . Generally, if the sample coefficient of variation γˆ plotted on the SH−γ chart falls outside the control limits, then a signal will be given to inform operators to search for an assignable cause. Otherwise, the process is considered as being in-control, and sampling is continued without any corrective action. In the SH−γ chart, the length of the time interval h between samples is fixed. In the VSI−γ chart, the interval between samples i and i + 1 will depend on the value of γˆi . This method can improve the detection ability of the standard SH−γ chart by shortening the length of the time to give a signal. In this paper, we assume that the VSI−γ chart only takes two sampling interval values hS (S for Short) and hL (L for Long) with hS < hL . This choice is motivated by the works of [Rey 88] and [RUN 91] who showed that the majority of the gain in detection effectiveness that is achievable for a V SI chart can be obtained by using only two sampling intervals and it keeps the complexity of VSI schemes to a reasonable level. In the case of two sampling intervals, we have to define the control limits =

µ0 (ˆ γ ) − Kσ0 (ˆ γ ),

[9]

U CL =

µ0 (ˆ γ ) + Kσ0 (ˆ γ ),

[10]

LW L

=

µ0 (ˆ γ ) − W σ0 (ˆ γ ),

[11]

UWL

=

µ0 (ˆ γ ) + W σ0 (ˆ γ ),

[12]

LCL

as well as the warning limits

where W > 0 and K ≥ W are the warning and control limit parameters, respectively, and where µ0 (ˆ γ ) and σ0 (ˆ γ ) are the mean and standard-deviation of the sample coefficient of variation γˆ when the process is in-control, i.e. γi = γ0 . Since there is no closed form for µ0 (ˆ γ ) and σ0 (ˆ γ ), [HON 08] suggested to use the approximations proposed by [REH 96], i.e.      1 1 1 γ2 7 µ0 (ˆ γ ) ' γ0 1 + γ02 − + 2 3γ04 − 0 − n 4 n 4 32   1 3γ 4 7γ 2 19 + 3 15γ06 − 0 − 0 − , [13] n 4 32 128      1 1 1 3 2 4 2 σ0 (ˆ γ ) ' γ0 γ0 + + 2 8γ0 + γ0 + n 2 n 8   1/2 1 7γ 4 3γ 2 3 + 3 69γ06 + 0 + 0 + . [14] n 2 4 16 The VSI strategy works as follows : – if γˆ ∈ [LW L, U W L], the process is declared “in-control” and the next sample is collected after a long sampling interval hL.

112

Proceedings of MSDM 2013

– if γˆ ∈ [LCL, LW L] ∪ [U W L, U CL], the process is also declared “in-control” but the next sample is collected after a short sampling interval hS . – if γˆ < LCL or γˆ > U CL, the process is declared “out-of-control” and the potential assignable cause(s) must be found and removed. The properties of a VSI type control chart are determined by the number of samples and the length of time until a signal is given. The number of samples before a signal is usually called the run length in the quality control literature, and the expected number of samples is called the average run length (ARL). With a fixed interval between samples, the ARL can be easily converted to the expected time to signal AT S by multiplying it by the fixed sampling interval h. Hence, the ARL can be thought of as the expected time to signal. With a VSI type chart, however, the time to signal is not a constant multiple of the number of samples to signal. It is necessary to keep track of both the number of samples to signal and the time to signal. [Rey 89] defined the number of samples to signal N as the number of samples taken from the start of the process to the time that the chart signals and they defined the Average Number of Samples to Signal AN SS = E(N ) as the expected value of the number of samples to signal. They also defined the time to signal T as the time from the start of the process to the time when the chart signals. If hi = {hS , hL} is the last sampling interval used before the ith sample, then : N X T = hi [15] i=1

The Average Time to Signal AT S = E(T ) and the standard-deviation time to signal SDT S = the expected value and the standard deviation of the time to signal, respectively.

p V (T ) are

Let pS , pL and q be the following probabilities :

pL

=

P (LW L ≤ γˆ ≤ U W L),

[16]

pS

=

P (LCL ≤ γˆ < LW L) + P (U W L < γˆ ≤ U CL),

[17]

q

=

P (ˆ γ < LCL) + P (ˆ γ > U CL)

[18]

pL

=

Fγˆ (U W L|n, γ1 ) − Fγˆ (LW L|n, γ1 ),

[19]

pS

=

Fγˆ (U CL|n, γ1 ) − Fγˆ (LCL|n, γ1 ) − pL ,

[20]

q

=

1 − pS − pL

[21]

Then we have

where Fγˆ (x|n, γ) is the cumulative distribution of γˆ and where γ1 = τ γ0 is an out-of-control value for the CV. Values of τ ∈ (0, 1) correspond to a decrease of the nominal coefficient of variation, while values of τ > 1 correspond to an increase of the nominal coefficient of variation. For the SH−γ chart, the sampling interval is a constant, i.e. hi = h. But for the VSI−γ chart, the sampling interval hi is a random variable with two outcomes {hS , hL}. In order to make a fair comparison with the fixed sampling interval SH−γ chart, it is important to evaluate also the Average Sampling Interval ASI = E(hi ). Since the probability associated with hS is pS and the probability associated with hL is pL and since pS + pL = 1 − q, we have : ASI = E(hi ) =

hS p S + hL p L 1−q

[22]

113

Proceedings of MSDM 2013

and for the same reason, we also have E(h2i ) =

h2S pS + h2L pL 1−q

AT S = E(N )E(hi ) =

hS p S + hL p L q(1 − q)

[23]

[24]

and SDT S

= =

p E(N )V (hi ) + V (N )E 2 (hi ) s h2S pS + h2L pL (1 − 2q)(hS pS + hL pL )2 + . q(1 − q) q 2 (1 − q)2

[25] [26]

It is important to note that [Rey 89] derived these equations assuming that “. . .the interval h1 used before the first sample is determined according to a random interval with the same distribution as the other hi ’s, although no sample is taken at time 0”. When the process is in-control, we have τ = 1 and the probabilities pS , pL and q simplify to

pL0

=

Fγˆ (U W L|n, γ0 ) − Fγˆ (LW L|n, γ0 ),

[27]

pS0

=

Fγˆ (U CL|n, γ0 ) − Fγˆ (LCL|n, γ0 ) − pL0 ,

[28]

q0

=

1 − pS0 − pL0

[29]

Consequently, when the process is in-control, the AT S and the ASI are equal to AT S0

=

ASI0

=

hS pS0 + hL pL0 q0 (1 − q0 )

hS pS0 + hL pL0 1 − q0

[30] [31]

4. Numerical analysis Usually, we use the mean (ARL) and the standard deviation (SDRL) of the run length (RL) distribution to evaluate the performance of control charts. But for the VSI−γ chart, we use the mean (AT S) and the standard deviation (SDT S) of the time to signal T . When the process is in-control, the average of T will be denoted as AT S0 and the standard deviation of T will be denoted as SDT S0 . On the contrary, when a process is out-of-control, the average of T will be denoted as AT S1 and the standard deviation of T will be denoted as SDT S1 . A control chart is considered better than its competitors if it has smaller AT S1 value for a specific shift τ , when AT S0 is the same for all the charts. For a FSR model, the AT S is a multiple of the ARL since the sampling interval h is fixed, i.e. AT S (FSR) = h × ARL(FSR) . For a VSI model, due to the variation of the sampling interval, the equation should be changed to AT S (VSI) = ASI × ARL(VSI) . The comparison of the statistical performances among different static charts must be conducted by forcing the same value for ARL0 . In the remainder of this paper, we will assume ARL0 = 370.4. Furthermore, the in-control average sampling interval ASI0 of the VSI−γ chart should be equal to h. Without loss of generality, in this paper we assume h = 1 t.u. (time unit) and, therefore, to make a fair comparison with

114

Proceedings of MSDM 2013

other control charts, we assume that ASI0 = 1 t.u. (this ensures AT S0 = ARL0 = 370.4). Consequently, the chart parameters hS , hL , W , K of a properly designed VSI−γ chart must satisfy the following two equations : AT S(n, hS , hL, W, K, γ0 , τ = 1)

= AT S0 = 370.4,

[32]

ASI(n, hS , hL, W, K, γ0 , τ = 1)

= ASI0 = 1.

[33]

Since n and γ0 are fixed by the problem itself, two chart parameters have also to be fixed in order to solve the equations above. It is a common choice to fix the values of (hS , hL) and compute the chart parameters K and W . Agreeing with the recommendations of [Rey 89], we suggest to investigate the following combinations of (hS , hL ) : (0.5, 1.5), (0.3, 1.7), (0.1, 1.9), (0.1, 1.1), (0.1, 1.3), (0.1, 1.5) and (0.1, 4.0) for both the symmetric and the asymmetric cases. If ASI0 = 1, then we deduce hS pS0 + hL pL0 = 1 − q0 and then, we have AT S0 = q10 or equivalently 1 q0 = = 1 − Fγˆ (U CL|n, γ0) + Fγˆ (LCL|n, γ0 ). [34] AT S0 Concerning the increasing case, as it was expected, whatever the values of n, γ0 or τ , the AT S values of the VSI-γ chart are much smaller than the ones of the SH-γ chart, clearly demonstrating the outperformance of the former over the latter. Computation of SDT S values for both the VSI-γ and SH-γ charts also demonstrates that the Time to Signal distribution of the VSI-γ chart is always more under-dispersed than the one corresponding to the SH-γ chart. we can deduce that the best performance is obtained by the VSI-γ chart for the couple (hS = 0.1, hL = 4.0). That is, the larger is the spread between hS and hL , the shorter is the out-of-control AT S assured by the investigated chart. This evidence stems from the fact that as the difference between hS and hL increases, W reduces : thereby, the warning region tends to cover a larger fraction of the control interval between U CL and LCL. In general, this condition always enhances the statistical properties of the VSI chart due to a larger probability to have the tightened sampling. Thus, as a rule of thumb, we suggest selecting the couple (hS , hL) over a range as large as possible, compatibly with the technological constraint related to the rate of inspection and the need to collect a sufficiently large number of samples during the process run.

5. Conclusion Monitoring the coefficient of variation CV by means of a control chart is receiving growing attention in the context of SPC. There are many situations in which the sample mean and standard deviation vary naturally in a ¯ and S control charts cannot be implemented. proportional manner when the process is in-control, in which case X In this paper, a VSI-γ chart is proposed to monitor the CV. The AT S and SDT S computation has been performed and a comparison with the SH-γ chart demonstrated the outperformance of the VSI-γ chart over the SH-γ chart : the VSI-γ chart has smaller out-of-control AT S1 and SDT S1 values compared to the ones of the SH-γ chart. A real implementation of the VSI-γ chart on data collected from a die casting hot chamber process has demonstrated its efficiency in the detection of out-of-control situations. Future research should be focused on extending the study to other adaptive schemes without and with run rules.

6. References [CAS 11] C ASTAGLIOLA P., C ELANO G., P SARAKIS S., “Monitoring the Coefficient of Variation Using EWMA Charts”, Journal of Quality Technology, vol. 43, num. 3, 2011, p. 249–265. [HEN 36] H ENDRICKS W., ROBEY W., “The Sampling Distribution of the Coefficient of Variation”, Annals of Mathematical Statistic, vol. 7, 1936, p. 129–132. [HON 08] H ONG E., K ANG C., BAEK J., K ANG H., “Development of CV Control Chart Using EWMA Technique”, Journal of the Society of Korea Industrial and Systems Engineering, vol. 31, num. 4, 2008, p. 114–120. [IGL 68] I GLEWICZ B., M YERS R., H OWE R., “On the Percentage Points of the Sample Coefficient of Variation”, Biometrika, vol. 55, num. 3, 1968, p. 580–581.

115

Proceedings of MSDM 2013

[IGL 70] I GLEWICZ B., M YERS R., “Comparisons of Approximations to the Percentage Points of the Sample Coefficient of Variation”, Technometrics, vol. 12, num. 1, 1970, p. 166–169. [KAN 07] K ANG C., L EE M., S EONG Y., H AWKINS D., “A Control Chart for the Coefficient of Variation”, Journal of Quality Technology, vol. 39, num. 2, 2007, p. 151–158. [MAH 09] M AHMOUDVAND R., H ASSANI H., “Two New Confidence Intervals for the Coefficient of Variation in a Normal Distribution”, Journal of Applied Statistics, vol. 36, num. 4, 2009, p. 429–442. [MCK 32] M C K AY A., “Distribution of the Coefficient of Variation and Extended t Distribution”, Journal of the Royal Statistical Society, vol. 95, 1932, p. 695–698. [REH 96] R EH W., S CHEFFLER B., “Significance Tests and Confidence Intervals for Coefficients of Variation”, Computational Statistics & Data Analysis, vol. 22, num. 4, 1996, p. 449–452. ¯ Charts with Variable Sampling Intervals”, Techno[Rey 88] R EYNOLDS J R . M., A MIN R., A RNOLD J., NACHLAS J., “X metrics, vol. 30, num. 2, 1988, p. 181–192. [Rey 89] R EYNOLDS J R . M., A RNOLD J., “Optimal One-Sided Shewhart Charts with Variable Sampling Intervals”, Sequential Analysis, vol. 8, num. 1, 1989, p. 51–77. [RUN 91] RUNGER G., P IGNATIELLO J., “Adaptive Sampling for Process Control”, Journal of Quality Technology, vol. 23, num. 2, 1991, p. 135–155. [TIA 05] T IAN L., “Inferences on the Common Coefficient of Variation”, Statistics in Medicine, vol. 24, num. 14, 2005, p. 2213–2220. [VAN 96] VANGEL M., “Confidence Intervals for a Normal Coefficient of Variation”, American Statistician, vol. 15, 1996, p. 21–26. [VER 07] V ERRILL S., J OHNSON R., “Confidence Bounds and Hypothesis Tests for Normal Distribution Coefficients of Variation”, Communications in Statistics – Theory and Methods, vol. 36, num. 12, 2007, p. 2187–2206. [WAR 82] WARREN W., “On the Adequacy of the Chi-Squared Approximation for the Coefficient of Variation”, Communications in Statistics – Simulation and Computation, vol. 11, 1982, p. 659–666.

116

Proceedings of MSDM 2013

A new time-adjusting control limit with Fast Initial Response for Dynamic Weighted Majority based control chart Dhouha Mejri 1 , Mohamed Limam 2 , Claus Weihs 3 . 1

ISG de Tunis, Larodec, University of Tunis ISG de Tunis, Larodec, University of Tunis and Dhofar University, Oman. 3 Technical University of Dortmund, Germany. 2

Control chart is the most important tool for monitoring processes. Three parameters affect the control chart performance: the number n of samples taken at each sampling time, the sampling interval h, and the control limits (CL). Most recent studies assume that the CL are fixed values. However, the CL should vary with time because after a change in a process, people often wish to adjust the CL in order to early detect a potential out-of-control condition. In this paper, we propose a new time varying control chart. This method is a new technique for time adjusting control limit. It integrates a data mining algorithm called the Dynamic Weighted Majority-Winnow (DWM-WIN) of [MEJ 12] into the new control chart. The proposed method is illustrated assuming a normal process with known standard deviation where we wish to detect shifts in the mean. We show that a quick detection of initial out-of-control condition can be achieved by using time adjusting control limit and that the integration of DWM-WIN improves the control chart effectiveness. Experiments have shown that the proposed time varying control limit detects earlier the first initial response (FIR). ABSTRACT.

KEYWORDS:

Time adjusting control chart, control limit, Statistical Process control, Dynamic Weighted Majority, Fast initial

response

1. Introduction Control charts are used to monitor reliability and performance and to impact the effectiveness of manufacturing processes. Control limits (CL) are one of the most important characteristic of control charts. First introduced by [MON 91], the CL of an EWMA control chart should be time varying since the variance of the test statistic zt depends on t. The author prove that using fixed control limits rather than time varying CL makes the EWMA control chart less sensitive to the shift of the new observation. He gives a new formulation for the Upper Control Limit UCL(t) and the Lower Control Limit LCL(t) which is expressed as a function of time as follows : s U CL(t) = µx + Lσx and

s U CL(t) = µx − Lσx

λ[1 − (1 − λ)2t (2 − λ)n

[1]

λ[1 − (1 − λ)2t (2 − λ)n

[2]

where µx and σx are the sample mean and the sample standard deviation estimated from the preliminary data, λ is a constant and L is the CL constant. [STE 99] improves the work of [MON 91] by proposing a time varying control limit with Fast initial response (FIR) suggested for Cumulative Sum (CUSUM) charts by [LUC

117

Proceedings of MSDM 2013

82]. Empirical results have shown that using time-varying limits with EWMA control chart is akin to FIR and makes the control chart more sensitive to early process shift detection. It helps detect problems with the start up quality. In one hand, the first objective of the present study is to propose a new formula of time-adjusting control limit. We also introduce a time-adjusting control limit with FIR. In another hand, THU 10] establish a control limit based on the empirical level of significance on the percentile, estimated by the bootstrap method. They propose a control limit based on bootstrap ensemble method. First, samples 2 2 2 2 are denoted Dj1 , Dj2 , Dj3 , . . . DjN . Second, for each bootstrap sample, choose a parameter (0 = 1 is out of control, then adjust the control limit by : LCL± =

U CL =

X ξ1 + ξn+m ( ) + 2ξn+m = ξ2 + Lσn+m [1 − (1 + f )1+a(t−1) ] 2

[17]

120

Proceedings of MSDM 2013

LCL± =

X ξ1 + ξn+m ) − 2ξn+m = ξ2 − Lσn+m [1 − (1 + f )1+a(t−1) ] ( 2

[18]

If ξn+m+p , where p≥1 is out of control, then :

CL± = (

ξ p+1 + ξn+m+p ) ± Lσn+m+p [1 − (1 + f )1+a(t−1) ] 2

[19]

Step3 : Monitoring the phase II observations, declare a batch out of control if ξi exceeds the control limit, then the process is out of control.

4. Empirical results To illustrate control chart schemes with time-adjusting control limits, we use a sequence of simulated observations from a normal distribution with parameters µ = 1 and σ = 0.01. As shown in Figure 1, two control schemes are presented. One scheme has constant control limit : CL= µ ± L σ and the other scheme is time varying CL based on equations (8), (9), (10) and (11). We report process shifts in units of µσxx . In the next section, we present a detailed description of the ARL properties for different values of this ratio.

4.1. Run Length properties of DWM-WIN control chart with Time Adjusting Control Limit In order to evaluate the performance of the proposed control chart, the average run length (ARL) which is defined as the average number of sample until an out-of-control condition is signaled. In the present study, the summary of the ARL results of the proposed control chart based on time adjusting CL is presented in table 1. For the evaluation of control chart with the new time-adjusting CL, we use the equations (8), (9), (10) and (11). We note that FIR is used to make the shift identification as quick as possible. For simulation, we use f = 0.5 which yields to a = 0.3 according to the formula a = (−2/log(1 − f ) − 1) = 19 We note that FIR is used to make the shift identification as quick as possible. For simulation, we use f = 0.5 which yields to a = 0.3 according to the formula a = (−2/log(1 − f ) − 1)/19. Table 1. Run length properties of time-adjusting control limit based control chart Mean and standard deviation (µ = 0.5, σ = 0.5) (µ = 0.5, σ = 0.4) (µ = 0.5, σ = 0.3) (µ = 1, σ = 0.5) (µ = 1, σ = 0.4) (µ = 1, σ = 0.3) (µ = 1, σ = 0.2) (µ = 2, σ = 0.2) (µ = 2, σ = 0.3)

µ σ

1 1.25 1.66 2 2.5 3.33 5 6.66 10

ARL with time-adjusting CL 200 50 20 5.128 4.545 3.278 1.886 1.626 1.587

ARL with FIR time-adjusting CL 200 18.18 11.76 5 2.35 2 1.731 1.333 1.31

The effect of time-adjusting control limit on the ARL decreases with time to demonstrate that the long term run length properties of the proposed control chart with time adjusting control limit decrease when µ /σ increases. As shown in table 1, the time adjusting control limit has very little impact after µ /σ=3.33 when the ARL is less than 2. Table 1 shows also the effect of using the CL of equations 19 and 20 with f = 0.3 and a = 0.5 to simulate FIR adjustment. In fact, for the different values of the shift ratio µ /σ, ARL with FIR adjustment presents an

121

Proceedings of MSDM 2013

1.3

UCL

1.2 1.1 1 0.9 0.8

LCL

0.7 0

20

40

60

80

1.5

100 120 Number of instances

140

160

180

200

180

200

Time adjusting UCL

1

Time adjusting LCL 0.5

0

20

40

60

80

100 120 Number of instances

140

160

Figure 1. Simulation of normal distribution with mean µ = 1 and σ = 0.01 : (a) with constant control limits (equation (6 and 7)) ; (b) with time-adjusting control limits based on equations (8), (9),( 10) and (11).

improvement over the run length obtained only with time adjusting CL since all the values of the ARL when using F IRadj are less than ARL values with time adjusting CL only except the first value which stills 200.

4.2. Impact of the proportions of F IRadj on the ARL values This section examines the effect of FIR proportion on the performance of the variability of the ARL. As shown in table 2, the values of ARL increases when f increases. In fact, values of the ARL are the best for f = 0.01. Accordingly, in order to have a substantial benefit from FIR proportion, f should be small. By comparing table 1 where we have used f = 0.5 and table 2 where many values of f where used, we can conclude that the proposed control chart with time adjusting control limit has superior ARL performance when f is small.

122

Proceedings of MSDM 2013

Table 2. ARL results for different FIR proportions µx σx

1 1.25 1.66 2 2.5 3.33 5 6.66

f = 0.2 50 43.33 4.54 2.53 1.32 1.26 1.13 1.123

f = 0.3 100 100 6.06 3.27 2.02 1.48 1.18 1.123

f = 0.4 200 150 9.09 4.44 2.06 1.503 1.38 1.13

f = 0.5 200 18.18 11.76 5 2.35 2 1.73 1.33

5. Conclusion We have described a new strategy for time adjusting CL based on DWM-WIN method. The method adjusts the control limit each time a shift exists using the information of the past data. We have integrated DWM-WIN algorithm into the time varying CL procedure. We have included a FIR feature on the proposed time adjusting control limit that makes the chart more sensitive to start up shift detection. The new control limit setting is at first applied on a normal distribution with artificial shifts. The idea proposed in this study will be applied to DWM-WIN misclassification error rates. Impact of the integration of data mining algorithm will be evaluated. Proposed control with time adjusting control limit could be extended by combining it with other control chart with time varying CL.

6. References [KOL 07] KOLTER Z. J., MALOOF M. A., ”Dynamic weighted majority : An ensemble Method for Drifting Concepts”, Journal of Machine Learning Research, vol. 8, 2007, p. 2755-2790. [LUC 82] LUCAS J. M., CROSIER R., ”Fast Initial Response for CUSUM Quality Control Schemes”, Technometrics, vol. 24, 1982, p. 199- 205. [MEJ 12] MEJRI D., KHANCHEL R., LIMAM M., ”An ensemble method for concept drift in nonstationary environment”, Journal of Statistical Computation and Simulation, vol. 82, 2012, p. 1-14. [MEJ 13] MEJRI D., LIMAM M., WEIHS C., ”On parameters optimization of dynamic weighted majority algorithm based on Genetic Algorithm”, International conference on Modeling, Simulation and applied Optimization, (ICMAO’13), IEEE. (Accepted in January 2013). [MON 91] MONTGOMERY D.C, ”Introduction to Statistical Process Control”, Second Edition, John Wiley and Sons, New York. [STE 99] STEFAN H. S., ”Exponenetially Weighted Moving Average Control Charts with Time-Varying Control Limits and Fast Initial Response”, Journal of Quality Technology, vol. 31, 1999, p. 2454-2476. [THU 09] THUNTEE S., SEOUNG B. K., FUGEE T., ”One-class classification-based control charts for multivariate process monitoring”, IEEE Transactions on Software Engineering, vol. 42, 2009, p. 107-120. 123

Proceedings of MSDM 2013

On economic design of np control charts using variable sampling interval

Kooli Imen1 and Limam Mohamed ISG, LARODEC. University of Tunis, Bardo 2000. Tunisia. Abstract This paper focus on economic design of attribute charts namely np charts having a sampling interval varying from an inspection to another based on the position of the number of nonconforming units on the chart. Variable and fixed sampling interval np charts are compared based on the expected cost per unit time. A sensitivity analysis using fractional factorial design is run to study the effect of the input parameters, associated with the operation of the chart and the behavior of the process, on the optimal designs of the two schemes. It was obtained that in all the runs used in the experimental design, the V SI np chart outperforms economically the static one. Key Words: Attribute charts, economic design, variable sampling interval.

1

Introduction

In any manufacturing process, a certain amount of inherent variation, composed of the cumulative effect of many unavoidable causes called chance causes is always present. A process operating with only chance causes is said to be in statistical control. Other kinds of variability may occasionally be present in the output of a process which, in key quality characteristics, usually arises from three sources: improperly adjusted machines, operator errors or defective raw materials. Such variability leads to an unacceptable level of process performance. These sources are referred to, in the quality control context, as assignable causes. A process that is operating in the presence of assignable causes is said to be out-of-control. Control charts are widely used to maintain and establish statistical control of a given process. A control chart is a graphical visualization of the behavior of 1

corresponding author. E-mail:[email protected]

1

124

Proceedings of MSDM 2013

the process through time. To use a control chart, the engineer should specify the sample size which is the number of units to be inspected, the sampling interval between successive samples and the control limits. The chart’s control limits are those within which the plotted points would fall with higher probability if the process is running in the in-control state. A point outside the control limits is taken as an indicator that an assignable cause has occured. As a consequence, the root cause of the process anomaly should be identified and eliminated before a large number of nonconforming units is manufactured. Control charts are divided into two main types, variable charts and attribute charts. If the quality characteristic can be expressed in terms of numerical measurements such as weight, volume, dimension..., then the related chart is called a variable ¯ chart is widely used to monitor the mean of a quality characteristics. one. The X In attribute charts, each inspected item can be classified as either conforming or nonconforming with respect to the required specifications. The p chart is used to supervise the proportion of nonconforming units observed in a selected sample. If a random sample of n units of product is selected and the probability that any unit will not conform to specifications is p, then the number of units of nonconforming items, D, observed in the taken sample follows a Binomial distribution with parameters n and p. In such case, the np chart is employed. Control charts can be designed with respect to statistical criteria only. This usually involves selecting the sample size and control limits so that the probability of detecting a particular shift when it really occurs and the probability of getting a false alarm are equal to specified values. The design of a control chart has economic consequences in that the costs of sampling and testing, the costs associated with investigating out-of-control signals and possibly correcting assignable causes , and the costs of allowing defective products to reach the consumer are all affected by the selection of the control chart parameters. Since it is crucial for a company to get a better quality at lower costs, economic design should be employed. Optimal design parameters are those which minimize the total expected cost per item or per unit of time. The pioneering work on economically designed charts was which of Duncan [4] ¯ charts. Since then much attention was given to economic design of control for X charts. Literature surveys are given in Montgomery [12]. The classical control schemes using a fixed sample size, sampling interval and control limits are referred to, in the literature, as fixed parameters or static control charts. However, keeping the design parameters constant during the production process can results in a delay in detecting small to moderate shifts in the process parameter being controlled. In the last two decades, it has been found that the statistical and economical performance of control charts can be improved when at least one design parameter is allowed to take different values given the position of the last observed sample on 2

125

Proceedings of MSDM 2013

the chart. These charts are called adaptive charts. One way is to use a variable sampling interval, V SI, which has received much attention. In these charts, a short sampling interval is used when there is an indication of a problem, otherwise the long sampling interval is chosen. ¯ charts. A numerical Reynolds et al. [10] were the first to propose the V SI X study made to compare between the fixed sampling interval, F SI, and the V SI schemes has shown that related to F SI control charts, the V SI feature can substantially reduce the time required by the chart to detect small to moderate shifts ¯ have been investigated by in the process mean. Later, The properties of V SI X Runger and Pignatiello [11] and Runger and Montgomery [14] among others. An ¯ was treated by Bai and Lee [1]. Luo et al. [9], Economic design model of V SI X Epprecht et al.[7] and Lin et al. [8], applied the variable sampling interval approach to other kinds of control charts. Tagaras [15] provided a survey of the developments in the designs of adaptive control charts covering all kinds of adaptive charts that have variable sample size, sampling interval and control limits, as well as different combinations. Economic design of static np charts was treated by Duncan [5]. While attribute charts are of increasing importance due to the rapidity of getting count data, adaptive attribute charts were treated in few works. Calabrese [2] developped a Bayesian np control chart having an adaptive control limit. Epprecht et al. [6] studied the statistical properties of a fully adaptive c chart, related to the observed number of nonconformities in a sample. In this paper, the economic design of V SI np chart is treated, a sensitivity analysis is conducted in order to find the optimal solutions for a given set of input parameters composed by the cost and process parameters. Also the relations between the input parameters and the optimal designs is conducted. In section 2 notations that will be used through the paper are presented. Section 3 describes the operation of the V SI np chart and the expected cost per unit of time is developed. The sensitivity analysis is presented in section 4. Finally, conclusion make up the last section.

2

Notations

In order to develop the economic model of V SI np chart and its static counterpart, the cost parameters associated with the operation of the chart, and the process parameters related to the behavior of the process should be determined. The cost and process parameters used in this paper are given as follows: • L0 : A false alarm cost incurred if the process is stopped whereas no assignable cause is really present 3

126

Proceedings of MSDM 2013

• L1 : Cost of detecting and eliminating an assignable cause. • M : The loss per hour due to the increased nonconforming items passed while the process runs out-of-control. • b : Fixed sampling cost • c : Variable sampling cost. • t1 : Average search time for a false alarm. • t2 : Average time of detecting and eliminating an assignable cause. • λ : Rate of occurrence of the assignable cause. • p0 : Fraction nonconforming when the process operates in the in-control state. • p1 : Fraction nonconforming when the process operates in the out-of-control state. The design parameters of the F SI scheme are: • h: The sampling interval length in the F SI scheme. • n : The number of units to be taken. • d: The upper control limit in F SI scheme. The decision variables related to the operation of the V SI chart are

4

127

Proceedings of MSDM 2013

• h1 : The long sampling interval. • h2 : The short sampling interval. • d1 : Warning limit. • d2 : The upper control limit in V SI scheme. The terms related to the derivation of the expected cost per unit of time are: • EC: Expected cost per a production cycle. • ET : Expected length of a production cycle. • ECT : Expected cost per hour.

3

Operation of the V SI np control chart

Suppose that a sample of size n is taken at every sampling point and let the computed statistic Dt denotes the number of nonconforming units which follows a binomial distribution with parameters n and pi when the process operates in state i. If Dt plotted on the chart goes beyond the value d2 , the process is stopped and a search for an assignable cause is undertaken. Otherwise, the process is considered operating in the in-control state, and the next sample is continually taken at next sampling point. In this situation, the control chart operates with fixed sampling interval, h, regardless of Dt which is defined to be the F SI np chart. In the V SI np chart, if Dt falls inside the control limits, the monitored process is also considered in-control as in the case of F SI chart. However, with a difference that the next sampling interval is a function of the latest observed statistic. For administrative convenience, we consider two values of sampling interval since the ¯ schemes using two results in [1] indicated that the expected costs for the V SI X and three possible sampling intervals are nearly the same. In this case, there are three alternative courses of action after the relevant statistic of a sample is recorded as shown in Figure 1.

5

128

Number of nonconforming units

Proceedings of MSDM 2013

Signal

Dt I 3 : Dt > d

d I 2 : w < Dt ≤ d

h2

w I 1 : 0 ≤ Dt ≤ w

h1 0 0

h1 h1 + h2 2h1 + h2 3h1 + h2 3h1 + 2h2

Time

Figure 1: Chart’s partition when a V SI scheme is used. A sample point falling in the region I1 is a strong evidence that the process is operating properly so the control could be relaxed by choosing a long time interval, h1 , before taking the next sample. Whereas, a sample point falling in the warning region, I2 , is an indication that the process needs adjustment, as a consequence the control should be tightened by taking the next sample using a short sampling interval, h2 . If the number of defective units exceeds the value of d2 , the process is stopped and a search for the assignable cause is undertaken. The choice of the sampling interval for the first sample and after a false alarm is determined by the conditional distribution of the sampling interval given that D falls in the control region when the process is in-control.

4

Development of cost model

Let q0 and q1 denote, respectively, the probability that D falls outside the control limits when p = p0 and p = p1 . Also, let p0j be the conditional probability that D belongs to Ij given that D ≤ d2 when p = p0 and p1j be the corresponding conditional probability when p = p1 . Then we have q0 =

n X

Cjn pj0 (1 − p0 )n−j ,

(1)

j=d2+1

6

129

Proceedings of MSDM 2013 n X

q1 =

Cjn pj1 (1 − p1 )n−j .

(2)

j=d2+1

and Pd1 −1

p01 p02

Cjn pj0 (1 − p0 )n−j = , 1 − q0 Pd2 n j n−j j=d1 Cj p0 (1 − p0 ) = 1 − q0 j=0

(3) (4)

by the same way Pd1 −1

p11 p12

Cjn pj1 (1 − p1 )n−j , = 1 − q1 Pd2 n j n−j j=d1 Cj p1 (1 − p1 ) = 1 − q1 j=0

(5) (6)

In economic design models, the operation of the control chart is viewed as a series of cycles. A production cycle is defined as the average time elapsed from the startup of the production until the detection and elimination of the assignable cause. A new cycle begins just after the process has been put back in control. As shown in Figure 2, a production cycle is formed by four components: in-control period, out of control period, time spent in searching false alarms and the time period needed for detection and elimination of the root cause. It is assumed that the process is shut down during the search of an assignable cause. p = p1

p = p0 1st sample taken Last sample taken in-control in-control

1st sample taken out-of-control

Assignable cause 2nd sample taken outdetected and removed of-control

Cycle begins

Cycle ends

Assignable cause occurs

τ Out-of-control period

In-control period

Figure 2: A diagram for an in-control period followed by an out-of-control period in a process cycle The expected length of the in-control period is 1/λ. Let O be the expected length of the out-of-control period, U be the length of the sampling interval in which the assignable cause occurs and Y be the time between the sampling point just prior 7

130

Proceedings of MSDM 2013

to the occurrence of the assignable cause and the occurrence itself. Reynolds et al. [10] showed that E(O) = E(U ) − E(Y ) + (S1 − 1)

2 X

hj p1j ,

(7)

j=1

where S1 denotes the expected number of samples in the out-of-control period which is a geometric random variable with parameter q1 , it results that S1 = 1/q1 . The variable U takes either the value h1 or h2 and its distribution is found when the process is in-control since the sample just prior to the occurrence of the assignable cause was taken when p = p0 . Reynolds et al. [10] assumed that P (U = hj ) = hj p0j /

2 X

hj p0j for j = 1, 2.

(8)

j=1

From results of Duncan [4], the conditional expected value of Y , given U = hj is E(Y |U = hj ) =

1 − (1 + λhj )e−λhj for j = 1, 2. λ(1 − e−λhj )

(9)

Therefore the expected length of the out-of-control period is obtained by using formulas (8) to (10): 2 2 X X 1 hj p0j (λhj − 1 + e−λhj ) E(O) = P2 + (S1 − 1) hj p1j . λ(1 − e−λhj ) j=1 hj p0j j=1 j=1

Let S0 denotes the expected number of samples in the in-control state. Bai and Lee [1] showed that P2

S0 =

2 X p0j e−λhj p0j (1 − e−λhj ) −λhj )2 p e 0j j=1 j=1

j=1

(1 −

P2

It results that the expected length of a production cycle is ET =

1 + E(O) + q0 t1 S0 + t2 . λ

(10)

The cost model involves the cost of false alarms, the cost incurred while the process is operating off-target, the cost of sampling and the cost of eliminating the assignable cause. It follows that EC = q0 L0 S0 + M E(O) + (b + cn)(S0 + S1 ) + L1 .

(11)

8

131

Proceedings of MSDM 2013

Table 1: Factor levels for the sensitivity analysis Factor A = p0 B = p1 C = λ D = t1 E = t2 F = L0 G = L1 H = M J = b K = c Level Low 0.03 0.07 0.01 0.1 0.3 100 300 100 0.5 0.1 High 0.05 0.16 0.05 0.5 1.5 250 400 150 5 1

The cost of the static np chart [5] is obtained by letting d1 = d2 = d and h1 = h2 = h which implies that p01 = p11 = 1 and p02 = p12 = 0. In this case the expected cycle length reduces to ET =

h e−λh 1 + − τ + q0 t1 + t2 λ q1 1 − e−λh

(12)

and the total expected cost becomes e−λh b + cn h EC = (q0 L0 + b + cn) + + M − τ + L1 . 1 − e−λh q1 q1 "

#

(13)

The function to be minimized in both cases is the expected cost per time unit, ECT which is the ratio EC/ET . In designing an economic V SI np chart for a particular application, the objective is to find the set of chart parameters {d1 , d2 , h1 , h2 , n} which minimize the ratio obtained by dividing equation (11) by equation (10). For the static np chart, the set of design parameters is {d, n, h}, the optimal one is which minimize the expected cost per hour of operation.

5

Sensitivity analysis

In this section, first the V SI and F SI np control charts are compared with respect to the ECT . As a second part, the impact of the input parameters on the behavior of the optimal solutions of the two schemes is studied. The optimal set of chart parameters of the V SI chart {d1 , d2 , n, h1 , h2 } minimizing the expected unit cost is found using a direct search procedure over all possible integer values of n, h1 , h2 , d1 and d2 such that 0 ≤ d1 ≤ d2 and 1 ≤ h2 ≤ h1 .The same method is used to find the optimal solutions of the F SI chart. A sensitivity analysis is carried out by varying each of the ten input parameters at two different levels. Table 1 provides the high and low levels of the studied factors. 10−5 A fractional factorial 2IV with resolution IV is used. This resolution provides very good information about main effects. Table 2 shows the used runs, based on the design generators F = ABCD, G = 9

132

Proceedings of MSDM 2013

ABCE, H = ABDE, J = ACDE and K = BCDE. The minus sign represents the low level and the plus sign is the high level. The optimal solutions of F SI and V SI charts are given in table 3. The column labeled by ∆C is the percent reduction in cost achieved when the V SI scheme is used instead of the F SI scheme. That is ∆C =

ECTV∗ SI − ECTF∗ SI ∗ 100 ECTF∗ SI

From table 3, we can conclude that: • The expected total costs of the V SI are consistently smaller than that of the corresponding F SI charts with a mean of 2.45%. • For most of the cases (21 out of 32 runs), h∗2 = 1 meaning that the process should be sampled after one hour, which is the minimum allowable value for h2 , if the observed nonconforming units falls into region I2 . This result agrees with the basic idea of the V SI scheme of taking samples as quickly as possible when there is a symptom of process shift. • In most cases, the optimal long sampling interval is greater than the sampling interval of the static scheme. A contrast effect is observed for the short sampling interval. To study the effect of the process and cost parameters on the behavior of the optimal solutions of the F SI and V SI schemes, based on table 3, the statistical software SP SS is used to run the regression analysis for each dependent variable. The regression models, including the significant effects for the two cases are given respectively in tables 4 and 5. From table 4, it is observed that: • When the fraction of defective units in the in-control state increases, the optimal sampling interval tends to be higher. • A greater variable sampling cost leads to take samples at longer sampling intervals with smaller sample sizes and at narrower control limits. • A higher cost due to manufacturing nonconforming units leads to using a smaller sampling interval. Thus, for processes subject to an important cost of repairing items covered by warranties [13], the process should be supervised frequently in order to detect a process shift as soon as possible. From table 5 it can be concluded that 10

133

Proceedings of MSDM 2013

Run 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32

Table 2: A B + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +

Runs used in C D E + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +

the experiment F G H + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +

J + + + + + + + + + + + + + + + +

K + + + + + + + + + + + + + + + -

11

134

Proceedings of MSDM 2013

Table 3: Optimal design variables and costs of F SI and V SI charts Run F SI solution V SI solution d∗ h∗ n∗ ECT ∗ d∗1 d∗2 h∗1 h∗2 n∗ ECT ∗ ∆C 1 2 10 53 26.18 2 2 9 1 40 25.72 1.76% 2 10 6 188 11.55 6 9 6 1 142 10.58 8.4% 3 6 5 119 11.62 4 6 5 1 97 11.01 5.25% 4 0 7 17 19.46 0 0 7 7 17 19.46 0.00% 5 6 2 120 34.38 5 6 2 1 112 33.55 2.42% 6 0 5 19 42.4 0 0 5 5 19 42.40 0.00% 7 1 7 38 46.74 1 1 7 5 33 46.62 0.26% 8 9 2 164 39.20 7 9 3 1 162 37.94 3.22% 9 5 5 60 8.06 4 5 5 1 51 7.85 2.61% 10 1 4 14 14.38 1 1 4 1 11 13.58 5.57% 11 2 5 23 15.12 1 2 5 1 16 14.03 7.21% 12 6 5 67 9.11 4 5 5 1 49 8.82 3.19% 13 2 3 22 38.57 2 2 4 1 22 37.46 2.88% 14 6 2 68 26.87 5 6 2 1 64 26.53 1.27% 15 4 2 51 32.48 3 4 2 1 46 32.20 0.87% 16 1 3 16 29.96 1 1 3 1 13 29.18 2.61% 17 0 12 23 17.74 0 0 12 12 23 17.74 0.00% 18 11 8 210 19.24 10 11 8 6 202 19.14 0.52% 19 27 9 452 21.83 21 25 9 1 376 20.97 3.94% 20 0 21 31 23.65 0 0 21 21 31 23.65 0.00% 21 21 6 356 47.49 19 21 6 2 335 47.09 0.85% 22 0 8 29 58.95 0 0 8 8 29 58.95 0.00% 23 0 4 20 52.02 0 0 4 4 20 52.02 0.00% 24 5 5 117 35.56 2 5 6 5 117 35.56 0.00% 25 9 3 78 10.27 6 8 3 1 63 9.68 5.75% 26 3 9 32 15.20 2 3 9 1 23 14.31 5.86% 27 1 7 15 14.56 2 2 8 1 20 14.32 1.65% 28 7 3 65 8.70 4 6 3 1 48 8.24 5.29% 29 1 3 16 41.58 1 1 3 2 14 41.28 0.73% 30 7 2 66 27.49 4 6 2 1 52 27.05 1.61% 31 9 2 80 25.10 6 8 2 1 67 24.51 2.36% 32 3 4 33 47.39 3 3 4 1 30 46.32 2.26%

12

135

Proceedings of MSDM 2013

Table 4: Regression model in the F SI case Response d∗ h∗ n∗ T C∗

Significant effects 5.156 + 1.719L0 − 4.094c 5.594 + 1.031p0 − 1.719p1 − 1.844λ − 0.844M + 1.406c 83.188 − 39.063p1 − 58.125c 27.277 + 1.897p0 − 4.474p1 + 11.860λ + 1.675L0 + 1.977M + 4.217c

R2 0.560 0.703 0.512 0.947

Table 5: Regression model in the V SI case Response d∗1 d∗2 h∗1 h∗2 n∗ T C∗ ∆C

Significant effects 3.938 + 1.375L0 − 2.938c 4.938 + 1.562L0 − 3.812c 5.688 + 1.062p0 − 1.688p1 − 1.750λ − 0.937M + 1.375c 3.063 − 2.000p1 + 1.437c 73.250 − 36.438p1 − 50.687c 26.805 + 1.997p0 − 4.595p1 + 11.861λ + 1.519L0 + 1.921M + 4.260c 2.448 + 0.784p1 − 1.114λ + 0.627L0 − 0.837b

• An increase in the frequency of the occurrence of the assignable cause, yields to a decrease in the longer sampling interval and an increased expected cost per hour of operation. • The warning and control limits are greater when L0 is large and c is small. • Percent reductions due to using variable sampling interval is significantly raised when λ and b are small. • The other parameters such as t1 and t2 has no significant effects on the optimal design schemes.

6

Conclusion

In this paper the economic design of V SI np charts is studied. The V SI chart is a modification of the F SI chart, in which the sampling interval vary between two values as a function of the most recent process information. The expected cost per hour of V SI np chart is developed to determine the five decision variables (the sample size, the long sampling interval, the short sampling interval, the warning limit and the control limit). A sensitivity analysis using a 210−5 fractional factorial IV 13

136

R2 0.457 0.540 0.706 0.35 0.523 0.945 0.550

Proceedings of MSDM 2013

design is conducted to search the optimal solutions of F SI and V SI charts and the impact of the 10 process and cost parameters on the behavior of the optimal schemes is treated. Numerical comparisons show that the V SI scheme can be more efficient than the F SI scheme in terms of expected cost.

References [1] D. S. Bai and K. T. Lee. An economic design of variable sampling interval ¯ control charts. International Journal of Production Economics, 54, 57-64. X (1998). [2] J. M. Calabrese. Bayesian process control for attributes. Management Science, 41(4), 637-645. (1995). ¯ control charts for non-normal data using [3] Y. K. Chen. Economic design of X variable sampling interval. International Journal of Production Economics, 92, 61-74. (2004). ¯ charts used to maintain current control [4] A. J. Duncan. The economic design of X of a process. Journal of the American Statistical Association, 51(274), 228-242. (1956). [5] A. J. Duncan. The economic design of p charts to maintain current control of a process: Some numerical results. Technometrics, 20(3), 235-243. (1978). [6] E. K. Epprecht, A.F.B. Costa and F.C.T. Mendes. Adaptive control charts for attributes. IIE Transactions, 35, 567-582. (2003). [7] E. K. Epprecht, B. F. T. Simos and F. C. T. Mendes. A variable sampling interval EWMA chart for attributes. The International Journal of Advanced Manufacturing Technology, 49 (1-4), 281-292. (2010). [8] Y.C. Lin and C.Y. Chou. Robustness of the EWMA and the combined ¯ X-EWMA control charts with variable sampling intervals to non-normality. Journal of Applied Statistics, 38(3), 553-570. (2011) [9] Y. Luo, Z. Li and Wang Z. Adaptive CUSUM control chart with variable sampling intervals. Computational Statistics and Data Analysis, 53(7),26932701. (2009). ¯ charts with [10] M. R. Reynolds, R. W. Amin, J. C. Arnold, J. A. Nachlas. X variable sampling interval. Technometrics, 30(2), 181-192. (1988).

14

137

Proceedings of MSDM 2013

[11] G. C. Runger and J. J. Pignatiello. Adaptive sampling for process control. Journal of Quality Technology, 23, 135-155. (1991). [12] D. C. Montgomery. The economic design of control charts: A review and literature survey. Journal of Quality Technology, 12(2), 75-87. (1980). [13] D. C. Montgomery. Introduction to statistical process control. John Wiley, NY. (1996). [14] G. C. Runger and D.C. Montgomery. Adaptive sampling for process control charts. IIE Transaction, 25, 41-51. (1993). [15] G. Tagaras. A survey of recent developments in the design of adaptive control charts. Journal of Quality Technology, 30(3), 212-231. (1998).

15

138

Proceedings of MSDM 2013

Average run length for MEWMA AND MCUSUM charts in the case of Skewed distributions Sihem Ben Zakour, Hassen Taleb LARODEC, ISG, University of Tunis. [email protected] [email protected] ABSTRACT: Multivariate control charts have been widely used in many manufacturing industries to control and analyze processes characterized by a large number of quality characteristics. Such as semiconductor industry, manufacturers make semiconductor devices around the clock through hundreds of processes. A multivariate chart, an extension of univariate chart, is used in a joint monitoring of several correlated variables. The two most commonly used multivariate charts for a quick detection of small or moderate shifts in the mean vector are the multivariate exponentially weighted moving average (MEWMA) and multivariate cumulative sum (MCUSUM) charts. The MEWMA and MCUSUM charts use information from previous data, which make them sensitive to small shifts. These control charts require the assumption that the underlying process follows a multivariate normal distribution. This current paper studies the robustness of the MEWMA and MCUSUM charts under non-normality distribution by considering the multivariate Weibull and multivariate gamma distributions with different sample sizes and correlation coefficients.

KEYWORDS: MEWMA chart, MCUSUM chart, Average Run length (ARL).

1. Introduction In the most process monitoring, the quality of a process is determined by two or more quality characteristics [WO 99]. If the industrial process requires several related variables of interest to control it, are called multivariate statistical process control. The most useful tool used in the monitoring of a multivariate process is a multivariate control chart. In order to construct a multivariate chart, the preliminary set of data is assumed to be in statistical control. This analysis is known as a Phase-I analysis and it is conducted to estimate process parameters that will be used for the monitoring of a future process, a Phase-II process. Numerous multivariate charts and their extensions are presently available. These charts can be categorized into three groups, namely, T2, multivariate EWMA (MEWMA) and multivariate CUSUM (MCUSUM) charts. The Hotelling’s chart was invented by [HOT47] for the detection of a large sustained shift. The MCUSUM chart was first suggested by Woodall and [NCU85] while the MEWMA chart was introduced by [LOW 92]. The primary criterion is the average run length (ARL) which is the most commonly used measure of a control chart’s performance. This paper is organized as follows: Second section introduces the MEWMA chart and Section 3 reviews the MCUSUM chart. In Section 4, a simulation study is conducted to compare the performances of MEWMA and MCUSUM charts for skewed distributions. Finally, conclusions are drawn in Section 5.

139

Proceedings of MSDM 2013

 2. MEWMA control chart The MEWMA chart proposed by [LOW 92] is expressed as follows: Zt =Xt + (1- )Zt-1, for t=1, 2,… Where Z0=µ0 and 0 h1 , where h1 , where is the limit h1 chosen to achieve a desired in-control ARL ARL0)  zt = (/2-) [1-(1-)2t ]∑xt ∑xt is the variance-covariance matrix for [LOW 92] showed that the run length Zt performance of the MEWMA chart depends on the off-target means vector µ1 and the covariance matrix of Xt, i.e.  xt only through the value of the non-centrality parameter, 1/2 -1 δ= {(µ1-µ0)’x (µ1µ0)} Where µ0 denotes the in-control means vector. The Type-I and Type-II error-probabilities could not be employed for evaluating the MEWMA control chart performance. Accordingly, the average run length (ARL) is the main indicator of the MEWMA chart. [LEE 06] provide a method based on the Markov chain approach for the selection of the optimal parameters  and h1, which produce the minimum out-of control ARL1) for a desired size of a shift of interest based on a fixed ARL0. The longer the average runs length when the process is in-control (ARL0), the better the performance of the conceived chart is. The shorter the average run length when the process is in an out-of-control state (ARL1), the greater performance is attained [MOL 01].

3. MCUSUM CONTROL CHART [CRO 88] suggested two multivariate CUSUM charts. The one with the better ARL performance is based on the following statistics:

Ct={(St-1 +Xt – a)’ x-1(St-1 +Xt – a)}1/2, for t=1, 2,…,

Notes that S0=0, k > 0 is the reference value and a is the aim point or target value for the mean vector. The control charting statistic for the MCUSUM chart is [CRO 88]

Y t= (St’x-1St)1/2 ’ -1 1/2 An out-of-control signal is generated when Yt= (St x St) > H with H>0 is the control limit of the scheme. A shift in the mean vector is signaled when Yt> h2 , where h2 represents the limit of the chart. The MCUSUM procedure assumes that the multivariate observations Xt for t = 1, 2, ..., follow an independently and identically

140

Proceedings of MSDM 2013

distributed (i.i.d.) multivariate normal distribution. [LEE 06] give an approach based on the Markov chain method in determining the optimal parameters, k and that give the minimum out-of-control ARL for a size of shift of interest based on a fixed ARL0.

4. SIMULATION STUDY Since many semiconductor processes like Chemical mechanical process and Etch process follow skewed distribution, it is difficult even impossible to satisfy the multivariate normality assumption for MEWMA and MCUSUM charts. Nonnormality is often studied for individual observations. Based on the central limit theorem, sample means are approximately normally distributed for all reasonable distributions and as a result non-normality is not a major pertaining with large subgroups. Nevertheless, for small sample sizes, the distribution of quality characteristics is distant from normal. In this section, the performances of the MEWMA and MCUSUM charts will be analyzed when the multivariate normality assumption is not satisfied. To simplify the work, we consider only the bivariate case (P=2). The average run length of the both charts is compared based on the false alarm rates when the process is in-control for multivariate skewed distributions. For better comparison, the multivariate normal distribution is also studied. The employed software to compute the false alarm rates for the three multivariate distributions is Matlab program. Each false alarm rate is computed based on 5000 simulation trials. We assume that  =0.0027 when the underlying distribution is bivariate normal. For a quick detection of the mean shift vector (, we take  = 1. For ease of computation, the scale parameters of (1,1) for (X1, X2) are selected for the Weibull and gamma distributions. Based on the procedure given in [LEE 06], the optimal parameters are found to be k = 0.5 and h2= 6.227 for the MCUSUM chart. The optimal smoothing constant, = 0.13 and limit h1 = 10.55 are found in the MEWMA chart using the approach described in [LEE 0 6]. We have considered for the correlation coefficients,  = 0.3, 0.5 and 0.8 in the case of the bivariate distributions. The shape parameters for (X1, X2) chosen so that the desired skewness (1, 2) = {(1,1), (1,2), (1,3), (2,2), (2, 3), (3,3)} for these parameters are attained. The sample sizes n = 3, 5 and 7 are studied. The false alarm rates for the MEWMA and MCUSUM charts are given in Tables 1 and 2, respectively. Note that the false alarm rates, marked as “*” in Tables 1 and 2 cannot be computed because the corresponding shape parameters of one of the gamma distributed components, have negative values. From Tables 1 and 2, the false alarm rates of the MEWMA and MCUSUM charts increase as the level of skewness and correlation coefficient increase, in the case of multivariate Weibull and gamma distributions. [KBC 10] demonstrated that the the false alarm rates for weighted standard deviation (WSD) MEWMA and WSD MCUSUM charts, based on the gamma and Weibull distributions increase as the level of skewness increases. It should be written down, that the false alarm rate decreases as the sample size increases. In the most cases, the MEWMA chart has smaller false alarm rates than the MCUSUM chart for various levels of skewness. Therefore, we conclude that the MEWMA chart has strong performance and robustness than the MCUSUM chart. Those results are approved by [MAR 10] MCUSUM chart showed better overall performance than MEWMA chart, with the differences becoming negligible for moderate to large shifts, for multivariate normal distribution. [SFN 12] shows the MEWMA method performs better than other methods in detecting the small shifts of process mean. If multivariate normality is questionable, then the MEWMA chart should be used [MAR 10].

141

Proceedings of MSDM 2013 TABLE 1 - False alarm rates for the MEWMA chart when λ = 0.13 and h1 = 10.55 Skewness Sample size, n Correlation Multivariate coefficient coefficient distribution ( γ1 , γ 2 ) 3 5 Normal

Weibull ρ = 0.3

Gamma

Normal

Weibull ρ = 0.5

Gamma

Normal

Weibull ρ = 0.8

Gamma

7

(0,0) (1,1) (1,2) (1,3) (2,2) (2,3) (3,3) (1,1) (1,2) (1,3) (2,2) (2,3) (3,3)

0.0025830 0.0029040 0.0035350 0.0043070 0.0041670 0.0049280 0.0056780 0.0029980 0.0034710 0.0039560 0.0040630 0.0047300 0.0053240

0.0026450 0.0027250 0.0031600 0.0037940 0.0035550 0.0041320 0.0047290 0.0028110 0.0031680 0.0035100 0.0036180 0.0040400 0.0043790

0.0026600 0.0026130 0.0030100 0.0035010 0.0033540 0.0038270 0.0043020 0.0027850 0.0030860 0.0033090 0.0032550 0.0036070 0.0039400

(0,0) (1,1) (1,2) (1,3) (2,2) (2,3) (3,3) (1,1) (1,2) (1,3) (2,2) (2,3) (3,3)

0.0025830 0.0028690 0.0037170 0.0047470 0.0044490 0.0053900 0.0062970 0.0030340 0.0036180 * 0.0041920 0.0047770 0.0055110

0.0026450 0.0026110 0.0032160 0.0040950 0.0037830 0.0045610 0.0052450 0.0029600 0.0031160 * 0.0035860 0.0039130 0.0046630

0.0026600 0.0024610 0.0030640 0.0038720 0.0035030 0.0042110 0.0048160 0.0027750 0.0030910 * 0.0033770 0.0036190 0.0042070

(0,0) (1,1) (1,2) (1,3) (2,2) (2,3) (3,3) (1,1) (1,2) (1,3) (2,2) (2,3) (3,3)

0.0025830 0.0033460 0.0049340 0.0070200 0.0057440 0.0071680 0.0081140 0.0032220 * * 0.0048070 * 0.0064080

0.0026450 0.0027720 0.0041010 0.0060950 0.0046370 0.0059480 0.0068870 0.0030090 * * 0.0040280 * 0.0053320

0.0026600 0.0025900 0.0036990 0.0057130 0.0041530 0.0054110 0.0062570 0.0029240 * * 0.0036410 * 0.0047210

142

Proceedings of MSDM 2013 TABLE 2 - False alarm rates for the MCUSUM chart when k = 0.5 and h2 = 6.227 Skewness Sample size, n Correlation Multivariate coefficient coefficient distribution 3 5 ( γ1 , γ 2 ) Normal

Weibull ρ = 0.3

Gamma

Normal

Weibull ρ = 0.5

Gamma

Normal

Weibull ρ = 0.8

Gamma

7

(0,0) (1,1) (1,2) (1,3) (2,2) (2,3) (3,3) (1,1) (1,2) (1,3) (2,2) (2,3) (3,3)

0.0026930 0.0029440 0.0035020 0.0042500 0.0036650 0.0048150 0.0055340 0.0026930 0.0029440 0.0035020 0.0042500 0.0036650 0.0048150

0.0027220 0.0027420 0.0031400 0.0037290 0.0033290 0.0040180 0.0045740 0.0027220 0.0027420 0.0031400 0.0037290 0.0033290 0.0040180

0.0027220 0.0026800 0.0030080 0.0034920 0.0031400 0.0037430 0.0041640 0.0027220 0.0026800 0.0030080 0.0034920 0.0031400 0.0037430

(0,0) (1,1) (1,2) (1,3) (2,2) (2,3) (3,3) (1,1) (1,2) (1,3) (2,2) (2,3) (3,3)

0.0026930 0.0028180 0.0036490 0.0046620 0.0043290 0.0052570 0.0061670 0.0031260 0.0034620 * 0.0039720 0.0047190 0.0055250

0.0027220 0.0025440 0.0031830 0.0040150 0.0036840 0.0044200 0.0050810 0.0028970 0.0032160 * 0.0035540 0.0039000 0.0045230

0.0027220 0.0024610 0.0030670 0.0038180 0.0034060 0.0040660 0.0046200 0.0028610 0.0029970 * 0.0033110 0.0035950 0.0040310

(0,0) (1,1) (1,2) (1,3) (2,2) (2,3) (3,3) (1,1) (1,2) (1,3) (2,2) (2,3) (3,3)

0.0026930 0.0030700 0.0047220 0.0068540 0.0054640 0.0068990 0.0078760 0.0032610 * * 0.0045670 * 0.0062740

0.0027220 0.0026280 0.0039330 0.0059950 0.0044190 0.0056910 0.0065890 0.0029770 * * 0.0038680 * 0.0052470

0.0027220 0.0024780 0.0036050 0.0056410 0.0039710 0.0051910 0.0059620 0.0028680 * * 0.0035700 * 0.0044940

143

Proceedings of MSDM 2013

5. CONCLUSIONS In this paper, we have studied the average run length of the MEWMA and MCUSUM charts under multivariate normal and multivariate skewed distributions. The average run lengths of the charts, computed using the MATLAB program, are compared, for various levels of skewness, sample sizes. As a result, the false alarms of both MEWMA and CUSUM charts are subjected to variation due to the skewness of the underlying distribution. Moreover, the false alarm rates of those charts are affected by the sample size and correlation of the quality characteristics. The simulation results showed that the MEWMA chart has a smaller false alarm rate than the MCUSUM chart when the underlying distribution is skewed. Since it is well known that the both the MEWMA and MCUSUM charts have equal performances in the detection of small shifts when the underlying process is multivariate normally distributed, the use of the MEWMA chart in process monitoring is advocated because the MEWMA chart is more robust towards skewed samples. 6. References [CHE 41] Cheriyan, K.C. (1941). A bivariate correlated gamma-type distribution function. Journal of the Indian Mathematical Society, 5, 133-144. 2. [HOT 47] Hotelling, H. (1947). Multivariate quality control, Techniques of Statistical Analysis, Eisenhart, Hastay and Wallis (eds.), McGraw-Hill, New York. [KOT 00] Kotz, S., Balakrishnan, N. and Johnson, N.L. (2000). Continuous multivariate distributions, nd Vol. 1.,2 ed.,John Wiley, New York. [KBC 10] Michael Khoo Boon Chong, Sin Yin Teh, May Yin Eng, (2010) A comparison of multivariate control charts for skewed distributions using weighted standard deviations, Malaysian Journal of Fundamental and Applied Sciences, Vol 6, No 1 (2010). [MAR 10] Mary Waterhouse, Ian Smith, Hassan Assareh and Kerrie Mengersen, (2010), Implementation of multivariate control charts in a clinical setting, Int J Qual Health Care 22 (5): 408414. [MOL 01]Molnau, W. E., Montgomery, D. C., & Runger, G. C. (2001). Statistically Constrained designs of the multivariate exponentially weighted moving average Control chart. Quality and Reliability Engineering International, 7, 39–49. [SFZ 12] Saber Fallah Nezhad, M . (2012) A new EWMA monitoring design for multivariate quality control problem, Int J Adv Manuf Technol (2012) 62:751–758. [LEE 79] Lee, L. (1979). Multivariate distributions having Weibull properties. Journal of Multivariate Analysis, 9, 267–277. [LEE 06] Lee, M.H. and Khoo, M.B.C. (2006a). Optimal statistical design of a multivariate EWMA chart based on ARL and MRL. Communications in Statistics – Simulation and Computation, 35, 831–847. [LEE 06] Lee, M.H. and Khoo, M.B.C. (2006b). Optimal statistical design of a multivariate CUSUM chart. International Journal of Reliability, Quality and Safety Engineering, 13, 479–497. [LOW 92] Lowry, C.A., Woodall, W.H., Champ, C.W. and Rigdon, S.E. (1 992). A multivariate exponentially weighted moving average control chart. Technometrics, 34, 46–53. [RAM 51] Ramabhadran, V.R. (1951). A multivariate gamma-type distributions. Sankhya, 11, 45–46. [WOO 99] Woodall, W.H. and Montgomery, D.C. (1999). Research issues and ideas in statistical process control. Journal of Quality Technology, 31, 376-386.

144

Proceedings of MSDM 2013

The multi-SOM algorithm based on SOM method Imen KHANCHOUCH1, 2 , Khaddouja BOUJENFA1, 3 , Amor Messaoud 4 Mohamed Limam1, 5 1

LARODEC, ISG, University of Tunis [email protected] 3 [email protected] 4 [email protected] 5 [email protected] 2

ABSTRACT: This paper proposes a clustering algorithm based on the multi-SOM approach. To find the optimal number of clusters, our algorithm uses the Davies Bouldin index which has not been used previously in the multi-SOM. The proposed algorithm is compared to three clustering methods using different public databases. The results show that our algorithm is as performing as concurrent methods. KEYWORDS: Clustering, SOM, multi-SOM, DB index.

1.

Introduction

Clustering is an unsupervised learning technique which is the subject of many recent researches. It aims to obtain homogeneous partitions of objects while promoting the heterogeneity between partitions. Many clustering categories exist such as hierarchical [Zhang1996], partition-based [MacQueen1976], densitybased [Ester1996] and neuronal network [Kohonen1981]. Hierarchical methods aim to build a hierarchy of clusters with many levels. Two types of hierarchical clustering approaches exist: the agglomerative methods (bottom-up) and the divisive methods (Top-down). Agglomerative methods start by many data objects taken as clusters and then they are successively joined two by two until obtaining a single partition containing all the objects. However, the clustering in divisive methods begins with a sample of data as one cluster and successively gets N divided clusters as objects. Hierarchical methods are time consuming in the presence of large amount of data. Consequently, the resulting dendrogram is very large and we may include incorrect information. Partitioning methods divide the data set into disjoint partitions where each partition represents a cluster. Clusters are formed to optimize an objective partitioning criterion, often called a similarity function, such as distance. Each cluster is represented by a centroid or a representative cluster. But, partitioning methods suffer from the sensibility of initialization. Thus, inappropriate initialization may lead to bad results. However, they are faster than hierarchical methods. Density-based clustering methods aim to discover clusters with different shapes. It is based on the assumption that regions with high density constitute clusters, which are separated by regions with low density. They are based on the concept of cloud of points with higher density where the neighborhoods of a point are defined by a threshold of distance or number of nearest neighbors.

145

Proceedings of MSDM 2013

Neural networks (NN) are complex systems with high degree of interconnected neurons. Unlike the hierarchical and partitioning clustering methods, NN can handle a large number of high dimensional data. Neural Gas is an artificial neural network proposed by [Martinetz1991] which is based on feature vectors to find optimal representations for input data. The algorithm name refers to the dynamicity of the feature vectors during the adaptation process. SOM [Kohonen1981] is the most commonly used NN method based on competitive learning. In the training process, the nodes compete to be the most similar to the input vector node. Euclidean distance is commonly used to measure distances between input vectors and output node’s weights. The node with the minimum distance is the winner, also known as the Best Matching Unit (BMU). This latter, is a SOM unit having the closest weight to the current input vector after calculating the Euclidean distance from each existing weight vector to the chosen input record. Therefore, the neighbors of the BMU on the map are determined and adjusted. The main function of SOM is to map the input data from a high dimensional space to a lower dimensional one. SOM method is appropriate for visualization of high-dimensional data allowing a reduction of data and its complexity. However, SOM map is insufficient to define the boundaries of each cluster since there is no clear separation of data items. Thus, extracting partitions from SOM grid is a crucial task. Along with this shortcoming, SOM initializes the topology and the size of the grid where the choice of the size is very sensitive to the generalization of the method. Thus, an extension of this method namely multi-SOM which is a clustering method based on the clustering of SOM grid and consists of many levels of SOM grids, could overcome these shortcomings and gives the optimal number of clusters without any initializations. This paper is structured as follows. Section 2 describes different clustering approaches. Section 3 details the multi-SOM approach and the proposed algorithm. Experimental results on real datasets are given in Section 4. Finally, conclusion and some future works are given in Section 5.

2.

The multi-SOM approach

The multi-SOM method was firstly introduced by [Lamirel2001] for scientific and technical information analysis specifically for patenting transgenic plant to improve the resistance of plants to pathogen agents. The authors proposed an extension of SOM called multi-SOM to introduce the notion of viewpoints into the information analysis with its multiple map visualization and dynamicity. A viewpoint is defined as a partition of the analyst reasoning. The objects in a partition could be homogenous or heterogeneous and not necessary similar. However objects in a cluster are similar and homogenous and a criterion of similarity is inevitably used. Each map in multi-SOM represents a viewpoint and the information in each map is represented by nodes (classes) and logical areas (group of classes). [Lamirel2002] applied the multi-SOM on an iconographic database. Iconographic is the collected representation illustrating a subject which can be an image or a document text. Then, multi-SOM model is applied in the domain of patent analysis in [Lamirel2003] and [Lamirel2006]. The experiments use a database of one thousand patents about oil engineering technology and indicate the efficiency of viewpoint oriented analysis, where selected viewpoints correspond to: uses, advantages, patentees and titles subfields of patents. A patent is an official document conferring a right. [Smith2009] applied multi-SOM on a zoo data set from the UCI repository to illustrate the technique combining multiple SOMs which visualizes the different feature maps of the zoo data with color coded clusters superimposed. The multi-SOM algorithm supplies well map coverage with a minimal topological defects but it does not facilitate the integration of new data dimensions. [Ghouila2008] applied the Multi-SOM algorithm for macrophage gene expression analysis. Their proposed algorithm overcomes some weaknesses of clustering methods which are the cluster number estimation in partitioning methods and the delimitation of partitions from the output grid of SOM algorithm. The idea of [Ghouila2008] consists on obtaining compact and well separated clusters using an evaluation criterion namely Dynamic Validity Index. The DVI metric is derived from compactness and separation properties. Thus compactness and separation are two criteria to evaluate clustering quality and to select the optimal clustering layer. Compactness is assessed by the intra-distance variability which should be minimized and separation is assessed by the inter-distance between two clusters which should be maximized. The DVI metric is given by

146

Proceedings of MSDM 2013

DVI = min k =1..K {IntraRatio(k ) + γInterRatio(k )} Where k denotes the number of activated nodes on the layer and

γ

[1]

is a modulating parameter.

FIGURE 1 – Architecture of the proposed multi-SOM.

There are many evaluation criteria such as DVI. [Fonseka2010] used the DB index to measure cluster quality in Multi-Layer Growing SOM algorithm (GSOM) for expression data analysis. The DB index aims to define the compactness and how well separated are the clusters. GSOM is a model of neural network algorithm which belongs to the hierarchical Agglomerative clustering (HAC) algorithms. It is based on SOM approach but it starts with a minimum number of nodes which is usually four nodes and grows with a new node at every iteration. We have chosen to use in this work the DB index because it belongs to the internal criteria which is based on the compactness and separation of the clusters and well used in many works but not in the multi-SOM algorithm. The DB index is given by

DB =

1 c

c

∑ max

i≠ j

i =1

 d ( X i ) + d (X  d (c i , c j ) 

j

)  

[2]

Where c defines the number of clusters, i and j are the clusters, d(Xi) and d(Xj) are distances between all objects in clusters i and j to their respective cluster centroids, and d(ci, cj) is the distance between centroids. Smaller values of DB index indicate better clustering quality.

3.

The proposed multi-SOM algorithm

The proposed algorithm (Algorithm1) aims to find the optimal clusters using the DB index as an evaluation criterion.

147

Proceedings of MSDM 2013

Given a dataset, the multi-SOM algorithm first uses the SOM approach to cluster the dataset and to generate the first SOM grid, namely SOM1 and then the first SOM grid is iteratively clustered as shown in figure 1. The grid height (Hs) and the grid width (Ws) of SOM1 are given by the user. The multi-SOM algorithm uses the Batch map function proposed by [Fort2002] which is a version of SOM Kohonen algorithm but, it is faster than SOM in the training process. Then, the SOM map is clustered iteratively from one level to another where the DB index is computed at each level. The size of the grid decreases at each level until optimal number of clusters is reached. In [Ghouila2008], the training process stops when one neuron is reached. However, our proposed algorithm stops when the DB index gets its minimum value. At this value the optimal number of clusters is given. Consequently, the proposed algorithm uses less computation time than the one proposed by [Ghouila2008]. The time complexity of DB is O (DB) = O(C) from the formula of DB index, where C is the number of clusters. This time complexity is less than the time complexity of the DVI, where O (DVI) = O (InterRatio+IntraRatio). This integrates many operations to compute the intra- and inter-distance. Thus, computation of DVI values at each grid requires much memory space and time than the calculation of the DB index. The different steps of the algorithm are as follow: Algorithm1: multi-SOM Input: W1, H1, I1, max_it Output: Optimal cluster number Begin

• Step1: Clustering data by SOM s = 1; Batch SOM (W1, H1,I1,max_it) ; Compute DB index; s = s+1; • Step2: Clustering of the SOM and cluster delimitation

HS=HS-1; WS=WS-1; repeat

Batch SOM (Ws, Hs, Is, max_it); Compute DB index on each SOM grid; s=s+1; until(DBs

Suggest Documents