Nonnegative Matrix Factorization for Binary Data to ... - Springer Link

2 downloads 0 Views 2MB Size Report
we decompose binary coded (pass/fail) wafer test data into underlying elementary failure patterns and study their influence on the quality of single wafers.
Nonnegative Matrix Factorization for Binary Data to Extract Elementary Failure Maps from Wafer Test Images Reinhard Schachtner, Gerhard P¨oppel, and Elmar Lang

Abstract We introduce a probabilistic variant of nonnegative matrix factorization (NMF) applied to binary datasets. Hence we consider binary coded images as a probabilistic superposition of underlying continuous-valued basic patterns. An extension of the well-known NMF procedure to binary-valued datasets is provided to solve the related optimization problem with nonnegativity constraints. We demonstrate the performance of our method by applying it to the detection and characterization of hidden causes for failures during wafer processing. Therefore, we decompose binary coded (pass/fail) wafer test data into underlying elementary failure patterns and study their influence on the quality of single wafers. Keywords Binary data  Failure patterns  Nonnegative matrix factorization.

1 Introduction Manufacturing a microchip requires up to hundreds of productive steps, depending on the complexity of its components. Lifetime, performance speed and other quality aspects render a set of specifications taylored on the intended application field. The overall functionality of the completed chips is measured in a test series after the last step of production. A chip is labelled “pass”, if it satisfies all investigated features, and “fail” otherwise. A disordered or wrongly calibrated productive machine can cause the failing of a quality check of a series of chips. The identification and explanation of such systematic errors is a highly interesting and nontrivial problem. While several individual root causes can be the responsible trigger for a droppedout device, only the overall “pass”/“fail”-information for the chip is available at any case. In this paper we introduce a new method to model the systematic part of errors by a superposition of individual failure causes. Reinhard Schachtner (B) Infineon Technologies AG, 93049 Regensburg, Germany, and Institute of Biophysics, University of Regensburg, 93040 Regensburg, Germany, e-mail: [email protected] A. Fink et al., (eds.), Advances in Data Analysis, Data Handling and Business Intelligence, Studies in Classification, Data Analysis, and Knowledge Organization, c Springer-Verlag Berlin Heidelberg 2010 DOI 10.1007/978-3-642-01044-6 69, 

755

756

R. Schachtner et al.

1.1 Notation Measurement data from N wafers constitute a binary N  M data matrix X, each row of which contains all M chips of one wafer aligned. Matrix entry Xij contains the information whether chip j on wafer i has passed all functionality tests (0), or failed at any of them (1). In the following, we use Xi  to denote the i -th row and Xj for the j -th column of X, meaning one whole wafer i or one chip position j on all wafers, respectively.

2 Nonnegative Matrix Factorization Nonnegative matrix factorization (NMF) is a very popular technique for the analysis of real-valued multivariate datasets. In context of Blind Source Separation, NMF is intended to explain a data generating process as strictly additive superposition of nonnegative sources. A nonnegative N  M data matrix X is approximated by a N  K matrix W and a K  M matrix H such that X  WH

.W; H  0/:

(1)

The number of basis components K is usually chosen so that .N CM /K  NM . In that case, the product WH can be regarded as a compressed version of the original data X (Lee & Seung, 1999). Technically, the task of a NMF can be formulated as an optimization problem by minimizing a suitable cost function, such as the squared Euclidean distance N

f .W; H/ WD

M

1 XX .Xij  ŒWHij /2 2 i D1 j D1

(2)

with respect to the non-negativity constraints W; H  0. Other cost functions, such as the Kullback–Leibler- (Lee & Seung, 1999), Bregman- (Dhillon & Sra, 2006) or Csisz´ar’s (Cichocki, Zdunek, & Amari, 2006) divergences, have been proposed in the literature. Additional sparse- or smoothness parameters to enforce solutions with desired characteristics, as well as a variety of optimization techniques to achieve the desired matrix decomposition have been discussed (see, e.g., Berry, Browne, Langville, Pauca, & Plemmons, 2007; Cichocki, Zdunek, & Amari, 2008 for a survey).

2.1 Alternating Least Squares Algorithm for NMF A very popular method to minimize the squared Euclidean distance f .W; H/ (2) is called Alternating Least Squares procedure. It can be summarized by the following steps (see, e.g., Berry et al., 2007):

NMF for Binary Data

757

Initialize W at random iterate the following equations for i=1 to maxiter: @f D 0: @H Set all negative elements in H to 0: @f D 0: Solve for W in matrix equation @W Set all negative elements in W to 0: Solve for H in matrix equation

(3) (4) (5) (6)

ALS-procedures which properly enforce non-negativity of W and H can be proven to converge towards a local minimum of the cost function (Berry et al., 2007). Unfortunately, the rough projection onto the nonnegative orthant after every optimization step can cause convergence problems. In case of convergence, however, projected ALS is extremely fast. Computing several runs using different random initializations thus still outperforms other methods like gradient descent and multiplicative update rules with respect to the required computational time. Despite its convergence problems, the projected ALS method is very attractive for NMF applications (see Cichocki et al., 2008).

3 NMF for Binary Datasets 3.1 Generative Model The measurement outcome “fail” of a microchip can have several possible reasons. Here we assume that the data is generated by K individual root causes which are acting simultaneously without influencing each other and that the probability for a chip to be “pass” can be expressed by P .“pass”/ D

K Y

P .“pass”jroot cause k/:

(7)

kD1

Furthermore, there are two aspects of the data generating process to be considered: 1. Each root cause can have varying impact on different wafers. 2. A root cause can be related to a characteristic pattern on a wafer. Such a pattern manifests itself in a certain “fail”-likeliness for each chip position. Incorporating both aspects, we employ parameters Wi k  0 and Hkj  0 to describe the probability that chip j of wafer i is “pass” as

758

R. Schachtner et al. Model summary

N W no. of objects 2 N M W dimension 2 N K W no. of basic patterns  min.N; M /

N M X W data matrix W W coefficient matrix N K H W pattern matrix K M (

Xij Wik Hkj

D 0 W “pass”

2 f0; 1g

D 1 W “fail”

weight of pattern k in object i 2 Œ0; 1Œ value of pattern k on position j 2 Œ0; 1

P .Xij D 0jH; W/

e ŒWHij

2 0; 1

P .Xij D 1jH; W/

1  e ŒWHij

2 Œ0; 1Œ

Fig. 1 Summary of the model

P .Xij D 0/ D

K Y

e Wi k Hkj D e ŒWHij ;

(8)

kD1

where we recognize the product of nonnegative matrices in the exponent of the last expression. The parameters Wi k reflect the influence of root cause k on wafer i such that Wrk < Wsk means that root cause k is more strongly expressed on wafer Xs than on Xr . We refer to the row vector Hk D .Hk1 ; : : : ; HkM /  0 as pattern k where Hkl < Hkm implies that it is more likely to observe a 1 on chip position m than on position l due to root cause k. In our description, the term probability is avoided for the parameters Wi k and Hkj due to scaling indeterminacies. Only the terms e Wi k Hkj have a probabilistic interpretation. Summarizing, the “pass”/“fail”-probabilities given the hidden root causes are (Fig. 1) P .Xij D 0jH; W/ D e ŒWHij; P .Xij D 1jH; W/ D 1  e

ŒWHij

(9) :

(10)

Both matrices W and H are nonnegative and are related to the binary matrix X as described. The challenge of finding these matrices can thus be viewed as an extension of NMF for this kind of binary datasets.

3.2 Bernoulli Likelihood The Bernoulli likelihood is a natural choice for modelling binary data. Denoting pij the probability that Xij D 1, the Bernoulli likelihood of one entry is

NMF for Binary Data

759 X

P .Xij jH; W/ D pij ij .1  pij /1Xij:

(11)

Together with (10) this leads to an overall (log-)likelihood LL WD

N X M X

Xij ln.1  e ŒWHij /  ŒWHij C Xij ŒWHij:

(12)

i D1 j D1

In Kab´an, Bingham, and Hirsim¨aki (2004), a symmetric linear model is used to approximate the Bernoulli parameter of a similar problem. The authors use an EM type approach to maximize a lower bound for the log-likelihood. Here, we propose a completely different strategy for the optimization. We combine an Alternating Gradient Ascent Algorithm in the variables W and H together with a preceding search for appropriate initial values in order to reduce the risk of getting stuck in “poor” local maxima. Note that this does not imply that the algorithm will necessarily find the global maximum.

3.3 Optimizing the Log-Likelihood 3.3.1 Alternating Gradient Ascent Algorithm After some suitable initialization of the parameter matrices W and H, an iterative gradient ascent scheme for the log-likelihood (12) is given by Wi k

Wi k C W

@LL ; @Wi k

(13)

Hkj

Hkj C H

@LL : @Hkj

(14)

While one of the matrices is updated, the other one is kept fixed. Due to the non-negativity constraints on all Wi k ; Hkj , the stepsize parameters W and H have to be controlled carefully. Especially for small stepsizes, however, convergence can be extremely slow. Even in the unconstrained case, gradient ascent algorithms can only be guaranteed to find a local maximum for sufficiently small W , H . Particularly the logarithm in (12) can cause serious global convergence problems by inducing local maxima to the log-likelihood function. Single entries Xij D 1 with a small probability 1  e ŒWHij may pin the optimization algorithm. In the following, we derive a strategy how to find a “good” starting point for the Alternating Gradient Ascent.

760

R. Schachtner et al.

3.3.2 Alternating Least Squares on a Simplified Problem In order to obtain suitable initial matrices W and H we apply a preceding standard NMF on a simplified version of the true optimization task. Therefore, we introduce an auxiliary variable ˛ 20; 1Œ and set P .Xij D 1/ D 0; P .Xij D 1/ D ˛;

if Xij D 0 if Xij D 1

for all i; j:

(15)

˛ can be regarded as an averaged probability P .Xij D 1/ given that the observed realization was Xij D 1. For all .i; j / this can be summarized by ˛Xij D 1  e ŒWHij ,  ln.1  ˛Xij / D ŒWHij:

(16)

Note that the left hand side of the last equation is always nonnegative since ˛ 20; 1Œ. Substituting Xij0 DW  ln.1 ˛Xij / we recover a standard NMF problem X0  WH. We chose the squared Euclidean distance as a cost function M N X X  2 E.˛; W; H/ D ln.1  ˛Xij / C ŒWHij

(17)

i D1 j D1

and apply the Alternating Least Squares Algorithm as described in Sect. 2.1 in order to minimize (17) with respect to W  0 and H  0. The ALS-updates are given by Hrs

maxf; 

N X

Œ.WT W/1 WT ri ln.1  ˛Xi s /g;

(18)

ln.1  ˛Xlj /ŒHT .HHT /1 j m g:

(19)

i D1

Wlm

maxf; 

M X j D1

To avoid getting stuck in local minima of the cost function (17) the procedure is repeated using different random initializations of H and W and only the solution with the smallest Euclidean distance is preserved.

3.3.3 Determining the Parameter ˛ Note that the global minimum of (17) as a function of ˛; W and H is given by E D 0 when ˛ ! 0, W; H D 0 independently from the data X . Thus, we determine the optimal ˛ by the log-likelihood of the estimated W.˛/; H.˛/. If the parameter ˛ is chosen too small, the probabilities P .Xij D 1/ are consistently estimated too small and the related log-likelihood will be small. On the other hand, a large ˛  1 leads to very large values ŒWHij for any Xij D 1. Due to the matrix product this

NMF for Binary Data

761

LL(α,init)

0

5 x 10

−2 LL(X,W(α),H(α))

−4 −6

0.5 α

0

1

Fig. 2 Log-likelihood of the approximations computed by the ALS-method as a function of ˛ for 10 random initializations. The best value is obtained for ˛ D 0:87 in this example. The horizontal line denotes the true log-likelihood

implies an increase of the whole column Hj and/or row Wi  at the expense of the reconstruction accuracy for zeros in the same row and column (Xi s D 0; Xrj D 0). From simulations on toydata sets (see Sect. 4.1 for details), we observed that the best obtained log-likelihood LL.X; W.˛/; H.˛// among several randomly initialized runs resembles a concave function of ˛ (see Fig. 2). Thus, a Golden Section Search procedure can be applied to obtain the optimal ˛ in a reasonable amount of trials and computational time.

3.3.4 Semi-supervised Mode The algorithm presented above can easily be run in a semi-supervised fashion. A fixed pattern of interest can be stored into a row of H at initialization. While during the optimization only the randomly initialized K  1 rows of H are updated, the updates for W remain as usual. For example, an uninformative constant pattern of ones Hk0  D .1; : : : ; 1/ can be utilized to model uniformly distributed noise.

3.4 Other Cost functions The Bernoulli likelihood discussed above is not the only possibility to handle the binary NMF problem. We also experimented with the following class of cost functions: 0 1q N M X X ˇ ˇ ˇXij  1 C e ŒWHij ˇp A ; p; q > 0; @ Epq .X; W; H/ WD (20) i D1

j D1

a special case of which is simply the squared Euclidean distance E21 .X; W; H/ WD

M N X X  2 Xij  1 C e ŒWHij : i D1 j D1

(21)

762

R. Schachtner et al.

It is our experience that optimizing cost functions of this type also yield useful decompositions of binary datasets. However, in this paper we focus on the Bernoulli likelihood approach.

4 Results 4.1 Toydata Example First, we present the performance of the above algorithm on a constructed toydata example. K D 4 fixed failure patterns H1 ; : : : ; H4 were created, each constituting a squared 30  30 image (see Fig. 3, left-hand side). Entry Hkj is represented by the greyscale value of pixel j on pattern k (j D 1; : : : ; 900). In this example we use three binary patterns (white: 0, black: 1) and one pattern of values graded from zero in the center to one on the edges. By means of a randomly generated 1,000 4 coefficient matrix W, the failure probabilities pij D 1  e ŒWHij were constructed. Finally, a binary data matrix X of realizations f0; 1g was created by setting the (i ,j )th entry to 1 with probability pij (see Fig. 4 for examples). Using the binary matrix X and the correct number of sources K D 4 as input for the ALS algorithm, the best achieved log-likelihood value was obtained using ˛ D 0:87. As displayed in the centered image of Fig. 3, the simplified ALS algorithm yields quite good approximations of the original source patterns in this example. Feeding the ALS solutions as starting points for the Alternating Gradient Ascent, after 1,000 iterations a nearly perfect reconstruction of the original patterns is achieved (Fig. 3, right). Note that in the images W and H are rescaled such that the maximum value in each pattern Hk is given by one. While the top row of Fig. 4 contains the original randomly generated coefficients Wi k , the second row shows the corresponding binary images Xi  . As an example,

Fig. 3 Left: 30 30 source patterns Hk valued in Œ0; 1 (white:0, black:1). Center: Reconstructions gained via ALS. Right: Maximum likelihood solutions

NMF for Binary Data

763

6

6

6

6

6

6

6

6

4

4

4

4

4

4

4

4

2

2

2

2

2

2

2

2

0

0

0

0

0

0

0

1 2 3 4

1 2 3 4

1 2 3 4

1 2 3 4

1 2 3 4

1 2 3 4

1 2 3 4

0

6

6

6

6

6

6

6

6

4

4

4

4

4

4

4

4

2

2

2

2

2

2

2

2

0

0

0

0

0

0

0

1 2 3 4

1 2 3 4

1 2 3 4

1 2 3 4

1 2 3 4

1 2 3 4

1 2 3 4

0

6

6

6

6

6

6

6

6

4

4

4

4

4

4

4

4

2

2

2

2

2

2

2

0

1 2 3 4

0

1 2 3 4

0

1 2 3 4

0

1 2 3 4

0

1 2 3 4

0

1 2 3 4

0

1 2 3 4

1 2 3 4

2 1 2 3 4

0

1 2 3 4

Fig. 4 Toydata examples. Top: Original coefficients Wi . Second row: Binary realizations Xi . Third row: Coefficients gained by ALS. Bottom: Coefficients after refinement by Gradient Ascent

the left image in the second row can be represented by the third and fourth basis component, while the second image consists of the fourth component only. The last two rows contain the estimated coefficients by the ALS-method and after refinement by Alternating Gradient Ascent.

4.2 Real World Example Finally, we demonstrate the performance of our method on a real world dataset. The data stems from a special kind of measurements. These kinds of measurements are aimed to identify latent structures and detect potential failure causes in an early stadium. Given a set of N D 3;043 wafers, each containing M D 500 chips, we estimated K D 4 source patterns H1 ; : : : ; H4 and the related weight coefficients W1 ; : : : ; W4 (see Fig. 5). We identified four clearly distinguishable patterns of different characteristics: The first source pattern shows a region of higher failprobability on the upper side of the wafer. The second pattern constitutes a ring of fails on the edge zone. While the third pattern is a repeated structure consisting of a group of neighboring fails at constant distance from each other, the fourth pattern is a bead centered on the wafer. The related W-matrix stores the activity of each of the four putative sources on each wafer separately. This new representation of the data contrasts wafers affected by the detected sources with untouched ones and is intended to support the detection of potential error causes.

764

R. Schachtner et al. 1

2

5

5

10

10

15

15

20

20

25

25 5

10

15

20

1

5

1000

1500

2000

2500

3000

10

15

20

0

500

1000

1500

2000

2500

3000

0

500

1000

1500

2000

2500

3000

0

500

1000

1500

2000

2500

3000

4

5

5

10

10

15

15

20

20

25

25 10

500

2

3

5

0

15

20

3

4 5

10

15

20

Fig. 5 Estimated source patterns H1 ; : : : ; H4 (left) and contribution coefficients (right) W1 ; : : : ; W4 for a realdata example comprising 3,043 wafers and 500 chips per wafer

5 Conclusion We introduced a probabilistic framework to model systematic failure causes in the microchip production. Therefore a new methodology was presented which utilizes an extension of nonnegative matrix factorization to this kind of binary datasets. An optimization technique was presented which maximizes a log-likelihood function using a fast alternating least squares algorithm followed by gradient ascent refinement. The performance of the overall procedure was demonstrated on an artificial and a real world dataset.

References Berry, M. W., Browne, M., Langville, A. N., Pauca, V. P., & Plemmons, R. J. (2007, September). Algorithms and applications for approximate nonnegative matrix factorization. Computational Statistics and Data Analysis, 52(1), 155–173. Cichocki, A., Zdunek, R., & Amari, S. (2006). Csiszar’s divergences for non-negative matrix factorization: family of new algorithms. Lecture Notes in Computer Science (Vol. 3889, pp. 32– 39). Berlin: Springer. Cichocki, A., Zdunek, R., & Amari, S. (2008, January). Nonnegative matrix and tensor factorization. IEEE Signal Processing Magazine, 25(1), 142–145. Dhillon, I., & Sra, S. (2006). Generalized nonnegative matrix approximations with Bregman divergences. In Y. Weiss, B. Sch¨olkopf, & J. Platt (Eds.), Advances in Neural Information Processing Systems (Vol. 18, pp. 283–290). Cambridge, MA: MIT. Kab´an, A., Bingham, E., & Hirsim¨aki, T. (2004, April 22–24). Learning to read between the lines: The aspect Bernoulli model. In Proceedings of the 4th SIAM International Conference on Data Mining (pp. 462–466). Lake Buena Vista, FL, USA. Lee, D. D., & Seung, H. S. (1999). Learning the parts of objects by non-negative matrix factorization. Nature, 401, 788–791.

Suggest Documents