Implementation of Non-Threaded Estimation for Run-to-run Control

0 downloads 0 Views 428KB Size Report
In the literature, non-threaded run-to-run control methods have been presented which describe the process bias for a particular wafer as a linear combination of ...
Implementation of Non-Threaded Estimation for Run-to-run Control of High Mix Semiconductor Manufacturing Farshad Harirchi1 , Tyrone Vincent1 , Anand Subramanian2 , Kameshwar Poolla2 and Broc Stirton3

Abstract In the literature, non-threaded run-to-run control methods have been presented which describe the process bias for a particular wafer as a linear combination of possible bias contributions, with the individual contributions estimated using a Kalman Filter. In this paper, we address two issues that need to be considered in implementations: observability of the state realization of the bias model, and the computational cost of the Kalman filter. While some elements of the bias model that create unobservability are well known, we present a complete analysis of observability that also considers the influence of the thread sequence. We also survey and extend methods of observability recovery that do not require model reduction or the specification of special reference threads, thus easily allowing new threads to be added and old threads removed. Finally, we describe how the problem structure allows the information form of the Kalman filter to be much more computationally efficient. Simulation results illustrate the proposed method.

I. INTRODUCTION Run-to-run control is used in the semiconductor manufacturing industry to stabilize unit processes within the manufacturing sequence. The concept of run to run control is illustrated in Fig. 1. A wafer arrives at a particular process step, and is processed using settings determined by a controller. In run to run control, these settings do not depend on in-situ measurements of the process, because of the size of metrology devices, thus they are completely determined at the time of wafer arrival at the process step. Later, this process is measured ex-situ to determine the results of the processing step. These measurements are given to the run to run controller, which will use them to determine the appropriate process settings for future wafers. One of the challenges in run-to-run process control occurs due to the market requirements and technology advances, which drive producers to introduce new products and modify older ones. Semiconductor foundries, which manufacture chips for many different companies, are dealing with a huge number of products at the same time that pass through many different tools. This manufacturing environment is called high-mix. The processing error that occurs for a particular wafer after processing on a particular tool can be influenced by what is termed the processing context [13]. The context can include the product to be produced (which defines the patterns of material *This work was supported by GlobalFoundries 1 F.

Harirchi and T. Vincent are in the Department of Electrical Engineering and Computer Science, Colorado School of Mines, USA.

tvincent at mines dot edu, fharirch at mines dot edu 2 A.

Subramanian and K. Poolla are in the department of Mechanical Engineering, University of California, Berkeley, USA. poolla at

berkeley dot edu, anandsub at berkeley dot edu 3 B.

Stirton is with Globalfoundries. Broc.Stirton at globalfoundries dot com

Process Step gain: γ

Wafer

Metrology

Process Settings: u Wafer Measurements: y Run to Run Control

Fig. 1.

Run to run control

being processed), the process technology, previous processing steps, and other effects. In order to be able to respond to this challenge and many others, one will need to introduce enhanced algorithms in this field [2], [5], [12]. To meet this need, several authors [1], [6], [8], [9], [13]–[15], [17], [18] have investigated a method of describing the processing error as a linear combination of biases, each due to one of the items that define the processing context, and using methods such as the Kalman Filter to estimate these context item biases. This work has been important because it allows measurements from a wafer with one processing context to be used in updating the run to run control actions for wafers that only share a portion of the processing context. In this paper, we offer several contributions related to the implementation of such an estimation process. First, we give a complete description of the conditions under which the estimation problem can be ill posed, a condition that is called unobservable in the literature. While effects due to the structure of the problem have been discussed in some depth, here a comprehensive result is given that also describes when the specific sequence of processing contexts can also give rise to an unobservable system. Secondly, we describe a method for estimating the correct value of the tuning parameters that describe the context model. Finally, we examine the computational complexity of implementation, and show when an alternate recursive form, called the information Kalman filter, can provide for faster calculation of the desired estimates. The structure of the paper is as follows. In Section II the run to run problem is presented in more detail and important notation and terms are defined. In Section III the estimation problem based on a linear combination of context item bias effects is defined. The new contributions of the paper start in Section IV which include complete characterization of observability that includes effects due to the processing sequence, and methods for recovering observability. In Section V the algorithms for implementing the estimation process are described, including the information Kalman filter. Section VII presents a method for estimating the tunable parameters contained in the estimation problem. Finally Section VIII presents simulation examples. A preliminary version of this paper, which includes only a discussion of observability, has been submitted to the Conference of Decision and Control (2013). II. T HREADED RUN TO RUN C ONTROL The standard threaded controller used in semiconductor manufacturing is called EWMA (exponentially weighted moving average). In EWMA approach, each thread is assigned a specific bias estimate cα,k , where α is the multi-

index identifying the thread, and k is the processing time. This controller assumes a non-stationary process where the variation can be modeled as an integrated moving average process. Therefore, a model bias can be updated recursively using an EWMA filter [1]. EWMA assumes that the unit operating input/output mapping can be modeled (at least locally) as a static gain plus a bias, that is yk = γuk + ck

(1)

where uk represents the process setting, yk is an important product variable that is impacted by the process, γ is a constant called the process gain (assumed known), and ck is a time varying signal that models disturbances and drifts in the process. In order to create an estimate of the bias, cˆk , from previous measurements, an exponentially weighted moving average filter is utilized cˆk = λ(yk−1 − γuk−1 ) + (1 − λ)ˆ ck−1 ,

(2)

where λ is a tuning parameter, chosen between 0 and 1. The process input is then adjusted to compensate for the bias uk =

y target − cˆk γ

(3)

The key value estimated by the EWMA controller is the bias estimate cˆk . A complication with this control architecture is the fact that the true bias ck is not independent of wafer being processed. In fact, previous processing steps, or type of product, due to the characteristics of the layer being defined, can influence the particular realization of ck . The issue of multiple products and processes requiring a more sophisticated control strategy was documented in the late nineties by Miller [12] and others. To account for the difference in processing characteristics seen by each wafer, more sophisticated estimation approaches can be applied that attempt to capture this dependence. To help clarify further discussion, a specific terminology that is generally utilized in the literature will be defined to describe the processing characteristics, and is illustrated in Figure 2. Each class that represents a choice in the manufacturing process is called a category. The specific realization within that category is called the context. So, for example, one category could be the specific product or layer. Other categories could include the tool selected for a particular process, or a recipe used. The collection of contexts within each category is called a thread. Because the manufacturing context will affect the result of a particular process, run-to-run control must take the wafer thread into account. One way to do this is, called threaded estimation, is to assign a separate bias estimate to each thread. Let α be the multi-index identifying the thread. For example, in Figure 2, a wafer runs with product #2, Layer #3, Technology #3 and Stepper #1, resulting in thread with multi-index α = 2331. Threaded control has a bias update of the form cˆα,k =

  λ(yk−1 − γuk−1 ) + (1 − λ)ˆ ck−1

α is measured

 cα,k−1

otherwise

Products

Layer

Technology

Stepper

Metrology

Context Item

4

3 thread: 2331 bias estimate: cˆ2331

2

1 1

2

3

4

Category

Fig. 2.

Thread definition

where cˆα,k is the estimate at time k for thread α. The control action when thread α is processed is uk =

y target − cˆα,k . γ

Note that only measurements from thread α are used to update the bias estimate for that thread, while the others are simply propagated forward from previous estimates. This allows each thread to have an independent bias estimate, but comes as a cost: the production rate for different threads can be non-uniform, and new threads need to be properly initialized. Threads that are updated often are called high-runners, and for those threads the bias estimation is fairly straightforward. However, for low-runners, a large amount of time can pass between instances, causing long delays between updates and the bias estimate looses validity. Also, initializing the bias estimate of new threads is difficult. Many changes, such as new products and new reticles, can cause new threads to be created. III. N ON -T HREADED E STIMATION The problem of low-runners lead to the development of other approaches, where information is shared amongst threads that have some aspects in common (such as a common tool). These approaches are called non-threaded, as they do not have strict separation of data between threads. The strategy under investigation in this paper uses a linear model for the thread bias. In this case, if a thread is identified by multi-index α = `mop, then it is assumed that the thread bias is given by cˆα = µ + cˆ1,` + cˆ2,m + cˆ3,o + cˆ4,p

(4)

where µ is an average bias shared across all threads, cˆ1,` is the bias due to context item i in category 1, cˆ2,m is the bias due to context item m in category 2, etc. Note that the time index k has been left off for clarity, but all of these terms are function of time. For clarity, we will reserve the following variables to denote the size of the various items defining the non-threaded estimation problem •

q - number of context item categories.



ri - number of context items in category i, i = 1, · · · , q. Pq n = 1 + i=1 ri - total number of context items, including average bias.



In order to consider estimation of these context item biases, it is useful to collect all biases into a single state vector of dimension n (x ∈ Rn ), specifically h x = µ cˆ1,1

cˆ1,2

···

cˆ2,1

···

iT

(5)

The bias for thread α can then be represented as cα,k = Hα xk

(6)

where Hα ∈ R1×n is a row vector selects the terms that are relevant to that thread. For later use, define Ci as the set of indices that represent the ith category. For example, consider the thread defined in Figure 2. In this case, q = 4, with r1 = 4, r2 = 4, r3 = 3 and r4 = 4. Including the mean, the state vector will be length n = 16, and Hα contains a 1 at the position associated with the mean, as well as a 1 at the position of the relevant context items within each category. For the thread α = 2331, it will be of the form category 1 category 3 z }| { z }| {  Hα = 1 0 1 0 0 0 0 1 0 0 0 1 1 0 0 0 . | {z } | {z } category 2 category 4 common mean 

(7)

and we have C1 = {2, 3, 4, 5}, C2 = {6, 7, 8, 9}, C3 = {10, 11, 12}, C4 = {13, 14, 15, 16}. For each wafer, the measurement yk is compared to the expected value γuk , giving the bias measurement (denoted by zk ) zk = yk − γuk

(8)

This bias measurement is assumed to be a noisy estimate of the true bias of the thread run at time k, cα,k , so that the final measurement model is zk = Hαk xk + nk

(9)

where nk is a zero mean i.i.d. Gaussian random sequence with variance R, independent of other random variables to be defined later. Since for each wafer measurement there are multiple context item biases to be estimated, estimation of the context item biases will required collecting measurements over multiple wafers that share common biases. However, unlikely that all biases will remain constant over time. Thus, complementing the model of the measurement process, is a model of the state trajectories xk . (In what follows, we will use the terms “context item biases” and “states” interchangeably.) The most common model is a random walk, xk+1 = xk + wk

(10)

where wk ∈ Rn is a zero mean i.i.d. Gaussian random vector sequence with covariance Q∆tk+1 , independent of other random variables. Note that Q ∈ Rn×n is a n by n matrix, and ∆tk+1 is the (scalar) time between processing wafer k and wafer k + 1. By writing this as xk+1 − xk = wk

it is clear that this is simply a model of the expected difference in states between samples. Finally, we can add a model for the a-priori expected distribution of the context item biases. Denoting ks as the index of the first data point used in the estimation process, the model of the form xks = x ¯+v

(11)

where x ¯ ∈ Rn is a fixed, known vector (for example, a vector of zeros) and v ∈ Rn is a zero mean Gaussian random vector with covariance P , independent of other random variables. We will assume that the initial uncertainty can be modeled as independent between context items, and uniform, so that P = pI where I is the identity matrix. Suppose the most recent process run occurs at sample time kc . Based on the models above, given data from some past time k = ks to the current time k = kc , one can obtain an estimate of the state trajectory by solving the following least squares problem: kc X

min kxks − x ¯k2pI +

c {xk }k ks

k=ks

+ where kxkM =



2

kzk − Hαk xk kR

kc X

kxk −

(12)

2 xk−1 kQ∆tk

,

k=ks +1

xT M −1 x denotes a weighted Euclidean norm. This optimization problem is quadratic and the

global minimum is easily obtained analytically. This will correspond to the state trajectory that best matches the observed measurements with the least deviation between sample times, as measured by the stated weighted norms. This corresponds to the minimum variance or maximum a-posteriori estimate when the random variables are jointly Gaussian. The state trajectory that achieves the minimum, denoted x ˆk , represents estimates of context item biases at each time k. The prediction of the bias of the next wafer can then be given by the appropriate context item selection matrix Hαk+1 multiplied by the current context bias estimate, cˆkc +1 = Hαkc +1 x ˆkc . IV. O BSERVABILITY AND S OLUTION P ROPERTIES In this section, we will discuss how strongly the estimate given by (12) depends on the observed data vs. the a-priori estimate x ¯. To make this comparison, we can consider an optimization problem that removes the a-priori term, that is, min

kc X

c {xk }k ks k=ks

2

kzk − Hαk xk kR +

kc X

2

kxk − xk−1 kQ∆tk .

(13)

k=ks +1

Definition 1: The system (10)-(9) is observable over time window [ks , kc ] if the solution to (13) is unique. This definition is equivalent to the standard system theoretic definition of an observable system (see e.g. [3]). Observability is a desirable property, for if it does not hold, there is a part of the estimate that is independent of the observations, and depends only on the a-priori information x ¯. For non-threaded estimation, observability depends both on the thread sequence αk , (which influences Hα ) and the particular structure of Hα as given by (7). We begin by discussing the structural issues, before turning our attention the requirements on the thread sequence.

The construction of the matrix Hα implies that there exists vectors x such that Hα x = 0 for all threads. Recall that Cj is the set of indices of x corresponding to the jth category. Then, a vector x with elements [x]i , i = 1, · · · , n given by    −1    [x]i = 1     0

i=1 (14)

i ∈ Cj otherwise

will satisfy Hα x = 0, and thus lie in the nullspace of Hα for any α. Note that these vectors are of the form     −1 −1  0   1      (15) x =  0  x =  1  ···     0 0 0 0 (where the dashed lines denote divisions by category and 1 is a vector of 1s in a particular category). The key structural property of Hα that ensures that these vectors are in the nullspace is that there is always a 1 at index 1, and always exactly one 1 in index set Cj . The fact that a particular vector lies in the nullspace of any Hα has implications on the observability of the estimation problem. Specifically, if x ˆk is a minimizing sequence for (13), so is the sequence x ¯k given by x ¯k = x ˆk +x for any x given by (14). (Note that since the same vector is added at each time point x ¯k − x ¯k−1 = x ˆk − x ˆk−1 , while the nullspace property implies Hαk x ¯k = Hαk x ˆk ). Since the solution to (13) is not unique, system is not observable for any thread sequence αk . This structural lack of observability in non-threaded estimation is well known [13] and many solutions have been proposed. However, before discussing the possible solutions, we turn our attention to the less studied issue of unobservability due to the thread sequence. The number of linearly independent vectors described by (14) is equal to the number of categories, q. Ideally, we would like to choose a thread sequence αk so that these are the only vectors in the null space. As a simple example consider the case of two categories (q = 2) with r1 = 2 and r2 = 3. We will examine the matrix h i0 H = Hα0 k Hα0 k +1 · · · Hα0 k for different choices of αk . If a sequence includes each thread, i.e. {αk }kkes = s s c {(1, 1), (1, 2), (1, 3), (2, 1), (2, 2), (2, 3)}, then H is of the form  1 1 0 1 0 1 1 0 0 1  1 1 0 0 0 H= 1 0 1 1 0 1 0 1 0 1 1 0 1 0 0

 0 0  1 . 0 0 1

Note that H is rank 4. In fact, due to the structural effects discussed above that ensure at least q vectors in the null space of H, the maximum column rank of any collection of Hαk will be n − q, which in this case is 6 − 2 = 4. On the other hand, if we choose the shorter sequence  1 1 H= 1 1

{αk }kkes = {(1, 1), (1, 2), (2, 1), (2, 3)} then  1 0 1 0 0 1 0 0 1 0  0 1 1 0 0 0 1 0 0 1

is also rank 4. From the two examples above, we see that it is not necessary for each of the Πqi=1 ri possible thread sequences to occur to obtain the maximal rank, and in fact, the sequence can be as short as n − q. However, not every n − q length sequence will work. For example, if q = 3, with r1 = 2, r2 = 2 and r3 = 4, then n − q = 6, but the length 6 sequence of unique threads {(1, 1, 1), (1, 1, 2), (1, 2, 3), (2, 1, 4), (1, 2, 4), (2, 1, 3)} gives   1 1 0 1 0 1 0 0 0 1 1 0 1 0 0 1 0 0   1 1 0 0 1 0 0 1 0 H= , 1 0 1 1 0 0 0 0 1 1 1 0 0 1 0 0 0 1 1 0 1 1 0 0 0 1 0

(16)

which is rank 5, implying that another vector other than the ones described by (14) also satisfies Hαk x = 0 for all αk . It is not immediately obvious why this is so as the context items occur in the different threads fairly equally. The problem is a specific correlation between the occurrence of context items that is captured by the following result. Theorem 1: Given sequence αk and two nonintersecting sets of indices, called A and B, such that the following are satisfied: •

for each k, not more than one element of Hαk in index set A is equal to 1, and similarly for index set B.



an element of Hαk in index set A equals 1 if and only if an element in index set B is equal to 1.

Then the vector x with elements

   1    [x]i = −1     0

i∈A (17)

i∈B otherwise

satisifes Hαk x = 0 for all αk in the sequence. Proof: Consider Hαk x for arbitrary αk with x as in (17). From the form of x, Hαk x = P

i∈B [Hαk ]i .

P

i∈A [Hαk ]i



From the conditions of the theorem, each sum is either both 0 or both 1, in either case implying

Hαk x = 0. Note that this theorem is actually a generalization of the structural unobservability discussed above, as we can apply this theorem with A = Ci and B = 1 to show that elements of the form (14) are in the nullspace of Hαk . To illustrate the conditions of the theorem, the sequence αk shown in (16) is repeated here with the context item indices labeled. The conditions of the theorem are satisfied with A = {3, 5} and B = {8, 9}. 1 2 3 4 5 6 7 8 9

1 1  1 H= 1 1 1 

1 1 1 0 1 0

0 0 0 1 0 1

- set A - set B

1 1 0 1 0 1

0 0 1 0 1 0

1 0 0 0 0 0

0 1 0 0 0 0

0 0 1 0 0 1

 0 0  0  1 1 0

As a more concrete example of the type of correlation that can result a larger common nullspace for Hαk , consider the case in which there two categories that correspond to products and tools. If there is a •

a subset of products (index set A) that is run exclusively on a subset of tools (index set B), and



no products outside of A are run on tools B,

then A and B satisfy the conditions of Theorem 1. When these patterns occur, context item biases that differ only a term equal to a linear combination of vectors of the form (17) cannot be differentiated by the observations. More importantly, lack of observability means that when the number of observations becomes large, the recursive methods of solution to be discussed in the next section can become numerically unstable. In the remainder of this section we discuss methods for recovering observability and avoiding these problems. These solutions will not require changing the processing sequence. One set of solutions uses a transformation T ∈ Rm×n to define a new system state η such that η = T x and the unobservable subspace associated with x is in the nullspace of T [6], [17]. The optimization problem is stated in terms of this new, smaller variable η ∈ Rm , and since Hα x = 0, predictions can still be made using just η. This can be very effective, but does have the drawback that the variables no longer represent the context item biases as simply as they are defined in (5). This can be important if new context items need to be added or removed, as would occur in the estimation process defined in Figure 3(b). Because of this lack of generality in the transformation approach, we will focus on the second set of solutions, which involves adding additional constraints to the estimation problem. However, it needs to be ensured that the constraints do not bias the solution. Given a thread sequence αk that contains correlations captured by the pairs of index sets Ai and Bi for i = 1, · · · , p, (including correlations due to categories Ci ), an acceptable set of constraints can be obtained using the vectors described by (17). Denote the vector associated with Ai and Bi as si and define   s0  1 . S =  ..  .   s0p If the sets Ai , Bi are a complete characterization of the common nullspace of Hαk such that the rank of H is n − p, and the rank of S is p then by adding the constraints Sxk = 0 to the estimation problem (13), we can ensure that there is a unique solution that is also a minimizer of (13) without the additional constraints, thus ensuring that these constraints do not add an undesired bias. We will make one small modification to these constraints that is necessary to ensure that the first state keeps a physical meaning of “mean bias over all threads.” This is best illustrated using an example. Consider the case of two categories (q = 2) with r1 = 2 and r2 = 3. Then, even if the sequence αk contains every thread, we have

correlations defined by A1 = C1 = {2, 3}, A2 = C2 = {4, 5, 6}, and B1 = B2 = 1. Thus we choose   −1 1 1 0 0 0  S= −1 0 0 1 1 1 These constraints would ensure that the average of the context item biases within each category is equal to the first state. In this case, if one calculates mean bias over all threads, it will turn out to be equal to not the first state, but the first state times 3. However, this can be easily fixed. Define a vector f elementwise as   1 i=0 [f ]i =   1 i>1 |Ci | Pp and set ν = 1 + i=1 |C1i | . Let   0  T =  ν1 f  I be a matrix that has

1 νf

as its first column, with the remainder as a row of zeros above the n − 1 dimensional

identity matrix. Then since Hα T = Hα for any α we are free to modify the constraints to be   s0  1 . S =  ..  T.   s0p In the particular illustrative case given above, we have h f = 1 1/2 1/2 and

 0 S= 0

1/3

1/3

1

1

0

0

0

0

0

1

1

1

1/3

(18)

i0

 .

(19)

Now the average of the context item biases within each category will be zero, and the overall mean must be given by the first state. There are two ways of applying constraints that are compatible with the recursive solution given in the next section. The first possibility, noted in [8], [14], is to add fictitious measurements that describe the desired constraints. A second possibility, which apparently has not been discussed in the literature, is to modify the system dynamics. The following two theorems show that adding constraints in these two ways will result in a unique or essentially unique solution that is also a solution to the original estimation problem. Thus observability has been recovered without biasing the solution. Theorem 2: Choose positive definite Rc ∈ Rp×p . If the sets Ai , Bi are a complete characterization of the common

nullspace of Hαk , S is defined as in (18), and kc − ks > n − p then min

c {xk }k ks

kc X

2

kzk − Hαk xk kR + kSxkc k2Rc

k=ks

+

kc X

(20) kxk −

2 xk−1 kQ∆tk

,

k=ks +1

has a unique solution, and this solution approaches a minimizer of (13) when Rc → ∞. Proof: See appendix. Note that the term that has been added is equivalent to defining “pseudo-measurements” at time kc given by 0 = Sxkc + v, where v is a zero mean Gaussian random vector with covariance Rc . This is only an approximation of the constraint, but it becomes more exact as Rc increases. In system theoretic terms, these extra measurements make the system observable. Significantly, it should be noted that this pseudo-measurement does not need to be made at every sample time k to recover observability. An alternate, second method that could be employed recovers uniqueness in the final state by modifying the state dynamics. Let Π = I − S 0 (SS 0 )−1 S, the matrix that projects a vector onto the space perpendicular to S. Note that Πx = x whenever x satisfies the constraints Sx = 0. Theorem 3: If the sets Ai , Bi are a complete characterization of the common nullspace of Hαk , S is defined as in (18), and kc − ks > n − p then min

c {xk }k ks

kc X

kX c −1

2

kzk − Hαk xk kR +

2

kxk − xk−1 kQ∆tk

k=ks +1

k=ks

(21)

2

+ kxkc − Πxkc −1 kQ∆tk

c

has solutions that are also minimizers of (13), and these solutions all have the same value at time kc . Proof: : See appendix. This is equivalent to re-defining the state dynamics at time kc − 1 as xk+1 = Πxk + wk . In system theoretic terms, while the overall system is still not observable, it is detectable, as the unobservable subspace has been made stable. V. I MPLEMENTATION OF N ON -T HREADED E STIMATION While the previous sections outlined a basic approach to non-threaded estimation, in actual implementation there are several issues related to computational stability and efficiency that need to be addressed. To guide the discussion, we consider two possible methods for implementation, which are illustrated in Figure 3. This figure is read with time running horizontally from sample time ks through sample time kc , and different context items distributed

new context item context item occurs in current wafer startup data

single prediction kc kc +1

context items, xk

ks · · ·

(a) Prediction uses a fixed number of past wafers

measure and startup data

predict kc →

context items, xk

ks · · ·

(b) After startup, multiple wafers to be predicted

Fig. 3.

Two types of methods for data processing

vertically. A dash indicates that a context item occurs for the run sampled at that time, and a green dot is added when this context item is the first to occur since time ks . In Figure 3(a), data from a time window of fixed width, say N , is used to estimate the context item biases. This window starts at sample time ks = kc −N and runs through through the current sample time kc . These estimates are then used to predict the bias of the run at time kc + 1. After this, the problem is re-set with ks and kc incremented by one, and x ¯ modified, if desired. This form of implementation allows other engineering information to be easily and rapidly brought to bear on the estimation problem through the choice of x ¯, and provides some robustness to outliers, as there is a fixed time limit for which any particular run is used in the estimation process. The following are important characteristics of this processing method: •

The data window-size is fixed.



The context items to be estimated are all known in advance.

Note that although it is possible that the wafer at time kc + 1 contains a new thread with a new context item (e.g. a new reticle), this will be known at estimation time, and then the estimate for this context item would come from

the corresponding element of x ¯. Figure 3(b), represents a estimation process where the size of the estimation window increases with time. Specifically, while the current time kc is advanced after each run, the starting index ks is fixed, so that all runs after ks are used to make a prediction of the run at time kc . Note that while more data is used in the estimation process, new threads containing new context items could occur at and time, and must be dealt with appropriately. The following are important characteristics for this processing method •

The data window-size grows in time.



Which context items that will appear during the estimation process is not known in advance.

A. Basic Recursive Solution: Kalman Filter From the results of Section IV, non-threaded estimation involves solving an optimization problem of the form kc X

min kxks − x ¯k2pI +

c {xk }k ks

2

kzk − Hαk xk kR

k=ks

+ kSxkc k2Rc +

kX c −1

2

kxk − xk−1 kQ∆tk

(22)

k=ks +1

+ kxkc − Πxkc −1 kQ∆tkc where we can take either Π = I or S = 0 (but not both). As is well known (see e.g. [7]), the Kalman Filter provides a recursive solution to this optimization problem. This recursive solution is as follows: •

Input data: post-run bias measurements zk = yk − γuk .



Initialization: set x ˆks −1 = x ¯.



For k = ks to kc − 1, do: Pk− =

  pI

k = ks

 P + + Q∆tk k−1

k > ks

Kk = Pk− Hα0 k (Hαk Pk− Hα0 k + R)−1 x ˆk = x ˆk−1 + Kk (zk − Hαk x ˆk−1 ) Pk+ = (I − Kk Hαk )Pk−



At k = kc (or at regular intervals1 ), calculate + Pk− = ΠPk−1 Π0 + Q∆tk   Hαk  H= S   −1 R 0  Kk = Pk− H 0 HPk− H 0 +  0 Rc    zk x ˆk = Πˆ xk−1 + Kk   − HΠˆ xk−1  0

Pk+ = (I − Kk H)Pk− At the conclusion of this algorithm, x ˆkc will be the same as would have been calculated by (22), and can be used for prediction. Given the context item bias estimate x ˆkc , the bias for the next run is calculated as cˆkc +1 = Hαkc +1 x ˆkc .

(23)

In the case that the model (9), (10), (11) is accurate, it is also possible to calculate the uncertainty of this prediction, using the state error covariance that is part of the Kalman Filter calculations. Specifically, the variance of the bias prediction is calculated as  σk2c +1 = Hαkc +1 Pk+c + Q∆tk+1 Hα0 kc +1

(24)

The uncertainty prediction can be useful for determining the relative reliability of the non-threaded estimate. In cases when alternate control techniques are available, a large uncertainty is a useful signal that the non-threaded estimate needs to be modified, or the process carefully monitored. B. Introducing New Context Items As discussed in the introduction to this section, in some cases new context items may be introduced that are not part of the current state xk . This will require modifying the state xk and associated matrices so that the prediction can continue. For ease of notation, we will assume that the new state is part of the last category and thus can be placed at the end of the state vector. Changes in the state dimension can be viewed from the context of the following optimization problem where the optimization variables change size at time k1 , so that where xk is dimension n, 1 In

the case that this algorithm is to be used for repeated predictions (kc increasing into the future), for numerical stability the

observability/detectability recovery step should be run perodically, with a suggested interval of n samples.

x◦k is dimension n + 1, and all variables with superscript min

k

kc {xk }k1s ,{x◦ k }k



have increase dimension compatible with x◦k .

kxks − x ¯k2pI +

k1 X

2

kzk − Hαk xk kR

k=ks

1

+

ks X



zk − Hα◦ x◦k 2 k R

k=k1 kX 1 −1

+

2

kxk − xk−1 kQ∆tk

k=ks +1

  2

◦ xk1 −1

  + xk1 −

¯

x ¯

Q

kc X

+



xk − x◦k−1 2 ◦ + ··· Q ∆t k

k=k1 +1

where

 Q∆tk ¯= Q 0

0



. p

The Kalman filter recursion of the last sequence can also be used to find the minimum, but with the following extra steps at time k1 •

Increase state dimension x ˆ◦k1



  x ˆk1 =  x ¯

Increase dimension of state error covariance Pk◦+ 1

 0  p

 P+ =  k1 00

where 0 is an n × 1 vector of zeros. •

Increase dimension of Q



Q

Q◦ =  00

0 α

 

where the choice of α is discussed below •

Add the new index as appropriate to Ai and/or Bi and re-calculate constraint S ◦ and/or Π◦ .

The recursion can then continue with the variables denoted with superscript ◦ . C. An Alternate Implementation There is an alternate recursive solution to (22), called the information Kalman filter, that may in some cases be more efficient, especially when using a slightly reformulated objective function. Rather than using a random walk model for the context item biases that updates after every measurement, we assume that the bias states are constant within intervals of fixed length. That is, for d steps, we use the state trajectory model xk+1 = xk

and then for one step, the model xk+1 = xk + wk where wk has covariance Q(tk+1 − tk+1−d ). This will result in random walk model with a similar variance as for the previous model, but with elements that are constant over intervals of length d. The information Kalman Filter update equations are formulated in terms of the inverse of the covariance matrix, ± −1 called the information matrix Ψ± and an information state ψk± = (Ψ± xk . The recursive solution is as k = (Pk ) k )ˆ

follows. •

Input data: post-run bias measurements zk = yk − γuk .



¯, Ψks −1 = p1 I. Initialization: set ψks −1 = p1 x



For k = ks to kc − 1, do: – (time-update) once every d time steps −1 + −1 Ψ− + Q(tk − tk−d ) k = (Ψk−1 ) + ψk− = (I − Ψ− k Q(tk − tk−d ))ψk−1

– (measurement-update) every time step − 0 −1 Ψ+ Hαk k = Ψk + Hαk R

ψk+ = ψk− + Hα0 k R−1 zk •

At k = kc (or at regular intervals), calculate + −1 0 Ψ− Π + Q∆tk k = Π(Ψk−1 )

−1

+ −1 + ψk− = Ψ− ψk−1 k (Ψk−1 )   Hαk  H= S  −1 R 0 − 0  H Ψ+ k = Ψk + H 0 Rc  −1   R 0 z   k ψk+ = ψk− + H 0  0 Rc 0

At the end of this recursion, time kc , the prediction and uncertainty estimate are given by −1 ˜ = Hα H (Ψ− kc +1 kc +1 )

(25)

˜ − cˆkc +1 = Hψ kc +1

(26)

˜ + Hα ˜0 σk2c +1 = (H Q∆tkc +1 )H kc +1

(27)

The major benefit of the information form is the very efficient set of equations that integrates the new measurements (under “measurement-update”). This is unfortunately offset by a more complex “time-update” which requires

calculating the inverse of Ψ, which may be a very large matrix. However, if the time-update is run only every d time steps, this cost is reduced. A detailed comparison between the computational complexity of the Kalman Filter and information Filter is given in the section VI. VI. C OMPARISON OF C OMPUTATIONAL C OMPLEXITY In implementation of non-threaded bias estimation, the run time of the Kalman filter can be high when there are a large number of context items. While the exact run time will depend on the specific hardware, we can get insight into how the problem scales by counting the number of multiplications needed - this will be our measure of computational complexity. We will look at scenarios characterized by the following parameters. •

n: number of the states (number of context items used)



w: The window-size used for estimation defined as: w = kc − ks



d: time-update occurs every d iterations.



κ: number of categories plus 1 for the common mean, κ = q + 1.



Mk : The number of nonzero elements in covariance or information matrix at step k.

Because of the structure of the high-mix estimation problem, many elements of the matrices Hk as well as Pk± and Φ± k will be zero. When this occurs, it significantly decreases the computational complexity of the implementation, as multiplication by a zero does not take any significant computational resources. Thus, the implementation cost will depend on the number of non-zero elements as captured by the parameter Mk . A sparse matrix is a matrix populated primarily with zeros [16]. By contrast, if a larger number of elements differ from zero, then it is common to refer to the matrix as a dense or full matrix. The fraction of zero elements (non-zero elements) in a matrix is called the sparsity (density). In numerical analysis, there are two ways of storing a matrix: full structure that all of the elements of the matrix are stored and sparse structure, where non-zero elements along with their positions in the matrix are stored. Similar to the storage, the operations are also different for sparse matrices. For example, matrix multiplication will only require the multiplication of non-zero elements, and operations with matrix inversion can be more efficiently implemented using iterative methods. Remark 1: From simulation results, we discovered that using sparse structure in Matlab is useful up to the point that for matrix Am×n , density(A) < 0.5 ∗ mn. If the number of non-zeros exceed this value then it is beneficial to use the full structure. This will utilized in our implementations. We will use big O notation to describe how the computational complexity scales with the problem parameters. The formal definition of big O notation is as follows: Let f (x) and g(x) be two functions defined on some subset of the real numbers. Then f (x) = O(g(x)) if and only if there is a positive constant c such that for all sufficiently large values of x, |f (x)| ≤ c|g(x)| [4]. In our case f will be the number of multiplications required, and x will be a combination of problem parameters. The

function g(x) will be called the order of complexity. As described in the Appendix, we have the following in the case when sparse matrix representations are used. •

Kalman Filter Complexity  – Time-update: O w nd – Measurement-update: O w(n2 min{κ, Mnk } + (n + 2)κ + 2n)





Information Filter Implementation   2 k +n – Time-update: O w (2n+1)M d – Measurement-update: O(w((n + 2)κ + 1))

Note that for standard Kalman filter, the computational cost of the measurement update is much higher than the time update, while for the information filter, the time-update is costly. Hence running time-update in every d step instead of every step does have a large impact on the speed of the algorithm.

VII. E STIMATING Q, R AND p A key part of the bias model are the covariances of the measurement (R), and the covariance of the state bias disturbance (Q). The measurement covariance can usually be obtained from equipment specifications, experiments, or engineering knowledge. In addition, for scalar measurements, the Kalman filter estimates depend only on the relative size of R and Q, rather than their absolute values. The absolute value of R is only important when interpreting magnitude of the prediction error covariance. However, for completeness, we suggest a method for identifying both Q and R together. Several approaches for parameter identification are available, which are reviewed in [11]. For systems such as ours that are time varying, (e.g. because Hαk is a function of k) of these methods only the maximum likelihood (ML) approach is feasible. In this case, the matrix Q and R are parameterized, say as a function of parameters β, and the log likelihood function L(β) for the observed measurements is calculated, which is a function of β. The parameters β are then adjusted to maximize the likelihood function, or in other words, make the observed estimation sequence zk maximally likely to have occurred. It turns out that the Kalman filter parameters are very useful for expressing the log likelihood function, which in our case is as follows: ! 2 kc zk − Hαk x ˆ− 1 X k 2 L(β) = − + log(σk + R) . 2 (σk2 + R)

(28)

k=ks

2 Note that both x ˆ− k and σk will be functions of β. While Q could be fully parameterized (i.e. each element of Q

can be adjusted, modulo symmetry) with a large number of parameters the resulting optimization problem becomes intractable. However, there are some reasonable assumptions that can be made to reduce the number of parameters. We will assume that the disturbances for each context item bias is independent of the others, so that Q is a diagonal matrix. In addition, we will assume that the disturbance covariance is identical within each category, with separate covariance for the mean bias. Thus, for our four category example given above, Q will be parameterized as

category 1 category 3 z z }| { }| { h i Q = diag β0 β1 β1 β1 β1 β2 β2 β2 β2 β3 β3 β3 β4 β4 β4 β4 . | {z } | {z } category 2 category 4 common mean

We add a parameter for R, R = β5 , and for the initial uncertainty, p = β6 , h so that β = β0

β1

β2

β3

β4

β5

i

β6 has six elements. VIII. S IMULATION R ESULTS

In this section, several different simulations are performed that demonstrate some of the important issues raised in this paper. A. Demonstration of Unobservability As discussed in Section IV, choices made during the manufacturing process can cause the estimation problem to be unobservable. In this stimulation, a production sequence is chosen where subsets of products are segregated on subsets of tools, resulting in an unobservable estimation problem (if the corrections suggested in Section IV are not taken). 1) Experiment setup: We generated a simulated manufacturing process with 3 categories: tools, technologies and products. In these categories are 20, 10 and 70 context items respectively. Seven of the products within one technology are run on only two tools. We generated the simulation data using the models described in (9) and (10) with the assumption of both process and measurement noise are iid white Gaussian with standard deviation of 10−2 and 10−3 respectively. 2) Simulation Results: This data was applied to the standard Kalman filter using only constraints for the structural observability (i.e. S of the form (19)) along a second filter in which S also includes constraints to reflect the restriction of some products to a subset of the tools. In Figure 4 the maximum singular value of the covariance marix Pk+ is plotted. As expected, the covariance for the standard Kalman filter increases without bound. Eventually, this will cause Pk+ to become ill-conditioned and give rise to numerical issues. On the other hand the maximum singular values of covariance matrix for modified version of Kalman filter are bounded. This is because the Π projects those unobservable parts of covariance matrix to zero. Note when using S = 0 but the appropriate Π, the thread bias estimates (ˆ ck ) from these two approaches are identical and thus numerical stability is achieved without losing accuracy.

Max(Singular values of Covariance matrix)

−6

7

x 10

Kalman Filter Modified Kalman Filter 6

5

4

3

2

1 0

1000

2000

3000

4000

5000

6000

7000

8000

9000

10000

Iterations

Maximum singular value of the covariance matrix vs. Iterations

Probability mass function of a thread occurs in logscale

Fig. 4.

Fig. 5.

−1

10

−2

10

−3

10

−4

10

0

200

400

600

800

1000

Thread Number

PMF of thread distribution in logarithmic scale

B. Sensitivity to Q and R Implementation of non-threaded estimation requires specifying the covariance Q and R as part of the model. In this section, we demonstrate that these matrices can be estimated from data, and also show the sensitivity to choosing the wrong value. 1) Data Generation: Data was generated using a simulated manufacturing process with four categories and a total of 444 context items. Adding the mean of data to these context items, the state vector contains 445 elements. The four categories have 37, 22, 149 and 236 context items respectively. There are 1192 threads in the process with 20000 runs. In semiconductor manufacturing there are often a few threads that have the most runs in the factory, which are called high runners, while the rest of threads occur rarely. To simulate this effect, we used a probability mass function for thread distribution and generated the H matrix accordingly. The PMF of the thread distribution is shown in logarithmic scale in figure 5. As shown in figure 5 there is a fixed probability for more than 90% of threads and it is exponentially increasing

probability for the remaining of them. 20000 samples were generated using the random walk model Xk+1 = Xk + wk Ck = Hk Xk + vk where Xk corresponds to the states, wk is the process noise and vk is the measurement noise. Ck is the bias values obtained from this random walk model that will be used as the data for analysis. The covariances of the process and the measurement noises are assumed to be 10−8 I and 10−6 respectively.

2) Simulation Results: Using the first 10000 runs, the Q and R matrices were estimated using the maximum likelihood approach of Section VII. On the remaining 10000 runs, the Kalman filter was used to predict the bias, and the results shown in Table I. The first row is the Q and R values that were used to generate data. The second row demonstrates the estimated values of Q and R using the approach mentioned in this paper and the last rows show the mean squared error for the same runs but if erroneous values of Q and R have used . As we can see estimation of Q and R has a huge effect on the accuracy of the estimates.

TABLE I T HE SIMULATION RESULTS FOR K ALMAN F ILTER Q

R

true values

diag{10−8 ,10−8 ,10−8 ,10−8 }

10−6

RMSE 0.015807

ML estimates

diag{8.47 × 10−9 ,3.64 × 10−8 ,2.857 × 10−8 ,6.499 × 10−8 }

6.499 × 10−7

0.015739

erroneous values

diag{10−7 ,10−7 ,10−6 ,10−9 ,10−10 }

10−6

0.067124

erroneous values

diag{10−8 ,10−8 ,10−8 ,10−8 }

10−3

0.15726

By choosing wrong Q and R the accuracy can be decreased by a factor of 10 or more. C. Computational Complexity Comparison In this section, after illustrating the sparsity behavior of the covariance matrix. the computational complexity and prediction accuracy for the two implementations is compared via simulation. 1) Data Generation: The generated data in previous section is used in this section as well.

2) Sparsity of the covariance/information matrix: As an illustration of sparsity, consider estimating over a fixed window-size, w. Initially, the covariance and information matrices are the identity, and thus very sparse. The nonzero elements that occur in the covariance and information matrices during the run will depend on the combination of the states that occur within the window. Figure 6 shows the percentage of non-zero elements of the covariance and information matrices respectively versus the number of updates that has been performed. Clearly, the sparsity is significant, especially over the shorter window-sizes. It is also clear that the pattern of the nonzero elements for

the covariance and information matrices is identical. These plots are illustrated for n = 445 and the w = 1000.

15

% of non−zero elements

Covariance Matrix Information Matrix

10

5

0 0

100

200

300

400

500

600

700

800

900

1000

Iterations

Fig. 6.

Percentage of non-zero elements in the covariance and information matrices vs. iterations

3) Simulation Results: In order to compare two implementations, for each simulation, the run-time and estimation error are measured for a variety of problem sizes. The experiments considered an estimation process where a fixed window of past data is utilized to predict the current wafer. For each experiment, predictions for 100 wafers were performed, with the run time measured, and the estimated bias compared with the real bias, reported as mean root squared error. The simulations were performed on a Mac Pro with a 2.4 GHz Intel Xeon Quad Core processor with 24 GB of 1067 MHz memory. The results are shown in Table II. The parameters in the table are defined in the beginning of Section VI, and Mean represents the average magnitude of the bias being estimated. Note that the information filter runs significantly faster for larger d, and that increasing d does not significantly affect the estimation error.

4) Complexity vs. window-size: Window-size, w, has two effects on the complexity of both Kalman and information filter •

Increasing window-size will linearly increase the number of iterations, which results in linear increase in the complexity. This is true under the constraint that the computer doesn’t run out of memory. In the case that memory is not sufficient for larger ws the increase will be nonlinear.



Increase in the size of the window can have a second effect as well. The number of nonzero elements in covariance/information matrices increase with the size of window due to the fact that more combinations of context items appear in the larger window. This effect will change the structure of the matrices from sparse to full causing in slower performance. Figure 7 shows the run-time versus window-size for both Kalman filter and Information filter. The plots are generated for window-sizes between 500 to 10000. The setup of the experiment was: n = 209, d = 10 and κ = 4. As illustrated the complexity of both Kalman and information filter implementation increases linearly with window-size, so the first effect of window-size is much more

TABLE II T HE SIMULATION RESULTS FOR NON - THREADED HIGH - MIX ESTIMATION

Common Parameters

Kalman Filter

Information Filter

w

d

n

κ

Mean

RMSE

Run-Time (sec)

RMSE

Run-Time (sec)

1000

1

445

5

0.8901

0.0796

586

0.0796

2080

1000

10

445

5

0.8901

0.0796

586

0.0796

364

1000

50

445

5

0.8901

0.0796

586

0.0795

194

1000

100

445

5

0.8901

0.0796

586

0.0795

173

1000

1

209

4

0.5683

0.0459

144

0.0461

419

1000

10

209

4

0.5683

0.0459

144

0.0459

92

1000

50

209

4

0.5683

0.0459

144

0.0455

64

1000

100

209

4

0.5683

0.0459

144

0.0458

61

5000

1

445

5

0.8901

0.0168

3064

0.0168

9780

5000

10

445

5

0.8901

0.0168

3064

0.0168

1614

5000

50

445

5

0.8901

0.0168

3064

0.0168

879

5000

100

445

5

0.8901

0.0168

3064

0.0168

788

5000

1

209

4

0.5686

0.0021

715

0.0024

2048

5000

10

209

4

0.5686

0.0021

715

0.0021

436

5000

50

209

4

0.5686

0.0021

715

0.0025

293

5000

100

209

4

0.5686

0.0021

715

0.0023

275

powerful than the second for this experiment.

1500 Kalman Filter Information Filter

Run−Time

1000

500

0 0

1000

2000

3000

4000

5000

6000

7000

8000

9000 10000

window−size

Fig. 7.

Run-time vs. w

5) Complexity vs. context items, n: Figure 8 shows the plot of complexity order versus n for fixed Mk = 500 and κ = 5, but for two different cases of d = w and d = 1. In the case, d = w, the time-update will be skipped and the plots show the measurement-update cost. As illustrated the information filter cost is very small comparing with Kalman filter. However in the case, d = 1 both measurement and time-updates are at full cost. One should mention that the Kalman filter cost remains the same for both cases, because Time-update is very cheap for Kalman filter implementation. This figure is generated assuming covariance/information matrices remain sparse.

6

6

x 10

Kalman Filter d = w Information Filter d = w Kalman Filter d = 1 Information Filter d = 1

5

Run−Time

4

3

2

1

0 100

200

300

400

500

600

700

800

900

1000

n

Fig. 8.

Order of complexity vs. n for d = w, d = 1

6) Complexity vs. d: d only affects time-update complexity. In the information filter implementation, the effect of d is large, because the main algorithm complexity is due to the time-update, whereas in Kalman filter implementation, the time-update is very cheap and increasing d doesn’t make a considerable improvement. Figure 9 shows the change in run-time with respect to change in parameter d. As illustrated in Figure 9, the run-time of the Kalman filter is almost constant, but the run-time of the information filter converges to the run-time of the measurement-update alone, which is very small. This figure is generated for w = 2000, n = 209 and κ = 4.

900

Kalman Filter Information Filter

800

Run−Time

700 600 500 400 300 200 100 0

20

40

60

80

100

120

140

160

180

200

d

Fig. 9.

Run-time vs. d

IX. C ONCLUSIONS In this paper we showed that the non-threaded bias estimation problem can suffer from unobservability issues due to both the structure of the model and the particular processing choices made. Approach to mitigate these observability issues were presented, and the properties demonstrated both analytically and numerically.

Two different implementations of the non-threaded estimation were presented: the Kalman filter and the information Kalman filter. The computational complexity for each implementation was studied as a function of the problem parameters. We also presented the maximum likelihood method for estimating the covariance parameters of the model, and demonstrated via simulations the sensitivity of the prediction error to these parameters.

X. A PPENDIX A. Proof of Theorem 2 Proof: If Ai , Bi is a complete characterization, then   Hαks    .  H =  ..    Hαkc is rank n − p and S is rank p. Since HS 0 = 0, the matrix   Hαks    ..   .      Hαk  c  S must be rank n. Without loss of generality assume that ks = 1. Assume that (20) has multiple solutions and {˜ xk }k1c and {ˆ xk }k1c are two arbitrary solutions to it. Let x ˜ and x ˆ be “vectorized” form of these sequences. Set   H1 0 0 ... 0 0     0   I −I 0 . . . 0     0   0 H2 0 . . . 0     I −I . . . 0 0  0   B= . .. ..  ..  .. . . . 0 0      0  0 0 . . . I −I     0 0 0 . . . 0 Hkc    0 0 0 ... 0 S and we can write (20) as follows min x0 B 0 (diag{R, Q∆t1 , R, Q∆t2 , . . . , Rc }) Bx

c {xk }k 1

where diag operator forms the block diagonal its inputs and x is the vectorized form of {xk }k1c . If x ˜ and x ˆ have the same value of the objective function, since R and Q are invertible ∆x = x ˜−x ˆ must be in the null space B so

that B∆x = 0. By simple column operations on  H1   0   H  2   0    BT =  H3   .  ..    0   Hk  c S

B we can achieve to the following for an invertible matrix T  0 0 ... 0 0  −I 0 ... 0 0    H2 0 ... 0 0    0 −I . . . 0 0    ..  . H3 H3 0 0    .. .. . . ... 0 0    0 0 ... 0 −I    Hkc Hkc . . . Hkc Hkc   S S ... S S

By writing B in this form one can clearly see that it is full column rank if and only if the first block column is full column rank. In this case B∆x = 0 implies ∆x = 0, so this minimization problem has a unique solution. Now suppose {ˆ xk }kkcs is a minimizer of (13). Let x ¯k = x ˆk − S 0 (SS 0 )−1 S x ˆkc . From the definition of S, we have Hαk S 0 = 0 for all αk , so Hαk x ¯k = Hαk x ˆk . Since the same constant is added to each term, x ¯k − x ¯k−1 = x ˆk − x ˆk−1 . These two facts demonstrates that the value of the objective function is the same, thus x ¯k is also a minimizer of (13). Furthermore, since S x ¯kc = 0, the solution of (20) will approach x ¯kc as Rc → ∞.

B. Proof of Theorem 3 Proof: Assume that (21) has multiple solutions and x ˜ and x ˆ are two arbitrary solutions to it. Without loss of generality assume that ks = 1. ∆x = x ˜−x ˆ has to be in  H1 0 0    I −I 0    0 H2 0   I −I 0 A=  0 0 0   0 0 0   0 0 0  0

0

0

the null space of the following matrix:  ... 0 0   ... 0 0    ... 0 0    ... 0 0    .. . 0 0    . . . Hkc −1 0    ... Π −I   ...

0

Hkc

so A∆x = 0, which immediately results in the following: Hi ∆xi = 0 ∆x1 = ∆x2 = . . . = ∆xkc −1 ∆xkc = Π∆xkc −1 using the fact that Hkc Π = Hkc and eliminating all ∆xi but ∆xkc −1 , we have h i0 H1 H2 . . . Hkc −1 Hkc ∆xkc −1 = 0 By assumption, the matrix on the left has the same null space as Π. Thus, Π∆xkc −1 = 0 but since ∆xkc = Π∆xkc −1 ∆xkc = 0. Thus all of the multiple solutions of (21) have the same value at k = kc . C. Theoretical complexity calculations In this section we will use the same parameters that is defined in section VI. In this section the complexity of the calculations of Kalman and information filter is calculated equation by equation in terms of number of multiplies [10]. Assume to have a square matrix of size n × n and a vector of size n × 1, some general rules and facts are as follows:



Assume Am×n , Bn×p and Cn×n are sparse matrices with S1 , S2 and S nonzero elements respectively.



The product AB has min{pS1 , mS2 } multiplies.



The inverse of matrix C is also proportional to S. Computational complexity also depends linearly on the row/column size n of the matrix, but is independent of the product n2 , the total number of zero and nonzero elements.



All the complexities are calculated for 1 step.

1) Kalman Filter Complexity (step-wise):

Time-update: Pk− =

  pI

k = ks

 P + + Q∆tk k−1

k > ks

Scalar times a diagonal matrix has at most n multiplies.

Measurement-update: Kk = Pk− Hα0 k (Hαk Pk− Hα0 k + R)−1 x ˆk = x ˆk−1 + Kk (zk − Hαk x ˆk−1 ) Pk+ = (I − Kk Hαk )Pk− Complexity of first equation is: O(n min{κ, Mnk } + κ + n), because: Hαk (Pk− Hα0 k ): This has the complexity of O(min{κ, nnz(P ∗ H 0 )} + n min{κ, Mnk }) ≤ O(κ + n min{κ, Mnk }). The inverse is negligible and product by Pk− Hα0 k has at most n multiplies. Complexity of second equation is: O(κ + n), because: Kk (zk − Hαk x ˆk−1 ): This involves a product of scalar and a vector which has complexity of O(n + κ). The complexity of summation with previous states is negligible. Complexity of third equation is: O(κn + n2 min{ Mnk , κ}), because: (I − Kk Hαk ) has the complexity O(κn) and its multiplication by Pk− has complexity of O(n2 min{ Mnk , κ}). Kalman Filter complexity for window-size, w Time-update: O( wn d ) Measurement-update: O w(n2 min{κ, Mnk } + (n + 2)κ + 2n)



2) Information Filter Complexity (step-wise):

Time-update: −1 + −1 Ψ− + Q(tk − tk−d ) k = (Ψk−1 ) + ψk− = (I − Ψ− k Q(tk − tk−d ))ψk−1

Complexity of first equation is:O(2n Mk + n), because:  −1 (Ψ+ + Q(tk − tk−d ) has O(n Mk ) complexity plus n multiplies and the inverse of it also has the complexity k−1 ) order of O(n Mk ). Complexity of second equation is: O(n2 + n + Mk ), because: I − Ψ− k Q(tk − tk−d ): The scalar by diagonal matrix has at most n multiplies. The sparse matrix Ψ multiply by the + diagonal matrix Q has O(n2 ) complexity. The product of the whole thing with ψk−1 has complexity of n

Mk n

= Mk .

Measurement-update: − 0 −1 Ψ+ Hαk k = Ψk + Hαk R

ψk+ = ψk− + Hα0 k R−1 zk Complexity of first equation is: O((n + 1)κ), because: 0 −1 Ψ− Hαk : This has the complexity of O((n + 1)κ) and the complexity of summation is negligible. k + Hαk R

Complexity of second equation is: O(κ + 1), because: ψk−

+ Hα0 k R−1 zk : This has at most κ + 1 multiplies.

Information Filter complexity for window-size, w   2 k +n ) Time-update: O w((2n+1)M d Measurement-update: O (w((n + 2)κ + 1)) R EFERENCES [1] CA Bode, J. Wang, QP He, and TF Edgar. Run-to-run control and state estimation in high-mix semiconductor manufacturing. Annual Reviews in Control, 31(2):241–253, 2007. [2] MW Braun, ST Jenkins, and NS Patel. A comparison of supervisory control algorithms for tool/process disturbance tracking. In American Control Conference, 2003. Proceedings of the 2003, volume 3, pages 2626–2631. IEEE, 2003. [3] C.T. Chen. Linear system theory and design. Oxford University Press, Inc., 1998. [4] Thomas H. Cormen, Clifford Stein, Ronald L. Rivest, and Charles E. Leiserson. Introduction to Algorithms. McGraw-Hill Higher Education, 2nd edition, 2001. [5] S.K. Firth, W.J. Campbell, A. Toprac, and T.F. Edgar. Just-in-time adaptive disturbance estimation for run-to-run control of semiconductor processes. Semiconductor Manufacturing, IEEE Transactions on, 19(3):298–315, 2006. [6] C.K. Hanish et al. Run-to-run state estimation in systems with unobservable states. In Proceedings of AEC/APC Symposium XVII, 2005. [7] A.H. Jazwinski. Stochastic processes and filtering theory. Academic press, 1970. [8] M.D. Ma, C.C. Chang, D.S.H. Wong, and S.S. Jang. Identification of tool and product effects in a mixed product and parallel tool environment. Journal of Process Control, 19(4):591–603, 2009. [9] Ming-Da Ma, Chun-Cheng Chang, David Shan-Hill Wong, and Shi-Shang Jang. A novel mixed product run-to-run control algorithm dynamic ANCOVA approach. In Proc. European Congress of Chemical Engineering, Chopenhagen, 16-20 September, 2007. [10] Mathworks. Matlab Software. http://www.mathworks.com/help/matlab/math/sparse-matrix-operations.html, 2013. [Online; accessed 16April-2013]. [11] R. Mehra. Approaches to adaptive filtering. Automatic Control, IEEE Transactions on, 17(5):693–698, 1972. [12] Michael L. Miller. Impact of multi-product and -process manufacturing on run-to-run control. In Proc. SPIE, volume 3213, pages 138–146, 1997. [13] A.J. Pasadyn and T.F. Edgar. Observability and state estimation for multiple product control in semiconductor manufacturing. Semiconductor Manufacturing, IEEE Transactions on, 18(4):592–604, 2005. [14] N.S. Patel. Model regularization for high-mix control. Semiconductor Manufacturing, IEEE Transactions on, 23(2):151–158, 2010. [15] A.V. Prabhu and T.F. Edgar. A new state estimation method for high-mix semiconductor manufacturing processes. Journal of Process Control, 19(7):1149–1161, 2009. [16] J. Stoer and R. Bulirsch. Introduction to Numerical Analysis. Springer, New York, 3 edition, August 2002.

[17] A.J. Su, C.C. Yu, J.C. Jeng, H.P. Huang, C.J. Yang, H.W. Chiou, and S.C. Yang. Context-based state estimation in semiconductor manufacturing: reference path based state transformation approach. In Proc. 17th IFAC World Congress, Seoul, South Korea, July 6-11, 2008. [18] J. Wang, Q. Peter He, and T.F. Edgar. State estimation in high-mix semiconductor manufacturing. Journal of Process Control, 19(3):443– 456, 2009.

Suggest Documents