Robust Fault Detection and Fault Classification of Semiconductor Manufacturing Equipment Anna M. Ison and Costas J. Spanos Department of EECS, University of California, Berkeley, CA 94720-1772 office:(510)642-9584, fax:(510)642-2739, email:
[email protected], WWW: http://bcam.eecs.berkeley.edu Abstract In this paper we extend our current multivariate statistical process control system for fault detection to deal with long term variability on a lot-to-lot basis. Long term trends in optical emission data collected from a plasma etcher are characterized through data transformations and linear modeling techniques. By filtering the known effects of machine aging, these models facilitate the integration of optical emission data with other sensor signals, resulting in a fault detection system which is robust over time. A methodology to classify the detected faults into discrete categories is also currently under development. We present the framework of a diagnostic system which incorporates various data types and accommodates uncertainty while providing a systematic method of drawing inferences from the available evidence. 1. Introduction In order to meet the increasing demands of the semiconductor industry to improve yield while simultaneously decreasing circuit geometries, recent efforts have focused on characterizing and controlling variability in critical manufacturing processes such as plasma etching. Real-time tool signals from three sensors (SECSII machine information, RF monitor and optical emission spectroscopy) collected in-situ provide valuable information about the machine state. Effective monitoring of these signals serves two purposes: (1) it provides a description of the machine and chamber states which can be used to predict final wafer characteristics and (2) it provides a means of detecting and identifying equipment malfunctions in real time without interrupting the process. Recent efforts have focused on using optical emission data as a valuable source of information about the plasma state. However, measurements of this type exhibit atypical trends due to the confounding effect of window clouding and machine aging. This behavior is cyclical in the sense that the machine state can be “reset” by preventative maintenance (PM) events. This cycle of long term trends can result in an increased false alarm rate during fault detection. This paper describes models which characterize this behavior enabling the integration of optical emission signals with other sensor data so that real-time statistical process control (RTSPC [1]) techniques can be applied to perform fault detection. By specifically accounting for long term trends, these models partially decouple the machine state from the state of the plasma; such decoupling reduces the false alarm rate due to preventative maintenance events, thus resulting in a fault detection mechanism which is robust over time. The detection of an out-of-control condition by the fault detection mechanism indicates the possible presence of a fault. In order to confirm the hypothesis that a fault has occurred and to identify an assignable cause, a meth-
odology to classify faults into discrete categories is developed. This paper presents the framework of a diagnostic system incorporating both qualitative information, provided through the expert knowledge of human operators, and quantitative information derived from empirical equipment models as well as historical and maintenance records. This framework provides a systematic method of drawing inferences from the available evidence and accounts for uncertainty by retaining a measure of likelihood for each classification decision. Data collected from a Lam TCP 9600 plasma etcher were used to construct the empirical models used for fault detection and classification. 2. Modeling and RTSPC Traditional statistical process control (SPC) techniques assume that the underlying process is stationary, i.e. that the mean and variance do not vary with time, and that the observations are identically, independently, and normally distributed (IIND) [2]. Presuming that these trends are present in data representing normal operating behavior, application of these techniques directly to machine data that contain trends results in increased false alarm and missed alarm rates [1]. To avoid these increased false and missed alarm rates, past work used time-series modeling techniques to filter out the time dependent trends; traditional or multivariate SPC methods were then applied to the resulting residuals to monitor the machine behavior. This system, known as RTSPC, was shown in [1] to be effective in monitoring real-time and wafer-to-wafer data. Our investigation was motivated by the need to extend RTSPC to include long term variability on a lot-to-lot basis. 2.1 Long Term Trends in Optical Emission Data Examination and analysis of optical emission data over long periods of time shows a different type of trend
1
than that typically handled by time series models. As depicted in Figure 1, the endpoint signal (a measure of the intensity of the plasma for a particular wavelength) exhibits an exponential decay. In this figure, the average value of the endpoint taken over each lot is plotted with respect to the wafercount. Because the wafercount is reset to zero after a preventative maintenance (PM) event, the plot shows the endpoint signal evolving over the course of a maintenance cycle, where the chamber state is initially clean but becomes progressively dirtier as more wafers are processed. The trend is clearly visible and is repeatable as demonstrated by the five different maintenance cycles which are overlaid in this plot. The data shown in Figure 1 span a total period of eight months during which there were five PM events corresponding to chamber and window cleans.
I (z) = I 0e
– αz
(1)
where the intensity, I, decreases exponentially with the thickness (z) of the deposited material. The exponential decay constant (α) is related to the absorption properties of the material. Assuming that the accumulation of deposited material varies as a linear function of time, z = z 1 + z 2 • RFtime
(2)
the expression for measured intensity as a function of RF time becomes: I ( RFtime ) = I 0 e
– αz 1 – αz 2 • RFtime e
(3)
15000
Taking the logarithm of equation (3) results in a linear expression relating the log of the intensity to RF time.
Intensity 5000 10000
3. Analysis and Results To extend the monitoring system and fault detection capability (RTSPC [1]) to accommodate lot-to-lot trends, the optical emission data are first filtered through a log transformation, and then modeled using linear regression techniques.
10 9 8 7
Time series models are known to capture the dependencies among a sequence of data points, with the assumption that these readings are taken at regularly spaced intervals. However, because the processing of lots is rarely scheduled at such regular intervals, these models are inappropriate for dealing with optical emission data at long time scales. The problem is further complicated by the apparent exponential decay in the measured values. The exponential decay visible in the lot average value of the endpoint signal suggests the use of the log transform as a method of linearizing the data. This is further supported by knowledge of the plasma etch process and its effects on the reading of optical emission data. Specifically, the chamber window becomes clouded as a result of progressive depositing of material on the window surface as the wafers are being etched. This clouding in turn affects the sensor reading of the plasma intensity. Mathematically, the plasma intensity measurement may be modeled by the following equation:
6
2.2 Modeling the Effect of Window Clouding
5
Figure 1
1000 2000 Wafer Count (RF Time) Lot averages of endpoint for five PM cycles
4
0
The linear regression model uses wafer count as an input parameter in order to account for the effect of RF time. Figure 2 depicts the transformed data from Figure 1 for the five maintenance cycles. Note that, as expected, the transformation has linearized the data. After the linear trend is filtered out, the resulting linear regression model residuals are filtered using time-series models in order to remove the remaining time dependencies.
Log(Intensity)
0
3.1 Linearization of Optical Emission Data
0 Figure 2
1000 2000 Wafer Count (RF Time) Lot averages of transformed endpoint for five PM cycles
2
3.2 Improved Fault Detection Models to filter trends from the various sensor signals are built using data obtained during normal operating behavior of the machine (baseline data). If the models are formulated appropriately, the resulting residuals resemble IIND random variables. These residuals can be monitored separately using Shewhart control charts. However, because the signals are measurements of the same physical process, there are cross-correlations among the different signal residuals. To account for these cross-correlations, the Hotelling T2 statistic is used to combine the individual IIND residuals into a single statistical score. In previous work [1], this method has been displayed in the form of a double T2 chart plotting scores corresponding to wafer-to-wafer and realtime scales respectively. Here we extend this result by constructing a double T2 chart plotting scores corresponding to lot-to-lot and wafer-to-wafer time scales. These are shown in Figure 4 as bars and lines respectively. The double T2 chart for one maintenance cycle is shown in Figure 3(a). This plot was generated using only a time series model constructed from the original lot averages of the endpoint signal as a filter. The analysis used baseline data, and yet the model produced false alarms (dark bars) at the beginning of the cycle. Examination of individual signal residuals shows that the problem is indeed caused by the failure of the time-series model to accurately represent the apparent exponential decay in the endpoint signal.
eliminated, and thus, the models have effectively captured the long term trend. Figure 3(c) is a plot of the T2 chart of production data with known injected faults. The figure demonstrates that the known faults were detected on a lot-to-lot basis. (No further investigation was done regarding the alarms at the wafer-to-wafer level.) 4. A Framework for Fault Classification Once a fault has been detected, the next step is to identify a probable cause of the abnormal behavior. An expert system is currently being developed which combines various data types, including qualitative as well as quantitative information, in order to draw inferences regarding the state of the machine. The system takes the fault space (i.e. all possible faults as determined by the available data), and decomposes this space into successively finer categories. Each division of the fault space has an associated degree of likelihood which is also calculated from the evidence, or “training” data set. For this application, the sources of evidence include the real-time tool signals collected from the sensors, equipment history and machine maintenance records, operator observations, process or equipment alarms, and equipment models.
Fault Detection Module Double−T^2
26.7867 21.4294 16.072 10.7147 5.35734 0.0
0
1
Stage I
2
3
4
5
6
7
8
9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32
Nominal
Faulty
Pressure
Power
Gas Flow
3(a) Baseline Chart Using Original Data Double T 2
II
0
1
2
3
4
5
6
7
8
9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29
3(b) Baseline Chart Using Transformed Data
Sensor
Actual State
Input Setting
III
Machine Components
Figure 4 0
1
2
3
4
5
6
7
8
9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32
3(c) Production Chart Using Transformed Data Figure 3
Double T2 Charts
Figure 3(b) depicts a similar double T2 chart after the log transformation, followed by linear regression and time-series filtering as described above. The plot shows that the false alarms due to the decay have been
A Diagnostic Framework
As shown in Figure 4, the first classification stage decomposes the fault space into broad categories corresponding to changes in the machine input settings from their nominal or target value. One assumption is that real machine faults will exhibit symptoms which can be simulated by changing the input pressure, power, and/or gas flows. The training data set of this stage can be obtained through carefully designed experiments (DOE’s) where the input settings are changed from their nominal target 3
values, forming a discrete set of "recipes". The signatures of the real-time signals corresponding to these known changes in the input settings will be modeled and categorized into the discrete bins. A separate validation data set, not used in the construction of the models for classification, will be used to test the performance of the first stage. Thus, production data can be compared with the signature of the real-time signals collected from the DOE’s, and can be classified into discrete categories corresponding to changes in the input settings. The likelihood value assigned to each classification will be determined according to how closely the production data resemble the prediction of the models. The second stage can be considered sensor validation. A determination is made as to whether the fault is a problem with the sensor, or if the actual machine state is abnormal. In order to validate the sensor output, redundant models using signals from other sensors will be constructed for the same parameter of interest. Ideally, these signals will be independent of the sensor to be validated. For this reason, accurate equipment modeling which captures the relationships among subsystems, components and signals is critical to the success of this methodology. Finally, in the third stage, after checking that the fault is not an incorrect input setting, the system calculates a likelihood value that the fault is due to a particular subsystem, or machine component. Influence diagrams are used to capture engineering intuition from operators, and to track causality and independence assumptions among the subsystems and components of the machine. Figure 5 shows an example of a simple influence diagram of the chamber pressure system. This diagram shows that the system consists of several components including pumps, valves, controllers and sensors. In this case, a backing pump is connected to a turbo pump via the turbo isolation valve, and to the chamber via the bypass isolation valve. The control gate valve connects the turbo pump to the chamber. A valve controller determines the position of this control gate valve by monitoring the pressure through a sensor (the manometer), and adjusting the valve position until this reading agrees with the input setting or setpoint, as specified by the operator. backing pump manometer
turbo isolation valve
setpoint turbo pump valve controller
bypass isolation valve
Knowledge of the physical system is crucial to identifying potential problems and the propagation of faults through a subsystem. 5. Conclusions and Future Work The models developed to account for long term trends are consistent with physical equations describing the window clouding effect on the measured data. Furthermore, the results are repeatable over several preventative maintenance (PM) cycles with little variation of the linear regression model from one cycle to the next. This suggests that a simple linear adaptive model may be used to effectively predict the behavior of a cycle, even after a change of the machine state as drastic as that produced by a PM event. Future work includes fine tuning the empirical models and methodology for fault detection and using these models as points of reference for identifying assignable causes for the faults. Although much data from various sources has been collected for fault classification, the diagnostic system under development is still in its infancy. We have presented a general framework for this system which is flexible enough to accommodate the types of available evidence, and allows for different methods of estimating likelihood or probability values for each classification. Such a system promises to be invaluable to the operator, especially as a trouble-shooting tool to find problems early, thus preventing the propagation of faults and further damage to the machine. The problem can then be resolved before it ever affects the final product. Acknowledgments The authors are grateful to Texas Instruments, Digital Semiconductor, Lam Research, and to the SRC (95FP-700) for support of this research. We are also grateful to the staff of the Berkeley Microfabrication laboratory for their help and expertise. References [1]S. F. Lee, E. D. Boskin, H. C. Liu, E. Wen, C. J. Spanos, “RTSPC: A Software Utility for Real-Time SPC and Tool Data Analysis,” IEEE Trans. Semiconductor Manufacturing, vol.8, no. 1, Feb. 1995, pp. 17-25. [2]D. C. Montgomery, Introduction to Statistical Quality Control, 2nd ed., John Wiley & Sons, 1991.
control gate valve chamber
Figure 5 Equipment Model for Pressure 4