Rule-based Data Mining for Yield Improvement in ...

Rule-based Data Mining for Yield Improvement in Semiconductor Manufacturing Sholom M. Weiss1 , Robert J. Baseman1 , Fateh Tipu1 , Christopher N. Collins2 , William A. Davies2 , Raminderpal Singh2 , and John W. Hopkins3 1

2

IBM Research, Yorktown Heights, NY 10598, USA IBM Systems and Technology Group, Hopewell Junction, NY 12533 3 IBM Systems and Technology Group, Essex Junction, VT 05452

Abstract. We describe an automated system for improving yield, power consumption and speed characteristics in the manufacture of semiconductors. Data are continually collected in the form of a history of tool usage, electrical and other real-valued measurements–a dimension of tens of thousands of features. Unique to this approach is the inference of patterns in the form of binary regression rules that demonstrate a significantly higher or lower performance value for tools relative to the overall mean for that manufacturing step. Results are filtered by knowledgebased constraints, increasing the likelihood that empirically validated rules will prove interesting and worth further investigation. This system is currently installed in the IBM 300mm fab, manufacturing game chips and microprocessors. It has detected numerous opportunities for yield and performance improvement, saving many millions of dollars.

1

Introduction

Microprocessor chips are ubiquitous, embedded in mass market devices like electronic games. The chips that power business computers and games are powerful devices. Their manufacture is no routine process. A game-playing chip like the Cell processor takes months to produce. Starting from the initial wafer, the chips are produced by the application of hundreds of steps and tools. Thousands of measurements are taken and recorded to monitor results at different stages of chip production. Given the complexity of these processes and the long periods needed to manufacture a microprocessor, it is not surprising that extensive efforts have been made to collect data and mine them looking for patterns that can eventually lead to improved productivity [1], [2], [3], [4]. If a tool fails acutely, manufacturing engineers have ample techniques to immediately determine the source of failure. However, many opportunities for yield improvement are far more subtle [5], [6]. Months may pass before a chip is completed, hence the great interest in mining data prior to final testing [7], [8], [9]. Some chips will fail these tests, and the percentage of successfully manufactured chips, especially during the early stages of manufacturing, may be significantly below 100%. The volume of collected data, numbering tens of thousands of measured values for each wafer, suggests

the need for analytical processes to extract patterns from the data, for example those patterns that lead to unusually high or low yields of successful chips or those that lead to superior or substandard performance characteristics for chip speed or power consumption. A sample of measurements is collected for a fixed time period. These features are related to some critical target variable such as overall chip yield. If the objective were to make predictions, this would be a standard regression problem, and many different methods could be applied. Our ultimate objective is diagnosis, not prediction. Moreover, this sample is not from a stationary population where a sample from one week will be fully applicable a month later. Many processes in the fab change, and target variables may change in magnitude as manufacturing programs mature. For example, the in-house engineering staff, as well as the manufacturing tool vendors, continually improve processes and manufacturing tooling to improve mean process performance and reduce process variability. Instead of a goal of predicting a critical target variable, the defined task is to find patterns that deviate from the overall mean of the target. Analogous to clinical medicine, the apparent opportunities for improvement are associated with values that are far from normal, such as a pattern for high or low yield. This does not mean that prediction methods have no role in diagnosis. However, for diagnosis, the identification of patterns that are far from normal are the ones that will draw the attention of the engineers. As fab tooling, processes, and products evolve over time, the connection between results from one sample to another may be weak. Opportunities for improvements are exploited and new ones arise. A goal of identifying deviations is consistent with that environment. The patterns that summarize deviations from the mean become the targets of the engineers to improve the performance of the manufacturing line. In this paper, we describe a fully automated system for problem detection and diagnosis. It cycles through four operational stages: (a) collects and prepares samples of data, (b) finds patterns of tool behavior, described in terms of binary regression rules that identify significant deviations from average current performance, (c) filters the rules by knowledge-based constraints and (d) finds supporting measurements that underline this behavior and can help trace root causes. Among the key advantages of this approach are the exceptional comprehensibility of results and techniques for processing data that are replete with missing values. This system is in routine use in IBM’s East Fishkill semiconductor fabrication plant. The paper follows the following outline. Section 2 provides some basic background material for our application to semiconductor manufacturing. In Section 3, methods are described for data preparation and for finding decision rules that detect weak tool performance. Empirical results and cost benefits are then detailed. We conclude with a discussion of the advantages of this approach and directions for future work.

2

Application Background

The products of the fab are semiconductor chips including microprocessors for r Xbox°or r WiiTM or the microprocessors for IBM games like the PlayStation°3, servers. While chips are the basic products, manufacturing processes are generally applied to individual wafers containing multiple chips [10]. Depending on the complexity of the microprocessor, the number of chips on a wafer may vary from hundreds to thousands. The basic unit of manufacture in the fab is a collection of wafers known as a lot, comprised of wafers to be processed together during production. Typical 300mm manufacturing lots include 25 wafers. Manufacture of chips is by parallel production of chips on a wafer. Any given lot may take months for completion of the manufacturing process. After completion, the chips are tested, and measures of performance are taken. Typically, measurements reveal some chips with superior and some chips with inferior performance. In extreme cases, chips fail tests and have limited commercial value. The results of testing wafers or lots can be used to characterize the quality of overall fab performance. For example, the percentage of good chips on a wafer is known as the yield and is a primary measure of fab performance. Especially during the early stages of a manufacturing program, yields may be substantially lower than ultimate steady state yields, and substandard chips may account for the majority of production volume for some time. Given the long production period, the complexity of the physical processes, and the extraordinary investments, many intermediate measurements characterizing in-process wafers and production processes are made prior to final testing. A common property is that these are measures relative to wafers or to lots. Thus, a pattern that is detected will find some group of wafers that differ from others in some aggregate value, for example a significantly higher or lower yield. Chip manufacturing begins with raw wafers and lots. The chips are completed and functionally tested months later. What happens during the life of the wafer for this relatively long period of time? The window on time can be divided into segments or steps. These steps represent processes applied to all wafers. Figure 1 illustrates the life cycle of a wafer, where for simplification, the steps needed to produce a chip are labeled 1 to 500. The wafer would complete sequentially each of these hypothetical steps until the last step is finished. At each step the wafer passes through a single process tool. Many alternative tools may be capable of performing the same action on a particular step. At each step in Figure 1, only one of the many similar tools is applied to a wafer. We characterize the ”logistics” history of a wafer by the set of manufacturing steps applied to that wafer, and the set of tools, and tool components, applied for those individual steps. In practice, the ordering of steps is somewhat more complicated. However, complete historical knowledge is available: it is known exactly whether a step was invoked and which tool was used. During the production of chips, thousands of measurements are recorded to monitor product progress. These measurements include electrical measurements on specially designed test structures, defectivity measures, and an abundance of physical measurements characterizing feature dimensions, thin film thicknesses,

ch

Step 1

Step 5

Step 500

Tool 29 Tool 16 Tool 5 Tool 2 Tool 1

Raw wafer

…

Tool 52 Tool 12 Tool 23

In process wafer

Tool 16

…

Tool 45 Tool 201 Tool 1

Finished wafer

Fig. 1. Life Cycle of a Wafer

and thin film chemical and physical properties. These measurements are not performed for all wafers. Dynamic sampling plans select subsets of lots, subsets of wafers within lots, and for chip specific measurements, subsets of chips within Sciences Department, IBM Research IBM confidential wafers for testing. It Mathematical is not unusual for only 5% of all wafers to undergo any particular measurement. Large lot-to-lot variations in sampling are also possible: only one or two wafers from one lot may be subject to a particular measurement while all wafers in another lot may be subject to the same measurement. For any given wafer, most measurements are unknown, but over a larger population, like a lot, a summary value, like a mean or median can be produced from a subsample of wafers. A sample drawn from wafers in production during a time period may not be representative of future opportunities. Work in process development continues throughout the lifecycle of production. At any time, a sample is drawn from a non-stationary population reflecting the fundamental dynamic nature of fab processes and products. Opportunities for improvements are detected and exploited, processes are improved, tools are changed or taken out of production. In general, modifications are made routinely, and the interactions among all events cannot be understood exactly. This implies that samples must be continually drawn to see the latest opportunities that have arisen. Acute tool failures are easy to detect by standard means. However, extreme competitive pressures demand that all opportunities for improvement be identified and exploited. There are usually subtle, but still detectable opportunities to improve yield. In the next section, we discuss our approach to detecting these opportunities.

3

Methods and Procedures

Our central theme is diagnosis. We provide a solution that is transparent and understandable to the engineers who maintain and monitor the production line. A major component of our solution is a specialized form of a binary-clause regression rule. Association rules and related forms of classification decision rules have previously been used to extract knowledge from manufacturing databases [11],[12], [13], [14], [15]. Figure 2 illustrates a typical binary regression rule, which is contrasted to the mean value for all wafers in the sample. In Figures 2 and 3, Implanter A1 and Furnace B1 are tools; Extension Implant and Oxidation are steps. n-ion is a measurement characteristic of chip power consumption and speed. Both low and high values are indicative of poor performance. The general form of a discovered rule is the following: If Then Otherwise

true-or-false conditions are satisfied average value = a; average value = b.

Figure 2 has one condition, but the number of conditions may be greater as in Figure 3. In practice, multiple sets of conditions can be very hard to rationalize in terms of manufacturing tool behavior, and we have currently limited these rules to two conditions. Average median-n-ion for all 292 wafers is 823.4. IF (Implanter A1 is used for Extension Implant) THEN (Average median-n-ion for 58 wafers is 801.9) OTHERWISE (Average median-n-ion for 234 wafers is 828.7) Fig. 2. Typical Rule

Average median-n-ion for all 292 wafers is 823.4. IF (Implanter A1 is used for Extension Implant) AND (Furnace B1 is used for Oxidation) THEN (Average median-n-ion for 43 wafers is 793.6) OTHERWISE (Average median-n-ion for 249 wafers is 828.6) Fig. 3. Typical Paired Rule

The rules have several descriptors, including the number of wafers and the difference in target value from the overall population mean target value. These

are used as constraints and filters on acceptable rules. While a predictive method, such as a decision tree [16], will dynamically select an implied set of covering rules, these may be inadequate for this application. They may be too complex, cover an insufficient number of wafers, or several other factors may mitigate against the use of standard methods. Diagnostic procedures require an understanding of the special constraints of the fab’s operating environment. The fab operates in a non-stationary environment with substantial probabilistic influence. Engineering resource constraints limit the number of opportunities that can be pursued, and some apparent opportunities are transient in nature. The ones of greatest interest are those that cover a substantial number of wafers and must be supported by a graphical display of time series performance of a candidate tool or process. For example, Figure 4 illustrates the supporting time series for the rule in Figure 2. A problem may be detected when a relatively large number of wafers deviate significantly from the overall mean value. If a pattern is discovered relative to specific tools, an induced rule may be diagnostic of problems with those tools. These are opportunities for improving the overall target, such as yield. For example, the operational parameters of a tool, such as underperforming Implanter A1 in Figure 2, may be compared to similar tools to note a difference in settings that can be adjusted. Complete tool trace data are available for detailed investigation. Not all investigations lead to clear cut improvements, but some can lead to substantial gains in performance. The illustrative rules in Figures 2 and 3 also describe the basic units that are analyzed. The target is some key numerical value that is a critical measure of performance and production. For example, yield measurements are percentages of successful manufacturing. The basic unit of analysis is a wafer. A yield target for a wafer would indicate the percentage of good chips on the wafer. Data are analyzed at the wafer level. Results are evaluated at both the wafer and lot level. In general, we are looking for rules that report anomalies at both levels of resolution. A rule that covers a significant number of wafers, all obtained from a small number of lots, would be less interesting than one from a large number of lots. Our general approach to problem detection and diagnosis involves several critical tasks: data preparation, regression rule learning, rule filtering, and identification of supporting findings. These tasks form a general agenda for which choices in representations and methods are made. Certain characteristics of the application greatly influence the directions taken in implementing a real-world system. Of particular interest are (a) the need to present answers in a form that is amenable to action by the engineers and (b) the overwhelming number of missing values in the collection of wafer data. In the following sections, we will describe the methods used in this system. 3.1

Data Preparation

A massive amount of information is recorded during the manufacturing process. This information is stored in a specialized format optimized for knowledgeable

900 880 860

M ed ian n Io n

840 820 800 780 760 740 720 700 6/22

7/2

7/12

7/22

8/1

8/11

8/21

8/31

9/10

9/20

Processing Date Implanter A1 used

Otherw ise

Fig. 4. Time Series for Pattern Mathematical Sciences Department, IBM Research

IBM confidential

specialists. For our diagnostic objectives, a simplified representation can be described. As illustrated in Figure 1, a wafer is processed sequentially in steps. At each step, a tool is applied. For example, Implanter A1, one of a set of 12 nominally identical tools, can be used to introduce precise quantities of dopant atoms into transistors during the Extension Implant Step. Following the final wafer manufacturing step, every wafer is subjected to a comprehensive wafer final test. The steps imply a time ordering on the data. Additional testing occurs after intermediate steps. Not all wafers are tested at these earlier steps. Rather, lots are sampled, and samples of wafers are drawn from sampled lots. If the target is a measure of final testing, such as chip yield, complete information is available for all wafers. The downside is that these wafers have been in process for months, and diagnoses may be quite late for action. Alternatively, a target may be selected from an intermediate testing step. These targets have the great advantage that they may detect opportunities much earlier. There are weaknesses though. Targets from intermediate testing steps can identify a specialized opportunity, but they are not as definitive as a final test measure like yield. Moreover, because of sampling, conclusions will be based on much smaller subsets of wafers than those diagnosed from final test results. The data are transformed into a standard spreadsheet format. This is illustrated in Table 1, where each row represents features of a single wafer. The

target, which can be stored in the last column, is a numerical value that is critical to track. Table 1. Transformed Sample Data

Wafer-ID step1 − tool1 step1 − tool5 w1 0 1 ... ... ... w1000 1 0

... step500 − tool16 Yield ... 0 70 ... ... ... ... 0 75

The remaining columns are features of the wafers. We could use all information about the wafers and assemble a comprehensive spreadsheet describing everything known about the wafer during manufacture. One technical problem with that approach is that most information is missing because measurements relative to wafers are sampled. Instead, we divide the features into two piles: (a) history of tools and steps and (b) recorded measurements, mostly sampled and numerical. The step-tool pair is a basic unit of diagnosis. If we find a pattern relative to a step and tool, then the source of an apparent opportunity is greatly narrowed. The history values are never missing. For a given wafer, either a step and tool were applied to the wafer or not. In the spreadsheet of Table 1, the columns are filled in with binary information. Each of these features represents a pair, a step and tool combination. The actual value will be binary, true (1) or false (0), indicating whether that tool was used in that step. The pair can be extended to chambers. Tools may have multiple chambers performing identical or complementary processes and can process multiple wafers simultaneously. Our step-tool features are not combinations of all possible steps and tools. For any step, only a subset of tools can be applied. In our application, the number of tool-step pairs numbers in the thousands. The actual list of tool steps is not constant and must be composed dynamically from the specialized data of a certain period. The manufacturing line is a dynamic structure, many changes are made: steps are added or eliminated, as are tools for a specific step. When chambers are considered, the step-tool-chamber triples typically number in the tens of thousands. This could be considered a high dimensional feature space especially relative to the number of wafers, which can vary from hundreds to thousands. While the high dimensionality may be a concern, because features are binary-valued, the learning methods to be discussed in Section 3.2 are surprisingly effective and efficient. The numerical measurements will also be analyzed, but only after the historical record is considered and patterns are found. This process will be considered in Section 3.4. In summary, the logistics data are organized as follows: – Each column is a step-tool pair. – Each value is true or false - indicating whether the tool was used on that step.

– Each row is a wafer and the values indicate the history of the wafer in terms of its steps and tools. In contrast for the measurement data, each column is a specific measurement, and the values are real or missing. The sample is collected relative to a target. Unless the target is one of the wafer final tests, i.e. post manufacture, not all wafers will have the target information. The sample will only include those wafers for which the recorded target information is available. In general, a collected sample is of interest for a specified period of time, and new opportunities may be exploited as they surface. Thus, a sample is usually collected for a target from the most recent wafers having the target value. A time window is specified to capture wafers that can relate to current events on the line, for example all wafers from the last 30 days of target recordings. This window is specified based on knowledge of current line operating conditions. For routine triage, 30 days may balance recency with an adequately sized sample. Under some circumstances, the system has been used to explore specific aspects of manufacturing line operations. Larger samples, or samples drawn from predefined dates, may be needed to support such specialized investigations. The set of features produced by these procedures is not constant. Unlike a typical sampling for prediction, the set of features cannot be specified permanently. For each new sample, the set of features will change, necessitating their re-generation. A typical sample is collected over time in the following manner: 1. Define features at time A. 2. Collect examples of feature values. 3. Add new examples using these features. For this application, the features must be generated every time a sample is collected. Steps may change as well as the tools assigned to those steps. Thus the sampling procedure will proceed along these lines for a given time window W and target T: 1. Define features found over all examples (wafers) processed in W having a defined value for T. 2. For all wafers having a known value of T, collect their feature values over time window W. While the complete set of features will vary over different windows, many of the features will be unchanged for different time periods. Discoveries that relate to a small subset of the feature space may be applicable to both old and new samples. If a single step and tool are identified as a potential source of improvement, then unless major changes have been made, the same steptool will occur in future samples. Thus regression-rule representations will find stable solutions and are readily applicable to new data, even though the full set of features will include changes. In the next section, we discuss techniques for inducing diagnostic regression rules.

3.2

Analytical Methods - Binary Regression Rules

Our diagnostic solution is presented in the form of regression rules. Of interest are patterns of high or low target values. Patterns are presented as binary regression rules of a maximum length filtered by constraints on minimum numbers of wafers or lots and minimum deviations from mean values. Rule conditions are currently restricted to singletons or pairs because of the difficulties encountered to date in operationally exploiting patterns of greater complexity. The general format is If X and Y are satisf ied T hen value = a; Otherwise value = b.

(1)

The task of the learning method is to induce these rules from sample data. Figure 5 is an overview of the learning methods. A regression tree is grown using the CART algorithm [17]. 1. 2. 3. 4. 5.

Grow binary regression tree to depth k (shallow tree). Cross-validate to get best size tree. Each path to a terminal node is a potential rule. Generalize rules by extracting subsets of paths. Filter rules. Fig. 5. Overview of learning methods

In this manufacturing application, identifying opportunities associated with larger numbers of wafers is more significant than identifying opportunities associated with fewer outlying wafers. As such, absolute deviation from the target average is the minimization criterion as opposed to squared error. The tree is grown to a relatively shallow depth, for example 4 nodes deep. There is no harm in growing a full tree. However, the constraints on the solution, such as length of rules or minimum numbers of wafers and lots, imply that only the top of the tree can satisfy those constraints. The learning algorithm will find significant results and eliminate results that are possibly random. Because tree induction is a well-studied process, a standard algorithm can be used for Steps 1 and 2 in Figure 5. This procedure produces a tree with tested, nonrandom predictive performance. Our task though is diagnosis, not prediction, so the pruned tree is an intermediate structure from which binary regression rules will be inferred. Each path to a leaf node in the final tree is a potential regression rule, a conjunction of nodes on that path, starting with the root node. However, the full path rule may be unnecessarily complex. In this application, it can be beneficial to restrict rule size to 2 or fewer conditions. Figure 6 illustrates a pruned tree, where a full-path 2-condition rule is outlined (the rule in Figure 3), along with a single-condition subpath rule (the rule in Figure 2). Nodes 4 and 5 (not shown in Figure 6) were pruned from node 2 by the learning procedure. Both rules will eventually be presented to the engineers for examination. The single-condition

rule covers more wafers and is easier to investigate. However, the 2-condition rule deviates more from the mean and still covers substantial numbers of wafers. Figure 7 describes a procedure for extracting a rule R from the path traversed beginning at root node 1 and ending at node k, where node k can be terminal or non-terminal. This procedure is repeated for every node in the tree, potentially generating one rule for every node in the tree. Instead of starting at the root node, the rule is assembled by starting at the last node on the path and gradually adding parent nodes, the reverse of how the path was generated. As soon as R’s target value is close to the value of the complete path to node k, the procedure halts. A candidate rule’s ”deviation” is the difference between the mean targetvalue for all wafers in the sample (the root node) and the target value for the rule (the conclusion node). Heuristically, a rule is deemed to be close when its deviation is within 10% of the full path’s deviation. When the maximum length is exceeded, the procedure halts and no rule is extracted. In the tree of Figure 6, a full path consists of nodes {1,3,7}. In terms of rules, the equivalent rule is ”node 3 AND node 7,” where node 3 represents a true value for the test at node 1, and node 7 represents a true value for the test at node 3. The target value of node 7 is 793.6, and its deviation is 823.4 - 793.6 = 29.8. The procedure will first consider node 7 alone, applied to all 292 wafers in the sample. Hence node 7 will have a different computed count and mean value; its deviation is not within 10% of 29.8%. For this reason, the shortened candidate rule is rejected, and the procedure accepts the longer full-path rule. A potential rule is formed by a path to every node in the tree. These rules may be generalized by removing conditions. In this application, a shallow tree is grown because rules are limited to one or two terms. If longer rules were acceptable, deeper trees can be grown. Each rule induced from the tree is an object of interest and is an IF part of the binary formulation. The Otherwise part of the rule is readily computed from all wafers in the sample not covered by the IF conditions. 3.3

Rule Filters - Knowledge-based Constraints

The application of these procedures results in rules that are empirically tested and nonrandom. Their short length makes them understandable to the engineers. For diagnosis, this is not sufficient. They must also be shown to be ”interesting,” so that any expended time in investigation by the engineers is justified. Interesting rules are found by passing them through filters. An interesting rule is one that meets all thresholds posed by the filters. Table 2 lists some of the filters, for example the magnitude of the minimum deviation from the mean, which may be far greater than implied by a routine significance threshold. The magnitude of units above or below a target can differ. For example, yield degradation is usually of much greater concern than increased wafer yield. The manufacturing line does not operate in a stationary environment; some limited natural variability is expected. Only when deviations are clear indications of opportunities for improvement is a subsequent investigation worth pursuing. A detected pattern may be worth pursuing when it affects a minimum number of wafers. This will

Node 1 292 wafers 823.4 Avg

Sub path (see Fig 2)

Implanter A1 used for Extension Implant

False

True

Full path (see Fig 3)



Furnace B1 used for Oxidation

False Node 6 15 wafers 825.8 Avg

True Node 7 43 wafers 793.6 Avg

Fig. 6. Rules Extracted from Tree Paths

Mathematical Sciences Department, IBM Research

1. 2. 3. 4. 5. 6. 7.

Number the nodes on the path from root node 1 to last node k i=k; R = True; j=max length R = node i AND R if (deviation(R) is within 10% of full-path deviation) stop i=i-1; if(i=1) stop if (k-i > j) {R=null;stop} goto 3 Fig. 7. Extract and Generalize Rules

IBM confidential

also filter one-time events that may have already been addressed. The thresholds are set in consultation with the engineers and may vary depending on the target and the degree of sampling. Table 2. Rule filters Filter Rule must Rule must Rule must Rule must

cover minimum number of wafers cover minimum number of lots deviate from global mean by X units (above) deviate from global mean by Y units (below)

Threshold 25 5 5 5

These filters are applied to rules extracted and generalized from the tree. Rules not satisfying these conditions are eliminated from consideration. The goal is not to investigate all potential opportunities in the plant no matter how fleeting or small. Exploring marginal opportunities may be time-consuming and difficult. Engineering resources must be allocated to maximize potential gain. The use of filters is consistent with that objective. Of particular interest are rules that diverge greatly from the mean and also affect a large number of wafers. Most applications to yield and speed have a set of multiple targets. In practice, even with filters, many rules may be found. These same factors can also be used to order the rules from most promising to least. This is done either by number or percentage of lots or some trade-off with divergence from the target mean. However, pure empirical knowledge will not be the sole determining factor. Knowledge of the manufacturing line conditions and physical operations may lead the engineers to pursue some rules and ignore others. 3.4

Supporting Findings

Tools and steps have been identified that have demonstrably different performance. Results are presented in the form of binary regression rules for critical targets. As illustrated in Figure 4, these results must also be supported by a graphical time series demonstrating a tool’s divergent performance relative to its peers. From an automated learning perspective, using historical step and tool data alone has the advantage of having no missing values. As noted in Section 3.1, measurement data are recorded at intermediate stages of manufacture. Unlike the step-tool data, the measurement data are riddled with gaps because of sampling, with a minor percentage of wafers being subject to any specific measurement. Figure 8 illustrates how these data are processed to support the diagnostic pattern found by the regression rule. The rule separates the wafers into two populations: those that satisfy the conditions and the remaining wafers that do not. Although the measurements are sampled, their mean values can readily be computed for each of these populations. These means are compared for the two populations, and significant results noted. While

Data Warehouse

Data Source

Data Extraction

Data Transformation

Data Mining Data Mining Engine

Rule Selection

Confirmatory Analyses

Text Results

SPC Charts

Web Page

Results Presentation

Mathematical Sciences Department, IBM Research IBM confidential Fig. 8. Overview of Analyses and Supporting Evidence

this may sound like a clinical study comparing two medical treatments, ordinary statistically significance is of minimal interest. We are heuristically looking for extreme differences that support a major difference for the pattern. While we currently compare means for all measurements and rank these measurements, knowledge of the physical processes and manufacturing tools will restrict interest to those measurements that are most relevant to a particular tool-step combination. The tools themselves have additional settings and measured values that are currently stored in auxiliary databases. These could prove valuable, but these data are not currently incorporated in the learning procedures. Figure 9 illustrates the procedure for finding supporting measurements. Table 3 is an example of several supporting measurements for the example rule in Figure 2. For each parameter, it lists the average value and the number of wafers when tool A1 is used or A1 not used, and the t-test probability when comparing the 2 groups. The complete exploitation of a particular finding from this system requires the identification of ultimate root causes, and the identification and implementation of appropriate remedial measures.

1. Rule Pattern defines Two populations: Pattern vs. Not Pattern 2. Perform hypothesis testing using t-test on measurement mean values. 3. Apply filters; – minimum number of wafers per mean – minimum difference between means Fig. 9. Find Key Differentiating Measurements Table 3. Supporting Measurements

Parameter (median) Mean A1 Number A1 Mean A1 Number A1 Prob n Ion 801.9 58 828.7 234 8.90E-13 n Ioff 121.7 58 179 234 1.00E-11 n Overlap Capacitance 0.243 58 0.25 215 5.90E-06 p Ion 355.6 58 351.4 234 6.30E-03 p Ioff 23.15 58 20.91 234 4.40E-03 p Overlap Capacitance 0.24 58 0.236 215 2.40E-05

The identification of a tool and step leading to a different average performance, such as median n-ion, could be symptomatic of a number of ultimate root causes. In the initial stages of root cause determination, an engineer must recognize a plausible physical or chemical mechanism relating the identified tool and step to the performance measure and hypothesize one or more root causes. The supporting findings facilitate an efficient and effective means for the engineer to test many such hypothetical root causes. In general, each hypothesized root cause will have implications for other measurements. For example, one hypothetical root cause explaining why the use of Implanter A1 for the Extension Implant would lead to low n-ion, may be consistent with no effect on n-ioff for the same transistor design. For the example rule in Figure 2, such a hypothesized root cause would be inconsistent with the observed highly statistically significant change in n-ioff, allowing the engineer to rule out that hypothesized root cause. Another hypothesized root cause may be consistent with no effect on similar p transistors. Any such hypothesized root cause would be inconsistent with highly statistically significant changes observed for p-ion in Table 3, again allowing the engineer to rule out such a root cause. In such a fashion, the supporting findings allow engineers to test a wide variety of hypotheses, given their understanding of the implications of those hypotheses on the measurements included in the supporting findings. It may be the case that the engineer, given existing expert knowledge, may be unable to hypothesize any physical or chemical relationship between the tool step identified and the associated performance measurement. In this case, the supporting findings present the engineer with a very broad characterization of the in-rule wafer population. The engineer, presented with such a broad characterization of the in-rule population, may be able to hypothesize another root

cause, that may stand completely apart from the fact that the in-rule population shared a common tool at a particular step.

4

Results

We have developed an automated system that samples data every week and reports candidate opportunities for yield improvements. These opportunities are not permanent. A new sample taken at a later time may show that the opportunity has been exploited or is of lesser potential value. The objective of mining the data is to detect opportunities as early as possible. Unlike applications that deal with relatively stationary data, results for diagnosis can vary greatly over time when measured in terms of accuracy. From the perspective of cost saving, the value of a correct diagnosis can be measured much more objectively than most applications. The reduction of yield for a period of time translates into lost production of chips. Chips have a commercial value, and the number of failed chips multiplied by the chip value is approximately the total lost sales. Thus we can compute the actual amount of money saved by the early detection and diagnosis of a highly significant opportunity. These numbers can range into the many millions of dollars. For a run of the system during one time period, Table 4 describes a snapshot of sample data characteristics relative to power consumption and speed. A tool analysis is performed, and then repeated with data for the tool chambers. For these data analyses, the entries in Table 4 are the number of targets, the number of features and the sample size. Table 4. Sample Dimensions

Analysis Type Targets Features Sample Size tool 26 2073 1282 chamber 26 16611 1282

For these same targets, Table 5 describes results for the derived rules. Listed is the median number of rules for the targets’ cross-validated trees. This corresponds to the number of leaves, which is the set of unfiltered or lengthy rules. Also included is the number of rules passing all filters, which is the number of rules that the system presents in its reports. Table 6 lists results for the best rule found over the set of 26 targets, as measured by the greatest deviation and largest number of lots. For this rule, the table lists the percentage of lots with that pattern in the sample and the percentage deviation relative to the engineers’ threshold. For each target, a threshold is set to indicated the minimum deviation from the mean value required to make results interesting and worth reporting.

Table 5. Rule dimensions Analysis Type No. Leaf Rules No. Filtered Rules tool 672 5 chamber 1167 4 Table 6. Best Rule Attributes Analysis Type % Lots % Deviation tool 21 210 chamber 52 172

Given the cost basis for chip manufacture, we can compute the expected cost savings associated with an opportunity for yield improvement. Not all of the opportunities diagnosed by this system are substantiated to be of commercial interest, and based on engineering insight, many are not pursued. However, dozens of diagnoses from this system have influenced fab operations and cost savings in excess of $1M have been associated with individual diagnoses. In detecting opportunities early, or in detecting opportunities that would be missed otherwise, the financial value of this type of analysis is clear.

5

Discussion

A system has been described that periodically samples data from all steps and tools in the fab. The objective is to find opportunities for yield improvement. The opportunities detected are usually not acute. Instead, they represent limited excursions in performance which are seen when compared to alternative tools and wafers. Detecting and exploiting an opportunity can save many millions of dollars, and the system that has been described has identified opportunities that have contributed to such savings. The data are not sampled from a stationary population. Over time, often many weeks, the results may change significantly. This can be traced to simple events like updating or substituting a tool, or more subtle events like reordering or changing steps. The manufacturing line can be a dynamic structure, especially one like IBM’s, where many different products are manufactured simultaneously. The data in this application should not be thought of as a stream. They are collected from many sources at different times and then combined to give a picture for a window on time. Our typical window is from the most recent 30 days of wafer testing during some stage of manufacture. A new sample of wafers is collected every week. The huge volume of data recorded on the line during the manufacture of chips makes it a natural application for data mining. Many companies have reported on their efforts to mine data. There have been notable successes in these efforts,

mostly in detective work for finding the cause of an unresolved problem in the fab. There, a specialized sample is collected to find root causes [18]. We have introduced a binary rule representation that simplifies diagnosis and investigation. Instead of a specialized compilation of data, suitable for investigating a known problem, our task is more standardized. We periodically run a system over collected data. The system is completely automated and can detect patterns of steps, tools, and supporting findings that deviate most from the mean values of a target while covering a relatively large number of wafers. Many different learning methods [19], [20] could be applied to the prediction subtask, but any solution requires a full understanding of the results, so that corrective action may be taken. A regression tree was used as an intermediate structure in this application, but regression rules could also be induced directly [21]. The actual choice of the rule induction method is not as critical as the filtering of candidates. Hundreds or even thousands of rules may be found that are empirically significant, yet only a few are worth investigating. Identified weaknesses in tool performance must be filtered by knowledge that the gain will be substantial and that the identified weaknesses are readily supported by standardized manufacturing techniques, such as time-series graphs of degraded tool performance relative to other tools. Although the rules are empirically verified, they do not necessarily demonstrate causal faults. Adding knowledge to the system could improve results. We know that some empirical answers are rejected based on physical knowledge. That does not mean that the result is incorrect. Rather, the result may be a proxy for one or more events that are occurring elsewhere and happen to pass through the step and tool that have been suggested. Another example of advantageous use of knowledge might be to restrict conjunctions in rules to tools and steps that are related or occur within a reasonable time frame. So far, we have relied on empirical evaluation. The filters in the rule requirements are an attempt to specify general knowledge that will lead to greater acceptance of answers. The view presented here is that of diagnosis. Opportunities can be identified and further information can be provided about potential causes. This is not the same as treatment. A long range effort would be needed to carefully understand how opportunities are exploited and make sure that type of history is recorded. Further study of successful solutions, from diagnosis to successful exploitation, would enable an improved system that understands how a diagnosis might be resolved. Our system provides an opening on those possibilities and has demonstrated itself with substantial savings.

6

Acknowledgments

We acknowledge the following people (a) for critical project leadership: Bernie Meyerson, Neil Poulin, Matt Paggi and Dan Armbrust. (b) for substantive guidance in solution development: James P. Rice, Thomas W. Joseph, Hari V. Mallela, Brian Trapp, Patrick Varekamp, and Keith Tabakman. (c) for expert de-

velopment of supporting data extracts and solution delivery: John M. Balas, Andrew D. Pond, Yunsheng Song, and Bill Hoffman.

References 1. Goodwin, R., Miller, R., Tuv, E., Borisov, A., Janakiram, M., Louchheim, S.: Advancements and applications of statistical learning/data mining in semiconductor manufacturing. Intel Technology Journal 8(4) (2004) 325–336 2. Harding, J., Shahbaz, M., Srinivas, Kusiak, A.: Data mining in manufacturing: A review. Manufacturing Science and Engineering 128(4) (2006) 969–976 3. Melzner, H.: Statistical modeling and analysis of wafer test fail counts. In: Advanced Semiconductor Manufacturing 2002 IEEE/SEMI Conference and Workshop. (2002) 266–271 4. Weber, C.: Yield learning and the sources of profitability in semiconductor manufacturing and process development. IEEE Transactions on Semiconductor Manufacturing 17(4) (2004) 590–596 5. Kong, G.: Tool commonality analysis for yield enhancement. In: Proceedings of IEEE Conference and Workshop on Advanced Semiconductor Manufacturing. (2004) 202–205 6. Chen, W., Tseng, S., Hsiao, K., Liu, C.: A data mining project for solving low-yield situations of semiconductor manufacturing. In: Proceedings of IEEE Conference and Workshop on Advanced Semiconductor Manufacturing. (2004) 129–134 7. Irani, K.B., Cheng, J., Fayyad, U.M., Qian, Z.: Applying machine learning to semiconductor manufacturing. IEEE Expert: Intelligent Systems and Their Applications 8(1) (1993) 41–47 8. Apte, C., Weiss, S., Grout, G.: Predicting defects in disk drive manufacturing: A case study in high-dimensional classification. In: IEEE CAIA (93). (1993) 212–218 9. Fountain, T., Dietterich, T., Sudyka, B.: Mining ic test data to optimize vlsi testing. In: Proceedings of the sixth ACM SIGKDD international conference on Knowledge discovery and data mining. (2000) 18–25 10. Campbell, S.: Fabrication Engineering at the Micro and Nanoscale. Oxford University Press, Oxford (2007) 11. Shahbaz, M., Srinivas, Harding, J.: Knowledge extraction from manufacturing process and product databases using association rules. In: Proceedings of Conference on Product Data Technology Europe. (2004) 145–153 12. Kusiak, A.: Rough set theory: a data mining tool for semiconductor manufacturing. IEEE Transactions on Electronics Packaging Manufacturing 24(1) (2001) 44–50 13. Lian-Yin, Z., Li-Pheng, K., Sai-Cheong, F.: Derivation of decision rules for the evaluation of product performance using genetic algorithms and rough set theory. In: Data mining for design and manufacturing: methods and applications, Norwell, MA, USA, Kluwer Academic Publishers (2002) 337–353 14. Sadoyan, H., Zakarian, A., Mohanty, P.: Data mining algorithm for manufacturing process control. The International Journal of Advanced Manufacturing Technology 28(3/4) (2006) 342–350 15. Hrycej, T., Strobel, C.: Extraction of maximum support rules for the root cause analysis. Computational Intelligence in Automotive Applications 132 (2008) 89–99 16. Chen, R., Yeh, K., Chang, C., Chien, H.: Using data mining technology to improve manufacturing quality - a case study of lcd driver ic packaging industry. In: Seventh ACIS International Conference on Software Engineering, Artificial Intelligence, Networking, and Parallel/Distributed Computing. (2006) 115–119

17. Breiman, L., Friedman, J., Olshen, R., Stone, C.: Classification and Regression Trees. Wadsworth, Monterrey, CA. (1984) 18. Gardner, M., Bieker, J.: Data mining solves tough semiconductor manufacturing problems. In: Proceedings of the sixth ACM SIGKDD international conference on Knowledge discovery and data mining. (2000) 376–383 19. Chien, C., Wang, W., Cheng, J.: Data mining for yield enhancement in semiconductor manufacturing and an empirical study. Expert Systems with Applications 33(1) (2007) 192–198 20. Rokach, L.: Mining manufacturing data using genetic algorithm-based feature set decomposition. Int. J. Intell. Syst. Technol. Appl. 4(1/2) (2008) 57–78 21. Weiss, S., Indurkhya, N.: Solving regression problems with rule-based ensemble classifiers. In: Proceedings of KDD-2001. (2001)