3300
IEEE TRANSACTIONS ON INSTRUMENTATION AND MEASUREMENT, VOL. 60, NO. 10, OCTOBER 2011
Getting More From the Semiconductor Test: Data Mining With Defect-Cluster Extraction Melanie Po-Leen Ooi, Eric Kwang Joo Sim, Ye Chow Kuang, Serge Demidenko, Fellow, IEEE, Lindsay Kleeman, and Chris Wei Keong Chan
Abstract—High-volume production data shows that dies, which failed probe test on a semiconductor wafer, have a tendency to form certain unique patterns, i.e., defect clusters. Identifying such clusters is one of the crucial steps toward improvement of the fabrication process and design for manufacturing. This paper proposes a new technique for defect-cluster identification that combines data mining with a defect-cluster extraction using a Segmentation, Detection, and Cluster-Extraction algorithm. It offers high defect-extraction accuracy, without any significant increase in test time and cost. Index Terms—Data mining, defect-cluster extraction, probe testing, segmentation, semiconductor manufacturing.
I. I NTRODUCTION
T
HE PRIMARY mission of the semiconductor manufacturing test has been traditionally seen as a segregation of good devices under test (DUTs) from the defective (entirely or partially failed) ones. With the on-going technological progress in the semiconductor industry, this paradigm is rapidly changing. The semiconductor industry has approached an inflection point whereby test-enabled diagnostics and yield learning have become crucial for further progress in integrated circuit (IC) manufacturing [1]. As technology approaches nanoscale geometry, high defect and fault rates are experienced throughout the semiconductor production process [2]. The latest International Technology Roadmap for Semiconductors report on Test and Test Equipment states that the industry’s most difficult challenges in the test technology area are the Test for Yield Learning and Detecting Systematic Defects. The production test data were identified as an essential element in overcoming these challenges in the manufacturing process feedback loop. Through the use of appropriate statistical tools on production Manuscript received July 13, 2010; revised October 17, 2010; accepted January 31, 2011. Date of publication March 28, 2011; date of current version September 14, 2011. The Associate Editor coordinating the review process for this paper was Dr. Antonios Tsourdos. M. P.-L. Ooi, E. K. J. Sim, and Y. C. Kuang are with the Monash University Sunway Campus, 46150 Petaling Jaya, Selangor, Malaysia (e-mail:
[email protected];
[email protected]; kuang.ye.chow@ monash.edu). S. Demidenko is with Royal Melbourne Institute of Technology International University, Ho Chi Minh City, Q7, Vietnam (e-mail: serge.demidenko@ rmit.edu.vn). L. Kleeman is with Monash University, Clayton Campus, Clayton, Vic. 3800, Australia (e-mail:
[email protected]). C. W. K. Chan is with Freescale Semiconductor Malaysia, Free Industrial Zone, 47300 Petaling Jaya, Selangor, Malaysia (e-mail: chrischan@ freecsale.com). Color versions of one or more of the figures in this paper are available online at http://ieeexplore.ieee.org. Digital Object Identifier 10.1109/TIM.2011.2122430
test data, much knowledge can be discovered. For example, in addition to the actual characteristics and functionality of the manufactured ICs, the parameters of the fabrication processes, materials, and equipment involved in the product manufacturing can be precisely quantified, which enables improvement of the product quality [3]. One of the observations made from the high-volume production test data is that dies that fail with identifiable root causes have a tendency to form certain unique patterns that manifest as defect clusters at the wafer level [4] or hot spots at the wafer-lot level [1]. These patterns are found to be caused by systematic failures [5]. Thus, identifying such cluster patterns is one of the crucial steps toward improving the fabrication process, real-time statistical process control, and design for manufacturing (the complex relationship between the fail die clusters and particular problems in the fabrication process or design is a subject of a special research and is outside the scope of this paper). Statistical analysis can be performed to distinguish between systematic and random failure patterns [1], whereas machine learning, pattern recognition, and image processing techniques can be performed to extract these defect clusters. In this paper, the defect patterns are categorized into two categories: global defect patterns (GDPs) and local defect patterns (LDPs). The patterns of the GDP type correspond to cases where the dies are found to be randomly failing across the entire wafer. On the other hand, the LDPs are patterns with dies failing around specific locations that have consistent shapes and orientations across different wafer lots [5]. Data logging has become a standard procedure in the industry, whereby large databases are generated, and the data are collected in areas such as production, fault detection, design, and quality assurance. The data are then used for quality management, identification of production issues, and increasing the manufacturing throughput. The rapid progress in computer technology has made the data accumulation easy and cheap; thus, huge data sets are collected each day. They can be employed to extract the GDP and LDP data. Traditional statistical data analysis, originally proposed on much smaller data sets, has become inadequate at providing the required summary for further analysis, reasoning, and decision making [6]. A general process called knowledge discovery in databases (KDD) was therefore proposed to aid in converting data into knowledge [7], [8]. Data integrity validation and standardization encompasses the first four stages in [7], [8]: Problem Definition, Selection and Addition, Preprocessing and Cleaning, and Transformation. Useful knowledge is then
0018-9456/$26.00 © 2011 IEEE
OOI et al.: GETTING MORE FROM SEMICONDUCTOR TEST: DATA MINING WITH DEFECT-CLUSTER EXTRACTION
3301
TABLE I DEVICES USED IN THE RESEARCH
Fig. 2. Example of raw manufacturing wafer maps. (a) GDP and (b) LDP at the upper-left edge of the wafer [4].
Fig. 1.
Flowchart of the data mining with the SDC algorithm.
extracted and analyzed in the final two stages: Data Mining and Interpretation. Data mining is a general term used to describe the process of studying a database for patterns of interest. It is normally achieved using machine learning and pattern recognition techniques. This paper proposes a new comprehensive data-mining process using a novel Segmentation, Detection, and Cluster-Extraction (SDC) algorithm (see Fig. 1) to extract the LDPs from the raw production test data. This paper presents the SDC approach in explicit detail with derivations and illustrations with new and substantial experimental trials of several millions of IC units. This paper is divided into the following sections: Section II provides the literature review, Section III outlines the data integrity validation and standardization framework (labeled as “Standardization” in Fig. 1), Section IV introduces the concept of Local Yield, whereas Sections V–VII outline the segmentation, detection, and cluster-extraction stages of the SDC algorithm, respectively. The proposed SDC algorithm has been developed to extract meaningful cluster features from a database of manufacturing test results accurately and automatically. It can be implemented either in an online or offline mode. For online industrial applications, any employed method for cluster extraction and identification should have a low computational time in order to mitigate lowering of the manufacturing throughput and increasing the cost of test. The online analysis can be used for defect-oriented testing, real-time statistical process control, and fast fault diagnosis. Examples of offline applications include periodic manufacturing process performance review and general postanalysis of the manufacturing process. The SDC algorithm proposed in this paper has undergone an extensive experimental trial using several million IC units of six
types of mainstream high-volume production semiconductor devices. These devices are of different technologies (half-pitch sizes) and different die sizes (dies/wafer counts) (see Table I) and were subjected to testing and analysis. The devices were selected for experimental study in consultation with Freescale Semiconductor, and their characteristics were obtained from their respective design documents. The names/descriptions of the devices have been intentionally changed to A–F in this paper for confidentiality reasons. The threshold selection used in the SDC algorithm has been determined through extensive wafer defect simulations. This is elaborated in detail in Section VIII. Section IX of this paper discusses the results and performance of the algorithm and is followed by the conclusions, acknowledgments, and references. II. D EFECT C LUSTERING : R EVIEW KDD is an interdisciplinary field that encompasses machine learning, pattern recognition, and computer vision with statistical analysis [6]. The biggest area of interest in this paper is defect-cluster extraction. Fabricated semiconductor wafers provide a unique data set that can be visually and graphically interpreted in a 2-D image as opposed to data sets used in most data mining applications (see Fig. 2). Thus, cluster extraction can be achieved either through algorithms in the fields of pattern recognition and machine learning, or via image processing methods (in the field of computer vision), such as segmentation and detection. Another unique feature of semiconductor wafers is the small number of points in the data set (generally ranging from 200 to 5000 dies/wafer), whereas most clustering algorithms in pattern recognition and image processing were developed for much larger number of points. This problem is further compounded by the fact that a high proportion of this small data set (between 10%–50%) are random failures, which is interpreted as “noise” in the data set. Furthermore, the semiconductor wafer data set is not square or even a perfect circle, and most defect
3302
IEEE TRANSACTIONS ON INSTRUMENTATION AND MEASUREMENT, VOL. 60, NO. 10, OCTOBER 2011
Fig. 3. Examples of local defect patterns. (a) Bulls eye. (b) Blob. (c) Edge. (d) Ring. (e) Line. (f) Hat [4].
clusters are not convex polygons. Thus, many classical clustering algorithms will tend to fail when applied to this research problem. The proposed cluster-extraction algorithm must be also repeatable across a variety of wafers, devices, distributions of data, and failure rates. In pattern recognition and machine learning, there are two general ways to partition a data set: hard clustering and fuzzy clustering. Hard clustering requires all points in a data set to be partitioned into distinct groups, with no overlap in any cluster. Fuzzy clustering allows samples to have a varying degree of membership in each of the available clusters. There is no single clear-cut clustering solution that is optimal for all given applications. It also largely depends on the type of data to be clustered. For fabricated semiconductor wafers, defect clusters do not tend to overlap but are instead divided between regions of random failure (so-called GDP, as shown in Fig. 2(a) and systematic patterns [or so-called LDP shown in Fig. 2(b)] [5]. Thus, hard clustering is a more suitable approach for this paper. GDPs are the result of random perturbations in the manufacturing process. They are a particle-related phenomena normally observed to be randomly distributed. Conversely, LDPs are normally located around a specific location and are process related. There are six LDPs identified and examined in this paper, which are Bulls eye, Blob, Edge, Ring, Line, and Hat (see Fig. 3). These defect patterns are loosely characterized by the engineers and are subjective to individual judgment. In this paper, a formal definition is given to each type of pattern, aiming to standardize their characteristics, as shown in Table II.
A. Clustering Algorithms Fig. 4 shows a brief taxonomy of hard partitioning methods used in machine learning and pattern recognition [9], [10]. Note that computer vision may also use the same partitional techniques to achieve image segmentation [11]. Statistical partitioning clustering algorithms such as k-means, k-medoid, and mean shift are very widely used for cluster extraction across different fields of disciplines.
The k-means is the most commonly used partitioning clustering algorithm whereby each cluster is represented by its center of gravity. The initial cluster centers are randomly chosen from the data set and are iteratively calculated to represent the centroid of the clusters. The process continues until convergence is reached. Because it is heuristic, the result may never statistically converge [12]. A further drawback of k-means is that the entire result is dependent on the choice of the initial cluster center. The biggest problem with k-means is its sensitivity to noise [12]. This makes it unsuitable for implementation on semiconductor-fabricated wafers that are normally dominated by random failures leading to a “noisy” data set. The k-medoid algorithm [12], [13] was introduced to overcome the drawbacks of k-means. A medoid is the sample inside the cluster with the smallest average dissimilarity to all other objects in the cluster. This property makes the k-medoid algorithm more robust to noise and outliers. The major drawback of the k-medoid algorithm is that the number of clusters must be known prior to implementing the clustering. Estimating the number of clusters on a fabricated wafer is an added challenge making this algorithm unsuitable for the industrial application. It is important to state that both the k-means and k-medoid algorithms only work for convex clusters. This makes them impractical for wafer defects, which are largely nonconvex. The mean-shift clustering has the following advantages over the k-means and k-medoid algorithms. First, it does not require prior knowledge on the number of clusters. Second, it is capable of determining the distance between clusters and does not have cluster shape constraints. It operates by recursively searching for the direction of the maximum increase in density in an agglomerative hierarchical manner. The mean-shift algorithm requires only geometrical coordinates and density value, comprising a very small input set. Therefore, theoretically, it is computationally fast to implement. However, mean-shift clustering may never really be suitable for fabricated semiconductor wafers despite its ability to consider three dimensions in data: x- and y-geometry and a value of the data point. This is because a fabricated semiconductor wafer has a small geometry size (may span as few as a 100 points in each direction), whereas most of the mean-shift clustering application in the literature involves data or pictures with larger numbers of points or pixels (spanning across thousands of points in each direction). Additionally, the value of the data points for wafers can be only either “0” or “1” for “Fail” and “Pass,” respectively, as opposed to the pixel values typically varying between 0 and 255. Therefore, manipulation of the weight and the x−y geometry at each point in the mean-shift algorithm does not provide enough distinction between the cluster and noncluster regions. Balanced Iterative Reducing and Clustering using Hierarchies (BIRCHs) and Clustering Using REpresentatives (CUREs) are two other examples of agglomerative hierarchical clustering [13]. The BIRCH uses a cluster feature tree as a data structure to compress cluster information into a set of points, to perform merging, and to finally terminate operation on achieving some prespecified conditions (e.g., specified number of clusters
OOI et al.: GETTING MORE FROM SEMICONDUCTOR TEST: DATA MINING WITH DEFECT-CLUSTER EXTRACTION
3303
TABLE II DESCRIPTION OF EACH TYPE OF THE LDP
Fig. 4.
Taxonomy of hard clustering algorithms.
cluster diameter, etc.) [13]. Application of the BIRCH results in spherical clusters. The CURE algorithm uses a set of representative points that give information about a cluster and attempts to strike a balance between using all points for accuracy and using only centroid points for efficiency. The main advantage of the algorithm is that it is not constrained to spherical clusters. However, it is not capable of clustering in any distance space since it relies on vector operations [13]. The biggest problem with the hierarchical clustering is related to the fact that its accuracy depends on the chosen terminating condition for splitting and merging operations [13]. Additional parameters are normally required to specify the terminating condition, such as size, cluster validity index, and distance between cluster centroids. Unfortunately, this introduces some user bias into the algorithm. User bias is notoriously difficult to quantify and control and, thus, should be avoided whenever possible to maintain consistency in the algorithm’s performance. Moreover, hierarchical clustering is sensitive to perturbations due to the agglomerating/dividing operation. Thus, a pure hierarchical technique is unsuitable for clustering of wafer defects whereby a pure unsupervised algorithm insensitive to noise is necessary for this paper. In the field of computer vision, cluster extraction can be achieved through image segmentation including a nonlinear noise-removal stage, followed by partitioning methods such as connected-components labeling or nearest-neighbor clustering. While image segmentation can be also accomplished through
edge detection, color, or texture segmentation [11], these methods will not be discussed in this paper as they are unsuitable for defect patterns on semiconductor wafers. Noise removal is typically achieved through standard image processing techniques such as median filtering and erosion–dilation [14], [15]. They aid in the removal of saltand-pepper noise or “speckles” in the data such that all remaining points are assumed as belonging to the same regions (or clusters). This is akin to removing the GDPs on the wafer and leaving behind only areas of the LDPs. A median filter replaces each point in the data set by the median of all points in a prespecified neighborhood (or window) [14]. The erosion replaces the value of the original die with the lowest value within its surrounding neighborhood, whereas the dilation method replaces it with the highest value [15]. When erosion is performed first, it causes small “specks” of data to be removed since only larger clusters will survive the process, as shown in Fig. 5(b). Dilation will then enlarge the surviving clusters back to approximately the original size [see Fig. 5(c)]. The erosion-and-dilation technique is employed in this paper as a part of the segmentation algorithm. The connected-components labeling is a fundamental task that can be applied by simply pairing adjacent “1s” and “0s” in a binary image [16], [17] until all the data belong to some cluster. This is quite attractive for semiconductor-fabricated wafers as the patterns are all in a binary form (“1” for the passed die and “0” for the failed one). Additionally, recent developments in connected-components labeling algorithms [18]–[20] have
3304
IEEE TRANSACTIONS ON INSTRUMENTATION AND MEASUREMENT, VOL. 60, NO. 10, OCTOBER 2011
Fig. 5. Performing erosion before dilation. (a) The original wafer. (b) The wafer map after erosion. (c) The wafer map after dilation.
made it fast and easy to implement. As a result, it is attractive for real-time cluster segmentation for semiconductor wafer defects. This method is used for cluster extraction in this paper. Nearest neighbor clustering first assumes that all points belong to individual clusters [14]. It then calculates the shortest distance between two points from different clusters and merges them if the distance is less than a prespecified threshold. Although this method is fast and easy to implement, it is not adaptive to different cluster sizes because the threshold is normally fixed. Specifying the threshold in the first place can be a problem for defect clustering since the cluster size is not known a priori. B. Clustering Algorithms for Semiconductor Wafers Clustering with application to semiconductor-fabricated wafers is a highly specialized field. While there are a number of methods proposed in the literature [14], [21]–[24], each approach sets out to achieve specific clustering goals with different degrees of accuracy and limitations. For instance, statistical distribution analyses (such as Poisson or negative-binomial distribution) are normally applied on a lot-level basis to account for defect-cluster phenomena during yield calculations to estimate return on investment [21], [22]. Although it can detect the presence of a cluster through analysis of outliers, these algorithms do not actually perform clustering per se. This means that they are unable to pinpoint the exact wafer that contains a defect cluster in a wafer lot, much less identify and segment the cluster. Another group of methods attempts to recognize the cluster type directly from the raw manufacturing data. These methods typically confuse clustering with classification. For instance, [23] presents an intelligent neural-network system developed to specifically recognize semiconductor defect patterns. With neural networks, the decision making on cluster membership is unknown to the user. Instead, it requires a sufficiently large amount of training to obtain a stable statistical inference for the chosen neural-network architecture. Thus, if the algorithm fails to achieve the required performance level, there is no method available for fine tuning or improving the decision-making parameters. The only available solutions are either to increase the training set or to change the neural-network architecture. Another major limitation of any neural-network approach is its inability to classify two or more shift or rotational variations in the defect patterns that belong to the same cluster type. The biggest pitfall in this approach is that the algorithm cannot confirm the presence and location of a defect cluster before it goes on to classify it.
In [14], a pure image-processing clustering method is proposed. It utilizes a median filter and radon transform to perform cluster extraction. However, the problem with the standalone application of any image- or signal-processing clustering method onto semiconductor wafers lies in the size of the data set. As mentioned earlier, the number of points in a wafer data set (200–1000) is very small, compared with that in typical images (100 000–10 000 000). Consequently, these techniques tend to produce either high false-acceptance rates or high falserejection (FR) rates. Additionally, in both the cases, the number of false alarms (FAs) is typically quite high. High FA rates will compel the user to turn off or ignore the cluster-extraction algorithm, as demonstrated by human machine interaction research [25]. Moreover, real manufacturing defect clusters are normally of different and imperfect geometrical shapes. Thus, if a wrong threshold or window size is selected for the noise removal, the extracted cluster would either be over- or underemphasized (see Fig. 6). During an initial investigation of the problem, a number of the previously proposed defect-clustering algorithms for semiconductor wafers were researched and experimentally studied. Unfortunately almost all of them failed to provide the level of flexibility and accuracy required by an industrial partner. This is mainly because the investigated methods did not make a clear distinction between clustering and classification. In the majority of cases, they either implicitly assumed that the problem was singular (defect-clusters identification), or they tended to be semiautomated (i.e., required user input to verify the cluster segmentation). Thus, it was concluded that pure clustering algorithms adopted from other fields were practically unsuitable for direct application for semiconductor wafer data sets. Such data sets are characterized by not only very limited number of points but also a high proportion of those points belonging to the GDPs (or random failure). The most promising out of the researched algorithms [24] utilizes joint-count statistics to perform the segmentation. It evaluates each wafer and removes the GDPs. This is a sound approach since an accurate way to account for noisy wafers is through the statistical observation. However, the study in [24] was mostly conducted on simulated wafer data. When applied to real manufacturing test data during the experimental study, the resultant wafer maps still showed a high amount of segmented noise. Additionally, the statistical computation employed in the algorithm showed to be quite intensive, requiring up to 48 hours of computation time for cluster detection on a single wafer. Nevertheless, the use of jointcount statistics has been considered as a promising approach, and it will be used for detection, rather than segmentation in Section VI. III. DATA I NTEGRITY VALIDATION AND S TANDARDIZATION F RAMEWORK Although data mining has been very popular in marketing, finance, and health care, it is seldom applied to the semiconductor manufacturing due to the inherent complexity of the production processes and data. The semiconductor manufacturing process involves a number of stages, which are typically
OOI et al.: GETTING MORE FROM SEMICONDUCTOR TEST: DATA MINING WITH DEFECT-CLUSTER EXTRACTION
3305
Fig. 6. Example of applying standard image processing techniques on fabricated wafers. (a) The original wafer map. (b) The median filter output. (c) The detected defect clusters [24].
performed in different production plants often located in different parts of the world. Additionally, semiconductor manufacturers produce thousands of different device types in their product lineup. Thus, there are large variations in complexity, technology, equipment, materials, and process steps. It is therefore challenging for any manufacturer to have only one standard database format for all data logging purposes. The number of variations in the data type often results in a significant rate of data corruption, whereas the use of corrupted data during the data mining process inevitably leads to wrong analysis and wrong conclusions. Incorporating faulty analysis into the production flow may in turn lead to significant monetary losses at the end of the process. Some common forms of data corruption found in semiconductor production data include the following. 1) Missing data are the most common form of data corruption, normally caused by poor server connection, incomplete testing, human error during production testing, and data logging into a wrong database. 2) Redundant/repeated data involves rewriting of the data, normally caused by retesting whereby the data could be rewritten into a different data slot (rather than replacing the previous value in the same data slot). 3) Out-of-range data are acquired data that fall into an unacceptable range, normally due to mislabeling of a measurement unit, automatic data logging into the wrong database column/row or wrong test sequencing. For instance, a “Fail–Fail–Pass” report sequence is out-of-range data given three test insertions whereby if any DUT fails a particular test insertion, it should be immediately discarded (and thus would not undergo subsequent testing). This causes some confusion as to whether the device actually passed or failed the testing process. These sorts of cases are particularly common because for some devices with lower quality specifications, retesting may be allowed. Thus, the DUT may have been recovered through retesting, but the retest results were not updated into the raw database. 4) Unacceptable data are the result when, occasionally, some “strange” data within the data set may be observed that are unexpected and cannot be explained. This is actually quite common in large databases since all data logging is automatically performed and unsupervised through the available data logging software.
To overcome any data corruption issues, a general framework that is shown in Fig. 1 (Standardization) is applied. Raw production data normally contains many different features and attributes that may or may not be useful for the purpose of a particular study. It is normally not feasible to process all available data for several reasons. First, there is a so-called “curse of dimensionality” factor, whereby too many features in the data would provide no clear result since the results may never be statistically significant to allow for extraction of useful knowledge. Second, the processing power and time required to calculate or mine the data may not be sufficient, even when a very powerful processor is used. Third, some irrelevant data may even show up as “noise” during statistical calculations, which would negatively affect the accuracy of the extracted knowledge. The aforementioned does not mean that the data set should be kept as small as possible. On the contrary, a better and more rounded analysis can be provided if more attributes are considered [26]. However, these attributes should be screened out for relevance to the particular study. Thus, a set of selection criteria, i.e., C = {1, 2, . . . , n}, must be first predetermined based on the problem definition. The problem definition and the selection criteria for the reported research are shown in Table III. During Selection and Addition, all the databases for the device under study must be scanned for the selection criteria. Data is then extracted into a new database. The Preprocessing and Data Cleaning stages must address all forms of data corruption. Thus, all possible combinations of probe test insertions must be accounted for. For example, for three probe test insertions, there are 27 possible combinations of test results. At the end of the process, a decision must be made on whether the DUT had passed or failed the overall probe test. If no decision can be made due to data corruption, Corrupt Flag must be set. Table IV shows a simple example of data cleaning using a lookup table, where F corresponds to “Fail,” N to “No data,” and P to “Pass”. The Error Code is written to represent each cause of data corruption for accountability purposes. The cleaned data finally transformed to a prespecified format, which will be used for the defect-cluster-extraction algorithm. A new and standardized database has five columns of data for the following fields: 1) Wafer Lot Identification Number (five digits); 2) Wafer Identification Number (two digits); 3) Probe Pass/Fail (one digit);
3306
IEEE TRANSACTIONS ON INSTRUMENTATION AND MEASUREMENT, VOL. 60, NO. 10, OCTOBER 2011
TABLE III PROBLEM DEFINITION AND SELECTION CRITERIA
TABLE IV EXAMPLE OF A LOOK-UP TABLE FOR PREPROCESSING AND D ATA C LEANING
Fig. 7. Neighborhood used by Intel [27], [29], i.e., 1, 2, 4, 8, 12, 24, 48, and 80.
4) Corrupt Flag (one digit); 5) Error Code (three digits). IV. L OCAL -Y IELD C ONVERSION A Local-Yield concept is proposed based on an empirical study performed by Intel [27]. This study found that a passing die surrounded by failing dies had a higher likelihood of containing a latent defect. It prompted the company to perform correlation studies between test failure rates (in particular, for the burn-in test [28]) and the Local Yield of a die within a range of neighborhoods, i.e., 1, 2, 4, 8, 12, 48, and 80 (see Fig. 7). For example, if a 12 neighborhood is used, the region of interest around die X consists of the dies marked number 1, 2, 4, 8, and 12. Equation (1) shows calculation of Local Yield, where λ0 is the pass/fail status of the die itself, λi is the pass/fail status (0 or 1) of a neighboring die i, and n is the neighborhood used. Fig. 8 shows the conversion of Pass/Fail data from a wafer map to Local-Yield values, i.e., λ0 + ni=1 λi . (1) Local Yield = n+1
Fig. 8.
Example of the Local-Yield conversion for the 8 neighborhood.
Fig. 9. Die replacement procedure. (a) The wafer regions. (b) Neighborhood of 24 with die replacement. (c) Neighborhood of 24 without die replacement.
The Local-Yield calculation introduces a statistical bias for the dies at the wafer edge, whereby there may be an incomplete neighborhood. To overcome this, the specialists from Intel proposed a die replacement scheme that was achieved by categorizing the wafer map into four regions, as shown in Fig. 9 [27], [29]. Fig. 9(a) shows Regions 1, 2, 3, and 4 as purple, blue, red, and green, respectively. It has been observed that a die at the wafer edge (in Region 1) had better defect
OOI et al.: GETTING MORE FROM SEMICONDUCTOR TEST: DATA MINING WITH DEFECT-CLUSTER EXTRACTION
correlation with another die at the opposite end of the wafer edge (also in Region 1), compared with a closer die in Region 4. This could be caused by the variations in the wafer fabrication process between the wafer center and wafer edges, resulting in different defect distribution between Regions 1, 2, 3, and 4. For example, the photoresist thickness has a different uniformity, depending on the radius from the wafer center [30]. Thus, to replace a missing neighbor, the closest die in the same or higher region is selected, as shown in Fig. 9(b). If there is a complete neighborhood, no die replacement is needed [see Fig. 9(c)]. In additional studies by both Intel and IBM [27], [29], [31], the Local Yield was used to predict burn-in failure rates. Subsequent studies performed in Freescale [32] confirmed that a die with low Local Yield had a higher probability of falling within a defect cluster. However, the study also showed that while Local Yield was a good defect-cluster indicator, it could be economically unsuitable for prediction purposes due to substantial losses in cases of any wrong prediction. This paper uses the 8-surrounding-neighborhood Local Yield as a defect indicator.
3307
Fig. 10. FA and FR (Segmentation) rates for the full set of threshold values for different wafer yields.
V. D EFECT-C LUSTER S EGMENTATION Cluster segmentation is performed through the connectedcomponents labeling [18]–[20], which is a strictly geometrical method. The advantage of this technique lies in its simplicity and computational speed. It involves analyzing and labeling adjacent objects as belonging to the same group or otherwise. In semiconductor wafer maps, there are only two groups: Pass and Fail (corresponding to the dies that pass or fail the test). Thus, direct application of connected-components labeling would be highly susceptible to the noise (or random failures in the GDP). Therefore, the standardized manufacturing data are first converted to its Local-Yield map, and a noise-removal technique is performed prior to the connected-components labeling. Noise removal is achieved via erosion and dilation (as discussed in Section II and illustrated in Fig. 5). After noise removal, the Local-Yield value of each die must be converted back to binary Pass/Fail data. Since erosion and dilation is performed on the 8-surrounding-neighborhood Local Yield, there are eight possible thresholds, i.e., 1/9, 2/9, 3/9, 4/9, 5/9, 6/9, 7/9, and 8/9. Any die with a Local-Yield value below a particular threshold will be “Failed,” whereas anything above the threshold will be “Passed.” Dies that fail will now take the value “1,” whereas all other dies will have the value “0.” Connected components will basically group all dies with the values of “1” that are adjacent to each other. The choice of threshold is extremely important in ensuring that the cluster segmentation algorithm produces minimum errors. There are two types of errors in clustering, i.e., FA and FR. FA refers to cases whereby a cluster is incorrectly segmented, although it is not present in the wafer, whereas FR refers to cases whereby a cluster is not segmented, despite being present in the original wafer map. The selection of the threshold hence becomes a tradeoff between the FA and the FR. Higher threshold will result in low FR but with high FA. Conversely, lower threshold will result in high FR rate, with low FA. Fig. 10
Fig. 11. Segmentation. (a) The original wafer with 90% yield. With segmentation at thresholds (b) 1/9, (c) 4/9, and (d) 7/9. (e) The original wafer with 70% yield. With segmentation at thresholds (f) 1/9, (g) 4/9, and (h) 7/9.
shows an example of the FA (denoted by solid curves in the legend) and FR (denoted by dotted curves) for Device C for wafer yields of approximately 50% (in light blue), 70% (in red), and 90% (in dark blue). For 50% yield, the FA rate for the Device C is always 1. For the yields of 70% and 90%, there is a clear tradeoff between the FR and FA rates. Fig. 11 shows the results of segmentation after connectedcomponents labeling. The original wafer at 70% and 90% yields is shown in Fig. 11(a) and (e), respectively. For wafers with a high manufacturing yield, selecting higher segmentation thresholds leads to a good cluster-extraction result [see Fig. 11(b)–(d)]. However, selecting a high threshold for manufacturing yield of 70% tends to over-emphasize the cluster [see Fig. 11(f)–(h)]. It is thus important to note that a lowyield wafer will result in a case whereby most dies will have a Local Yield close to 0, whereas a very high yield wafer will lead to the opposite effect (most dies will have Local-Yield values close to 1). Therefore, the segmentation thresholds will need to be accordingly adjusted for different wafer yields to prevent cases whereby the clusters are either over- or underemphasized. Further elaboration on the segmentation threshold is provided below in Section VIII of this paper. Note that the segmentation threshold is chosen based on the LDP-wafer map (after Extraction) rather than on the Segmentation-wafer map (see Fig. 1).
3308
IEEE TRANSACTIONS ON INSTRUMENTATION AND MEASUREMENT, VOL. 60, NO. 10, OCTOBER 2011
VI. D EFECT-C LUSTER D ETECTION
and
The defect-cluster-detection algorithm is independent of segmentation. Thus, both the detection and segmentation can be concurrently executed. This significantly decreases the overall computation time. The proposed defect-cluster detection algorithm shown in Fig. 1 (Detection) iteratively removes from the map, i.e., those dies that fail the spatial randomness test until all the remaining dies passes the test. The removed dies therefore correspond to the detected defect regions. This is achieved by setting the Local-Yield threshold to zero for the first calculation and slowly incrementing it by 1/9 until all dies pass the spatial randomness test. The spatial randomness test involves calculation of joint counts, whereby each possible “Pass–Pass” (or “Fail–Fail”) combination between two dies form a single joint. Thus, two dies are required to form each joint. Let Ω be the set of total number of dies on a wafer. Given that N0 is the set of index positions of all the devices that fail, and N1 is the set of index positions of all the devices that pass, Ω = N1 + N0 . Assuming that the total number of devices in the wafer set Ω is W , the total number of joint counts between two passing devices can be written as (2), whereas the total number of joint counts between two failing devices is written as (3), i.e., wi (j)IN1 (i)IN1 (j) (2) TN1 = W −1 i∈Ω j∈Ω
TN0 = W −1
wi (j)IN0 (i)IN0 (j)
(3)
i∈Ω j∈Ω
whereby w refers to the weight of a particular joint count, which varies according to the importance of that joint count, IN1 is 0 for fail and 1 for pass of the die at the particular index position, and IN0 is 1 for fail and 0 for pass of the die at the particular index position. The statistics (TN1 , TN0 ) are distributed according to bivariate normal distribution asymptotically [33]. In order to normalize the distribution, the Mean (or Expectation) and variance of TN1 and TN0 , and their covariance must be calculated. Assuming p refers to the probability that a particular device will pass and q refers to the probability that a particular device will fail, if the total number of devices in set N1 is n1 and the total number of devices in set N0 is n0 , then pˆ = n1 /W is the estimated value of p based on empirical observation and qˆ = n0 /W , whereby W = no + n1 . Assuming the expectation of p = (n1 /W )), the TN1 given pˆ = n1 /W is written as E(TN1 |ˆ q= expectation of TN0 given qˆ = n0 /W is written as E(TN0 |ˆ (n0 /W )), the variances are written as σT2 N and σT2 N , respec1 0 tively, and the covariance of TN1 and TN0 is σTN0 TN1 , the transformation to the normal distribution is written as 2
T
−1
z =x Σ x where x=
TN1 − E TN1 |ˆ p= q= TN0 − E TN0 |ˆ
(4)
n1 W n0 W
Σ=
σT2 N
σTN0 TN1 σT2 N
1
σTN0 TN1
.
0
The z-score is calculated by taking the square root of z 2 score in (4). If the z-score is lower than a specific z-threshold, the Local-Yield threshold is incremented, and the entire process is repeated for the remaining dies. The selection of the z-threshold is a managerial decision based on acceptable level of the FA and FR rates. Therefore, it is advantageous to decide the threshold level through simulation before analyzing the wafer. The waferto-wafer variation (yield and defect-cluster type) does not affect this threshold as all the variations have been normalized in the process of calculating z 2 . A. Derivation of the Expectation of TN1 and TN0 The total possible combinations in the N1 set is W Cn1 = W !/n1 !n0 !. Because the joint-count probability related to the two devices requires two points to be fixed, the total possible combinations for one link (with two fixed points) is written as (W −2)
C(n1 −2) =
(W − 2)! . (n1 − 2)!(n0 )!
(5)
The expectation of TN1 , given that pˆ = n1 /W , can therefore be derived, as shown in the following: n1 (W −2) C(n1 −2) E TN1 |ˆ = p= WC W n1 =
(W −2)! (n1 −2)!n0 ! W! n1 !n0 !
=
(W − 2)!n1 ! W !(n1 − 2)!
=
n1 (n1 − 1) . W (W − 1)
(6)
Similar derivation is performed to find the expectation of TN0 . The mean square terms of TN1 and TN0 are shown in n2 (n1 − 1)2 n1 = 12 E 2 TN1 |ˆ p= (7) W W (W − 1)2 n2 (n0 − 1)2 n0 = 02 q= . (8) E 2 TN0 |ˆ W W (W − 1)2
B. Derivation of the Variance and That Covariance of TN1 and TN0 The variance of TN1 requires the calculation of E(TN2 1 |ˆ p= (n1 /W )), which is obtained by expanding TN2 1 = TN1 TN1 , as shown in (9). The expectation of TN2 1 is found by averaging across all possible joint combinations of indexes i, j, k, and l. Thus, (9) can be decomposed into four disjoint sets, as illustrated in Fig. 12. There are four disjoint sets of configurations, which depend on the occurrence of the overlap
OOI et al.: GETTING MORE FROM SEMICONDUCTOR TEST: DATA MINING WITH DEFECT-CLUSTER EXTRACTION
Fig. 12. Sets of possible joint configurations when the four indexes are (a) on the same location, (b) on two different locations, (c) on three different locations, and (d) on four distinct locations.
between indexes i, j, k, and l. This is represented in (11). p = (n1 /W )), Φ is To simplify the calculations for E(TN2 1 |ˆ defined as a general function of the two integers a and b [see (10)], where a is the number of unique indexes from the set N1 , and b is the number of unique indexes from the set N0 . Φ in (11) refers to the probability (in the space of all possible configurations) of finding a unique indexes from n1 and b unique indexes from n0 . For example, if W = 100, n1 = 90 and n0 = 10, with two passing dies at positions i = 13 and j = 17 and two failing dies at overlapping positions k = l = 19. Hence, a = 2, b = 1, and there are 100−3 C90−2 possible configurations on the wafer with the same set of pass–fail pattern on [i, j, k, l]. The total number of configurations is always 100 C90 = 100 C10 . Therefore, the probability of finding [i, j, k, l] = [13(pass), 17(pass), 19(fail), 19(fail)] is 100−3 C90−2 /100 C90 = 0.0826, i.e., ⎡ ⎤ TN2 = ⎣W −1 wi (j)IN1 (i)IN1 (j)⎦ 1
i∈Ω j∈Ω
× W
−1
wk (l)IN1 (k)IN1 (l)
(9)
k∈Ω l∈Ω (W −a−b)
Φ(a, b) =
Cn1 −a
WC n1
(W −a−b)
=
Cn0 −b) . WC n0
(10)
The four sets for TN2 1 is as shown in the following (note that when calculating the variance of set N1 , b = 0 since there will be no failing devices from set N0 ). 1) Φ(1, 0) is when all pass–pass joint counts from set N1 are overlapping on the same index (e.g., [i = j = k = l]; thus, a = 1). Substituting a = 1 and b = 0 into (10), Φ(1, 0) = (n1 /W ); .f1 (i, j, k, l) is the number of possible combinations for [i = j = k = l] for a wafer (which is dependent on the wafer geometry). 2) Φ(2, 0) is the weight factor of TN2 1 such that there are two unique indexes overlapping from set N1 (e.g., [(i = j = k) = l], [(i = j = l) = k], [(j = k = l) = i], [(i = j)&(k = l)], [(i = k)&(j = l)], etc.; thus, a = 2). Φ(2, 0) = (n1 (n1 − 1)/W (W − 1)); f2 (i, j, k, l) is the number of possible combinations that have two unique indexes for a wafer. 3) Φ(3, 0) is the expected value of TN2 1 such that there are three unique indexes overlapping from set N1 (e.g., [(i = j) = k = l], [(i = k) = j = l], [(i = l) = j = k], etc.; thus, a = 3). Φ(3, 0) = (n1 (n1 − 1)(n1 − 2))/(W (W − 1)(W − 2)); f3 (i, j, k, l) is the number of possible combinations that have three unique indexes for a wafer.
3309
4) Φ(4, 0) is the expected value of TN2 1 such that there are four unique indexes overlapping from set N1 (e.g., [i = j = k = l]; thus, a = 4). Φ(4, 0) = (n1 (n1 − 1)(n1 − 2)(n1 − 3))/(W (W − 1)(W − 2)(W − 3)).f4 (i, j, k, l) is the number of possible combinations that have four unique indexes for a wafer. Since f1 (i, j, k, l), f2 (i, j, k, l), f3 (i, j, k, l), and f4 (i, j, k, l) are the actual joint-count combinations of pass–pass–pass–pass with i, j, k, and l for the particular position; the expectation of TN2 1 is as shown in (11). Similar derivation is applied to calculate the expectation of TN2 0 , as shown in (12). The variance σT2 N is calculated by substituting (7) and (11) into 1 2 (13), whereas σN is obtained by substituting (8) and (12) into 0 (14), i.e., n1 p= E TN2 1 |ˆ = f1 (i, j, k, l)Φ(1, 0)+f2 (i, j, k, l)Φ(2, 0) W +f3 (i, j, k, l)Φ(3, 0) +f4 (i, j, k, l)Φ(4, 0) (11) n0 E TN2 0 |ˆ = f1 (i, j, k, l)Φ(0, 1) + f2 (i, j, k, l)Φ(0, 2) q= W +f3 (i, j, k, l)Φ(0, 3) +f4 (i, j, k, l)Φ(0, 4) (12) n1 n1 σT2 N = E TN2 1 |ˆ p= p= −E 2 TN1 |ˆ (13) 1 W W n0 n0 −E 2 TN0 |ˆ . (14) q= q= σT2 N = E TN2 0 |ˆ 0 W W The covariance of TN1 and TN0 is written as σT2 N TN = 0 1 p = (n1 /W ), qˆ= (n0 /W ))−E(TN1 |ˆ p = (n1 /W ))· E(TN1 TN0 |ˆ q = (n0 /W )). The term E(TN1 TN0 |ˆ p = (n1 /W ), qˆ = E(TN0 |ˆ (n0 /W )) has to be expanded as follows: n0 n1 E TN1 TN0 |ˆ , qˆ = p= W W ⎤ ⎛⎡ wi (j)IN1 (i)IN1 (j)⎦ = E ⎝⎣W −1 i∈Ω j∈Ω
× W −1
⎞ wk (l)IN0 (k)IN0 (l) ⎠
k∈Ω l∈Ω
⎛ = E ⎝W −2
i∈Ω j∈Ω k∈Ω l∈Ω
wk (l)wi (j)IN1 (i) ⎞
×IN1 (j)IN0 (k)IN0 (l)⎠ .
(15)
Although (15) appears to be similar to (9), it is conceptually different since certain combinations have no possibility of existence when multiplying the TN1 (the pass–pass joint count) with TN0 (the fail–fail joint count). In other words, indexes i and j will never overlap with indexes k and l. Thus, the calculations must exclude any configuration where the pass–fail
3310
IEEE TRANSACTIONS ON INSTRUMENTATION AND MEASUREMENT, VOL. 60, NO. 10, OCTOBER 2011
Fig. 13. Graphical illustration of the clustered region. (a) The wafer partitioned by sets N0 and N1 . (b) The wafer partitioned by a selected region R and its complement. (c) Obtained by superimposing (a) and (b).
joint counts are overlapping. wi (j) refers to the weight of the between devices i and j on a wafer. wi (j) = 0 indicates that i and j are not neighbor, and hence, the joint will not be counted. C. Derivation of the Expectation of TN1 and TN0 Given a Region R When some dies are removed from the original wafer set, the mean (or expectation) of the set changes. A new mean must be calculated for the set of passing dies, given a region R of the remaining dies. Assuming that specific joint-count cases in a selected region R are fixed, the expectation of TN1 , given that p = (n1 /W ), R), pˆ = n1 /W , and R can be written as E(TN1 |ˆ p = (n1 /W )). This which is generally not equivalent to E(TN1 |ˆ is because any joint count involving fixed locations will be subjected to different combinatorial factors. This concept is illustrated in Fig. 13. Note that, previously, the wafer set Ω was divided into the sets of N1 and N0 , as shown in Fig. 13(a). In this section, a specified region R is inserted, as shown in Fig. 13(b). By superimposing these two partitions, the pattern shown in Fig. 13(c) is obtained whereby R1 = N1 ∩ R and R0 = N0 ∩ R. Let r, r0 , and r1 denote the total number of devices in regions R, R0 , and R1 , respectively. Given that i and j are indexing the wafer Ω with the total number of dies W , i.e., n1 ,R E TN1 |ˆ p= w ⎞ ⎛ n 1 , R⎠ wi (j)IN1 (i)IN1 (j)|ˆ p= = E ⎝W −1 w i∈Ω j∈Ω
= E(TN1 |i ∈ R, j ∈ R) + E(TN1 |i ∈ R, j ∈ R) + E(TN1 |i ∈ R, j ∈ R) + E(TN1 |i ∈ R, j ∈ R).
(16)
Because (16) is very complex to expand, a general function θ1 is defined, as shown in (17). Equation (16) can then be written as (18). Note that θ1 (Ω, Ω) = TN1 , i.e., θ1 (A, B) =W
−1
There are three possible cases for the i and j indexes, which are as follows. 1) Case 1: i ∈ R, j ∈ R—When both i and j are inside the defined region R, there will be no permutation since the joint-count cases will be fixed. Therefore, E{θ1 (R, R)} = θ1 (R, R) = θ1 (R1 , R1 ). ¯ or {i ∈ R, ¯ j ∈ R}—When 2) Case 2: {i ∈ R, j ∈ R} either one of the indexes are inside the region R, while the other is not, the number of possible configurations is (W −r−1) C(n1 −r1 −1) out of the total of (W −r) C(n1 −r1 ) possible configurations. Thus, ¯ R)} = θ1 (R, ¯ R)((N1 − r1 )/(W − r)). E{θ1 (R, ¯ j ∈ R—When ¯ 3) Case 3: i ∈ R, both i and j are outside the defined region R, the joint-count configu¯ R)} ¯ = θ1 (R, ¯ R)((N ¯ rations are E{θ1 (R, 1 − r1 )(N1 − r1 − 1)/(W − r)(W − r − 1)). The same procedure is carried out for the calcula|ˆ q = (n0 /w), R), whereby θ0 is defined as tion ofE(TN 0 W (−1) i∈A j∈B wi (j)IN0 (i)I0 (j). D. Derivation of the Variance and Covariance of TN1 and TN0 Given a Region R The variance and the covariance of TN1 and TN0 given a region R must be also derived to obtain the x term in (4). This requires an evaluation of the second moment of the statistics. To simplify the computation, the summation has to be broken down into different groups, depending on the number of unique elements present in each i−j−k−l link. Two functions, i.e., ψ1 and ψ0 , are defined to accept four sets as arguments and return a matrix as wi (j)wk (l)IN1 (i) ψ1 (A, B, C, D) = W −2 i∈A j∈B k∈C l∈D
× IN1 (j)IN1 (k)IN1 (l)U (i, j, k, l) (19) ψ0 (A, B, C, D) = W −2 wi (j)wk (l)IN0 (i) i∈A j∈B k∈C l∈D
× IN0 (j)IN0 (k)IN0 (l)U (i, j, k, l) wi (j)IN1 (i)IN1 (j)
(17)
i∈A j∈B
n1 ,R p= E TN1 |ˆ w
= E {θ1 (R, R)} + E θ1 (R, R) + E θ1 (R, R) + E θ1 (R, R) .
(18)
(20)
The function U (i, j, k, l) returns a 4 × 4 matrix. The matrix element Ua,b will have a value of 0, unless the set {i, j, k, l} ¯ and b elements contains a elements from the set N1 ∩ R ¯ in which case, it will return the value from the set N0 ∩ R, 1. Hence, wi (j)wk (l) for various {i, j, k, l} combinations for different number of unique indexes will be added through different channels of the output matrix. The full expression to
OOI et al.: GETTING MORE FROM SEMICONDUCTOR TEST: DATA MINING WITH DEFECT-CLUSTER EXTRACTION
3311
Fig. 14. Application of the SDC algorithm on Device C.
calculate E(TN2 1 |ˆ p = (n1 /W ), R) is given in (21), whereas the q = (n0 /W ), R) is given corresponding expression for E(TN2 0 |ˆ in (22). Both Ψ1 and Ψ0 are 4 × 4 matrices, i.e., 4 4 n1 E TN2 1 |ˆ ,R = p= Φ(a, b)Ψ1 (a, b) W a=1
(21)
b=1
¯ where Ψ1 = ψ1 (R1 , R1 , R1 , R1 )+ψ1 (R1 , R1 , R1 , R)+ψ 1 (R1 , ¯ R1 )+ψ1 (R1 , R, ¯ R1 , R1 )+ψ1 (R, ¯ R1 , R1 , R1 )+ψ1 (R1 , R1 , R, ¯ R+ψ ¯ ¯ ¯ ¯ ¯ ¯ R1 , R, 1 (R1 , R, R1 , R)+ψ1 (R, R1 , R1 , R)+ψ1 (R1 , R, ¯ R1 )+ψ1 (R, ¯ R1 , R, ¯ R1 )+ψ1 (R, ¯ R, ¯ R1 , R1 )+ψ1 (R, ¯ R, ¯ R, ¯ R, ¯ R, ¯ R1 , R) ¯ + ψ1 (R, ¯ R1 , R, ¯ R) ¯ + ψ1 (R1 , R, ¯ R, ¯ R1 ) + ψ1 (R, ¯ + ψ1 (R, ¯ R, ¯ R, ¯ R) ¯ and R)
E TN2 0 |ˆ q=
4 4
n0 ,R = W a=1
Φ(a, b)Ψ0 (a, b)
(22)
b=1
¯ where Ψ0 = ψ0 (R1 , R1 , R1 , R1 )+ψ0 (R1 , R1 , R1 , R)+ψ 0 (R1 , ¯ ¯ ¯ R1 ,R, R1 )+ψ0 (R1 , R, R1 , R1 )+ψ0 (R, R1 , R1 , R1 )+ψ0 (R1 , ¯ R) ¯ + ψ0 (R1 , R, ¯ R1 , R) ¯ + ψ0 (R, ¯ R1 , R1 , R) ¯ + ψ0 (R1 , R1 , R, ¯ ¯ ¯ ¯ ¯ ¯ ¯ R, R, R1 ) + ψ0 (R, R1 , R, R1 ) + ψ0 (R, R, R1 , R1 ) + ψ0 (R, ¯ ¯ ¯ ¯ ¯ ¯ ¯ ¯ ¯ R, R, R1 )+ψ0 (R, R, R1 , R) + ψ0 (R, R1 , R, R) + ψ0 (R1 , R, ¯ R) ¯ + ψ0 (R, ¯ R, ¯ R, ¯ R). ¯ R, p = (n1 /W ), qˆ = (n0 /W ), The computation of E(TN1 TN0 |ˆ R) is slightly more complex. A function υ is defined to accept four sets as arguments and return a 4 × 4 matrix, as in (23). Note that the first two input sets A and B are tested to check if they belong to N1 , whereas the later sets C and D are tested for belonging to N0 . Due to this limitation, certain combination such as i = k or j = k are not possible, i.e., υ(A, B, C, D) = W −2 wi (j)wk (l)IN1 (i) i∈A j∈B k∈C l∈D
×IN1 (j)IN0 (k)IN0 (l)U (i, j, k, l).
(23)
Finally, the full expression to evaluate E(TN1 TN0 |ˆ p= (n1 /W ), qˆ = (n0 /W ), R) is governed by 4 4 n0 n1 , qˆ = ,R = E TN1 TN0 |ˆ p= Φ(a, b)Υ(a, b) W W a=1 b=1 (24) ¯ + υ(R1 , where Υ = υ(R1 , R1 , R0 , R0 ) + υ(R1 , R1 , R0 , R) ¯ R0 ) + υ(R1 , R, ¯ R0 , R0 ) + υ(R, ¯ R1 , R0 , R0 ) + υ(R1 , R1 , R,
¯ R) ¯ + υ(R1 , R, ¯ R0 , R) ¯ + υ(R, ¯ R1 , R0 , R) ¯ + υ(R1 , R, ¯ R1 , R, ¯ R0 ) + υ(R, ¯ R1 , R, ¯ R0 ) + υ(R, ¯ R, ¯ R0 , R0 ) + υ(R, ¯ R, ¯ R, ¯ R, ¯ R, ¯ R0 , R) ¯ + υ(R, ¯ R1 , R, ¯ R) ¯ + υ(R1 , R, ¯ R, ¯ R) ¯ + R0 ) + υ(R, ¯ R, ¯ R, ¯ R). ¯ υ(R, A naive implementation of functions such as θ1 , θ0 , ψ1 , ψ0 , and υ using nested loops is impractically slow. However, with more careful analysis of the underlying symmetries and unique constrains, it is possible to significantly reduce the total number of arithmetic operations to speed up the computation. VII. C LUSTER E XTRACTION After Segmentation (presented in Section V) and Detection (discussed in Section VI), two wafer maps are produced. Cluster extraction combines these two wafer maps to form the final LDP wafer map as the final stage of the SDC algorithm. Fig. 14 shows a graphical example of how the SDC algorithm is applied onto a set of manufacturing data for Device C. The original wafer map contains a noticeable defect cluster near the center. The detected cluster locations from the Detection Wafer Map are used to validate the segmented cluster regions from the Segmentation Wafer Map. This is achieved using an Intersect operation. Next, the valid segments from both wafer maps are combined via a Union operation. The final step of cluster extraction allows the user to input the minimum cluster size, whereby any clusters below a predetermined size will be removed. The purpose of this step is to increase the robustness of the clustering algorithm by allowing fine tuning of this algorithm to suit a specific application. Since the 8-neighborhood Local Yield is used, the absolute minimum cluster size is 9. Engineers can also choose to only consider large clusters (i.e., those exceeding 50% of the total dies/wafer) depending on the need. For a stable manufacturing process, the minimum cluster size can be increased to reduce the FA rate (at the expense of higher FR rates). The result is reported as the Local DefectPattern Wafer Map. It is insufficient to apply either the Segmentation or Detection as a stand-alone algorithm to extract the defect clusters. The Segmentation algorithm works by highlighting any region with low Local Yield, regardless of the failure distribution in other regions of the wafer. Fig. 15(b) shows a typical example whereby several clusters were segmented, although the engineers agreed that there were no identifiable clusters present on
3312
IEEE TRANSACTIONS ON INSTRUMENTATION AND MEASUREMENT, VOL. 60, NO. 10, OCTOBER 2011
Fig. 15. Typical example of applying on (a) the original wafer map for Device C the following procedures: (b) Segmentation, (c) Detection, and (d) Cluster Extraction, whereby the Segmentation produced a lot of noise that was then erased by the Detection algorithm.
the original wafer [see Fig. 15(a)]. This is remedied by the Detection algorithm since the overall failure distribution on the original wafer was found to be random [refer to Fig. 15(c)]. Thus, by combining the results of the two algorithms, the FAs caused by the Segmentation algorithm is greatly reduced. In contrast, the Detection algorithm assesses the degree of nonrandomness in the failure distribution based on overall wafer yield. The centers of nonrandom neighborhood are then identified using their respective local yield (discussed in Section IV). This second step has a tendency to fragment large clusters into several smaller ones or to reduce the size of the detected cluster since only the peaks are detected. Figs. 16(a) and 17(a) show two examples of a bulls-eye defect pattern that was fragmented [see Fig. 16(c)] and reduced in size [see Fig. 17(c)], respectively. This problem is remedied by in the cluster-extraction stage in Figs. 16(d) and 17(d), respectively. Although the segmentation tends to give a better representation of the defect clusters, it introduces a very high rate of FAs. Table V shows the FA and FR rates obtained from simulating 5000 wafers for each device at 70% yield. The segmentation threshold was chosen as 0.33 to minimize the FR rates. If the Segmentation algorithm is used as stand alone, the FA (or noise) is almost 100% for all the devices. However, by using the SDC algorithm, the Segmentation noise (or FA rate) is reduced to the level of below 3% for Devices A and B and below 1% for Devices C–F with negligible effect on the FR rates. VIII. S EGMENTATION T HRESHOLD S ELECTION T HROUGH WAFER D EFECT S IMULATION The segmentation algorithm discussed in Section V must be selected based on the characteristics of the device and the wafer yield. It would not be feasible to obtain the segmentation thresh-
Fig. 16. Typical example of applying on (a) the original wafer map for Device C the following procedures: (b) Segmentation, (c) Detection, and (d) Cluster Extraction, whereby the Detection algorithm fragments a larger cluster into several smaller areas.
Fig. 17. Typical example of applying on (a) the original wafer map for Device C the following procedures: (b) Segmentation, (c) Detection, and (d) Cluster Extraction, whereby the Detection algorithm significantly reduces the size of the cluster.
old purely through the experimental trials. This is because there will be insufficient samples to confidently select the threshold value, even if the device is manufactured in high volume. Thus, the statistics are highly likely to be biased. Additionally, in any mature production cycle, the occurrence of wafers with frequently changing yields is not normal. Typically, the yield remains within a particular range with some outliers. Therefore, in order to obtain a reliable estimation of the FA and FR rates associated with each segmentation threshold, a defect-cluster simulator is used. Fig. 18 shows the three stages of the defect-cluster simulation. In the first stage, a “perfect” LDP is produced using a set of knowledge rules, such as location and shape of the defect.
OOI et al.: GETTING MORE FROM SEMICONDUCTOR TEST: DATA MINING WITH DEFECT-CLUSTER EXTRACTION
3313
TABLE V FA AND FR RATES FOR SEGMENTATION AND SDC ALGORITHMS, RESPECTIVELY, FOR 70% YIELD WITH SEGMENTATION THRESHOLD = 0.33. THE DATA ARE OBTAINED FROM SIMULATING 5000 WAFERS FOR EACH DEVICE
IX. E XPERIMENTAL R ESULTS AND D ISCUSSION
Fig. 18. Defect cluster simulation stages.
For example, the Edge LDP has its cluster location set to the wafer edges, whereas a Bulls eye LDP will cluster at the wafer center. In Stage 2 of the simulator application, some failing dies are randomly inserted into or omitted from the defect map obtained in Stage 1 to obtain a more realistic defect pattern. The last stage of the defect-cluster simulation inserts the GDP (or random failure) into the defect map from Stage 2 based on a predetermined wafer yield. In the reported research, 5000 wafers were generated for 23 wafer yields (ranging from 10% to 90%), seven defect types (including random failure), and six device types (the total number of simulated wafers were 4 800 000). The reason for simulating 5000 wafers for each point was to ensure that the FA and FR rates were observed to converge to an asymptotical value, as shown in Fig. 19, which is produced for Device C. Similar graphs are produced for all other devices to verify that the simulation results are stable with 5000 wafers. Fig. 20 shows an example of the FA rates for Device C at different manufacturing yields after the SDC application. It is observed that for manufacturing yields below 40%, the probability of an FA is about 0.3, regardless of the threshold values. This is true for all the six devices. When the manufacturing yield is at or above 40%, the user can choose a different threshold, depending upon the tradeoff between the allowable FA (see Fig. 20) and FR (see Fig. 21) rates. The optimal tradeoff point would depend on the user’s requirements. For this paper, the segmentation-threshold values were chosen (see Table VI) by the industry partner based on the intended applications of the devices. It was initially suspected that the differences in technology and process steps may affect the threshold selection. However, it was not found to be the case. Instead, the segmentation threshold was only affected by the number of dies/wafer and the wafer yield. This is primarily because the Segmentation algorithm uses an image processing technique (erosion–dilation), which treats each die as a “pixel.” Because this technique utilizes only the spatial relationship between the dies, it highlights any large failure region, regardless of the type of cluster.
The SDC algorithm was applied to all six devices shown in Table I using the respective segmentation thresholds shown in Table VI. The algorithm can successfully extract five of the six known defect types shown in Fig. 3, i.e., Bulls eye, Blob, Edge, Ring, and Line, from all six devices despite their differences in manufacturing yield and dies/wafer. Some examples of the results are shown in Figs. 22 and 23. The Hat LDP only occurs for Devices B and C, whereby the SDC results are shown in Fig. 24. Table VII summarizes the performance of the proposed SDC algorithm. There are four possible states (or outcomes) of the SDC output, which are as follows. 1) False Positive (FP) is the incorrect extraction of a cluster that is not present in the original wafer [see Fig. 25(c)]. This is also known as the FA. 2) False Negative (FP) is the incorrect exclusion of a cluster that is genuinely present in the original wafer [see Fig. 25(d)]. This is also known as the FR. 3) True Positive is the correct extraction of a cluster that is genuinely present in the original wafer [see Fig. 25(a)]. 4) True Negative is the correct exclusion of clusters that also do not exist in the original wafer [see Fig. 25(b)]. The SDC algorithm can accurately extract defect clusters with a small degree of error. For devices with an average manufacturing yield of above 40%, the FR and FA rates (or FP and FP, respectively, in Table VII) are generally below 10%. In other words, the average accuracy of the algorithm is 90% for manufacturing yields above 40%. Device F has a very low manufacturing yield of below 40%, and the observed FA rate after SDC application is very high, thus confirming the simulated results. The SDC algorithm is highly suitable as a tool for yield improvement and process control procedures for mature devices since mature manufacturing processes tend to have yields of above 70%. If defect clusters are found through data mining with SDC and the sources of these clusters are isolated, the yield can be further increased. This will certainly significantly increase the profitability of the company in the long run, particularly since the same implementation can be applied to all the devices within the same family or product group. Additionally, the average computational time of the SDC algorithm is less than 3 s/wafer, even when running on a commodity computer powered by an Intel Pentium T4400 (2.2-GHz) processor. This leads to the possibility of analyzing wafer patterns in real time, without increasing the test time and cost.
3314
IEEE TRANSACTIONS ON INSTRUMENTATION AND MEASUREMENT, VOL. 60, NO. 10, OCTOBER 2011
Fig. 19. Simulation stability at 5000 wafers after the SDC application for the (a) FA and (b) FR rates for Device C. TABLE VI SEGMENTATION THRESHOLD VALUES FOR DEVICES A–F FOR D IFFERENT W AFER Y IELDS
Fig. 20. FA rates for Device C after the SDC application at different manufacturing yields.
Fig. 21. FR rates for Device C after the SDC application at different manufacturing yields.
X. C ONCLUSION Production test data enhanced with data mining can provide valuable insight into the actual semiconductor manufacturing process on top of its traditional fault isolation role. This paper has presented a real-time data mining solution with the SDC algorithm that can automatically and accurately extract defect clusters from raw wafer probe test production data. The experi-
mental results have shown that this solution has an accuracy of approximately 90% for devices having over 40% manufacturing yield. The obtained results can be used as feedback into the manufacturing process loop to identify the sources of defects. This will lead to yield improvement and better quality control. The implementation of the proposed technique will not significantly increase test time and cost, making it a feasible and attractive tool for industrial applications. This paper is the further development of the initial ideas by the authors presented at the conferences in [4] and [34]. The material has undergone a very substantial enhancement since these earlier communications. In essence, the conference papers were rapid publications that, in a very brief form, introduced the initial proposal for defect clustering and outlined the integrated framework for data mining with the SDC algorithm. In addition to the concise description, the SDC algorithm presented in the earlier publications had very limited experimental trial information, involving just two particular devices. In this paper, the SDC approach is presented in full detail expected from the scholar journal publication, with all the required comprehensive derivations and illustrations, thus making it clearly explicable to the audience and applicable to engineering practice. This paper has also presented results on new very substantial experimental trials, where several million of units of six different types of mainstream high-volume production semiconductor devices were tested and analyzed. These devices were of different implementation technologies (half-pitch sizes), different die sizes (dies/wafer counts), different number of metallization layers, etc. The devices were selected for experimental study, in
OOI et al.: GETTING MORE FROM SEMICONDUCTOR TEST: DATA MINING WITH DEFECT-CLUSTER EXTRACTION
3315
Fig. 22. Application of the SDC algorithm devices A, B, and C for Bulls eye, Blob, Edge, Ring, and Line.
Fig. 23. Application of the SDC algorithm devices D, E, and F for Bulls eye, Blob, Edge, Ring, and Line.
consultation with the industrial partner (one of the major semiconductor manufacturers in the world), and their characteristics and test datalog information were made available to the researchers. The very substantial volume of these real-world production and experimental data makes this paper substantially unique as it is almost impossible to find in the literature reliable
engineering data related to the on-going manufacturing and experimental research due to confidentiality requirements. The body of this paper, research derivations, discussions, illustration materials, etc., has been extended to include new findings, plans for future research, and observations from the extended experiments.
3316
IEEE TRANSACTIONS ON INSTRUMENTATION AND MEASUREMENT, VOL. 60, NO. 10, OCTOBER 2011
support to this research. The authors would also like to thank Monash University for the provision of laboratory facilities, postgraduate scholarship, and research funding. R EFERENCES
Fig. 24. Application of the SDC algorithm on Hat defect clusters for Devices (a) B and (b) C. TABLE VII PERFORMANCE OF THE SDC ALGORITHM
Fig. 25. Performance categories of the SDC algorithm.
ACKNOWLEDGMENT The authors would like to express their sincere thanks to Freescale Semiconductor Malaysia for the provision of equipment and data, as well as for the continuing encouragement and
[1] International Technology Roadmap for Semiconductors, Report Test and Test Equipment 2009 Edition, 2009. [Online]. Available: http://www.itrs. net/Links/2009ITRS/Home2009.htm [2] S. Ahn, Z. Patitz, N.-J. Park, H. J. Kim, and N. Park, “A floorprint-based defect tolerance for nano-scale application-specific IC,” IEEE Trans. Instrum. Meas., vol. 58, no. 5, pp. 1283–1290, May 2009. [3] L. Bechou, D. Dallet, Y. Danto, P. Daponte, Y. Ousten, and S. Rapuano, “An improved method for automatic detection and location of defects in electronic components using scanning ultrasonic microscopy,” IEEE Trans. Instrum. Meas., vol. 52, no. 1, pp. 135–142, Feb. 2003. [4] M. P.-L. Ooi, E. K. J. Sim, Y. C. Kuang, L. Kleeman, C. Chan, and S. Demidenko, “Automatic defect cluster extraction for semiconductor wafers,” in Proc. IEEE Int. Instrum. Meas. Technol. Conf., Austin, TX, 2010, pp. 1024–1029. [5] X. Zhao and L. Cui, “Defect pattern recognition on nano/micro integrated circuits wafer,” in Proc. 3rd IEEE Int. Conf. NEMS, Sanya, China, 2008, pp. 519–523. [6] I. Kononenko and M. Kukar, Machine Learning and Data Mining: Introduction to Principles and Algorithms. Cambridge, U.K.: Horwood, 2007. [7] Q. Luo, “Advancing knowledge discovery and data mining,” in Proc. 1st Int. Workshop Knowl. Discovery Data Mining, Adelaide, SA, Australia, 2008, pp. 3–5. [8] C. Apte, “Data mining: An industrial research perspective,” Comput. Sci. Eng., vol. 4, no. 2, pp. 6–9, Apr.–Jun. 1997. [9] D. A. Forsyth and J. Ponce, Computer Vision: A Modern Approach. Englewood Cliffs, NJ: Prentice-Hall, 2003. [10] W. E. Snyder and H. Qi, Machine Vision. Cape Town, South Africa: Cambridge Univ. Press, 2004. [11] L. G. Shapiro and G. C. Stockman, Computer Vision. Englewood Cliffs, NJ: Prentice-Hall, 2001. [12] R. Jornsten, Y. Vardi, and C. H. Zhang, “A robust clustering method and visualisation tool based on data depth,” in Statistical Data Analysis Based on the L1-norm and Related Methods. Basel, Switzerland: Birkhauser, 2002, pp. 353–366. [13] A. K. Pujari, Data Mining Techniques, 4th ed. Andhra Pradesh, India: Orient BlackSwan, 2001. [14] C.-J. Huang, C.-C. Wang, and C.-F. Wu, “Image processing techniques for wafer defect cluster identification,” IEEE Des. Test Comput., vol. 19, no. 2, pp. 44–48, Mar./Apr. 2002. [15] J. C. Russ, The Image Processing Handbook, 5th ed. Boca Raton, FL: CRC Press, 2006. [16] M. B. Dillencourt, H. Samet, and M. Tamminen, “A general approach to connected component labeling for arbitrary image,” J. ACM, vol. 39, no. 2, pp. 253–280, Apr. 1992. [17] H. Samet and M. Tamminen, “An improved approach to connected component labeling of images,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit., 1986, pp. 312–318. [18] Y. Yang and D. Zhang, “A novel line scan clustering algorithm for identifying connected components in digital images,” Image Vis. Comput., vol. 21, no. 5, pp. 459–472, May 2003. [19] L. He, Y. Chao, and K. Suzuki, “A linear-time two-scan labeling algorithm,” in Proc. IEEE ICIP, 2007, vol. 5, pp. V-241–V-244. [20] L. Di Stefano and A. Bulgarelli, “A simple and efficient connected components labeling algorithm,” in Proc. 10th Int. Conf. Image Analysis Process., Los Alamitos, CA, 1999, pp. 322–327. [21] C. H. Stapper, “LSI yield modeling and process monitoring,” IBM J. Res. Develop., vol. 20, no. 3, pp. 228–234, May 1976. [22] N. Kumar, K. Kennedy, K. Gildersleeve, R. Abelson, C. M. Mastrangelo, and D. C. Montgomery, “A review of yield modelling techniques for semiconductor manufacturing,” Int. J. Prod. Res., vol. 44, no. 23, pp. 5019– 5036, Dec. 2006. [23] F.-L. Chen and S.-F. Liu, “A neural-network approach to recognize defect spatial pattern in semiconductor fabrication,” IEEE Trans. Semicond. Manuf., vol. 13, no. 3, pp. 366–373, Aug. 2000. [24] D. Friedman, M. Hansen, V. Nair, and D. James, “Model-free estimation of defect clustering in integrated circuit fabrication,” IEEE Trans. Semicond. Manuf., vol. 10, no. 3, pp. 344–359, Aug. 1997.
OOI et al.: GETTING MORE FROM SEMICONDUCTOR TEST: DATA MINING WITH DEFECT-CLUSTER EXTRACTION
[25] R. Parasuraman and V. Riley, “Humans and automation: Use, misuse, disuse, abuse,” Hum. Factors, vol. 39, no. 2, pp. 230–253, Jun. 1997. [26] O. Maimon and L. Rokach, The Data Mining and Knowledge Discovery Handbook, 1st ed. New York: Springer Science+Business Media Inc., 2005. [27] R. B. Miller and W. C. Riordan, “Unit level predicted yield: A method of identifying high defect density die at wafer sort,” in Proc. Int. Test Conf., Baltimore, MD, 2001, pp. 1118–1127. [28] M. P.-L. Ooi, Z. A. Kassim, and S. N. Demidenko, “Shortening burn-in test: Application of HVST and Weibull statistical analysis,” IEEE Trans. Instrum. Meas., vol. 56, no. 3, pp. 990–999, Jun. 2007. [29] R. B. Miller, W. C. Riordan, and E. R. S. Pierre, “Reliability improvement and burn in optimization through the use of die level predictive modeling,” in Proc. 43rd IEEE Annu. Int. Rel. Symp., San Jose, CA, 2005, pp. 435–445. [30] A. Tay, W. K. Ho, X. Wu, and X. Chen, “In situ monitoring of photoresist thickness uniformity of a rotating wafer in lithography,” IEEE Trans. Instrum. Meas., vol. 58, no. 12, pp. 3978–3984, Dec. 2009. [31] T. S. Barnett and A. D. Singh, “Relating yield models to burn-in fall- out in time,” in Proc. Int. Test Conf., Charlotte, NC, 2003, pp. 77–84. [32] M. P.-L. Ooi, C. Chan, S.-L. Lee, A. A. Mohanan, L. Y. Goh, and Y. C. Kuang, “Towards identification of latent defects: Yield mining using defect characteristic model and clustering,” in Proc. IEEE/SEMI Adv. Semicond. Manuf. Conf., Berlin, Germany, 2009, pp. 194–199. [33] A. J. Lee, U-Statistics, Theory and Practice. New York: Marcel Dekker, 1990. [34] M. P.-L. Ooi, “Fast and accurate automatic defect cluster extraction for semiconductor wafers,” in Proc. 5th IEEE Int. Symp. Electron. Des., Test Appl., Ho Chi Minh City, Vietnam, 2010, pp. 276–280.
Melanie Po-Leen Ooi received the B.Eng. (Hons.) degree with First Class Honors and the M.Eng.Sc. (Research) degree in 2003 and 2006, respectively, from Monash University, Sunway Campus, Petaling Jaya, Selangor, Malaysia, where she is currently working toward the Ph.D. degree. She is a Lecturer with Monash University, Sunway Campus. Through extensive collaboration with Freescale Semiconductor and Texas Instruments, she has developed new testing methodologies for integrated circuits. Her areas of research include embedded hardware design, testing and design of microelectronic devices, and image and signal processing.
Eric Kwang Joo Sim received the B.Eng. (Hons.) degree in electrical and computer system engineering from Monash University, Sunway Campus, Petaling Jaya, Selangor, Malaysia. He completed this degree with a final-year research project on microelectronic testing in joint collaboration with Freescale Semiconductor. He is currently with Monash University, Sunway Campus.
3317
Ye Chow Kuang received the B.Eng. (Hons.) degree in electromechanical engineering with First Class Honors and the Ph.D. degree, studying and modeling noninvasive diagnostic techniques on distribution transformers, from the University of Southampton, Southampton, U.K., in 2000 and 2004, respectively. Since 2005, he has been a Lecturer with Monash University, Sunway Campus, Petaling Jaya, Selangor, Malaysia. His research interests are in machine intelligence, algorithm development, signal processing, and statistical modeling.
Serge Demidenko (M’91–SM’94–F’04) received the M.E. degree from the Belarusian State University of Informatics and Radio Electronics, Minsk, Belarus, in 1977 and the Ph.D. degree from the Belarusian Academy of Sciences, Minsk, in 1984. He was the Head of the School of Engineering, Monash University, Sunway Campus, Petaling Jaya, Selangor, Malaysia, and the Chair of the Department of Electronic Engineering, Massey University, Palmerston North, New Zealand. He is currently the Vice President (Academic) and the Head of the Centre of Technology, Royal Melbourne Institute of Technology International University, Ho Chi Minh City, Vietnam. His research areas include electronic design and test, fault tolerance, and signal processing. Dr. Demidenko is a Fellow of the Institution of Engineering and Technology and the U.K. Chartered Engineer.
Lindsay Kleeman received the B.S. degrees in electrical engineering in 1982 and mathematics in 1983, both with university medals, and the Ph.D. degree from the University of Newcastle, Newcastle, Australia, in 1986. He is currently an Associate Professor and the Deputy Head with the Department of Electrical and Computer Systems Engineering, Monash University, Clayton Campus, Clayton, Vic., Australia. He has over a hundred research publications in the areas of robotics, sensing, and digital systems. He has been an IEEE Centennial student and the president of the Computer Chapter of the IEEE Victorian Section.
Chris Wei Keong Chan recieved the B.Eng. (Hons.) degree in electrical and electronics engineering with First Class Honors from Sussex University, Falmer, U.K., in 1997. He is currently working toward the MBA degree with Manchester Business School, Manchester, U.K. He currently works with Freescale Semiconductor, Petaling Jaya, Selangor, Malaysia, as the New Product Introduction Department Manager. He focuses on improving new product manufacturability through upfront engagement in development. His experience in the industry includes yield and process integration in wafer fabrication, probe, and final testing.