finding descriptors useful for data mining in the ... - CiteSeerX

Copyright ©JCPDS - International Centre for Diffraction Data 2004, Advances in X-ray Analysis, Volume 47.

FINDING DESCRIPTORS USEFUL FOR DATA MINING IN THE CHARACTERIZATION DATA OF CATALYSTS C. K. Lowe-Ma, A. R. Drews, A. E. Chen Research & Advanced Engineering, Ford Motor Company, Dearborn, Michigan, USA ABSTRACT The ultimate goal of materials characterization is often to optimize materials by relating observed features to a response function or performance specification. For X-ray data to be successfully included in statistical or data mining methodologies that examine contributions to a response function, sufficient pieces of information of the right kind must be extracted from the X-ray data. Traditional X-ray analysis methods using individual comparisons cannot keep up with the flux of specimens and data needed for data mining approaches to materials optimization. The work described herein focuses on obtaining descriptors from X-ray fluorescence, X-ray powder diffraction, and other characterization data from automotive exhaust-gas catalysts using automated or semi-automated processes, and relating these descriptors to other performance measures. Our results are also relevant to informatics requirements for high-throughput screening and combinatorial studies. INTRODUCTION The goal of this work is to combine X-ray powder diffraction features with other characterization data to build up mathematical relationships in automated or semi-automated processes that not only describe existing data, but can also predict results and materials performance. These new data analysis approaches can (a) help to develop better catalysis strategies and new materials, (b) help to understand failure mechanisms, and (c) help in examining large numbers of fleet and customer-aged catalysts for usage-dependent aging. Understanding and improving automotive exhaust-gas catalysis enables us to improve air quality by continuing to mitigate undesirable exhaust gases. Although automotive exhaust-gas catalysts are only one component of a complex exhaust emissions system, the catalysts themselves are also complex heterogeneous chemical systems designed to perform multiple functions. An example of an automotive exhaust-gas catalyst is shown in Figure 1. FINDING AND USING DESCRIPTORS Statistical methods and data mining algorithms have evolved to handle discrete bits of information that are abstractions of data that may not necessarily have simple physical interpretations. Ideal descriptors are those that can provide real distinctions amongst data without redundancy. The value of descriptors derives from using them to enable comparisons between disparate unrelated-types of characterization data. One of the difficulties and challenges of obtaining useful descriptors is that the results of subsequent statistical analyses or data mining algorithms may be quite dependent on the form and choice of descriptors! [1] Complex data, such as X-ray powder diffraction patterns of catalyst materials, are often difficult and very time consuming to interpret, making detailed individual interpretations impractical if

338


large numbers of samples are involved. Accurately predicting catalyst performance over a wide variety of scenarios requires models built upon large numbers of specimens and built upon data from different sources, hence the need to find data-driven descriptors that can provide the most useful information from the fewest variables in an automated fashion.

Substrate

Active catalytic material Figure 1. Shown at the left is a catalytic converter brick for a vehicle. The magnified image in the middle shows the channels in the brick through which engine exhaust gases pass. The electron microscopy image at the right is an image a single corner of a channel and shows the active catalytic material that has been washcoated onto the substrate.

Most physical characterization techniques (and their associated software) have evolved to examine one (or small n) sample(s) at a time. For example, X-ray powder diffraction scans are collected sequentially, one at a time, on an individual specimen. Each resulting diffraction scan is processed either by hand or in a batch mode for baseline correction, possibly some additional geometric corrections, and peak picking. Each processed diffraction scan is then analyzed to identify phases present, possibly analyzed for crystallite size or quantitative information, and relationships to other characterization data are deduced manually. The limitations of this conventional approach are obvious: it is problematic for materials containing many phases with severe overlap; it is problematic for complex mixtures of crystalline and poorly crystalline materials; and it is certainly problematic for handling large numbers of diffraction patterns containing many phases of variable crystallinity mixed with highly crystalline (but uninteresting) substrate phases. Figure 2 shows representative powder diffraction data obtained from the active catalytic material scraped from three different catalysts and illustrates how unrealistic a conventional approach might be. Several approaches to computationally examining diffraction patterns were considered. Described herein are results obtained by: (a) using whole-pattern (SNAP)-derived correlations and peaks as descriptors, (b) using expectation maximization to bin peaks into clusters and using the clusters as descriptors, and (c) using principle component analysis of large regions of raw powder pattern data to obtain key factors with high variance (high information content). COMPUTATIONALLY-DERIVED DESCRIPTORS INSTEAD OF PHASE DESCRIPTORS

339


Using the non-parametric whole-pattern analysis described by Gilmore, et al, [2] (also called SNAP), correlations amongst a set of catalyst diffraction patterns were obtained from pair-wise pattern comparisons between catalyst diffraction patterns and a standard set of reference patterns for different ceria-zirconia compositions and crystallinity. These correlations, when examined for clustering, yield the hierarchical cluster analysis tree (without pruning and using standard tree clustering algorithms [3]) shown in Figure 3. The resulting tree exhibits three (or possibly four) major clusters. These correlation values (or the mean value of each cluster) could, therefore, be used in subsequent analyses as a variable (or a descriptor) representative of the type and composition of ceria-zirconia present in the catalyst.

Figure 2. Representative diffraction data obtained from the catalytic material in automotive catalysts.

Figure 3. Hierarchical cluster analysis tree illustrating the clustering of correlation coefficients obtained from SNAP (see text). The horizontal axis shows individual data labels.

340


A concern with using SNAP-derived correlations is that every pair-wise comparison yields only a single value that may not adequately represent the complexity and subtle differences between diffraction patterns containing data of the type shown in Figure 2. For this reason, an approach using peaks was also considered. Peak positions and intensities (d's & I's) have, historically, been used as short-hand descriptors for powder patterns. However, the mechanics of the approach we used was rather different and more amenable to computational methods. A complete list of all observed peak positions from a number of diffraction patterns was used as input to a clustering algorithm using expectation maximization [4]. Expectation Maximization examines distributions of data and develops naïve Bayesian probabilities about which data values (peak positions) belong together. A sampling of the cluster or binning results is shown in Table I below. Then, using normalized intensities (from zero to one) to represent the peak heights every diffraction pattern exhibits for each bin, relationships between the peak-position bins can be further examined. For example, Table II shows, for small a subset of bins, intensity-based correlations between some bins. Over an entire scan, redundant phase composition information is present in diffraction data that contain no distortions due to preferred orientation. In the present example, although the major ceria-zirconia [111] peak envelope region was not included in the expectation maximization procedure, enough ceria-zirconia phase information still remains in other parts of the diffraction pattern to derive a general regression relationship between binned (peak) intensities and the amount of cerium observed by X-ray fluorescence (Figure 4). Table I. Sample of expectation maximization results (after including diffraction knowledge about the likely spread in peak position values). N is the number of patterns examined containing a peak at that average position. cluster, or bin #

2theta

Std.Error

-99.00%

99.00%

N

1

18.115

0.008

18.094

18.136

21

2

19.007

0.008

18.986

19.027

22

3

19.354

0.021

19.299

19.410

3

4

19.670

0.011

19.643

19.698

12

6

20.427

0.010

20.402

20.452

15

Table II. Example of intensity-based correlations between peak-position bins. The blue highlighted values in the 18.114 and 19.006 columns are correlations from peaks due to the same phase — cordierite from the substrate. The green highlighted correlation is new information not previously known; peaks at 19.67° and at 21.31° appear to be due to the same (but unknown) phase. 18.114

19.006

19.354

19.670

18.114

1.000

0.886

0.408

0.182

19.006

0.886

1.000

0.416

0.292

19.354

0.408

0.416

1.000

0.416

19.670

0.182

0.292

0.416

1.000

20.427

0.661

0.669

0.301

0.183

21.307

0.229

0.336

0.401

0.815

21.739

0.815

0.856

0.407

0.292

341


Expectation Maximization may represent a different, and possibly improved approach to phase analysis through intensity correlations, but this approach still results in far too many variables (too many peak positions). Another approach is to use Principle Component Analysis (PCA) to obtain composite factors that contain the largest amount of information. [5] PCA has been used in other diffraction studies [6] and has been widely used in the spectroscopy community [5b, 7]. PCA could be used to reduce the number of peak-position bins obtained by expectation maximization but PCA on peak-position bins could be problematic because across any given set of diffraction scans many of the peak-position bins may have zero peak intensity. However, if PCA is used on raw normalized data, every 2θ step becomes a variable and the intensity is the value of the variable. Shown in Figure 5 are plots of the first two principle components obtained from raw diffraction data over a "low-angle" region and over a "mid-angle" region. Because PCA derives directions in parameter space with the highest variance (information content), the PCA approach is able to delineate differences between the diffraction data for three types of catalysts; and these differences are more substantive than being due to just the variable amount of cordierite substrate that inadvertently occurs when scraping catalyst washcoat.

Figure 4. Comparison of the regression-predicted Ce composition with the observed Ce from XRF.

INCORPORATING DESCRIPTORS FROM OTHER CHARACTERIZATION TECHNIQUES Analytical techniques such as quantitative X-ray fluorescence generally yield descriptors (numerical values for composition) that are examined easily for relationships. Other spectroscopic characterization methods can yield descriptors using procedures similar to those described here for X-ray diffraction data. Continuous curve data (e.g., reactor or emissions data) represent another type of data for which obtaining useful descriptors can be difficult. Images, and microscopy images in particular, also pose challenges to using computational methods to examine relationships. For electron microprobe images of catalysts, we obtain descriptors by separating substrate regions from washcoat and then derive elemental spatial correlations from the X-ray emission maps. [8] PUTTING IT ALL TOGETHER — RELATIONSHIPS BETWEEN DESCRIPTORS AND PERFORMANCE Aggregate X-ray diffraction descriptors, such as the first few PCA factors, can be compared to performance groupings derived from, e.g., tailpipe emissions to determine the usefulness of the descriptors. Figure 6a shows the effectiveness of the selected X-ray diffraction PCA factors in discriminating amongst emissions-based groupings. The larger the numbers on both axes, the

342


better the discrimination. Some discrimination occurs using just the X-ray diffraction descriptors (Figure 6a), but the discrimination between performance groups is greatly enhanced if the most important PCA factors from electron microprobe image correlations and from XRF-based compositions are included with the X-ray diffraction factors (Figure 6b).

Figure 5. Plots illustrating the ability of the first PCA factor (horizontal axis) and second PCA factor (vertical axis) derived from raw X-ray diffraction data to separate three different types of catalysts into three (known) types of catalyst. The plot on the left shows the first two PCA factors derived from "lowangle" data; the plot on the right shows the first two PCA factors for "mid-angle" data.

Figure 6a. Figure 6b. Figure 6. Plots of each catalyst discriminant score for the first two discriminant functions (canonical roots). Discriminant analysis is used to determine which variables can successfully differentiate between groups, in this case, groups based on emission performance. Discriminant functions are obtained from weighted linear combinations of variables with the weights derived to maximize differentiation between groups. As shown above, PCA factors derived from raw XRD data can (marginally) discriminate between emission groups (6a), but the differentiation between emissions groups is notably more effective if the first few PCA factors obtained from XRF and EPMA data are also included (6b).

343


CONCLUSIONS SNAP is a powerful approach to comparing diffraction patterns and enables obtaining useful correlations between whole patterns. Expectation Maximization is found to be useful for deriving clusters of peaks associated with a single average peak position (a peak bin); correlations between intensities in the bins can then be used to determine which peaks belong together and are due to a single phase. Principle Component Analysis of normalized step-scan data is found to yield useful descriptors that subsequently can be related to measures of materials performance. These approaches to deriving descriptors from powder diffraction data will facilitate finding new and improved inorganic materials, especially heterogeneous catalytic materials, using high-throughput discovery methods and data mining to target specific property criteria.

REFERENCES [1] Kantardzic, M., Data Mining -- Concepts, Models, Methods, and Algorithms, Wiley-Interscience, IEEE Press: New Jersey (2003), pp. 19-38. [2] Gilmore, C. J., Barr, G., Paisley, J., “High Throughput Powder Diffraction I: Full-profile Qualitative and Quantitative Powder Diffraction Pattern Analysis”, J. Appl. Crystall. (submitted) [3] StatSoft, Inc. (2003). STATISTICA (data analysis software system), version 6. Tulsa, Oklahoma: www.statsoft.com. [4] Mitchell, T. M., Machine Learning, McGraw-Hill: Boston (1997), pp. 191-195. [5] (a) Reference [1], pp. 48-51; (b) Jurs, P.C., "Chemometric and Multivariate Analysis in Analytical Chemistry" in Reviews in Computational Chemistry, Lipkowitz and Boyd, edit., VCH Publishers: New York (1990), pp. 169-212; (c) Jambu, M., Exploratory and Multivariate Data Analysis, Academic Press: Boston (1991), pp. 129-167. [6] (a) Kato, M., Fujii, S., Ui, T., Asada, E., Powder Diffract. 5(1), 33-35 (1990); (b) Klar, P.J., Chen, L., Rentschler, T., J. Mater. Chem., 6(11), 1815-1821 (1996); (c) Artursson, T., et al., Applied Spectr., 54(8), 1222-1230 (2000); (d) Hida, M., Sato, H., Sugawara, H., Mitsui, T., Forensic Science International 115, 129-134 (2001). [7] (a) Aries, R., Lidiard, D., Spragg, R., Spectroscopy 5(3), 41-44 (1990); (b) Workman, J.J., et al., "Review of Chemometrics Applied to Spectroscopy: 1985-95, Part I" in Applied Spectr.Reviews, 31(1&2), 73-124 (1996). [8] Chen, A.E. and Lowe-Ma, C.K., Microscopy and Microanalysis 7, Suppl. 2, 1116-7 (2001).

344

finding descriptors useful for data mining in the ... - CiteSeerX

finding descriptors useful for data mining in the ... - CiteSeerX

Suggest Documents

Finding Generalized Path Patterns for Web Log Data Mining - CiteSeerX

Outliers and Data Mining: Finding Exceptions in Data - Computer ...

Proximity Mining: Finding Proximity using Sensor Data ... - CiteSeerX

THÃSE Mining Software Engineering Data for Useful ... - Castalia Camp

u DATA PREPARATION FOR DATA MINING - CiteSeerX

Data mining in astronomy - CiteSeerX

Could Data Mining be useful in Official Statistics ? - Universitat de ...

Data Mining in the Chemical Industry - CiteSeerX

Data Mining in the Chemical Industry - CiteSeerX

Finding rules for audit opinions prediction through data mining methods

Finding Generalized Path Patterns for Web Log Data Mining *

Finding Generalized Path Patterns for Web Log Data Mining - DElab

A Framework for Finding Distributed Data Mining Strategies That are ...

Data mining techniques for HIV/AIDS data management in ... - CiteSeerX

Dynamic Data Mining - CiteSeerX

Educational Data Mining - CiteSeerX

Medical Data Mining - CiteSeerX

Dynamic Data Mining - CiteSeerX

Spatial Data Mining - CiteSeerX

Educational Data Mining - CiteSeerX

Finding developmental groups in acquisition data - CiteSeerX

Finding Consistent Clusters in Data Partitions - CiteSeerX

Useful data

Data Mining for Automated Visual Inspection - CiteSeerX

finding descriptors useful for data mining in the ... - CiteSeerX