Labs

49 downloads 0 Views 631KB Size Report
normalized to one prior to PCA analysis using Statistica software version 6.1 (Statsoft, Tulsa, OK). PC extraction was done by correlation of the bins in this study.
Intelligent Bucketing for Metabonomics—Part 1 Brent Lefebvre and Sergey Golotvin Advanced Chemistry Development Toronto, ON, Canada

Laura Schoenbachler and Richard Beger National Center for Toxicological Research (NCTR) Jefferson, AR, USA

INTRODUCTION

DISCUSSION

Typical Principal Components Analysis (PCA) of a metabonomics data set starts with the important step of splitting up the spectra into small integral units called bins or buckets. Traditionally, the divisions for these regions were determined arbitrarily—based solely on the width the user had chosen. This method has worked reasonably well, but there was certainly room to improve upon it.

When preparing data for a principal components analysis in a Metabonomics environment, the only option that has been widely available has been standard bucketing. This technique served its purpose but was purely arbitrary, which left much room for improvement. Intelligent bucketing was inspired by the search for a better way of slicing up a spectrum for statistical analysis. It uses the information in the spectrum to decide where bucket divisions should be placed. This is done by overlaying the spectra in the ACD/1D NMR Processor window. When the spectra are assembled into a series and treated in “Group Mode”, the local minima between all the spectra are identified and bucket divisions are placed there. It can even take into account a shifting peak and ensure that it remains in a single bucket across all the spectra in the data set.

Intelligent bucketing is an algorithm that was designed to make intelligent decisions as to where a bucket division should be. A common shortcoming of traditional bucketing occurs when the edge of a bucket situates itself in the middle of a peak, resulting in a contribution of that peak to two integral regions. The nature of PCA lends itself to correcting this error, by lumping those two regions together into a single principal component vector. An inaccurate result is obtained however, when the peak invariably shifts due to minor pH or salt variations between samples, and now the contribution of the same peak is split asymmetrically across more than one integral region. Intelligent bucketing chooses integral divisions based on local minima and will therefore avoid this common, yet unfortunate scenario.

Peter Price; Judit Megyesi, MD; Robert Safirstein, MD University of Arkansas for Medical Sciences (UAMS) Central Arkansas Veterans Administration Healthcare System (CAVAHS) Little Rock, AR, USA were distributed across two buckets in traditional bucketing, but were more accurately characterized with intelligent bucketing. The spectral data that corresponds to these integral regions are in Figure 4. As can be seen, the 2 regions around 7.84 ppm appear as positive loadings in PC1 of traditional bucketing, but appear only as a single region in PC1 of intelligent bucketing. From Figure 4, it is evident that this peak of hippurate is what is important, as it is split among two buckets in traditional bucketing but has been put into one bucket with intelligent bucketing.

RESULTS The clustering of data seen in PCA is useful because it allows a user to associate a group of data points in an unbiased way, and then determine what is distinguishing the clusters from each other. Similarly, the tighter the cluster is, the more similar the samples are expected to be. A bucketing technique that would allow the better characterization of the samples into their components, would then most likely produce better clustering. As can be seen in Figure 2, the clustering that is obtained for the first 3 principal components is somewhat better with the intelligent buckets.

A

Figure 1 - A group of urine spectra in ACD/1D NMR Processor. Intelligent Bucketing must be done on the entire series at one time to ensure the buckets are consistent across the data set.

Figure 5 - The integral regions around 7.6 ppm for traditional bucketing (top) and intelligent bucketing (bottom). The nearly empty region is properly given its own integral region by intelligent bucketing.

Figure 3 - The loadings plots for PC1 in the region between 7 and 8 ppm of both data sets. The red regions outline the advantage of intelligent bucketing. In the 7.8 ppm region, the traditional bucketing splits an important peak into two bins that only appears in one bin in the intelligent buckets. In the 7.6 ppm region, an intelligent bucket identifies a negative correlation; a correlation that is missed by traditional bucketing.

CONCLUSION Intelligent bucketing can produce better clustering of data points as has been shown. What can potentially be more important however, is its ability to produce buckets that better fit the data set. In this way, more peaks will be in their own bins and will then be able to influence the principal components independently. Hopefully, what could then be realized is more useful principal components in a data set. A future study is necessary to show this result.

Methods The data set consisted of 13 samples from 4 rats over 3 days, dosed with the renal toxin cisplatin. 75 to 250 mL of urine was collected over ice with 1 mL of 0.1% sodium azide in the tube. The urine volume was increased to 400 mL by adding the appropriate amount of H20. 200 mL of 200 mM sodium phosphate, pH=7.4 was added. The samples were centrifuged and TMSP was added to the 0.5 mL of sample for a final concentration of 0.1 mM. NMR measurements were then made on a Bruker Avance 600 MHz NMR spectrometer. Bins in the spectra containing the water and urea resonances (4.50 - 6.40 ppm), < 0.10 ppm and > 10.0 ppm, were removed. No cisplatin or cisplatin metabolites were observed in any of the spectra. The NMR spectra were reduced into spectral bins in two ways. The first was by traditional bucketing that resulted in bins 0.04 ppm wide. The intelligent bucketing was done with 0.04 ppm widths to start, and an ability to adjust them by 50%, (what we will call “looseness”). So bucket sizes could range from 0.02 to 0.06 ppm. The area under each bin was integrated and the total of the integrated bins was normalized to one prior to PCA analysis using Statistica software version 6.1 (Statsoft, Tulsa, OK). PC extraction was done by correlation of the bins in this study.

The region around 7.6 ppm is showing the opposite effect. The intelligent bucketing loading plot for PC1 shows a negative value that does not even appear in the traditional bucketing loadings plot. Figure 5 quickly explains this discrepancy; a nearly empty region that is split between two buckets in traditional bucketing is put into its own bucket by intelligent bucketing.

B

90 Adelaide St. W., Suite 600 Toronto, Canada M5H 3V9 Tel: (416) 368-3435 Fax: (416) 368-5596 Toll Free: 1-800-304-3988 Email: [email protected] w w w. a c d l a b s . c o m

Figure 2 - 3D Plots of the first three Principal Components (PCs). (A) corresponds to traditional bucketing and (B) was produced with intelligent buckets. The clustering is noticeably better with intelligent bucketing. As can be expected, the clustering effect seen in the PCA plot can be easily traced back to the loadings plot. In Figure 3, we can see two loadings plots. The loadings in red are highlighted as examples of important regions that

Figure 4 - The spectral region at 7.8 ppm that is highlighted in red in Figure 3. Notice how in the traditionally bucketed spectra (top), the hippurate peak is split into 2 buckets. The intelligent bucketing (bottom) recognizes the peak and puts it in one bucket.