Linear Predictive Coding and Wavelet Decomposition for ... - CiteSeerX

0 downloads 0 Views 477KB Size Report
advanced signal processing methods such as Linear. Predictive Coding (LPC) and discrete wavelet decomposition (DWD) for microarray data sample clustering.
Linear Predictive Coding and Wavelet Decomposition for Robust Microarray Data Clustering Robert S. H. Istepanian, Senior MIEEE, Ala Sungoor, and Jean-Christophe Nebel Abstract—Microarrays are powerful tools for simultaneous monitoring of the expression levels of large number of genes. Their analysis is usually achieved by using clustering techniques. Genomic signal processing is a new area of research that combines genomics with digital signal processing methodologies. In this paper, we present a comparative analysis of two genomic signal processing methods namely Linear Predictive Coding, and Discrete Wavelet Decomposition coefficients for robust microarray data clustering. Vector quantization is applied to the resultant coefficients to provide the clustering of the data samples. Both techniques were validated for a standard data set. Comparative analyses of the results indicate that these methods provide improved clustering accuracy compared to some conventional clustering techniques. Moreover, there classifiers don’t require any prior training procedures

advanced signal processing methods such as Linear Predictive Coding (LPC) and discrete wavelet decomposition (DWD) for microarray data sample clustering In this paper we present a comparative analysis of two genomic signal processing methods LPC and DWD coefficients, with vector quantization applied for robust microarray data clustering. The paper is outlined as follows: in section (2) we present an outline on the LPC and DWD based clustering methods. In section (3) we describe the application and validation of the methods on a sample of microarray data sets and discuss the results obtained. Finally, the paper concludes with ongoing and future work in this area. II. LPC AND DWD BASED CLUSTERING

I

I. INTRODUCTION

t is well known that microarray technologies are one of the most powerful tools for extracting and interpreting simultaneous gene activities and relevant genomic information. In particular, analysis of microarray genetic data allows a better understanding of genetically based diseases such as diabetes, cardiovascular diseases and some forms of cancer. Although extensive work in the literature have been reported on microarray imaging with biomedical, signal processing analysis techniques [1-2], enhanced gene classification and clustering is still an active field of research. In general, gene expression analysis is based on statistical methods that are capable of detecting relevant genomic patterns that reflect individual genes in different regulatory states. The importance of this research area is reflected in the large number of literature published in recent years [3]. In recent years, various computational and genomic signal processing methodologies were introduced for different microarray application areas to support different clustering studies. These were used for detection, prediction, classification and statistical modeling [4]. Other classification methods based on spectral component analysis were also investigated. An autoregressive technique was used to evaluate the potential regulatory relationship between genes with dominant spectral components in [5]. Also, in the work reported in [6], the expression profiles were decomposed into spectral component to correlate the profiles to obtain high accuracy expression values. However, to date no work has been reported for the comparative performance analysis of

Manuscript received April 2nd, 2007. The authors are with Mobile Information and Network Technologies Research Centre (MINT), Kingston University, London, KT1 2EE (e-mail: [email protected], [email protected] and [email protected]).

The LPC approach is based on estimation of spectral distortion measures that provide relative measures of gene expression changes. Therefore, enhanced classification of genes or samples into classes can be performed according to their distortion values. Also DWD clustering approach is a well known digital signal processing method that is based on estimation of the spectral distortion measures of relevant wavelet coefficients. Since wavelets are particularly powerful tools for analyzing large and noisy datasets in a global fashion, they are well suited to microarray data analysis [7]. Recent work on application of wavelets for microarray data analysis demonstrated the implementation of wavelet transform to identify the features of gene profiles and explores their relationship with phenotypic outcome [8]. The choice of a wavelet family for a given task depends on the computational and memory requirements. We choose Daubechies orthogonal wavelets that have been proven to provide good spatial data mining and accurate data representation [9]. The processing steps of the proposed clustering methods are as follows. Microarray data is first preprocessed and filtered using statistical estimators. The gene expression data is then transformed into distortion measures. The LPC and DWD algorithms are used to generate the feature vectors at different scales, for LPC algorithm a predictive model is calculated to produce the relevant microarray vector coefficients. Subsequently, LPC coefficients are converted into Line Spectral Frequency (LSF) coefficients to increase their spectral robustness. While for DWD method, it decomposes the original feature space of microarray data samples into two sets of coefficients: approximation subband coefficients cA(n) that define the clusters content and detailed subband coefficients cD(n) that describe the clusters boundaries. Finally, Vector Quantization (VQ) is applied to both

methods to detect the clusters from the resultant coefficients. A. LPC Coefficient prediction for microarray data Features of expression sampling rate in the microarray data set format produce a limited range of frequencies. In addition, there are correlations both between genes and between samples. LPC in general is a coding method which is suitable to deal with data with such properties which are common in speech and imaging data. The basic idea behind LPC analysis is that each expression sample is approximated as a combination of past samples [10]. Equation (1) defines the LPC principle where the value of the present output, s(n), can be predicted approximately by a linear combination of p past samples; p is called the order of LPC. p

s ( n) = ∑ a j s ( n − j )

(1)

j =1

The goal of the LPC analysis is to find the best prediction coefficients aj so that the predicted sample is a good approximation of the original sample. This optimization process is performed by minimizing the energy of the prediction error. This involves choosing aj to minimize the mean energy, E, in the error signal over a frame or window of data set: 2 p  ∞    E =  ∑  s ( n) − ∑ a j s ( n − j )   j =1 n= −∞   

(2)

The values of aj that minimize E are found by setting all derivatives δE / δa j equal to zero. It is expressed by: p

∑ a E (s i =1

i

s

n −i n − j

) = E ( sn sn − j )

(3)

To solve equation (3), E ( s n−i sn− j ) needs to be estimated for i, j Є {1,…..,p}. The autocorrelation and covariance methods are two of the most common and efficient linear predictive spectral estimation techniques. Their main difference lies in the placement of the analysis window. Since the covariance method windows the errors signal instead of the original signal, it has a highest accuracy. The energy E of the windowed error signal is: E=



∑e

n = −∞

2

( n) w(n) =







m

2

∑ s(n) − ∑ a s(n − k ) w(n)

n = −∞

k =1

k

(4)

Where the error is minimized over a finite interval of size N as defined by the rectangular window function w(n). After reducing and differentiating equation (4) with respect to ak, we obtain: N −1

M

n =0

k =1

N −1

∑ s(n − i)s(n) = ∑ a ∑ s(n − k )s(n − i) k

(5)

n =0

However, the process of direct quantization of the LPC coefficients aj is not advisable. The issue is that small changes due to the quantization error could result in the internal digital filter pole becoming unstable and producing large spectral errors. Thus, other superior parametric representations have been formulated to replace the LPC coefficients aj [11]. We chose the Line Spectral Frequency (LSF) representation because it has

been shown to be a particularly efficient for scalar quantization of LPC information [12]: it also does not distort the spectrum, vary smoothly in time and offers a better coding in relation to spectral peaks. These LSF coefficients are used subsequently to determine distortion between samples. B. DWD for microarray clustering In general, the DWD is defined as the decomposition of a signal onto a set of basis functions by dilations, contractions, scaling, and shifting. It has two main advantages: it provides resolution optimality in both time and frequency domains and it does not require a stationary signal [13]. It is based on two major sub operations: the scaling which captures the information at different frequencies by successive low pass/ high pass filtering and down sampling, whilst the translation sub operation captures the information at different locations. It decomposes data into several groups of coefficients which contain information regarding the sampled signal at different scales. Coarse scale coefficients capture gross and global features of the signal while fine scale coefficients contain local details. Since microarray data represent the activity of genes across either time or samples, and display a specific range of frequencies with specific correlations both between genes and between samples. Therefore, wavelet decomposition is a method that can be applied to convert spatial data into the frequency domain. The method has high degree of spatial localization, but the degree of concentration depends on the frequency content of the wavelet function [7]. Since high frequency wavelets are narrower than lower frequency ones, wavelets can be seen as a set of adaptive base functions. The DWD algorithm starts by applying recursively two convolution functions, a low and high pass filters known as quadrate minor filters [14], on the given data signal S. Each function produces an output stream that is half the length of the original input. The iterative process ends once a resolution threshold is reached. As a result, two sets of coefficients are calculated: the coefficients are generated by the low pass filter and the coefficients are produced by the high pass filter. The mathematical formulation of the problem can be summarized as follows: Given a microarry dataset sample S(t), it can be decomposed it in the wavelet space as (6) Wi, j ( f ) = ∫ s(t )ψ i , j (t ) dt Where ψ i , j (t ) the Daubechies wavelet basis functions are defined by i and j parameters. They are derived from a function ψ (t ) called mother wavelet and defined as

ψ i, j (t ) = 2m 2 ψ (2m t - n)

(7)

Using Eq.(7) we obtain an orthonormal wavelet basis. Parameter m stretches the mother wavelet leading to either a narrower or broader new function. Parameter n translates the mother wavelet along t space. Therefore, all the basis functions ψ i , j (t ) have the same profile, but dilated and translated according to parameters m and n respectively. The inverse discrete wavelet transform is given by the reconstruction formula:

(8)

s(t ) = ∑∑Wi , j ( f ) ψ i , j (t ) i

j

A coarser approximation of microarray samples So can be calculated by iteratively convolution with the low pass filter hj and down sampling the signal by two. Therefore, a set of discrete approximations Sj, 1

Suggest Documents