Biosignal pattern recognition and interpretation systems ... - IEEE Xplore

0 downloads 0 Views 689KB Size Report
Dec 1, 1993 - BiosignaI Pattern Recognition. And Interpretation Systems. Part 2 of 4: Methods for Feature Extraction and Selection or pattern processing ...
BiosignaI Pattern Recognition And Interpretation Systems Part 2 of 4: Methods for Feature Extraction and Selection

f

or pattern processing problems to be tractable requires the conversion of patterns to features, which are condensed representations of patterns, ideally containing only salient information. Feature extraction methods are subdivided into: 1) statistical characteristics and 2) syntactic descriptions. There are numerous ways to represent patterns as a grouping of features. Feature selection techniques provide a means for choosing the features which are best for classification, based on various criteria. If a

E.J. Ciaccio S.M. Dunn M. Akay Department of Biomedical Engineering Rutgers University given criterion minimizes some measurement of representation error, the feature extraction technique is considered optimal.

Table 1. Comn m Techniques Used in Pattern Analvsis Features (measurements)

Features (transforms)

Features (structure)

amplitude

polynomials

peaks

bias

harmonic analysis

slopes

duration

(Fourier, cosine, wavelets)

lines

phase

Haar

corners

energy

Karhunen-Loeve expansion

linear predictive coefficients parametric models

moments singular values Karhunen-Loeve eigenvalues

Classifiers

Clustering Methods .__

discriminant analysis

Euclidean distance

isodata algorithm

Chernoff bound

Mahalanobis distance

Fisher's Linear Discriminant

Bhattacharya

linear discriminant functions

parsing

divergence

Bayes linear classifier

exhaustive search

Maximum likelihood

dynamic programming

density functions

Feature Selection

production rules

Parzen estimate

k- Nearest Neighbor histoarams

106

IEEE ENGINEERING

IN MEDICINE AND BIOLOGY

The choice of techniques appropriate for a given pattern analysis task is rarely obvious. At each level (i.e., feature extraction, feature selection, clustering, and classification), many techniques exist. Moreover, there is rarely a correspondence between techniques at each level, making the choice of the correct technique combination for all levels tedious. Often, in the literature, comparisons are made by using several methods at a given level. Table 1 lists the more common techniques used at each level of pattern analysis.

Feature Extraction Methods Features are used to represent patterns with minimal loss of important information. They can be divided into four categories: 1 . Nontransformed structural characteristics such as moments, power, phase information, and model parameters. 2. Transformed structural characteristics such as frequency spectra and subspace mapping methods. 3. Structural descriptions such as formal languages and their grammars, parsing techniques, and string matching techniques. 4. Graph descriptors such as attributed graphs, relational graphs, and semantic networks. The feature vector, which is comprised of the set of all features used to describe a pattern, is a reduced-dimensional representation of that pattern. This, in effect, means that the set of all features that could be used to describe a given pattern (large and in fact infinite infinitesimal changes in some parameter are allowed to separate different features) is limited to those actually stated in the feature vector. One purpose of the dimensionality reduction is to meet engineering constraints in software and hardware complexity, the computing cost, and (for purposes of data transmission), the desirability of compressing pattern information. In addition, classification is often more accurate (the criteria for accuracy may be, for example, the percent of input patterns classified correctly) when the pattern is simplified through representation by important features or properties only [l]. There is, 0739-5175/93/$3.0001993

December 1993

however, a tradeoff between too few and too many features (termed the peaking phenomenon [2]). The choice of features used in classification is governed by: I ) hardware and software constraints, 2 ) the peaking phenomenon, as well as 3) the permissible information loss. Some feature extraction methods used in biomedical signal pattern recognition are now given.

Nont ransformed Signal Characteristics

average (ARMA): it is autoregressive in the sense that it depends on past values of the output, and moving average in that it depends on present and past values of the input. Under certain conditions, the model can be simplified and written as an all-pole (autoregressive, AR) or as an all-zero (MA)model. In the all-pole model the coefficients h, = I for I = 0, and h, = 0 for 1 = I to q, that is:

b= 1

(14) The MSE is minimized by taking its partial derivative with respect to a,, i = 1 to p, to obtain the normal equations:

I’

Moments Moments are defined as [3J:

(6)

i = 1

wherep(s) represents the probability density function (Fig. I). x is arandom variable, and p is the order of the moment. Central moments are defined as:

This is also known as linear predictive modeling. The input is assumed to be a zero-mean. white-noise process, the model coefficients are ab, and the model order is p. The Z domain transfer function is then given by:

E,, =

.I-; I1

where E, is the minimum total square error. An autocorrelation matrix can be formed with elements given by:

c z.c

/.( i) =

xirvn+;

(17)

,,=-m

/.(-i) where 2 is the mean. More common moments used in pattern recognition are the first moment (the signal average), the second moment (the signal power) and the second central moment (the

The problem then is to determine ak given .I-,,. If the input u,, is totally unknown, s,, can be estimated as:

akr(i-k) = -r(i),

I’

1 li5p (19)

(8)

I,= I

representing a signal as a weighted combination of previous samples x,,.kof the system [41:

(18)

L=1

signal standard deviation). Parametric Modeling Parametric modeling is a means of

= r(i)

From Eqs. 15 to 18, the predictive coefficients uk are given by:

The power spectral density of the model signal can be shown to be [4]:

P

Ell = r(0)+

a&) I= I

(20)

The coefficients r(i - k ) form the autocorrelation matrix, so that the equation for the predictive coefficients can be written as:

(9) with parameters uk and 1. The pole-zero model, also known as the

autoregressive moving average (ARMA) model, gives the impulse response of input U,, in the Z domain:

The linear predictive coefficients uk can be derived by minimizing the mean square error between the signal. and the linear prediction based on Eq. 8. The residual error is:

e,, = .\-,,-

A

(10)

(4) b= 1

The MSE is given by:

The model is called pole-zero because it contains poles (roots of the denominator) and zeros (roots of the numerator) in its Z transfer function. Another way to describe the model is as an autoregressive moving December 1993

IEEE ENGINEERING IN MEDICINE AND BIOLOGY

(11)

Rg=g where r(k) is the autocorrelation 0,,2 is the noise variance. The above equations are known as the Yule-Walker equations. The autocorrelation matrix is a symmetric Toeplitz matrix, which makes the equations more readily solvable by iterative techniques. Algorithms such as Durbin’s recursive procedure [SI can be used to find the autoregressive coefficients ai which optimally represent the signal. Commonly, these coefficients, and the poles of Eq. 13, are used as features 161.

107

Transformed Signal Characteristics

XE

[0, 11

with k = 0, 1, ..., N-l are given by [9]:

Fourier Series A time series x(k), k = 1 to N can be represented as a sum of periodic components [7]. The determination of the magnitude and phase of each component is known as spectral analysis. If x ( k ) is periodic, the spectrum may be represented as a linear combination of harmonics whose frequencies are integer multiples of the fundamental frequency (the Fourier series). If x ( k )contains nonperiodic components, the spectrum must be expressed as a continuum of frequencies (the Fourier transform). The discrete Fourier transform (DFT) is given by:

1

h"(.X) = h , ()(.I-) = -,X€ IO, I]

m

1

i

0.

otherwise for x

(27)

E

[0, I ] (28)

given that k = 2p + q - I for 0 < p < n - I ; 4 = 0 , 1 f o r p = 0 and I S q 1 2 " f o r p # O .

a

where x ( k ) is the measured signal, j = and N is the total number of samples (the Fast Fourier Transform (FFT) is a computationally efficient implementation). The power spectral density (PSD), which represents the power of each component of the Fourier series, is given by:

Walsh Transform The Walsh transform is appealing because it is implementable in N log2 N additions (no multiplications) [IO]. Given a pattern in matrix form, it can be stated as: /I-

I

c A,,. = (- 1 )

Karhunen-Loeve Transform

i(i'i

,*I

where U j and .I-, are binary representations of the row and column indices. where x (0) is the complex conjugate of

x(0).

Cosine Transform The discrete cosine transform (DCT) of a data sequence x(m), m = 0, 1, 2, ..., M-1, can be stated as [ 81: -

M-1

M- I

2 M

G,(k) = - Z X ( m )cos

(2m + 1)kJT 2M '

m=O

k = 1 , 2 , ...,M- 1

(26)

where the set of basis vectors is { l/*,

cos((2rn

+ l ) h ) / 2 M )1

Singular Value Decomposition The singular value decomposition of a matrix provides a robust, numerically reliable and efficient technique for feature extraction [ 1 I ] . Singular value decomposition methods have been used to extract the fetal electrocardiogram from surface electrode signals [ 1 I]. The singular value decomposition (SVD) of matrix A can be written [12]:

A=UWfl

(31)

where U and V are, respectively, 17 x n and Y x t unitary matrices, and VH is the Hermitian transpose of V. The singular values of A are the positive square roots of the eigenvalues of

The Karhunen-Loeve Transform (KLT) is used to represent a discrete set of signals in terms of a set of orthonormal basis vectors, ordered optimally in the least squares sense, for estimation of signal patterns [13]. If all basis vectors are used, an exact representation is obtained. Each basis vector represents important features contained within the pattern data. Since the features are optimally ordered in terms of relative importance, the problem of feature selection becomes one simply of choosing how many features are to be used for classification purposes (in part dependent on representation error and in part on computational cost restrictions). This transformation also provides data compression, so that redundancies in the patterns are removed, and some immunity is provided from motion artifact and other noise 114-161. T h e Karhunen-Loeve expansion (KLE) is found as follows. Any vector can be written as: I,

5=

Z (y,&) = $1 I=

A" A and G.,{k)is the kth coefficient.

where the matrix A is of rank k, and Eq. 33 represents A as the sum of k matrices of rank 1. Each matrix Sj ajXjTis a representation of A after eliminating the redundancy. The singular values of the SVD have been used in pattern recognition. To compute the SVD of a matrix A [12]: 1 . Find the eigenvalues and orthonormalize the eigenvectors of AHA. 2. Construct the submatrix D, by placing the singular values of AHAon the diagonal, and set all other elements equal to zero. 3. Partition V = [ V I I VI] where V I consists of eigenvectors corresponding to positive eigenvalues of AHA, and VI consists of all other eigenvectors. 4. Calculate UI = A V ~ D - ' 5 . Augment onto U1 the identity matrix with the same number of rows as U I . 6. Identify those columns of the augmented matrix which form a maximal set of linearly independent column vectors and delete the others. Orthonormalize this matrix to form U.

(32)

and are the diagonal elements of D. The SVD can also be written as:

I

where the transform matrix are given by:

(35)

9 and vector y

Haar Transform The Haar functions hk(,rj for the continuous interval: 108

IEEE ENGINEERING IN MEDICINE AND BIDLOGY

December 1993

,

The transform matrix (I is made up of n linearly independent column vectors (basis vectors). These vectors @, i = 1 to n, form an orthogonal set and are the KLE of x.Therefore,

T

Y=Q

x

(38)

y r. = &T

x

(39)

To find this basis, let matrix of z:

z~x be the covariance

5. Once the transforr? matrix is known, eigenvalues can be generated for representation of any input vector.

sion k in the following manner. Sequentially, the best features are determined by minimizing the within group variance. The F statistic is used as the performance index.

Wavelets Wavelets are a way to represent signals which can be expressed as functions that are local in time and frequmcy (or space and wave number) [ 191.Wavelets are basis functions which can be translated (change in phase) and dilated (change in sequence length). The wavelet is dilated using the equation:

Structural Descriptors The underlying point of view of this class of modeling techniques is that the signal characteristics of interest are separable, deterministic, and when grouped, describe the concept of interest. Therefore, signals in this class can be described concisely in symbolic form.

Terminology where CL is a coefficieni used to determine the orthogonal properties of the wavelets. Once Q(x) is found, W c,in be determined:

Several shorthand notations are often used in structural modeling. Consider the grammar: G={V,,V,,P,SJ

and E is the expectation operator. Consider that an approximate representation of x can be given by using only rn of n basis vectors for representation. It can be shown [13], that the optimal choice for the basis vectors satisfies the eigenvector equation:

(49)

with nonterminal symbols V N ,denoted by uppercase letters, terminal symbols VT, denoted by lower case letters, a set of productions P, and start symbol S. Productions are denoted by a + p where a and p represent strings composed of terminals and/or nonterminals and + is the substitution arrow which is used to confer the meaning “p replaces a.” la1 denotes the length (i.e., number of elements (symbols)) in string a. U” denotes that string a is repeated i? times. Several types of grammars can be defined [21,221:

where&; is the weight vector for the ithclass, using the full N-dimensional feature vector, SWDA finds the best subset of lower dimen-

These grammars have no restrictions to the rewriting rules, which allows highly descrip-

w (x)= + \v (,\.+I) w (Y) = +IV

(2.Y)

(47)

(48)

Feature Vector Dimensionality Reduction-Stepwise Discriminant Analysis (SWDA) Given the general linear discriminant function [20]:

I1

N

(43)

i=m+ 1

Therefore the eigenvalues h;,i = rn + 1 to n (which are not used), should be those with the lowest magnitude, for minimal error of representation. The KLT is optimal in the sense of minimizing the mean square error of representation when rn < n basis vectors are used, as compared to any other choice of basis functions or ordering [15,17]. The implementation of the KLE can be stated as [IS]:

(50)

d, =

M’ji

x,

+ u’,o

/=I

1. Type 0 To (Free

OI.

Unrestricted).

1. For all sample vectorsxi, find the mean vector F= E[xi] and the covariance matrix

Ex. 2. Perform eigenanalysis on and order the eigenvectors according to the descending magnitude of the eigenvalues hi.

3. The first basis vector Qi is that eigenvector associated with the largest eigenvalue, etc. The matrix 41is formed from these eigenvectors.

4. Select rn of n basis vectors for representation, based on some error criterion, for example average rms error [14]: I

Prob[a I..the feature selection process operates from hotton? up; if < r, it operates from top - I’

e

e

doMw. Mas-min feature selection-Feature selection is based on the individual and pairwise merit of features, based on a criterion function J . Consider that k features have already been selected, and the available features are Y - Xk. where Y is the total pool of features. and X k are the features already chosen. Let y j E Y - X k and q E X k . The algorithm finds:

5. Branch and bound algorithm.

v, =

[U,

h]

(64)

where y, is a candidate feature set with i features, andJ(yi) is a criterion function. The method is shown graphically (Fig. 5)for the P =[ S ~ U A ~ , S ~ ~ A , . A , ~ ~ I . A ~ ~ U A , , A ? ~ I ? ] selection of two features out of four meas(66) urements !I. g. y?. and y4 [IS]. At each A representation by graph is made using level. the greatest criterion value found from the previous level is chosen as the correct the following rules: path. At each node. a feature is removed Nodes are elements of i’)~, plus a term- from the total feature set. J is updated as nal node T. follows: if greaterthanJ,,, continue along the Ai aAi is represented by an arc same path. else go to the next branch at left (Fig. 5 ) . labeled a between nodes Ai and Ai. Srciirential ,fim’urd mid huc,kwurd seAi + U is represented by an arc labeled Iec‘tiori-Sequential forward selection a between nodes Ai and T. (SFS) is a bottom zip process. Starting with Thus, the in-degree of node S (number the empty set, select as the first feature the of arcs entering S ) and out-degree of node individual best feature. At each subsequent level. choose the best feature from the reT (number of arcs leaving r ) is zero. maining set, which in combination with the features already selected. gives the best Feature Selection Methods value of a criterion function. Sequential Feature selection has two meanings: I ) backward selection (SBS) is a top ~ O M Y which components of a pattern, or 2) which process. If k features are removed from the of a set of features, best represent a given features set J = [ y , ;j = 1 to D J then the pattern. The following methods are used to feature set is \‘n.l;. The k + 1st measurement select features [IS]. eliminated is chosen so that J Q D - x - , )= max , yj EPD-L, and initiallyjD E.~-Iiuusrii~e Seui.ch -This is feature se- .I ( ~ l - k - ~ )where lection based on the individual merits of a = J . A problem with the SFS algorithm is given combination of features. Given fea- that there is no mechanism to remove features y,. i = I to D , the problem is how to tures that become superfluous once other features are added. The SBS algorithm has select the best d of these features that the difficulty of there being no way to re-inminimizes the error of classification. The d u d e discarded features that would be helpful after discarding other features. These total number of different combinations is: difficulties make the algorithms suboptimal. Plus t? - tuke UMUJ I‘ algorithni- This D! (67) is an alternating process of augmentation and depletion of the feature set (by employBrunch and hound ulgoiithn7-This as- ing. respectively, the SFS and SBS algosumes that the feature selection criteria sat- rithms). After adding features to the current feature set, r a r e removed. Therefore isfy the monotonicity property: that is:

Fl

112

IEEE ENGINEERING IN MEDICINE AND BIOLOGY

Conclusions The wide variety of techniques existent for feature extraction presents two problems: I ) which techniques should be used and 2) how to select from among the features that each extraction technique generates. Selected features are “best” only by some standard (i.e. criterion): therefore techniques for generation of features tend not to be very portable from one pattern processing problem to another. Production of salient features is the connecting link between prototypical and symbolic representations of a class. Often, thresholds govern the selection of features. Many techniques do not generate independent features; therefore there is redundancy in the data, which potentially affects both efficiency and accuracy in pattern recognition. ~

EdNard Ciuccio received a B.A. in chemistry in 1979, an M.S. in biomedical engineering in 1989, and a Ph.D. in biomedical engineering in 1993, all at Rutgers University, New Brunswick, N.J. Since 1990, he has worked part time as an engineer in the Dept. of Pharmacology, Columbia University College of Physicians and Surgeons, New York City, designing multichannel data acquisition systems for cardiac mapping. His Ph.D. dissertation is entitled “Representation and Comparison of Sampled Distributions for Biomedical Signal Pattern Recognition.”

He can be reached at Rutgers University, Department of Biomedical EnDecember 1993

gineering, P.O. Box 909, Piscataway, NJ 08855. Stanley M Dunn (S '75, M '86, SM '90) is Assoc i a t e P r o f e s s o r of Biomedical Engineering at Rutgers University, Piscataway, and Associate Profesvor in the Division of Oral and Maxillofacial Radiology at the New Jersey Dental School. Newark, both in New Jersey. He received both the B.S.E.E and B S. in computer science from Drexel University in 1979, an M.S. and a Ph.D. in computer science from the University of Maryland, College Park, Md., in 1983 and 1985, respectively. He is currently completing a Ph.D. in oral radiology at the Vrije University, Amsterdam, the Netherlands. His research interests include pattern recognition, learning in machine vision, nonrigid motion analysir and problems of image registration in two and three dimensions. In addition to the IEEE, he is a member of ACM, SPIE, IADR and AAOMR.

Aka?. was in Sivas, Turkey. He received the B.S. and the M,S, degrees in e,ectrical engineering from the Bogazici University, Istanbul, Turkey, in 198 1 and 1984. From 1984 to 1986he was in the Ph.D. program at the same university. In 1986, he joined the Department of Biomedical ~ ~ ~R~~~~~~ i univer~ ~ sity, Piscataway. N,J., where he held a fellowship and a teaching assistantship, He received his Ph.D. degree from Rutgers UniMetill

December 1993

versity in 1990. He is currently visiting assistant professor at Rutgers University. His research areas of interest are digital signal processing, detection and estimation theory and application to biomedical signals. He has been currently teaching these two graduate courses on those areas. His biomedical research areas include breathing control, noninvasive detection of coronary artery disease, and the understanding of the autonomic nervous system.

References I . Dunn SM: An introduction to model-bared Rodrol 2 1 : 184-189. imaging. D~/r/[~nr~.cc-illofii(. 1992. 2. Fu KS: Sy/r/erc.tic.Parteix Rec.r,:.iritioi? ai7d Allp / i c ~ t i ~ i i ,Prentice-Hall. \. Englewood Cliffs, NJ. 1982.

3. Hu MK: Pattern recognition by inonient invariants. P roc' I i i s r Rodro E i r , ~49: 1428. 1961. 4. Makhoul J: Linear prediction - a tutorial review. Pi.oc.fEEE. 63361-580. 1975.

5. Cohen A: Bionitdiwl Si,qirul Proc,ec.xiirg, Vol 1 and 2, CRC Press. Boca Raton. FL. 1986. 6, Tavathia S, Rangayyan RM, Frank CB, Bell GD, Ladly KO, Zhang Y-T: Analysis of knee vibration {ignals using linear prediction. / € E € Ti.crr7.x Rronictl E/rgrf,q.BME-39:959-970. 1992.

7 . Baker (;IA, Gollub JP: C/lUOfic' D.viruniic..x. U11 /rrtrodrrc.rioii.Cambridge University Press, Cambridge' I 990'

Sansen W, Vantrappen G, Janssens J: Comparison of SVD methods to extract the foetal electrocardiogram from cutaneous electrode signals. Mtd curd Biol E q r r g und Conzp, 28217.224, 1990. 12. Bronson R: Mutrri Operutions. Schaum's Outline Serie,. McGraw-Hill, NY, 1990. 13. Fukunaga K: Innodirc.riorr ro Srutistic~ulParrem Rrcogrritroiz, Academic Press. NY, 1991. 14. Lux RL, Evans AK, Burgess MJ, Wyatt RF, Abildskov JA: Redundancy reduction for improved display and analysis of body surface potential imps I. spatial compression. Ctr-cu/ufion Rcs. 49: 186-196. 198 I . 15. Evans AK, Lux RL, Burgess MJ, Wyatt RF, Abildskov JA: Redundancy reduction for i n proved display and analysis of body surface pot e n t i a l m a p s 11, t e m p o r a l c o m p r e s s i o n . Circ.rr/~/r~/i Rewcrrc~h.49: 197-203, 198 1. 16. Moody and Mark: QRS morphology repres e n t a l i o n a n d noi5e e s t i m a t i o n u s i n g the Karhunen-Loeve transform. In: f i v e IEEE Conrp irr C'urt/io/,269-272, Jerusalem, 1989.

17. Chen C-H: Srotisric~o/Parterii Recogi?iriori. Spartan Books. Rochelle Park. 1973. 18. Kittler T: Feature Selection and extraction. In: Young TY and Fu K-S. (Eds): Handhod of' Porterri Rrc.ogrririoiz urd Inrage Proc~ssi/r,y.5983. .Academic Press. Orlando. 1986.

19. Mallet S G : A theory for multiresolution signal decomposition: The wavelet representation. IEEE

8. Ahmed N, Natamjan T, Rao KR: Discrete Tmm CClW. C-23:90-93. cosine trmfom1. 1973.

Trcrr'sPur

10. Andrews WK: tran~ ~ HC, Pratt i ~ Digital ~ image , p F L 4 7 ~form processing. In: Proc. S ~ r Wolrh plicuriom. 183-194, Washington. DC. Apr 1970.

PMmkinds. 1957.

I 1. Callaerts D, De Moor B, Vandewalle J,

Sons. NY. 1992

1:674-693. 1989'

20. ~ f i fA,i Azen s: Src~~;.~ric~o/Ajrul~.~;.s-A Cornp i i rei' 0 ri P I ! r ed Approoi'k, Ac ade m ic Press , 9. Jain AK: Fiirrdunrerrto/\ ofDiairal Iniuge P r o ~ Inc.". cssinx. Prentice Hall. Englewood Cliffs, NJ. 1989. Chumsky N: Syr,ractic, ~~t,.licrLircs, M ~

IEEE ENGINEERING IN MEDICINE AND BIOLOGY

22. Schalkoff R: Ptrrrerir Recognirioir: Sratisrical. Srr-uc.tiii.u/ etrid IVewol Appi~oaches.John Wiley &

113

~

~

~

~

Suggest Documents