Biosignal pattern recognition and interpretation systems. 3. Methods of ...

1 downloads 0 Views 612KB Size Report
Deportment of Biomedical Engineering. Rutgers University. Further, if the observation vector 2i is corrupted by white noise, 1.' = I (i.e.. the identity matrix):.
BiosignaI Pattern Recognition and Inter pretation Systems Part 3 of 4: Methods of Clussificafion tatistical techniques for pattern classification fall into several categories: 1. Minimum distance classifiers, which classify a pattern based on its distance in a feature space to class prototypes. 2. Maximum likelihood estimation, which optimizes a parameter based on likelihoods. 3. Fisher's Linear Discriminant, which reduces the dimensionality of a feature vector in such a way as to improve classification. 4. Entropy criteria, which classifies apattern based on minimizing uncertainty.

S

BayeslMinimum Distance Classifiers Consider the Bayes linear classifier (BLC): For two normal distributions "1 and ~'2, the Bayes decision rule can be expressed

as a quadratic function of the sample vector & a s [l]:

E.J. Ciaccio S.M. Dunn M. Akay

= (E-&)T

Deportment of Biomedical Engineering Rutgers University Further, if the observation vector 2i is corrupted by white noise, 1.'= I (i.e.. the identity matrix):

Now the product E , ~ & is the correlation between & and x,which can be written for a time sampled process as:

The correlation classifier compares the difference in the correlation ofx with pi and p2.The correlation between plandx can also be considered to be the output of a linear (matched) filter [2], where the filter:

xi ( T - t ) = ml ( r ) where ~1 andp2 are mean vectors for classes 1 and 2, P is the probability, and

denotes class membership of x in M'/ or M'? dependent upon the direction of the inequality. The equation can be simplified if the covariance matrices CI = C2 = C:

Februory/Morth 1994

(10)

(6)

which allows 2 to be classified in w'i. Equations 9-1 l are equivalent to the linear discriminant function:

Ignoring the first term, which is a constant for all distances, a distance function can be written:

The Euclidean distance has been used for classification of acoustic signals from knee cartilage motion [3]. Vectors is assigned to the cluster whose mean is closest to X. The cluster centers are found by averaging [4], which can be performed adaptively. If a priori statistics of the classes are known, the Mahalanobis distance can be used instead of the Euclidean distance. This distance is a Euclidean distance normalized with respect to intraclass scatter:

Now if E 4 i5 multiplied by two, add and subtract x x from the left hand side to obtain the decision rule:

Y

(7) which can be rewritten as:

which is termed the Euclidean distance classifier. If P ( w J / )= P(w?J = 0.5, the

(3)

(&--&I

We want to choose the Euclidean distance:

decision boundary is the perpendicular bisector of a line joining pi and ~ 2 The . Euclidean distance measure is given by:

IEEE ENGINEERING IN MEDICINE AND BIOLOGY

where x is the feature vector, M, is the mean feature vector for class wl, and (&1 is the covariance matrix of class M+. The Mahalanobis distance has been used to classify evoked response potentials, such as the visual evoked potential (VEP) [ 5 ] . If the observation noise is correlated (colored noise). then the covariance matrix does not reduce to the identi9 matrix I, and the Bayes classifier I S not as readily interpreted. The decision rule can still be viewed as acorrelation or distance classifier, if a "whitening" transformation is introduced, that is: y

= A ,where

0739-51 75/94/$4.0001994

(15)

129

Fisher's Linear Discriminant

A ~ AI~ = and E[y] is:

E b ] =A E ~

(17)

so that the distance of the whitened vector is related to p - Ayi.

Maximum Likelihood (ML) Estimation Given a training set of sample feature vectors, we can compute the best estimate of a given class parameter. For example, to compute the class mean, the objective is to find a K, that maximizes the likelihood function p(Hi/yi) with respect to Hi, where Hj is the set of observation vectors [XI,U , ..., &]. If E Hi are assumed to be independent. then:

This is a means of reducing the dimensionality of the feature vector from n to d = M-1 , where M is the total number of classes, while optimally preserving class separability [6]. The n dimensional feature vector is projected onto a lower dimensional surface (assuming d < n). Given the general case that M classes are present, the within-class scatter matrix will be derived: consider that there are N known samples of feature vector y, N I of which belong to class W I and N2 of which belong to class w'2. Then y can be projected by: '7;

=

&

(26)

Here Zj will be a projection of yi onto a line QT, scaled by Ilgrll = 1, in n-dimensional space. Let the mean feature vector for class M';. with Ni samples, be given by:

The difference between means of the projected data alone may be insufficient to act as a good classifier (Fig. 1). Therefore, class variances are often used as well as means. Then the n x n scatter matrix of the ifh class, which represents a measure of the dispersion of signals in M ' j , is given by:

wi =

c

(41, -

L)

" 7

-

b) (32)

I€U'!

The within class scatter matrix is given by:

w=(z)i wi (33) The one-dimensional scatter matrix of projection z L is:

n

(18)

k= 1

The goal is to maximize p(H//yi)with respect to ui. Any convenient monotonically increasing function can be used in the likelihood equation,but normally the log function is used, so that the product operator is replaced by a sum operator:

and let the mean projection be given by:

n

The separation of the means on p can be written as:

To maximize, solve: n

The between-class scatter matrix, which

=O

(22)

If the probability density function is Gaussian, then the gradient of the likelihood function can be shown to be:

POPULATION DENSITY

n

ne-'

;I]= 0 POWIATION 1

(23)

k= 1

POPULATION 2

Therefore the class mean is given by: hi1 hi2 PARAMETER MAGNITUDE

n

I

1. Population of Two Classes. IEEE ENGINEERING IN MEDICINE AND BIOLOGY

Februory/Morrh 1994

represents interclass dispersion in n-dimensional feature space, is given by:

B=

A

A

A

smoothed) and too small (variance of the estimate increases).

T

- B)(B-M)

(39)

The variance of the means between classes in the one-dimensional projection is:

(L b)2= (A - &I2 (L-

= $Bp

(40)

(42)

A criterion for separability of the within- and between-class scatter matrices must be determined. One is to maximize the ratio of the determinants:

Entropy Criteria Entropy is a statistical measure of the uncertainty of a system, and therefore gives some insight into a system’s information content. Finding the minimum entropy of a system is analogous to minimizing the minimum intraclass dispersion, while not changing interclass dispersion significantly. First, the feature vector dimensionality is reduced by finding a linear transformation: zy = ATl

= xiwp,

(44)

for i = 1,2, ..., m-1, which can be solved by the eigenvalue problem:

(49)

czl,

of the The M X M covariance matrix reduced vector zy, with expectation b-, is given by:

(=I2”

A

=

The optimal transformation matrix A that maximizes this equation has columns a in

(SO)

Since zJ is a linear transformation of a Gaussian process, it is also Gaussian. Therefore the conditional probability density in the reduced feature space is:

The eigenvectors pi are orthogonal coordinate axes onto which the distributions can be projected to maximize the interclass scatter while holding intraclass scatter fixed.

-1

/

-

k Nearest Neighbor (k-NN) Classification Often the statistics of each class population are unknown. Nearest neighbor classification is a nonparametric technique for classification of a feature vector 41 once features have been extracted from a sample vector 8. The probability that feature vector g falls in a given volume R of feature space is:

The entropy P in the reduced feature space, if the distribution of 41 is Gaussian, is:

- j p ( ~ l w ; In ) (p (3lwi)dlr

d

=.=PO V

=

(47)

ability per unit hypervolume). The probability density can be estimated by:

k

$(J) = -

rn.V

where k is the number of feature vectors 41 falling in R, and m is the total number of feature vectors. V should be chosen in a range between too large (estimate

c

1 2

d

Ink, + - (1 + In (2x1) 2

i= I

where V is the hypervolume of space R and

p(2) is the probability density of feature vector g (probability density is the prob-

February/March 1994

tor of one dimension) a threshold to separate clusters.

(54)

where Xi are the eigenvalues of the covariance matrix . A procedure for finding entropy can be listed as [ 6 ] : 1. Find the feature vector covariance matrix eigenvalues and eigenvectors. 2. Use the eigenvector corresponding to the smallest eigenvalue to transform g into Q.

3. Project zy onto that eigenplane and use discriminant functions or (for reduced vec-

IEEE ENGINEERING I N MEDICINE AND BIOLOGY

Syntactic ClassificationTechniques Syntactic techniques analyze waveform shape. An advantage of syntactic pattern recognition techniques is that we can describe an infinite set of strings (i.e., signals) by a finite (i.e., computer representable) description or graph.

String Matching One method of recognition is by string matching. Similar to the 1-NN rule in statistical pattern recognition, a matching metric can be employed using strings. A string is a list of symbols which represent basic elements of pattern structure, i.e., prirnatives, connected in serial fashion. It must be determined if a string x is an element of a given language. L(Gi)for all grammars i = 1, 2, ..., c. Therefore a class-specific library of all strings generated by the grammar (i.e. the entire language) must be stored. If the language L(Gi) is infinite then even though the entire set of strings cannot be stored, the language can still be represented by its (finite) grammar. Since the strings must be generated, the time for each comparison will increase. This procedure is inefficient relative to other structural pattern recognition methods in terms of computational cost per match [7]. Improvements to the basic string matching technique include: 1. Prescreening data-remove those strings s an element of that have terminal strings not found in the vocabulary of the language(s) to be checked. 2. Efficient search algorithms-find ways to check key portions of a string that can be used to differentiate between classes. 3. Hierarchical matching-try matching on lower and then higher resolution levels by, for example, combining symbols (features) at the highest level, to form symbols (features) representing a lower resolution of structure. 4. Prototypical strings-narrow down the language of a class to a few prototypical examples, and match the unknown string to these prototypes using some measure of feature similarity.

Parsing Parsing is a frequently employed method of syntactic pattern recognition. The objective of this method is to determine if an input string (i.e., a pattern) could have been generated by the given grammar. Parsing is accomplished by syntactic analyzers called parsers. A parse tree shows various routes in which, beginning with the slanting symbol, productions are generated to provide a

131

set of terminal symbols (i.e., a sentence belonging to the language of the particular grammar). As an example, consider the syntactic structure of a blood pressure pulse (Fig. 2). The grammar G1 is given by the following set of equations.

G I = IVT, VN, P , SI

runoff

(66)

(55)

where VT is the set of terminal symbols, VN is the set of nonterminal symbols, P is the set of production rules, and S is the starting symbol.

I = or

(56)

+ = concatenation

of

+ = may be replaced

(57)

by

(58)

VT = { slow upstroke,fast upstroke, (59) peak, notch, rebound, reflection 1, rejection2 ] (60) VN= { systole, diastole, upstroke , runoff } (61)

P = { S-+ systole

systole

+ diastole,

(62)

+ upstroke + peak + notch, (63)

upstroke -+ slow segment +fast segment, (64)

+ runoff,

diastole -+ rebound

upstroke fast

P

+ reflection 1 + reflection 2 ]

(65)

Consider the objective of deciding if x E wi, where W j is a given class. The objective can also be stated as, to determine syntactically, ifx E L(Gi). A top-down parse works from root (top or starting symbol) to branches of the tree (bottom or terminal symbols) (Fig. 2), therefore giving a speculative expansion of a given string. A bottom-up parse works from branches to the root of the tree; it is a “brute force” technique because it simply seeks to find a path to the starting symbol. The Cocke-Younger-Kasami (CYK) Parsing Algorithm can be used to parse a string x in a number of steps proportional to /xi3 [7]; it is a variant of dynamic programming. The Chomsky Normal Form (CNF) grammar must be used, that is, productions are of the type: A +BC or A + a. A table is used to order and direct the flow of parsing. However, a fixed table cannot be used for problems in which variable numbers of parses are allowed to reach the staring symbol. For example, given the string x = x i , x 2 , ...xnwherexiE VTand/x/=n,and grammar G, write table entries tu, 1 I i I n, and 1 I j I n - i + 1.Symbols are parsed from bottom-up, placing the result in the box above the left-most symbol to be parsed. If the starting symbol, S, is obtained in the top box, then the string is a valid sentence in grammar G. This technique is limited because few signals fall neatly into the category of a CNF grammar. For example, it may be difficult to use in circumstances of signals with additive noise, and particularly colored

noise. This is because it is difficult to distinguish an arc representative of periodic noise with arcs representing the structure of the signal, if the context is not accounted for. Productions with only one member on the right hand side cannot adequately account for contextual information.

Syntactic Learning Certain procedures may be used to generate constrained grammars by using grammatical inference. An example is the following. Let x‘j’ = abcabc be a sentence known to be part of language L(G). Simple production rules can be formulated which produce the given string with the grammatical constraints, given by the above equations. The following procedure is used to generate these productions: 1. For all x“’ E [ VT U VI^)', determine the set of distinct terminals VT: VT = [a,b, cl

2. For each .x“’ E (VT U VN)’, define a corresponding set of productions P, for example, P: S+aAI

p

A3

+ 0-44

(71)

A4

+ bA5

(72)

+c

(73)

A5

S+aAI I

ru;;ytole

I

I

systole

(68)

This defines VN and P for all x‘~’E S’. 3. P may contain redundancies, which can be removed by combining productions that serve the same purpose. Merging the productions produces a recursive grammar and a corresponding infinite language. After merging:

upstroke

systole

(67)

-

(74)

A I + bA2

(75)

A2 -+ c

(77)

Blood Pressure Pulse

crclL

s tole 2 . k -notch

upst&