1
Characterization of Event Related Potentials Using Information Theoretic Distance Measures Selin Aviyente, Linda W. Brakel, Ramesh K. Kushwaha, Michael Snodgrass, Howard Shevrin, William J. Williams
Abstract Analysis of event-related potentials (ERPs) using signal processing tools has become extremely widespread in recent years. Nonstationary signal processing tools such as wavelets and time-frequency distributions have proven to be especially effective in characterizing the transient phenomena encountered in event-related potentials. In this paper, we focus on the analysis of event-related potentials collected during a psychological experiment where two groups of subjects, spider phobics and snake phobics, are shown the same set of stimulus: A blank stimulus, a neutral stimulus and a spider stimulus. We introduce a new information theoretic distance measure on the time-frequency plane based on R´enyi entropy to discriminate between the different stimuli. The results illustrate the effectiveness of information theoretic methods combined with time-frequency distributions in differentiating between the two classes of subjects and the different regions of the brain. Keywords Time-frequency distribution, Event-Related Potentials, R´enyi entropy, Divergence Measure S. Aviyente is with the Department of Electrical and Computer Engineering, Michigan State University, East Lansing, MI, 48824. L. W. Brakel, R. K. Kushwaha, M. Snodgrass and H. Shevrin are with the Ormond and Hazel Hunt Event-Related Potential Laboratory directed by H. Shevrin supported in part by funds contributed by Robert H. Berry, Department of Psychiatry, University of Michigan, 900 Wall St., Riverview Building, Ann Arbor, MI 48105. W. J. Williams is with Department of Electrical Engineering and Computer Science, University of Michigan, Ann Arbor, MI, 48105. E-mail:
[email protected],
[email protected],
[email protected],
[email protected],
[email protected],
[email protected]
2
I. Introduction
E
VENT -related potentials (ERPs) are transient signals which appear as voltage deviations in the electroencephalogram (EEG). They are caused by external stimuli or cognitive processes triggered by
external events. ERPs have found numerous applications in clinical neurophysiology and psychiatry [1], [2], [3], [4]. This is because their recording is non-invasive and accurate, and they are consistently shown to be an indicator of brain functions and its abnormalities. The clinical applications of ERPs could be significantly extended if they could be interpreted more accurately and effectively. This requires the development of novel signal processing methods. In recent years, there has been a tremendous growth in applying signal processing techniques such as independent component analysis, wavelet and time-frequency methods for separating the source signals and extracting useful information about the underlying brain activity [2], [3], [4], [5]. Event-related potentials like most other real life signals are nonstationary and thus can be best tackled by using nonstationary signal analysis techniques such as time-frequency distributions (TFDs) and wavelet analysis. Analysis of brain responses to subliminal stimuli, such as the types of stimuli we often encounter in psychiatry, is particularly challenging since the response is subtle compared to auditory or visual stimuli and thus requires powerful mathematical techniques for full understanding. In recent years, there has been an interest in using information theoretic methods to extract information from event-related potentials [6], [7], [8]. The method presented in this paper is a new approach in that direction such that it combines information theoretic methods with nonstationary signal processing tools. More specifically, we try to extract information about how phobic and neutral stimuli are processed in the brain by two groups of subjects: Spider phobics and non-phobics, at different regions of the brain. For this purpose, we analyze the ERP data recorded at two electrodes, Cz, the central electrode, and Oz, the occipital electrode, using a time-frequency based method. We introduce a new information theoretic distance measure to differentiate between the pre-stimulus and post-stimulus activity at the two electrodes for the two groups of subjects. The distance measure depends on R´enyi entropy and its application on the time-frequency distribution.
3
In Section II, the background on R´enyi entropy and its application on the time-frequency plane will be briefly summarized. Section III will review the basics of information theoretic distance measures, and will introduce a R´enyi entropy based distance measure for differentiating between time-frequency surfaces. We will apply this measure on our data set collected during the experiment. The results will be summarized in Section IV and an analysis of variance will be applied on the distance values computed to quantify the performance of this new discrimination measure. Finally, Section V summarizes the results and discusses the major contributions of this paper. ´nyi Entropy and its application on the time-frequency plane II. Re In this section, the basics of R´enyi entropy and its application on time-frequency distributions will be reviewed. A. Definition of R´enyi Entropy Let P = (p1 , p2 , . . . , pn ) be a finite discrete probability distribution, that is, pk ≥ 0 for k = 1, 2, . . . , n and W (P ) =
n X
pk = 1. The amount of uncertainty of the distribution P , that is the amount of uncertainty
k=1
concerning the outcome of an experiment, is called the entropy of the distribution P and is usually measured by the quantity H[P ] = H(p1 , p2 , . . . , pn ), introduced by Shannon and defined by [9]: H(p1 , p2 , . . . , pn ) =
n X
pk log2
k=1
1 pk
(1)
The entropy functional satisfies the following set of postulates: 1. H(p1 , p2 , . . . , pn ) is a symmetric function of its variables. 2. H(p, 1 − p) is a continuous function of p for 0 ≤ p ≤ 1. 3. H( 12 , 12 ) = 1 4. H(tp1 , (1 − t)p1 , p2 , . . . , pn ) = H(p1 , p2 , . . . , pn ) + p1 H(t, 1 − t) for any distribution P = (p1 , p2 , . . . , pn ) and for 0 ≤ t ≤ 1. 5. Mean value condition, H(P ∪ Q) = q1 , q2 , . . . , qm ).
W (P )H(P )+W (Q)H(Q) , W (P )+W (Q)
where W (P ) + W (Q) ≤ 1, P ∪ Q = (p1 , p2 , . . . , pn ,
4
The characterization of measures of entropy becomes much simpler if we consider these quantities as defined on the set of generalized probability distributions. In [10], R´enyi introduced an alternative axiomatic derivation of entropy based on incomplete probability mass function P = (p1 , p2 , . . . , pn ) such that W (P ) =
n X
pk ≤ 1.
k=1
He observed that Shannon entropy H[P ] = −
n 1 X pk log2 pk W (P ) k=1
(2)
uniquely satisfies the set of postulates given above. Extending the arithmetic mean value condition yields generalized entropies closely resembling Shannon’s. Considering the generalized mean value condition, Ã R
H (P ∪ Q) = f
−1
W (P )f [H R (P )] + W (Q)f [H R (Q)] W (P ) + W (Q)
!
(3)
with f a continuous monotone function, R´enyi demonstrated that just two types of functions f are compatible with the other four postulates. The first one of these functions is the class of linear functions, which yield the arithmetic mean and the Shannon entropy. The second function f (x) = 2(α−1)x , α > 0, α 6= 1 yields the functional
n X
HαR (P ) =
pαk
1 log2 k=1 n X 1−α
(4) pk
k=1
known as the R´enyi entropy of order α. The Shannon entropy can be recovered as lim HαR = H. Since the α→1
passage from Shannon entropy to the generalized class of R´enyi entropies involves only the relaxation of the mean value property from an arithmetic to an exponential mean, HαR behaves much like H. R´enyi entropy based measures have recently been applied to different areas of signal processing and information theory. Some major applications include image registration based on the joint R´enyi entropy of the images, data clustering and bounding error probabilities in classification problems [11], [12], [13], [14]. In all of these applications, R´enyi entropy has proven to be a very robust measure. In this paper, we concentrate on applications of R´enyi entropy on the time-frequency plane. Therefore, we need to define R´enyi entropy for time-frequency distributions.
5
B. R´enyi entropy for Time-Frequency Distributions The uncertainty of signals are studied indirectly through their time-frequency distributions, which represent the energy distribution of a signal as a function of both time and frequency. A time-frequency distribution, C(t, ω) from Cohen’s class can be expressed as [15] C(t, ω) =
1 4π 2
1:
µ
Z Z Z
¶
φ(θ, τ )s u +
µ
¶
τ τ s∗ u − ej(θu−θt−τ ω) du dθ dτ 2 2
(5)
where the function φ(θ, τ ) is the kernel function and s is the signal. The kernel completely determines the properties of its corresponding TFD. Some of the most desired properties of TFDs are the energy preservation and the marginals. They are given as follows and are satisfied when φ(θ, 0) = φ(0, τ ) = 1 ∀τ, θ. Z Z
Z
Z 2
C(t, ω) dt dω =
|s(t)| dt =
Z
|S(ω)|2 dω
Z
C(t, ω) dω = |s(t)|2 ,
C(t, ω) dt = |S(ω)|2
(6)
The formulas given above evoke an analogy between a TFD and the probability density function (pdf) of a two-dimensional random variable. In order to have the TFD behave like a pdf, one needs to normalize it properly. The main tool in measuring the information content or the uncertainty of a given probability distribution is the entropy function. Williams et al. have extended measures of information from probability theory to the time-frequency plane by treating the time-frequency distributions (TFDs) as density functions [16]. The well-known Shannon entropy for TFDs is defined as: Z Z
H(C) = −
C(t, ω) log2 C(t, ω)dt dω
(7)
This measure is only defined when C(t, ω) > 0, ∀t, ω. Therefore, it is valid for positive distributions such as the spectrogram. In [16], R´enyi entropy was introduced as an alternative way of measuring the complexity of TFDs and the properties of this measure were proved extensively in [17]. 1
All integrals are from −∞ to ∞ unless otherwise specified.
6
Extension of R´enyi entropy HαR (p) to continuous-valued bivariate densities is: Z Z
HαR (C)
C α (t, ω) dt dω 1 = log2 Z Z 1−α C(t, ω) dt dω
(8)
However, for TFDs we prefer modifying this measure by first pre-normalizing the distribution by its energy, i.e. Cnormalized (t, ω) = R R
C(t,ω) , C(t,ω) dt dω
and then applying the entropy measure on the time-frequency plane.
Therefore, Hα (C) =
1 log2 1−α
Z Z
α
Z Z C(t, ω)
dt dω
C(u, v)du dv
(9)
This measure is related to the original definition of R´enyi entropy by HαR (C) = Hα (C)−log2 ksk22 and thus Hα is invariant to the energy of the signal. This is the main motivation for using this modified measure. Another main difference between TFDs and probability density functions is the nonpositivity. Most Cohen’s class TFDs are non-positive and therefore cannot be interpreted strictly as densities of signal energy. Therefore, one should be careful while interpreting the results. In particular, these functionals can be interpreted as inverse measures of concentration. In this paper, all of the analysis is done for third order R´enyi entropy since it has been proved that α = 3 is the smallest integer value to yield a well-defined, useful information measure, i.e. the entropy is not equal to zero
2
RR
C α (t, ω)dt dω ≥ 0 and
[17].
III. Information theoretic distance measures Using entropy based distance functionals is a well-known discrimination method in signal processing. These functionals are known as divergence measures and are applied directly on statistical models describing the signals. These tools are useful in a variety of signal processing applications such as detection, segmentation, classification, recognition and coding [18]. The most general class of distance measures is known as Csiszar’s f-divergence which includes some well-known measures like Hellinger distance, Kullback-Leibler divergence and R´enyi divergence [19]. The divergence between two probability density functions, p1 and p2 for this class 2
Although there are some pathological signals for which this is not true, for the purposes of this paper this inequality is always
true.
7
of distance measures can be expressed as: ·
· µ
d(p1 , p2 ) = g E1 f
p2 p1
¶¸¸
(10)
where f is a continuous convex function, g is an increasing function and E1 is the expectation operator with respect to p1 . For example, for f (x) = − log x and g(x) = x, d(p1 , p2 ) becomes Kullback-Leibler information. Z
d(p1 , p2 ) =
p1 log
p1 (x) dx p2 (x)
(11)
A. Entropy based distance measure for TFDs Recently, R´enyi entropy based distance functionals for time-frequency distributions have been proposed in [20]. The time-frequency divergence functions indicate the distance between different TFDs and they can be interpreted as indicators of difference in complexity of the distributions. These measures could be useful as time-frequency detection statistic in comparing the distributions of the reference and the observation. A familiar way of deriving distance measures from entropy functionals is to form the Jensen difference. The Jensen difference of two probability densities p1 and p2 is defined by [21]
µ
H
p1 + p2 2
¶
−
H(p1 ) + H(p2 ) 2
(12)
Positivity of this quantity relies on the concavity of the entropy functional H. Since the R´enyi entropy Hα is neither concave nor convex for α 6= 1, this equation makes sense only for the Shannon entropy. The R´enyi entropy is derived from the same set of axioms as the Shannon entropy, the only difference being the employment of a more general exponential mean instead of the arithmetic mean. This realization inspires the modification of equation 12 from an arithmetic to a geometric mean, and the following quantity is obtained for two positive TFDs C1 and C2 . p
J1 (C1 , C2 ) = Hα ( C1 C2 ) − p
where ( C1 C2 )(t, ω) =
Hα (C1 ) + Hα (C2 ) 2
(13)
p
C1 (t, ω)C2 (t, ω). This quantity is obviously null when C1 = C2 . The positivity of
this quantity can be proven using the Cauchy-Schwartz inequality. ¯Z Z ¯2 Z Z Z Z ¯ ¯ α/2 α ¯ ¯ [C1 (t, ω)C2 (t, ω)] dt dω ¯ ≤ C1 (t, ω)dt dω C2α (t, ω) dt dω ¯
(14)
8
and since the log function is monotonically increasing, for α > 1 ¯Z Z
1 log2 1−α √ Thus Hα ( C1 C2 ) ≥
Z Z
Hα (C1 )+Hα (C2 ) , 2
¯ 1 log2 ¯¯ 1−α
α/2
[C1 (t, ω)C2 (t, ω)]
C1α (t, ω) dt dω +
1 log2 1−α
¯2 ¯ dt dω ¯¯ ≥
Z Z
C2α (t, ω) dt dω
(15)
which proves that the distance measure is always positive. One of the
disadvantages of this measure is that it diverges for disjoint TFDs such that C1 (t, ω)C2 (t, ω) = 0 that it cannot be applied to nonpositive TFDs since
∀t∀ω and
√ C1 C2 (t, ω) will not always be real. Due to this fact,
Michel et al. restricted themselves to positive TFDs like the spectrogram. As mentioned above this distance measure is only applicable to positive TFDs like the spectrogram. In the following section, we are proposing a method to generalize this distance measure to general TFDs. B. Generalized distance measure for TFDs In this section, we are going to introduce a way of modifying the R´enyi entropy based distance measure such that it is applicable to non-positive TFDs. The modification method can be summarized as follows: 1. Find the minimum of C1 (t, ω) and C2 (t, ω) over all t and ω. If the minimum is greater than zero go to step 3, otherwise continue. 2. Add |min∀t,∀ω (C1 (t, ω), C2 (t, ω)) | to C1 (t, ω) and C2 (t, ω). 3. Normalize C1 (t, ω) and C2 (t, ω). 4. Apply the distance measure J1 given by equation 13. This distance measure is always positive as shown in the previous section and can be applied to any TFD. The addition of a positive constant on the TFD surface eliminates the problem of divergence since C1 (t, ω)C2 (t, ω) can no longer be equal to zero on the whole time-frequency surface. It is also worthwhile to mention that although the addition of the constant introduces a bias in the distance measure, the distance is still equal to zero when the two distributions are the same. One important issue that needs to be checked for is the order preserving characteristic of this modified distance measure regardless of the offset value that is added onto the time-frequency surface. The following conjecture clarifies this question.
9
Conjecture 1: If d1 (C1 (t, ω), C2 (t, ω)) < d1 (C1 (t, ω), C3 (t, ω)) then d2 (C1 (t, ω), C2 (t, ω))