Data mining and EEG 1 Introduction - Semantic Scholar

2 downloads 29758 Views 631KB Size Report
its application to the analysis of electro encephalogram (EEG) data. The intended audience are people with an interest in DM and EEG and both a statistical and ...
Data mining and EEG Arthur Flexer The Austrian Research Institute for Arti cial Intelligence Schottengasse 3, A-1010 Vienna, Austria [email protected]

Abstract

An overview of Data Mining (DM) and its application to the analysis of EEG is given by (i) presenting a working de nition of DM, (ii) motivating why EEG analysis is a challenging eld of application for DM technology and (iii) by reviewing exemplary work on DM applied to EEG analysis. The current status of work on DM and EEG is discussed and some general conclusions are drawn.

1

Introduction

The purpose of this review article is to give an overview of Data Mining and its application to the analysis of electro encephalogram (EEG) data. The intended audience are people with an interest in DM and EEG and both a statistical and a medical background. The review is structured as follows: In Sec. 1.1 we will give a working de nition of Data Mining. We will not try to give a rigorous de nition of DM since this is not really possible given how DM intersects with statistics and other elds of science. We will rather focus on a description of DM in relation to statistics as well as to the problem of EEG analysis. This should help us in distinguishing work concerned with Data Mining in EEG from the large quantity of work published about EEG analysis in general. In Sec. 1.2 we try to motivate why EEG is a challenging area of application for DM technology. In Sec. 2 we review exemplary work on DM and EEG based on our working de nition of DM. In Sec. 3 we discuss the current status of work on DM and EEG presented in the review and draw some general conclusions.

1.1 A working de nition of Data Mining

According to [Fayyad et al. 1996a] DM is one part of Knowledge Discovery in Data Bases (KDD). Where KDD refers to the overall process of discovering useful knowledge from data, DM refers to the application of algorithms to extracting patterns from data. The whole process of KDD includes DM but also data cleaning and preprocessing, data reduction and 1

projection, incorporation of prior knowledge and also the proper validation and interpretation of the results. KDD and DM are seen to be related to the elds of machine learning, statistics, pattern recognition, arti cial intelligence, data bases and visualization amongst others. From a statistical point of view, KDD is closely related to exploratory data analysis (EDA). However, the unifying goal in KDD and DM is the extraction of knowledge from data in the context of large data bases. [Hand 1999] also states that statistics and DM are intersecting disciplines. But he also claims that in statistics there is a certain emphasis on rigour and \a tendency to require proof that a proposed method will work prior to the use of that method", whereas DM has a more experimental attitude. DM therefore often \produced methods which apparently work, even if they cannot be (or have not yet been) proven to work". Related to this di erence is the fact that whereas in \modern statistics the model is king" and the computation (model selection criteria) is only secondary, in DM the algorithm plays a much more central role due to proximity to computer science and machine learning. Similar opinions about DM and its relation to statistics can be found in numerous other publications on the subject (see [Elder & Pregibon 1996], [Glymour et al. 1997] and [Witten & Frank 1999] amongst others). Our loose de nition of DM which we will use throughout the article will therefore be: DM is the application of algorithms to extracting patterns from large data bases. DM emphasizes the central role of algorithms rather than models and uses methods from statistics, machine learning and other related elds.

1.2 EEG analysis as a challenging eld of application

After we have distinguished DM from other related elds (especially statistics), we can now turn to our chosen area of application: the analysis of EEG. The recording of the human electro encephalogram (EEG) is a non-invasive method to record electric brain potentials from the human scalp via a set of electrodes. Most of the available data in human EEG is believed to be generated by sources in the neocortex. It is therefore a measure of the collective interaction of masses of neocortical neurons plus some subcortical structures. It is beyond the scope of this article to review what is known about the origin of EEG as well as its relation to human behaviour (see [Posner & Raichle 1994] and [Nunez 1995] for recent overviews). It should suce to say that analysis of EEG is a way to study the function of the brain by monitoring its neuro-electric correlates

under diverse circumstances. EEG studies are concerned with all kinds of human behaviour, from active cognition to the description of human sleep. From a technical point of few, EEG is a multi-dimensional signal since it is usually recorded via an array of electrodes. The number of electrodes that are used for recording varies from just a few to around 20 electrodes which are arranged according to the international 10-20 system [Jasper 1958]. EEG studies with up to 128 electrodes have also been reported [Johnson & Hamm 2000]. EEG is a temporal real valued signal, nevertheless some studies deal with EEG at single points in time. It is also common practice to discard temporal information by various forms of integration along the time axis. In most studies EEG is recorded from a number of test subjects in a standardized situation and the goal is to nd a common description of the EEG recordings across all test subjects (or distinct subclasses of test subjects, e.g. men and women). A lot of EEG studies are also concerned with the sheer discrimination between subclasses of test subjects based on EEG without any interest in further description. Although research concerning EEG analysis has been done for decades and within a wide range of disciplines (signal processing, pattern recognition, statistics, etc.), there are still a lot of open questions and there is room for new technical developments. The following two problems in EEG analysis have attracted a considerable amount of DM applications: analysis of sleep recordings and of evoked potentials. Sleep staging is one of the most important steps in sleep analysis. It is a very time consuming task consisting of classifying all 30 second pieces of an approximately eight hour recording into one of six sleep stages: wakefulness, S1 (light sleep), S2, S3, S4 (deep sleep), REM (rapid eye movement) sleep. A sleep recording is made with a minimum setting of four channels: electro-encephalogram (EEG) from electrodes C3 and C4 1 , electro-myogram (EMG) and electro-oculogram (EOG). In order to classify each 30 second segment of sleep according to the classical [Rechtscha en & Kales 1968] (R&K) rules, the human scorer looks for de ned patterns of waveforms in the EEG, for rapid eye movements in the EOG and for EMG level. It is therefore a valuable goal to try and automate this process and quite some work has already been done in trying to replicate R&K sleep staging with diverse automatic methods (see [Hasan 1983] and [Penzel et al. 1991] for overviews). There is however a considerable dissatisfaction within the sleep research community concerning the very basis of R&K sleep staging [Penzel et al. 1991]: R&K is based on a prede ned set of rules leaving much room for subjective interpretation; C3 and C4 are two electrodes placed above the central cortex according to the international 10-20 system. 1

it is a very time consuming and tedious task; it is designed for young normal subjects only; it has a low 30 second temporal resolution; it is de ned in terms of six stages neglecting the micro-structure of sleep; it cannot be automated reliably due to the large inter-scorer variability and insucient rules for staging. Some DM approaches towards sleep staging are therefore concerned with nding alternative descriptions of the sleep process. An evoked potential (EP) is the electro cortical potential measurable in the EEG before, during and after sensoric, motoric or psychic events. An EP is de ned as the combination of the brain's electric activity that occurs in association with the eliciting event and `noise', which is brain activity not related to the event together with interference from non-neural sources (see e.g. [Ruchkin 1988]). Since the noise contained in single trial EPs is signi cantly stronger than the signal, the common approach is to compute an average across several EPs recorded under equal conditions to improve the signal-to-noise ratio. Averaging will attenuate the noise and not the signal given that: (i) the evoked signal is time locked to the onset of the recording, (ii) the evoked signal is the same for each recorded EP, (iii) signal and noise linearly sum together to produce the recorded EPs, (iv) the noise contributions can be considered to constitute statistically independent samples of a random process. DM methods applied to EPs either deal with the analysis of EPs where the above assumptions do not hold (e.g. cognitive EPs) or they are used to classify or describe EPs after conventional averaging. Summing up, EEG analysis poses a number of challenges which make it an interesting eld of application for DM technology:

 EEG comes in large data bases. E.g. one whole night recording of the

human sleep results in eight hours of multi-channel data sampled with up to 256Hz .  EEG signals are very noisy. Whereas the electrical background activity of the human brain is in the range of 1-200V, evoked potentials (EPs) have an amplitude of only 1-30V (see e.g. [Birbaumer & Schmidt 1990]).  EEG signals have a large temporal variance. Whereas the spatial localization of EEG is already well researched, a lot of e ort is still needed to take the between-subjects temporal variation into account.  Analysis of EEG data requires the use of the full range of DM techniques. There are tasks for classi cation, regression, clustering, sequence analysis, etc.

2

Review of work on Data Mining in EEG analysis

After we have given a loose working de nition of Data Mining as well as having motivated why EEG analysis is a challenging and interesting eld of application for DM, we will now review work falling into our focus on DM and EEG. Of course it would be best to make such a review following the usual division into classi cation, regression, clustering, etc. However, as with probably any eld of application, work published on DM and EEG usually is a mixture of several categories of methods and algorithms. Instead of following the more strict categorisation above we decided to structure the review into di erent approaches commonly distinguished within DM: neural networks, machine learning methods, discovery of sequential patterns, statistical approaches, fuzzy and knowledge based approaches. Even given our special focus on DM and EEG, there is still a substantial amount of work published in the literature. Rather than giving a comprehensive overview of all the work, we give examples for a range of di erent applications and methods that have been tried. At the end of this paper we will draw some general conclusions concerning DM and EEG.

2.1 Neural networks

The wealth of work published on EEG analysis with a DM avour uses the method of arti cial neural networks (NN). Historically, many concepts in NN computing have been inspired by biological neural networks. Usually NN are made of a number of local information processing units called neurons which are organized in a number of layers (e.g. input, hidden, output in the case of so-called multi layer perceptrons). These neurons are densely connected with each other and the connections are used to pass information between the local processing units. From a statistical perspective, NN are a semi-parametric approach to achieve general nonlinear mappings2 which can be used for both classi cation and regression. Recent accounts of NN technology from a statistical and pattern recognition perspective can be found in [Bishop 1995] and [Ripley 1996]. Although NN can be seen as belonging to the more general eld of machine learning (see e.g. [Mitchell 1997]), it nevertheless makes sense to review NN approaches separate from others. After all NN methods lack the emphasis on comprehensible explanation and on discrete instead of numeric values and therefore form a coherent subgroup within machine learning.

This is true for both so-called multi layer perceptrons and radial basis function networks, two of the most widely used approaches. 2

2.1.1 Analysis of sleep recordings In what follows we give examples of work trying to replicate classical methods of sleep staging, of work trying to nd alternative descriptions of human sleep and of work about the detection of certain distinguished patterns in EEG sleep recordings. [Schaltenbrand et al. 1993] report about work trying to automate [Rechtscha en & Kales 1968] (R&K) based sleep staging. They use multi layer perceptrons (MLP) to classify all-night sleep recordings according to the R&K rules. They use spectral parameters like relative and total power, ratio of powers and mean frequencies of densities of EEG (two channels, one central, one occipital), EOG and EMG signals as inputs to the network. For every two second segment such a feature vector of the dimension 17 is computed. Although the authors state that the computation of these features is only meaningful for short intervals (two seconds) in which the signals can be said to be stationary, they nevertheless average 15 of the feature vectors to get an estimate for the feature vector corresponding to the 30 seconds windows on which the R&K rules are based. Using these feature vectors, an MLP is able to reach an accuracy of about 80% on 12 new, previously unseen all-night sleep recordings. Employing a simpli ed version of ART2 [Carpenter & Grossberg 1987], NeoART, the system is able to reject input vectors which are too di erent from those used during training. Thereby artifacts and generally ambiguous epochs of the sleep recordings can be detected and nal decisions can be handed over to human experts. NeoART is an algorithm closely related to the classical LBG-algorithm [Linde et al. 1980] for vector quantization as well as to the k-means clustering approach developed in the cluster analysis literature starting from [MacQueen 1967]. All these algorithms represent a set of data via a smaller set of prototypical codebook vectors or cluster centers. A new input pattern is rejected if it does not fall within any of the hyper-spheres of a given radius around the centers obtained by NeoART. [Roberts & Tarassenko 1992] do not want to apply a xed set of rules in order to automatically classify sleep stages but instead aim at giving some indication of the dynamics of sleep in humans. The authors want to develop an alternative analysis system not based on the R&K rules, especially since the latter have been de ned for normal healthy subjects and tend to break down when applied to abnormal sleep patterns. Their database consists of nine whole-night sleep records of young, healthy adults from which only one channel EEG and EMG were used for further analysis. The EEG signal was parametrised using a tenth order Kalman lter for every one second segment which provides ten time-varying linear prediction coecients as input for the network. The network was a

self organizing map (SOM) according to [Kohonen 1984] with a two dimensional output map of 10  10 units. SOM can be said to do clustering and at the same time to preserve the spatial ordering of the input data re ected by an ordering of the cluster centroids in a one or two dimensional output space. The latter property is closely related to multidimensional scaling (MDS) in statistics. More on SOMs relation to clustering and MDS can be found in [Balakrishnan et al. 1994], [Flexer 1997], [Bishop et al. 1998] and [Flexer 1999]. Using the Kalman coecients for every one second segment as input vectors for the SOM neglecting their ordering in time, the authors observed eight clearly visible clusters in the output map. Even more important they were able to distinguish three typical trajectories between these clusters when the input vectors are presented in their timely ordering. These trajectories correspond to the physiological states of wakefulness, rapid eye movement (REM) sleep and deep sleep. By adding an additional supervised layer to the SOM and using sequences of input vectors corresponding to the three above mentioned states (agreed upon by three human experts) for training, probabilities for being in one of the three states at a certain time could be computed. The use of the three probability parameters enabled the authors to quantify severely disturbed sleep which could not have been analysed with conventional systems. The authors claim that even the traditional R&K sleep scores on the much coarser 30 seconds time scale can be computed quite well from the probabilities of the three states (wake, REM and deep sleep). [Jansen & Desai 1994] use multi layer perceptrons (MLP) and recurrent neural networks (according to [Elman 1990]) for the detection of K-complexes in simulated and real EEG data. A K-complex is a distinct waveform in the EEG observable during sleep and detecting its presence or absence is part of the R&K rules for sleep staging. A recurrent neural network is an MLP containing feedback loops thereby introducing a kind of temporal integration into the model. Both types of network perform well on the simulated but not on the real data. Whereas the input for the MLPs consisted of magnitude and/or phase values obtained through Fourier transform for ten second segments, the input for the recurrent neural networks were the raw data samples without any parametrisation. The authors conclude that the poor performance of their networks on real contrary to simulated data, as well as on test sets contrary to training sets is due to too large di erences in the respective data that could not have been learned during the training. What could have contributed even more to the severe declines in performance (e.g. from 100% correct on the training data to 50% correct on the test data) is the fact that in all cases far too less data exists for training the rather large networks (e.g. using only about 60 input vectors of dimension 128 for training of an MLP).

[Bankman et al. 1992] employ multi layer perceptrons (MLP) to detect K-complexes in either raw EEG-data or in a set of several features based on amplitude and duration measurements taken on signi cant points of the waveforms which are candidates for K-complexes. These features include information concerning slopes, peaks, troughs as well as measures of the sharpness of the waveforms. Best results (around 90% speci city and sensitivity) were obtained by using an MLP with the mentioned features. Linear discriminant based on the same features worked almost as good, the MLP based on the raw data did considerably worse. It is important to notice that in the two examples of the application of neural networks to the detection of K-complexes the proper parametrisation of the input seems to be the crucial point. Using a very specialist parametrisation of suspected K-complexes [Bankman et al. 1992] are able to achieve very good results both with MLPs and a simple linear discriminant. [Jansen & Desai 1994] on the other hand use a rather simple parametrisation based on the Fourier transform with only limited success.

2.1.2 Analysis of evoked potentials

[Masic & Pfurtscheller 1993] compare di erent types of NN for the classi cation of non-averaged single trial EPs. It is their goal to use EEG recorded prior to cue triggered nger movements to predict the side of the movement. This is possible in principle due to speci c neuro-electric patterns recordable form the sensorimotor cortex which are related to preparation of movements. The three types of NN are multi layer perceptrons (MLP), a partially recurrent network (PR) and a Cascade-Correlation network (CC). The PR is an MLP containing feedback loops which introduce a kind of temporal integration into the model. The CC is an MLP which optimizes its own structure (number of units and connections) during training. Therefore it is an approach to do parameter estimation and model order selection at the same time. For the NN experiments the authors used EEG recorded via 14 channels from three test subjects. Around 200 single trials were recorded from each test subject and limited to a bandwidth of 10 ? 12Hz . Five samples from the second before the actual movement were extracted as representative input points for the networks. Therefore one input pattern for the MLP and the CC consisted of 70 values (14 electrodes  5 time intervals). The input patterns for the PR networks consisted of only 14 values since they are able to directly deal with temporal information due to their feedback loops. The data for each subject were divided into a training and a test set. Separate NN models were trained for all three subjects using the respective training sets but the model orders of all three NN are optimized using data from the test set from subject 1. Whereas the results in terms

of classi cation accuracy on the test set for subject 1 range from 70 to 90% depending on the type of NN, results for subjects 2 and 3 are around 50% (i.e. chance level), again for all three types of NN. It seems to be obvious that the NN models were not able to cope with the inter-subject variance in the EP data.

2.2 Machine learning methods

Application of \classical" machine learning methods (e.g. decision trees, inductive logic programming) to EEG analysis are rather rare which might be due to the fact that this part of machine learning usually deals with symbols (discrete data) and not with numbers. The example described below even employs a transformation from numerical to symbolic data. [Kubat et al. 1994] introduce two machine learning methods, induction of decision trees (ID3) and learning vector quantization (LVQ), to the problem of sleep classi cation in infants. Their work is concerned with sleep staging according to the rules by [Guilleminault & Souquet 1979]. ID3 by [Quinlan 1982] is an algorithm for the automatic induction (learning) of a decision tree from a set of examples. A decision tree can be seen as a hierarchical way to describe a partition of a data space. An advantage over other approaches to classi cation is that the terminology of trees is graphic and easy to comprehend. It is important to note that each branch introduces a new partitioning of the data set based on the values of one variable alone. ID3 therefore also achieves something like a ranking of the importance of the variables since more discriminant variables are used in branches high up in the hierarchy. LVQ introduced by [Kohonen 1990] is a non-parametric method for classi cation. A randomly initialized codebook similar to a set of cluster centers partitions the data set and is iteratively improved. During each run through the data set codebook vectors are moved towards nearby examples of its own class, and away from ones of other classes. See the respective chapters in [Ripley 1996] for a statistical view on both decision trees and LVQ. The authors employ a special combination of LVQ and ID3: the numerical data is transformed into symbolic data by quantizing them into two or three intervals; ID3 is used to build a decision tree on the symbolic data and LVQ is used only on those numerical variables for which the corresponding symbolic variables got high rankings by ID3; whereas LVQ is used for classi cation the decision tree is used to obtain an approximate interpretation. The study uses polygraphic recordings from eight infants of six months of age. From 22 biological signals diverse parameters like the power in certain frequency bands or complexity measures (Hjorth parameter) are computed for the EEG signals, as well as parameters concerning heart rates, respiration or movements of the infants. These parameters are computed for every 30 second epoch. The average accuracy rate of correctly classifying the sleep

stages with a combined ID3-LVQ algorithm is around 75% using 15 of these parameters. This average is over all eight infants and 10 test runs each. The variance for each test subject was less than 3%.

2.3 Discovery of sequential patterns

In classical evoked potential (EP) studies wave-shapes in single channels are interpreted as sequences of components, each appearing as peaks or troughs in the voltage versus time plot, with certain amplitudes and latencies. Given that these xed components really do appear at de ned latencies in all single EP trials it is justi ed to average across all trials to improve the signal-tonoise ratio (see Sec. 1.2). But if there is e.g. considerable temporal variance between or even within test subjects, alternative ways of analysis have to be developed to cope with this problem. One possibility is to shift the focus to the multi-channel spatial distribution of EEG patterns and analyse and describe their temporal ordering. [Pascual-Marqui et al. 1995] view multi-channel records of EP data as sequences of momentary topographical potential distributions across the neocortex. Following [Lehmann & Skrandies 1984], a certain topographical con guration is believed to persist during an extended epoch and to change rapidly to new con gurations. The time segments of stable topographical con gurations are said to re ect the di erent steps or modes of information processing, i.e. so-called functional micro-states of the brain (see also [Lehmann 1987]). In what follows we will describe approaches focusing on this spatio-temporal nature of EPs. These approaches use information from all EEG channels simultaneously to achieve a segmentation of the multi-variate signals. [Wackermann et al. 1993] view the EEG globally as a sequence of timevarying, spatially distributed electrical elds. Their space-oriented segmentation approach uses information from all recording channels at a single sample point simultaneously and searches for sequences of spatially stable EEG topographies. In a rst step, primary segments are identi ed sequentially and their characteristics are computed. This is done by computing the so-called Global Field Power (GFP) at each sample point. The GFP is the root of the sum of the squared potentials at one sample point, where the sum is divided by the number of electrodes. Only topographic maps at times of local maxima of the GFP are being used for further analyses. The rationale behind the latter (see also [Lehmann et al. 1987]) is that EEG topographies tend to be stable around maximal GFP and change landscape around the time of minimal GFP. Thus, the maps with locally maximal GFPs serve as representatives of a longer period of multi-channel EEG. Next, the centroid locations of the positive and negative eld areas of the maps with locally maximal GFPs are computed. A

segment is de ned as a sequence of consecutive maps that do not di er more than a preset threshold value with respect to these centroid locations. The procedure works in strict sequence without any look-aheads or backtracks. In a second step, the obtained centroid locations of positive and negative eld areas representing the individual segments are clustered via a hierarchical agglomerative clustering algorithm. Finally, simple descriptive statistics like number of classes, number of segments assigned to classes, number of maximal GFP topographies or time covered by the segments belonging to one class are being computed. These, together with the di erent segment classes represented by the centroid locations of their positive and negative eld areas and the \time pro les" showing the sequences of segments, are the nal results of the discussed approach. The algorithm is applied to 42-channel EEG obtained from six healthy male subjects during four 60 second recordings collected during a state of relaxed awakeness with eyes closed. From each of the 24 recordings, a continuous 16 second epoch without artifacts was selected, which gave four epochs of 3750 samples (15 seconds) for each of the six persons after FIR ltering to 2 ? 16; 5Hz . The analysis is done separately for each of the epochs, reported inter-subject variances are between averages of four epochs corresponding to the test subjects and are quite high for most of the descriptive statistics. The number of maximal GFP maps used for the analysis was 293 on the average with little variance (12:0sd) across subjects. The average segment duration was 128:0ms (52:6sd) with an inter-quartile range of 91:6 ? 161:5. The number of classes covering at least 90% of the total time varied between 2:75 and 5:36, suggesting that a small number of segment classes is sucient to represent one whole epoch of data. The most frequent class covered 44:6% to 73:7% of the total time. Since for each of the 24 epochs di erent prototypical EEG topographies are obtained due to the separate analysis, an interesting question is whether these EEG maps are consistent within and across subjects. The authors give only an indirect answer by commenting on the agreement between the axis of the connected centroids of positive and negative eld areas and the frequency of the respective class. They observe that across all 24 analysis epochs the most frequent class is dominated by anterior-posterior axes which also shows the highest GFP values. For the less frequent classes, the agreement is less clear. In a similar approach [Pascual-Marqui et al. 1995] segment averaged EP multi-channel data into non-overlapping micro-states with variable duration and intensity dynamics by clustering the EP topographies obtained at each of the sample points regardless of their temporal ordering. Since the authors try to segment the data into classes with equal topographies regardless of any variation in the amplitudes, they use a the so-called subspace pattern recognition method. This method by [Oja 1983] is a version of the classical

k-means cluster method that works with angles between vectors rather than

with distances. This allows EP samples with equal topographies but of di erent intensities to be clustered into the same classes. Because of the assumption that micro-states should be stable for a certain amount of time until they change, EP topographies that are neighbours in time should be labelled alike, i.e. be classi ed to the same cluster. This is enforced by adding a penalty term for non-smoothness of the time series of labels to the functional that is minimized by the clustering method. The problem of deciding how many micro-states are appropriate for the data is solved by using resampling methods and a cross-validation residual variance estimator, i.e. the authors look for an increase in the value of the error function on unseen data while they let the number of micro states grow. Since the usual error functions for clustering algorithms are monotonic decreasing with the number of clusters, it is our belief, that the increases in the value of the error function observed by the authors are due only to the additional penalty term for non-smoothness. The approach is applied to auditory EPs where test subjects had to discriminate between frequent and rare tones. EEG is recorded via 21 electrodes from 12 normal adult subjects (six female, six male) and averages both across subjects and stimuli with duration of half a second are computed. The optimal number of micro-states is ve and the authors are able to show that these micro-states correspond to classical EP waveforms like the N 1, P 2 or P 3003 . In contrast to the above described approaches, [Flexer & Bauer 2000] are concerned with the analysis of EPs prior to their averaging. In fact the authors try to nd an alternative way of averaging long-lasting EPs for which the necessary assumptions (that EPs consist of stable waveforms with a xed latency) for common averaging do not hold. Building on the hypothesis that only subsequences of long-lasting EPs with variable latency can be expected to be similar across EP trials and test subjects, they develop an alternative way of averaging across such subsequences. They use a combination of techniques from statistics and computer science (vector quantization, Sammon mapping and sequence alignment) to nd topographical subsequences common to a set of EPs. The developed method of \multichannel piecewise selective averaging with variable latency" rst discovers common subsequences across a set of EP trials and then computes an average across the subsequences. From a statistical perspective, both visualization of high dimensional sequential data and unsupervised discovery of patterns within a multivariate set of real valued time series are tackled. Visualization is achieved by discretizing the sequences via vector quantization and The N 1 is a distinct negative-going waveform that always appears 100ms after the eliciting event, P 2 and P 300 are positive-going waveforms at 200ms and 300ms. 3

63 60 50 40 30 20 10 0 0.0

(a)

1000.0

2000.0

(b)

Figure 1: (a) Example of a complete 8:5 second 22-channel EP recording. (b) The corresponding trajectory across codebook vectors depicted as ordered codebook numbers (y-axis) as a function of time (x-axis). Taken from [Flexer & Bauer 2000], see Sec. 2.3. performing a Sammon mapping of the codebook. Instead of conducting a time-consuming search for common subsequences in the set of multivariate sequential data, a multiple sequence alignment procedure is applied to the set of one-dimensional discrete time series. The approach is tested on data that stems from 10 good and 8 poor female spatializers who were subjected to both a spatial imagination and a verbal task. The EEG is recorded via 22 electrodes and limited to frequencies between 0 and 8Hz . The complete data base of EP recordings is divided into four groups: 319 EPs spatial/good, 167 EPs spatial/poor, 399 EPs verbal/good, 270 verbal/poor, where each EP trial consists of 2125 samples, each being a 22 dimensional real valued vector. One complete 22-channel EP trial (duration is 8:5 seconds) is depicted in Fig. 1(a). The EP time series are vector quantized together by using all the EP vectors at all the sample points as input vectors to a clustering algorithm disregarding their ordering in time. Then the sequence of the original vectors x is replaced by the sequence of the prototypical codebook vectors x^. K-means clustering is used for vector quantization (see e.g. [Duda & Hart 1973, p.201]). The authors claim that a number of 64 codebook vectors is suf cient to preserve the main characteristics of the original signal. The set of 22-dimensional time series is thereby transformed to a set of discrete sequences of 64 codebook vectors x^. For the 64 codebook vectors, a 64  64 distance matrix DC is calculated. The sequences of codebook vectors can be visualized in a graph where the x-axis stands for time and the y-axis for the number of the codebook vector. Since in the course of time, the trajectory moves only between codebook vectors that are close to each other in the 22 dimensional vector space,

this neighbourhood should also be re ected by an appropriate ordering of the numbers of the codebook vectors. Such an ordered numbering results in smooth curves of the time vs. codebook number graphs and enables visual inspection of similarities between trajectories. This can be obtained by rst performing a Sammon mapping [Sammon 1969] of the 22-dimensional codebook vectors to one output dimension and by then renumbering the codebook vectors according to a trivially achieved ordering of their onedimensional representation (see [Flexer 1997] for more detail). An example for a trajectory across an ordered set of codebook vectors is given in Fig. 1(b). It should be noted that the ordering of the numbers of the codebook vectors is needed only for visualization and is not necessary for the subsequent sequence alignment. To discover common subsequences in the set of sequences made of 64 discrete symbols (corresponding to 64 codebook vectors x^) the authors apply a so-called xed length subsequence approach. Given two sequences E and F of length m, all possible overlapping subsequences having a particular window length W from E are compared to all subsequences from F . For each pair of elements the score taken from the distance matrix DC is recorded and summed up for the comparison of subsequences. Successive application of this pairwise method allows for the alignment of more than two sequences. Such a xed subsequence approach that is explicitly designed for multiple sequence alignment is given by [Bacon & Anderson 1986]. It computes a multiple alignment by iteratively comparing sequences to the multiple alignment obtained so far, keeping always just the L best alignments as an intermediate result. From the nal L (L = 100 in this application) alignments the one with the minimal inter-alignment variance is chosen. The nal alignment contains a subsequence of length W (W = 125 corresponding to 500ms of EP in this application) from each of the original EPs. This set of subsequences corresponds to a set of beginning points bmin i . Each min bi is the same for all d = 22 channels of the corresponding ith EP. For each channel of EP an alternative selective average s^(t) can be computed: the duration T is equal to the length of the subsequences, W , and the beginning points of the averaging are the parameters bmin i . The results of computing selective averages via beginning points bmin i for both good and poor spatializers doing the spatial imagination task are given in Fig. 2 as sequences of topographical patterns. Each topography is a spherical spline interpolation of the 22 values at a single point in time of the selective averaging window. Given are topographies at 40; 80;    ; 440; 480msec of the window for poor spatializers (top two rows) and good spatializers (lower two rows). It can seen that for both groups there is one speci c dominant topographical pattern visible, albeit at changing levels of amplitude. It is a pattern of more positive amplitudes at frontal to central regions relative to more negative amplitudes at occipital to parietal regions. This common topographical pattern is generally more negative for poor spatializers.

Figure 2: Sequences of topographies for poor spatializers (top two rows) and good spatializers (lower two rows). Scale is from ?4 to +4mV . Taken from [Flexer & Bauer 2000], see Sec. 2.3. The results obtained via the method of selective averaging have also been analysed by a series of analyses of variance (ANOVA). In accordance with the procedure of analysis of classical averages, selective averages of EPs are computed separately for each test subject and serve as inputs to the ANOVAs. A selective average s^(t) is computed separately for a test subject by averaging across all corresponding EPs. The rst analysis of variance was computed to test for signi cance of the general di erences between spatially and verbally evoked subsequences of topographies. A signi cant di erence between spatial and verbal task in terms of their cortical activity distribution is obtained which is of course well in line with psychophysiological literature. The more negative amplitudes at occipital to parietal regions visible in Fig. 2 are as expected and both kinds of information processing are accompanied by a series of activations and in-activations. Taking this di erence into account, data of good vs. poor spatial performers were analysed separately within task \spatial" and task \verbal". Since the two performance levels \good" and \poor" represent extreme groups of spatial ability selected by psychological testing, they should be discriminable in their EP correlates during the spatial but not during the verbal task. This is what the authors have found and others [Vitouch et al. 1997] have also reported and attributed to a higher investment of cortical e ort visible as a more negative amplitude level of one similar pattern.

2.4 Statistical approaches

Of course there exist far too many statistical approaches to the analysis of EEG to cover even just a part of it in this review. We therefore want to restrict ourselves to work in a more statistical framework which is on one hand rather novel and on the other hand deals with a large EEG data base. After all the application to large data bases is one of the central notions in our working de nition of DM given in Sec. 1.1. This rather novel application of statistical methods to EEG analysis is the usage of Hidden Markov Models (HMM). HMMs [Rabiner & Juang 1986] allow analysis of non-stationary multi-variate time series by modeling, both, the probability density functions of locally stationary multi-variate data and the transition probabilities between these stable states. This probabilistic description of locally stable states plus the transition probabilities seems to be well suited to handle the temporal variation between and within test subjects visible in EEG data. HMMs have been applied with great success to all kinds of biological sequence modeling as well as speech processing (see [Charniak 1993], [Durbin et al. 1998] and [Baldi & Brunak 1998] for comprehensive reviews). Applications to EEG analysis are still a novelty. [Flexer et al. 2000] report about an automatic continuous sleep stager, based on probabilistic principles which overcome the known drawbacks of traditional Rechtscha en & Kales (R&K) sleep staging (e.g. it is based on a prede ned set of rules leaving much room for subjective interpretation; it has a low 30sec temporal resolution; it is de ned in terms of six stages neglecting the micro-structure of sleep). They develop a sleep stager based on Hidden Markov Models (HMM) which is objective by being automatic and has a 30-fold increased temporal resolution compared to R&K. Data consisted of nine whole night sleep recordings from a control group (total sleep time = 70.5h, age ranges from 20 to 60, 5 females and 4 males). Only channels C3, C4 and EMG are used for further analysis. Five recordings are used to train the automatic sleep stager (training set), four are set aside to evaluate it (test set). Both sets are matched for sex and age. As already mentioned, HMMs allow to model locally stable states plus the transition probabilities between these stable states. In the context of sleep analysis, the locally stable states can be thought of as sleep stages. Although the classical HMM uses a set of discrete symbols as observation output, [Rabiner & Juang 1986] already discuss the extension to continuous observation symbols. Such a Gaussian Observation HMM (GOHMM) [Penny & Roberts 1998] has already been proposed as a model for EEG analysis. A GOHMM is de ned over the rst re ection coecients and stochastic temporal complexity measures [Rezek & Roberts 1998], computed for EEG signals (electrodes C3 and C4), and a measure of EMG power. The EMG measure is normalized for each subject by subtracting the lower 10% per-

centile and dividing through the interquartile range to minimize di erences in EMG level between subjects. To train the model only data labelled as \extreme" R&K stages \wake", \rem" and \deep" are used in a semi-supervised way: extreme label data is only used to train a 3-state Gaussian Mixture Model which is needed to initialize a 3-state GOHMM. The GOHMM is trained on all available data from the training set and the probabilities of being in any of the 3 states are computed at each point in time. From the results 3 continuous probability plots which indicate the amount of wakefulness, REM and deep sleep with a 1sec resolution (P(wake), P(rem) and P(deep)) are obtained. To produce an automatic system which replicates R&K sleep staging, the authors construct a classi er for 30sec sections as suggested in [Roberts & Tarassenko 1992]. Mean values of P(wake), P(rem) and P(deep) are computed for each of the six human scored R&K stages. The minimum Euclidean distance between these mean values and the current probabilities de nes classi cation of the current sample. Table 1: R&K scores vs. GOHMM classi cation; GOHMM classi cation is given in percentages, separately for each sleep stage. Taken from [Flexer et al. 2000], see Sec. 2.4. wake R S1 & S2 K S3 deep REM

wake S 1 86 11 52 22 13 12 2 2 1 0 32 16

GOHMM S2 S3 0 0 6 6 14 14 17 20 4 14 13 12

deep REM 0 3 0 13 11 37 51 8 81 0 1 26

The trained GOHMM is evaluated using 4 whole night recordings from the test set. The newly obtained continuous sleep pro les are compared with traditional R&K scoring. R&K scores are taken as true scores and for each sleep stage separately the percentages of GOHMM classi cation into each of the 6 stages are given in Tab. 1. The authors expect that the GOHMM should be able to correctly classify data from the \extreme" R&K stages \wake", \rem" and \deep" which they used during initialization. Whereas this is true for \wake" and \deep" (86% and 81%) it is not for \rem" (26%). Probability plots plus R&K and HMM scoring for one whole night recording (subject from the test data group) are given in Fig. 3. Whereas the overall structure of sleep plus short periods of wakefulness are clearly visible in the probability plots, there is substantial mix up between REM sleep and S2 at the end of the night.

R&K rem wake stage 1 stage 2 stage 3 stage 4 rem wake stage 1 stage 2 stage 3 stage 4

HMM

P(wake)

1 0.5 0

P(deep)

1 0.5 0

P(rem)

1 0.5 0

Figure 3: Whole night results of the automatic sleep stager by [Flexer et al. 2000] for one test subject, see Sec. 2.4; from top to bottom: R&K scoring, GOHMM scoring, P(wake), P(deep), P(rem). The GOHMM has great problems discriminating REM sleep from wakefulness and stages 1, 2 and even 3. Whereas it is known that detection of REM is dicult from EEG alone, EMG should help in this respect. The authors state that close inspection of the EMG recordings reveals that discrimination even within subjects is very dicult. This seems to be due to a too coarse quanti cation resolution during recording which does not allow to detect the often very small drop in muscle tone in REM sleep. The two other \extreme" R&K stages \wake" and \deep" can be detected very satisfactory. There is only minor mix up between wake and S1, and deep and S3. S1 is mainly mixed up with wake, S2 with REM but also with all other stages. Both phenomena might in part be due to the EMG problem described above. S2 has already been described as a \compound" state not easily discriminable from other states (see [Roberts & Tarassenko 1992]). S3 is mainly mixed up with deep sleep, which is as expected.

2.5 Fuzzy and Knowledge based approaches

[Karim & Jansen 1992] present a knowledge based approach to the detection of K-complexes in sleep recordings. Knowledge is represented in the form of so-called \specialists" which are basically a set of \if-then" rules for detection of certain patterns in the EEG. E.g. the \specialist" for K-complexes

contains rules concerning minimum amplitude, maximum duration, a certain isolation of the pattern in the signal but also concerning preferred channels and relationship between channels. \Specialists" for other EEG-patterns like sleep spindles or slow waves plus a resolution mechanism are used to decide whether a K-complex is observed or not. These rules are then re ned in an interactive process. The decisions of the knowledge based system on a training set are presented to human experts on a graphic display. Based on this information the experts are able to re ne the rules by changing the thresholds or by introducing fuzzy criteria for some of the rules. The data for this study consisted of overnight EEG recordings from six subjects (three channels: Fp1, F3 and C3). A total of 159 ten seconds epochs were selected for analysis, each containing at least one of the following events: K-complexes, paroxysmal delta bursts and isolated waves bearing some resemblance to genuine K-complexes. This data set was divided into a training set and test sets A and B. Whereas test set A contained data from the same subjects as the training set, test set B contained data from two totally di erent subjects. The rate of correctly classi ed epochs of Kcomplexes and No-K-complexes was 87% for the training set (84% for the Kcomplexes and 90% for the No-K-complexes). Whereas the overall accuracy of classi cation for test set A increased to 89:5%, it dropped to 54:5% for test set B. The authors state that a lot of miss-classi cations are due to the fact that the automatic system's preferred location of a K-complex is the frontal channel but in some subjects most of the K-complexes can be found in the central channel. Nevertheless the results seem to suggest that the automatic system has problems generalizing to new previously unseen test subjects since there is a dramatic drop in performance for test set B.

3

Discussion

The purpose of this review article is to give an overview of Data Mining and its application to the analysis of EEG. We tried to achieve this goal by (i) giving a working de nition of DM, (ii) by motivating why EEG analysis is a challenging eld of application for DM technology and (iii) by reviewing exemplary work on DM applied to EEG analysis. We now like to discuss if the challenges of EEG analysis which we listed in Sec. 1.2 are met by current work on DM and EEG. The emphasis on large data bases is one of the central notions in all de nitions of DM (see Sec. 1.1). In principle EEG data comes in large data bases. This is true for both the length of the recordings (e.g. hours of data in the case of sleep recordings) as well as their dimensionality (the numbers of electrodes can be as high as 128). At least all work on EEG recorded during sleep does deal with large data bases in the form of hours of

recordings from numerous test subjects (see [Schaltenbrand et al. 1993] and [Roberts & Tarassenko 1992] in Sec. 2.1.1, [Kubat et al. 1994] in Sec. 2.2, [Flexer et al. 2000] in Sec. 2.4). The numbers of EEG channels in these studies is however limited to only one to three electrodes with the exception of [Kubat et al. 1994] who work on 22 biological signals. Work on the discovery of sequential patterns reviewed in Sec. 2.3 by de nition deals with temporal multi-variate signals. Consequently all work reviewed in Sec. 2.3 deals with signals of high dimensionality (21 to 42) and often considerable length (thousands of samples) and instances (hundreds of single EP trials). Therefore we conclude that the majority of work on DM and EEG deals with considerable large data bases, albeit a lot of them employ only very few EEG channels. EEG signals are generally very noisy, often the noise is signi cantly stronger than the signal. However, in none of the work on DM and EEG reviewed in this article noise is explicitly modeled. Instead various forms of temporal integration and/or parametrisation of the signals are employed which might help in improving the signal-to-noise ratio. Temporal integration is the computation of average parameters in windows of xed length. Parametrisation includes computation of spectral parameters, Kalman ltering, Hjorth parameters, stochastic complexity, auto-regressive parameters and limiting the signals to certain bandwidths via FIR- ltering. All these techniques are used both for analysis of sleep and EP recordings. For the latter the computation of conventional averages prior to any application of DM technology is another means to cope with the noisy signals. Sometimes DM algorithms are even applied to raw EEG signals without any parametrisation (see the work on K-complexes by [Bankman et al. 1992] and [Jansen & Desai 1994] in Sec. 2.1.1). Up till now DM applications to EEG analysis cope with noisy signals by employing all kinds of \classical" signal processing techniques. Development of new concepts and algorithms to make DM techniques themselves more noise resistant is still missing in work on DM and EEG. Another important issue in EEG analysis is the amount of temporal variation between subjects or even within subjects (i.e. di erences between single EP trails from one test subject). As has already been described in Sec. 1.2, the \classical" technique of averaging across a set of EP trials to improve the signal-to-noise ratio relies on the assumption that there exists a xed EP waveform with very little or no temporal variation at all. If this assumption does not hold, suboptimal solutions include various forms of integration over time and/or nding representative points in time (see [Masic & Pfurtscheller 1993] in Sec. 2.1.2). Other approaches ignore temporal variation within subjects by applying \classical" averaging to data from individual test subjects and then trying to describe the temporal variation

between subjects after this improvement of the individual signal-to-noise ra-

tios. Both [Wackermann et al. 1993] and [Pascual-Marqui et al. 1995] compute a range of measures (average duration of stationary signal segments, number of such di erent segments, etc; see Sec. 2.3) from individual averages and then report between subject variance. [Flexer & Bauer 2000] even try to account for temporal variation on a single trial basis by nding common subsequences in a set of single EP trials and then averaging across these suciently similar subsequences. Analysis of sleep recordings is concerned with classi cation of segments of temporal signals into a prede ned set of categories (sleep stages), where the resolution varies from one to only 30 seconds. Only after this classi cation variation between test subjects is described based on computation of a range of measures describing duration and succession of typical sleep periods. The application of Hidden Markov Models (see Sec. 2.4) per se can be seen as an attempt to account for temporal variation. After all HMMs allow modeling of locally stable states plus transitions between them both in a probabilistic way thereby accounting for variation and uncertainty. It can be concluded that temporal variation of EEG signals is an issue in DM applications to EEG analysis, albeit it is often only considered on a between subjects level. EEG recordings from individual subjects are very often still averaged despite possible temporal variation. This might also be the cause of the frequently encountered problem to generalize results from one set of test subjects to another (see [Jansen & Desai 1994] in Sec. 2.1.1, [Masic & Pfurtscheller 1993] in Sec. 2.1.2 or [Karim & Jansen 1992] in Sec. 2.5). As we have suggested in Sec. 1.2, all kinds of di erent DM techniques are being used for the analysis of EEG. What is also obvious is the fact that mainly DM techniques which are able to deal with numerical values are being applied (e.g. neural networks). DM algorithms which are intended for symbolic data and have to be adjusted to work with numerical data are naturally more rare (e.g. decision trees). As a nal comment it can be said that EEG analysis indeed is a stimulating eld of application for Data Mining methods and that there still are a lot of questions that are left unanswered. Promising lines of future research include: (i) explicit modeling of the noise contained in EEG recordings, (ii) nding solutions to the problem of temporal variation within and between test subjects (iii) considering the full topographical nature of EEG data (i.e. considering the di erent distances of electrodes on the human scalp).

Acknowledgements: The Austrian Research Institute for Arti cial Intel-

ligence is supported by the Austrian Federal Ministry of Education, Science and Culture. This research was partly supported by the Austrian Scienti c Research Fund (FWF), project P14100-MED \Improving analysis of brain electric potentials to enable monitoring of temporally extended cognition".

References [Bacon & Anderson 1986] Bacon D.J., Anderson W.F.: Multiple Sequence Alignment, Journal of Molecular Biology, 191, 153-161, 1986. [Balakrishnan et al. 1994] Balakrishnan P.V., Cooper M.C., Jacob V.S., Lewis P.A.: A study of the classi cation capabilities of neural networks using unsupervised learning: a comparison with k-means clustering, Psychometrika, Vol. 59, No. 4, 509-525, 1994. [Baldi & Brunak 1998] Baldi P., Brunak S.: Bioinformatics: The Machine Learning Approach, MIT Press, Cambridge/Boston/London, 1998. [Bankman et al. 1992] Bankman I.N., Sigillito V.G., Wise R.A., Smith P.L.: Feature-based detection of the K-complex wave in the human electroencephalogram using neural networks, IEEE Transactions on Biomedical Engineering, 39, pp.1305-1310, 1992. [Birbaumer & Schmidt 1990] Birbaumer N., Schmidt R.F.: Biologische Psychologie, Springer, Berlin/Heidelberg/New York/Tokyo, 1990. [Bishop 1995] Bishop C.M.: Neural Networks for Pattern Recognition, Clarendon Press, Oxford, 1995. [Bishop et al. 1998] Bishop C.M., Svensen M., Williams C.K.I.: GTM: The Generative Topographic Mapping, Neural Computation, Vol. 10, Issue 1, p.215-234, 1998. [Carpenter & Grossberg 1987] Carpenter G.A., Grossberg S.: ART2: SelfOrganization of Stable Category Recognition Codes for Analog Input Patterns, in Caudill M. & Butler C.(eds.), IEEE First International Conference on Neural Networks, San Diego, IEEE, pp.727-736, 1987. [Charniak 1993] Charniak E.: Statistical Language Learning, MIT Press/Bradford Books, Cambridge/London, 1993. [Duda & Hart 1973] Duda R.O., Hart P.E.: Pattern Classi cation and Scene Analysis, John Wiley & Sons, N.Y., 1973. [Durbin et al. 1998] Durbin R., Eddy S., Krogh A., Mitchison G.: Biological sequence analysis, Cambridge University Press, 1998. [Elder & Pregibon 1996] Elder J.F., Pregibon D.: A Statistical Perspective on Knowledge Discovery in Databases, in [Fayyad et al. 1996b], pp.83-116. [Elman 1990] Elman J.L.: Finding Structure in Time, Cognitive Science, 2(14)179212, 1990. [Fayyad et al. 1996a] Fayyad U.M., Piatetsky-Shapiro G., Smyth P.: From Data Mining to Knowledge Discovery: An Overview, in [Fayyad et al. 1996b], pp.136. [Fayyad et al. 1996b] Fayyad U.M., Piatetsky-Shapiro G., Smyth P., Uthurusamy R.(eds.): Advances in Knowledge Discovery and Data Mining, MIT Press, Cambridge/Boston/London, 1996.

[Flexer 1997] Flexer A.: Limitations of Self-Organizing Maps for Vector Quantization and Multidimensional Scaling, in Mozer M.C., et al.(eds.), Advances in Neural Information Processing Systems 9, MIT Press/Bradford Books, pp.445451, 1997. [Flexer 1999] Flexer A.: On the use of self-organizing maps for clustering and visualization, in Zytkow J.M. & Rauch J.(eds.), Principles of Data Mining and Knowledge Discovery, Third European Conference, PKDD'99, Prague, Czech Republic, Proceedings, Lecture Notes in Arti cial Intelligence 1704, Springer, p.80-88, 1999. [Flexer & Bauer 2000] Flexer A., Bauer H.: Monitoring human information processing via intelligent data analysis of EEG recordings, to appear in: Intelligent Data Analysis, Volume 4(2), 2000. [Flexer et al. 2000] Flexer A., Sykacek P., Rezek I., Dor ner G.: Using Hidden Markov Models to build an automatic, continuous and probabilistic sleep stager, to appear in: Proceedings of the International Joint Conference on Neural Networks (IJCNN'2000), Como, Italy, 24-27 July, 2000. [Glymour et al. 1997] Glymour C., Madigan D., Pregibon D., Smyth P.: Statistical Themes and Lessons for Data Mining, Data Mining and Knowledge Discovery, 1(1), 11-28, 1997. [Guilleminault & Souquet 1979] Guilleminault C., Souquet M.: Sleep states and related pathology, in Korobkin R. & Guilleminault C., Advances in perinatal neurology, S.P. Medical and Scienti c Books, New York, pp 225-247, 1979. [Hand 1999] Hand D.J.: Statistics and Data Mining: Intersecting Disciplines, SIGKDD Explorations, Vol. 1, Issue 1, pp.16-19, 1999. [Hasan 1983] Hasan J.: Di erentiation of normal and disturbed sleep by automatic analysis, Acta Physiologica Scandinavia, Supplementum 526, 1983. [Jansen & Desai 1994] Jansen B.H., Desai P.R.: K-complex detection using multilayer perceptrons and recurrent networks, International Journal of Bio-Medical Computing, 37, pp.249 -257, 1994. [Jasper 1958] Jasper H.H.: The ten-twenty electrode system of the International Federation, Electroencephalography and Clinical Neurophysiology, 20, 371375, 1958. [Johnson & Hamm 2000] Johnson B.W., Hamm J.P.: High-density mapping in an N400 paradigm: evidence for bilateral temporal lobe generators., Clinical Neurophysiology, 111(3):532- 545, 2000. [Karim & Jansen 1992] Karim Meddahi J., Jansen B.H.: Knowledge Acquisition for Multi-Channel Electroencephalogram Interpretation, Arti cial Intelligence in Medicine, 4(5), 315-328, 1992. [Kohonen 1984] Kohonen T.: Self-Organization and Associative Memory, Springer, Berlin, 1984. [Kohonen 1990] Kohonen T.: Improved Version of Learning Vector Quantization, in International Joint Conference on Neural Networks, San Diego, IEEE, pp.545-550, 1990. [Kubat et al. 1994] Kubat M., Pfurtscheller G., Flotzinger D.: AI-Based Approach to Automatic Sleep Classi cation, Biological Cybernetics, 70, 443-448, 1994. [Lehmann & Skrandies 1984] Lehmann D., Skrandies W.: Spatial analysis of evoked potentials in man - A review, Prog. Neurobiol., vol. 23, pp.227-250, 1984. [Lehmann 1987] Lehmann D.: Principles of spatial analysis, in Gevins A.S. & Re-

mond A.(eds.), Handbook of Electroencephalography and Clinical Neurophysiology, Elsevier, Amsterdam, Vol. 1, pp.309-354, 1987. [Lehmann et al. 1987] Lehmann D., Ozaki H., Pal I.: EEG alpha map series: brain micro-states by space-oriented adaptive segmentation, Electroencephal. Clinical Neurophysiol., 67: 271-288, 1987. [Linde et al. 1980] Linde Y., Buzo A., Gray R.M.: An Algorithm for Vector Quantizer Design, IEEE Transactions on Communications, Vol. COM-28, No. 1, January, 1980. [MacQueen 1967] MacQueen J.: Some Methods for Classi cation and Analysis of Multivariate Observations, Proc. of the Fifth Berkeley Symposium on Math., Stat. and Prob., Vol. 1, pp. 281-296, 1967. [Masic & Pfurtscheller 1993] Masic N., Pfurtscheller G.: Neural Network Based Classi cation of Single- Trial EEG Data, Arti cial Intelligence in Medicine, 5(6), 503-513, 1993. [Mitchell 1997] Mitchell T.M.: Machine Learning, McGraw-Hill, London/New York/Tokyo, 1997. [Nunez 1995] Nunez P.L.(ed.): Neocortical Dynamics and Human EEG Rhythms, Oxford University Press, London/Oxford/New York, 1995. [Oja 1983] Oja E.: Subspace Methods of Pattern Recognition, Wiley, New York, 1983. [Pascual-Marqui et al. 1995] Pascual-Marqui R.D., Michel C.M., Lehmann D.: Segmentation of Brain Electrical Activity into Microstates: Model Estimation and Validation, IEEE Transactions on Biomedical Engineering, Vol. 42, No. 7, p.658-665, 1995. [Penny & Roberts 1998] Penny W.D., Roberts S.J.: Gaussian Observation Hidden Markov Models for EEG analysis, Technical Report, Imperial College, London, TR-98-12, 1998. [Penzel et al. 1991] Penzel T., Stephan K., Kubicki S., Herrmann W.M.: Integrated Sleep Analysis, with emphasis on automatic methods, in Degen R., Rodin E.A. (eds): Epilepsy, Sleep and Sleep Deprivation, Elsevier, pp. 177203, 1991. [Posner & Raichle 1994] Posner M.I., Raichle M.E.: Images of mind, Freeman, New York, 1994. [Quinlan 1982] Quinlan J.R.: Learning Ecient Classi cation Procedures and Their Application to Chess End Games, in Michalski R.S., et al.(eds.), Machine Learning: An Arti cial Intelligence Approach, Tioga, Palo Alto, CA, 1982. [Rabiner & Juang 1986] Rabiner L.R., Juang B.H.: An Introduction To Hidden Markov Models, IEEE ASSP Magazine, 3(1):4-16, 1986. [Rechtscha en & Kales 1968] Rechtscha en A., Kales A.: A Manual of Standardized Terminology, Techniques and Scoring System for Sleep Stages of Human Subjects, U.S. Dept. Health, Education and Welfare, National Institute of Health Publ. No.204, Washington, 1968. [Rezek & Roberts 1998] Rezek I.A., Roberts S.J.: Stochastic complexity measures for physiological signal analysis, IEEE Transactions on Biomedical Engineering, Vol. 44, No.9, 1998. [Ripley 1996] Ripley B.D.: Pattern Recognition and Neural Networks, Cambridge University Press, Cambridge, UK, 1996. [Roberts & Tarassenko 1992] Roberts S., Tarassenko L.: New Method of Auto-

mated Sleep Quanti cation, Medical and Biological Engineering and Computing, (5), 509-517, 1992. [Ruchkin 1988] Ruchkin D.S.: Measurement of Event-Related Potentials: Signal Extraction, in Picton T.W.(ed.), Human Event Related Potentials, Elsevier, Amsterdam/New York, 1988. [Sammon 1969] Sammon J.W.: A Nonlinear Mapping for Data Structure Analysis, IEEE Transactions on Comp., Vol. C-18, No. 5, p.401-409, 1969. [Schaltenbrand et al. 1993] Schaltenbrand N., Lengelle R., Macher J.-P.: Neural Network Model: Application to Automatic Analysis of Human Sleep, Computers and Biomedical Research, 26, pp. 157-171, 1993. [Vitouch et al. 1997] Vitouch O., Bauer H., Gittler G., Leodolter M., Leodolter U.: Cortical activity of good and poor spatial test performers during spatial and verbal processing studied with Slow Potential Topography, International Journal of Psychophysiology, Volume 27, Issue 3, p.183-199, 1997. [Wackermann et al. 1993] Wackermann J., Lehmann D., Michel C.M., Strik W.K.: Adaptive segmentation of spontaneous EEG map series into spatially de ned microstates, International Journal of Psychophysiology, 14, p.269-283, 1993. [Witten & Frank 1999] Witten I.H., Frank E.: Data Mining, Morgan Kaufmann, Los Altos/Palo Alto/San Francisco, 1999.