Processing of Multichannel Recordings for Data-Mining Algorithms

2 downloads 2617 Views 1MB Size Report
Processing of Multichannel Recordings for. Data-Mining Algorithms. Oren Shmiel*, Tomer Shmiel, Yaron Dagan, and Mina Teicher. Abstract—Data Mining, or ...
444

IEEE TRANSACTIONS ON BIOMEDICAL ENGINEERING, VOL. 54, NO. 3, MARCH 2007

Processing of Multichannel Recordings for Data-Mining Algorithms Oren Shmiel*, Tomer Shmiel, Yaron Dagan, and Mina Teicher

Abstract—Data Mining, or knowledge discovery, is the computer-assisted process of digging through and analyzing large quantity of data in order to extract meaningful knowledge. Data mining methods are used in many studies to identify phenomena quicker and better than human experts. One class of these methods was designed for dealing with time series data. However, when several channels of data are collected simultaneously, data mining algorithms encounter numerous difficulties since channels may be measured in different units, may be recorded at different sampling-rates, or may have completely different characteristics. Furthermore, as the size of these data increases, the amount of irrelevant data usually increases as well and the process becomes impractical. Hence, in such cases, the analyst must be capable of focusing on the informational parts while ignoring the noise data. These kinds of difficulties complicate the analysis of multichannel data as compared to the analysis of single-channel data. This paper presents a useful technique for preprocessing multi channel data. Our technique supplies tools for coping with all the above-mentioned difficulties, and prepares the data for further analysis (using common algorithms, especially from the data mining field). The paper is divided as follows. After the introduction (Section I) we describe the state of the art (Section II), follows by the main section—methodology (Section III) which is divided to four steps (3.2–3.5). The results are described in a separate section (Section IV). Then, a discussion and conclusions of the proposed methodology are given in (Sections V and VI). Acknowledgements and the references follow. Index Terms—Data mining, multichannel, multi-channel, multivariable, recordings, signal quantization, signal discretization.

I. INTRODUCTION

T

HE technological revolution has made large memory spaces cheaper and easy to acquire. This has encouraged investigators to collect and preserve every possible piece of recordable data. Although digital storages are very useful for preserving the history of relevant targets, they can also be used Manuscript received September 8, 2005; revised August 20, 2006. Asterisk indicates corresponding author. *O. Shmiel is with the Mathematics and Computer Science Department, Bar-Ilan University, Ramat Gan, 52900 Israel (e-mail: [email protected]; [email protected]). T. Shmiel is with the Mathematics and Computer Science Department, BarIlan University, Ramat Gan 52900 Israel. Y. Dagan is with the Institute for Fatigue and Sleep Medicine, Sheba Medical Center, Tel Hashomer 52621, Israel (affiliated with the Sackler School of Medicine), Tel-Aviv University, Tel-Aviv, Israel (e-mail: [email protected]. il). M. Teicher is with the Brain Research Center, Bar-Ilan University, Ramat Gan, 52900 Israel. She is also with the Ministry of Science and Technology, P.O. Box 49100, Jerusalem 91490, Israel (e-mail: [email protected]). Digital Object Identifier 10.1109/TBME.2006.888826

Fig. 1. Multichannel data recorded in a sleep laboratory. 13 s of a few simultaneous recordings are presented in the above panels. Each panel represents a different recorded channel over the same time scale. The EEG panel displays the electrical brain activity in microvolts with a sensitivity of 50 v=cm. The EKG panel displays the electrical activity of the heart in millivolts with a sensitivity of 1 mv/cm. The EMG-Submental panel displays the chin muscles activity in microvolts with a sensitivity of 50 v=cm. The Pulse panel displays the heartbeat rate in beats-per-minute. The SaO2 panel displays the oxygen carried by hemoglobin in the blood in percent saturation of hemoglobin.

intelligently to explore interesting phenomena about these targets. This kind of analysis takes advantage of the huge amount of data to discover knowledge, and is known as knowledge discovery in databases (KDD) or as data-mining. A major subgroup of analyzing techniques deals with data that describe the behavior of an object over time. This type of data supplies dynamic information about how the behavior of the object changes as time goes by. Hence, analysis of such data could be very useful for learning trends about the characteristics of the object. However, in many cases single-channel data is not enough to characterize describing the behavior of an object over time, and several types of information are collected simultaneously using the same time scale. For example, if the object is a patient in a sleep laboratory, diagnosticians may be interested in simultaneous recording of many channels during night sleep, such as electroencephalography (EEG), electrocardiography (EKG), pulse rate, oxygen saturation (SaO2), eye movements [electrooculography (EOG)], activity of chin muscles (EMG), respiration-flow, etc. (Fig. 1). Although multichannel data supplies much more useful information about the target, its analysis is fraught with difficulties. Some of these are detailed as follows. 1) As the total number of the recorded channels increases, so does the total amount of data to process. This might render common analysis methods impractical. 2) Channels may contain irrelevant information that we would like to ignore. Furthermore, focusing on specific behavior of channels sometimes leads to better analysis. For example, it is advantageous to ignore all the characteristics of the EMG channel except its amplitude.

0018-9294/$25.00 © 2007 IEEE

SHMIEL et al.: PROCESSING OF MULTICHANNEL RECORDINGS FOR DATA-MINING ALGORITHMS

3) The units in which channels are measured may be conceptually different from one channel to another. Hence, it becomes more difficult to detect cross rules which relate information from different channels. 4) The resolution in which channels are sampled may be different from one channel to another., e.g., the EEG channel may be sampled at 200 Hz whereas the pulse channel may be sampled at 1 Hz. Hence, information channels are represented by significantly unsymmetrical amounts of data, and this is usually undesirable for data-mining algorithms. 5) Sometimes, a recorded channel cannot be interpreted as a set of single values, but as values in some multidimensional space. For instance, a channel can be created by sampling the location of an object in a 2-D plane. Nevertheless, in cases 3)–5) we still want our analysis to handle all channels together, although they are conceptually different. All the difficulties involving multichannel data led us to look for a processing technique in which all recorded data are represented in more convenient way. This representation should be very brief, compact and independent of the total number of original channels. In other words it should extract all relevant information from the original channels, and ignore the irrelevant parts. The motivation for finding such a technique is to use it as a preprocessing stage for analyzing multichannel data. Once applied, common algorithms (e.g., data-mining algorithms) can be easily activated on the new simple representation, and explore the target knowledge.

II. STATE OF THE ART Representing multichannel (or multivariable) data in a simple way is an important issue in data analysis. One of the most well-known techniques for achieving this complexity reduction is called quantization (or discretization), which is the process of converting a continuous variable into a discrete variable. The discretized variable has a finite number of values, which is considerably smaller than the number of possible values in the empirical data set. The discretization simplifies the data representation, improves interpretability of results, and makes data accessible to more data mining methods ([1] and [2]). In decision trees, quantization as a preprocessing step is preferable to a local quantization process as part of the decision tree building algorithm ([3] and [4]). One could approach the discretization process by discretizing all variables at the same time (global), or each one separately (local). The methods may use all of the available data at every step in the process (global) or concentrate on a subset of data (local) depending on the current level of discretization. Decision trees, for instance, are usually local in both senses. Furthermore, the two following search procedures could be employed. The top-down approach [4] starts with a small number of bins, which are iteratively split further. The bottom-up approach [5], on the other hand, starts with a large number of narrow bins which are iteratively merged. In both cases, a particular split or merge operation is based on a defined performance criterion, which can be global (defined for all bins) or local (defined for two adjacent bins only). An example of a local criterion is presented in [5].

445

Fig. 2. Summary of analysis stages.

Discretizing variables separately assumes independence between them, an assumption that is usually violated in practice. However, this simplifies the algorithms and makes them scalable to large data sets with many variables. In contemporary data mining problems, these attributes become especially important. Four useful quantization methods are the following: • Equal Width Interval: By far the simplest and most frequently applied method of discretization is to divide the range of data into a predetermined number of bins [1]. • Maximum Entropy: An alternative method is to create bins so that each bin equally contributes to the representation of the input data. In other words, the probability of each bin for the data should be approximately equal [7]. • Maximum Mutual Information: In classification problems, it is important to optimize the quantized representation with regard to the distribution of the output variable. In order to measure information about the output preserved in the discretized variable, mutual information may be employed. Mutual information was used in the discretization process of the decision tree construction algorithm (ID3) in [8]. • Maximum Mutual Information with Entropy: By combining the maximum entropy and the mutual information approaches, one hopes to obtain a solution with the merits of both. In other words, one would like to retain balanced bins that turn out to be more reliable (prevent overfitting in this context) but simultaneously optimize the binning for classification. III. METHODOLOGY A. Summary of Methods Our technique is based on five independent stages. The following is a brief summary of all stages involved in the final analysis. 1) Every channel of the original data is projected into one or more projected-channels. 2) In each of the projected-channels, a set of critical-points is detected. 3) Once all sets are ready, they are merged into one large set which can be interpreted as a long series of events along the time scale. 4) The next stage is activating the desired data-mining algorithm on this series, giving an output 5) which is translated into insights in the context of the original data. The above flowchart (Fig. 2) illustrates all five stages in the analysis process. Note that the dashed arrows are used when there are no result insights. In such a case, a redefinition of either how to create the projected-channels or how to detect the critical-points is carried out.

446

IEEE TRANSACTIONS ON BIOMEDICAL ENGINEERING, VOL. 54, NO. 3, MARCH 2007

Fig. 3. Example of projection-functions and critical-points. Each of the two panels above shows a possible projected-channel. The top projected-channel (a) focuses on the direction of the trajectory and is measured in radians. Due to its cyclic range [0; 2 ], dashed lines were used to connect equivalent values. The bottom panel (b) focuses on the velocity of the sampled object and is measured in cm/s. In both channels a safe-pass-factor was used and critical-points are marked by open circles (see Section III-C2).

B. Projection-Functions A projection-function in our context is a function that maps a channel into another space. The resulting space represents a chosen projection of the original channel which is relevant for its analysis. Formally, it can be seen as a pair id, func where id is the identifier of the channel to be projected and func is the function to be applied on it. For example, if we define the set of projection-functions to contain only the pair EEG, frequency , only the frequency of the EEG channel will be used in our future analysis. Generally, just as FFT projects a signal from the time domain to the frequency domain, it is possible to use any projection-function for projecting the signal to another space. Usually, one would like to use several projections for a specific channel, thus we allow the existence of many pairs with the same id. In other cases, one would like to use the channel itself as a projected-channel. This can be simply fulfilled by pairing the identity function with this channel. It is very important to understand the intuition behind the projection-functions. This set represents a way of thinking of the important general measures for the future analysis. All information from the data-channels that is not a result of these functions is excluded from the next processing stages. Hence, this stage can be seen as a filter stage for the relevant measures in general. Fig. 3 illustrates two possible projection-functions to be applied on a channel which samples the location of an object over time. The first function projects the channel into its direction, and the second one projects it into its velocity. Using these two functions for this channel means that all future analyses will be based solely on the direction and on the velocity of the object, ignoring all other types of information such as curvature, acceleration or absolute location. C. Detection of Critical-Points In the previous stage, all projected-channels were created from the original channels, representing the characteristics we want our analysis to focus on. In this stage we independently transform these projected-channels into a significantly smaller set of points (called the critical-points), which is considered as the behavioral summary of all channel data.

1) Definition of a Critical-Point: A critical-point in our context is a time-point in which an important change occurs in some property of a projected-channel. Formally, a critical-point where is the time at can be seen as a triple , id, which the change occurred, id is the identifier of the related proholds information about the change jected-channel and and the related property. An example of Rtarget could be a to 2 . change in the acceleration of the object from 1 The key issue is how to generate the set of the critical-points. On one hand they should represent all trends in the data, but on the other hand, a large quantity of them can significantly increase the total runtime of the next analysis stages. There are many ways of implementing the concept of criticalpoints. One useful way is based on transitions between ranges of values. Whereas in common signal quantization each sampled time-point is associated with a specific bin of signal values, here we save only the time-points in which a transition from one bin to another occurs. Following this idea, data are represented in a more compact way which reduces their size and functions as a better basis for future detection of repeating trends. 2) Transitions Between Ranges [TBR]: Given a projectedchannel, our aim is to define criteria which will be used to detect all points along the time scale that seem to represent similar behavior of this channel. The following describes a method consisting of a few consecutive steps for each such channel. As a first step the distribution of all potential values for the projected-channel is estimated. This could be done either by statistical computations or by empirical processing of the real data values. As a second step, these values are divided into subranges representing clusters of this channel. One simple division of the potential range would be to break it into subranges of the same size. More complex divisions can be based on the estimated distribution of values for this channel, and will be described later. In the last step, the projected-channel is used to generate a set of critical-points that represents special time points at which the channel value alternates between the predefined subranges. In other words, we translate each projected-channel into a series of points along the time scale where there were transitions between channel clusters. In order to demonstrate the above steps we used the projectedchannel shown in Fig. 3(b). This channel holds the velocity of an object at each time, and we would like to translate it into a set of critical-points as described above. By visual inspection, we saw that all possible velocity values are between 0cm/s and 12cm/s. Then we counted the total number of appearances of each such possible value, thus obtaining the estimated distribution shown in Fig. 4. At this stage, we needed to divide the potential range of velocities into subranges. We could do this either by a simple division into fixed-width intervals (4a) or by considering the internal structure of this distribution (4b). Although visual inspection of the distribution may be useful for setting the borders between the ranges, one can use various clustering algorithms for identifying clusters ([9] and [10]). These clusters are considered as the desired ranges of values, and the critical points are defined as the transitions between these ranges. Moreover, in some cases the original data contains multi channel recordings for each patient separately (see application

SHMIEL et al.: PROCESSING OF MULTICHANNEL RECORDINGS FOR DATA-MINING ALGORITHMS

Fig. 4. Dividing the potential range into subranges. This figure demonstrates two possibilities to break down the potential range of a projected-channel into subranges. In the top panel (a), fixed-width intervals with a width of 3 cm/s were used. Hence, each subrange contains the same number of distinct values independent of the values themselves. In the bottom panel (b), a different division was made by considering both the distribution of these values and some “human logic”. For instance, the value 0 was put in a cluster of its own, as it represents the fact that the object is static. Intuitively, a velocity transition between 0 and 0.5 cm/s is conceptually different from a transition between 5.5 and 6 cm/s (this fact was ignored when the fixed-width intervals method was used).

for sleep research in Section IV-A. In these cases one should distinguish between channels whose distribution of values is common to all patients, and channels whose distribution of values may differ from patient to patient. Handling the second case can be done by defining the division to ranges relatively to the distribution of a specific patient. For example, a critical point may be defined at each time the value exceeds the interquartile range of the distribution, which is specific for each patient. Safe-pass-factor: The decision of marking a critical-point at each time the value alternates between predefined ranges may lead to undesired noise. Whenever the measurement comes near a boundary between two ranges, many back and forth transitions may be caused (e.g., noisy measurements) without identifying some trend of the channel. To solve the described difficulty we use an additional parameter called the safe-pass-factor whose value is a real number between 0 and 1. The idea behind this parameter is that a transition between subranges is sometimes not enough to obtain a critical-point. Similar to Schmitt Trigger logic ([11] and [12]) we maintain that the trend of the transition will keep taking place even after the first time we detect it. In practice, this parameter determines how deep the channel should enter the target range to convince us that the transition is not noisy. If so, a critical-point is generated the first time we find this transition. For example, if we choose the safe-pass-factor to be 15%, a transition between and the subrange is a critical-point the subrange only if the channel reaches (or passes) the value 26 within the . Intuitively, unless target range, where a safe-pass-factor has been used to generate the critical-points shown in Fig. 3(b), we would expect a noisy critical-point each time the speed crosses the 4 cm/s border (from both directions). However in practice we disregard these points, as is shown in the figure.

447

Cyclic range: In some cases the range we want to divide is cyclic by definition. For example, a projected-channel that projects the original channel into its direction in radians has . If we divide this range into the four a cyclic range of , , , and , we must subranges allow transitions between the first and the last subranges. Thus, unlike the regular ranges, cyclic ranges enable critical-points describing direct transitions between the first and the last subranges in both directions. Safe-distance-time: It is very common in some channels to have critical-points that appear in chunks along the time scale. Each such chunk contains a burst of these points, and sometimes we are not interested in all of them. For this purpose we use a parameter called the safe-distance-time. The idea behind this is that several types of critical-points cannot be generated within a short time . This is called the safe-distance-time. For example, an analyst of a complicated network may want to ignore all critical-points describing “Error connecting computer X”, if they are preceded (in less than 5 min) by a critical-point describing “voltage drop in computer X.” In such a way, the safe-distance-time parameter helps reduce the number of irrelevant critical-points, especially when they appear in bursts. Using transitions between ranges in Fig. 3: Fig. 3 shows not only the projected-channels but also the critical-points in both of them. These points were generated using the TBR criteria as described in Section II-2. The potential range of the direc, , tion channel was divided into the subranges: , and . Moreover, the original range was assumed to be cyclic, meaning that a transition (which is and is marked by a critical-point) between enabled in both directions. In order to ignore noisy transitions, we used the safe-pass-factor parameter, and set it at 20%. All these parameters led to the detection of the critical-points shown as small circles on the direction channel in Fig. 3. Similarly, the generation of critical-points for the speed channel involved the following parameters. The potential range was divided into the , , , , and where all subranges: units are given in millimeters per second. The range was not assumed to be cyclic, and the safe-pass-factor was set at 20%. As can be seen in the figure, in both examples there are many time-points at which the graph crosses a division border. If a safe-pass-factor had not been used, critical-points would have been generated at each of these time-points. 3) Using Patterns-Mining Inside a Projected-Channel: In complex channels the usage of TBR might not be enough for hinting at some human logic behind the main characteristics of the channel. Therefore, in this model, each critical point generated by the TBR method will not be considered as a critical point, but as an intermediate transaction for detecting the critical points. These transactions will be an input for specialized patterns-mining models that were designed for detecting patterns (Section III-E) (e.g., [13] and [14]). Now, each detected pattern is associated with a particular type of critical-point. Following this, a single critical-point represents a specific occurrence of a particular behavior-pattern. The revealed patterns can be seen as a second level of complexity above the basic transitions level. Hence, the more complicated the patterns are, the more detailed the logical meaning behind them. In Section IV-B we represent

448

IEEE TRANSACTIONS ON BIOMEDICAL ENGINEERING, VOL. 54, NO. 3, MARCH 2007

a useful preprocessing for such channels based on revealing patterns of behavior that clarify the basic characteristics of these channels. D. Merge of Critical-Points In the first stage, all projected-channels were created. In the second stage all critical-points were detected for each of the projected-channels independently. In this stage, all sets of critical-points are merged along the time scale in order to create the final summary of the multichannel data. This final representation of the data can be used as an input for known data-mining algorithms in order to explore rules and insights regarding the critical-points. Formally, the final representation of the multichannel data as an ordered set of critical-points along the time scale is seen as where is the time of the triples of the form th critical-point, is the projected-channel in which the th critical-point was marked, and is the identifier representing the type of this critical-point. Intuitively, this translation of the data is like a formal description of it to the machine, using our point of view. Following this idea, a complete 10-min recording of five different channels could be summarized into one sentence such as: “At 15:28 the pulse increased rapidly to above 100 beats-per minute, 2 min after, the object stopped walking and after five more minutes his sugar-level decreased by 10%.” When a large amount of raw data is available, descriptions like this can be useful for further analysis as discussed in Section III-E. E. Postprocessing Algorithms and Data-Mining The term data mining (or knowledge discovery in databases) can be described as a nontrivial extraction of implicit, previously unknown, and potentially useful information from data. It uses machine learning, statistical and visualization techniques to discover and present knowledge in a form which is easily comprehensible to humans. Simply, it is necessary to reduce the volume of data before processing them. The problems that data mining can overcome are enormity, high dimensionality and heterogenous nature of data, which do not allow the traditional techniques to be used. In general, data mining is designated to reach following goals: Prediction Methods: Predicting unknown or future values of variables, using some known values. Description Methods: Finding patterns describing the data in a human-understandable way. The common tasks of data mining are the following: • Classification (predictive): Using a given collection of records (training set)—each record contains several attributes, one of them is the class. The task is to find a model for the class attribute as a function of other attributes, so, after that, previously unknown records can be assigned a very accurate class. To determine the accuracy, a test set is often used. • Clustering (descriptive): Using a given set of data points, each having a set of attributes, and given a similarity measure among them, the task is to find clusters such that the data points in each cluster are more similar to each other

and the data points in different clusters are less similar to each other. • Association Rule Discovery (descriptive): Using a given set of records, each of which contains a certain number of items from a given collection, the task is to find dependency rules predicting occurence of an item based on occurences of other items. • Sequential Pattern Discovery (descriptive): Given a set of objects, each of them having its own timeline of events, the task is to predict sequential dependencies among different events. • Regression (predictive): Given a set of continuous valued variables, the task is to predict values of a variable using other variables and assuming a linear or nonlinear model of dependency. • Deviation Detection (predictive): The task of this method is to discover the most significant changes in data from previously measured or normative values. Data mining can be performed using several models and technologies, among which decision trees, neural networks, rule induction, nearest neighbor method and genetic algorithms belong ([8], [15], [16], and [17]). However, in this paper we only demonstrate methods from the “Sequential Pattern Discovery” field ([13] and [14]). As described in the introduction, it may be very difficult to apply common algorithms to multichannel data. To do so, the algorithms must cope with parallel processing of many channels on the same time scale. Furthermore, they must be adjusted to handle large amounts of data as well as ignoring large amounts of irrelevant data within it, and so on. However, when the multichannel data are represented as a series of critical-points, the situation is completely different. In computer science there are many algorithms for processing data which are in the form of a series of events with an associated time for each datum. Three variations of common “Sequential Pattern Discovery” problems are briefly discussed in Sections III-E1–V. All of them assume an input of the form: where is the th event in some described activity and is its time of occurrence. Because they are so important, efficient methods have already been developed to solve each of these problems. In the Section IV, we take advantage of this fact to demonstrate the advantages of applying our technique on multichannel data prior to the activation of a data-mining algorithm. 1) Discovering Serial-Episodes: A serial-episode is an . ordered set of events, noted by: Given a time-window , we say that occurs at time if the two following conditions hold. occurs at time . a) The event occur in this order within the b) The events . Note that they can be separated time interval by other events. The frequency of , noted by freq(S), is defined as the maximal number of nonoverlapping occurrences of . A serial-episode whose frequency has crossed some predefined threshold is called frequent. Given a series of events, a and a frequency-threshold , the problem of time-window discovering serial-episodes is the problem of finding all serialand . episodes which are frequent with respect to

SHMIEL et al.: PROCESSING OF MULTICHANNEL RECORDINGS FOR DATA-MINING ALGORITHMS

2) Discovering Parallel-Episodes: A parallel-episode is an unordered set of events, noted by Given a time-window , we say that occurs at time if the following two conditions hold. a) One of the events of occurs at time . b) All other events of occur in any order within the time . interval The frequency of and the problem of discovering parallelepisodes are defined just like the ones of a serial-episode. Note that given a frequency-threshold , the total number of chance episodes appearing at least times can usually be estimated with statistics regarding the frequencies of the single events. This idea can be used to determine the value of . For example, can be computed such that the expected total number of frequent outcome episodes is approximately 1000. 3) Discovering Generalized-episodes: A generalized-episode G is a partially ordered set of events. Intuitively it is a combination of a serial-episode and a parallel-episode. , For example, the generalized-episode is a parallel-episode made up of three (noted by freq is serial-episodes. The frequency of were a parallel-episode considering the defined as though constraints caused by the serial parts within it. The problem of discovering generalized-episodes is defined in the same manner as for the above two types of episodes. 4) Using Frequent Episodes To Explore Interesting Rules: Several data-mining methods as well as other statistical methods can take advantage of the discovered set of frequent episodes to discover interesting rules involving them. Each rule represents a specific relation between the frequent-episodes which are unand likely to occur by chance. Given two frequent episodes , an example for a simple relation would be “70% of the ocare followed by an occurrence of within at currences of most 2 min, while in the other direction, 85% of the occurrence are preceded by an occurrence of .” In order to underof stand the practical meaning of each such relation, it is translated into the context of the original data. For the above example, and are two repeating trends in the data which can be described in a natural language as well as the detected relation between them. At a further verification stage the data might be divided into a training-set and a test-set. In such a way rules which were detected in the training-set could be verified on the test-set, yielding more convincing results. 5) Critical-point Considerations and Taxonomies: Generally, critical-points help data-mining methods interpret the data as small repeating components of behavior. As for humans, it is much simpler to understand a sentence like “object turns right” than tracing its 100 recorded locations. However, in some cases we want to distinguish between several types of “right turns,” so we would like to have smaller behavioral components. On one hand when using many types of critical-points, data are represented in more detail, but on the other hand, handling the logical relations between them becomes more complicated for further algorithms. The use of predefined taxonomies in the datamining stage can be a good solution for this problem. Formally, a taxonomy is a specific hierarchy describing “is-a” relations between types of critical-points. In such a manner, although there are many kinds of “right turns,” the algorithm understands

449

that they are all derived from the same concept. Naturally, taxonomies can have as many hierarchical levels as needed, and they are usually represented by trees. When they are used in data-mining algorithms, patterns and rules involving nodes may be found despite the fact that they do not hold for any of their children. IV. RESULTS A. Implementation in Sleep Research In this section, we illustrate one practical use of the multichannel analysis technique. We used data recorded in a sleep laboratory for three different patients during one night of sleep. Our analysis was done on three recorded channels for each patient. • EEG channel—records the electrical activity of the brain in microvolts, sampled at 100 Hz. • Pulse channel—records the heartbeat rate in beats-perminute, sampled at 1 Hz. • SaO2 channel—records the oxygen carried by hemoglobin in the blood in percent-saturation-of-hemoglobin, sampled at 1 Hz. All channels for each patient were recorded using a common time scale during sleep. 1) Stage 1: Determining the Projected-Channels: We need decide what behavior of the channels we want to focus on in our analysis. In other words, we must determine the set of projection-functions associated with these channels (see Section III-B). Here, we decided to use the following three projection-functions (yielding three projected-channels, respectively). • EEG, frequency —this projected-channel was constructed using a fast Fourier transform (FFT) such that its value at time is the estimated frequency (in hertz) of the . EEG signal in the window • Pulse, first derivative —this projected-channel is is the diconstructed such that its value at time rection of the pulse (in radians). This value is computed by the arctan of the first derivative of the graph representing the pulse signal at time . Formally: for small value of . • SaO2, Identity —this projected-channel is equivalent to the original SaO2 channel. The idea behind it is that we want to use the original values of the channel to generate the critical-points (see stage 2). Once we applied these three projection-functions on the original channels we obtained the desired set of projected-channels, for which we had to detect the critical-points. 2) Stage 2: Detecting the Critical-Points: We had to decide for each projected-channel under what circumstances a critical-point should be generated. Note that the more meaningful generation in this stage gives better chances of having meaningful results in the final stage. To illustrate the TBR method (as described in Section III-C2, we applied it on the EEG frequency channel. To do so, we needed a meaningful division of the EEG frequency range. Hence, we used a known division in sleep literature having the

450

IEEE TRANSACTIONS ON BIOMEDICAL ENGINEERING, VOL. 54, NO. 3, MARCH 2007

Fig. 5. Merging critical points. In the top panel (a) critical points were marked on three original channels. In bottom panel (b) all critical points were merged on the same time scale. The merged representation is the input for post processing algorithms such as data mining.

Fig. 6. Example of a patient’s pulse distribution. The frequency of each potential pulse value was computed and is shown in the above histogram. The and ). These interquartile range is marked by two dashed vertical lines ( two values were used as borders for generating the critical-points as discussed above.

Q

Q

following predefined ranges [18]: 0 Hz–4 Hz (delta waves), 4 Hz–8 Hz (theta waves), 8 Hz–13 Hz (alpha waves), and (beta waves). Each time the estimated EEG frequency alternated between two of these subranges, a critical-point was generated. Furthermore, to overcome noisy critical-points, we used the safe-pass-factor parameter (see Section III-C2. In this example we set it to 15%, observing that choosing any other value between 5% and 30% led to the same results as given in stage 4. Regarding the pulse channel, we were interested in generating critical-points at all extremum points which were significantly higher/lower than the values around them. For this purpose, we estimated the distribution of all potential pulse values for each patient separately, as shown in Fig. 6. Then we computed the first quarter value (say ) and the third quarter value (say ), to be the interquartile range of the distribuyielding tion. Eventually we passed through the values of the first derivative of the pulse to detect all time points at which extremums exist (i.e., the sign of the first derivative alternates between positive and negative values). The criteria for generating a critical-point at a certain extremum point were the two following conditions. a) The extremum value is not within the interquartile range . b) There is at least one time-point between the last generated critical-point and the current extremum point in which the pulse value is outside the interquartile range. Intuitively, the above two conditions force the extremum peak value to be outside the interquartile range, and be preceded by at least one a value which is within the interquartile range. For the SaO2 channel, we wanted each critical-point to represent a critical increase/decrease in the saturation value. We estimated the distribution of all potential saturation values (for each patient separately similar to that described in Fig. 6),

contains all indicating that the interpercentile range saturation values between 90% and 98%. Generating the critical-points was done again using TBR, assuming the subranges , , and . Intuitively, values are most common, whereas values in the range of less than 90% and values greater than 98% are rare (at each). Note that saturation values below 50% were considered as noise and were ignored. 3) Stage 3: Merging the Critical-Points: Once all criticalpoints of each projected-channel were detected in stage 2, at this stage all critical-points of all projected-channels were merged on a common time scale (see Fig. 5). The result of the merge was a compact representation of the multichannel data for a postprocessing algorithm. 4) Stage 4: Applying a Data-Mining Algorithm: We used data-mining methods to discover all the generalized-episodes along the time scale (see Section III-E3). For the demonstration, we chose a time-window of 30 s, and we focused only on patterns involving critical-points from all original channels. The most frequent pattern includes three critical-points and is formulated as follows. a) SaO2 went below 90%. b) EEG waves moved from range 4–8 Hz to range 8–13 Hz (i.e., theta to alpha) c) A local maximum peak was detected in the pulse as described in stage 2. In terms of data-mining this generalized-episode was noted , which means that the critical-points occurred in by the SaO2 and in the EEG channels first, without a significant order between them. Then they were followed by the criticalpoint which occurred in the pulse channel. Note that this episode repeated 70 times in the data. Two of its occurrences are shown in Fig. 7. Note that all other frequent episodes found were diagnosed as derivatives of the same episode. This can be explained by the different ways of representing this episode within a time-window. For example, a derivative episode could replace “decrease below 90%” with “increase over 90%” in the SaO2 channel, since after a saturation value goes below its normal range (below 90%), it should return to it in a short while. 5) Stage 5: Back to the Sleep Laboratory: Understanding the meaning of this episode in the context of the original data involved consulting experts in sleep analysis. All 70 occurrences of this episode were examined and 59 of them (84%) were found to be correlated with a known sleep disorder called arousal with deep-desaturation ([18] and [19]). Note that the change in the heartbeat rate indicates the heart’s reaction to this disorder, which may but does not necessarily appear in this kind of arousal. Therefore, these episodes do not indicate all arousals with deep-desaturation that exist in the data, but a subgroup of them, where there is a noticeable reaction of the heart rate as well. According to academic polysomnography ([18] and [19]), SaO2 values below 90% are considered abnormal. A situation when SaO2 values go below 90% is called deep-desaturation. An arousal with deep-desaturation is defined as an abrupt shift in the EEG channel to theta or alpha frequency, for at least 3 s, while values of the SaO2 channel go below 90%. This

SHMIEL et al.: PROCESSING OF MULTICHANNEL RECORDINGS FOR DATA-MINING ALGORITHMS

451

Fig. 7. Two occurrences of a generalized-episode found by a data-mining algorithm. This figure shows an episode that consists of three critical- points from three different channels (marked by small arrows on the time scale below them). In the left occurrence (a) the critical-point that was marked on the EEG channel precedes the one that was marked on the saturation channel, whereas on the right occurrence (b) the opposite took place.

kind of arousal is a common sleep disorder found in many patients. The data for our three patients contained 81 such arousals, with recognition of 73% of them (59/81) by the most frequent episode method. B. Implementation in Brain Research Here we show an application of the presented technique on data recorded in brain research projects. The paper that describes the comprehensive results we got was published in [20]. 1) Background: Maccaca fascicularis monkeys were trained to hold a two-joint manipulandum and scribble in the horizontal plane. One out of 19 hexagonal patches which tiled the working space was randomly selected as a target. When the monkey hit the target, a short beep was sounded, the monkey obtained a juice reward and the target jumped at random to another location. In this way the monkey was encouraged to produce continuous movements. After training, two data channels were simultaneously recorded: the drawing channel which holds the manipulandum’s location sampled at 10 Hz, and the neural channel which holds the firing times (also called spikes) of each recorded neuron, up to a resolution of 0.1 ms. Our main goal was looking for unknown relations between hand motion and neural activity. Due to the fact that relations which are based on the average firing rate of the neurons are already known [21], we focused on relations between hand motion and precise firing sequences (PFS). A PFS is defined as a series of two spikes or more with constant time delays between their occurrence times. In our context, the input multichannel data consist of two channels: the drawing channel and the neural channel. Sections IV-B2–IV show how our technique can be applied on such type of data in order to construct the set of critical points (stages 1, 2, and 3) for further postprocessing techniques (which will not be described here). Recall that the way in which critical points are chosen for each channel determines what characteristics of the channel we want to use as the input

for the postprocessing algorithms. In this example we used an internal patterns-mining stage inside each of the channels (Section III-C3). 2) Critical Points of the Drawing Channel: The main idea was looking for repetitions of similar drawing patterns, and defining the critical points at the occurrence times of these patterns. For the detection of these components, we used two projected functions for the drawing channel: angular-direction and direction-delta. The angular-direction projection creates a such that holds the direction of the original channel drawing at time (in degress). The direction-delta projection such that holds the difference creates a channel between two consecutive values of . All possible values of were quantized into six sections of 60 each. Using the TBR method (Section III-C2), transitions from one section to another were marked. In the same manner, values of (using ) was quantized to four sections of 90 each, and transitions were mark. In both (Section III-C2). projections we used The two types of transitions were merged on the same time scale, and sequences of repeating transitions (one or more) that last for at most 0.5 sec were considered as drawing components. Drawing components were found using a patterns-mining algorithm for discovering serial episodes (described in Section III-E1). Each start time of an occurrence of the repeating drawing components was considered as a critical point in the drawing channel. Few examples of drawing components are shown in Fig. 8. 3) The Critical Points of the Neural Channel: Following our goal to find relations between drawing components and PFS, we generated a critical point at the start times of the occurrences of the repeating PFSs. Formally, we followed the three following steps. a) We used the “identity projection” for the neural channel and merged the spike times from all neurons on the same time scale.

452

IEEE TRANSACTIONS ON BIOMEDICAL ENGINEERING, VOL. 54, NO. 3, MARCH 2007

Fig. 8. Example of three drawing components found by patterns-mining techniques. Each row represents several occurrences of the same component which are marked by thicker lines.

b) We applied patterns-mining techniques for discovering repeating patterns. This stage was termed above as patterns-mining inside a channel (Section III-C3). Each such pattern was defined as two spikes or more having constant time intervals between them. In addition, the time difference between the earliest spike and the latest spike in a pattern was limited to 100 ms at most. A single pattern is represented by a vector of ascending ordered time intervals. For instance, the vector stands for a pattern consisting of three spikes, which means that a spike was emitted at time , then a spike was emitted , and then another one was emitted at . at 4) Postprocessing: Once critical points were marked in both drawing and neural channels they were all merged on a common time scale and passed to a postprocessing algorithm. This algorithm was based on statistical methods and was aimed to find unknown relations between both types of critical points. More details regrading the postprocessing and the results can be found in [20]. V. DISCUSSION Analyzing multichannel recordings is crucial in many studies. While known techniques for analyzing such data restrict each variable to a specific set of values (an action that is called discretization), this paper proposes an easy applicative method that transforms multichannel recordings into an alternative representation which consists solely of the major incidents in these recordings. The presented method has two major advantages upon the existing methods. 1) The amount of the transformed data is significantly smaller (for example, if one of the variables remains constant at each time point, it will not be represented at all in the transformed data). This fact has a major impact on the runtime of the next analysis stages, especially when the amount of the raw data is large, 2) The pruning of noisy data is done at an early stage of the analysis process. In such manner, a variety of postprocessing algorithms can be activated solely on the major incidents in the original data.

The results section demostrates two applications of our technique in two different studies. In these two types of analysis we first transformed the raw data into a series of critical points using techniques described in Section II. For example, we used the TBR method and the safe-pass-factor idea (Section III-C2) for interpretting monkeys’ drawings as an ordered set of either direction changes or velocity changes. Another example is the use of patterns mining inside a projected channel (Section III-C3) for interpreting the neural data as a series of precise firing patterns. These interpretations of the raw data led us to state that cortical neurons exhibit precise interspike timing in correspondence to behavior [20]. Many types of multichannel recordings can be analyzed following the same steps as described in this paper. Different choices of projection-functions and critical points definition will lead to different result patterns of behavior, that may shed light on new interesting facts about the data. VI. CONCLUSION In the last years, computerized methods become more and more useful in many studies for processing large amounts of raw data. However, choosing the right processing technique is mostly critical. This decision must consider the type of the data, the memory limit of the machine and the data noise. The desired technique should be able to extract meaningful information from the raw data and also provide a flexible infrastructure for further postprocessing analysis. The suggested methods in this paper provide a convenient template which is a basis for exploring variaty of relations between different behavior in multi channel data. We find these techniques very useful in our studies and we think it might be useful in many others. ACKNOWLEDGMENT This paper is a part of a Ph.D. dissertation done in Bar-Ilan University. REFERENCES [1] L. Huan and M. Hiroshi, Feature Selection for Knowledge Discovery and Data Mining, ser. Engineering and Computer Science. : Kluwer Academic Publishers, 1998. [2] M. R. Chmielewski and J. W. Grzynala-Busse, “Global discretization of continuous attributes as preprocessing for machine learning,” Int. J. Approximate Reasoning, vol. 15, pp. 319–331, 1996. [3] J. Catlett, “On changing continuous attributes into ordered discrete attributes,” in Proc. Machine Learning—EWSL-91, Mar. 1991, vol. 482, pp. 164–178. [4] U. M. Fayyad and K. B. Irani, “Multi-interval discretization of continuousvalued attributes for classification learning,” in Proc. IJCAI-93, Aug./Sep. 1993, vol. 2, pp. 1022–1027. [5] R. Kerber, “Chimerge: Discretization of numeric attributes,” in Proc. 10th Nat. Conf. Artificial Intelligence, 1992, pp. 123–128. [6] T. M. Cover and J. A. Thomas, Elements of Information Theory, ser. Telecommunications. New York: Wiley, 1991. [7] A. K. C. Wong and D. K. Y. Chiu, “Synthesizing statistical knowledge from incomplete mixed-mode data,” IEEE Trans. Pattern Anal. Mach. Intell., vol. PAMI-9, no. 6, pp. 796–805, 1987. [8] J. Quinlan, “Induction of decision trees,” Mach. Learning, vol. 1, no. 1, pp. 81–106, 1986. [9] A. P. Dempster, N. M. Laird, and D. B. Rubin, “Maximum likelihood from incomplete data via the EM algorithm,” J. Roy. Statist. Soc., B (Methodological), vol. 39, no. 1, pp. 1–38, 1977. [10] T. F. Gonzalez, “Clustering to minimize the maximum intercluster distance,” Theoretical Comput. Sci., vol. 38, pp. 293–306, 1985.

SHMIEL et al.: PROCESSING OF MULTICHANNEL RECORDINGS FOR DATA-MINING ALGORITHMS

[11] B. McNamara and K. Wiesenfeld, Phys. Rev. A, vol. 39, pp. 4854–4854, 1989. [12] S. Fauve and F. Heslot, Phys. Lett. A, vol. 97, pp. 5–5, 1983. [13] H. Mannila, H. Toivonen, and A. I. Verkamo, “Discovering frequent episodes in sequences,” in Proc. Knowledge Discovery and Data Mining (KDD), 1995, pp. 210–215. [14] H. Mannila and H. Toivonen, “Discovering generalized episodes using minimal occurrences,” in Proc. Knowledge Discovery and Data Mining (KDD), 1996, pp. 146–151. [15] M. Kantardzic, Data Mining: Concepts, Models, Methods, and Algorithms. Piscataway, NJ: Wiley-IEEE Press, 2002. [16] K. Cios, W. Pedrycz, and R. Swiniarski, Data Mining Methods for Knowledge Discovery. : Kluwer, 1998. [17] E. Cox, Fuzzy Modeling and Genetic Algorithms for Data Mining and Exploration. San Mateo, CA: Morgan Kaufmann, 2004. [18] N. Butkov, Atlas of Clinical Polysomnography. Medford, OR: Synapse Media, 1996, vol. 1 & 2. [19] S. Chokroverty, R. Thomas, and M. Bhatt, Atlas of Sleep Medicine. Amsterdam, The Netherlands: Elsevier, 2005, pp. 376–376. [20] T. Shmiel, R. Drori, O. Shmiel, Y. Ben-Shaul, Z. Nadasdy, M. Shemesh, M. Teicher, and M. Abeles, “Neurons of the cerebral cortex exhibit precise interspike timing in correspondence to behaviour,” Proc. Nat. Acad. Sci., vol. 102, pp. 18655–18657, 2005. [21] H. B. Barlow, Perception. : , 1972, vol. 1, pp. 371–394.

Oren Shmiel was born in Israel in 1980. At the 8th grade he took the matriculation exam in mathematics and before the age of 14 started working towards the bachelor’s degree in computer science and mathematics at Bar-Ilan University, Ramat Gan, Israel. In 2001 he completed the master’s degree, and is working towards the Ph.D. degree in computerized methods for sleep analysis at Bar-Ilan University. He is a specialist in algorithms and in software development, and well-experienced with digital video compression, and DSP.

453

Tomer Shmiel was born in Israel in 1977. At the 8th grade he took the matriculation exam in mathematics and before the age of 14 started working towards the bachelor’s degree in computer science and mathematics at Bar-Ilan University, Ramat Gan, Israel. In 1998, he completed the master’s degree in computer science (Bar-Ilan) including a thesis in the field of efficient data-mining algorithms, and in 2006 he received the Ph.D. degree (Bar-Ilan) on the discovery of new relationships between precise firing sequences of neurons and hand motion in monkeys. He is an expert in efficient algorithms and in C++ programming.

Yaron Dagan is the head of the Institute for Fatigue AND Sleep Medicine at Sheba Medical Center, Israel, affiliated to Sackler school of medicine, Tel—Aviv University. He is the head of the Department of Medical Education at this school. Since 1984, he is a Sleep Physician and Researcher. His main interest is chronobiological sleep disorders. Dr. Dagan is a board member of the World Association for Chronobiology, the president of the Israeli Association for Chronobiology and a board member of Chronobiology International journal.

Mina Teicher is a Professor of Mathematics with the Brain research Center, Bar-Ilan University, Ramat Gan, Israel. She is currently Chief Scientist with the Ministry of Science and Technology, Jerusalem Israel. She previously served as VP for Research and Development of Bar-Ilan University. Her spectrum of research interests covers algebraic geometry, group based cryptography, computer vision, mathematical methods in neuroscience and its medical applications. She runs a research group of 25 students and postdoctoral students. Dr. Teicher is the Chairman of the Education Committee of the European Mathematical Union.

Suggest Documents