AbstractâObjective: In this paper, the genetic fuzzy inference system based on expert knowledge for automatic sleep staging was developed. Methods: Eight ...
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TBME.2015.2510365, IEEE Transactions on Biomedical Engineering
1
Combination of expert knowledge and a genetic fuzzy inference system for automatic sleep staging Sheng-Fu Liang, Chih-En Kuo, Fu-Zen Shaw, Ying-Huang Chen, Chia-Hu Hsu, Jyun-Yu Chen
leep is one of the most important human activities. Sleep diseases, such as insomnia and obstructive sleep apnea may cause daytime sleepiness, irritability, depression, anxiety or even death [1, 2]. Sleep analysis is helpful in disease diagnosis and in several psychophysiological analyses. For the diagnosis of sleep issues, all night polysomnographic (PSG) recordings, including electroencephalogram (EEG), electrooculogram (EOG) and electromyogram (EMG), are usually taken from the patients and the recordings are scored by a well-trained expert according to the Rechtschaffen & Kales (R&K) rules codified in 1968 [3]. According to the R&K rules, each epoch (i.e., 30 s of data) is classified into one of the sleep stages, including wakefulness (Wake), non-rapid eye movement (stages 1-4,
from light to deep sleep) and rapid eye movement (REM). Stages 3 and 4 were combined as the slow wave sleep (SWS) stage recently [4]. The expert manually scores the all night PSG recordings according to the characteristics of each sleep stage described by the R&K rules [3]. The Wake stage consists of alpha activity (813 Hz) or low-voltage mixed frequency activity. During light sleep, the S1 stage consists of related low-voltage mixed activity (3-7 Hz); the S2 stage is characterized by the appearance of sleep spindles and/or K-complexes. During deep sleep, the SWS stage consists of high-voltage (>75 μV), lowfrequency ( 30 min) and 75.25 min (> 15 min), respectively. These measurements were approved by the internal review board of National Cheng Kung University. Subjects were recruited from the public by online advertisements and announcements on notice boards at National Cheng Kung University. Participants had to refrain from any drug/medication and limit caffeine use (no caffeine intake for at least 5-6 h prior to sleep laboratory visits). The all-night PSGs were recorded in the sleep laboratory at the cognitive institute of National Cheng Kung University. There was no outside interference during data collection, and no medications were used to induce sleep. The recordings (Siesta 802 PSG, Compumedics, Inc.) included six EEG channels (F3-A2, F4-A1, C3-A2, C4-A1, P3-A2, and P4-A1, according to the
international 10-20 standard system), two EOG channels (positioned 1 cm lateral to the left and right outer canthi), and a chin EMG channel. The sampling rate was 256 Hz with 16-bit resolution. All 48 PSG sleep recordings were visually scored by a sleep specialist using the R&K rules with a 30-s interval (named an epoch). B. Feature extraction The automatic sleep staging system analyzes the data from two recording channels: the central EEG (C3-A2) and the chin EMG. According to the R&K criteria [3], the EEG data were filtered with an eighth-order Butterworth band-pass filter with a cutoff frequency of 0.5–30 Hz, and the EMG data were filtered with a 5–100 Hz eighth-order Butterworth band-pass filter. The continuous time signals were segmented into the nonoverlapping 30-s epochs. According to the property of sleep recordings, the spectral information may be lost if the processing window is too long. For example, the characteristics of Stage 2, i.e., K-complex and spindle, are often less than 2 seconds in duration. The SWS epoch has > 6-s delta activity and Stage 2 may contain < 6-s delta activity. Therefore, we segmented each epoch into 15 nonoverlapping intervals of two seconds for feature extraction. The non-overlapping interval of two seconds was called a window, i.e., there are a total of 15 2-s windows in a 30-s epoch. The 512-point (256 Hz * 2 sec) FFT calculation was applied to each 2-s. Table I lists the eight features used in this study. The types of feature include power spectrum (PS), power ratio (PR), spectral frequency (SF), duration ratio (DR), and EMG energy. PS is calculated by averaging the power of a specific frequency band in the 15 windows. PR is the power ratio of two frequency bands for comparison. SF is the mean frequency of spectral power. DR is the ratio between the number of windows in which the energy of a specific frequency band is higher than a threshold and the total 15 windows in an epoch. The feature Amp M is the mean value of the absolute amplitude of the total EMG data points in an epoch (for fuzzy rules) or a 2-s window (for movement epoch detection). TABLE I The features for automatic sleep scoring. No. Source Type Feature Label 1 EEG PS Total power of 0-30 Hz 0-30 E 2
EEG
3 4
Link to R&K NC
PR
0-4 Hz/0-30 Hz
0-4 E
SWS
EEG
SF
Mean frequency of 0-30 Hz
Mean(fre.) E
NC
EEG
DR
Alpha ratio
Alpha E
Wake
5
EEG
DR
Spindle ratio
Spindle E
S2
6
EEG
DR
SWS ratio
SWS E
SWS
7
EMG
PS
Total power of 0-30 Hz
0-30 M
NC
8
EMG EMG energy
Mean amplitude
Amp M
REM
* PS(=Power spectrum), PR(=Power ratio), SF(=Spectral frequency), DR(=Duration ratio), NC(=No correspondence).
The relations between the features and the R&K rules are also given in Table I. The feature “Alpha E” corresponds to the duration of the epoch consists of alpha (8-13 Hz) activity for
0018-9294 (c) 2015 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TBME.2015.2510365, IEEE Transactions on Biomedical Engineering
3 stage Wake. The “Spindle E” corresponds to the duration of spindle activity for stage S2. The “0-4 E” and “SWS E” correspond to the magnitude and duration of delta activity for stage SWS. The “Amp M” corresponds to magnitude of EMG for stage REM. More details of these features can be found in reference [8]. Most of the features we used had physiological meaning corresponding to the characteristics of sleep signals described in the R&K criteria [3]. For example, the feature “Alpha E” is a quantized value of the alpha ratio for each epoch. According to R&K criteria, the signature of the Wake state is that >50% of the epoch consists of alpha (8-13 Hz) activity or low-voltage. Therefore, the feature “Alpha E” can be used to determine Wake stages. Similarly, the feature “SWS E” is a quantized value of the SWS ratio for each epoch. According to R&K criteria, the identifying feature of SWS is that >20% of the epoch consists of high-voltage (>75 μV) and low-frequency (75 μV) and low-frequency (20 Hz) with higher muscle tone, is observed during the Wake stages. So, the logic of Rule 1 is that, when alpha EEG is high (Alpha E is high) and EMG is high (0-30 M is high) and low frequency (delta, 0-4 Hz) EEG is low (0-4 E is low), then the rule concludes that the stage is Wake. Similarly, for Rule 2, 50% of the page (epoch) consists of related low voltage mixed (3-7 Hz) activity during S1 stage according to R&K rules. Thus, the logic of Rule 2 is that, when alpha EEG is high (Alpha E is high) and EMG is high (0-30 M is high) and low frequency EEG is middle or high (0-4 E is mid OR high) and Spindle EEG is middle or low (Spindle E is low OR Spindle E is mid); then the rule concludes that the stage is S1.
0018-9294 (c) 2015 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TBME.2015.2510365, IEEE Transactions on Biomedical Engineering
4
Fig. 1. The flow chart of the proposed automatic sleep staging system.
Fig. 2. Histogram of each feather in Table I for the Wake, S1, S2, SWS, and REM stages. The X-axis represents the normalized feature values and the Y-axis represents the number of epochs.
0018-9294 (c) 2015 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TBME.2015.2510365, IEEE Transactions on Biomedical Engineering
5 Moreover, the main characteristic of the proposed fuzzy inference system is that multiple features are concurrently included and considered to determine a specific sleep stage. For example, we primarily use the feature Spindle E to detect spindles for classifying the sleep stage 2 as the R&K rule. However, other features such as Alpha E, 0-30 M, 0-4 E, SWS E, and Amp M, etc. are also used in the fuzzy rules no. 4-7 to distinguish S2 from S1, SWS, and REM. These features were observed from the feature distribution in each stage as shown in Fig. 2. Similarly, fuzzy rule 9 is used to detect the REM stage. In addition to the primary feature Amp M used to distinguish REM from awake or light sleep, four other features, Alpha E, 0-30 M, Spindle E, and Mean(fre.) E, are also used in rule no. 9. TABLE II The description of the nine rules for automatic sleep scoring. Target Rule sleep Rule description No. stage IF Alpha E is high AND 0-30 M is high AND 0-4 E is 1 Wake low THEN out is WAKE. IF Alpha E is high AND 0-30 M is high AND (0-4 E is 2 S1 mid OR 0-4 E is high) AND (Spindle E is low OR Spindle E is mid) THEN out is S1. IF (Alpha E is low OR 0-30 M is low) AND Mean(fre.) 3 S1 E is high AND Amp M is high AND (Spindle E is low OR Spindle E is mid) THEN out is S1. IF Alpha E is high AND 0-30 M is high AND (0-4 E is 4 S2 mid OR 0-4 E is high) AND Spindle E is high THEN out is S2. IF (Alpha E is low OR 0-30 M is low) AND Mean(fre.) 5 S2 E is low AND SWS E is low THEN out is S2. IF (Alpha E is low OR 0-30 M is low) AND Mean(fre.) E is high AND Amp M is low AND (Spindle E is high 6 S2 OR (Spindle E is mid AND 0-30 E is high)) THEN out is S2. IF (Alpha E is low OR 0-30 M is low ) AND Mean(fre.) 7 S2 E is high AND Amp M is high AND Spindle E is high THEN out is S2. 8
9
SWS
IF (Alpha E is low OR 0-30 M is low) AND Mean(fre.) E is low AND SWS E is high THEN out is SWS.
REM
IF (Alpha E is low OR 0-30 M is low) AND Mean(fre.) E is high AND Amp M is low AND (Spindle E is low OR (Spindle E is mid AND 0-30 E is low)) THEN out is REM
After calculating the activation values of all fuzzy rules, the Takagi-Sugeno fuzzy system was applied in our method [28] to make the decision. For each epoch, the inputs of the fuzzy inference system are the eight feature values and the output of the fuzzy inference system is the determined stage that has the maximum activation value among all of the nine rules. For example, the epoch is classified as Wake if the fuzzy Rule 1 has the maximum activation value. 2.2) Genetic algorithm After constructing the fuzzy rules and the fuzzy sets based on human knowledge and the distributions of feature values, the genetic algorithm (GA) was utilized to fine-tune the membership functions of the fuzzy sets to improve the recognition rate of the developed fuzzy inference system. The
genetic algorithm is often applied to solve multi-parameter problems and it has also been utilized to construct optimal membership functions of fuzzy sets [29, 30]. Fig. 3 shows how to apply the genetic algorithm [30] in our system for membership function optimization. Chromosome is used to build fuzzy sets for the fuzzy inference system. First, the GA would randomly generate values (range from 0 to 1) in the initial population and assess fitness, the overall accuracy of the training data for each chromosome. After generating the initial population or producing a new population, the chromosomes were sorted according to the fitness. The 50 best chromosomes were selected for the next new population. The next new chromosomes were produced by crossover and mutation from the best 50 chromosomes. The structure of chromosome X was composed of a real number with a length of 40, the population size is 100, and the number of generations is 200. In our proposed method, two-point crossover was used and the crossover rate was 0.95. The twopoint crossover scheme chose two crossover segments with the same length randomly and exchanged the segments of the strings with each other. The mutation rate was set as 0.01. After crossover, if a random number was lower than the mutation rate, we might randomly produce a number (range from 0 to 1) and the index of chromosome. Then, we replaced the value of the indexed chromosome with the random number. After initialization, the crossover and the mutation process, the genetic algorithm produced best parameter sets for the fuzzy inference system. The resultant fuzzy sets and parameters of the chromosome after training are shown in Figure 4 and Table III, respectively. Note that, the x-axis and yaxis in Fig. 4 represent the values of features (from 0 to 1) and the fuzzy variables, respectively. Initialization of parameters
Generation of the initial population Fitness evaluate
Population size Chromosome size Crossover rate Mutation rate
Fuzzy inference system
Generation end Yes Finish
No Producing a new population
Fig. 3. The flow chart of the genetic algorithm for optimization of our system.
3) Contextual rule smoothing Sleep staging has periodicity and continuity from light to deep [3]. General classifiers may not consider temporal contextual information, but the expert may refer to the neighbor epochs in addition to the current epoch to make decisions. Therefore, after classifying the sleep stage using the GA-fuzzy inference system, a smoothing process, considering the temporal contextual information, was applied for continuity [3].
0018-9294 (c) 2015 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TBME.2015.2510365, IEEE Transactions on Biomedical Engineering
6 (a).0-30E (EEG)
(b).0-30M (EMG)
1.5
L
1.5
H
1
X32
L
H
1
X29
0.5
0
X8
X5
X31 0
X30
0.2
0.4
0.6
0.8
0
1
X7 0
X6 0.2
0.4
0.6
0.8
1
L
X4
X26 0.4
0.6
0.8
0
1
X11 0.4
X15 0.6
X14 0.8
M
X17
1
0.5
0.2
0.2
L
H
X1
1
X28
X27
X10 0
1
(f).Spindle E (EEG)
L
0.5
0
0
1.5
H
X25
H
X12 X13 X16
(e).Alpha E (EEG) 1.5
1
M
X9
0.5
(d).Mean(fre.) E (EEG)
H
X24
X20 X21
0.5
X3 0
0.2
(g).SWS E (EEG)
0.4
X2 0.6
0.8
1
0
X23 0
X18 0.2
X19 0.4
0.6
0.8
X22 1
(h).Amp M (EMG)
1.5
1.5
L 1
H X40
L 1
X37
0.5
0
L 1
0.5
1.5
0
(c).0-4E (EEG)
1.5
H X33 X36
0.5
X39 0
0.2
X38 0.4
0.6
0.8
0
1
X35 0
0.2
X34 0.4
0.6
0.8
1
Fig. 4. The final membership functions of the fuzzy sets determined by the genetic algorithm. The symbols “L”, “M”, and “H” represent the sets of low, mid, and high, respectively. The x-axis represents the value of feature (from 0 to 1) and the y- axis represents the possibility. TABLE III The best parameters of the chromosome after training by GA (total 40 genes). x1
0.001495
x11
0.454787
x21
0.969481
x31
0.059114
x2
0.629536
x12
0.674123
x22
0.984466
x32
0.112125
x3
0.542222
x13
0.707175
x23
0.014130
x33
0.028291
TABLE IV Lists of the smoothing rules. Rule No.
2
Modification Any REM epochs before the very first appearance of S2 are replaced with S1 epochs Wake, REM, S2 → Wake, S1, S2
3
S1, REM, S2 → S1, S1, S2
4
S2, S1, S2 → S2, S2, S2
5
S2, SWS, S2 → S2, S2, S2
6
S2, REM, S2 → S2, S2, S2
7
SWS, S2, SWS → SWS, SWS, SWS
8
REM, Wake, REM → REM, REM, REM
1
x4 x5 x6
0.959105 0.301614 0.841639
x14 x15 x16
0.84582 0.639485 0.777276
x24 x25 x26
0.71633 0.111911 0.897671
x34 x35 x36
0.618549 0.205512 0.237342
x7
0.027497
x17
0.027772
x27
0.033814
x37
0.358867
x8
0.086337
x18
0.196844
x28
0.360363
x38
0.889248
x9
0.052431
x19
0.92819
x29
0.76516
x39
0.047670
x10
0.35008
x20
0.960021
x30
0.919828
x40
0.180578
These rules refer to the relationship between epochs prior to and after the current epoch. For example, three consecutive epochs of S2, REM, and S2 were replaced with the sequence S2, S2, and S2. Similarly, consecutive epochs of REM, S1, and REM were replaced with the sequence REM, REM, and REM. Table IV shows the rules for smoothing in the present method.
9
REM, S1, REM → REM, REM, REM
10
REM, S2, REM → REM, REM, REM
11
Mov, REM, S2 → Mov, S1, S2
4) Movement epochs elimination After smoothing, an elimination procedure was used on those MT epochs with the AASM scoring methods [4]. The final result of staging (hypnogram) was still characterized by five stages (Wake, S1, S2, SWS, and REM). D. Performance evaluation The performance of the proposed method was evaluated by a confusion matrix, is the typical evaluation method for multiclassification problems. The overall agreement (accuracy, AC), sensitivity (SE), and specificity (SP) of each sleep stage were also calculated. They are defined as:
0018-9294 (c) 2015 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TBME.2015.2510365, IEEE Transactions on Biomedical Engineering
7 AC
TP TN TP TN FP FN
(1)
SE
TP TP FN
(2)
SP
TN TN FP
(3)
where TP: total number of correctly positive classification. TN: total number of correctly negative classification. FP: total number of erroneously positive classification. FN: total number of erroneously negative classification. In addition, Cohen’s kappa coefficient [31] was calculated for each subject to assess the robustness of our system. Cohen’s kappa coefficient (κ) is a statistical measure of inter-rater agreement among two or more raters. It is generally thought to be a more robust measure than simple percent agreement calculations because κ takes into account agreements that occur by chance. The interpretation of kappa coefficients [31] is as follows: values less than 0.00 indicate no agreement; 0.00 to 0.20 indicate slight agreement; 0.21 to 0.40 indicate fair agreement; 0.41 to 0.60 indicate moderate agreement; 0.61 to 0.80 indicate substantial agreement; and more than 0.80 indicates excellent agreement. III. EXPERIMENTAL RESULTS Several experiments on multiple datasets were performed to evaluate the proposed method. (1) Performance comparison between the proposed method and the rule-based method [8] on dataset PDB-1; (2) Subject-by-subject performance evaluation of our method on PDB-1; (3) Performance of our proposed method on PDB-2; (4) Performance of an integrated genetic fuzzy inference system on data from subjects with good and poor sleep efficiency as well as subjects with insomnia from PDB-1 and PDB-2; (5) Applying the proposed method on a publicly available sleep database [32] and making comparison with two existing methods [33, 34] using the same database. A. Performance of our method on PDB-1 All-night PSG sleep recordings obtained from 32 subjects were used. In PDB-1, half of the subjects’ sleep efficiencies were equal to or higher than 85% and the other half were lower than 85%. In order to effectively construct and evaluate our method, the PSG data were sorted according to the sleep efficiency. From the sorted list of sleep efficiency, the data from subjects of odd number (16 subjects) were used to train the system, and the data from the other 16 subjects were used for testing. Tables V (a) and (b) show the confusion matrices of the five-stage epoch classification by using the rule-based method [8] and our method, respectively. The identical smoothing process was applied to these two methods. The rows and column are the results staged by the expert and the automatic sleep scoring method, respectively. The unidentified signals and the movement epochs are not taken into account here. The overall agreement, sensitivity, specificity, and kappa coefficient of each sleep stage corresponding to the rule-based method [8] and our method are shown in Table V. As shown in
Table V (a), the overall agreement between the expert and the rule-based system was 85.85%. It was higher than the range of inter-score agreement [20]. The kappa value is 0.79, indicating substantial agreement. Most of the misclassifications occur in stage transitions. The sleep process is continuous and epochs during stage transitions are not typical, so these epochs are more likely to be erroneously classified due to hard thresholding for the rule-based system. Therefore, we propose to develop the genetic fuzzy logic system to enhance the scoring performance in this paper. TABLE V Confusion matrix between two computer scoring methods and the visual scorings on 16 subjects from PDB-1. (A) Method in [8]
Expert
Computer Wake
S1
S2
SWS
REM
Total
SE(%)
Wake
1189
78
98
0
14
1379
86.22
S1
72
232
65
1
95
465
49.89
S2
115
235
5734
165
219
6468
88.65
SWS
4
6
337
1861
13
2221
83.79
REM
49
165
123
2
2249
2588
86.90
Overall
13121 85.85
Specificity 0.98
0.96
0.88
Expert
0.97
0.79
kappa (B) The proposed method
0.99
Computer Wake
S1
S2
SWS
REM
Total
SE(%)
Wake
1258
21
92
0
8
1379
91.23
S1
85
159
100
1
120
465
34.19
S2
110
80
5856
243
179
6468
90.54
SWS
4
4
227
1970
16
2221
88.70
REM
73
54
162
5
2294
2588
88.64
0.99
0.91
0.98
0.97
Overall Specificity 0.98 kappa
13121 87.93 0.82
Table V (b) shows the performance of the proposed genetic fuzzy inference system. The overall agreement between the expert and the proposed system was 87.93% and the sensitivities for all stages, except for S1, were higher than 88%. In addition, the sensitivities for Wake and S2 were higher than 90%. The specificity for all stages is higher than 91%. Compared to Table V (a), the results on Wake, S2, SWS, REM, and overall were increased by 5.01%, 1.89%, 4.91%, 1.74% and 2.08%, respectively. It was also observed that kappa of our system shows excellent agreement (0.82). In addition, statistical analysis of performance difference between the proposed method and the method in [8] for subjectby-subject five-stage sleep scoring was performed by the paired t-test. The sensitivities (mean ± sd (%)) with respect the five sleep stages and the overall agreement for the proposed method and the method in [8] are Wake: 93.92 ± 5.48 and 87.37 ± 14.23, S1: 33.75 ± 19.39 and 44.83 ± 15.83, S2: 90.69 ± 4.37 and 90.48 ± 5.63, SWS: 88.1 ± 10.3 and 74.49 ± 25.4, REM: 87.16 ± 8.73 and 87.68 ± 11.44, overall agreement: 87.76 ± 3.65 and 85.43 ±
0018-9294 (c) 2015 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TBME.2015.2510365, IEEE Transactions on Biomedical Engineering
8 6.56, respectively. The accuracies of the proposed method on Wake (p=0.027), SWS (p=0.004), and overall agreement (p=0.024) are statistically superior to the method in [8]. The method in [8] has better performance on S1 (p=0.005) only. The results demonstrated that the performance of the automatic sleep scoring method might be further improved by using GA and fuzzy logic techniques. B. Subject-by-subject performance evaluation on PDB-1 Table VI shows the subject-by-subject sleep efficiencies, overall agreements, and kappa coefficients of the manual scoring and our automatic scoring on the first private database. The differences between these two scorings are also given. The average kappa value of the 16 testing subjects was κ = 0.81±0.04, and the individual kappa ranged from 0.7 to 0.87. The individual kappa values were at least substantial agreement (>0.7), and the kappa on 12 subjects were excellent (> 0.8).
(S.D. = 3.91%). The average kappa value of the proposed method is 0.80 (S.D. = 0.06). This analysis demonstrates that the proposed method is robust and effective for scoring the sleep stages for subjects with both good and poor sleep efficiencies. The accuracies are both in the inter-scorer agreement range for clinical applications (>82.6%) [20] with excellent agreement (kappa>0.8). Fig. 5 shows the hypnograms of subject no. 7 (from good sleep efficiency group) and no. 11 (from poor sleep efficiency group), including the manual scoring by expert and the automatic staging. The automatic scoring hypnograms are close to the hypnograms scored by the expert.
TABLE VI Subject-by-subject sleep efficiencies, overall agreements, and kappa coefficients of the automatic scorings compared with the manual scorings. (A) The subjects’ sleep efficiency ≧ 85% Subject manual auto diff agreement kappa number 1
97.26%
95.47%
1.79%
91.27%
0.85
2
95.71%
94.07%
1.65%
87.07%
0.80
3
94.09%
90.42%
3.67%
89.73%
0.83
4
93.70%
93.80%
-0.11%
89.80%
0.83
5
93.51%
92.28%
1.24%
88.11%
0.82
6
90.78%
90.88%
-0.10%
90.34%
0.83
7
90.16%
90.98%
-0.83%
90.26%
0.83
8
87.28%
87.70%
-0.42%
89.47%
0.85
avg.
92.81%
91.95%
0.86%
89.51%
0.83
3.02%
2.32%
1.41%
1.33%
0.02
std.
(B) The subjects’ sleep efficiency < 85% Subject number
Manual
auto
diff
agreement
kappa
9
84.95%
82.77%
2.18%
89.81%
0.85
10
84.66%
86.04%
-1.38%
86.55%
0.80
11
79.56%
82.40%
-2.84%
90.04%
0.85
12
78.15%
80.54%
-2.39%
78.66%
0.70
13
78.02%
79.57%
-1.55%
83.24%
0.78
14
72.29%
72.29%
0.00%
85.76%
0.79
15
64.02%
66.19%
-2.16%
91.77%
0.87
16
53.13%
64.37%
-11.24%
82.32%
0.75
avg.
74.35%
76.77%
-2.42%
86.02%
0.80
std.
10.92%
8.12%
3.91%
4.46%
0.06
In addition, Tables VI (a) and (b) present the results of the good and poor sleep efficiency group, respectively. The average difference of sleep efficiency scored by the automatic sleep staging and the manual scoring in good sleep efficiency group is 0.86% (S.D. = 1.14%). The average kappa value of the proposed method is 0.83 (S.D. = 0.02). The average difference of sleep efficiency scored by the automatic sleep staging and the manual scoring in the poor sleep efficiency group is 2.42%
Fig. 5. The hypnogram of subject no. 7 and no. 11: (a) the expert scored hypnogram of subject no. 7, (b) the automatic staging hypnogram of subject no. 7, and (c) the expert scored hypnogram of subject no. 11, (d) the automatic staging hypnogram of subject no. 11.
C. Performance of our method on PDB-2 All-night PSG sleep recordings obtained from 16 subjects with insomnia were used to confirm the robustness and clinical applicability of the proposed method. The averaged sleep efficiency of PDB-2 was 71.07% (< 85%). The PSG data were also sorted according to the sleep efficiency and the data from subjects of odd number (8 subjects) were used to train the system, the data from the other 8 subjects were used for testing. Table VII shows the performance of the proposed genetic fuzzy inference system on PDB-2. The overall agreement between the expert and the proposed system was 81.77% and the sensitivities for all stages, except for S1, were higher than 83%. In addition, the sensitivities for S2 and SWS were higher than 85%. The specificity for all stages is higher than 92%. It was also observed that kappa of our system shows moderate agreement (0.75). The result demonstrated that the agreement between the expert and our proposed method is still higher than 80% even on the subjects with insomnia.
0018-9294 (c) 2015 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TBME.2015.2510365, IEEE Transactions on Biomedical Engineering
9 TABLE VII Confusion matrix between the computer scorings and the visual scorings on 8 subjects with insomnia from PDB-2. Computer Wake
S1
S2
SWS
REM
Wake 1577 S1 87 S2 46 Expert SWS 4 REM 32 Overall Specificity 0.96 kappa
125 122 93 0 26
75 61 2113 105 61
12 4 101 782 13
110 84 128 7 671
0.96
0.92
0.98 0.75
0.94
Total SE(%) 1899 358 2481 898 803 6439
83.04 34.08 85.17 87.08 83.56 81.77
D. Performance of the integrated system on PDB-1 and PDB2 The experiments A and C demonstrate that knowledge of the experts in sleep scoring and the elasticity of fuzzy systems in reasoning can be integrated to develop the automatic sleep staging systems for the healthy subjects and the subjects with insomnia. Finally, an integrated sleep scoring system that can be applied to the data from subjects with (1) good and (2) poor sleep efficiency as well as (3) subjects with insomnia was designed. Because the sleep patterns of these subject groups may be different [24], the two fuzzy inference systems developed based on database PDB-1 (denoted as GA-fuzzy model-1) and database PDB-2 (denoted as GA-fuzzy model-2) were integrated as shown in Fig. 6.
the results of GA-fuzzy model-1 are adopted directly. The data from 24 subjects, 8 subjects with good sleep efficiency (PDB-1), 8 subjects with poor sleep efficiency (PDB1) and 8 subjects with insomnia (PDB-2) that did not used to train the GA-fuzzy model-1 and GA-fuzzy model-2 were utilized to test this system. The experimental results are shown in Table VIII. The overall agreement between the expert and the proposed system was 86.44% and the sensitivities for all stages, except for S1, were higher than 86%. The specificity for all stages is higher than 92%. It was also observed that kappa of this integrated system shows excellent agreement (0.81). Appling the method in [8] to these data, the sensitivities for Wake, S1, S2, SWS, REM, and overall were 79.19%, 34.51%, 83.22%, 84.42%, 84.49%, and 80.9%, respectively. The kappa coefficient is 0.73. These results show that the overall agreement of our integrated system (86%) is still in the interscorer agreement range for clinical applications (>82.6%) [20] with excellent agreement (kappa>0.8) for automatic scoring of data from multiple subject groups. However, the agreement of the rule-based method [8] will be lower than the inter-scorer agreement range if the data from the subjects with insomnia are included. TABLE VIII Confusion matrix between the computer scorings and the visual scorings on 24 subjects from PDB-1 and PDB-2. Computer
Expert
Wake
S1
S2
SWS
REM
Wake
2845
146
157
12
118
3278
86.79
S1
170
290
160
5
198
823
35.24
S2
124
173
8018
344
290
8949
89.60
SWS
8
4
320
2764
23
3119
88.62
REM
105
80
198
18
2990
3391
88.17
Overall Specificity 0.97 kappa
Fig. 6. The flow chart of the integrated automatic sleep staging system.
When a PSG data comes, the extracted features are classified by GA-fuzzy model-1 firstly. Then, the sleep efficiency of the subject is estimated according to the results of GA-fuzzy model-1. If the estimated sleep efficiency is lower than 75%, it means the sleep quality of the subject is worse and the subject has a high possibility of suffering from insomnia. The GA-fuzzy model-2 is utilized to classify the sleep stages. Otherwise, the subject is considered as a healthy individual and
Total SE(%)
19560 86.44 0.98
0.92
0.98
0.96
0.81
E. Performance of the method on public dataset Nevertheless, to further assess and demonstrate the abilities of our method, we also applied it to a publicly available sleep database (International Database PhysioNet Sleep Recordings, http://www.physionet.org.) [32] that provides sleep recordings and the corresponding hypnograms in European Data Format. The EEG recording sites are Pz-Oz and the sampling rate is 100 Hz for EEG and EOG in the both data sets. The sampling rates for EMG are 1 Hz and 100 Hz in the sc* and st* recordings, respectively. Because the sampling rate for EMG in sc* recordings is too low compared with the recommendation of our method, the st* recordings (4 subjects) were utilized in our experiment. It was noted that the sampling rate and the recording sites of EEG in the st* recordings (100 Hz) are still different from the recordings for the development of our method (256 Hz, C3-A2), and therefore we went through a training phase again to fine-tune the system parameters. We also sorted the subject's list according to the sleep efficiency. From the sorted list, the subjects of odd number were used to train the system, and the subjects of even number were used for testing.
0018-9294 (c) 2015 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TBME.2015.2510365, IEEE Transactions on Biomedical Engineering
10 Table IX shows the confusion matrices of the five-stage epoch classification by the proposed sleep staging method. The testing dataset for performance estimation was comprised of 30-second epochs. As shown in Table IX, the overall accuracy was 81.34% for the proposed system. The sensitivity of S2, SWS and REM was over 85%. The specificity for all stages is higher than 93%. The overall kappa value is 0.72. The methods proposed in [33-37] have been applied to the same database. The method in [33] combined spectral features and the hidden Markov models (HMM) for sleep staging. The spectral and temporal features were integrated with an adaptive fuzzy logic iterative system in [34]. The kappa coefficients of the methods in [33] and [34] were 0.61, and 0.52, respectively. It means that the proposed automatic sleep staging system is in substantial agreement and the results of our method are superior to the existing methods in references [34] and [33]. Because the methods in [35] and [36] combined stage S1 and REM as the same stage and the method in [37] removed some epochs in which the quality of PSG signals did not satisfy the requirement, the results of these methods were not included to avoid unfair comparison. TABLE IX Confusion matrix between our proposed method and the visual scoring for the public dataset in [32]. Computer Wake S1 S2 SWS REM Total SE(%) Wake 45 16 5 13 216 63.43 137 S1 53 3 10 144 39.58 57 21 S2 22 22 633 41 20 738 85.77 Expert SWS 3 0 23 0 358 92.74 332 REM 7 9 19 18 420 87.38 367 Overall 1876 81.34 Specificity 0.95 0.96 0.93 0.96 0.97 kappa 0.72
IV. DISCUSSION The three main ideas of our method are as follows. First, the decision rules that realize and implement human knowledge in sleep scoring were utilized. In the observation of the PSG signal for each stage, we found that there are many types of signal features present in one stage. The scoring rules of our method were designed based on expert knowledge and data distributions. Therefore, the developed multi-rule-based staging method that considers various types of situations according to the features of frequency and time domains is better than that which only refers to the R&K rules or relies on numerical classifies such as neural networks [12, 15] or SVM [38]. Second, we utilized the fuzzy inference system to transform the hard thresholds into soft thresholds, and the GA was also applied to fine-tune the membership function for optimization. The fuzzy system has the advantages of modeling human or approximate reasoning and dealing with uncertainty within the data. These characteristics are especially suitable for the uncertainty of sleep scoring due to the fuzziness in manual scoring caused by confusing PSG data in sleep-stage transitions or atypical epochs [20]. Therefore, the GA fuzzy inference system makes the automatic scoring method perform closer to the manual scoring by an expert.
Finally, most of the previous studies only focused on either healthy subjects or patients being treated for sleep disorders. In other words, their testing subjects may not include both groups of good and poor sleep efficiency. The sleep patterns of good sleep efficiency and poor sleep efficiency may be different [24]. To evaluate the applicability of an automatic sleep scoring method, it is essential that the data include both groups of subjects. Our experimental results show that the overall agreement of our method applied to both subject groups is higher than 86%. Our proposed method also maintains a good accuracy (>80%) for subjects with insomnia. The results demonstrate the robustness and reliability of the proposed method. In previous studies, several sleep staging methods, including rule-based, back-propagation neural network (BPNN), the hidden Markov models (HMM), support vector machine (SVM) and fuzzy classifier, have been applied to automated sleep staging [8, 12, 17, 21, 39]. Fraiwan et al. [24] reviewed the recent sleep staging works, and the overall agreement of these studies were reported to be in the range of 70% to 87.5%. Although group data from all subjects together may achieve higher classification accuracy [25], most of the automatic sleep scoring methods performed subjectindependent evaluation to simulate practical conditions. In the field of biomedical and clinical applications, expert knowledge plays an important role because clinical diagnoses are mainly based on the judgment of clinical staffs and the experts. The results of automatic interpretation by the machine or computer are also evaluated by the experts. In this study, the fuzzy rules are designed based on the knowledge of experts so the results of our method are superior to the existing numerical methods [33, 34] applied to the same public dataset [32]. V. CONCLUSION This paper studies the feasibility of integrating knowledge of the experts in scoring of PSG data and the elasticity of fuzzy systems in reasoning and decision making to develop an automatic sleep staging system. The overall agreement and kappa coefficient of the proposed genetic fuzzy inference system applied to all night PSGs on 16 subjects having good and poor sleep efficiencies (PDB-1) were 87.93% and 0.82, respectively. The overall agreement and kappa coefficient of our system on 8 subjects with insomnia (PDB-2) were 81.77% and 0.75, respectively. A sleep scoring system integrating two fuzzy inference models with robust performance on various subject groups is also developed. The overall agreement and kappa coefficient of this integrated system applied to PSG data from 8 subjects with good sleep efficiency, 8 subjects with poor sleep efficiency and 8 subjects with insomnia were 86.44% and 0.81, respectively. Because home-based PSG is associated with a better sleep efficiency [40], the proposed method can be combined with a portable PSG system [41] for sleep monitoring in clinical or homecare application in the future. REFERENCES [1]
M. Ciolek, M. Niedzwiecki, S. Sieklicki, J. Drozdowski, and J. Siebert, "Automated Detection of Sleep Apnea and Hypopnea Events Based on Robust Airflow Envelope Tracking in the Presence of Breathing
0018-9294 (c) 2015 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TBME.2015.2510365, IEEE Transactions on Biomedical Engineering
11
[2]
[3]
[4]
[5]
[6]
[7]
[8]
[9]
[10]
[11]
[12]
[13]
[14]
[15]
[16]
[17]
[18]
[19]
Artifacts," IEEE J. Biomed. Health Inform., DOI: 10.1109/JBHI.2014.2325997, 2014. G. Sannino, I. D. Falco, and G. D. Pietro, "An automatic rules extraction approach to support OSA events detection in a mHealth system," IEEE J. Biomed. Health Inform., DOI: 10.1109/JBHI.2014.2311325, 2014. A manual of standardized terminology, techniques and scoring system for sleep stages of human subjects. Allan Rechtschaffen and Anthony Kales, editors. Bethesda, Md: U. S. National Institute of Neurological Diseases and Blindness, Neurological Information Network, 1968. D. Moser, P. Anderer, G. Gruber, S. Parapatics, E. Loretz, M. Boeck, et al., "Sleep classification according to AASM and Rechtschaffen & Kales: effects on sleep scoring parameters," Sleep, vol. 32, p. 139, 2009. H. Danker ‐ Hopfe, D. Kunz, G. Gruber, G. Klösch, J. Lorenzo, S. Himanen, et al., "Interrater reliability between scorers from eight European sleep laboratories in subjects with different sleep disorders," J. Sleep Res., vol. 13, pp. 63-69, 2004. H. Danker‐hopfe, P. Anderer, J. Zeitlhofer, M. Boeck, H. Dorn, G. Gruber, et al., "Interrater reliability for sleep scoring according to the Rechtschaffen & Kales and the new AASM standard," J. Sleep Res., vol. 18, pp. 74-84, 2009. P. Anderer, G. Gruber, S. Parapatics, M. Woertz, T. Miazhynskaia, G. Klösch, et al., "An E-health solution for automatic sleep classification according to Rechtschaffen and Kales: validation study of the Somnolyzer 24× 7 utilizing the Siesta database," Neuropsychobiology, vol. 51, pp. 115-133, 2005. S.-F. Liang, C.-E. Kuo, Y.-H. Hu, and Y.-S. Cheng, "A rule-based automatic sleep staging method," J. Neurosci. Methods, vol. 205, pp. 169-176, 2012. H. Griessenberger, D. Heib, A. Kunz, K. Hoedlmoser, and M. Schabus, "Assessment of a wireless headband for automatic sleep scoring," Sleep Breath., pp. 1-6, 2013. C. Stepnowsky, D. Levendowski, D. Popovic, I. Ayappa, and D. M. Rapoport, "Scoring accuracy of automated sleep staging from a bipolar electroocular recording compared to manual scoring by multiple raters," Sleep Med., vol. 14, pp. 1199-1207, 2013. J. Virkkala, J. Hasan, A. Värri, S.-L. Himanen, and K. Müller, "Automatic sleep stage classification using two-channel electrooculography," J. Neurosci. Methods, vol. 166, pp. 109-115, 2007. N. Schaltenbrand, R. Lengelle, M. Toussaint, R. Luthringer, G. Carelli, A. Jacqmin, et al., "Sleep stage scoring using the neural network model: comparison between visual and automatic analysis in normal subjects and patients," Sleep, vol. 19, p. 26, 1996. E. Tagliazucchi, F. von Wegner, A. Morzelewski, S. Borisov, K. Jahnke, and H. Laufs, "Automatic sleep staging using fMRI functional connectivity data," Neuroimage, 2012. S.-T. Pan, C.-E. Kuo, J.-H. Zeng, and S.-F. Liang, "A transitionconstrained discrete hidden Markov model for automatic sleep staging," Biomed. Eng. Online, vol. 11, pp. 1-19, 2012. M. Ronzhina, O. Janoušek, J. Kolářová, M. Nováková, P. Honzík, and I. Provazník, "Sleep scoring using artificial neural networks," Sleep Med. Rev., vol. 16, pp. 251-263, 2012. S.-F. Liang, C.-E. Kuo, Y.-H. Hu, Y.-H. Pan, and Y.-H. Wang, "Automatic Stage Scoring of Single-Channel Sleep EEG by Using Multiscale Entropy and Autoregressive Models," IEEE Trans. Instrum. Meas., vol. 61, pp. 1649-1657, 2012. G. Zhu, Y. Li, and P. Wen, "Analysis and Classification of Sleep Stages Based on Difference Visibility Graphs from a Single Channel EEG Signal," IEEE J. Biomed. Health Inform., DOI: 10.1109/JBHI.2014.2303991, 2014. H. Park, K. Park, and D.-U. Jeong, "Hybrid neural-network and rulebased expert system for automatic sleep stage scoring," in Proc. 22nd Annu. EMBS Int. Conf., 2000, pp. 1316-1319. F. Chapotot and G. Becq, "Automated sleep–wake staging combining robust feature extraction, artificial neural network classification, and flexible decision rules," Int. J. Adapt. Control Signal Process., vol. 24, pp. 409-423, 2010.
[20] R. S. Rosenberg and S. Van Hout, "The American Academy of Sleep Medicine inter-scorer reliability program: sleep stage scoring," J. Clin. Sleep Med., vol. 9, pp. 81-87, 2013. [21] S.-F. Liang, Y.-H. Chen, C.-E. Kuo, J.-Y. Chen, and S.-C. Hsu, "A fuzzy inference system for sleep staging," in Fuzzy Systems (FUZZ), 2011 IEEE Int. Conf., 2011, pp. 2104-2107. [22] E. Zhou and A. Khotanzad, "Fuzzy classifier design using genetic algorithms," Pattern Recognit., vol. 40, pp. 3401-3414, 2007. [23] H. Ishibuchi, K. Nozaki, N. Yamamoto, and H. Tanaka, "Selecting fuzzy if-then rules for classification problems using genetic algorithms," IEEE Trans. Fuzzy Syst., vol. 3, pp. 260-270, 1995. [24] L. Fraiwan, K. Lweesy, N. Khasawneh, M. Fraiwan, H. Wenz, and H. Dickhaus, "Classification of sleep stages using multi-wavelet time frequency entropy and LDA," Methods Inf. Med., vol. 49, p. 230, 2010. [25] J. F. Gao, Y. Yang, P. Lin, P. Wang, and C. X. Zheng, "Automatic removal of eye-movement and blink artifacts from EEG signals," Brain Topogr., vol. 23, pp. 105-114, 2010. [26] L. A. Zadeh, "Fuzzy sets," Inf.Control, vol. 8, pp. 338-353, 1965. [27] A. D. Kulkarni, Computer vision and fuzzy-neural systems: Prentice Hall PTR, 2001. [28] T. Takagi and M. Sugeno, "Fuzzy identification of systems and its applications to modeling and control," IEEE Trans. Syst. Man Cybern., pp. 116-132, 1985. [29] J. H. Holland, "Genetic algorithms and the optimal allocation of trials," SIAM J. Comput., vol. 2, pp. 88-105, 1973. [30] D. E. Goldberg and J. H. Holland, "Genetic algorithms and machine learning," Mach. Learn., vol. 3, pp. 95-99, 1988. [31] J. Cohen, "A coefficient of agreement for nominal scales," Educ. Psychol. Meas., vol. 20, pp. 37-46, 1960. [32] A. L. Goldberger, L. A. Amaral, L. Glass, J. M. Hausdorff, P. C. Ivanov, R. G. Mark, et al., "Physiobank, physiotoolkit, and physionet components of a new research resource for complex physiologic signals," Circulation, vol. 101, pp. e215-e220, 2000. [33] L. Doroshenkov, V. Konyshev, and S. Selishchev, "Classification of human sleep stages based on EEG processing using hidden Markov models," Biomed Eng, vol. 41, pp. 25-28, 2007. [34] C. Berthomier, X. Drouot, M. Herman-Stoïca, P. Berthomier, J. Prado, D. Bokar-Thire, et al., "Automatic analysis of single-channel sleep EEG: validation in healthy individuals," Sleep, vol. 30, p. 1587, 2007. [35] F. Ebrahimi, M. Mikaeili, E. Estrada, and H. Nazeran, “Automatic sleep stage classification based on EEG signals by using neural networks and wavelet packet coefficients,” in Proc. 30th Ann. Int. IEEE EMBS Conf., 2008, pp. 1151–1154. [36] Y. Liu, L. Yan, B. Zeng, and W. Wang, "Automatic sleep stage scoring using Hilbert-Huang transform with BP neural network," in:Proc. of Int. Conf. on the BioinformaticsandBiomedicalEngineering (iCBBE), 2010, pp. 1-4. [37] Y.-L. Hsu, Y.-T. Yang, J.-S. Wang, and C.-Y. Hsu, "Automatic sleep stage recurrent neural classifier using energy features of EEG signals," Neurocomputing, vol. 104, pp. 105-114, 2013. [38] J. Hedner, D. P. White, A. Malhotra, S. Herscovici, S. D. Pittman, D. Zou, et al., "Sleep staging based on autonomic signals: a multi-center validation study," J. Clin. Sleep Med., vol. 7, p. 301, 2011. [39] A. Flexer, G. Gruber, and G. Dorffner, "A reliable probabilistic sleep stager based on a single EEG signal," Artif. Intell. Med., vol. 33, pp. 199207, 2005. [40] M. Bruyneel, C. Sanida, G. Art, W. Libert, L. Cuvelier, M. Paesmans, et al., "Sleep efficiency during sleep studies: results of a prospective study comparing home‐based and in‐hospital polysomnography," J. Sleep Res., vol. 20, pp. 201-206, 2011. [41] D.-W. Chang, Y.-D. Liu, C.-P. Young, J.-J. Chen, Y.-H. Chen, C.-Y. Chen, et al., "Design and Implementation of a modularized polysomnography system," IEEE Trans. Instrum. Meas., vol. 61, pp. 1933-1944, 2012.
0018-9294 (c) 2015 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.